Zedstore - compressed in-core columnar storage
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.
The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.
Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.
Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.
High-level Design - B-trees for the win!
========================================
To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.
The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.
TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.
The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:
- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload
+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------
Row store
---------
The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.
In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.
The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.
Column store
------------
A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.
A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.
So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.
This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.
Main reasons for storing data this way
--------------------------------------
* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.
* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.
MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.
Implementation
==============
Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.
Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.
Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.
Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.
Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.
* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.
* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.
Index Support:
Building index also leverages columnar storage and only scans columns
required to build the index. Indexes work pretty similar to heap
tables. Data is inserted into tables and TID for the tuple same gets
stored in index. On index scans, required column Btrees are scanned
for given TID and datums passed back using virtual tuple.
Page Format:
ZedStore table contains different kinds of pages, all in the same
file. Kinds of pages are meta-page, per-attribute btree internal and
leaf pages, UNDO log page, and toast pages. Each page type has its own
distinct data storage format.
Block 0 is always a metapage. It contains the block numbers of the
other data structures stored within the file, like the per-attribute
B-trees, and the UNDO log.
Enhancements to design:
=======================
Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.
When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.
When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.
In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.
If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.
Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.
Notes about current patch
=========================
Basic (core) functionality is implemented to showcase and play with.
Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.
Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:
CREATE TABLE <name> (column listing) USING zedstore;
Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.
TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.
Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.
Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.
Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows
postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)
Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
zedstore
avg
---------------------
500000.500000000000
(1 row)
Time: 379.710 ms
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.
The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.
Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.
Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.
References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/CAJrrPGfaC7WC9NK6PTTy6YN-NN+hCy8xOLAh2doYhVg5d6HsAA@mail.gmail.com
3]
/messages/by-id/20150611230316.GM133018@postgresql.org
4]
/messages/by-id/20150831225328.GM2912@alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6]
/messages/by-id/CAOykqKfko-n5YiBJtk-ocVdp+j92Apu5MJBwbGGh4awRY5NCuQ@mail.gmail.com
7]
/messages/by-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43@iki.fi
Attachments:
v1-0001-Zedstore-compressed-in-core-columnar-storage.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Zedstore-compressed-in-core-columnar-storage.patchDownload
From a31743fa72025c836283504a930fb173b5eabe81 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal <aagrawal@pivotal.io>
Date: Mon, 8 Apr 2019 16:54:21 -0700
Subject: [PATCH v1] Zedstore - compressed in-core columnar storage.
---
configure | 118 ++
configure.in | 19 +
src/backend/access/Makefile | 2 +-
src/backend/access/zedstore/Makefile | 19 +
src/backend/access/zedstore/README | 253 +++
src/backend/access/zedstore/zedstore_btree.c | 1723 +++++++++++++++++
.../access/zedstore/zedstore_compression.c | 362 ++++
.../access/zedstore/zedstore_inspect.c | 445 +++++
src/backend/access/zedstore/zedstore_meta.c | 218 +++
src/backend/access/zedstore/zedstore_toast.c | 188 ++
src/backend/access/zedstore/zedstore_undo.c | 783 ++++++++
.../access/zedstore/zedstore_visibility.c | 259 +++
.../access/zedstore/zedstoream_handler.c | 1500 ++++++++++++++
src/backend/executor/execScan.c | 75 +
src/backend/executor/nodeIndexonlyscan.c | 16 +-
src/backend/executor/nodeIndexscan.c | 7 +
src/backend/executor/nodeSeqscan.c | 18 +-
src/include/access/tableam.h | 44 +
src/include/access/zedstore_compression.h | 51 +
src/include/access/zedstore_internal.h | 409 ++++
src/include/access/zedstore_undo.h | 133 ++
src/include/catalog/pg_am.dat | 3 +
src/include/catalog/pg_proc.dat | 24 +
src/include/executor/executor.h | 2 +
src/include/nodes/execnodes.h | 1 +
src/include/pg_config.h.in | 9 +
src/test/regress/expected/create_am.out | 11 +-
src/test/regress/expected/zedstore.out | 259 +++
src/test/regress/parallel_schedule | 2 +-
src/test/regress/serial_schedule | 1 +
src/test/regress/sql/zedstore.sql | 104 +
31 files changed, 7045 insertions(+), 13 deletions(-)
create mode 100644 src/backend/access/zedstore/Makefile
create mode 100644 src/backend/access/zedstore/README
create mode 100644 src/backend/access/zedstore/zedstore_btree.c
create mode 100644 src/backend/access/zedstore/zedstore_compression.c
create mode 100644 src/backend/access/zedstore/zedstore_inspect.c
create mode 100644 src/backend/access/zedstore/zedstore_meta.c
create mode 100644 src/backend/access/zedstore/zedstore_toast.c
create mode 100644 src/backend/access/zedstore/zedstore_undo.c
create mode 100644 src/backend/access/zedstore/zedstore_visibility.c
create mode 100644 src/backend/access/zedstore/zedstoream_handler.c
create mode 100644 src/include/access/zedstore_compression.h
create mode 100644 src/include/access/zedstore_internal.h
create mode 100644 src/include/access/zedstore_undo.h
create mode 100644 src/test/regress/expected/zedstore.out
create mode 100644 src/test/regress/sql/zedstore.sql
diff --git a/configure b/configure
index 806810817d..8ca80562c8 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ LDFLAGS_EX
ELF_SYS
EGREP
GREP
+with_lz4
with_zlib
with_system_tzdata
with_libxslt
@@ -864,6 +865,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_lz4
with_gnu_ld
enable_largefile
enable_float4_byval
@@ -1570,6 +1572,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-lz4 build with LZ4 support
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8306,6 +8309,41 @@ fi
+#
+# LZ4
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with LZ4 support" >&5
+$as_echo_n "checking whether to build with LZ4 support... " >&6; }
+
+
+
+# Check whether --with-lz4 was given.
+if test "${with_lz4+set}" = set; then :
+ withval=$with_lz4;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LZ4 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-lz4 option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_lz4=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_lz4" >&5
+$as_echo "$with_lz4" >&6; }
+
+
#
# Elf
#
@@ -11828,6 +11866,56 @@ fi
fi
+if test "$with_lz4" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for LZ4_compress_default in -llz4" >&5
+$as_echo_n "checking for LZ4_compress_default in -llz4... " >&6; }
+if ${ac_cv_lib_lz4_LZ4_compress_default+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-llz4 $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char LZ4_compress_default ();
+int
+main ()
+{
+return LZ4_compress_default ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_lz4_LZ4_compress_default=yes
+else
+ ac_cv_lib_lz4_LZ4_compress_default=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_lz4_LZ4_compress_default" >&5
+$as_echo "$ac_cv_lib_lz4_LZ4_compress_default" >&6; }
+if test "x$ac_cv_lib_lz4_LZ4_compress_default" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBLZ4 1
+_ACEOF
+
+ LIBS="-llz4 $LIBS"
+
+else
+ as_fn_error $? "library 'lz4' is required for LZ4 support" "$LINENO" 5
+fi
+
+fi
+
if test "$enable_spinlocks" = yes; then
$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
@@ -13027,6 +13115,36 @@ Use --without-zlib to disable zlib support." "$LINENO" 5
fi
+fi
+
+if test "$with_lz4" = yes; then
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ as_fn_error $? "lz4.h header file is required for LZ4" "$LINENO" 5
+fi
+
+done
+
+fi
+
+done
+
fi
if test "$with_gssapi" = yes ; then
diff --git a/configure.in b/configure.in
index 9c7a9738bc..e229faf75a 100644
--- a/configure.in
+++ b/configure.in
@@ -964,6 +964,16 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# LZ4
+#
+AC_MSG_CHECKING([whether to build with LZ4 support])
+PGAC_ARG_BOOL(with, lz4, no,
+ [build with LZ4 support],
+ [AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])])
+AC_MSG_RESULT([$with_lz4])
+AC_SUBST(with_lz4)
+
#
# Elf
#
@@ -1174,6 +1184,10 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
+fi
+
if test "$enable_spinlocks" = yes; then
AC_DEFINE(HAVE_SPINLOCKS, 1, [Define to 1 if you have spinlocks.])
else
@@ -1387,6 +1401,11 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_HEADERS(lz4.h, [],
+ [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is required for LZ4])])])
+fi
+
if test "$with_gssapi" = yes ; then
AC_CHECK_HEADERS(gssapi/gssapi.h, [],
[AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is required for GSSAPI])])])
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bb..6d36f3bd26 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- table tablesample transam
+ table tablesample transam zedstore
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/zedstore/Makefile b/src/backend/access/zedstore/Makefile
new file mode 100644
index 0000000000..52c2b2025c
--- /dev/null
+++ b/src/backend/access/zedstore/Makefile
@@ -0,0 +1,19 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/zedstore
+#
+# IDENTIFICATION
+# src/backend/access/zedstore/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/zedstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = zedstore_btree.o zedstore_compression.o zedstoream_handler.o \
+ zedstore_meta.o zedstore_undo.o zedstore_toast.o zedstore_visibility.o \
+ zedstore_inspect.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/zedstore/README b/src/backend/access/zedstore/README
new file mode 100644
index 0000000000..96d50ae580
--- /dev/null
+++ b/src/backend/access/zedstore/README
@@ -0,0 +1,253 @@
+
+src/backend/access/zedstore/README
+
+ZedStore - compressed column (and row) store for PostgreSQL
+===========================================================
+
+The purpose of this README is to provide overview of zedstore's
+design, major requirements/objectives it intends to fulfill and
+high-level implementation details.
+
+Objectives
+----------
+
+* Performance improvement for queries selecting subset of columns (reduced IO).
+
+* Reduced on-disk footprint compared to heap table. Shorter tuple
+ headers and also leveraging compression of similar type data
+
+* Be first-class citizen in the Postgres architecture (tables data can
+ just independently live in columnar storage) and not be at arm's
+ length though an opaque interface.
+
+* Fully MVCC compliant - basically all operations supported similar to
+ heap, like update, delete, serializable transactions etc...
+
+* All Indexes supported
+
+* Hybrid row-column store, where some columns are stored together, and
+ others separately. Provide flexibility of granularity on how to
+ divide the columns. Columns accessed together can be stored
+ together.
+
+* Provide better control over bloat (using zheap)
+
+* Eliminate need for separate toast tables
+
+* Faster add / drop column or changing data type of column by avoiding
+ full rewrite of the table.
+
+Highlevel design of zedStore - B-trees for the win!
+---------------------------------------------------
+
+To start simple, let's ignore column store aspect and consider it as
+compressed row store. The column store is natural externsion of this
+concept, explained in next section.
+
+The basic on-disk data structure leveraged is a B-tree, indexed by
+TID. BTree being a great data structure, fast and versatile. Note this
+is not refering to existing Btree indexes, but instead net new BTree
+for table data storage.
+
+TID - used as a logical row identifier:
+TID is just a 48-bit row identifier. The traditional division into
+block and offset numbers is meaningless. In order to find a tuple with
+a given TID, one must always descend the B-tree. Having logical TID
+provides flexibility to move the tuples around different pages on page
+splits or page merges can be performed.
+
+The internal pages of the B-tree are super simple and boring. Each
+internal page just stores an array of TID/downlink pairs. Let's focus
+on the leaf level. Leaf blocks have short uncompressed header,
+followed by btree items. It contains two kind of items:
+
+ - plain item, holds one tuple or one datum, uncompressed payload
+ - a "container item", holds multiple plain items, compressed payload
+
++-----------------------------
+| Fixed-size page header:
+|
+| LSN
+| TID low and hi key (for Lehman & Yao B-tree operations)
+| left and right page pointers
+|
+| Items:
+|
+| TID | size | flags | uncompressed size | lastTID | payload (container item)
+| TID | size | flags | uncompressed size | lastTID | payload (container item)
+| TID | size | flags | undo pointer | payload (plain item)
+| TID | size | flags | undo pointer | payload (plain item)
+| ...
+|
++----------------------------
+
+Row store
+---------
+
+The tuples are stored one after another, sorted by TID. For each
+tuple, we store its 48-bit TID, a undo record pointer, and the actual
+tuple data uncompressed.
+
+In uncompressed form, the page can be arbitrarily large. But after
+compression, it must fit into a physical 8k block. If on insert or
+update of a tuple, the page cannot be compressed below 8k anymore, the
+page is split. Note that because TIDs are logical rather than physical
+identifiers, we can freely move tuples from one physical page to
+another during page split. A tuple's TID never changes.
+
+The buffer cache caches compressed blocks. Likewise, WAL-logging,
+full-page images etc. work on compressed blocks. Uncompression is done
+on-the-fly, as and when needed in backend-private memory, when
+reading. For some compressions like rel encoding or delta encoding
+tuples can be constructed directly from compressed data.
+
+Column store
+------------
+
+A column store uses the same structure but we have *multiple* B-trees,
+one for each column, all indexed by TID. Imagine zedstore as a forest
+of B-trees. The B-trees for all columns are stored in the same
+physical file.
+
+A metapage at block 0, has links to the roots of the B-trees. Leaf
+pages look the same, but instead of storing the whole tuple, stores
+just a single attribute. To reconstruct a row with given TID, scan
+descends down the B-trees for all the columns using that TID, and
+fetches all attributes. Likewise, a sequential scan walks all the
+B-trees in lockstep.
+
+
+MVCC
+----
+
+Undo record pointers are used to implement MVCC, like in zheap. Hence,
+transaction information if not directly stored with the data. In
+zheap, there's a small, fixed, number of "transaction slots" on each
+page, but zedstore has undo pointer with each item directly; in normal
+cases, the compression squeezes this down to almost nothing. In case
+of bulk load the undo record pointer can be maintained for bulk of
+items and not per item.
+
+
+Insert:
+
+Inserting a new row, splits the row into datums. Then for first column
+decide which block to insert the same to, and pick a TID for it, and
+write undo record for the same. Rest of the columns are inserted using
+that TID and point to same undo position.
+
+There's one subtle little issue here:
+
+Imagine that you load the table with very large rows, so that every page
+has just a single row. If you assign the TID ranges naively, as you add
+new leaf pages to the end, you will end up with leaf pages with only one
+TID each. So the first page covers TIDs [1, 2), the second [2, 3), and
+so forth. If you then delete a row, and try to insert 10 smaller rows to
+the same page, you can't, because there aren't enough unused TIDs in the
+page's range.
+
+Can avoid that by simply padding the TID ranges, as we add new pages,
+so that each page is initially allocated e.g. 50000 TIDs, even if you
+only place one row to it. That gives a lot of breathing room. There
+might still be some corner cases, where repeated updates cause page
+splits, so that you still end up with very small TIDs ranges on the
+split pages. But that seems fine.
+
+Toast:
+When an overly large datum is stored, it is divided into chunks, and
+each chunk is stored on a dedicated toast page within the same
+physical file. The toast pages of a datum form list, each page has a
+next/prev pointer.
+
+Select:
+Property is added to Table AM to convey if column projection is
+leveraged by AM for scans. While scanning tables with AM leveraging
+this property, executor parses the plan. Leverages the target list and
+quals to find the required columns for query. This list is passed down
+to AM on beginscan. Zedstore uses this column projection list to only
+pull data from selected columns. Virtual tuple table slot is used to
+pass back the datums for subset of columns.
+
+Current table am API requires enhancement here to pass down column
+projection to AM. The patch showcases two different ways for the same.
+
+* For sequential scans added new beginscan_with_column_projection()
+ API. Executor checks AM property and if it leverages column
+ projection uses this new API else normal beginscan() API.
+
+* For index scans instead of modifying the begin scan API, added new
+ API to specifically pass column projection list after calling begin
+ scan to populate the scan descriptor but before fetching the tuples.
+
+Update:
+
+Index Support:
+Building index also leverages columnar storage and only scans columns
+required to build the index. Indexes work pretty similar to heap
+tables. Data is inserted into tables and TID for the tuple same gets
+stored in index. On index scans, required column Btrees are scanned
+for given TID and datums passed back using virtual tuple.
+
+Page Format
+-----------
+A ZedStore table contains different kinds of pages, all in the same
+file. Kinds of pages are meta-page, per-attribute btree internal and
+leaf pages, UNDO log page, and toast pages. Each page type has its own
+distinct data storage format.
+
+META Page:
+Block 0 is always a metapage. It contains the block numbers of the
+other data structures stored within the file, like the per-attribute
+B-trees, and the UNDO log.
+
+BTREE Page:
+
+UNDO Page:
+
+TOAST Page:
+
+
+Free Space Map
+--------------
+
+
+Enhancements
+------------
+
+Instead of compressing all the tuples on a page in one batch, store a
+small "dictionary", e.g. in page header or meta page or separate
+dedicated page, and use it to compress tuple by tuple. That could make
+random reads and updates of individual tuples faster. Need to find how
+to create the dictionary first.
+
+Only cached compressed pages in the page cache. If we want to cache
+uncompressed pages instead, or in addition to that, we need to invent
+a whole new kind of a buffer cache that can deal with the
+variable-size blocks. For a first version, I think we can live without
+it.
+
+Instead of storing all columns in the same file, we could store them in
+separate files (separate forks?). That would allow immediate reuse of
+space, after dropping a column. It's not clear how to use an FSM in that
+case, though. Might have to implement an integrated FSM, too. (Which
+might not be a bad idea, anyway).
+
+Design allows for hybrid row-column store, where some columns are
+stored together, and others have a dedicated B-tree. Need to have user
+facing syntax to allow specifying how to group the columns.
+
+Salient points for the design
+------------------------------
+
+* Layout the data/tuples in mapped fashion instead of keeping the
+ logical to physical mapping separate from actual data. So, keep all
+ the meta-data and data logically in single stream of file, avoiding
+ the need for separate forks/files to store meta-data and data.
+
+* Handle/treat operations at tuple level and not block level.
+
+* Stick to fixed size physical blocks. Variable size blocks (for
+ possibly higher compression ratios) pose need for increased logical
+ to physical mapping maintenance, plus restrictions on concurrency of
+ writes and reads to files. Hence adopt compression to fit fixed size
+ blocks instead of other way round.
diff --git a/src/backend/access/zedstore/zedstore_btree.c b/src/backend/access/zedstore/zedstore_btree.c
new file mode 100644
index 0000000000..a1e0cb31b7
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_btree.c
@@ -0,0 +1,1723 @@
+/*
+ * zedstore_btree.c
+ * Routines for handling B-trees structures in ZedStore
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with one B-tree at a time, it is the caller's
+ * responsibility to tie together the scans of each btree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_btree.c
+ */
+#include "postgres.h"
+
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "utils/datum.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static zstid zsbt_insert_item(Relation rel, AttrNumber attno, ZSUncompressedBtreeItem *newitem,
+ TransactionId xid, CommandId cid, ZSUndoRecPtr *undorecptr);
+static Buffer zsbt_descend(Relation rel, BlockNumber rootblk, zstid key);
+static Buffer zsbt_find_downlink(Relation rel, AttrNumber attno,
+ zstid key, BlockNumber childblk, int level,
+ int *itemno);
+static void zsbt_recompress_replace(Relation rel, AttrNumber attno,
+ Buffer oldbuf, List *items);
+static void zsbt_insert_downlink(Relation rel, AttrNumber attno, Buffer leftbuf,
+ zstid rightlokey, BlockNumber rightblkno);
+static void zsbt_split_internal_page(Relation rel, AttrNumber attno,
+ Buffer leftbuf, Buffer childbuf,
+ OffsetNumber newoff, zstid newkey, BlockNumber childblk);
+static void zsbt_newroot(Relation rel, AttrNumber attno, int level,
+ zstid key1, BlockNumber blk1,
+ zstid key2, BlockNumber blk2,
+ Buffer leftchildbuf);
+static ZSUncompressedBtreeItem *zsbt_scan_next_internal(ZSBtreeScan *scan);
+static void zsbt_replace_item(Relation rel, AttrNumber attno, Buffer buf,
+ ZSBtreeItem *olditem, ZSBtreeItem *replacementitem,
+ ZSBtreeItem *newitem, List *newitems);
+
+static int zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of the btree.
+ */
+void
+zsbt_begin_scan(Relation rel, AttrNumber attno, zstid starttid, Snapshot snapshot, ZSBtreeScan *scan)
+{
+ BlockNumber rootblk;
+ Buffer buf;
+
+ rootblk = zsmeta_get_root_for_attribute(rel, attno, false);
+
+ if (rootblk == InvalidBlockNumber)
+ {
+ /* completely empty tree */
+ scan->rel = NULL;
+ scan->attno = InvalidAttrNumber;
+ scan->active = false;
+ scan->lastbuf = InvalidBuffer;
+ scan->lastbuf_is_locked = false;
+ scan->lastoff = InvalidOffsetNumber;
+ scan->snapshot = NULL;
+ memset(&scan->recent_oldest_undo, 0, sizeof(scan->recent_oldest_undo));
+ scan->nexttid = InvalidZSTid;
+ return;
+ }
+
+ buf = zsbt_descend(rel, rootblk, starttid);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ scan->rel = rel;
+ scan->attno = attno;
+ scan->snapshot = snapshot;
+ scan->for_update = false; /* caller can change this */
+
+ scan->active = true;
+ scan->lastbuf = buf;
+ scan->lastbuf_is_locked = false;
+ scan->lastoff = InvalidOffsetNumber;
+ scan->nexttid = starttid;
+
+ scan->has_decompressed = false;
+ zs_decompress_init(&scan->decompressor);
+
+ memset(&scan->recent_oldest_undo, 0, sizeof(scan->recent_oldest_undo));
+}
+
+void
+zsbt_end_scan(ZSBtreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ {
+ if (scan->lastbuf_is_locked)
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(scan->lastbuf);
+ }
+ zs_decompress_free(&scan->decompressor);
+
+ scan->active = false;
+}
+
+/*
+ * Return true if there was another tuple. The datum is returned in *datum,
+ * and its TID in *tid. For a pass-by-ref datum, it's a palloc'd copy.
+ */
+bool
+zsbt_scan_next(ZSBtreeScan *scan, Datum *datum, bool *isnull, zstid *tid)
+{
+ TupleDesc desc;
+ Form_pg_attribute attr;
+ ZSUncompressedBtreeItem *item;
+
+ if (!scan->active)
+ return false;
+
+ desc = RelationGetDescr(scan->rel);
+ attr = &desc->attrs[scan->attno - 1];
+
+ while ((item = zsbt_scan_next_internal(scan)) != NULL)
+ {
+ if (zs_SatisfiesVisibility(scan, item))
+ {
+ char *ptr = item->t_payload;
+
+ *tid = item->t_tid;
+ if (item->t_flags & ZSBT_NULL)
+ *isnull = true;
+ else
+ {
+ *isnull = false;
+ *datum = fetchatt(attr, ptr);
+ *datum = zs_datumCopy(*datum, attr->attbyval, attr->attlen);
+ }
+
+ if (scan->lastbuf_is_locked)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ scan->lastbuf_is_locked = false;
+ }
+
+ return true;
+ }
+ }
+ return false;
+}
+
+/*
+ * Get the last tid (plus one) in the tree.
+ */
+zstid
+zsbt_get_last_tid(Relation rel, AttrNumber attno)
+{
+ BlockNumber rootblk;
+ zstid rightmostkey;
+ zstid tid;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+
+ /* Find the rightmost leaf */
+ rootblk = zsmeta_get_root_for_attribute(rel, attno, true);
+ rightmostkey = MaxZSTid;
+ buf = zsbt_descend(rel, rootblk, rightmostkey);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Look at the last item, for its tid.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSBtreeItem *hitup = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ /* COMPRESSED items cover a range of TIDs */
+ if ((hitup->t_flags & ZSBT_COMPRESSED) != 0)
+ tid = ((ZSCompressedBtreeItem *) hitup)->t_lasttid;
+ else
+ tid = hitup->t_tid;
+ tid = ZSTidIncrementForInsert(tid);
+ }
+ else
+ {
+ tid = opaque->zs_lokey;
+ }
+ UnlockReleaseBuffer(buf);
+
+ return tid;
+}
+
+ZSUncompressedBtreeItem *
+zsbt_create_item(Form_pg_attribute attr, zstid tid, Datum datum, bool isnull)
+{
+ Size datumsz;
+ Size itemsz;
+ ZSUncompressedBtreeItem *newitem;
+ char *dataptr;
+
+ /*
+ * Form a ZSBtreeItem to insert.
+ */
+ if (isnull)
+ datumsz = 0;
+ else
+ datumsz = zs_datumGetSize(datum, attr->attbyval, attr->attlen);
+ itemsz = offsetof(ZSUncompressedBtreeItem, t_payload) + datumsz;
+
+ newitem = palloc(itemsz);
+ memset(newitem, 0, offsetof(ZSUncompressedBtreeItem, t_payload)); /* zero padding */
+ newitem->t_tid = tid;
+ newitem->t_flags = 0;
+ newitem->t_size = itemsz;
+ memset(&newitem->t_undo_ptr, 0, sizeof(ZSUndoRecPtr));
+
+ if (isnull)
+ newitem->t_flags |= ZSBT_NULL;
+ else
+ {
+ dataptr = ((char *) newitem) + offsetof(ZSUncompressedBtreeItem, t_payload);
+ if (attr->attbyval)
+ store_att_byval(dataptr, datum, attr->attlen);
+ else
+ memcpy(dataptr, DatumGetPointer(datum), datumsz);
+ }
+
+ return newitem;
+}
+
+/*
+ * Insert a new datum to the given attribute's btree.
+ *
+ * Returns the TID of the new tuple.
+ *
+ * If 'tid' is valid, then that TID is used. It better not be in use already. If
+ * it's invalid, then a new TID is allocated, as we see best. (When inserting the
+ * first column of the row, pass invalid, and for other columns, pass the TID
+ * you got for the first column.)
+ */
+zstid
+zsbt_insert(Relation rel, AttrNumber attno, Datum datum, bool isnull,
+ TransactionId xid, CommandId cid, zstid tid, ZSUndoRecPtr *undorecptr)
+{
+ TupleDesc desc = RelationGetDescr(rel);
+ Form_pg_attribute attr = &desc->attrs[attno - 1];
+ ZSUncompressedBtreeItem *newitem;
+
+ newitem = zsbt_create_item(attr, tid, datum, isnull);
+ tid = zsbt_insert_item(rel, attno, newitem, xid, cid, undorecptr);
+
+ pfree(newitem);
+
+ return tid;
+}
+
+static zstid
+zsbt_insert_item(Relation rel, AttrNumber attno, ZSUncompressedBtreeItem *newitem,
+ TransactionId xid, CommandId cid, ZSUndoRecPtr *undorecptr)
+{
+ zstid tid = newitem->t_tid;
+ BlockNumber rootblk;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ zstid lasttid;
+ zstid insert_target_key;
+ ZSUndoRec_Insert undorec;
+
+ rootblk = zsmeta_get_root_for_attribute(rel, attno, true);
+
+ /*
+ * If TID was given, find the right place for it. Otherwise, insert to
+ * the rightmost leaf.
+ *
+ * TODO: use a Free Space Map to find suitable target.
+ */
+ if (tid != InvalidZSTid)
+ insert_target_key = tid;
+ else
+ insert_target_key = MaxZSTid;
+
+ buf = zsbt_descend(rel, rootblk, insert_target_key);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Look at the last item, for its tid.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSBtreeItem *hitup = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ if ((hitup->t_flags & ZSBT_COMPRESSED) != 0)
+ lasttid = ((ZSCompressedBtreeItem *) hitup)->t_lasttid;
+ else
+ lasttid = hitup->t_tid;
+
+ if (tid == InvalidZSTid)
+ {
+ tid = lasttid;
+ tid = ZSTidIncrementForInsert(tid);
+ }
+ }
+ else
+ {
+ lasttid = opaque->zs_lokey;
+ if (tid == InvalidZSTid)
+ tid = lasttid;
+ }
+
+ /* Form an undo record */
+ if (!IsZSUndoRecPtrValid(undorecptr))
+ {
+ undorec.rec.size = sizeof(ZSUndoRec_Insert);
+ undorec.rec.type = ZSUNDO_TYPE_INSERT;
+ undorec.rec.attno = attno;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = tid;
+ undorec.endtid = tid;
+ *undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* fill in the remaining fields in the item */
+ newitem->t_undo_ptr = *undorecptr;
+
+ if (newitem->t_tid == InvalidZSTid)
+ newitem->t_tid = tid;
+
+ /*
+ * If there's enough space on the page, insert it directly. Otherwise, try to
+ * compress all existing items. If that still doesn't create enough space, we
+ * have to split the page.
+ *
+ * TODO: We also resort to the slow way, if the new TID is not at the end of
+ * the page. Things get difficult, if the new TID is covered by the range of
+ * an existing compressed item.
+ */
+ if (PageGetFreeSpace(page) >= MAXALIGN(newitem->t_size) &&
+ (maxoff == InvalidOffsetNumber || tid > lasttid))
+ {
+ OffsetNumber off;
+
+ off = PageAddItemExtended(page, (Item) newitem, newitem->t_size, maxoff + 1, PAI_OVERWRITE);
+ if (off == InvalidOffsetNumber)
+ elog(ERROR, "didn't fit, after all?");
+ MarkBufferDirty(buf);
+ /* TODO: WAL-log */
+
+ UnlockReleaseBuffer(buf);
+ }
+ else
+ {
+ /* recompress and possibly split the page */
+ zsbt_replace_item(rel, attno, buf,
+ NULL, NULL,
+ (ZSBtreeItem *) newitem, NIL);
+ /* zsbt_replace_item unlocked 'buf' */
+ ReleaseBuffer(buf);
+ }
+
+ return tid;
+}
+
+/*
+ * Insert a multiple items to the given attribute's btree.
+ *
+ * Populates the TIDs of the new tuples.
+ *
+ * If 'tid' in list is valid, then that TID is used. It better not be in use already. If
+ * it's invalid, then a new TID is allocated, as we see best. (When inserting the
+ * first column of the row, pass invalid, and for other columns, pass the TID
+ * you got for the first column.)
+ *
+ * TODO: this routine is very similar to zsbt_insert_item() can be easily combined.
+ */
+void
+zsbt_insert_multi_items(Relation rel, AttrNumber attno, List *newitems,
+ TransactionId xid, CommandId cid,
+ ZSUndoRecPtr *undorecptr, zstid *tid_return_list)
+{
+ ZSUncompressedBtreeItem *firstItem = (ZSUncompressedBtreeItem *) linitial(newitems);
+ zstid tid = firstItem->t_tid;
+ zstid firsttid;
+ BlockNumber rootblk;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ zstid lasttid;
+ zstid insert_target_key;
+ ZSUndoRec_Insert undorec;
+ int i;
+ ListCell *lc;
+
+ rootblk = zsmeta_get_root_for_attribute(rel, attno, true);
+
+ /*
+ * If TID was given, find the right place for it. Otherwise, insert to
+ * the rightmost leaf.
+ *
+ * TODO: use a Free Space Map to find suitable target.
+ */
+ if (tid != InvalidZSTid)
+ insert_target_key = tid;
+ else
+ insert_target_key = MaxZSTid;
+
+ buf = zsbt_descend(rel, rootblk, insert_target_key);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Look at the last item, for its tid.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSBtreeItem *hitup = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ if ((hitup->t_flags & ZSBT_COMPRESSED) != 0)
+ lasttid = ((ZSCompressedBtreeItem *) hitup)->t_lasttid;
+ else
+ lasttid = hitup->t_tid;
+
+ if (tid == InvalidZSTid)
+ {
+ tid = lasttid;
+ tid = ZSTidIncrementForInsert(tid);
+ }
+ }
+ else
+ {
+ lasttid = opaque->zs_lokey;
+ if (tid == InvalidZSTid)
+ tid = lasttid;
+ }
+
+ firsttid = tid;
+ lasttid = InvalidZSTid;
+ i = 0;
+
+ /* assign tids for all the items */
+ foreach(lc, newitems)
+ {
+ ZSUncompressedBtreeItem *newitem = (ZSUncompressedBtreeItem *) lfirst(lc);
+
+ /* fill in the remaining fields in the item */
+ newitem->t_undo_ptr = *undorecptr;
+
+ if (newitem->t_tid == InvalidZSTid)
+ {
+ newitem->t_tid = tid;
+ tid_return_list[i++] = tid;
+ lasttid = tid;
+
+ tid = ZSTidIncrementForInsert(tid);
+ }
+ }
+
+ /* Form an undo record */
+ if (!IsZSUndoRecPtrValid(undorecptr))
+ {
+ undorec.rec.size = sizeof(ZSUndoRec_Insert);
+ undorec.rec.type = ZSUNDO_TYPE_INSERT;
+ undorec.rec.attno = attno;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = firsttid;
+ undorec.endtid = lasttid;
+ *undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /*
+ * update undo record pointer for all the items.
+ *
+ * TODO: refactor later to avoid this loop. Can assign above with tids as
+ * undo pointer is known similar to tid for rest of the columns, but just
+ * not for first attribute.
+ */
+ foreach(lc, newitems)
+ {
+ ZSUncompressedBtreeItem *newitem = (ZSUncompressedBtreeItem *) lfirst(lc);
+ newitem->t_undo_ptr = *undorecptr;
+ }
+
+ while (list_length(newitems))
+ {
+ ZSUncompressedBtreeItem *newitem = (ZSUncompressedBtreeItem *) linitial(newitems);
+
+ /*
+ * If there's enough space on the page, insert it directly. Otherwise, try to
+ * compress all existing items. If that still doesn't create enough space, we
+ * have to split the page.
+ *
+ * TODO: We also resort to the slow way, if the new TID is not at the end of
+ * the page. Things get difficult, if the new TID is covered by the range of
+ * an existing compressed item.
+ */
+ if (PageGetFreeSpace(page) >= MAXALIGN(newitem->t_size) &&
+ (maxoff > FirstOffsetNumber || tid > lasttid))
+ {
+ OffsetNumber off;
+
+ off = PageAddItemExtended(page, (Item) newitem, newitem->t_size,
+ maxoff + 1, PAI_OVERWRITE);
+ if (off == InvalidOffsetNumber)
+ elog(ERROR, "didn't fit, after all?");
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ newitems = list_delete_first(newitems);
+ }
+ else
+ break;
+ }
+
+ if (list_length(newitems))
+ {
+ /* recompress and possibly split the page */
+ zsbt_replace_item(rel, attno, buf,
+ NULL, NULL,
+ NULL, newitems);
+ /* zsbt_replace_item unlocked 'buf' */
+ ReleaseBuffer(buf);
+ }
+ else
+ {
+ MarkBufferDirty(buf);
+ /* TODO: WAL-log */
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+TM_Result
+zsbt_delete(Relation rel, AttrNumber attno, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ ZSBtreeScan scan;
+ ZSUncompressedBtreeItem *item;
+ TM_Result result;
+ ZSUndoRecPtr undorecptr;
+ ZSUncompressedBtreeItem *deleteditem;
+
+ zsbt_begin_scan(rel, attno, tid, snapshot, &scan);
+ scan.for_update = true;
+
+ /* Find the item to delete. (It could be compressed) */
+ item = zsbt_scan_next_internal(&scan);
+ if (item->t_tid != tid)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to delete with TID (%u, %u) for attribute %d",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid), attno);
+ }
+ result = zs_SatisfiesUpdate(&scan, item);
+ if (result != TM_Ok)
+ {
+ zsbt_end_scan(&scan);
+ return result;
+ }
+
+ /* Create UNDO record. */
+ {
+ ZSUndoRec_Delete undorec;
+
+ undorec.rec.size = sizeof(ZSUndoRec_Delete);
+ undorec.rec.type = ZSUNDO_TYPE_DELETE;
+ undorec.rec.attno = attno;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = tid;
+ undorec.prevundorec = item->t_undo_ptr;
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* Replace the ZSBreeItem with a DELETED item. */
+ deleteditem = palloc(item->t_size);
+ memcpy(deleteditem, item, item->t_size);
+ deleteditem->t_flags |= ZSBT_DELETED;
+ deleteditem->t_undo_ptr = undorecptr;
+
+ zsbt_replace_item(rel, attno, scan.lastbuf,
+ (ZSBtreeItem *) item, (ZSBtreeItem *) deleteditem,
+ NULL, NIL);
+ scan.lastbuf_is_locked = false; /* zsbt_replace_item released */
+ zsbt_end_scan(&scan);
+
+ pfree(deleteditem);
+
+ return TM_Ok;
+}
+
+/*
+ * If 'newtid' is valid, then that TID is used for the new item. It better not
+ * be in use already. If it's invalid, then a new TID is allocated, as we see
+ * best. (When inserting the first column of the row, pass invalid, and for
+ * other columns, pass the TID you got for the first column.)
+ */
+TM_Result
+zsbt_update(Relation rel, AttrNumber attno, zstid otid, Datum newdatum,
+ bool newisnull, TransactionId xid, CommandId cid, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd,
+ zstid *newtid_p)
+{
+ TupleDesc desc = RelationGetDescr(rel);
+ Form_pg_attribute attr = &desc->attrs[attno - 1];
+ ZSBtreeScan scan;
+ ZSUncompressedBtreeItem *olditem;
+ TM_Result result;
+ ZSUndoRecPtr undorecptr;
+ ZSUncompressedBtreeItem *deleteditem;
+ ZSUncompressedBtreeItem *newitem;
+ OffsetNumber maxoff;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ zstid newtid = *newtid_p;
+ Size datumsz;
+ Size newitemsz;
+ char *dataptr;
+
+ /*
+ * Find the item to delete. It could be part of a compressed item,
+ * we let zsbt_scan_next_internal() handle that.
+ */
+ zsbt_begin_scan(rel, attno, otid, snapshot, &scan);
+ scan.for_update = true;
+
+ olditem = zsbt_scan_next_internal(&scan);
+ if (olditem->t_tid != otid)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) for attribute %d",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid), attno);
+ }
+
+ /*
+ * Is it visible to us?
+ */
+ result = zs_SatisfiesUpdate(&scan, olditem);
+ if (result != TM_Ok)
+ {
+ zsbt_end_scan(&scan);
+ return result;
+ }
+
+ /*
+ * Look at the last item on the page, for its tid. We will use that + 1,
+ * as the TID of the new item.
+ */
+ buf = scan.lastbuf;
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSBtreeItem *hitup = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ if ((hitup->t_flags & ZSBT_COMPRESSED) != 0)
+ newtid = ((ZSCompressedBtreeItem *) hitup)->t_lasttid;
+ else
+ newtid = hitup->t_tid;
+ newtid = ZSTidIncrementForInsert(newtid);
+ }
+ else
+ {
+ newtid = opaque->zs_lokey;
+ }
+
+ if (newtid >= opaque->zs_hikey)
+ {
+ /* no more free TIDs on the page. Bail out */
+ /*
+ * TODO: what we should do, is to find another target page for the
+ * new tuple.
+ */
+ elog(ERROR, "out of TID space on page");
+ }
+
+ /* Create UNDO record. */
+ {
+ ZSUndoRec_Update undorec;
+
+ undorec.rec.size = sizeof(ZSUndoRec_Update);
+ undorec.rec.type = ZSUNDO_TYPE_UPDATE;
+ undorec.rec.attno = attno;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = newtid;
+ undorec.prevundorec = olditem->t_undo_ptr;
+ undorec.otid = otid;
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* Replace the ZSBreeItem with an UPDATED item. */
+ deleteditem = palloc(olditem->t_size);
+ memcpy(deleteditem, olditem, olditem->t_size);
+ deleteditem->t_flags |= ZSBT_UPDATED;
+ deleteditem->t_undo_ptr = undorecptr;
+
+ /*
+ * Form a ZSBtreeItem to insert.
+ */
+ if (newisnull)
+ datumsz = 0;
+ else
+ datumsz = zs_datumGetSize(newdatum, attr->attbyval, attr->attlen);
+ newitemsz = offsetof(ZSUncompressedBtreeItem, t_payload) + datumsz;
+
+ newitem = palloc(newitemsz);
+ memset(newitem, 0, offsetof(ZSUncompressedBtreeItem, t_payload)); /* zero padding */
+ newitem->t_tid = newtid;
+ newitem->t_flags = 0;
+ newitem->t_size = newitemsz;
+ newitem->t_undo_ptr = undorecptr;
+
+ if (newisnull)
+ newitem->t_flags |= ZSBT_NULL;
+ else
+ {
+ dataptr = ((char *) newitem) + offsetof(ZSUncompressedBtreeItem, t_payload);
+ if (attr->attbyval)
+ store_att_byval(dataptr, newdatum, attr->attlen);
+ else
+ memcpy(dataptr, DatumGetPointer(newdatum), datumsz);
+ }
+
+ zsbt_replace_item(rel, attno, scan.lastbuf,
+ (ZSBtreeItem *) olditem, (ZSBtreeItem *) deleteditem,
+ (ZSBtreeItem *) newitem, NIL);
+ scan.lastbuf_is_locked = false; /* zsbt_recompress_replace released */
+ zsbt_end_scan(&scan);
+
+ pfree(deleteditem);
+ pfree(newitem);
+
+ *newtid_p = newtid;
+ return TM_Ok;
+}
+
+/*
+ * Mark item with given TID as dead.
+ *
+ * This is used during VACUUM.
+ */
+void
+zsbt_mark_item_dead(Relation rel, AttrNumber attno, zstid tid, ZSUndoRecPtr undoptr)
+{
+ ZSBtreeScan scan;
+ ZSUncompressedBtreeItem *item;
+ ZSUncompressedBtreeItem deaditem;
+
+ zsbt_begin_scan(rel, attno, tid, NULL, &scan);
+ scan.for_update = true;
+
+ /* Find the item to delete. (It could be compressed) */
+ item = zsbt_scan_next_internal(&scan);
+ if (item->t_tid != tid)
+ {
+ zsbt_end_scan(&scan);
+ elog(WARNING, "could not find tuple to remove with TID (%u, %u) for attribute %d",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid), attno);
+ return;
+ }
+
+ /* Replace the ZSBreeItem with a DEAD item. (Unless it's already dead) */
+ if ((item->t_flags & ZSBT_DEAD) != 0)
+ return;
+
+ memset(&deaditem, 0, offsetof(ZSUncompressedBtreeItem, t_payload));
+ deaditem.t_tid = tid;
+ deaditem.t_size = sizeof(ZSUncompressedBtreeItem);
+ deaditem.t_flags = ZSBT_DEAD;
+ deaditem.t_undo_ptr = undoptr;
+
+ zsbt_replace_item(rel, attno, scan.lastbuf,
+ (ZSBtreeItem *) item, (ZSBtreeItem *) &deaditem,
+ NULL, NIL);
+ scan.lastbuf_is_locked = false; /* zsbt_replace_item released */
+ zsbt_end_scan(&scan);
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Find the leaf page containing the given key TID.
+ */
+static Buffer
+zsbt_descend(Relation rel, BlockNumber rootblk, zstid key)
+{
+ BlockNumber next;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ int nextlevel = -1;
+
+ next = rootblk;
+ for (;;)
+ {
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); /* TODO: shared */
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (nextlevel == -1)
+ nextlevel = opaque->zs_level;
+ else if (opaque->zs_level != nextlevel)
+ elog(ERROR, "unexpected level encountered when descending tree");
+
+ if (opaque->zs_level == 0)
+ return buf;
+
+ /*
+ * Do we need to walk right? This could happen if the page was concurrently split.
+ */
+ if (key >= opaque->zs_hikey)
+ {
+ /* follow the right-link */
+ next = opaque->zs_next;
+ if (next == InvalidBlockNumber)
+ elog(ERROR, "fell off the end of btree");
+ }
+ else
+ {
+ /* follow the downlink */
+ items = ZSBtreeInternalPageGetItems(page);
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+
+ itemno = zsbt_binsrch_internal(key, items, nitems);
+ if (itemno < 0)
+ elog(ERROR, "could not descend tree for tid (%u, %u)",
+ ZSTidGetBlockNumber(key), ZSTidGetOffsetNumber(key));
+ next = items[itemno].childblk;
+ nextlevel--;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/*
+ * Re-find the parent page containing downlink for given block.
+ * The returned page is exclusive-locked, and *itemno_p is set to the
+ * position of the downlink in the parent.
+ *
+ * If 'childblk' is the root, returns InvalidBuffer.
+ */
+static Buffer
+zsbt_find_downlink(Relation rel, AttrNumber attno,
+ zstid key, BlockNumber childblk, int level,
+ int *itemno_p)
+{
+ BlockNumber rootblk;
+ BlockNumber next;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ int nextlevel = -1;
+
+ /* start from root */
+ rootblk = zsmeta_get_root_for_attribute(rel, attno, true);
+ if (rootblk == childblk)
+ return InvalidBuffer;
+
+ /* XXX: this is mostly the same as zsbt_descend, but we stop at an internal
+ * page instead of descending all the way down to leaf */
+ next = rootblk;
+ for (;;)
+ {
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (nextlevel == -1)
+ nextlevel = opaque->zs_level;
+ else if (nextlevel != opaque->zs_level)
+ elog(ERROR, "unexpected level encountered when descending tree");
+
+ if (opaque->zs_level <= level)
+ elog(ERROR, "unexpected page level encountered");
+
+ /*
+ * Do we need to walk right? This could happen if the page was concurrently split.
+ */
+ if (key >= opaque->zs_hikey)
+ {
+ next = opaque->zs_next;
+ if (next == InvalidBlockNumber)
+ elog(ERROR, "fell off the end of btree");
+ }
+ else
+ {
+ items = ZSBtreeInternalPageGetItems(page);
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+
+ itemno = zsbt_binsrch_internal(key, items, nitems);
+ if (itemno < 0)
+ elog(ERROR, "could not descend tree for tid (%u, %u)",
+ ZSTidGetBlockNumber(key), ZSTidGetOffsetNumber(key));
+
+ if (opaque->zs_level == level + 1)
+ {
+ if (items[itemno].childblk != childblk)
+ elog(ERROR, "could not re-find downlink for block %u", childblk);
+ *itemno_p = itemno;
+ return buf;
+ }
+
+ next = items[itemno].childblk;
+ nextlevel--;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/*
+ * Create a new btree root page, containing two downlinks.
+ *
+ * NOTE: the very first root page of a btree, which is also the leaf, is created
+ * in zsmeta_get_root_for_attribute(), not here.
+ */
+static void
+zsbt_newroot(Relation rel, AttrNumber attno, int level,
+ zstid key1, BlockNumber blk1,
+ zstid key2, BlockNumber blk2,
+ Buffer leftchildbuf)
+{
+ ZSBtreePageOpaque *opaque;
+ ZSBtreePageOpaque *leftchildopaque;
+ Buffer buf;
+ Page page;
+ ZSBtreeInternalPageItem *items;
+ Buffer metabuf;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+ Assert(key1 < key2);
+
+ buf = zs_getnewbuf(rel);
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(page);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = level;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ items = ZSBtreeInternalPageGetItems(page);
+ items[0].tid = key1;
+ items[0].childblk = blk1;
+ items[1].tid = key2;
+ items[1].childblk = blk2;
+ ((PageHeader) page)->pd_lower += 2 * sizeof(ZSBtreeInternalPageItem);
+ Assert(ZSBtreeInternalPageGetNumItems(page) == 2);
+
+ /* clear the follow-right flag on left child */
+ leftchildopaque = ZSBtreePageGetOpaque(BufferGetPage(leftchildbuf));
+ leftchildopaque->zs_flags &= ZS_FOLLOW_RIGHT;
+
+ /* TODO: wal-log all, including metapage */
+
+ MarkBufferDirty(buf);
+ MarkBufferDirty(leftchildbuf);
+
+ /* Before exiting, update the metapage */
+ zsmeta_update_root_for_attribute(rel, attno, metabuf, BufferGetBlockNumber(buf));
+
+ UnlockReleaseBuffer(leftchildbuf);
+ UnlockReleaseBuffer(buf);
+ UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * After page split, insert the downlink of 'rightblkno' to the parent.
+ *
+ * On entry, 'leftbuf' must be pinned exclusive-locked. It is released on exit.
+ */
+static void
+zsbt_insert_downlink(Relation rel, AttrNumber attno, Buffer leftbuf,
+ zstid rightlokey, BlockNumber rightblkno)
+{
+ BlockNumber leftblkno = BufferGetBlockNumber(leftbuf);
+ Page leftpage = BufferGetPage(leftbuf);
+ ZSBtreePageOpaque *leftopaque = ZSBtreePageGetOpaque(leftpage);
+ zstid leftlokey = leftopaque->zs_lokey;
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ Buffer parentbuf;
+ Page parentpage;
+
+ /*
+ * re-find parent
+ *
+ * TODO: this is a bit inefficient. Usually, we have just descended the
+ * tree, and if we just remembered the path we descended, we could just
+ * walk back up.
+ */
+ parentbuf = zsbt_find_downlink(rel, attno, leftlokey, leftblkno, leftopaque->zs_level, &itemno);
+ if (parentbuf == InvalidBuffer)
+ {
+ zsbt_newroot(rel, attno, leftopaque->zs_level + 1,
+ leftlokey, BufferGetBlockNumber(leftbuf),
+ rightlokey, rightblkno, leftbuf);
+ return;
+ }
+ parentpage = BufferGetPage(parentbuf);
+
+ /* Find the position in the parent for the downlink */
+ items = ZSBtreeInternalPageGetItems(parentpage);
+ nitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(rightlokey, items, nitems);
+
+ /* sanity checks */
+ if (itemno < 1 || items[itemno].tid != leftlokey ||
+ items[itemno].childblk != leftblkno)
+ {
+ elog(ERROR, "could not find downlink");
+ }
+ itemno++;
+
+ if (ZSBtreeInternalPageIsFull(parentpage))
+ {
+ /* split internal page */
+ zsbt_split_internal_page(rel, attno, parentbuf, leftbuf, itemno, rightlokey, rightblkno);
+ }
+ else
+ {
+ /* insert the new downlink for the right page. */
+ memmove(&items[itemno + 1],
+ &items[itemno],
+ (nitems - itemno) * sizeof(ZSBtreeInternalPageItem));
+ items[itemno].tid = rightlokey;
+ items[itemno].childblk = rightblkno;
+ ((PageHeader) parentpage)->pd_lower += sizeof(ZSBtreeInternalPageItem);
+
+ leftopaque->zs_flags &= ~ZS_FOLLOW_RIGHT;
+
+ /* TODO: WAL-log */
+
+ MarkBufferDirty(leftbuf);
+ MarkBufferDirty(parentbuf);
+ UnlockReleaseBuffer(leftbuf);
+ UnlockReleaseBuffer(parentbuf);
+ }
+}
+
+/*
+ * Split an internal page.
+ *
+ * The new downlink specified by 'newkey' and 'childblk' is inserted to
+ * position 'newoff', on 'leftbuf'. The page is split.
+ */
+static void
+zsbt_split_internal_page(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer childbuf,
+ OffsetNumber newoff, zstid newkey, BlockNumber childblk)
+{
+ Buffer rightbuf;
+ Page origpage = BufferGetPage(leftbuf);
+ Page leftpage;
+ Page rightpage;
+ BlockNumber rightblkno;
+ ZSBtreePageOpaque *leftopaque;
+ ZSBtreePageOpaque *rightopaque;
+ ZSBtreeInternalPageItem *origitems;
+ ZSBtreeInternalPageItem *leftitems;
+ ZSBtreeInternalPageItem *rightitems;
+ int orignitems;
+ int leftnitems;
+ int rightnitems;
+ int splitpoint;
+ zstid splittid;
+ bool newitemonleft;
+ int i;
+ ZSBtreeInternalPageItem newitem;
+
+ leftpage = PageGetTempPageCopySpecial(origpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+ Assert(leftopaque->zs_level > 0);
+ /* any previous incomplete split must be finished first */
+ Assert((leftopaque->zs_flags & ZS_FOLLOW_RIGHT) == 0);
+
+ rightbuf = zs_getnewbuf(rel);
+ rightpage = BufferGetPage(rightbuf);
+ rightblkno = BufferGetBlockNumber(rightbuf);
+ PageInit(rightpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ rightopaque = ZSBtreePageGetOpaque(rightpage);
+
+ /*
+ * Figure out the split point.
+ *
+ * TODO: currently, always do 90/10 split.
+ */
+ origitems = ZSBtreeInternalPageGetItems(origpage);
+ orignitems = ZSBtreeInternalPageGetNumItems(origpage);
+ splitpoint = orignitems * 0.9;
+ splittid = origitems[splitpoint].tid;
+ newitemonleft = (newkey < splittid);
+
+ /* Set up the page headers */
+ rightopaque->zs_attno = attno;
+ rightopaque->zs_next = leftopaque->zs_next;
+ rightopaque->zs_lokey = splittid;
+ rightopaque->zs_hikey = leftopaque->zs_hikey;
+ rightopaque->zs_level = leftopaque->zs_level;
+ rightopaque->zs_flags = 0;
+ rightopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ leftopaque->zs_next = rightblkno;
+ leftopaque->zs_hikey = splittid;
+ leftopaque->zs_flags |= ZS_FOLLOW_RIGHT;
+
+ /* copy the items */
+ leftitems = ZSBtreeInternalPageGetItems(leftpage);
+ leftnitems = 0;
+ rightitems = ZSBtreeInternalPageGetItems(rightpage);
+ rightnitems = 0;
+
+ newitem.tid = newkey;
+ newitem.childblk = childblk;
+
+ for (i = 0; i < orignitems; i++)
+ {
+ if (i == newoff)
+ {
+ if (newitemonleft)
+ leftitems[leftnitems++] = newitem;
+ else
+ rightitems[rightnitems++] = newitem;
+ }
+
+ if (i < splitpoint)
+ leftitems[leftnitems++] = origitems[i];
+ else
+ rightitems[rightnitems++] = origitems[i];
+ }
+ /* cope with possibility that newitem goes at the end */
+ if (i <= newoff)
+ {
+ Assert(!newitemonleft);
+ rightitems[rightnitems++] = newitem;
+ }
+ ((PageHeader) leftpage)->pd_lower += leftnitems * sizeof(ZSBtreeInternalPageItem);
+ ((PageHeader) rightpage)->pd_lower += rightnitems * sizeof(ZSBtreeInternalPageItem);
+
+ Assert(leftnitems + rightnitems == orignitems + 1);
+
+ PageRestoreTempPage(leftpage, origpage);
+
+ /* TODO: WAL-logging */
+ MarkBufferDirty(leftbuf);
+ MarkBufferDirty(rightbuf);
+
+ MarkBufferDirty(childbuf);
+ ZSBtreePageGetOpaque(BufferGetPage(childbuf))->zs_flags &= ~ZS_FOLLOW_RIGHT;
+ UnlockReleaseBuffer(childbuf);
+
+ UnlockReleaseBuffer(rightbuf);
+
+ /* recurse to insert downlink. (this releases 'leftbuf') */
+ zsbt_insert_downlink(rel, attno, leftbuf, splittid, rightblkno);
+}
+
+/*
+ * Returns the next item in the scan. This doesn't pay attention to visibility.
+ *
+ * The returned pointer might point directly to a btree-buffer, or it might be
+ * palloc'd copy. If it points to a buffer, scan->lastbuf_is_locked is true,
+ * otherwise false.
+ */
+static ZSUncompressedBtreeItem *
+zsbt_scan_next_internal(ZSBtreeScan *scan)
+{
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ BlockNumber next;
+
+ if (!scan->active)
+ return NULL;
+
+ for (;;)
+ {
+ while (scan->has_decompressed)
+ {
+ ZSUncompressedBtreeItem *item = zs_decompress_read_item(&scan->decompressor);
+
+ if (item == NULL)
+ {
+ scan->has_decompressed = false;
+ break;
+ }
+ if (item->t_tid >= scan->nexttid)
+ {
+ scan->nexttid = item->t_tid;
+ scan->nexttid = ZSTidIncrement(scan->nexttid);
+ return item;
+ }
+ }
+
+ buf = scan->lastbuf;
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (!scan->lastbuf_is_locked)
+ LockBuffer(buf, scan->for_update ? BUFFER_LOCK_EXCLUSIVE : BUFFER_LOCK_SHARE);
+ scan->lastbuf_is_locked = true;
+
+ /* TODO: check that the page is a valid zs btree page */
+
+ /* TODO: check the last offset first, as an optimization */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSBtreeItem *item = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+
+ if (citem->t_lasttid >= scan->nexttid)
+ {
+ zs_decompress_chunk(&scan->decompressor, citem);
+ scan->has_decompressed = true;
+ if (!scan->for_update)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ scan->lastbuf_is_locked = false;
+ }
+ break;
+ }
+ }
+ else
+ {
+ ZSUncompressedBtreeItem *uitem = (ZSUncompressedBtreeItem *) item;
+
+ if (uitem->t_tid >= scan->nexttid)
+ {
+ scan->nexttid = uitem->t_tid;
+ scan->nexttid = ZSTidIncrement(scan->nexttid);
+ return uitem;
+ }
+ }
+ }
+
+ if (scan->has_decompressed)
+ continue;
+
+ /* No more items on this page. Walk right, if possible */
+ next = opaque->zs_next;
+ if (next == BufferGetBlockNumber(buf))
+ elog(ERROR, "btree page %u next-pointer points to itself", next);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ scan->lastbuf_is_locked = false;
+
+ if (next == InvalidBlockNumber)
+ {
+ scan->active = false;
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ return NULL;
+ }
+
+ scan->lastbuf = ReleaseAndReadBuffer(scan->lastbuf, scan->rel, next);
+ }
+}
+
+/*
+ * This helper function is used to implement INSERT, UPDATE and DELETE.
+ *
+ * If 'olditem' is not NULL, then 'olditem' on the page is replaced with
+ * 'replacementitem'. 'replacementitem' can be NULL, to remove an old item.
+ *
+ * If 'newitem' is not NULL, it is added to the page, to the correct position.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page if needed.
+ */
+static void
+zsbt_replace_item(Relation rel, AttrNumber attno, Buffer buf,
+ ZSBtreeItem *olditem,
+ ZSBtreeItem *replacementitem,
+ ZSBtreeItem *newitem,
+ List *newitems)
+{
+ Page page = BufferGetPage(buf);
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ List *items;
+ bool found_old_item = false;
+ /* We might need to decompress up to two previously compressed items */
+ ZSDecompressContext decompressors[2];
+ int numdecompressors = 0;
+
+ /*
+ * Helper routine, to append the given old item 'x' to the list.
+ * If the 'x' matches the old item, then append 'replacementitem' instead.
+ * And if thew 'newitem' shoudl go before 'x', then append that first.
+ *
+ * TODO: We could also leave out any old, deleted, items that are no longer
+ * visible to anyone.
+ */
+#define PROCESS_ITEM(x) \
+ do { \
+ if (newitem && (x)->t_tid >= newitem->t_tid) \
+ { \
+ Assert((x)->t_tid != newitem->t_tid); \
+ items = lappend(items, newitem); \
+ newitem = NULL; \
+ } \
+ if (olditem && (x)->t_tid == olditem->t_tid) \
+ { \
+ Assert(!found_old_item); \
+ found_old_item = true; \
+ if (replacementitem) \
+ items = lappend(items, replacementitem); \
+ } \
+ else \
+ items = lappend(items, x); \
+ } while(0)
+
+ /* Loop through all old items on the page */
+ items = NIL;
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSBtreeItem *item = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+
+ if ((olditem && citem->t_tid <= olditem->t_tid && olditem->t_tid <= citem->t_lasttid) ||
+ (newitem && citem->t_tid <= newitem->t_tid && newitem->t_tid <= citem->t_lasttid))
+ {
+ /* Found it, this compressed item covers the target or the new TID. */
+ /* We have to decompress it, and recompress */
+ ZSDecompressContext *decompressor = &decompressors[numdecompressors++];
+ ZSUncompressedBtreeItem *uitem;
+
+ Assert(numdecompressors <= 2);
+
+ zs_decompress_init(decompressor);
+ zs_decompress_chunk(decompressor, citem);
+
+ while ((uitem = zs_decompress_read_item(decompressor)) != NULL)
+ PROCESS_ITEM(uitem);
+ }
+ else
+ {
+ /* this item does not cover the target, nor the newitem. Add as it is. */
+ items = lappend(items, item);
+ continue;
+ }
+ }
+ else
+ PROCESS_ITEM(item);
+ }
+
+ if (olditem && !found_old_item)
+ elog(ERROR, "could not find old item to replace");
+
+ /* if the new item was not added in the loop, it goes to the end */
+ if (newitem)
+ items = lappend(items, newitem);
+
+ if (newitems)
+ items = list_concat(items, newitems);
+
+ /* Now pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ zsbt_recompress_replace(rel, attno, buf, items);
+
+ /*
+ * We can now free the decompression contexts. The pointers in the 'items' list
+ * point to decompression buffers, so we cannot free them until after writing out
+ * the pages.
+ */
+ for (int i = 0; i < numdecompressors; i++)
+ zs_decompress_free(&decompressors[i]);
+ list_free(items);
+}
+
+/*
+ * Recompressor routines
+ */
+typedef struct
+{
+ Page currpage;
+ ZSCompressContext compressor;
+ int compressed_items;
+ List *pages; /* first page writes over the old buffer,
+ * subsequent pages get newly-allocated buffers */
+
+ int total_items;
+ int total_compressed_items;
+ int total_already_compressed_items;
+
+ AttrNumber attno;
+ zstid hikey;
+} zsbt_recompress_context;
+
+static void
+zsbt_recompress_newpage(zsbt_recompress_context *cxt, zstid nexttid)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+
+ if (cxt->currpage)
+ {
+ /* set the last tid on previous page */
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ oldopaque->zs_hikey = nexttid;
+ }
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ cxt->pages = lappend(cxt->pages, newpage);
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = cxt->attno;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = 0;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_recompress_add_to_page(zsbt_recompress_context *cxt, ZSBtreeItem *item)
+{
+ if (PageGetFreeSpace(cxt->currpage) < MAXALIGN(item->t_size))
+ zsbt_recompress_newpage(cxt, item->t_tid);
+
+ if (PageAddItemExtended(cxt->currpage,
+ (Item) item, item->t_size,
+ PageGetMaxOffsetNumber(cxt->currpage) + 1,
+ PAI_OVERWRITE) == InvalidOffsetNumber)
+ elog(ERROR, "could not add item to page while recompressing");
+
+ cxt->total_items++;
+}
+
+static bool
+zsbt_recompress_add_to_compressor(zsbt_recompress_context *cxt, ZSUncompressedBtreeItem *item)
+{
+ bool result;
+
+ if (cxt->compressed_items == 0)
+ zs_compress_begin(&cxt->compressor, PageGetFreeSpace(cxt->currpage));
+
+ result = zs_compress_add(&cxt->compressor, item);
+ if (result)
+ {
+ cxt->compressed_items++;
+
+ cxt->total_compressed_items++;
+ }
+
+ return result;
+}
+
+static void
+zsbt_recompress_flush(zsbt_recompress_context *cxt)
+{
+ ZSCompressedBtreeItem *citem;
+
+ if (cxt->compressed_items == 0)
+ return;
+
+ citem = zs_compress_finish(&cxt->compressor);
+
+ zsbt_recompress_add_to_page(cxt, (ZSBtreeItem *) citem);
+ cxt->compressed_items = 0;
+}
+
+/*
+ * Rewrite a leaf page, with given 'items' as the new content.
+ *
+ * If there are any uncompressed items in the list, we try to compress them.
+ * Any already-compressed items are added as is.
+ *
+ * If the items no longer fit on the page, then the page is split. It is
+ * entirely possible that they don't fit even on two pages; we split the page
+ * into as many pages as needed. Hopefully not more than a few pages, though,
+ * because otherwise you might hit limits on the number of buffer pins (with
+ * tiny shared_buffers).
+ *
+ * On entry, 'oldbuf' must be pinned and exclusive-locked. On exit, the lock
+ * is released, but it's still pinned.
+ */
+static void
+zsbt_recompress_replace(Relation rel, AttrNumber attno, Buffer oldbuf, List *items)
+{
+ ListCell *lc;
+ ListCell *lc2;
+ zsbt_recompress_context cxt;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(BufferGetPage(oldbuf));
+ ZSUndoRecPtr recent_oldest_undo = { 0 };
+ List *bufs;
+ int i;
+ BlockNumber orignextblk;
+
+ cxt.currpage = NULL;
+ zs_compress_init(&cxt.compressor);
+ cxt.compressed_items = 0;
+ cxt.pages = NIL;
+ cxt.attno = attno;
+ cxt.hikey = oldopaque->zs_hikey;
+
+ cxt.total_items = 0;
+ cxt.total_compressed_items = 0;
+ cxt.total_already_compressed_items = 0;
+
+ zsbt_recompress_newpage(&cxt, oldopaque->zs_lokey);
+
+ foreach(lc, items)
+ {
+ ZSBtreeItem *item = (ZSBtreeItem *) lfirst(lc);
+
+ /* We can leave out any old-enough DEAD items */
+ if ((item->t_flags & ZSBT_DEAD) != 0)
+ {
+ ZSUncompressedBtreeItem *uitem = (ZSUncompressedBtreeItem *) item;
+
+ if (recent_oldest_undo.counter == 0)
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+
+ if (uitem->t_undo_ptr.counter < recent_oldest_undo.counter)
+ continue;
+ }
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ /* already compressed, add as it is. */
+ zsbt_recompress_flush(&cxt);
+ cxt.total_already_compressed_items++;
+ zsbt_recompress_add_to_page(&cxt, item);
+ }
+ else
+ {
+ /* try to add this item to the compressor */
+ ZSUncompressedBtreeItem *uitem = (ZSUncompressedBtreeItem *) item;
+
+ if (!zsbt_recompress_add_to_compressor(&cxt, uitem))
+ {
+ if (cxt.compressed_items > 0)
+ {
+ /* flush, and retry */
+ zsbt_recompress_flush(&cxt);
+
+ if (!zsbt_recompress_add_to_compressor(&cxt, uitem))
+ {
+ /* could not compress, even on its own. Store it uncompressed, then */
+ zsbt_recompress_add_to_page(&cxt, item);
+ }
+ }
+ else
+ {
+ /* could not compress, even on its own. Store it uncompressed, then */
+ zsbt_recompress_add_to_page(&cxt, item);
+ }
+ }
+ }
+ }
+
+ /* flush the last one, if any */
+ zsbt_recompress_flush(&cxt);
+
+ zs_compress_free(&cxt.compressor);
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ bufs = list_make1_int(oldbuf);
+ for (i = 0; i < list_length(cxt.pages) - 1; i++)
+ {
+ Buffer newbuf = zs_getnewbuf(rel);
+
+ bufs = lappend_int(bufs, newbuf);
+ }
+
+ START_CRIT_SECTION();
+
+ orignextblk = oldopaque->zs_next;
+ forboth(lc, cxt.pages, lc2, bufs)
+ {
+ Page page_copy = (Page) lfirst(lc);
+ Buffer buf = (Buffer) lfirst_int(lc2);
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque;
+
+ PageRestoreTempPage(page_copy, page);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /* TODO: WAL-log */
+ if (lnext(lc2))
+ {
+ Buffer nextbuf = (Buffer) lfirst_int(lnext(lc2));
+
+ opaque->zs_next = BufferGetBlockNumber(nextbuf);
+ opaque->zs_flags |= ZS_FOLLOW_RIGHT;
+ }
+ else
+ {
+ /* last one in the chain. */
+ opaque->zs_next = orignextblk;
+ }
+
+ MarkBufferDirty(buf);
+ }
+ list_free(cxt.pages);
+
+ END_CRIT_SECTION();
+
+ /* If we had to split, insert downlinks for the new pages. */
+ while (list_length(bufs) > 1)
+ {
+ Buffer leftbuf = (Buffer) linitial_int(bufs);
+ Buffer rightbuf = (Buffer) lsecond_int(bufs);
+
+ zsbt_insert_downlink(rel, attno, leftbuf,
+ ZSBtreePageGetOpaque(BufferGetPage(leftbuf))->zs_hikey,
+ BufferGetBlockNumber(rightbuf));
+ /* zsbt_insert_downlink() released leftbuf */
+ bufs = list_delete_first(bufs);
+ }
+ /* release the last page */
+ UnlockReleaseBuffer((Buffer) linitial_int(bufs));
+ list_free(bufs);
+}
+
+static int
+zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid].tid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git a/src/backend/access/zedstore/zedstore_compression.c b/src/backend/access/zedstore/zedstore_compression.c
new file mode 100644
index 0000000000..2aeac56beb
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_compression.c
@@ -0,0 +1,362 @@
+/*
+ * zedstore_compression.c
+ * Routines for compression
+ *
+ * There are two implementations at the moment: LZ4, and the Postgres
+ * pg_lzcompress(). LZ4 support requires that the server was compiled
+ * with --with-lz4.
+ *
+ * The compressor works on ZSUncompressedBtreeItems.
+ *
+ * Compression interface
+ * ---------------------
+ *
+ * Call zs_compress_init() to initialize.
+ *
+ * Call zs_compress_begin(), to begin compressing a group of items. Pass the
+ * maximum amount of space it's allowed to use after compression, as argument.
+ *
+ * Feed them to the compressor one by one with zs_compress_add(), until it
+ * returns false.
+ *
+ * Finally, call zs_compress_finish(). It returns a ZSCompressedBtreeItem,
+ * which contains all the plain items that were added (except for the last one
+ * for which zs_compress_add() returned false)
+ *
+ * Decompression interface
+ * -----------------------
+ *
+ * zs_decompress_chunk() takes a ZSCompressedBtreeItem as argument. It
+ * initializes a "context" with the given chunk.
+ *
+ * Call zs_decompress_read_item() to return the uncompressed items one by one.
+ *
+ *
+ * NOTES:
+ *
+ * Currently, the compressor accepts input, until the *uncompressed* size exceeds
+ * the *compressed* size available. I.e it assumes that the compressed size is never
+ * larger than uncompressed size.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_compression.c
+ */
+#include "postgres.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "common/pg_lzcompress.h"
+#include "utils/datum.h"
+
+
+/*
+ * There are two implementations at the moment: LZ4, and the Postgres
+ * pg_lzcompress(). LZ4 support requires that the server was compiled
+ * with --with-lz4.
+ */
+#ifdef USE_LZ4
+
+/*
+ * Begin compression, with given max compressed size.
+ */
+void
+zs_compress_init(ZSCompressContext *context)
+{
+ context->uncompressedbuffer = palloc(BLCKSZ * 10); // FIXME: arbitrary size
+ context->buffer = palloc(BLCKSZ);
+ context->maxCompressedSize = 0;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+void
+zs_compress_begin(ZSCompressContext *context, int maxCompressedSize)
+{
+ context->buffer = repalloc(context->buffer, maxCompressedSize);
+
+ maxCompressedSize -= offsetof(ZSCompressedBtreeItem, t_payload);
+ if (maxCompressedSize < 0)
+ maxCompressedSize = 0;
+
+ context->maxCompressedSize = maxCompressedSize;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+/*
+ * Try to add some data to the compressed block.
+ *
+ * If it wouldn't fit, return false.
+ */
+bool
+zs_compress_add(ZSCompressContext *context, ZSUncompressedBtreeItem *item)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+
+ Assert((item->t_flags & ZSBT_COMPRESSED) == 0);
+ Assert(item->t_tid != InvalidZSTid);
+
+ if (LZ4_COMPRESSBOUND(context->rawsize + item->t_size) > context->maxCompressedSize)
+ return false;
+
+ memcpy(context->uncompressedbuffer + context->rawsize, item, item->t_size);
+ if (context->nitems == 0)
+ chunk->t_tid = item->t_tid;
+ chunk->t_lasttid = item->t_tid;
+ context->nitems++;
+ context->rawsize += item->t_size;
+
+ return true;
+}
+
+ZSCompressedBtreeItem *
+zs_compress_finish(ZSCompressContext *context)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+ int32 compressed_size;
+
+ compressed_size = LZ4_compress_default(context->uncompressedbuffer,
+ chunk->t_payload,
+ context->rawsize,
+ context->maxCompressedSize);
+ if (compressed_size < 0)
+ elog(ERROR, "compression failed. what now?");
+
+ chunk->t_size = offsetof(ZSCompressedBtreeItem, t_payload) + compressed_size;
+ chunk->t_flags = ZSBT_COMPRESSED;
+ chunk->t_uncompressedsize = context->rawsize;
+
+ return chunk;
+}
+
+void
+zs_compress_free(ZSCompressContext *context)
+{
+ pfree(context->uncompressedbuffer);
+ pfree(context->buffer);
+}
+
+void
+zs_decompress_init(ZSDecompressContext *context)
+{
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+void
+zs_decompress_chunk(ZSDecompressContext *context, ZSCompressedBtreeItem *chunk)
+{
+ Assert((chunk->t_flags & ZSBT_COMPRESSED) != 0);
+ Assert(chunk->t_uncompressedsize > 0);
+ if (context->bufsize < chunk->t_uncompressedsize)
+ {
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = palloc(chunk->t_uncompressedsize);
+ context->bufsize = chunk->t_uncompressedsize;
+ }
+ context->uncompressedsize = chunk->t_uncompressedsize;
+
+ if (LZ4_decompress_safe(chunk->t_payload,
+ context->buffer,
+ chunk->t_size - offsetof(ZSCompressedBtreeItem, t_payload),
+ context->uncompressedsize) != context->uncompressedsize)
+ elog(ERROR, "could not decompress chunk");
+
+ context->bytesread = 0;
+}
+
+ZSUncompressedBtreeItem *
+zs_decompress_read_item(ZSDecompressContext *context)
+{
+ ZSUncompressedBtreeItem *next;
+
+ if (context->bytesread == context->uncompressedsize)
+ return NULL;
+ next = (ZSUncompressedBtreeItem *) (context->buffer + context->bytesread);
+ if (context->bytesread + next->t_size > context->uncompressedsize)
+ elog(ERROR, "invalid compressed item");
+ context->bytesread += next->t_size;
+
+ Assert(next->t_size >= sizeof(ZSUncompressedBtreeItem));
+ Assert(next->t_tid != InvalidZSTid);
+
+ return next;
+}
+
+void
+zs_decompress_free(ZSDecompressContext *context)
+{
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+
+#else
+/* PGLZ imlementation */
+
+/*
+ * In the worst case, pg_lz outputs everything as "literals", and emits one
+ * "control byte" ever 8 bytes. Also, it requires 4 bytes extra at the end
+ * of the buffer. And add 10 bytes of slop, for good measure.
+ */
+#define MAX_COMPRESS_EXPANSION_OVERHEAD (8)
+#define MAX_COMPRESS_EXPANSION_BYTES (4 + 10)
+
+/*
+ * Begin compression, with given max compressed size.
+ */
+void
+zs_compress_init(ZSCompressContext *context)
+{
+ context->uncompressedbuffer = palloc(BLCKSZ * 10); // FIXME: arbitrary size
+ context->buffer = palloc(BLCKSZ);
+ context->maxCompressedSize = 0;
+ context->maxUncompressedSize = 0;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+void
+zs_compress_begin(ZSCompressContext *context, int maxCompressedSize)
+{
+ int maxUncompressedSize;
+
+ context->buffer = repalloc(context->buffer, maxCompressedSize + 4 /* LZ slop */);
+
+ context->maxCompressedSize = maxCompressedSize;
+
+ /* determine the max uncompressed size */
+ maxUncompressedSize = maxCompressedSize;
+ maxUncompressedSize -= offsetof(ZSCompressedBtreeItem, t_payload);
+ maxUncompressedSize -= maxUncompressedSize / MAX_COMPRESS_EXPANSION_OVERHEAD;
+ maxUncompressedSize -= MAX_COMPRESS_EXPANSION_BYTES;
+ if (maxUncompressedSize < 0)
+ maxUncompressedSize = 0;
+ context->maxUncompressedSize = maxUncompressedSize;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+/*
+ * Try to add some data to the compressed block.
+ *
+ * If it wouldn't fit, return false.
+ */
+bool
+zs_compress_add(ZSCompressContext *context, ZSUncompressedBtreeItem *item)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+
+ Assert ((item->t_flags & ZSBT_COMPRESSED) == 0);
+
+ if (context->rawsize + item->t_size > context->maxUncompressedSize)
+ return false;
+
+ memcpy(context->uncompressedbuffer + context->rawsize, item, item->t_size);
+ if (context->nitems == 0)
+ chunk->t_tid = item->t_tid;
+ chunk->t_lasttid = item->t_tid;
+ context->nitems++;
+ context->rawsize += item->t_size;
+
+ return true;
+}
+
+ZSCompressedBtreeItem *
+zs_compress_finish(ZSCompressContext *context)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+ int32 compressed_size;
+
+ compressed_size = pglz_compress(context->uncompressedbuffer, context->rawsize,
+ chunk->t_payload,
+ PGLZ_strategy_always);
+ if (compressed_size < 0)
+ elog(ERROR, "compression failed. what now?");
+
+ chunk->t_size = offsetof(ZSCompressedBtreeItem, t_payload) + compressed_size;
+ chunk->t_flags = ZSBT_COMPRESSED;
+ chunk->t_uncompressedsize = context->rawsize;
+
+ return chunk;
+}
+
+void
+zs_compress_free(ZSCompressContext *context)
+{
+ pfree(context->uncompressedbuffer);
+ pfree(context->buffer);
+}
+
+void
+zs_decompress_init(ZSDecompressContext *context)
+{
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+void
+zs_decompress_chunk(ZSDecompressContext *context, ZSCompressedBtreeItem *chunk)
+{
+ Assert((chunk->t_flags & ZSBT_COMPRESSED) != 0);
+ Assert(chunk->t_uncompressedsize > 0);
+ if (context->bufsize < chunk->t_uncompressedsize)
+ {
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = palloc(chunk->t_uncompressedsize);
+ context->bufsize = chunk->t_uncompressedsize;
+ }
+ context->uncompressedsize = chunk->t_uncompressedsize;
+
+ if (pglz_decompress(chunk->t_payload,
+ chunk->t_size - offsetof(ZSCompressedBtreeItem, t_payload),
+ context->buffer,
+ context->uncompressedsize, true) != context->uncompressedsize)
+ elog(ERROR, "could not decompress chunk");
+
+ context->bytesread = 0;
+}
+
+ZSUncompressedBtreeItem *
+zs_decompress_read_item(ZSDecompressContext *context)
+{
+ ZSUncompressedBtreeItem *next;
+
+ if (context->bytesread == context->uncompressedsize)
+ return NULL;
+ next = (ZSUncompressedBtreeItem *) (context->buffer + context->bytesread);
+ if (context->bytesread + next->t_size > context->uncompressedsize)
+ elog(ERROR, "invalid compressed item");
+ context->bytesread += next->t_size;
+
+ Assert(next->t_size >= sizeof(ZSUncompressedBtreeItem));
+ Assert(next->t_tid != InvalidZSTid);
+
+ return next;
+}
+
+void
+zs_decompress_free(ZSDecompressContext *context)
+{
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+#endif /* !USE_LZ4 */
diff --git a/src/backend/access/zedstore/zedstore_inspect.c b/src/backend/access/zedstore/zedstore_inspect.c
new file mode 100644
index 0000000000..b7c18bc04c
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_inspect.c
@@ -0,0 +1,445 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_inspect.c
+ * Debugging functions, for viewing ZedStore page contents
+ *
+ * These should probably be moved to contrib/, but it's handy to have them
+ * here during development.
+ *
+ * Example queries
+ * ---------------
+ *
+ * How many pages of each type a table has?
+ *
+ * select count(*), pg_zs_page_type('t_zedstore', g)
+ * from generate_series(0, pg_table_size('t_zedstore') / 8192 - 1) g group by 2;
+ *
+ * count | pg_zs_page_type
+ * -------+-----------------
+ * 1 | META
+ * 3701 | BTREE
+ * 6 | UNDO
+ * (3 rows)
+ *
+ * Compression ratio of B-tree leaf pages (other pages are not compressed):
+ *
+ * select sum(uncompressedsz::numeric) / sum(totalsz) as compratio
+ * from pg_zs_btree_pages('t_zedstore') ;
+ * compratio
+ * --------------------
+ * 3.6623829559208134
+ * (1 row)
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_inspect.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "commands/vacuum.h"
+#include "funcapi.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+Datum pg_zs_page_type(PG_FUNCTION_ARGS);
+Datum pg_zs_undo_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_btree_pages(PG_FUNCTION_ARGS);
+
+Datum
+pg_zs_page_type(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint64 pageno = PG_GETARG_INT64(1);
+ Relation rel;
+ uint16 zs_page_id;
+ Buffer buf;
+ Page page;
+ char *result;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ buf = ReadBuffer(rel, pageno);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ zs_page_id = *((uint16 *) ((char *) page + BLCKSZ - sizeof(uint16)));
+
+ UnlockReleaseBuffer(buf);
+
+ table_close(rel, AccessShareLock);
+
+ switch (zs_page_id)
+ {
+ case ZS_META_PAGE_ID:
+ result = "META";
+ break;
+ case ZS_BTREE_PAGE_ID:
+ result = "BTREE";
+ break;
+ case ZS_UNDO_PAGE_ID:
+ result = "UNDO";
+ break;
+ case ZS_TOAST_PAGE_ID:
+ result = "UNDO";
+ break;
+ default:
+ result = psprintf("UNKNOWN 0x%04x", zs_page_id);
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * blkno int8
+ * nrecords int4
+ * freespace int4
+ * firstrecptr int8
+ * lastrecptr int8
+ */
+Datum
+pg_zs_undo_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber blkno;
+ char *ptr;
+ char *endptr;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page.
+ */
+ blkno = firstblk;
+ while (blkno != InvalidBlockNumber)
+ {
+ Datum values[5];
+ bool nulls[5];
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+ int nrecords;
+ ZSUndoRecPtr firstptr = { 0, 0, 0 };
+ ZSUndoRecPtr lastptr = { 0, 0, 0 };
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ {
+ elog(WARNING, "unexpected page id on UNDO page %u", blkno);
+ break;
+ }
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ nrecords = 0;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+
+ Assert(undorec->undorecptr.blkno == blkno);
+
+ lastptr = undorec->undorecptr;
+ if (nrecords == 0)
+ firstptr = lastptr;
+ nrecords++;
+
+ ptr += undorec->size;
+ }
+
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int32GetDatum(nrecords);
+ values[2] = Int32GetDatum(PageGetExactFreeSpace(page));
+ values[3] = Int64GetDatum(firstptr.counter);
+ values[4] = Int64GetDatum(lastptr.counter);
+
+ blkno = opaque->next;
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+
+/*
+ * blkno int8
+ * nextblk int8
+ * attno int4
+ * level int4
+ *
+ * lokey int8
+ * hikey int8
+
+ * nitems int4
+ * ncompressed int4
+ * totalsz int4
+ * uncompressedsz int4
+ * freespace int4
+ */
+Datum
+pg_zs_btree_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ BlockNumber blkno;
+ BlockNumber nblocks;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /* scan all blocks in physical order */
+ for (blkno = 1; blkno < nblocks; blkno++)
+ {
+ Datum values[11];
+ bool nulls[11];
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ int nitems;
+ int ncompressed;
+ int totalsz;
+ int uncompressedsz;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in B-tree pages. (Presumably, most of the
+ * pages in the relation are b-tree pages, so it makes sense to
+ * scan the whole relation in physical order)
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSBtreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ nitems = 0;
+ ncompressed = 0;
+ totalsz = 0;
+ uncompressedsz = 0;
+ if (opaque->zs_level == 0)
+ {
+ /* leaf page */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSBtreeItem *item = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ nitems++;
+ totalsz += item->t_size;
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) PageGetItem(page, iid);
+
+ ncompressed++;
+ uncompressedsz += citem->t_uncompressedsize;
+ }
+ else
+ uncompressedsz += item->t_size;
+ }
+ }
+ else
+ {
+ /* internal page */
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+ }
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int64GetDatum(opaque->zs_next);
+ values[2] = Int32GetDatum(opaque->zs_attno);
+ values[3] = Int32GetDatum(opaque->zs_level);
+ values[4] = Int64GetDatum(opaque->zs_lokey);
+ values[5] = Int64GetDatum(opaque->zs_hikey);
+ values[6] = Int32GetDatum(nitems);
+ if (opaque->zs_level == 0)
+ {
+ values[7] = Int32GetDatum(ncompressed);
+ values[8] = Int32GetDatum(totalsz);
+ values[9] = Int32GetDatum(uncompressedsz);
+ }
+ else
+ {
+ nulls[7] = true;
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ values[10] = Int32GetDatum(PageGetExactFreeSpace(page));
+
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/access/zedstore/zedstore_meta.c b/src/backend/access/zedstore/zedstore_meta.c
new file mode 100644
index 0000000000..b58b3bb31c
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_meta.c
@@ -0,0 +1,218 @@
+/*
+ * zedstore_meta.c
+ * Routines for handling ZedStore metapage
+ *
+ * The metapage holds a directory of B-tree root block numbers, one for each
+ * column.
+ *
+ * TODO:
+ * - support ALTER TABLE ADD COLUMN.
+ * - extend the root block dir to an overflow page if there are too many
+ * attributes to fit on one page
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_meta.c
+ */
+#include "postgres.h"
+
+#include "access/itup.h"
+#include "access/zedstore_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "utils/rel.h"
+
+static void zs_initmetapage(Relation rel);
+
+/*
+ * Allocate a new page.
+ *
+ * The page is exclusive-locked, but not initialized.
+ *
+ * Currently, this just extends the relation, but we should have a free space
+ * map of some kind.
+ */
+Buffer
+zs_getnewbuf(Relation rel)
+{
+ Buffer buf;
+ bool needLock;
+
+ /*
+ * Extend the relation by one page.
+ *
+ * We have to use a lock to ensure no one else is extending the rel at
+ * the same time, else we will both try to initialize the same new
+ * page. We can skip locking for new or temp relations, however,
+ * since no one else could be accessing them.
+ */
+ needLock = !RELATION_IS_LOCAL(rel);
+
+ if (needLock)
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ buf = ReadBuffer(rel, P_NEW);
+
+ /* Acquire buffer lock on new page */
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * Release the file-extension lock; it's now OK for someone else to
+ * extend the relation some more. Note that we cannot release this
+ * lock before we have buffer lock on the new page, or we risk a race
+ * condition against btvacuumscan --- see comments therein.
+ */
+ if (needLock)
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ return buf;
+}
+
+/*
+ * Initialize the metapage for an empty relation.
+ */
+static void
+zs_initmetapage(Relation rel)
+{
+ int natts = RelationGetNumberOfAttributes(rel);
+ Buffer buf;
+ Page page;
+ ZSMetaPage *metapg;
+ ZSMetaPageOpaque *opaque;
+ Size freespace;
+ int maxatts;
+
+ buf = ReadBuffer(rel, P_NEW);
+ if (BufferGetBlockNumber(buf) != ZS_META_BLK)
+ elog(ERROR, "index is not empty");
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ PageInit(page, BLCKSZ, sizeof(ZSMetaPageOpaque));
+
+ /* Initialize the attribute root dir */
+ freespace = PageGetExactFreeSpace(page);
+ maxatts = freespace / sizeof(BlockNumber);
+ if (natts > maxatts)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+ metapg->nattributes = natts;
+ for (int i = 0; i < natts; i++)
+ metapg->roots[i] = InvalidBlockNumber;
+ ((PageHeader) page)->pd_lower += natts * sizeof(BlockNumber);
+
+ opaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_META_PAGE_ID;
+
+ /* UNDO-related fields */
+ opaque->zs_undo_counter = 1; /* start at 1, so that 0 is always "old" */
+ opaque->zs_undo_head = InvalidBlockNumber;
+ opaque->zs_undo_tail = InvalidBlockNumber;
+ opaque->zs_undo_oldestptr.counter = 1;
+
+ MarkBufferDirty(buf);
+ /* TODO: WAL-log */
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Get the block number of the b-tree root for given attribute.
+ *
+ * If 'forupdate' is true, and the root doesn't exist yet (ie. it's an empty
+ * table), a new root is allocated. Otherwise, returns InvalidBlockNumber if
+ * the root doesn't exist.
+ */
+BlockNumber
+zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool forupdate)
+{
+ Buffer metabuf;
+ ZSMetaPage *metapg;
+ BlockNumber rootblk;
+
+ if (RelationGetNumberOfBlocks(rel) == 0)
+ {
+ if (!forupdate)
+ return InvalidBlockNumber;
+
+ zs_initmetapage(rel);
+ }
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ /* TODO: get share lock to begin with */
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metapg = (ZSMetaPage *) PageGetContents(BufferGetPage(metabuf));
+
+ if (attno <= 0 || attno > metapg->nattributes)
+ elog(ERROR, "invalid attribute number %d (table has only %d attributes)", attno, metapg->nattributes);
+
+ rootblk = metapg->roots[attno - 1];
+
+ if (forupdate && rootblk == InvalidBlockNumber)
+ {
+ /* try to allocate one */
+ Buffer rootbuf;
+ Page rootpage;
+ ZSBtreePageOpaque *opaque;
+
+ /* TODO: release lock on metapage while we do I/O */
+ rootbuf = zs_getnewbuf(rel);
+ rootblk = BufferGetBlockNumber(rootbuf);
+
+ metapg->roots[attno - 1] = rootblk;
+
+ /* initialize the page to look like a root leaf */
+ rootpage = BufferGetPage(rootbuf);
+ PageInit(rootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(rootpage);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = 0;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ MarkBufferDirty(rootbuf);
+ MarkBufferDirty(metabuf);
+ /* TODO: WAL-log both pages */
+
+ UnlockReleaseBuffer(rootbuf);
+ }
+
+ UnlockReleaseBuffer(metabuf);
+
+ return rootblk;
+}
+
+/*
+ *
+ * Caller is responsible for WAL-logging this.
+ */
+void
+zsmeta_update_root_for_attribute(Relation rel, AttrNumber attno,
+ Buffer metabuf, BlockNumber rootblk)
+{
+ ZSMetaPage *metapg;
+
+ metapg = (ZSMetaPage *) PageGetContents(BufferGetPage(metabuf));
+
+ if (attno <= 0 || attno > metapg->nattributes)
+ elog(ERROR, "invalid attribute number %d (table has only %d attributes)", attno, metapg->nattributes);
+
+ metapg->roots[attno - 1] = rootblk;
+
+ MarkBufferDirty(metabuf);
+}
diff --git a/src/backend/access/zedstore/zedstore_toast.c b/src/backend/access/zedstore/zedstore_toast.c
new file mode 100644
index 0000000000..f1e90f0b4b
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_toast.c
@@ -0,0 +1,188 @@
+/*
+ * zedstore_toast.c
+ * Routines for Toasting oversized tuples in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_toast.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/rel.h"
+
+/*
+ * Toast a datum, inside the ZedStore file.
+ *
+ * This is similar to regular toasting, but instead of using a separate index and
+ * heap, the datum is stored within the same ZedStore file as all the btrees and
+ * stuff. A chain of "toast-pages" is allocated for the datum, and each page is filled
+ * with as much of the datum as possible.
+ *
+ *
+ * Note: You must call zedstore_toast_finish() after this,
+ * to set the TID in the toast-chain's first block. Otherwise, it's considered recyclable.
+ */
+Datum
+zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value)
+{
+ varatt_zs_toastptr *toastptr;
+ BlockNumber firstblk = InvalidBlockNumber;
+ Buffer buf = InvalidBuffer;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ Buffer prevbuf = InvalidBuffer;
+ ZSToastPageOpaque *prevopaque = NULL;
+ char *ptr;
+ int32 total_size;
+ int32 offset;
+
+ /* TODO: try to compress it in place first. Maybe just call toast_compress_datum? */
+
+ /*
+ * If that doesn't reduce it enough, allocate a toast page
+ * for it.
+ */
+ ptr = VARDATA_ANY(value);
+ total_size = VARSIZE_ANY_EXHDR(value);
+ offset = 0;
+
+ while (total_size - offset > 0)
+ {
+ Size thisbytes;
+
+ buf = zs_getnewbuf(rel);
+ if (prevbuf == InvalidBuffer)
+ firstblk = BufferGetBlockNumber(buf);
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, sizeof(ZSToastPageOpaque));
+
+ thisbytes = Min(total_size - offset, PageGetExactFreeSpace(page));
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_attno = attno;
+ opaque->zs_tid = InvalidZSTid;
+ opaque->zs_total_size = total_size;
+ opaque->zs_slice_offset = offset;
+ opaque->zs_prev = BufferIsValid(prevbuf) ? BufferGetBlockNumber(prevbuf) : InvalidBlockNumber;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_TOAST_PAGE_ID;
+
+ memcpy((char *) page + SizeOfPageHeaderData, ptr, thisbytes);
+ ((PageHeader) page)->pd_lower += thisbytes;
+ ptr += thisbytes;
+ offset += thisbytes;
+
+ if (prevbuf != InvalidBuffer)
+ {
+ prevopaque->zs_next = BufferGetBlockNumber(buf);
+ MarkBufferDirty(prevbuf);
+ }
+
+ /* TODO: WAL-log */
+ MarkBufferDirty(buf);
+
+ if (prevbuf != InvalidBuffer)
+ UnlockReleaseBuffer(prevbuf);
+ prevbuf = buf;
+ prevopaque = opaque;
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ toastptr = palloc0(sizeof(varatt_zs_toastptr));
+ SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
+ toastptr->zst_block = firstblk;
+
+ return PointerGetDatum(toastptr);
+}
+
+void
+zedstore_toast_finish(Relation rel, AttrNumber attno, Datum toasted, zstid tid)
+{
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(toasted);
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+
+ Assert(toastptr->va_tag == VARTAG_ZEDSTORE);
+
+ buf = ReadBuffer(rel, toastptr->zst_block);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ Assert(opaque->zs_tid == InvalidZSTid);
+ Assert(opaque->zs_attno == attno);
+ Assert(opaque->zs_prev == InvalidBlockNumber);
+
+ opaque->zs_tid = tid;
+
+ /* TODO: WAL-log */
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+}
+
+Datum
+zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted)
+{
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(toasted);
+ BlockNumber nextblk;
+ BlockNumber prevblk;
+ char *result = NULL;
+ char *ptr = NULL;
+ int32 total_size = 0;
+
+ Assert(toastptr->va_tag == VARTAG_ZEDSTORE);
+
+ prevblk = InvalidBlockNumber;
+ nextblk = toastptr->zst_block;
+
+ while (nextblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ uint32 size;
+
+ buf = ReadBuffer(rel, nextblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ Assert(opaque->zs_attno == attno);
+ Assert(opaque->zs_prev == prevblk);
+
+ if (prevblk == InvalidBlockNumber)
+ {
+ Assert(opaque->zs_tid == tid);
+
+ total_size = opaque->zs_total_size;
+
+ result = palloc(total_size + VARHDRSZ);
+ SET_VARSIZE(result, total_size + VARHDRSZ);
+ ptr = result + VARHDRSZ;
+ }
+
+ size = ((PageHeader) page)->pd_lower - SizeOfPageHeaderData;
+ memcpy(ptr, (char *) page + SizeOfPageHeaderData, size);
+ ptr += size;
+
+ prevblk = nextblk;
+ nextblk = opaque->zs_next;
+ UnlockReleaseBuffer(buf);
+ }
+ Assert(total_size > 0);
+ Assert(ptr == result + total_size + VARHDRSZ);
+
+ return PointerGetDatum(result);
+}
diff --git a/src/backend/access/zedstore/zedstore_undo.c b/src/backend/access/zedstore/zedstore_undo.c
new file mode 100644
index 0000000000..65bf67137e
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_undo.c
@@ -0,0 +1,783 @@
+/*
+ * zedstore_undo.c
+ * Temporary UNDO-logging for zedstore.
+ *
+ * XXX: This is hopefully replaced with an upstream UNDO facility later.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_undo.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/multixact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/pg_rusage.h"
+#include "utils/rel.h"
+#include "utils/lsyscache.h"
+
+/*
+ * Working area for zsundo_scan().
+ */
+typedef struct ZSUndoTrimStats
+{
+ /* List of TIDs of tuples we intend to delete */
+ /* NB: this list is ordered by TID address */
+ int num_dead_tuples; /* current # of entries */
+ int max_dead_tuples; /* # slots allocated in array */
+ ItemPointer dead_tuples; /* array of ItemPointerData */
+ bool dead_tuples_overflowed;
+
+ BlockNumber deleted_undo_pages;
+
+ bool can_advance_oldestundorecptr;
+} ZSUndoTrimStats;
+
+/*
+ * Working area for VACUUM.
+ */
+typedef struct ZSVacRelStats
+{
+ int elevel;
+ BufferAccessStrategy vac_strategy;
+
+ /* hasindex = true means two-pass strategy; false means one-pass */
+ bool hasindex;
+ /* Overall statistics about rel */
+ BlockNumber old_rel_pages; /* previous value of pg_class.relpages */
+ BlockNumber rel_pages; /* total number of pages */
+ BlockNumber scanned_pages; /* number of pages we examined */
+ BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber frozenskipped_pages; /* # of frozen pages we skipped */
+ BlockNumber tupcount_pages; /* pages whose tuples we counted */
+ double old_live_tuples; /* previous value of pg_class.reltuples */
+ double new_rel_tuples; /* new estimated total # of tuples */
+ double new_live_tuples; /* new estimated total # of live tuples */
+ double new_dead_tuples; /* new estimated total # of dead tuples */
+ BlockNumber pages_removed;
+ double tuples_deleted;
+ BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+ ZSUndoTrimStats trimstats;
+} ZSVacRelStats;
+
+/*
+ * Guesstimation of number of dead tuples per page. This is used to
+ * provide an upper limit to memory allocated when vacuuming small
+ * tables.
+ */
+#define LAZY_ALLOC_TUPLES MaxHeapTuplesPerPage
+
+static int zs_vac_cmp_itemptr(const void *left, const void *right);
+static bool zs_lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_space_alloc(ZSVacRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats);
+static void lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats);
+static ZSUndoRecPtr zsundo_scan(Relation rel, TransactionId OldestXmin, ZSUndoTrimStats *trimstats, BlockNumber *oldest_undopage);
+static void zsundo_update_oldest_ptr(Relation rel, ZSUndoRecPtr oldest_undorecptr, BlockNumber oldest_undopage);
+static void zsundo_record_dead_tuple(ZSUndoTrimStats *trimstats, zstid tid);
+
+/*
+ * Insert the given UNDO record to the UNDO log.
+ */
+ZSUndoRecPtr
+zsundo_insert(Relation rel, ZSUndoRec *rec)
+{
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber tail_blk;
+ Buffer tail_buf = InvalidBuffer;
+ Page tail_pg = NULL;
+ ZSUndoPageOpaque *tail_opaque = NULL;
+ char *dst;
+ ZSUndoRecPtr undorecptr;
+ int offset;
+ uint64 undo_counter;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+
+ /* TODO: get share lock to begin with, for more concurrency */
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ tail_blk = metaopaque->zs_undo_tail;
+
+ /* Is there space on the tail page? */
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_buf = ReadBuffer(rel, tail_blk);
+ LockBuffer(tail_buf, BUFFER_LOCK_EXCLUSIVE);
+ tail_pg = BufferGetPage(tail_buf);
+ tail_opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(tail_pg);
+ }
+ if (tail_blk == InvalidBlockNumber || PageGetExactFreeSpace(tail_pg) < rec->size)
+ {
+ Buffer newbuf;
+ BlockNumber newblk;
+ Page newpage;
+ ZSUndoPageOpaque *newopaque;
+
+ /* new page */
+ newbuf = zs_getnewbuf(rel);
+ newblk = BufferGetBlockNumber(newbuf);
+ newpage = BufferGetPage(newbuf);
+ PageInit(newpage, BLCKSZ, sizeof(ZSUndoPageOpaque));
+ newopaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(newpage);
+ newopaque->next = InvalidBlockNumber;
+ newopaque->zs_page_id = ZS_UNDO_PAGE_ID;
+
+ metaopaque->zs_undo_tail = newblk;
+ if (tail_blk == InvalidBlockNumber)
+ metaopaque->zs_undo_head = newblk;
+
+ MarkBufferDirty(metabuf);
+
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_opaque->next = newblk;
+ MarkBufferDirty(tail_buf);
+ UnlockReleaseBuffer(tail_buf);
+ }
+
+ tail_blk = newblk;
+ tail_buf = newbuf;
+ tail_pg = newpage;
+ tail_opaque = newopaque;
+ }
+
+ undo_counter = metaopaque->zs_undo_counter++;
+ MarkBufferDirty(metabuf);
+
+ UnlockReleaseBuffer(metabuf);
+
+ /* insert the record to this page */
+ offset = ((PageHeader) tail_pg)->pd_lower;
+
+ undorecptr.counter = undo_counter;
+ undorecptr.blkno = tail_blk;
+ undorecptr.offset = offset;
+ rec->undorecptr = undorecptr;
+ dst = ((char *) tail_pg) + offset;
+ memcpy(dst, rec, rec->size);
+ ((PageHeader) tail_pg)->pd_lower += rec->size;
+ MarkBufferDirty(tail_buf);
+ UnlockReleaseBuffer(tail_buf);
+
+ return undorecptr;
+}
+
+/*
+ * Fetch the UNDO record with the given undo-pointer.
+ *
+ * The returned record is a palloc'd copy.
+ */
+ZSUndoRec *
+zsundo_fetch(Relation rel, ZSUndoRecPtr undoptr)
+{
+ Buffer buf;
+ Page page;
+ PageHeader pagehdr;
+ ZSUndoPageOpaque *opaque;
+ ZSUndoRec *undorec;
+ ZSUndoRec *undorec_copy;
+
+ buf = ReadBuffer(rel, undoptr.blkno);
+ page = BufferGetPage(buf);
+ pagehdr = (PageHeader) page;
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u; not an UNDO page",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+
+ /* Sanity check that the pointer pointed to a valid place */
+ if (undoptr.offset < SizeOfPageHeaderData ||
+ undoptr.offset + sizeof(ZSUndoRec) > pagehdr->pd_lower)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+
+ undorec = (ZSUndoRec *) (((char *) page) + undoptr.offset);
+
+ if (memcmp(&undorec->undorecptr, &undoptr, sizeof(ZSUndoRecPtr)) != 0)
+ elog(ERROR, "could not find UNDO record");
+
+ undorec_copy = palloc(undorec->size);
+ memcpy(undorec_copy, undorec, undorec->size);
+
+ UnlockReleaseBuffer(buf);
+
+ return undorec_copy;
+}
+
+static bool
+zs_lazy_tid_reaped(ItemPointer itemptr, void *state)
+{
+ ZSVacRelStats *vacrelstats = (ZSVacRelStats *) state;
+ ItemPointer res;
+
+ res = (ItemPointer) bsearch((void *) itemptr,
+ (void *) vacrelstats->trimstats.dead_tuples,
+ vacrelstats->trimstats.num_dead_tuples,
+ sizeof(ItemPointerData),
+ zs_vac_cmp_itemptr);
+
+ return (res != NULL);
+}
+
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+zs_vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+void
+zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin)
+{
+ ZSVacRelStats *vacrelstats;
+ ZSUndoTrimStats *trimstats;
+ Relation *Irel;
+ int nindexes;
+ IndexBulkDeleteResult **indstats;
+ BlockNumber nblocks;
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (nblocks == 0)
+ return; /* empty table */
+
+ vacrelstats = (ZSVacRelStats *) palloc0(sizeof(ZSVacRelStats));
+ trimstats = &vacrelstats->trimstats;
+
+ if (params->options & VACOPT_VERBOSE)
+ vacrelstats->elevel = INFO;
+ else
+ vacrelstats->elevel = DEBUG2;
+ vacrelstats->vac_strategy = bstrategy;
+
+ /* Open all indexes of the relation */
+ vac_open_indexes(rel, RowExclusiveLock, &nindexes, &Irel);
+ vacrelstats->hasindex = (nindexes > 0);
+ indstats = (IndexBulkDeleteResult **)
+ palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
+
+ lazy_space_alloc(vacrelstats, nblocks);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("vacuuming \"%s.%s\"",
+ get_namespace_name(RelationGetNamespace(rel)),
+ RelationGetRelationName(rel))));
+
+ do
+ {
+ ZSUndoRecPtr reaped_upto;
+ BlockNumber oldest_undopage;
+ int j;
+
+ trimstats->dead_tuples_overflowed = false;
+ trimstats->num_dead_tuples = 0;
+ trimstats->deleted_undo_pages = 0;
+
+ reaped_upto = zsundo_scan(rel, OldestXmin, trimstats, &oldest_undopage);
+
+ if (trimstats->num_dead_tuples > 0)
+ {
+ pg_qsort(trimstats->dead_tuples, trimstats->num_dead_tuples,
+ sizeof(ItemPointerData), zs_vac_cmp_itemptr);
+ /* TODO: currently, we write a separate UNDO record for each attribute, so there will
+ * be duplicates. Eliminate them. */
+ j = 1;
+ for (int i = 1; i < trimstats->num_dead_tuples; i++)
+ {
+ if (!ItemPointerEquals(&trimstats->dead_tuples[j - 1],
+ &trimstats->dead_tuples[i]))
+ trimstats->dead_tuples[j++] = trimstats->dead_tuples[i];
+ }
+ trimstats->num_dead_tuples = j;
+
+ /* Remove index entries */
+ for (int i = 0; i < nindexes; i++)
+ lazy_vacuum_index(Irel[i],
+ &indstats[i],
+ vacrelstats);
+
+ /*
+ * Mark the items as dead in the attribute b-trees.
+ *
+ * We cannot remove them immediately, because we must prevent the TIDs from
+ * being reused, until we have trimmed the UNDO records. Otherwise, this might
+ * happen:
+ *
+ * 1. We remove items from all the B-trees.
+ * 2. An inserter reuses the now-unused TID for a new tuple
+ * 3. We abort the VACUUM, for some reason
+ * 4. We start VACUUM again. We will now try to remove the item again, but
+ * we will remove the new item with the same TID instead.
+ *
+ * There would be other ways to deal with it. One easy optimization would be
+ * to leave the DEAD item in only one of the attributes, and remove all others
+ * completely. Or in step #4, we could refrain from removing items, whose
+ * UNDO pointers are newer than expected. But that's tricky, because we scan
+ * the indexes first, and we must refrain from removing index entries for
+ * new items, too.
+ */
+ for (int attno = 1; attno <= RelationGetNumberOfAttributes(rel); attno++)
+ {
+ for (int i = 0; i < trimstats->num_dead_tuples; i++)
+ zsbt_mark_item_dead(rel, attno,
+ ZSTidFromItemPointer(trimstats->dead_tuples[i]),
+ reaped_upto);
+ }
+ }
+
+ /*
+ * The UNDO records for the tuple versions we just removed are no longer
+ * interesting to anyone. Advance the UNDO tail, so that the UNDO pages
+ * can be recycled.
+ */
+ zsundo_update_oldest_ptr(rel, reaped_upto, oldest_undopage);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("\"%s\": removed %d row versions and %d undo pages",
+ RelationGetRelationName(rel),
+ trimstats->num_dead_tuples,
+ trimstats->deleted_undo_pages)));
+ } while(trimstats->dead_tuples_overflowed);
+
+ /* Do post-vacuum cleanup and statistics update for each index */
+ for (int i = 0; i < nindexes; i++)
+ lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+
+ /* Done with indexes */
+ vac_close_indexes(nindexes, Irel, NoLock);
+}
+
+
+/*
+ * lazy_space_alloc - space allocation decisions for lazy vacuum
+ *
+ * See the comments at the head of this file for rationale.
+ */
+static void
+lazy_space_alloc(ZSVacRelStats *vacrelstats, BlockNumber relblocks)
+{
+ long maxtuples;
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem : maintenance_work_mem;
+
+ if (vacrelstats->hasindex)
+ {
+ maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
+ maxtuples = Min(maxtuples, INT_MAX);
+ maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+
+ /* curious coding here to ensure the multiplication can't overflow */
+ if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
+ maxtuples = relblocks * LAZY_ALLOC_TUPLES;
+
+ /* stay sane if small maintenance_work_mem */
+ maxtuples = Max(maxtuples, MaxHeapTuplesPerPage);
+ }
+ else
+ {
+ maxtuples = MaxHeapTuplesPerPage;
+ }
+
+ vacrelstats->trimstats.num_dead_tuples = 0;
+ vacrelstats->trimstats.max_dead_tuples = (int) maxtuples;
+ vacrelstats->trimstats.dead_tuples = (ItemPointer)
+ palloc(maxtuples * sizeof(ItemPointerData));
+}
+
+/*
+ * lazy_vacuum_index() -- vacuum one index relation.
+ *
+ * Delete all the index entries pointing to tuples listed in
+ * vacrelstats->dead_tuples, and update running statistics.
+ */
+static void
+lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = true;
+ ivinfo.message_level = vacrelstats->elevel;
+ /* We can only provide an approximate value of num_heap_tuples here */
+ ivinfo.num_heap_tuples = vacrelstats->old_live_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ zs_lazy_tid_reaped, (void *) vacrelstats);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("scanned index \"%s\" to remove %d row versions",
+ RelationGetRelationName(indrel),
+ vacrelstats->trimstats.num_dead_tuples),
+ errdetail_internal("%s", pg_rusage_show(&ru0))));
+}
+
+/*
+ * lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ */
+static void
+lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = (vacrelstats->tupcount_pages < vacrelstats->rel_pages);
+ ivinfo.message_level = vacrelstats->elevel;
+
+ /*
+ * Now we can provide a better estimate of total number of surviving
+ * tuples (we assume indexes are more interested in that than in the
+ * number of nominally live tuples).
+ */
+ ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ stats = index_vacuum_cleanup(&ivinfo, stats);
+
+ if (!stats)
+ return;
+
+ /*
+ * Now update statistics in pg_class, but only if the index says the count
+ * is accurate.
+ */
+ if (!stats->estimated_count)
+ vac_update_relstats(indrel,
+ stats->num_pages,
+ stats->num_index_tuples,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ false);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("index \"%s\" now contains %.0f row versions in %u pages",
+ RelationGetRelationName(indrel),
+ stats->num_index_tuples,
+ stats->num_pages),
+ errdetail("%.0f index row versions were removed.\n"
+ "%u index pages have been deleted, %u are currently reusable.\n"
+ "%s.",
+ stats->tuples_removed,
+ stats->pages_deleted, stats->pages_free,
+ pg_rusage_show(&ru0))));
+
+ pfree(stats);
+}
+
+/*
+ * Scan the UNDO log, starting from oldest entry. For every tuple that is
+ * now considered dead, add it to 'dead_tuples'. Records for committed
+ * transactions can be trimmed away immediately.
+ *
+ * Returns the value that the oldest UNDO ptr can be trimmed upto, after
+ * removing all the dead TIDs.
+ *
+ * The caller must initialize ZSUndoTrimStats. This function updates the
+ * counters, and adds dead TIDs that can be removed to trimstats->dead_tuples.
+ * If there are more dead TIDs than fit in the dead_tuples array, this
+ * function sets trimstats->dead_tuples_overflow flag, and stops just before
+ * the UNDO record for the TID that did not fit. An important special case is
+ * calling this with trimstats->max_dead_tuples == 0. In that case, we scan
+ * as much as is possible without scanning the indexes (i.e. only UNDO
+ * records belonging to committed transactions at the tail of the UNDO log).
+ * IOW, it returns the oldest UNDO rec pointer that is still needed by
+ * active snapshots.
+ */
+static ZSUndoRecPtr
+zsundo_scan(Relation rel, TransactionId OldestXmin, ZSUndoTrimStats *trimstats,
+ BlockNumber *oldest_undopage)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber lastblk;
+ ZSUndoRecPtr oldest_undorecptr;
+ bool can_advance_oldestundorecptr;
+ char *ptr;
+ char *endptr;
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ oldest_undorecptr = metaopaque->zs_undo_oldestptr;
+
+ /*
+ * If we assume that only one process can call TRIM at a time, then we
+ * don't need to hold the metapage locked. Alternatively, if multiple
+ * concurrent trims is possible, we could check after reading the head
+ * page, that it is the page we expect, and re-read the metapage if it's
+ * not.
+ *
+ * FIXME: Currently this works even if two backends call zsundo_trim()
+ * concurrently, because we never recycle UNDO pages.
+ */
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page, until we
+ * hit a record that we cannot remove.
+ */
+ lastblk = firstblk;
+ can_advance_oldestundorecptr = false;
+ while (lastblk != InvalidBlockNumber && !trimstats->dead_tuples_overflowed)
+ {
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, lastblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "unexpected page id on UNDO page");
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ while (ptr < endptr && !trimstats->dead_tuples_overflowed)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+ bool did_commit;
+
+ Assert(undorec->undorecptr.blkno == lastblk);
+
+ oldest_undorecptr = undorec->undorecptr;
+
+ if (undorec->undorecptr.counter < oldest_undorecptr.counter)
+ {
+ ptr += undorec->size;
+ continue;
+ }
+
+ if (!TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* This is still needed. Bail out */
+ break;
+ }
+
+ /*
+ * No one thinks this transaction is in-progress anymore. If it
+ * committed, we can just trim away its UNDO record. If it aborted,
+ * we need to apply the UNDO record first.
+ *
+ * TODO: applying UNDO actions has not been implemented, so if we
+ * encounter an aborted record, we just stop dead right there, and
+ * never trim past that point.
+ */
+ did_commit = TransactionIdDidCommit(undorec->xid);
+
+ switch (undorec->type)
+ {
+ case ZSUNDO_TYPE_INSERT:
+ if (!did_commit)
+ zsundo_record_dead_tuple(trimstats, undorec->tid);
+ break;
+ case ZSUNDO_TYPE_DELETE:
+ if (did_commit)
+ zsundo_record_dead_tuple(trimstats, undorec->tid);
+ break;
+ case ZSUNDO_TYPE_UPDATE:
+ if (did_commit)
+ zsundo_record_dead_tuple(trimstats, ((ZSUndoRec_Update *) undorec)->otid);
+ else
+ zsundo_record_dead_tuple(trimstats, undorec->tid);
+ break;
+ }
+ ptr += undorec->size;
+
+ can_advance_oldestundorecptr = true;
+ }
+
+ if (ptr < endptr)
+ {
+ UnlockReleaseBuffer(buf);
+ break;
+ }
+ else
+ {
+ /* We processed all records on the page. Step to the next one, if any. */
+ Assert(ptr == endptr);
+ lastblk = opaque->next;
+ UnlockReleaseBuffer(buf);
+ if (lastblk != InvalidBlockNumber)
+ trimstats->deleted_undo_pages++;
+ }
+ }
+
+ trimstats->can_advance_oldestundorecptr = can_advance_oldestundorecptr;
+ *oldest_undopage = lastblk;
+ return oldest_undorecptr;
+}
+
+/* Update metapage with the oldest value */
+static void
+zsundo_update_oldest_ptr(Relation rel, ZSUndoRecPtr oldest_undorecptr, BlockNumber oldest_undopage)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ metaopaque->zs_undo_oldestptr = oldest_undorecptr;
+ if (oldest_undopage == InvalidBlockNumber)
+ {
+ metaopaque->zs_undo_head = InvalidBlockNumber;
+ metaopaque->zs_undo_tail = InvalidBlockNumber;
+ }
+ else
+ metaopaque->zs_undo_head = oldest_undopage;
+
+ /* TODO: WAL-log */
+
+ MarkBufferDirty(metabuf);
+ UnlockReleaseBuffer(metabuf);
+
+ /* TODO: we leak all the removed undo pages */
+}
+
+/*
+ * zsundo_record_dead_tuple - remember one deletable tuple
+ */
+static void
+zsundo_record_dead_tuple(ZSUndoTrimStats *trimstats, zstid tid)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (trimstats->num_dead_tuples < trimstats->max_dead_tuples)
+ {
+ trimstats->dead_tuples[trimstats->num_dead_tuples] = ItemPointerFromZSTid(tid);
+ trimstats->num_dead_tuples++;
+ pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+ trimstats->num_dead_tuples);
+ }
+ else
+ trimstats->dead_tuples_overflowed = true;
+}
+
+/*
+ * Return the current "Oldest undo pointer". The effects of any actions with
+ * undo pointer older than this is known to be visible to everyone. (i.e.
+ * an inserted tuple is known to be visible, and a deleted tuple is known to
+ * be invisible.)
+ */
+ZSUndoRecPtr
+zsundo_get_oldest_undo_ptr(Relation rel)
+{
+ ZSUndoRecPtr result;
+ ZSUndoTrimStats trimstats;
+ BlockNumber oldest_undopage;
+
+ if (RelationGetNumberOfBlocks(rel) == 0)
+ {
+ memset(&result, 0, sizeof(ZSUndoRecPtr));
+ return result;
+ }
+
+ /*
+ * Call zsundo_scan, with max_dead_tuples = 0. It scans the UNDO log,
+ * starting from the oldest record, and advances the oldest UNDO pointer
+ * past as many committed, visible-to-all transactions as possible.
+ *
+ * TODO:
+ * We could get the latest cached value directly from the metapage, but
+ * this allows trimming the UNDO log more aggressively, whenever we're
+ * scanning. Fetching records from the UNDO log is pretty expensive,
+ * so until that is somehow sped up, it is a good tradeoff to be
+ * aggressive about that.
+ */
+ trimstats.num_dead_tuples = 0;
+ trimstats.max_dead_tuples = 0;
+ trimstats.dead_tuples = NULL;
+ trimstats.dead_tuples_overflowed = false;
+ trimstats.deleted_undo_pages = 0;
+ result = zsundo_scan(rel, RecentGlobalXmin, &trimstats, &oldest_undopage);
+
+ if (trimstats.can_advance_oldestundorecptr)
+ zsundo_update_oldest_ptr(rel, result, oldest_undopage);
+
+ return result;
+}
diff --git a/src/backend/access/zedstore/zedstore_visibility.c b/src/backend/access/zedstore/zedstore_visibility.c
new file mode 100644
index 0000000000..302331f2cb
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_visibility.c
@@ -0,0 +1,259 @@
+/*
+ * zedstore_visibility.c
+ * Routines for MVCC in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_visibility.c
+ */
+#include "postgres.h"
+
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "storage/procarray.h"
+
+/*
+ * Like HeapTupleSatisfiesUpdate.
+ */
+TM_Result
+zs_SatisfiesUpdate(ZSBtreeScan *scan, ZSUncompressedBtreeItem *item)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr recent_oldest_undo;
+ bool is_deleted;
+ ZSUndoRec *undorec;
+
+ if (scan->recent_oldest_undo.counter == 0)
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(scan->rel);
+ recent_oldest_undo = scan->recent_oldest_undo;
+
+ is_deleted = (item->t_flags & (ZSBT_UPDATED | ZSBT_DELETED)) != 0;
+
+ /* Is it visible? */
+ if (item->t_undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ if (is_deleted)
+ {
+ /* this probably shouldn't happen.. */
+ return TM_Invisible;
+ }
+ else
+ return TM_Ok;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, item->t_undo_ptr);
+
+ if (!is_deleted)
+ {
+ /* Inserted tuple */
+ Assert(undorec->type == ZSUNDO_TYPE_INSERT ||
+ undorec->type == ZSUNDO_TYPE_UPDATE);
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (undorec->cid >= snapshot->curcid)
+ return TM_Invisible; /* inserted after scan started */
+ return TM_Ok;
+ }
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return TM_Invisible; /* inserter has not committed yet */
+ else if (TransactionIdDidCommit(undorec->xid))
+ return TM_Ok;
+ else
+ {
+ /* it must have aborted or crashed */
+ return TM_Invisible;
+ }
+ }
+ else
+ {
+ /* deleted or updated-away tuple */
+ Assert(undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE);
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (undorec->cid >= snapshot->curcid)
+ return TM_SelfModified; /* deleted/updated after scan started */
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return TM_BeingModified;
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed */
+ return TM_Ok;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_DELETE)
+ return TM_Deleted;
+ else
+ return TM_Updated;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesAny
+ */
+static bool
+zs_SatisfiesAny(ZSBtreeScan *scan, ZSUncompressedBtreeItem *item)
+{
+ return true;
+}
+
+/*
+ * helper function to zs_SatisfiesMVCC(), to check if the given XID
+ * is visible to the snapshot.
+ */
+static bool
+xid_is_visible(Snapshot snapshot, TransactionId xid, CommandId cid)
+{
+ if (TransactionIdIsCurrentTransactionId(xid))
+ {
+ if (cid >= snapshot->curcid)
+ return false;
+ else
+ return true;
+ }
+ else if (XidInMVCCSnapshot(xid, snapshot))
+ return false;
+ else if (TransactionIdDidCommit(xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesMVCC
+ */
+static bool
+zs_SatisfiesMVCC(ZSBtreeScan *scan, ZSUncompressedBtreeItem *item)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr recent_oldest_undo = scan->recent_oldest_undo;
+ ZSUndoRec *undorec;
+ bool is_deleted;
+
+ Assert (snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ is_deleted = (item->t_flags & (ZSBT_UPDATED | ZSBT_DELETED)) != 0;
+
+ if (item->t_undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ if (!is_deleted)
+ return true;
+ else
+ return false;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, item->t_undo_ptr);
+
+ if (!is_deleted)
+ {
+ /* Inserted tuple */
+ Assert(undorec->type == ZSUNDO_TYPE_INSERT ||
+ undorec->type == ZSUNDO_TYPE_UPDATE);
+
+ return xid_is_visible(snapshot, undorec->xid, undorec->cid);
+ }
+ else
+ {
+ /* deleted or updated-away tuple */
+ Assert(undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE);
+
+ if (xid_is_visible(snapshot, undorec->xid, undorec->cid))
+ {
+ /* we can see the deletion */
+ return false;
+ }
+ else
+ {
+ /*
+ * The deleting XID is not visible to us. But before concluding
+ * that the tuple is visible, we have to check if the inserting
+ * XID is visible to us.
+ */
+ ZSUndoRecPtr prevptr;
+
+ if (undorec->type == ZSUNDO_TYPE_DELETE)
+ prevptr = ((ZSUndoRec_Delete *) undorec)->prevundorec;
+ else
+ prevptr = ((ZSUndoRec_Update *) undorec)->prevundorec;
+
+ if (prevptr.counter < recent_oldest_undo.counter)
+ return true;
+
+ undorec = zsundo_fetch(rel, prevptr);
+
+ Assert(undorec->type == ZSUNDO_TYPE_INSERT ||
+ undorec->type == ZSUNDO_TYPE_UPDATE);
+ if (xid_is_visible(snapshot, undorec->xid, undorec->cid))
+ return true; /* we can see the insert, but not the delete */
+ else
+ return false; /* we cannot see the insert */
+ }
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesVisibility
+ */
+bool
+zs_SatisfiesVisibility(ZSBtreeScan *scan, ZSUncompressedBtreeItem *item)
+{
+ /*
+ * If we don't have a cached oldest-undo-ptr value yet, fetch it
+ * from the metapage. (TODO: In the final EDB's UNDO-log implementation
+ * this will probably be just a global variable, like RecentGlobalXmin.)
+ */
+ if (scan->recent_oldest_undo.counter == 0)
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(scan->rel);
+
+ /* dead items are never considered visible. */
+ if ((item->t_flags & ZSBT_DEAD) != 0)
+ return false;
+
+ switch (scan->snapshot->snapshot_type)
+ {
+ case SNAPSHOT_MVCC:
+ return zs_SatisfiesMVCC(scan, item);
+ break;
+ case SNAPSHOT_SELF:
+ elog(ERROR, "SnapshotSelf not implemented in zedstore yet");
+ break;
+ case SNAPSHOT_ANY:
+ return zs_SatisfiesAny(scan, item);
+ break;
+ case SNAPSHOT_TOAST:
+ elog(ERROR, "SnapshotToast not implemented in zedstore");
+ break;
+ case SNAPSHOT_DIRTY:
+ elog(ERROR, "SnapshotDirty not implemented in zedstore yet");
+ break;
+ case SNAPSHOT_HISTORIC_MVCC:
+ elog(ERROR, "SnapshotHistoricMVCC not implemented in zedstore yet");
+ break;
+ case SNAPSHOT_NON_VACUUMABLE:
+ elog(ERROR, "SnapshotNonVacuumable not implemented in zedstore yet");
+ break;
+ }
+
+ return false; /* keep compiler quiet */
+}
diff --git a/src/backend/access/zedstore/zedstoream_handler.c b/src/backend/access/zedstore/zedstoream_handler.c
new file mode 100644
index 0000000000..84575260fb
--- /dev/null
+++ b/src/backend/access/zedstore/zedstoream_handler.c
@@ -0,0 +1,1500 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_handler.c
+ * ZedStore table access method code
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_handler.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/heapam.h"
+#include "access/multixact.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "catalog/catalog.h"
+#include "catalog/index.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
+#include "commands/vacuum.h"
+#include "executor/executor.h"
+#include "optimizer/plancat.h"
+#include "storage/procarray.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+
+typedef enum
+{
+ ZSSCAN_STATE_UNSTARTED,
+ ZSSCAN_STATE_SCANNING,
+ ZSSCAN_STATE_FINISHED_RANGE,
+ ZSSCAN_STATE_FINISHED
+} zs_scan_state;
+
+typedef struct ZedStoreDescData
+{
+ /* scan parameters */
+ TableScanDescData rs_scan; /* */
+ int *proj_atts;
+ ZSBtreeScan *btree_scans;
+ int num_proj_atts;
+
+ zs_scan_state state;
+ zstid cur_range_start;
+ zstid cur_range_end;
+ bool finished;
+
+ /* These fields are used for bitmap scans, to hold a "block's" worth of data */
+#define MAX_ITEMS_PER_LOGICAL_BLOCK MaxHeapTuplesPerPage
+ int bmscan_ntuples;
+ zstid *bmscan_tids;
+ Datum **bmscan_datums;
+ bool **bmscan_isnulls;
+ int bmscan_nexttuple;
+
+} ZedStoreDescData;
+
+typedef struct ZedStoreDescData *ZedStoreDesc;
+
+typedef struct ZedStoreIndexFetchData
+{
+ IndexFetchTableData idx_fetch_data;
+ int *proj_atts;
+ int num_proj_atts;
+} ZedStoreIndexFetchData;
+
+typedef struct ZedStoreIndexFetchData *ZedStoreIndexFetch;
+
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static Size zs_parallelscan_estimate(Relation rel);
+static Size zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan);
+static void zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan);
+static bool zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end);
+
+
+/* ----------------------------------------------------------------
+ * storage AM support routines for zedstoream
+ * ----------------------------------------------------------------
+ */
+
+static bool
+zedstoream_fetch_row_version(Relation relation,
+ ItemPointer tid,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+ return false;
+}
+
+static void
+zedstoream_get_latest_tid(Relation relation,
+ Snapshot snapshot,
+ ItemPointer tid)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+zedstoream_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate)
+{
+ AttrNumber attno;
+ Datum *d;
+ bool *isnulls;
+ zstid tid;
+ ZSUndoRecPtr undorecptr;
+ TransactionId xid = GetCurrentTransactionId();
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ tid = InvalidZSTid;
+ ZSUndoRecPtrInitialize(&undorecptr);
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = &relation->rd_att->attrs[attno - 1];
+ Datum datum = d[attno - 1];
+ bool isnull = isnulls[attno - 1];
+ Datum toastptr = (Datum) 0;
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ toastptr = datum = zedstore_toast_datum(relation, attno, datum);
+ }
+
+ tid = zsbt_insert(relation, attno, datum, isnull, xid, cid, tid, &undorecptr);
+
+ if (toastptr != (Datum) 0)
+ zedstore_toast_finish(relation, attno, toastptr, tid);
+ }
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+}
+
+static void
+zedstoream_insert_speculative(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, BulkInsertState bistate, uint32 specToken)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+zedstoream_complete_speculative(Relation relation, TupleTableSlot *slot, uint32 spekToken,
+ bool succeeded)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+zedstoream_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
+ CommandId cid, int options, BulkInsertState bistate)
+{
+ AttrNumber attno;
+ zstid *tid;
+ ZSUndoRecPtr undorecptr;
+ int i;
+ bool slotgetandset = true;
+ TransactionId xid = GetCurrentTransactionId();
+ int *tupletoasted;
+ Datum *toastdatum;
+
+ tid = palloc(ntuples * sizeof(zstid));
+ tupletoasted = palloc(ntuples * sizeof(int));
+ toastdatum = palloc(ntuples * sizeof(Datum));
+
+ ZSUndoRecPtrInitialize(&undorecptr);
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = &relation->rd_att->attrs[attno - 1];
+ int ntupletoasted = 0;
+ List *zitems = NIL;
+
+ for (i = 0; i < ntuples; i++)
+ {
+ ZSUncompressedBtreeItem *zitem;
+ Datum datum = slots[i]->tts_values[attno - 1];
+ bool isnull = slots[i]->tts_isnull[attno - 1];
+
+ if (slotgetandset)
+ {
+ slot_getallattrs(slots[i]);
+ tid[i] = InvalidZSTid;
+ }
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = toastdatum[ntupletoasted] = zedstore_toast_datum(relation, attno, datum);
+ tupletoasted[ntupletoasted++] = i;
+ }
+
+ zitem = zsbt_create_item(attr, tid[i], datum, isnull);
+ zitems = lappend(zitems, zitem);
+ }
+
+ zsbt_insert_multi_items(relation, attno, zitems, xid, cid, &undorecptr, tid);
+
+ for (i = 0; i < ntupletoasted; i++)
+ {
+ zedstore_toast_finish(relation, attno, toastdatum[tupletoasted[i]],
+ tid[tupletoasted[i]]);
+ }
+
+ slotgetandset = false;
+ }
+
+ for (i = 0; i < ntuples; i++)
+ {
+ slots[i]->tts_tableOid = RelationGetRelid(relation);
+ slots[i]->tts_tid = ItemPointerFromZSTid(tid[i]);
+ }
+
+ pfree(tid);
+ pfree(tupletoasted);
+ pfree(toastdatum);
+}
+
+static TM_Result
+zedstoream_delete(Relation relation, ItemPointer tid_p, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ AttrNumber attno;
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ TM_Result result;
+
+ result = zsbt_delete(relation, attno, tid, xid, cid,
+ snapshot, crosscheck, wait, hufd, changingPart);
+
+ /*
+ * TODO: Here, we should check for TM_BeingModified, like heap_delete()
+ * does
+ */
+
+ if (result != TM_Ok)
+ {
+ if (attno != 1)
+ {
+ /* failed to delete this attribute, but we might already have
+ * deleted other attributes. */
+ elog(ERROR, "could not delete all columns of row");
+ }
+ return result;
+ }
+ }
+
+ return TM_Ok;
+}
+
+
+static TM_Result
+zedstoream_lock_tuple(Relation relation, ItemPointer tid, Snapshot snapshot,
+ TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, uint8 flags,
+ TM_FailureData *hufd)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+
+static TM_Result
+zedstoream_update(Relation relation, ItemPointer otid_p, TupleTableSlot *slot,
+ CommandId cid, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd,
+ LockTupleMode *lockmode, bool *update_indexes)
+{
+ zstid otid = ZSTidFromItemPointer(*otid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ AttrNumber attno;
+ Datum *d;
+ bool *isnulls;
+ TM_Result result;
+ zstid newtid;
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ /*
+ * TODO: Since we have visibility information on each column, we could skip
+ * updating columns whose value didn't change.
+ */
+
+ result = TM_Ok;
+ newtid = InvalidZSTid;
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = &relation->rd_att->attrs[attno - 1];
+ Datum newdatum = d[attno - 1];
+ bool newisnull = isnulls[attno - 1];
+ Datum toastptr = (Datum) 0;
+ TM_Result this_result;
+
+ /* If this datum is too large, toast it */
+ if (!newisnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(newdatum) > MaxZedStoreDatumSize)
+ {
+ toastptr = newdatum = zedstore_toast_datum(relation, attno, newdatum);
+ }
+
+ this_result = zsbt_update(relation, attno, otid, newdatum, newisnull,
+ xid, cid, snapshot, crosscheck,
+ wait, hufd, &newtid);
+
+ if (this_result != TM_Ok)
+ {
+ /* FIXME: hmm, failed to delete this attribute, but we might already have
+ * deleted other attributes. Error? */
+ /* FIXME: this leaks the toast chain on failure */
+ result = this_result;
+ break;
+ }
+
+ if (toastptr != (Datum) 0)
+ zedstore_toast_finish(relation, attno, toastptr, newtid);
+ }
+ slot->tts_tid = ItemPointerFromZSTid(newtid);
+
+ /* TODO: could we do HOT udates? */
+ /* TODO: What should we set lockmode to? */
+
+ return result;
+}
+
+static const TupleTableSlotOps *
+zedstoream_slot_callbacks(Relation relation)
+{
+ return &TTSOpsVirtual;
+}
+
+static TableScanDesc
+zedstoream_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ bool *project_columns,
+ bool allow_strat,
+ bool allow_sync,
+ bool allow_pagemode,
+ bool is_bitmapscan,
+ bool is_samplescan,
+ bool temp_snap)
+{
+ ZedStoreDesc scan;
+
+ /*
+ * allocate and initialize scan descriptor
+ */
+ scan = (ZedStoreDesc) palloc(sizeof(ZedStoreDescData));
+
+ scan->rs_scan.rs_rd = relation;
+ scan->rs_scan.rs_snapshot = snapshot;
+ scan->rs_scan.rs_nkeys = nkeys;
+ scan->rs_scan.rs_bitmapscan = is_bitmapscan;
+ scan->rs_scan.rs_samplescan = is_samplescan;
+ scan->rs_scan.rs_allow_strat = allow_strat;
+ scan->rs_scan.rs_allow_sync = allow_sync;
+ scan->rs_scan.rs_temp_snap = temp_snap;
+ scan->rs_scan.rs_parallel = parallel_scan;
+
+ /*
+ * we can use page-at-a-time mode if it's an MVCC-safe snapshot
+ */
+ scan->rs_scan.rs_pageatatime = allow_pagemode && snapshot && IsMVCCSnapshot(snapshot);
+
+ scan->state = ZSSCAN_STATE_UNSTARTED;
+
+ /*
+ * we do this here instead of in initscan() because heap_rescan also calls
+ * initscan() and we don't want to allocate memory again
+ */
+ if (nkeys > 0)
+ scan->rs_scan.rs_key = (ScanKey) palloc(sizeof(ScanKeyData) * nkeys);
+ else
+ scan->rs_scan.rs_key = NULL;
+
+ scan->proj_atts = palloc(relation->rd_att->natts * sizeof(int));
+
+ scan->btree_scans = palloc0(relation->rd_att->natts * sizeof(ZSBtreeScan));
+ scan->num_proj_atts = 0;
+
+ /*
+ * convert booleans array into an array of the attribute numbers of the
+ * required columns.
+ */
+ for (int i = 0; i < relation->rd_att->natts; i++)
+ {
+ /* project_columns empty also conveys need all the columns */
+ if (project_columns == NULL || project_columns[i])
+ {
+ scan->proj_atts[scan->num_proj_atts++] = i;
+ }
+ }
+
+ /* Extra setup for bitmap scans */
+ if (is_bitmapscan)
+ {
+ scan->bmscan_ntuples = 0;
+ scan->bmscan_tids = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(zstid));
+
+ scan->bmscan_datums = palloc(scan->num_proj_atts * sizeof(Datum *));
+ scan->bmscan_isnulls = palloc(scan->num_proj_atts * sizeof(bool *));
+ for (int i = 0; i < scan->num_proj_atts; i++)
+ {
+ scan->bmscan_datums[i] = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(Datum));
+ scan->bmscan_isnulls[i] = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(bool));
+ }
+ }
+
+ return (TableScanDesc) scan;
+}
+
+static TableScanDesc
+zedstoream_beginscan(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ bool allow_strat,
+ bool allow_sync,
+ bool allow_pagemode,
+ bool is_bitmapscan,
+ bool is_samplescan,
+ bool temp_snap)
+{
+ return zedstoream_beginscan_with_column_projection(relation, snapshot, nkeys, key, parallel_scan,
+ NULL, allow_strat, allow_sync, allow_pagemode,
+ is_bitmapscan, is_samplescan, temp_snap);
+}
+
+static void
+zedstoream_endscan(TableScanDesc sscan)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+
+ if (scan->proj_atts)
+ pfree(scan->proj_atts);
+
+ for (int i = 0; i < sscan->rs_rd->rd_att->natts; i++)
+ zsbt_end_scan(&scan->btree_scans[i]);
+
+ if (scan->rs_scan.rs_temp_snap)
+ UnregisterSnapshot(scan->rs_scan.rs_snapshot);
+
+ pfree(scan->btree_scans);
+ pfree(scan);
+}
+
+static bool
+zedstoream_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+
+ Assert(scan->num_proj_atts <= slot->tts_tupleDescriptor->natts);
+
+ slot->tts_nvalid = 0;
+ slot->tts_flags |= TTS_FLAG_EMPTY;
+
+ while (scan->state != ZSSCAN_STATE_FINISHED)
+ {
+ zstid this_tid;
+
+ if (scan->state == ZSSCAN_STATE_UNSTARTED ||
+ scan->state == ZSSCAN_STATE_FINISHED_RANGE)
+ {
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ scan->state = ZSSCAN_STATE_FINISHED;
+ break;
+ }
+ }
+ else
+ {
+ if (scan->state == ZSSCAN_STATE_FINISHED_RANGE)
+ {
+ scan->state = ZSSCAN_STATE_FINISHED;
+ break;
+ }
+ scan->cur_range_start = MinZSTid;
+ scan->cur_range_end = MaxPlusOneZSTid;
+ }
+
+ for (int i = 0; i < scan->num_proj_atts; i++)
+ {
+ int natt = scan->proj_atts[i];
+
+ zsbt_begin_scan(scan->rs_scan.rs_rd, natt + 1,
+ scan->cur_range_start,
+ scan->rs_scan.rs_snapshot,
+ &scan->btree_scans[i]);
+ }
+ scan->state = ZSSCAN_STATE_SCANNING;
+ }
+
+ /* We now have a range to scan */
+ Assert(scan->state == ZSSCAN_STATE_SCANNING);
+ this_tid = InvalidZSTid;
+ for (int i = 0; i < scan->num_proj_atts; i++)
+ {
+ int natt = scan->proj_atts[i];
+ Form_pg_attribute att = &scan->rs_scan.rs_rd->rd_att->attrs[natt];
+ Datum datum;
+ bool isnull;
+ zstid tid;
+
+ if (!zsbt_scan_next(&scan->btree_scans[i], &datum, &isnull, &tid))
+ {
+ scan->state = ZSSCAN_STATE_FINISHED_RANGE;
+ break;
+ }
+ if (tid >= scan->cur_range_end)
+ {
+ scan->state = ZSSCAN_STATE_FINISHED_RANGE;
+ break;
+ }
+
+ if (i == 0)
+ this_tid = tid;
+ else if (this_tid != tid)
+ {
+ elog(ERROR, "scans on different attributes out of sync");
+ }
+
+ /*
+ * flatten any ZS-TOASTed values, becaue the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt + 1, tid, datum);
+ }
+
+ slot->tts_values[natt] = datum;
+ slot->tts_isnull[natt] = isnull;
+ }
+
+ if (scan->state == ZSSCAN_STATE_FINISHED_RANGE)
+ {
+ for (int i = 0; i < scan->num_proj_atts; i++)
+ {
+ int natt = scan->proj_atts[i];
+
+ zsbt_end_scan(&scan->btree_scans[natt]);
+ }
+ }
+ else
+ {
+ Assert(scan->state == ZSSCAN_STATE_SCANNING);
+ slot->tts_tid = ItemPointerFromZSTid(this_tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ return true;
+ }
+ }
+
+ ExecClearTuple(slot);
+ return false;
+}
+
+static bool
+zedstoream_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
+ Snapshot snapshot)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static TransactionId
+zedstoream_compute_xid_horizon_for_tuples(Relation rel,
+ ItemPointerData *items,
+ int nitems)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+
+}
+
+static IndexFetchTableData *
+zedstoream_begin_index_fetch(Relation rel)
+{
+ ZedStoreIndexFetch zscan = palloc0(sizeof(ZedStoreIndexFetchData));
+
+ zscan->idx_fetch_data.rel = rel;
+
+ zscan->proj_atts = palloc(rel->rd_att->natts * sizeof(int));
+ zscan->num_proj_atts = 0;
+
+ return (IndexFetchTableData *) zscan;
+}
+
+static void
+zedstoream_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ bool *project_column)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch)scan;
+ Relation rel = zscan->idx_fetch_data.rel;
+
+ zscan->num_proj_atts = 0;
+
+ /*
+ * convert booleans array into an array of the attribute numbers of the
+ * required columns.
+ */
+ for (int i = 0; i < rel->rd_att->natts; i++)
+ {
+ /* if project_columns is empty means need all the columns */
+ if (project_column == NULL || project_column[i])
+ {
+ zscan->proj_atts[zscan->num_proj_atts++] = i;
+ }
+ }
+}
+
+static void
+zedstoream_reset_index_fetch(IndexFetchTableData *scan)
+{
+}
+
+static void
+zedstoream_end_index_fetch(IndexFetchTableData *scan)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+
+ pfree(zscan->proj_atts);
+ pfree(zscan);
+}
+
+static bool
+zedstoream_index_fetch_tuple(struct IndexFetchTableData *scan,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ bool *call_again, bool *all_dead)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ Relation rel = zscan->idx_fetch_data.rel;
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ bool found = true;
+
+ /*
+ * if executor didn't set the column projections, need to return all the
+ * columns.
+ */
+ if (zscan->num_proj_atts == 0)
+ {
+ for (int i = 0; i < rel->rd_att->natts; i++)
+ zscan->proj_atts[zscan->num_proj_atts++] = i;
+ }
+
+ for (int i = 0; i < zscan->num_proj_atts && found; i++)
+ {
+ int natt = zscan->proj_atts[i];
+ Form_pg_attribute att = &rel->rd_att->attrs[natt];
+ ZSBtreeScan btree_scan;
+ Datum datum;
+ bool isnull;
+ zstid this_tid;
+
+ if (att->attisdropped)
+ continue;
+
+ zsbt_begin_scan(rel, natt + 1, tid, snapshot, &btree_scan);
+
+ if (zsbt_scan_next(&btree_scan, &datum, &isnull, &this_tid))
+ {
+ if (this_tid != tid)
+ found = false;
+ else
+ {
+ slot->tts_values[natt] = datum;
+ slot->tts_isnull[natt] = isnull;
+ }
+ }
+ else
+ found = false;
+
+ zsbt_end_scan(&btree_scan);
+ }
+
+ if (found)
+ {
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ return true;
+ }
+ else
+ {
+ /*
+ * not found
+ *
+ * TODO: as a sanity check, it would be good to check if we
+ * get *any* of the columns. Currently, if any of the columns
+ * is missing, we treat the tuple as non-existent
+ */
+ return false;
+ }
+}
+
+static void
+zedstoream_index_validate_scan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ Snapshot snapshot,
+ ValidateIndexState *state)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static double
+zedstoream_index_build_range_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ bool anyvisible,
+ bool progress,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state,
+ TableScanDesc scan)
+{
+ bool checking_uniqueness;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ double reltuples;
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ Snapshot snapshot;
+ bool need_unregister_snapshot = false;
+ TransactionId OldestXmin;
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /* See whether we're verifying uniqueness/exclusion properties */
+ checking_uniqueness = (indexInfo->ii_Unique ||
+ indexInfo->ii_ExclusionOps != NULL);
+
+ /*
+ * "Any visible" mode is not compatible with uniqueness checks; make sure
+ * only one of those is requested.
+ */
+ Assert(!(anyvisible && checking_uniqueness));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. In a normal index build, we use
+ * SnapshotAny because we must retrieve all tuples and do our own time
+ * qual checks (because we have to index RECENTLY_DEAD tuples). In a
+ * concurrent build, or during bootstrap, we take a regular MVCC snapshot
+ * and index whatever's live according to that.
+ */
+ OldestXmin = InvalidTransactionId;
+
+ /* okay to ignore lazy VACUUMs here */
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
+ OldestXmin = GetOldestXmin(baseRelation, PROCARRAY_FLAGS_VACUUM);
+
+ /*
+ * TODO: It would be very good to fetch only the columns we need.
+ */
+ if (!scan)
+ {
+ bool *proj;
+ int attno;
+
+ /*
+ * Serial index build.
+ *
+ * Must begin our own zedstore scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ */
+ if (!TransactionIdIsValid(OldestXmin))
+ {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ need_unregister_snapshot = true;
+ }
+ else
+ snapshot = SnapshotAny;
+
+ proj = palloc0(baseRelation->rd_att->natts * sizeof(bool));
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ /* skip expressions */
+ if (indexInfo->ii_IndexAttrNumbers[attno] > 0)
+ proj[indexInfo->ii_IndexAttrNumbers[attno] - 1] = true;
+ }
+
+ GetNeededColumnsForNode((Node *)indexInfo->ii_Expressions, proj,
+ baseRelation->rd_att->natts);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ if (start_blockno != 0 || numblocks != InvalidBlockNumber)
+ {
+ ZedStoreDesc zscan = (ZedStoreDesc) scan;
+
+ zscan->cur_range_start = ZSTidFromBlkOff(start_blockno, 1);
+ zscan->cur_range_end = ZSTidFromBlkOff(numblocks, 1);
+
+ for (int i = 0; i < zscan->num_proj_atts; i++)
+ {
+ int natt = zscan->proj_atts[i];
+
+ zsbt_begin_scan(zscan->rs_scan.rs_rd, natt + 1,
+ zscan->cur_range_start,
+ zscan->rs_scan.rs_snapshot,
+ &zscan->btree_scans[i]);
+ }
+ zscan->state = ZSSCAN_STATE_SCANNING;
+ }
+ }
+ else
+ {
+ /*
+ * Parallel index build.
+ *
+ * Parallel case never registers/unregisters own snapshot. Snapshot
+ * is taken from parallel zedstore scan, and is SnapshotAny or an MVCC
+ * snapshot, based on same criteria as serial case.
+ */
+ Assert(!IsBootstrapProcessingMode());
+ Assert(allow_sync);
+ Assert(start_blockno == 0);
+ Assert(numblocks == InvalidBlockNumber);
+ snapshot = scan->rs_snapshot;
+ }
+
+ /*
+ * Must call GetOldestXmin() with SnapshotAny. Should never call
+ * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+ !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == SnapshotAny || !anyvisible);
+
+ reltuples = 0;
+
+ /*
+ * Scan all tuples in the base relation.
+ */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ bool tupleIsAlive;
+ HeapTuple heapTuple;
+
+ if (numblocks != InvalidBlockNumber &&
+ ItemPointerGetBlockNumber(&slot->tts_tid) >= numblocks)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* table_scan_getnextslot did the visibility check */
+ tupleIsAlive = true;
+ reltuples += 1;
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, to figure out which tuple version to index.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ heapTuple = ExecCopySlotHeapTuple(slot);
+ heapTuple->t_self = slot->tts_tid;
+ callback(indexRelation, heapTuple, values, isnull, tupleIsAlive,
+ callback_state);
+ pfree(heapTuple);
+ }
+
+ table_endscan(scan);
+
+ /* we can now forget our snapshot, if set and registered by us */
+ if (need_unregister_snapshot)
+ UnregisterSnapshot(snapshot);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+
+ return reltuples;
+}
+
+static void
+zedstoream_finish_bulk_insert(Relation relation, int options)
+{
+ /*
+ * If we skipped writing WAL, then we need to sync the zedstore (but not
+ * indexes since those use WAL anyway / don't go through tableam)
+ */
+ if (options & HEAP_INSERT_SKIP_WAL)
+ heap_sync(relation);
+}
+
+/* ------------------------------------------------------------------------
+ * DDL related callbacks for zedstore AM.
+ * ------------------------------------------------------------------------
+ */
+
+static void
+zedstoream_relation_set_new_filenode(Relation rel, char persistence,
+ TransactionId *freezeXid,
+ MultiXactId *minmulti)
+{
+ /*
+ * Initialize to the minimum XID that could put tuples in the table. We
+ * know that no xacts older than RecentXmin are still running, so that
+ * will do.
+ */
+ *freezeXid = RecentXmin;
+
+ /*
+ * Similarly, initialize the minimum Multixact to the first value that
+ * could possibly be stored in tuples in the table. Running transactions
+ * could reuse values from their local cache, so we are careful to
+ * consider all currently running multis.
+ *
+ * XXX this could be refined further, but is it worth the hassle?
+ */
+ *minmulti = GetOldestMultiXactId();
+
+ RelationCreateStorage(rel->rd_node, persistence);
+
+ /*
+ * If required, set up an init fork for an unlogged table so that it can
+ * be correctly reinitialized on restart. An immediate sync is required
+ * even if the page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved the
+ * redo pointer past our xlog record. Recovery may as well remove it
+ * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+ * record. Therefore, logging is necessary even if wal_level=minimal.
+ */
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW ||
+ rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+ RelationOpenSmgr(rel);
+ smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
+ log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
+ smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+ }
+}
+
+static void
+zedstoream_relation_nontransactional_truncate(Relation rel)
+{
+ RelationTruncate(rel, 0);
+}
+
+static void
+zedstoream_relation_copy_data(Relation rel, RelFileNode newrnode)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+zedstoream_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
+ Relation OldIndex, bool use_sort,
+ TransactionId OldestXmin,
+ TransactionId FreezeXid,
+ MultiXactId MultiXactCutoff,
+ double *num_tuples,
+ double *tups_vacuumed,
+ double *tups_recently_dead)
+{
+}
+
+static bool
+zedstoream_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
+ BufferAccessStrategy bstrategy)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("function %s not implemented yet", __func__)));
+ return false;
+}
+
+static bool
+zedstoream_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
+ double *liverows, double *deadrows,
+ TupleTableSlot *slot)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("function %s not implemented yet", __func__)));
+ return false;
+}
+
+/* ------------------------------------------------------------------------
+ * Planner related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * currently this is exact duplicate of heapam_estimate_rel_size().
+ * TODO fix to tune it based on zedstore storage.
+ */
+static void
+zedstoream_relation_estimate_size(Relation rel, int32 *attr_widths,
+ BlockNumber *pages, double *tuples,
+ double *allvisfrac)
+{
+ BlockNumber curpages;
+ BlockNumber relpages;
+ double reltuples;
+ BlockNumber relallvisible;
+ double density;
+
+ /* it has storage, ok to call the smgr */
+ curpages = RelationGetNumberOfBlocks(rel);
+
+ /* coerce values in pg_class to more desirable types */
+ relpages = (BlockNumber) rel->rd_rel->relpages;
+ reltuples = (double) rel->rd_rel->reltuples;
+ relallvisible = (BlockNumber) rel->rd_rel->relallvisible;
+
+ /*
+ * HACK: if the relation has never yet been vacuumed, use a minimum size
+ * estimate of 10 pages. The idea here is to avoid assuming a
+ * newly-created table is really small, even if it currently is, because
+ * that may not be true once some data gets loaded into it. Once a vacuum
+ * or analyze cycle has been done on it, it's more reasonable to believe
+ * the size is somewhat stable.
+ *
+ * (Note that this is only an issue if the plan gets cached and used again
+ * after the table has been filled. What we're trying to avoid is using a
+ * nestloop-type plan on a table that has grown substantially since the
+ * plan was made. Normally, autovacuum/autoanalyze will occur once enough
+ * inserts have happened and cause cached-plan invalidation; but that
+ * doesn't happen instantaneously, and it won't happen at all for cases
+ * such as temporary tables.)
+ *
+ * We approximate "never vacuumed" by "has relpages = 0", which means this
+ * will also fire on genuinely empty relations. Not great, but
+ * fortunately that's a seldom-seen case in the real world, and it
+ * shouldn't degrade the quality of the plan too much anyway to err in
+ * this direction.
+ *
+ * If the table has inheritance children, we don't apply this heuristic.
+ * Totally empty parent tables are quite common, so we should be willing
+ * to believe that they are empty.
+ */
+ if (curpages < 10 &&
+ relpages == 0 &&
+ !rel->rd_rel->relhassubclass)
+ curpages = 10;
+
+ /* report estimated # pages */
+ *pages = curpages;
+ /* quick exit if rel is clearly empty */
+ if (curpages == 0)
+ {
+ *tuples = 0;
+ *allvisfrac = 0;
+ return;
+ }
+
+ /* estimate number of tuples from previous tuple density */
+ if (relpages > 0)
+ density = reltuples / (double) relpages;
+ else
+ {
+ /*
+ * When we have no data because the relation was truncated, estimate
+ * tuple width from attribute datatypes. We assume here that the
+ * pages are completely full, which is OK for tables (since they've
+ * presumably not been VACUUMed yet) but is probably an overestimate
+ * for indexes. Fortunately get_relation_info() can clamp the
+ * overestimate to the parent table's size.
+ *
+ * Note: this code intentionally disregards alignment considerations,
+ * because (a) that would be gilding the lily considering how crude
+ * the estimate is, and (b) it creates platform dependencies in the
+ * default plans which are kind of a headache for regression testing.
+ */
+ int32 tuple_width;
+
+ tuple_width = get_rel_data_width(rel, attr_widths);
+ tuple_width += MAXALIGN(SizeofHeapTupleHeader);
+ tuple_width += sizeof(ItemIdData);
+ /* note: integer division is intentional here */
+ density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width;
+ }
+ *tuples = rint(density * (double) curpages);
+
+ /*
+ * We use relallvisible as-is, rather than scaling it up like we do for
+ * the pages and tuples counts, on the theory that any pages added since
+ * the last VACUUM are most likely not marked all-visible. But costsize.c
+ * wants it converted to a fraction.
+ */
+ if (relallvisible == 0 || curpages <= 0)
+ *allvisfrac = 0;
+ else if ((double) relallvisible >= curpages)
+ *allvisfrac = 1;
+ else
+ *allvisfrac = (double) relallvisible / curpages;
+}
+
+/* ------------------------------------------------------------------------
+ * Executor related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+static bool
+zedstoream_scan_bitmap_next_block(TableScanDesc sscan,
+ TBMIterateResult *tbmres)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ BlockNumber tid_blkno = tbmres->blockno;
+ int ntuples;
+ int first_ntuples = 0;
+ bool firstcol;
+
+ /*
+ * Our strategy for a bitmap scan is to scan the tree of each attribute,
+ * starting at the give logical block number, and store all the datums
+ * in the scan struct. zedstoream_scan_bitmap_next_tuple() then just
+ * needs to store the datums of the next TID in the slot.
+ *
+ * An alternative would be to keep the scans of each attribute open,
+ * like in a sequential scan. I'm not sure which is better.
+ */
+ firstcol = true;
+ for (int i = 0; i < scan->num_proj_atts; i++)
+ {
+ int natt = scan->proj_atts[i];
+ ZSBtreeScan btree_scan;
+ Datum datum;
+ bool isnull;
+ zstid tid;
+ Datum *datums = scan->bmscan_datums[natt];
+ bool *isnulls = scan->bmscan_isnulls[natt];
+ int noff = 0;
+
+ zsbt_begin_scan(scan->rs_scan.rs_rd, natt + 1,
+ ZSTidFromBlkOff(tid_blkno, 1),
+ scan->rs_scan.rs_snapshot,
+ &btree_scan);
+
+ /*
+ * TODO: it would be good to pass the next expected TID down to zsbt_scan_next,
+ * so that it could skip over to it more efficiently.
+ */
+ ntuples = 0;
+ while (zsbt_scan_next(&btree_scan, &datum, &isnull, &tid))
+ {
+ if (ZSTidGetBlockNumber(tid) != tid_blkno)
+ {
+ Assert(ZSTidGetBlockNumber(tid) > tid_blkno);
+ break;
+ }
+
+ if (tbmres->ntuples != -1)
+ {
+ while (ZSTidGetOffsetNumber(tid) > tbmres->offsets[noff] && noff < tbmres->ntuples)
+ noff++;
+
+ if (noff == tbmres->ntuples)
+ break;
+
+ if (ZSTidGetOffsetNumber(tid) < tbmres->offsets[noff])
+ continue;
+ }
+
+ datums[ntuples] = datum;
+ isnulls[ntuples] = isnull;
+ if (firstcol)
+ scan->bmscan_tids[ntuples] = tid;
+ else if (tid != scan->bmscan_tids[ntuples])
+ elog(ERROR, "scans on different attributes out of sync");
+
+ ntuples++;
+ }
+ if (firstcol)
+ first_ntuples = ntuples;
+ else if (ntuples != first_ntuples)
+ elog(ERROR, "scans on different attributes out of sync");
+
+ zsbt_end_scan(&btree_scan);
+
+ firstcol = false;
+ }
+
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = first_ntuples;
+
+ return first_ntuples > 0;
+}
+
+static bool
+zedstoream_scan_bitmap_next_tuple(TableScanDesc sscan,
+ TBMIterateResult *tbmres,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid tid;
+
+ if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
+ return false;
+
+ tid = scan->bmscan_tids[scan->bmscan_nexttuple];
+ for (int i = 0; i < scan->num_proj_atts; i++)
+ {
+ Form_pg_attribute att = &scan->rs_scan.rs_rd->rd_att->attrs[i];
+ int natt = scan->proj_atts[i];
+ Datum datum;
+ bool isnull;
+
+ datum = (scan->bmscan_datums[i])[scan->bmscan_nexttuple];
+ isnull = (scan->bmscan_isnulls[i])[scan->bmscan_nexttuple];
+
+ /*
+ * flatten any ZS-TOASTed values, becaue the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt + 1, tid, datum);
+ }
+
+ slot->tts_values[natt] = datum;
+ slot->tts_isnull[natt] = isnull;
+ }
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ scan->bmscan_nexttuple++;
+
+ return true;
+}
+
+static bool
+zedstoream_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("function %s not implemented yet", __func__)));
+ return false;
+}
+
+static bool
+zedstoream_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
+ TupleTableSlot *slot)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("function %s not implemented yet", __func__)));
+ return false;
+}
+
+static void
+zedstoream_vacuum_rel(Relation onerel, VacuumParams *params,
+ BufferAccessStrategy bstrategy)
+{
+ zsundo_vacuum(onerel, params, bstrategy,
+ GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM));
+}
+
+static const TableAmRoutine zedstoream_methods = {
+ .type = T_TableAmRoutine,
+ .scans_leverage_column_projection = true,
+
+ .slot_callbacks = zedstoream_slot_callbacks,
+
+ .scan_begin = zedstoream_beginscan,
+ .scan_begin_with_column_projection = zedstoream_beginscan_with_column_projection,
+ .scan_end = zedstoream_endscan,
+ .scan_rescan = heap_rescan,
+ .scan_getnextslot = zedstoream_getnextslot,
+
+ .parallelscan_estimate = zs_parallelscan_estimate,
+ .parallelscan_initialize = zs_parallelscan_initialize,
+ .parallelscan_reinitialize = zs_parallelscan_reinitialize,
+
+ .index_fetch_begin = zedstoream_begin_index_fetch,
+ .index_fetch_reset = zedstoream_reset_index_fetch,
+ .index_fetch_end = zedstoream_end_index_fetch,
+ .index_fetch_set_column_projection = zedstoream_fetch_set_column_projection,
+ .index_fetch_tuple = zedstoream_index_fetch_tuple,
+
+ .tuple_insert = zedstoream_insert,
+ .tuple_insert_speculative = zedstoream_insert_speculative,
+ .tuple_complete_speculative = zedstoream_complete_speculative,
+ .multi_insert = zedstoream_multi_insert,
+ .tuple_delete = zedstoream_delete,
+ .tuple_update = zedstoream_update,
+ .tuple_lock = zedstoream_lock_tuple,
+ .finish_bulk_insert = zedstoream_finish_bulk_insert,
+
+ .tuple_fetch_row_version = zedstoream_fetch_row_version,
+ .tuple_get_latest_tid = zedstoream_get_latest_tid,
+ .tuple_satisfies_snapshot = zedstoream_tuple_satisfies_snapshot,
+ .compute_xid_horizon_for_tuples = zedstoream_compute_xid_horizon_for_tuples,
+
+ .relation_set_new_filenode = zedstoream_relation_set_new_filenode,
+ .relation_nontransactional_truncate = zedstoream_relation_nontransactional_truncate,
+ .relation_copy_data = zedstoream_relation_copy_data,
+ .relation_copy_for_cluster = zedstoream_relation_copy_for_cluster,
+ .relation_vacuum = zedstoream_vacuum_rel,
+ .scan_analyze_next_block = zedstoream_scan_analyze_next_block,
+ .scan_analyze_next_tuple = zedstoream_scan_analyze_next_tuple,
+
+ .index_build_range_scan = zedstoream_index_build_range_scan,
+ .index_validate_scan = zedstoream_index_validate_scan,
+
+ .relation_estimate_size = zedstoream_relation_estimate_size,
+
+ .scan_bitmap_next_block = zedstoream_scan_bitmap_next_block,
+ .scan_bitmap_next_tuple = zedstoream_scan_bitmap_next_tuple,
+ .scan_sample_next_block = zedstoream_scan_sample_next_block,
+ .scan_sample_next_tuple = zedstoream_scan_sample_next_tuple
+};
+
+Datum
+zedstore_tableam_handler(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_POINTER(&zedstoream_methods);
+}
+
+
+/*
+ * Routines for dividing up the TID range for parallel seq scans
+ */
+
+typedef struct ParallelZSScanDescData
+{
+ ParallelTableScanDescData base;
+
+ zstid pzs_endtid; /* last tid + 1 in relation at start of scan */
+ pg_atomic_uint64 pzs_allocatedtid_blk; /* TID space allocated to workers so far. (in 65536 increments) */
+} ParallelZSScanDescData;
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static Size
+zs_parallelscan_estimate(Relation rel)
+{
+ return sizeof(ParallelZSScanDescData);
+}
+
+static Size
+zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc zpscan = (ParallelZSScanDesc) pscan;
+
+ zpscan->base.phs_relid = RelationGetRelid(rel);
+ /* FIXME: if attribute 1 is dropped, should use another attribute */
+ zpscan->pzs_endtid = zsbt_get_last_tid(rel, 1);
+ pg_atomic_init_u64(&zpscan->pzs_allocatedtid_blk, 0);
+
+ return sizeof(ParallelZSScanDescData);
+}
+
+static void
+zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc bpscan = (ParallelZSScanDesc) pscan;
+
+ pg_atomic_write_u64(&bpscan->pzs_allocatedtid_blk, 0);
+}
+
+/*
+ * get the next TID range to scan
+ *
+ * Returns true if there is more to scan, false otherwise.
+ *
+ * Get the next TID range to scan. Even if there are no TIDs left to scan,
+ * another backend could have grabbed a range to scan and not yet finished
+ * looking at it, so it doesn't follow that the scan is done when the first
+ * backend gets 'false' return.
+ */
+static bool
+zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end)
+{
+ uint64 allocatedtid_blk;
+
+ /*
+ * zhs_allocatedtid tracks how much has been allocated to workers
+ * already. When phs_allocatedtid >= rs_lasttid, all TIDs have been
+ * allocated.
+ *
+ * Because we use an atomic fetch-and-add to fetch the current value, the
+ * phs_allocatedtid counter will exceed rs_lasttid, because workers will
+ * still increment the value, when they try to allocate the next block but
+ * all blocks have been allocated already. The counter must be 64 bits
+ * wide because of that, to avoid wrapping around when rs_lasttid is close
+ * to 2^32. That's also one reason we do this at granularity of 2^16 TIDs,
+ * even though zedstore isn't block-oriented.
+ *
+ * TODO: we divide the TID space into chunks of 2^16 TIDs each. That's
+ * pretty inefficient, there's a fair amount of overhead in re-starting
+ * the B-tree scans between each range. We probably should use much larger
+ * ranges. But this is good for testing.
+ */
+ allocatedtid_blk = pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtid_blk, 1);
+ *start = ZSTidFromBlkOff(allocatedtid_blk, 1);
+ *end = ZSTidFromBlkOff(allocatedtid_blk + 1, 1);
+
+ return *start < pzscan->pzs_endtid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 881131aff2..cb212e49ce 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -20,6 +20,7 @@
#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
#include "utils/memutils.h"
@@ -300,3 +301,77 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+typedef struct neededColumnContext
+{
+ bool *mask;
+ int n;
+} neededColumnContext;
+
+static bool
+neededColumnContextWalker(Node *node, neededColumnContext *c)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *)node;
+
+ if (var->varattno > 0)
+ {
+ Assert(var->varattno <= c->n);
+ c->mask[var->varattno - 1] = true;
+ }
+ /*
+ * If all attributes are included,
+ * set all entries in mask to true.
+ */
+ else if (var->varattno == 0)
+ memset(c->mask, true, c->n);
+
+ return false;
+ }
+ return expression_tree_walker(node, neededColumnContextWalker, (void * )c);
+}
+
+/*
+ * n specifies the number of allowed entries in mask: we use
+ * it for bounds-checking in the walker above.
+ */
+void
+GetNeededColumnsForNode(Node *expr, bool *mask, int n)
+{
+ neededColumnContext c;
+
+ c.mask = mask;
+ c.n = n;
+
+ neededColumnContextWalker(expr, &c);
+}
+
+bool *
+GetNeededColumnsForScan(ScanState *scanstate, int ncol)
+{
+ bool *proj;
+ int i;
+
+ proj = palloc0(ncol * sizeof(bool));
+ GetNeededColumnsForNode((Node*) scanstate->ps.plan->targetlist, proj, ncol);
+ GetNeededColumnsForNode((Node*) scanstate->ps.plan->qual, proj, ncol);
+
+ for (i = 0; i < ncol; i++)
+ {
+ if (proj[i])
+ break;
+ }
+
+ /*
+ * In some cases (for example, count(*)), no columns are specified.
+ * We always scan the first column.
+ */
+ if (i == ncol)
+ proj[0] = true;
+
+ return proj;
+}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 7711728495..5833d683b3 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -166,10 +166,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, slot))
+ if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
continue; /* no visible tuple, try next index entry */
- ExecClearTuple(slot);
+ ExecClearTuple(node->ioss_TableSlot);
/*
* Only MVCC snapshots are supported here, so there should be no
@@ -528,7 +528,17 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
*/
tupDesc = ExecTypeFromTL(node->indextlist);
ExecInitScanTupleSlot(estate, &indexstate->ss, tupDesc,
- table_slot_callbacks(currentRelation));
+ &TTSOpsVirtual);
+
+ /*
+ * We need another slot, in a format that's suitable for the table AM,
+ * for when we need to fetch a tuple from the table for rechecking
+ * visibility.
+ */
+ indexstate->ioss_TableSlot =
+ ExecAllocTableSlot(&estate->es_tupleTable,
+ RelationGetDescr(currentRelation),
+ table_slot_callbacks(currentRelation));
/*
* Initialize result type and projection info. The node's targetlist will
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 399ac0109c..82903bc708 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,13 @@ IndexNext(IndexScanState *node)
node->iss_NumScanKeys,
node->iss_NumOrderByKeys);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ bool *proj;
+ proj = GetNeededColumnsForScan(&node->ss, node->ss.ss_currentRelation->rd_att->natts);
+ table_index_fetch_set_column_projection(scandesc->xs_heapfetch, proj);
+ }
+
node->iss_ScanDesc = scandesc;
/*
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 8bd7430a91..7f5a2c107c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -31,6 +31,7 @@
#include "access/tableam.h"
#include "executor/execdebug.h"
#include "executor/nodeSeqscan.h"
+#include "nodes/nodeFuncs.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
@@ -68,9 +69,20 @@ SeqNext(SeqScanState *node)
* We reach here if the scan is not parallel, or if we're serially
* executing a scan that was planned to be parallel.
*/
- scandesc = table_beginscan(node->ss.ss_currentRelation,
- estate->es_snapshot,
- 0, NULL);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ bool *proj;
+ proj = GetNeededColumnsForScan(&node->ss, node->ss.ss_currentRelation->rd_att->natts);
+ scandesc = table_beginscan_with_column_projection(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ }
node->ss.ss_currentScanDesc = scandesc;
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a647e7db32..2929165e2a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -138,6 +138,7 @@ typedef struct TableAmRoutine
{
/* this must be set to T_TableAmRoutine */
NodeTag type;
+ bool scans_leverage_column_projection;
/* ------------------------------------------------------------------------
@@ -187,6 +188,18 @@ typedef struct TableAmRoutine
bool is_samplescan,
bool temp_snap);
+ TableScanDesc (*scan_begin_with_column_projection)(Relation relation,
+ Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ ParallelTableScanDesc parallel_scan,
+ bool *project_column,
+ bool allow_strat,
+ bool allow_sync,
+ bool allow_pagemode,
+ bool is_bitmapscan,
+ bool is_samplescan,
+ bool temp_snap);
+
/*
* Release resources and deallocate scan. If TableScanDesc.temp_snap,
* TableScanDesc.rs_snapshot needs to be unregistered.
@@ -262,6 +275,13 @@ typedef struct TableAmRoutine
*/
void (*index_fetch_end) (struct IndexFetchTableData *data);
+ /*
+ * Set column projections for AM which leverage column projections for
+ * scanning.
+ */
+ void (*index_fetch_set_column_projection) (struct IndexFetchTableData *data,
+ bool *project_column);
+
/*
* Fetch tuple at `tid` into `slot`, after doing a visibility test
* according to `snapshot`. If a tuple was found and passed the visibility
@@ -683,6 +703,12 @@ table_beginscan(Relation rel, Snapshot snapshot,
true, true, true, false, false, false);
}
+static inline bool
+table_scans_leverage_column_projection(Relation relation)
+{
+ return relation->rd_tableam->scans_leverage_column_projection;
+}
+
/*
* Like table_beginscan(), but for scanning catalog. It'll automatically use a
* snapshot appropriate for scanning catalog relations.
@@ -707,6 +733,17 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
false, false, false);
}
+static inline TableScanDesc
+table_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ bool *project_column)
+{
+ Assert(relation->rd_tableam->scans_leverage_column_projection);
+
+ return relation->rd_tableam->scan_begin_with_column_projection(
+ relation, snapshot, nkeys, key, NULL, project_column,
+ true, true, true, false, false, false);
+}
/*
* table_beginscan_bm is an alternative entry point for setting up a
@@ -886,6 +923,13 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
scan->rel->rd_tableam->index_fetch_end(scan);
}
+static inline void
+table_index_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ bool *project_column)
+{
+ scan->rel->rd_tableam->index_fetch_set_column_projection(scan, project_column);
+}
+
/*
* Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
* a visibility test according to `snapshot`. If a tuple was found and passed
diff --git a/src/include/access/zedstore_compression.h b/src/include/access/zedstore_compression.h
new file mode 100644
index 0000000000..9958c02eac
--- /dev/null
+++ b/src/include/access/zedstore_compression.h
@@ -0,0 +1,51 @@
+/*
+ * zedstore_compression.h
+ * internal declarations for ZedStore compression
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_compression.h
+ */
+#ifndef ZEDSTORE_COMPRESSION_H
+#define ZEDSTORE_COMPRESSION_H
+
+#include "storage/itemptr.h"
+
+typedef struct ZSDecompressContext
+{
+ char *buffer;
+ int bufsize; /* allocated size of 'buffer' */
+ int uncompressedsize;
+ int bytesread;
+} ZSDecompressContext;
+
+typedef struct ZSCompressContext
+{
+ char *uncompressedbuffer;
+
+ int maxCompressedSize;
+ int maxUncompressedSize;
+ char *buffer;
+ int nitems;
+ int rawsize;
+} ZSCompressContext;
+
+typedef struct ZSBtreeItem ZSBtreeItem;
+typedef struct ZSCompressedBtreeItem ZSCompressedBtreeItem;
+typedef struct ZSUncompressedBtreeItem ZSUncompressedBtreeItem;
+
+/* compression functions */
+extern void zs_compress_init(ZSCompressContext *context);
+extern void zs_compress_begin(ZSCompressContext *context, int maxCompressedSize);
+extern bool zs_compress_add(ZSCompressContext *context, ZSUncompressedBtreeItem *item);
+extern ZSCompressedBtreeItem *zs_compress_finish(ZSCompressContext *context);
+extern void zs_compress_free(ZSCompressContext *context);
+
+/* decompression functions */
+extern void zs_decompress_init(ZSDecompressContext *context);
+extern void zs_decompress_chunk(ZSDecompressContext *context, ZSCompressedBtreeItem *chunk);
+extern ZSUncompressedBtreeItem *zs_decompress_read_item(ZSDecompressContext *context);
+extern void zs_decompress_free(ZSDecompressContext *context);
+
+#endif /* ZEDSTORE_COMPRESSION_H */
diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
new file mode 100644
index 0000000000..ab09b4a09b
--- /dev/null
+++ b/src/include/access/zedstore_internal.h
@@ -0,0 +1,409 @@
+/*
+ * zedstore_internal.h
+ * internal declarations for ZedStore tables
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_internal.h
+ */
+#ifndef ZEDSTORE_INTERNAL_H
+#define ZEDSTORE_INTERNAL_H
+
+#include "access/tableam.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_undo.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+
+/*
+ * Throughout ZedStore, we pass around TIDs as uint64's, rather than ItemPointers,
+ * for speed.
+ */
+typedef uint64 zstid;
+
+#define InvalidZSTid 0
+#define MinZSTid 1 /* blk 0, off 1 */
+#define MaxZSTid ((uint64) MaxBlockNumber << 16 | 0xffff)
+/* note: if this is converted to ItemPointer, it is invalid */
+#define MaxPlusOneZSTid (MaxZSTid + 1)
+
+static inline zstid
+ZSTidFromItemPointer(ItemPointerData iptr)
+{
+ Assert(ItemPointerIsValid(&iptr));
+ return ((uint64) iptr.ip_blkid.bi_hi << 32 |
+ (uint64) iptr.ip_blkid.bi_lo << 16 |
+ (uint64) iptr.ip_posid);
+}
+
+static inline zstid
+ZSTidFromBlkOff(BlockNumber blk, OffsetNumber off)
+{
+ Assert(off != 0);
+ return ((uint64) blk << 16 | off);
+}
+
+static inline ItemPointerData
+ItemPointerFromZSTid(zstid tid)
+{
+ ItemPointerData iptr;
+
+ iptr.ip_blkid.bi_hi = (tid >> 32) & 0xffff;
+ iptr.ip_blkid.bi_lo = (tid >> 16) & 0xffff;
+ iptr.ip_posid = (tid) & 0xffff;
+ Assert(ItemPointerIsValid(&iptr));
+ return iptr;
+}
+
+static inline BlockNumber
+ZSTidGetBlockNumber(zstid tid)
+{
+ return (BlockNumber) (tid >> 16);
+}
+
+static inline OffsetNumber
+ZSTidGetOffsetNumber(zstid tid)
+{
+ return (OffsetNumber) tid;
+}
+
+/*
+ * Helper function to "increment" a TID by one.
+ *
+ * Skips over values that would be invalid ItemPointers.
+ */
+static inline zstid
+ZSTidIncrement(zstid tid)
+{
+ tid++;
+ if ((tid & 0xffff) == 0)
+ tid++;
+ return tid;
+}
+
+static inline zstid
+ZSTidIncrementForInsert(zstid tid)
+{
+ tid++;
+ if (ZSTidGetOffsetNumber(tid) >= MaxHeapTuplesPerPage)
+ tid = ZSTidFromBlkOff(ZSTidGetBlockNumber(tid) + 1, 1);
+ return tid;
+}
+
+/*
+ * A ZedStore table contains different kinds of pages, all in the same file.
+ *
+ * Block 0 is always a metapage. It contains the block numbers of the other
+ * data structures stored within the file, like the per-attribute B-trees,
+ * and the UNDO log. In addition, if there are overly large datums in the
+ * the table, they are chopped into separate "toast" pages.
+ */
+#define ZS_META_PAGE_ID 0xF083
+#define ZS_BTREE_PAGE_ID 0xF084
+#define ZS_UNDO_PAGE_ID 0xF085
+#define ZS_TOAST_PAGE_ID 0xF086
+
+/* like nbtree/gist FOLLOW_RIGHT flag, used to detect concurrent page splits */
+#define ZS_FOLLOW_RIGHT 0x0002
+
+typedef struct ZSBtreePageOpaque
+{
+ AttrNumber zs_attno;
+ BlockNumber zs_next;
+ zstid zs_lokey; /* inclusive */
+ zstid zs_hikey; /* exclusive */
+ uint16 zs_level; /* 0 = leaf */
+ uint16 zs_flags;
+ uint16 padding; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* always ZS_BTREE_PAGE_ID */
+} ZSBtreePageOpaque;
+
+#define ZSBtreePageGetOpaque(page) ((ZSBtreePageOpaque *) PageGetSpecialPointer(page))
+
+/*
+ * Internal B-tree page layout.
+ *
+ * The "contents" of the page is an array of ZSBtreeInternalPageItem. The number
+ * of items can be deduced from pd_lower.
+ */
+typedef struct ZSBtreeInternalPageItem
+{
+ zstid tid;
+ BlockNumber childblk;
+} ZSBtreeInternalPageItem;
+
+static inline ZSBtreeInternalPageItem *
+ZSBtreeInternalPageGetItems(Page page)
+{
+ ZSBtreeInternalPageItem *items;
+
+ items = (ZSBtreeInternalPageItem *) PageGetContents(page);
+
+ return items;
+}
+static inline int
+ZSBtreeInternalPageGetNumItems(Page page)
+{
+ ZSBtreeInternalPageItem *begin;
+ ZSBtreeInternalPageItem *end;
+
+ begin = (ZSBtreeInternalPageItem *) PageGetContents(page);
+ end = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+
+ return end - begin;
+}
+
+static inline bool
+ZSBtreeInternalPageIsFull(Page page)
+{
+ PageHeader phdr = (PageHeader) page;
+
+ return phdr->pd_upper - phdr->pd_lower < sizeof(ZSBtreeInternalPageItem);
+}
+
+/*
+ * Leaf B-tree page layout
+ *
+ * Leaf pages are packed with ZSBtreeItems. There are three kinds of items:
+ *
+ * 1. plain item, holds one tuple (or rather, one datum).
+ *
+ * 2. A "container item", which holds multiple plain items, compressed.
+ *
+ * 3. A "dead item". A dead item prevents the TID from being reused. It's
+ * used during VACUUM, to mark items for which there are no index pointers
+ * anymore. But it cannot be removed until the undo record has been
+ * trimmed away, because if the TID was reused for a new record, vacuum
+ * might remove the new tuple version instead. After t_undo_ptr becomes
+ * older than "oldest undo ptr", the item can be removed and the TID
+ * recycled.
+ *
+ * TODO: squeeze harder: eliminate padding, use high bits of t_tid for flags or size
+ */
+typedef struct ZSBtreeItem
+{
+ zstid t_tid;
+ uint16 t_size;
+ uint16 t_flags;
+} ZSBtreeItem;
+
+typedef struct ZSUncompressedBtreeItem
+{
+ /* these fields must match ZSBtreeItem */
+ zstid t_tid;
+ uint16 t_size;
+ uint16 t_flags;
+
+ ZSUndoRecPtr t_undo_ptr;
+
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+} ZSUncompressedBtreeItem;
+
+typedef struct ZSCompressedBtreeItem
+{
+ /* these fields must match ZSBtreeItem */
+ zstid t_tid;
+ uint16 t_size;
+ uint16 t_flags;
+
+ uint16 t_uncompressedsize;
+ zstid t_lasttid; /* inclusive */
+
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+} ZSCompressedBtreeItem;
+
+
+#define ZSBT_COMPRESSED 0x0001
+#define ZSBT_DELETED 0x0002
+#define ZSBT_UPDATED 0x0004
+#define ZSBT_NULL 0x0008
+#define ZSBT_DEAD 0x0010
+
+/*
+ * Toast page layout.
+ *
+ * When an overly large datum is stored, it is divided into chunks, and each
+ * chunk is stored on a dedicated toast page. The toast pages of a datum form
+ * list, each page has a next/prev pointer.
+ */
+/*
+ * Maximum size of an individual untoasted Datum stored in ZedStore. Datums
+ * larger than this need to be toasted.
+ *
+ * A datum needs to fit on a B-tree page, with page and item headers.
+ *
+ * XXX: 500 accounts for all the headers. Need to compute this correctly...
+ */
+#define MaxZedStoreDatumSize (BLCKSZ - 500)
+
+typedef struct ZSToastPageOpaque
+{
+ AttrNumber zs_attno;
+
+ /* these are only set on the first page. */
+ zstid zs_tid;
+ uint32 zs_total_size;
+
+ uint32 zs_slice_offset;
+ BlockNumber zs_prev;
+ BlockNumber zs_next;
+ uint16 zs_flags;
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id;
+} ZSToastPageOpaque;
+
+/*
+ * "Toast pointer" of a datum that's stored in zedstore toast pages.
+ *
+ * This looks somewhat like a normal TOAST pointer, but we mustn't let these
+ * escape out of zedstore code, because the rest of the system doesn't know
+ * how to deal with them.
+ *
+ * This must look like varattrib_1b_e!
+ */
+typedef struct varatt_zs_toastptr
+{
+ /* varattrib_1b_e */
+ uint8 va_header;
+ uint8 va_tag; /* VARTAG_ZEDSTORE in zedstore toast datums */
+
+ /* first block */
+ BlockNumber zst_block;
+} varatt_zs_toastptr;
+
+/*
+ * va_tag value. this should be distinguishable from the values in
+ * vartag_external
+ */
+#define VARTAG_ZEDSTORE 10
+
+/*
+ * Versions of datumGetSize and datumCopy that know about ZedStore-toasted
+ * datums.
+ */
+static inline Size
+zs_datumGetSize(Datum value, bool typByVal, int typLen)
+{
+ if (typLen < 0 && VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ return sizeof(varatt_zs_toastptr);
+ return datumGetSize(value, typByVal, typLen);
+}
+
+static inline Datum
+zs_datumCopy(Datum value, bool typByVal, int typLen)
+{
+ if (typLen < 0 && VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ {
+ char *result = palloc(sizeof(varatt_zs_toastptr));
+
+ memcpy(result, DatumGetPointer(value), sizeof(varatt_zs_toastptr));
+
+ return PointerGetDatum(result);
+ }
+ else
+ return datumCopy(value, typByVal, typLen);
+}
+
+/*
+ * Block 0 on every ZedStore table is a metapage.
+ *
+ * It contains a directory of b-tree roots for each attribute, and lots more.
+ */
+#define ZS_META_BLK 0
+
+typedef struct ZSMetaPage
+{
+ int nattributes;
+ BlockNumber roots[FLEXIBLE_ARRAY_MEMBER]; /* one for each attribute */
+} ZSMetaPage;
+
+/*
+ * it's not clear what we should store in the "opaque" special area, and what
+ * as page contents, on a metapage. But have at least the page_id field here,
+ * so that tools like pg_filedump can recognize it as a zedstore metapage.
+ */
+typedef struct ZSMetaPageOpaque
+{
+ uint64 zs_undo_counter;
+ BlockNumber zs_undo_head;
+ BlockNumber zs_undo_tail;
+ ZSUndoRecPtr zs_undo_oldestptr;
+
+ uint16 zs_flags;
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id;
+} ZSMetaPageOpaque;
+
+
+/*
+ * Holds the state of an in-progress scan on a zedstore btree.
+ */
+typedef struct ZSBtreeScan
+{
+ Relation rel;
+ AttrNumber attno;
+
+ bool for_update;
+
+ bool active;
+ Buffer lastbuf;
+ bool lastbuf_is_locked;
+ OffsetNumber lastoff;
+ zstid nexttid;
+ Snapshot snapshot;
+
+ /* in the "real" UNDO-log, this would probably be a global variable */
+ ZSUndoRecPtr recent_oldest_undo;
+
+ /*
+ * if we have remaining items from a compressed "container" tuple, they
+ * are kept in the decompressor context, and 'has_decompressed' is true.
+ */
+ ZSDecompressContext decompressor;
+ bool has_decompressed;
+} ZSBtreeScan;
+
+/* prototypes for functions in zstore_btree.c */
+extern ZSUncompressedBtreeItem *zsbt_create_item(Form_pg_attribute attr, zstid tid,
+ Datum datum, bool isnull);
+extern zstid zsbt_insert(Relation rel, AttrNumber attno, Datum datum,
+ bool isnull, TransactionId xmin, CommandId cmin,
+ zstid tid, ZSUndoRecPtr *undorecptr);
+extern void zsbt_insert_multi_items(Relation rel, AttrNumber attno, List *newitems,
+ TransactionId xid, CommandId cid,
+ ZSUndoRecPtr *undorecptr, zstid *tid);
+extern TM_Result zsbt_delete(Relation rel, AttrNumber attno, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart);
+extern TM_Result zsbt_update(Relation rel, AttrNumber attno, zstid otid,
+ Datum newdatum, bool newisnull, TransactionId xid,
+ CommandId cid, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd, zstid *newtid_p);
+extern void zsbt_mark_item_dead(Relation rel, AttrNumber attno, zstid tid, ZSUndoRecPtr);
+extern void zsbt_begin_scan(Relation rel, AttrNumber attno, zstid starttid, Snapshot snapshot, ZSBtreeScan *scan);
+extern bool zsbt_scan_next(ZSBtreeScan *scan, Datum *datum, bool *isnull, zstid *tid);
+extern void zsbt_end_scan(ZSBtreeScan *scan);
+extern zstid zsbt_get_last_tid(Relation rel, AttrNumber attno);
+
+
+
+
+/* prototypes for functions in zstore_meta.c */
+extern Buffer zs_getnewbuf(Relation rel);
+extern BlockNumber zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool for_update);
+extern void zsmeta_update_root_for_attribute(Relation rel, AttrNumber attno, Buffer metabuf, BlockNumber rootblk);
+
+/* prototypes for functions in zstore_visibility.c */
+extern TM_Result zs_SatisfiesUpdate(ZSBtreeScan *scan, ZSUncompressedBtreeItem *item);
+extern bool zs_SatisfiesVisibility(ZSBtreeScan *scan, ZSUncompressedBtreeItem *item);
+
+/* prototypes for functions in zstore_toast.c */
+extern Datum zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value);
+extern void zedstore_toast_finish(Relation rel, AttrNumber attno, Datum toasted, zstid tid);
+extern Datum zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted);
+
+#endif /* ZEDSTORE_INTERNAL_H */
diff --git a/src/include/access/zedstore_undo.h b/src/include/access/zedstore_undo.h
new file mode 100644
index 0000000000..ea41f3cfd1
--- /dev/null
+++ b/src/include/access/zedstore_undo.h
@@ -0,0 +1,133 @@
+/*
+ * zedstore_undo.h
+ * internal declarations for ZedStore undo logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_undo.h
+ */
+#ifndef ZEDSTORE_UNDO_H
+#define ZEDSTORE_UNDO_H
+
+#include "commands/vacuum.h"
+#include "utils/relcache.h"
+
+/* this must match the definition in zedstore_internal.h */
+typedef uint64 zstid;
+
+/*
+ * An UNDO-pointer.
+ *
+ * In the "real" UNDO-logging work from EDB, an UndoRecPtr is only 64 bits.
+ * But we make life easier for us, by encoding more information in it.
+ *
+ * 'counter' is a number that's incremented every time a new undo record is
+ * created. It can be used to determine if an undo pointer is too old to be
+ * of interest to anyone.
+ *
+ * 'blkno' and 'offset' are the physical location of the UNDO record. They
+ * can be used to easily fetch a given record.
+ */
+typedef struct
+{
+ uint64 counter;
+ BlockNumber blkno;
+ int32 offset;
+} ZSUndoRecPtr;
+
+/* TODO: assert that blkno and offset match, too, if counter matches */
+#define ZSUndoRecPtrEquals(a, b) ((a).counter == (b).counter)
+
+typedef struct
+{
+ int16 size; /* size of this record, including header */
+ uint8 type; /* ZSUNDO_TYPE_* */
+ AttrNumber attno;
+ ZSUndoRecPtr undorecptr;
+ TransactionId xid;
+ CommandId cid;
+ zstid tid;
+} ZSUndoRec;
+
+#define ZSUNDO_TYPE_INSERT 1
+#define ZSUNDO_TYPE_DELETE 2
+#define ZSUNDO_TYPE_UPDATE 3
+
+/*
+ * Type-specific record formats.
+ *
+ * We store similar info as zheap for INSERT/UPDATE/DELETE. See zheap README.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid endtid; /* inclusive */
+} ZSUndoRec_Insert;
+
+typedef struct
+{
+ ZSUndoRec rec;
+
+ /*
+ * UNDO-record of the inserter. This is needed if a row is inserted, and
+ * deleted, and there are some snapshots active don't don't consider even
+ * the insertion as visible.
+ */
+ ZSUndoRecPtr prevundorec;
+
+ /*
+ * TODO: It might be good to move the deleted tuple to the undo-log, so
+ * that the space can immediately be reused. But currently, we don't do
+ * that. (or even better, move the old tuple to the undo-log lazily, if
+ * the space is needed for a new insertion, before the old tuple becomes
+ * recyclable.
+ */
+} ZSUndoRec_Delete;
+
+typedef struct
+{
+ ZSUndoRec rec;
+
+ /* Like in ZSUndoRec_Delete. */
+ ZSUndoRecPtr prevundorec;
+
+ /* old version of the datum */
+ /* TODO: currently, we only do "cold" updates, so the old tuple is
+ * left in the old place. Once we start supporting in-place updates,
+ * the old tuple should be stored here.
+ */
+ zstid otid;
+
+} ZSUndoRec_Update;
+
+typedef struct
+{
+ BlockNumber next;
+ uint16 padding; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* ZS_UNDO_PAGE_ID */
+} ZSUndoPageOpaque;
+
+static inline void
+ZSUndoRecPtrInitialize(ZSUndoRecPtr *uptr)
+{
+ uptr->blkno = InvalidBlockNumber;
+ uptr->offset = InvalidOffsetNumber;
+ uptr->counter = 0;
+}
+
+static inline bool
+IsZSUndoRecPtrValid(ZSUndoRecPtr *uptr)
+{
+ return (uptr->blkno != InvalidBlockNumber &&
+ uptr->offset != InvalidOffsetNumber);
+}
+
+/* prototypes for functions in zstore_undo.c */
+extern ZSUndoRecPtr zsundo_insert(Relation rel, ZSUndoRec *rec);
+extern ZSUndoRec *zsundo_fetch(Relation rel, ZSUndoRecPtr undorecptr);
+extern void zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin);
+extern ZSUndoRecPtr zsundo_get_oldest_undo_ptr(Relation rel);
+
+#endif /* ZEDSTORE_UNDO_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 393b41dd68..f370f63460 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '6668', oid_symbol => 'ZEDSTORE_TABLE_AM_OID',
+ descr => 'zedstore table access method',
+ amname => 'zedstore', amhandler => 'zedstore_tableam_handler', amtype => 't' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ad4519e001..e08d62754d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -873,6 +873,11 @@
proname => 'heap_tableam_handler', provolatile => 'v',
prorettype => 'table_am_handler', proargtypes => 'internal',
prosrc => 'heap_tableam_handler' },
+{ oid => '6669', oid_symbol => 'ZEDSTORE_TABLE_AM_HANDLER_OID',
+ descr => 'column-oriented table access method handler',
+ proname => 'zedstore_tableam_handler', provolatile => 'v',
+ prorettype => 'table_am_handler', proargtypes => 'internal',
+ prosrc => 'zedstore_tableam_handler' },
# Index access method handlers
{ oid => '330', descr => 'btree index access method handler',
@@ -10672,4 +10677,23 @@
proname => 'pg_partition_root', prorettype => 'regclass',
proargtypes => 'regclass', prosrc => 'pg_partition_root' },
+# zedstore inspection functions
+{ oid => '7000', descr => 'get zedstore page type',
+ proname => 'pg_zs_page_type', prorettype => 'text',
+ proargtypes => 'regclass int8', prosrc => 'pg_zs_page_type' },
+{ oid => '7001', descr => 'show stats about active zedstore undo pages',
+ proname => 'pg_zs_undo_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int4,int4,int8,int8}',
+ proargmodes => '{i,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nrecords,freespace,firstrecptr,lastrecptr}',
+ prosrc => 'pg_zs_undo_pages' },
+{ oid => '7002', descr => 'show stats about zedstore btree pages',
+ proname => 'pg_zs_btree_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int4,int4,int8,int8,int4,int4,int4,int4,int4}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nextblk,attno,level,lokey,hikey,nitems,ncompressed,totalsz,uncompressedsz,freespace}',
+ prosrc => 'pg_zs_btree_pages' },
+
]
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index eb4c8b5e79..509017aa56 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -597,5 +597,7 @@ extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
extern void CheckSubscriptionRelkind(char relkind, const char *nspname,
const char *relname);
+extern void GetNeededColumnsForNode(Node *expr, bool *mask, int n);
+extern bool *GetNeededColumnsForScan(ScanState *scanstate, int ncol);
#endif /* EXECUTOR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a5e4b7ef2e..108dee61e2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1424,6 +1424,7 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
Size ioss_PscanLen;
+ TupleTableSlot *ioss_TableSlot;
} IndexOnlyScanState;
/* ----------------
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6cd4cfed0a..ad7870a0bb 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -350,6 +350,9 @@
/* Define to 1 if you have the `ldap_r' library (-lldap_r). */
#undef HAVE_LIBLDAP_R
+/* Define to 1 if you have the `lz4' library (-llz4). */
+#undef HAVE_LIBLZ4
+
/* Define to 1 if you have the `m' library (-lm). */
#undef HAVE_LIBM
@@ -389,6 +392,9 @@
/* Define to 1 if `long long int' works and is 64 bits. */
#undef HAVE_LONG_LONG_INT_64
+/* Define to 1 if you have the <lz4.h> header file. */
+#undef HAVE_LZ4_H
+
/* Define to 1 if you have the <mbarrier.h> header file. */
#undef HAVE_MBARRIER_H
@@ -932,6 +938,9 @@
/* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
#undef USE_LLVM
+/* Define to 1 to build with LZ4 support. (--with-lz4) */
+#undef USE_LZ4
+
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index 352959b751..6eae2bab97 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -126,11 +126,12 @@ ERROR: function int4in(internal) does not exist
CREATE ACCESS METHOD bogus TYPE TABLE HANDLER bthandler;
ERROR: function bthandler must return type table_am_handler
SELECT amname, amhandler, amtype FROM pg_am where amtype = 't' ORDER BY 1, 2;
- amname | amhandler | amtype
---------+----------------------+--------
- heap | heap_tableam_handler | t
- heap2 | heap_tableam_handler | t
-(2 rows)
+ amname | amhandler | amtype
+----------+--------------------------+--------
+ heap | heap_tableam_handler | t
+ heap2 | heap_tableam_handler | t
+ zedstore | zedstore_tableam_handler | t
+(3 rows)
-- First create tables employing the new AM using USING
-- plain CREATE TABLE
diff --git a/src/test/regress/expected/zedstore.out b/src/test/regress/expected/zedstore.out
new file mode 100644
index 0000000000..37ffd6a98a
--- /dev/null
+++ b/src/test/regress/expected/zedstore.out
@@ -0,0 +1,259 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+(10 rows)
+
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+ c1 | c3
+----+----
+ 1 | 3
+ 2 | 4
+ 3 | 5
+ 4 | 6
+ 5 | 7
+ 6 | 8
+ 7 | 9
+ 8 | 10
+ 9 | 11
+ 10 | 12
+(10 rows)
+
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+ c3
+----
+ 7
+ 8
+ 9
+ 10
+ 11
+ 12
+(6 rows)
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(15 rows)
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+-- index scan
+select * from t_zedstore where c1 = 5;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+(1 row)
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+ c1
+----
+ 5
+(1 row)
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+ c1 | c2
+----+----
+ 5 | 6
+ 6 | 7
+ 7 | 8
+ 8 | 9
+ 9 | 10
+ 10 | 11
+(6 rows)
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(14 rows)
+
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(11 rows)
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+select c1, length(t) from t_zedtoast;
+ c1 | length
+----+--------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+ 6 | 10000
+ 7 | 10000
+ 8 | 10000
+ 9 | 10000
+ 10 | 10000
+(10 rows)
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+select c2 from t_zednullvalues;
+ c2
+----
+
+ 2
+(2 rows)
+
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ 1 |
+(2 rows)
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 4 | 4 | stuff | test_4 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(12 rows)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index f320fb6ef3..1668c251db 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -89,7 +89,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
# ----------
# Another group of parallel tests
# ----------
-test: create_table_like alter_generic alter_operator misc psql async dbsize misc_functions sysviews tsrf tidscan stats_ext
+test: create_table_like alter_generic alter_operator misc psql async dbsize misc_functions sysviews tsrf tidscan stats_ext zedstore
# rules cannot run concurrently with any test that creates a view
test: rules psql_crosstab amutils
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 36644aa963..a81722e412 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -135,6 +135,7 @@ test: sysviews
test: tsrf
test: tidscan
test: stats_ext
+test: zedstore
test: rules
test: psql_crosstab
test: select_parallel
diff --git a/src/test/regress/sql/zedstore.sql b/src/test/regress/sql/zedstore.sql
new file mode 100644
index 0000000000..5240a9efeb
--- /dev/null
+++ b/src/test/regress/sql/zedstore.sql
@@ -0,0 +1,104 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+
+-- index scan
+select * from t_zedstore where c1 = 5;
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+
+--
+-- Test toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+
+select c1, length(t) from t_zedtoast;
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+select c2 from t_zednullvalues;
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+9999 \N \\N \NN \N
+10000 21 31 41 51
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+1 test_1
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+2 test_2
+3 test_3
+4 test_4
+5 test_5
+\.
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+10001 22 32 42 52
+10002 23 33 43 53
+10003 24 34 44 54
+10004 25 35 45 55
+10005 26 36 46 56
+\.
+
+select * from t_zedcopy;
base-commit: 80a96e066eecb6bd1788964b5911a405d932a784
--
2.19.1
Hi,
On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method).
That's very cool.
Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.
Is storage going through the bufmgr.c or separately?
In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.
When does compression happen? After each modifcation of the expanded
"page"? Are repeated expansions prevented somehow, e.g. if I
insert/delete rows into the same page one-by-one?
A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.
Does the size of the metapage limit the number of column [groups]? Or is
there some overflow / tree of trees / whatnot happening?
Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.
Is there some buffering? Without that it seems like retail inserts are
going to be pretty slow?
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.
FWIW, I don't quite think this is the right approach. I've only a vague
sketch of this in my head, but I think we should want a general API to
pass that down to *any* scan. Even for heap, not deforming leading
columns that a uninteresting, but precede relevant columns, would be
quite a noticable performance win. I don't think the projection list is
the right approach for that.
Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.
Have you looked at the undo APIs developed for zheap, as discussed on
the list? Seems important that they're suitable for this too.
Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 showspostgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
zedstore
avg
---------------------
500000.500000000000
(1 row)Time: 379.710 ms
Well, I'm not sure I'm actually impressed by that. What does the
performance look like if you select i0 instead?
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.
I'm more and more thinking that we should just nix the physical tlist
stuff and start afresh.
Congrats again, this is cool stuff.
- Andres
On Mon, Apr 8, 2019 at 6:04 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method).That's very cool.
Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.Is storage going through the bufmgr.c or separately?
Yes, below access method its pretty much same as heap. All reads and writes
flow via buffer cache. The implementation sits nicely in between, just
modifying the access method code changing how just how data is stored in
pages, above AM and below AM is basically all behaves similar to heap code.
In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.When does compression happen? After each modifcation of the expanded
"page"? Are repeated expansions prevented somehow, e.g. if I
insert/delete rows into the same page one-by-one?
Compression is performed with new data is only if page becomes full, till
then uncompressed data is added to the page. If even after compression
cannot add data to the page then page split is performed. Already
compressed data is not compressed again on next insert. New compressed
block is created for newly added uncompressed items.
The line of thought we have for delete is will not free the space as soon
as delete is performed, but instead delay and reuse the space deleted on
next insertion to the page.
A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.Does the size of the metapage limit the number of column [groups]? Or is
there some overflow / tree of trees / whatnot happening?
In design it doesn't limit the number of columns, as can have chain of
meta-pages to store the required meta-data, page 0 still being start of the
chain.
Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.Is there some buffering? Without that it seems like retail inserts are
going to be pretty slow?
Yes, regular buffer cache.
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.FWIW, I don't quite think this is the right approach. I've only a vague
sketch of this in my head, but I think we should want a general API to
pass that down to *any* scan. Even for heap, not deforming leading
columns that a uninteresting, but precede relevant columns, would be
quite a noticable performance win. I don't think the projection list is
the right approach for that.
Sure, would love to hear more on it and can enhance the same as makes more
usable for AMs.
Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.Have you looked at the undo APIs developed for zheap, as discussed on
the list? Seems important that they're suitable for this too.
Not in details yet, but yes plan is to leverage the same common framework
and undo log API as zheap. Will look into the details. With the current
zedstore implementation the requirements from the undo are prertty clear.
Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 showspostgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x;
--
zedstore
avg
---------------------
500000.500000000000
(1 row)Time: 379.710 ms
Well, I'm not sure I'm actually impressed by that. What does the
performance look like if you select i0 instead?
Just for quick test used 100 instead of 200 columns (with 200 the results
would be more diverged), this is what it reports
postgres=# SELECT AVG(i0) FROM (select i0 from layout offset 0) x; -- heap
avg
------------------------
1.00000000000000000000
(1 row)
Time: 183.865 ms
postgres=# SELECT AVG(i0) FROM (select i0 from zlayout offset 0) x; --
zedstore
avg
------------------------
1.00000000000000000000
(1 row)
Time: 47.624 ms
Hi,
On 09.04.2019 3:27, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.
Thank you for publishing this patch. IMHO Postgres is really missing
normal support of columnar store and table access method
API is the best way of integrating it.
I wanted to compare memory footprint and performance of zedstore with
standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).
Unfortunately I failed to load data in zedstore:
postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"
from lineitem);
psql: ERROR: compression failed. what now?
Time: 237804.775 ms (03:57.805)
Then I try to check if there is something in
zedstore_lineitem_projection table:
postgres=# select count(*) from zedstore_lineitem_projection;
psql: WARNING: terminating connection because of crash of another
server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 145710.828 ms (02:25.711)
Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM
killer (undo log?)
Subsequent attempt to run the same command is failed with the following
error:
postgres=# select count(*) from zedstore_lineitem_projection;
psql: ERROR: unexpected level encountered when descending tree
So the only thing I can do at this moment is report size of tables on
the disk:
postgres=# select pg_relation_size('lineitem');
pg_relation_size
------------------
10455441408
(1 row)
postgres=# select pg_relation_size('lineitem_projection');
pg_relation_size
------------------
3129974784
(1 row)
postgres=# select pg_relation_size('vops_lineitem_projection');
pg_relation_size
------------------
1535647744
(1 row)
postgres=# select pg_relation_size('zedstore_lineitem_projection');
pg_relation_size
------------------
2303688704
(1 row)
But I do not know how much data was actually loaded in zedstore table...
Actually the main question is why this table is not empty if INSERT
statement was failed?
Please let me know if I can somehow help you to reproduce and
investigate the problem.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
On 09.04.2019 17:09, Konstantin Knizhnik wrote:
Hi,
On 09.04.2019 3:27, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.Thank you for publishing this patch. IMHO Postgres is really missing
normal support of columnar store and table access method
API is the best way of integrating it.I wanted to compare memory footprint and performance of zedstore with
standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).
Unfortunately I failed to load data in zedstore:postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"
from lineitem);
psql: ERROR: compression failed. what now?
Time: 237804.775 ms (03:57.805)Then I try to check if there is something in
zedstore_lineitem_projection table:postgres=# select count(*) from zedstore_lineitem_projection;
psql: WARNING: terminating connection because of crash of another
server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 145710.828 ms (02:25.711)Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM
killer (undo log?)
Subsequent attempt to run the same command is failed with the
following error:postgres=# select count(*) from zedstore_lineitem_projection;
psql: ERROR: unexpected level encountered when descending treeSo the only thing I can do at this moment is report size of tables on
the disk:postgres=# select pg_relation_size('lineitem');
pg_relation_size
------------------
10455441408
(1 row)postgres=# select pg_relation_size('lineitem_projection');
pg_relation_size
------------------
3129974784
(1 row)postgres=# select pg_relation_size('vops_lineitem_projection');
pg_relation_size
------------------
1535647744
(1 row)postgres=# select pg_relation_size('zedstore_lineitem_projection');
pg_relation_size
------------------
2303688704
(1 row)But I do not know how much data was actually loaded in zedstore table...
Actually the main question is why this table is not empty if INSERT
statement was failed?Please let me know if I can somehow help you to reproduce and
investigate the problem.
Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:
postgres=# select pg_relation_size('zedstore_lineitem_projection');
pg_relation_size
------------------
9363010640
No wonder that zedstore shows the worst results:
lineitem 6240.261 ms
lineitem_projection 5390.446 ms
zedstore_lineitem_projection 23310.341 ms
vops_lineitem_projection 439.731 ms
Updated version of vstore_bench.sql is attached (sorry, there was some
errors in previous version of this script).
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
On 09/04/2019 18:00, Konstantin Knizhnik wrote:
On 09.04.2019 17:09, Konstantin Knizhnik wrote:
standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).
Cool, thanks!
Unfortunately I failed to load data in zedstore:
postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"
from lineitem);
psql: ERROR: compression failed. what now?
Time: 237804.775 ms (03:57.805)
Yeah, it's still early days, it will crash and burn in a lot of cases.
We wanted to publish this early, to gather ideas and comments on the
high level design, and to validate that the table AM API that's in v12
is usable.
Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:postgres=# select pg_relation_size('zedstore_lineitem_projection');
pg_relation_size
------------------
9363010640
The single-insert codepath isn't very optimized yet. If you populate the
table with large "INSERT ... SELECT ...", you end up with a huge undo
log. Try loading it with COPY.
You can also see how many pages of each type there is with:
select count(*), pg_zs_page_type('zedstore_lineitem_projection', g)
from generate_series(0, pg_table_size('zedstore_lineitem_projection')
/ 8192 - 1) g group by 2;
- Heikki
On 09.04.2019 18:08, Heikki Linnakangas wrote:
On 09/04/2019 18:00, Konstantin Knizhnik wrote:
On 09.04.2019 17:09, Konstantin Knizhnik wrote:
standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).Cool, thanks!
Unfortunately I failed to load data in zedstore:
postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"from lineitem);
psql: ERROR: compression failed. what now?
Time: 237804.775 ms (03:57.805)Yeah, it's still early days, it will crash and burn in a lot of cases.
We wanted to publish this early, to gather ideas and comments on the
high level design, and to validate that the table AM API that's in v12
is usable.Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:postgres=# select pg_relation_size('zedstore_lineitem_projection');
pg_relation_size
------------------
9363010640The single-insert codepath isn't very optimized yet. If you populate
the table with large "INSERT ... SELECT ...", you end up with a huge
undo log. Try loading it with COPY.You can also see how many pages of each type there is with:
select count(*), pg_zs_page_type('zedstore_lineitem_projection', g)
from generate_series(0,
pg_table_size('zedstore_lineitem_projection') / 8192 - 1) g group by 2;- Heikki
postgres=# copy zedstore_lineitem from '/mnt/data/lineitem.tbl'
delimiter '|' csv;
COPY 59986052
Time: 232802.257 ms (03:52.802)
postgres=# select pg_relation_size('zedstore_lineitem');
pg_relation_size
------------------
10346504192
(1 row)
postgres=# select count(*), pg_zs_page_type('zedstore_lineitem', g)
from generate_series(0, pg_table_size('zedstore_lineitem') / 8192 -
1) g group by 2;
count | pg_zs_page_type
---------+-----------------
1 | META
1262308 | BTREE
692 | UNDO
(3 rows)
And now performance is much worser:
Time: 99819.476 ms (01:39.819)
It is strange, because the main advantage of columnar store is that it
has to fetch only accessed rows.
What I see is that in non-parallel mode (max_parallel_workers_per_gather
= 0)
backend consumes about 11GB of memory. It fits in my desktop RAM (16GB)
and speed is ~58 seconds.
But one I start 4 parallel workers, them cause huge swapping:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
28195 knizhnik 20 0 11.823g 6.553g 5.072g D 7.6 42.2 0:17.19
postgres
28074 knizhnik 20 0 11.848g 6.726g 5.223g D 7.3 43.3 4:14.96
postgres
28192 knizhnik 20 0 11.854g 6.586g 5.075g D 7.3 42.4 0:17.18
postgres
28193 knizhnik 20 0 11.870g 6.594g 5.064g D 7.3 42.4 0:17.19
postgres
28194 knizhnik 20 0 11.854g 6.589g 5.078g D 7.3 42.4 0:17.09
postgres
which is also strange because data should be present in shared buffers.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 2019-Apr-09, Konstantin Knizhnik wrote:
On 09.04.2019 3:27, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.Thank you for publishing this patch. IMHO Postgres is really missing normal
support of columnar store
Yep.
and table access method API is the best way of integrating it.
This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.
We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09.04.2019 18:51, Alvaro Herrera wrote:
On 2019-Apr-09, Konstantin Knizhnik wrote:
On 09.04.2019 3:27, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.Thank you for publishing this patch. IMHO Postgres is really missing normal
support of columnar storeYep.
and table access method API is the best way of integrating it.
This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.
There are two different aspects:
1. Store format.
2. Vector execution.
1. VOPS is using mixed format, something similar with Apache parquet.
Tuples are stored vertically, but only inside one page.
It tries to minimize trade-offs between true horizontal and true
vertical storage:
first is most optimal for selecting all rows, while second - for
selecting small subset of rows.
To make this approach more efficient, it is better to use large page
size - default Postgres 8k pages is not enough.
From my point of view such format is better than pure vertical storage
which will be very inefficient if query access larger number of columns.
This problem can be somehow addressed by creating projections: grouping
several columns together. But it requires more space for storing
multiple projections.
2. Doesn't matter which format we choose, to take all advantages of
vertical representation we need to use vector operations.
And Postgres executor doesn't support them now. This is why VOPS is
using some hacks, which is definitely not good and not working in all cases.
zedstore is not using such hacks and ... this is why it never can reach
VOPS performance.
The right solution is to add vector operations support to Postgres
planner and executors.
But is much harder than develop columnar store itself.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 09/04/2019 18:00, Konstantin Knizhnik wrote:
Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:postgres=# select pg_relation_size('zedstore_lineitem_projection');
pg_relation_size
------------------
9363010640No wonder that zedstore shows the worst results:
lineitem 6240.261 ms
lineitem_projection 5390.446 ms
zedstore_lineitem_projection 23310.341 ms
vops_lineitem_projection 439.731 msUpdated version of vstore_bench.sql is attached (sorry, there was some
errors in previous version of this script).
I tried this quickly, too. With default work_mem and no parallelism, and
1 gb table size, it seems that the query chooses a different plan with
heap and zedstore, with a sort+group for zedstore and hash agg for heap.
There's no ANALYZE support in zedstore yet, and we haven't given much
thought to parallelism either. With work_mem='1GB' and no parallelism,
both queries use a hash agg, and the numbers are much closer than what
you saw, about 6 s for heap, and 9 s for zedstore.
- Heikki
On 09.04.2019 19:19, Heikki Linnakangas wrote:
On 09/04/2019 18:00, Konstantin Knizhnik wrote:
Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:postgres=# select pg_relation_size('zedstore_lineitem_projection');
pg_relation_size
------------------
9363010640No wonder that zedstore shows the worst results:
lineitem 6240.261 ms
lineitem_projection 5390.446 ms
zedstore_lineitem_projection 23310.341 ms
vops_lineitem_projection 439.731 msUpdated version of vstore_bench.sql is attached (sorry, there was some
errors in previous version of this script).I tried this quickly, too. With default work_mem and no parallelism,
and 1 gb table size, it seems that the query chooses a different plan
with heap and zedstore, with a sort+group for zedstore and hash agg
for heap. There's no ANALYZE support in zedstore yet, and we haven't
given much thought to parallelism either. With work_mem='1GB' and no
parallelism, both queries use a hash agg, and the numbers are much
closer than what you saw, about 6 s for heap, and 9 s for zedstore.- Heikki
Yes, you was right. The plan for zedstore uses GroupAggregate instead
of HashAggregate.
Increasing work_mem force optimizer to use HashAggregate in all cases.
But it doesn't prevent memory overflow in my case.
And it is very strange to me, because there are just 4 groups in the
result, so it should not consume any memory.
Yet another strange thing is that size of zedstore_table is 10Gb
according to pg_relation_size.
Q1 query access only some some subset of "lineitem" columns, not
touching the largest ones (with text).
I have configured 12Gb shared buffers. And all this 11Gb are used! Looks
like all columns are fetched from the disk.
And looks like except this 11Gb of shard data, backend (and each
parallel worker) is also consuming several gigabytes of heap memory.
As a result total size of used memory during parallel query execution
with 4 workers exceeds 20GB and cause terrible swapping at my system.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.
I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.
Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.
This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.
BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Apr 9, 2019 at 11:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.
+1
BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.
Surprised its felt this thread would initiate the invitation to confusion.
Based on past internal and meetup discussions for few quite sometime now,
the confusion already exists for zheap pronunciation because of the reason
mentioned, as last letter is not pronounced universally same. Hence we
explicitly called it zedstore to learn from and make the pronunciation
world wide universal for new thing atleast.
On Tue, Apr 9, 2019 at 9:13 AM Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:
On 09.04.2019 18:51, Alvaro Herrera wrote:
On 2019-Apr-09, Konstantin Knizhnik wrote:
On 09.04.2019 3:27, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.Thank you for publishing this patch. IMHO Postgres is really missing
normal
support of columnar store
Yep.
and table access method API is the best way of integrating it.
This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.There are two different aspects:
1. Store format.
2. Vector execution.1. VOPS is using mixed format, something similar with Apache parquet.
Tuples are stored vertically, but only inside one page.
It tries to minimize trade-offs between true horizontal and true
vertical storage:
first is most optimal for selecting all rows, while second - for
selecting small subset of rows.
To make this approach more efficient, it is better to use large page
size - default Postgres 8k pages is not enough.From my point of view such format is better than pure vertical storage
which will be very inefficient if query access larger number of columns.
This problem can be somehow addressed by creating projections: grouping
several columns together. But it requires more space for storing
multiple projections.
Right, storing all the columns in single page doens't give any savings on
IO.
2. Doesn't matter which format we choose, to take all advantages of
vertical representation we need to use vector operations.
And Postgres executor doesn't support them now. This is why VOPS is
using some hacks, which is definitely not good and not working in all
cases.
zedstore is not using such hacks and ... this is why it never can reach
VOPS performance.
Vectorized execution is orthogonal to storage format. It can be even
applied to row store and performance gained. Similarly column store without
vectorized execution also gives performance gain better compression rations
and such benefits. Column store clubbed with vecotorized execution makes it
lot more performant agree. Zedstore currently is focused to have AM piece
in place, which fits the postgres ecosystem and supports all the features
heap does.
On 09/04/2019 23:24, Ashwin Agrawal wrote:
BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.Surprised its felt this thread would initiate the invitation to
confusion. Based on past internal and meetup discussions for few quite
sometime now, the confusion already exists for zheap pronunciation
because of the reason mentioned, as last letter is not pronounced
universally same. Hence we explicitly called it zedstore to learn from
and make the pronunciation world wide universal for new thing atleast.
Yeah, you can blame me for the name. It's a pun on zheap. I'm hoping we
come up with a better name before this matures; I'm thinking it could be
just "column store" or something like that in the end, but it's good to
have a more unique name during development.
- Heikki
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.High-level Design - B-trees for the win!
========================================To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------Row store
---------The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.
Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them. Have you thought how
will you implement the multi-locker scheme in this design? In zheap,
we have used undo for the same and it is easy to imagine when you have
separate transaction slots for each transaction. I am not sure how
will you implement the same here.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 10/04/2019 09:29, Amit Kapila wrote:
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Row store
---------The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.
Yeah. This does depend on compression to eliminate the unused fields
quite heavily at the moment. But you could have a flag in the header to
indicate "no undo pointer needed", and just leave it out, when it's needed.
Have you thought how will you implement the multi-locker scheme in
this design? In zheap, we have used undo for the same and it is easy
to imagine when you have separate transaction slots for each
transaction. I am not sure how will you implement the same here.
I've been thinking that the undo record would store all the XIDs
involved. So if there are multiple lockers, the UNDO record would store
a list of XIDs. Alternatively, I suppose you could store multiple UNDO
pointers for the same tuple.
- Heikki
On 10.04.2019 10:25, Heikki Linnakangas wrote:
On 10/04/2019 09:29, Amit Kapila wrote:
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io>
wrote:Row store
---------The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.Yeah. This does depend on compression to eliminate the unused fields
quite heavily at the moment. But you could have a flag in the header
to indicate "no undo pointer needed", and just leave it out, when it's
needed.Have you thought how will you implement the multi-locker scheme in
this design? In zheap, we have used undo for the same and it is easy
to imagine when you have separate transaction slots for each
transaction. I am not sure how will you implement the same here.I've been thinking that the undo record would store all the XIDs
involved. So if there are multiple lockers, the UNDO record would
store a list of XIDs. Alternatively, I suppose you could store
multiple UNDO pointers for the same tuple.- Heikki
I also a little bit confused about UNDO records and MVCC support in
Zedstore. Actually columnar store is mostly needed for analytic for
read-only or append-only data. One of the disadvantages of Postgres is
quite larger per-record space overhead caused by MVCC.
It may be critical if you want to store huge timeseries with relatively
small number of columns (like measurements of some sensor).
It will be nice to have storage format which reduce this overhead when
it is not needed (data is not updated).
Right now, even without UNFO pages, size of zedstore is larger than size
of original Postgres table.
It seems to be very strange.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 10/04/2019 10:38, Konstantin Knizhnik wrote:
I also a little bit confused about UNDO records and MVCC support in
Zedstore. Actually columnar store is mostly needed for analytic for
read-only or append-only data. One of the disadvantages of Postgres is
quite larger per-record space overhead caused by MVCC.
It may be critical if you want to store huge timeseries with relatively
small number of columns (like measurements of some sensor).
It will be nice to have storage format which reduce this overhead when
it is not needed (data is not updated).
Sure. Definitely something we need to optimize.
Right now, even without UNFO pages, size of zedstore is larger than size
of original Postgres table.
It seems to be very strange.
If you have a table with a lot of columns, but each column is small,
e.g. lots of boolean columns, the item headers that zedstore currently
stores for each datum take up a lot of space. We'll need to squeeze
those harder to make this competitive. Instead of storing a header for
each datum, if a group of consecutive tuples have the same visibility
information, we could store the header just once, with an array of the
datums, for example.
- Heikki
On Wed, Apr 10, 2019 at 12:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 10/04/2019 09:29, Amit Kapila wrote:
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Row store
---------The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.Yeah. This does depend on compression to eliminate the unused fields
quite heavily at the moment. But you could have a flag in the header to
indicate "no undo pointer needed", and just leave it out, when it's needed.Have you thought how will you implement the multi-locker scheme in
this design? In zheap, we have used undo for the same and it is easy
to imagine when you have separate transaction slots for each
transaction. I am not sure how will you implement the same here.I've been thinking that the undo record would store all the XIDs
involved. So if there are multiple lockers, the UNDO record would store
a list of XIDs.
This will be quite tricky. Whenever a new locker arrives, you first
need to fetch previous undo to see which all XIDs already have a lock
on it. Not only that, it will make discarding undo's way complicated.
We have considered this approach before implementing the current
approach in zheap.
Alternatively, I suppose you could store multiple UNDO
pointers for the same tuple.
This will not only make the length of the tuple unnecessarily long but
would make it much harder to reclaim that space once the corresponding
undo is discarded.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 9/04/19 12:27 PM, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.
Very nice. I realize that it is very early days, but applying this patch
I've managed to stumble over some compression bugs doing some COPY's:
benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR: compression failed. what now?
CONTEXT: COPY dim1, line 458
The log has:
2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed.
what now?
2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish,
zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM
'/data0/dump/dim1.dat'
USING DELIMITERS ',';
The dataset is generated from and old DW benchmark I wrote
(https://sourceforge.net/projects/benchw/). The row concerned looks like:
457,457th interesting measure,1th measure
type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure
type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg
I'll see if changing to LZ4 makes any different.
best wishes
Mark
On 11/04/19 4:01 PM, Mark Kirkwood wrote:
On 9/04/19 12:27 PM, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.Very nice. I realize that it is very early days, but applying this
patch I've managed to stumble over some compression bugs doing some
COPY's:benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR: compression failed. what now?
CONTEXT: COPY dim1, line 458The log has:
2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed.
what now?
2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish,
zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM
'/data0/dump/dim1.dat'
USING DELIMITERS ',';The dataset is generated from and old DW benchmark I wrote
(https://sourceforge.net/projects/benchw/). The row concerned looks like:457,457th interesting measure,1th measure
type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure
type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsgI'll see if changing to LZ4 makes any different.
The COPY works with LZ4 configured.
On Apr 10, 2019, at 9:08 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
On 11/04/19 4:01 PM, Mark Kirkwood wrote:
On 9/04/19 12:27 PM, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's:
benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR: compression failed. what now?
CONTEXT: COPY dim1, line 458The log has:
2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now?
2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';The dataset is generated from and old DW benchmark I wrote (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_&d=DwIDaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0&s=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY&e=). The row concerned looks like:
457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsgI'll see if changing to LZ4 makes any different.
The COPY works with LZ4 configured.
Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
On 11.04.2019 8:03, Ashwin Agrawal wrote:
On Apr 10, 2019, at 9:08 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
On 11/04/19 4:01 PM, Mark Kirkwood wrote:
On 9/04/19 12:27 PM, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's:
benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR: compression failed. what now?
CONTEXT: COPY dim1, line 458The log has:
2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now?
2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';The dataset is generated from and old DW benchmark I wrote (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_&d=DwIDaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0&s=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY&e=). The row concerned looks like:
457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsgI'll see if changing to LZ4 makes any different.
The COPY works with LZ4 configured.
Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
Internal Postgres lz compressor is really very inefficient comparing
with other compression algorithms.
But in any case you should never assume that size of compressed data
will be smaller than size of plain data.
Moreover, if you are trying to compress already compressed data, then
result almost always will be larger.
If size of compressed data is larger (or even not significantly smaller)
than size of raw data, then you should store original data.
lz4 is actually very fast. But it doesn't provide good compression ratio.
This my results of compressing pbench data using different compressors:
Configuration Size (Gb) Time (sec)
no compression
15.31 92
zlib (default level) 2.37 284
zlib (best speed) 2.43 191
postgres internal lz 3.89 214
lz4 4.12
95
snappy 5.18 99
lzfse 2.80 1099
(apple) 2.80 1099
1.69 125
You see that zstd provides almost 2 times better compression ration and
almost at the same speed.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.High-level Design - B-trees for the win!
========================================To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------Row store
---------The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.Column store
------------A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.Main reasons for storing data this way
--------------------------------------* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.Implementation
==============Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.Index Support:
Building index also leverages columnar storage and only scans columns
required to build the index. Indexes work pretty similar to heap
tables. Data is inserted into tables and TID for the tuple same gets
stored in index. On index scans, required column Btrees are scanned
for given TID and datums passed back using virtual tuple.Page Format:
ZedStore table contains different kinds of pages, all in the same
file. Kinds of pages are meta-page, per-attribute btree internal and
leaf pages, UNDO log page, and toast pages. Each page type has its own
distinct data storage format.Block 0 is always a metapage. It contains the block numbers of the
other data structures stored within the file, like the per-attribute
B-trees, and the UNDO log.Enhancements to design:
=======================Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.Notes about current patch
=========================Basic (core) functionality is implemented to showcase and play with.
Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:CREATE TABLE <name> (column listing) USING zedstore;
Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 showspostgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
zedstore
avg
---------------------
500000.500000000000
(1 row)Time: 379.710 ms
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/CAJrrPGfaC7WC9NK6PTTy6YN-NN+hCy8xOLAh2doYhVg5d6HsAA@mail.gmail.com
3]
/messages/by-id/20150611230316.GM133018@postgresql.org
4]
/messages/by-id/20150831225328.GM2912@alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6]
/messages/by-id/CAOykqKfko-n5YiBJtk-ocVdp+j92Apu5MJBwbGGh4awRY5NCuQ@mail.gmail.com
7]
/messages/by-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43@iki.fi
Reading about it reminds me of this work -- TAG column storage(
http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ).
Isn't this storage system inspired from there, with TID as the TAG?
It is not referenced here so made me wonder.
--
Regards,
Rafia Sabih
On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.
+1 on Brr. Looks like Thomas and your thought on having 'z' makes things
popular/stylish, etc. is after all true, I was skeptical back then.
--
Regards,
Rafia Sabih
On 11/04/2019 16:12, Rafia Sabih wrote:
On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>> wrote:BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.+1 on Brr. Looks like Thomas and your thought on having 'z' makes
things popular/stylish, etc. is after all true, I was skeptical back then.
BrrStore works for me, too ;-).
- Heikki
On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:
This my results of compressing pbench data using different compressors:
Configuration Size (Gb) Time (sec)
no compression
15.31 92
zlib (default level) 2.37 284
zlib (best speed) 2.43 191
postgres internal lz 3.89 214
lz4 4.12
95
snappy 5.18 99
lzfse 2.80 1099
(apple) 2.80 1099
1.69 125You see that zstd provides almost 2 times better compression ration
and almost at the same speed.
What is "(apple) 2.80 1099"? Was that intended to be zstd?
Andreas
On 11.04.2019 16:18, Andreas Karlsson wrote:
On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:
This my results of compressing pbench data using different compressors:
Configuration Size (Gb) Time (sec)
no compression
15.31 92
zlib (default level) 2.37 284
zlib (best speed) 2.43 191
postgres internal lz 3.89 214
lz4 4.12
95
snappy 5.18 99
lzfse 2.80 1099
(apple) 2.80 1099
1.69 125You see that zstd provides almost 2 times better compression ration
and almost at the same speed.What is "(apple) 2.80 1099"? Was that intended to be zstd?
Andreas
Ugh...
Cut and paste problems.
The whole document can be found here:
http://garret.ru/PageLevelCompression.pdf
lzfse (apple) 2.80 1099
zstd (facebook) 1.69 125
ztsd is compression algorithm proposed by facebook:
https://github.com/facebook/zstd
Looks like it provides the best speed/compress ratio result.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Apr 11, 2019 at 3:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 11/04/2019 16:12, Rafia Sabih wrote:
On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>> wrote:BTW, can I express a small measure of disappointment that the name
for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.+1 on Brr. Looks like Thomas and your thought on having 'z' makes
things popular/stylish, etc. is after all true, I was skeptical backthen.
BrrStore works for me, too ;-).
Also works as a reference to the Finnish climate?
(Sorry, couldn't help myself)
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.Motivations / Objectives
* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.High-level Design - B-trees for the win!
========================================To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.In my understanding these TIDs will follow the datatype of the current
ones. Then my question is will TIDs be reusable here and how will the
reusable range of TIDs be determined? If not, wouldn't that become a hard
limit to the number of insertions performed on a table?
The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------Row store
---------The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.Column store
------------A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.Main reasons for storing data this way
--------------------------------------* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.How about using a separate BTree for undo also?
Implementation
==============Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.I am curious about how delete is working here? Will the TID entries will
be just marked delete as in current heap, or will they be actually removed
and whole btree is restructured (if required) then?
Similarly, about updates, will they be just delete+insert or something
clever will be happening there?
Will there be in-place updates and in what scenarios they will be possible?
There is nothing mentioned in this direction, however using undo files
assures me there must be some in-place updates somewhere.
Enhancements to design:
=======================Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.Notes about current patch
=========================Basic (core) functionality is implemented to showcase and play with.
Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:CREATE TABLE <name> (column listing) USING zedstore;
Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 showspostgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)Time: 4679.026 ms (00:04.679)
postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
zedstore
avg
---------------------
500000.500000000000
(1 row)Time: 379.710 ms
Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/CAJrrPGfaC7WC9NK6PTTy6YN-NN+hCy8xOLAh2doYhVg5d6HsAA@mail.gmail.com
3]
/messages/by-id/20150611230316.GM133018@postgresql.org
4]
/messages/by-id/20150831225328.GM2912@alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6]
/messages/by-id/CAOykqKfko-n5YiBJtk-ocVdp+j92Apu5MJBwbGGh4awRY5NCuQ@mail.gmail.com
7]
/messages/by-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43@iki.fi
--
Regards,
Rafia Sabih
Ashwin Agrawal <aagrawal@pivotal.io> writes:
Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
You realize of course that *every* compression method has some inputs that
it makes bigger. If your code assumes that compression always produces a
smaller string, that's a bug in your code, not the compression algorithm.
regards, tom lane
On 11/04/2019 17:54, Tom Lane wrote:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
You realize of course that *every* compression method has some inputs that
it makes bigger. If your code assumes that compression always produces a
smaller string, that's a bug in your code, not the compression algorithm.
Of course. The code is not making that assumption, although clearly
there is a bug there somewhere because it throws that error. It's early
days..
In practice it's easy to weasel out of that, by storing the data
uncompressed, if compression would make it longer. Then you need an
extra flag somewhere to indicate whether it's compressed or not. It
doesn't break the theoretical limit because the actual stored length is
then original length + 1 bit, but it's usually not hard to find a place
for one extra bit.
- Heikki
On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
Reading about it reminds me of this work -- TAG column storage( http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ).
Isn't this storage system inspired from there, with TID as the TAG?It is not referenced here so made me wonder.
I don't think they're particularly similar, because that paper
describes an architecture based on using purely logical row
identifiers, which is not what a TID is. TID is a hybrid
physical/logical identifier, sometimes called a "physiological"
identifier, which will have significant overhead. Ashwin said that
ZedStore TIDs are logical identifiers, but I don't see how that's
compatible with a hybrid row/column design (unless you map heap TID to
logical row identifier using a separate B-Tree).
The big idea with Graefe's TAG design is that there is practically no
storage overhead for these logical identifiers, because each entry's
identifier is calculated by adding its slot number to the page's
tag/low key. The ZedStore design, in contrast, explicitly stores TID
for every entry. ZedStore seems more flexible for that reason, but at
the same time the per-datum overhead seems very high to me. Maybe
prefix compression could help here, which a low key and high key can
do rather well.
--
Peter Geoghegan
On Tue, Apr 09, 2019 at 02:29:09PM -0400, Robert Haas wrote:
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.
It's not clear to me whether you're arguing for not having any such
implementation in core, or having multiple ones? I think we should aim
to have at least one in-core implementation, even if it's not the best
possible one for all sizes. It's not like our rowstore is the best
possible implementation for all cases either.
I think having a colstore in core is important not just for adoption,
but also for testing and development of the executor / planner bits.
If we have multiple candidates with sufficient code quality, then we may
consider including both. I don't think it's very likely to happen in the
same release, considering how much work it will require. And I have no
idea if zedstore or VOPS are / will be the only candidates - it's way
too early at this point.
FWIW I personally plan to focus primarily on the features that aim to
be included in core, and that applies to colstores too.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 09, 2019 at 02:03:09PM -0700, Ashwin Agrawal wrote:
On Tue, Apr 9, 2019 at 9:13 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:On 09.04.2019 18:51, Alvaro Herrera wrote:
On 2019-Apr-09, Konstantin Knizhnik wrote:
On 09.04.2019 3:27, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design andinitial
implementation of Zedstore, compressed in-core columnar storage
(table
access method). Attaching the patch and link to github branch [1] to
follow along.Thank you for publishing this patch. IMHO Postgres is really missing
normal
support of columnar store
Yep.
and table access method API is the best way of integrating it.
This is not surprising, considering that columnar store is precisely
the
reason for starting the work on table AMs.
We should certainly look into integrating some sort of columnar
storage
in mainline.� Not sure which of zedstore or VOPS is the best
candidate,
or maybe we'll have some other proposal.� My feeling is that having
more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.There are two different aspects:
1. Store format.
2. Vector execution.1. VOPS is using mixed format, something similar with Apache parquet.
Tuples are stored vertically, but only inside one page.
It tries to minimize trade-offs between true horizontal and true
vertical storage:
first is most optimal for selecting all rows, while second - for
selecting small subset of rows.
To make this approach more efficient, it is better to use large page
size - default Postgres 8k pages is not enough.�From my point of view such format is better than pure vertical storage
which will be very inefficient if query access larger number of columns.
This problem can be somehow addressed by creating projections: grouping
several columns together. But it requires more space for storing
multiple projections.Right, storing all the columns in single page doens't give any savings on
IO.
Yeah, although you could save some I/O thanks to compression even in
that case.
2. Doesn't matter which format we choose, to take all advantages of
vertical representation we need to use vector operations.
And Postgres executor doesn't support them now. This is why VOPS is
using some hacks, which is definitely not good and not working in all
cases.
zedstore is not using such hacks and ... this is why it never can reach
VOPS performance.Vectorized execution is orthogonal to storage format. It can be even
applied to row store and performance gained. Similarly column store
without vectorized execution also gives performance gain better
compression rations and such benefits. Column store clubbed with
vecotorized execution makes it lot more performant agree. Zedstore
currently is focused to have AM piece in place, which fits the postgres
ecosystem and supports all the features heap does.
Not sure it's quite orthogonal. Sure, you can apply it to rowstores too,
but I'd say column stores are naturally better suited for it.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 11, 2019 at 04:52:33PM +0300, Konstantin Knizhnik wrote:
On 11.04.2019 16:18, Andreas Karlsson wrote:
On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:
This my results of compressing pbench data using different
compressors:+-------------------------------------------------------------+ |Configuration |Size (Gb) |Time (sec) | |---------------------------+----------------+----------------| |no compression |15.31 |92 | |---------------------------+----------------+----------------| |zlib (default level) |2.37 |284 | |---------------------------+----------------+----------------| |zlib (best speed) |2.43 |191 | |---------------------------+----------------+----------------| |postgres internal lz |3.89 |214 | |---------------------------+----------------+----------------| |lz4 |4.12 |95 | |---------------------------+----------------+----------------| |snappy |5.18 |99 | |---------------------------+----------------+----------------| |lzfse |2.80 |1099 | |---------------------------+----------------+----------------| |(apple) 2.80 1099 |1.69 |125 | +-------------------------------------------------------------+You see that zstd provides almost 2 times better compression ration
and almost at the same speed.What is "(apple) 2.80 1099"? Was that intended to be zstd?
Andreas
Ugh...
Cut and paste problems.
The whole document can be found here:
http://garret.ru/PageLevelCompression.pdflzfse (apple)������ 2.80��� 1099
zstd (facebook)� 1.69��� 125ztsd is compression algorithm proposed by facebook:�
https://github.com/facebook/zstd
Looks like it provides the best speed/compress ratio result.
I think those comparisons are cute and we did a fair amount of them when
considering a drop-in replacement for pglz, but ultimately it might be a
bit pointless because:
(a) it very much depends on the dataset (one algorithm may work great on
one type of data, suck on another)
(b) different systems may require different trade-offs (high ingestion
rate vs. best compression ratio)
(c) decompression speed may be much more important
What I'm trying to say is that we shouldn't obsess about picking one
particular algorithm too much, because it's entirely pointless. Instead,
we should probably design the system to support different compression
algorithms, ideally at column level.
Also, while these general purpose algorithms are nice, what I think will
be important in later stages of colstore development will be compression
algorithms allowing execution directly on the compressed data (like RLE,
dictionary and similar approaches).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
On 11/04/2019 17:54, Tom Lane wrote:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.
You realize of course that *every* compression method has some inputs that
it makes bigger. If your code assumes that compression always produces a
smaller string, that's a bug in your code, not the compression algorithm.Of course. The code is not making that assumption, although clearly
there is a bug there somewhere because it throws that error. It's
early days..In practice it's easy to weasel out of that, by storing the data
uncompressed, if compression would make it longer. Then you need an
extra flag somewhere to indicate whether it's compressed or not. It
doesn't break the theoretical limit because the actual stored length
is then original length + 1 bit, but it's usually not hard to find a
place for one extra bit.
Don't we already have that flag, though? I see ZSCompressedBtreeItem has
t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2019-04-14 18:36:18 +0200, Tomas Vondra wrote:
I think those comparisons are cute and we did a fair amount of them when
considering a drop-in replacement for pglz, but ultimately it might be a
bit pointless because:(a) it very much depends on the dataset (one algorithm may work great on
one type of data, suck on another)(b) different systems may require different trade-offs (high ingestion
rate vs. best compression ratio)(c) decompression speed may be much more important
What I'm trying to say is that we shouldn't obsess about picking one
particular algorithm too much, because it's entirely pointless. Instead,
we should probably design the system to support different compression
algorithms, ideally at column level.
I think we still need to pick a default algorithm, and realistically
that's going to be used by like 95% of the users.
Greetings,
Andres Freund
On Mon, Apr 08, 2019 at 05:27:05PM -0700, Ashwin Agrawal wrote:
Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.Motivations / Objectives
* Performance improvement for queries selecting subset of columns
� (reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
� headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
� just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
� others separately. Provide flexibility of granularity on how to
� divide the columns. Columns accessed together can be stored
� together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
� full rewrite of the table.
Cool. Me gusta.
High-level Design - B-trees for the win!
========================================To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.
So if TIDs are redefined this way, how does affect BRIN indexes? I mean,
that's a lightweight indexing scheme which however assumes TIDs encode
certain amount of locality - so this probably makes them (and Bitmap
Heap Scans in general) much less eficient. That's a bit unfortunate,
although I don't see a way around it :-(
The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:�- plain item, holds one tuple or one datum, uncompressed payload
�- a "container item", holds multiple plain items, compressed payload+-----------------------------
| Fixed-size page header:
|
|�� LSN
|�� TID low and hi key (for Lehman & Yao B-tree operations)
|�� left and right page pointers
|
| Items:
|
|�� TID | size | flags | uncompressed size | lastTID | payload (container
item)
|�� TID | size | flags | uncompressed size | lastTID | payload (container
item)
|�� TID | size | flags | undo pointer | payload (plain item)
|�� TID | size | flags | undo pointer | payload (plain item)
|�� ...
|
+----------------------------
So if I understand it correctly, ZSUncompressedBtreeItem is the "plain"
item and ZSCompressedBtreeItem is the container one. Correct?
I find it a bit confusing, and I too ran into the issue with data that
can't be compressed, so I think the "container" should support both
compressed and uncompressed data. Heikki already mentioned that, so I
suppose it's just not implemented yet. That however means the name of
the "compressed" struct gets confusing, so I suggest to rename to:
ZSUncompressedBtreeItem -> ZSPlainBtreeItem
ZSCompressedBtreeItem -> ZSContainerBtreeItem
where the container supports both compressed and uncompressed mode.
Also, maybe we don't need to put "Btree" into every damn name ;-)
Looking at the ZSCompressedBtreeItem, I see it stores just first/last
TID for the compressed data. Won't that be insufficient when there are
some gaps due to deletions or something? Or perhaps I just don't
understand how it works.
Another thing is that with uncompressed size being stored as uint16,
won't that be insufficient for highly compressible data / large pages? I
mean, we can have pages up to 32kB, which is not that far.
Column store
------------A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.
OK, so data for all the columns are stored in separate btrees, but in
the same physical file. Wouldn't it be more convenient to have one
relfilenode per column?
That would also mean the 32TB limit applies to individual columns, not
the whole table. Of course, it'd be more complicated and partitioning
allows us to work around that limit.
So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.
OK, makes sense. Do you also envision supporting per-column / per-group
compression etc?
Main reasons for storing data this way
--------------------------------------* Layout the data/tuples in mapped fashion instead of keeping the
� logical to physical mapping separate from actual data. So, keep the
� meta-data and data logically in single stream of file, avoiding the
� need for separate forks/files to store meta-data and data.* Stick to fixed size physical blocks. Variable size blocks pose need
� for increased logical to physical mapping maintenance, plus
� restrictions on concurrency of writes and reads to files. Hence
� adopt compression to fit fixed size blocks instead of other way
� round.MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.Implementation
==============Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.
What about deletes? How do these work?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 14, 2019 at 09:45:10AM -0700, Andres Freund wrote:
Hi,
On 2019-04-14 18:36:18 +0200, Tomas Vondra wrote:
I think those comparisons are cute and we did a fair amount of them when
considering a drop-in replacement for pglz, but ultimately it might be a
bit pointless because:(a) it very much depends on the dataset (one algorithm may work great on
one type of data, suck on another)(b) different systems may require different trade-offs (high ingestion
rate vs. best compression ratio)(c) decompression speed may be much more important
What I'm trying to say is that we shouldn't obsess about picking one
particular algorithm too much, because it's entirely pointless. Instead,
we should probably design the system to support different compression
algorithms, ideally at column level.I think we still need to pick a default algorithm, and realistically
that's going to be used by like 95% of the users.
True. Do you expect it to be specific to the column store, or should be
set per-instance default (even for regular heap)?
FWIW I think the conclusion from past dev meetings was we're unlikely to
find anything better than lz4. I doubt that changed very much.
regard
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings,
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
On Tue, Apr 09, 2019 at 02:29:09PM -0400, Robert Haas wrote:
On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.It's not clear to me whether you're arguing for not having any such
implementation in core, or having multiple ones? I think we should aim
to have at least one in-core implementation, even if it's not the best
possible one for all sizes. It's not like our rowstore is the best
possible implementation for all cases either.I think having a colstore in core is important not just for adoption,
but also for testing and development of the executor / planner bits.
Agreed.
If we have multiple candidates with sufficient code quality, then we may
consider including both. I don't think it's very likely to happen in the
same release, considering how much work it will require. And I have no
idea if zedstore or VOPS are / will be the only candidates - it's way
too early at this point.
Definitely, but having as many different indexes as we have is certainly
a good thing and we should be looking to a future where we have multiple
in-core options for row and column-oriented storage.
FWIW I personally plan to focus primarily on the features that aim to
be included in core, and that applies to colstores too.
Yeah, same here.
Thanks!
Stephen
On Sun, Apr 14, 2019 at 06:39:47PM +0200, Tomas Vondra wrote:
On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
On 11/04/2019 17:54, Tom Lane wrote:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
Thank you for trying it out. Yes, noticed for certain patterns
pg_lzcompress() actually requires much larger output buffers. Like
for one 86 len source it required 2296 len output buffer. Current
zedstore code doesn’t handle this case and errors out. LZ4 for same
patterns works fine, would highly recommend using LZ4 only, as
anyways speed is very fast as well with it.You realize of course that *every* compression method has some inputs
that it makes bigger. If your code assumes that compression always
produces a smaller string, that's a bug in your code, not the
compression algorithm.Of course. The code is not making that assumption, although clearly
there is a bug there somewhere because it throws that error. It's
early days..In practice it's easy to weasel out of that, by storing the data
uncompressed, if compression would make it longer. Then you need an
extra flag somewhere to indicate whether it's compressed or not. It
doesn't break the theoretical limit because the actual stored length
is then original length + 1 bit, but it's usually not hard to find a
place for one extra bit.Don't we already have that flag, though? I see ZSCompressedBtreeItem
has t_flags, and there's ZSBT_COMPRESSED, but maybe it's more
complicated.
After thinking about this a bit more, I think a simple flag may not be
enough. It might be better to have some sort of ID of the compression
algorithm in each item, which would allow switching algorithm for new
data (which may be useful e.g after we add new stuff in core, or when
the initial choice was not the best one).
Of course, those are just wild thoughts at this point, it's not
something the current PoC has to solve right away.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Stephen Frost <sfrost@snowman.net> writes:
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
I think having a colstore in core is important not just for adoption,
but also for testing and development of the executor / planner bits.
Agreed.
TBH, I thought the reason we were expending so much effort on a tableam
API was exactly so we *wouldn't* have to include such stuff in core.
There is a finite limit to how much stuff we can maintain as part of core.
We should embrace the notion that Postgres is an extensible system, rather
than build all the tooling for extension and then proceed to dump stuff
into core anyway.
If we have multiple candidates with sufficient code quality, then we may
consider including both.
Dear god, no.
regards, tom lane
On Sat, Apr 13, 2019 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih <rafia.pghackers@gmail.com>
wrote:Reading about it reminds me of this work -- TAG column storage(
Isn't this storage system inspired from there, with TID as the TAG?
It is not referenced here so made me wonder.
I don't think they're particularly similar, because that paper
describes an architecture based on using purely logical row
identifiers, which is not what a TID is. TID is a hybrid
physical/logical identifier, sometimes called a "physiological"
identifier, which will have significant overhead.
Storage system wasn't inspired by that paper, but yes seems it also talks
about laying out column data in btrees, which is good to see. But yes as
pointed out by Peter, the main aspect the paper is focusing on to save
space for TAG, isn't something zedstore plan's to leverage, it being more
restrictive. As discussed below we can use other alternatives to save space.
Ashwin said that
ZedStore TIDs are logical identifiers, but I don't see how that's
compatible with a hybrid row/column design (unless you map heap TID to
logical row identifier using a separate B-Tree).
Would like to know more specifics on this Peter. We may be having different
context on hybrid row/column design. When we referenced design supports
hybrid row/column families, it meant not within same table. So, not inside
a table one can have some data in row and some in column nature. For a
table, the structure will be homogenous. But it can easily support storing
all the columns together, or subset of columns together or single column
all connected together by TID.
The big idea with Graefe's TAG design is that there is practically no
storage overhead for these logical identifiers, because each entry's
identifier is calculated by adding its slot number to the page's
tag/low key. The ZedStore design, in contrast, explicitly stores TID
for every entry. ZedStore seems more flexible for that reason, but at
the same time the per-datum overhead seems very high to me. Maybe
prefix compression could help here, which a low key and high key can
do rather well.
Yes, the plan to optimize out TID space per datum, either by prefix
compression or delta compression or some other trick.
On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
On 11/04/2019 17:54, Tom Lane wrote:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
Thank you for trying it out. Yes, noticed for certain patterns
pg_lzcompress() actually requires much larger output buffers. Like for one
86 len source it required 2296 len output buffer. Current zedstore code
doesn’t handle this case and errors out. LZ4 for same patterns works fine,
would highly recommend using LZ4 only, as anyways speed is very fast as
well with it.You realize of course that *every* compression method has some inputs
that
it makes bigger. If your code assumes that compression always produces a
smaller string, that's a bug in your code, not the compression algorithm.Of course. The code is not making that assumption, although clearly
there is a bug there somewhere because it throws that error. It's
early days..In practice it's easy to weasel out of that, by storing the data
uncompressed, if compression would make it longer. Then you need an
extra flag somewhere to indicate whether it's compressed or not. It
doesn't break the theoretical limit because the actual stored length
is then original length + 1 bit, but it's usually not hard to find a
place for one extra bit.Don't we already have that flag, though? I see ZSCompressedBtreeItem has
t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.
The flag ZSBT_COMPRESSED differentiates between container (compressed) item
and plain (uncompressed item). Current code is writtten such that within
container (compressed) item, all the data is compressed. If need exists to
store some part of uncompressed data inside container item, then this
additional flag would be required to indicate the same. Hence its different
than ZSBT_COMPRESSED. I am thinking one of the ways could be to just not
store this datum in container item if can't be compressed and just store it
as plain item with uncompressed data, this additional flag won't be
required. Will know more once write code for this.
On Mon, Apr 15, 2019 at 09:29:37AM -0700, Ashwin Agrawal wrote:
On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
On 11/04/2019 17:54, Tom Lane wrote:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
Thank you for trying it out. Yes, noticed for certain patterns
pg_lzcompress() actually requires much larger output buffers. Like for
one 86 len source it required 2296 len output buffer. Current zedstore
code doesn’t handle this case and errors out. LZ4 for same patterns
works fine, would highly recommend using LZ4 only, as anyways speed is
very fast as well with it.You realize of course that *every* compression method has some inputs
that
it makes bigger. If your code assumes that compression always
produces a
smaller string, that's a bug in your code, not the compression
algorithm.
Of course. The code is not making that assumption, although clearly
there is a bug there somewhere because it throws that error. It's
early days..In practice it's easy to weasel out of that, by storing the data
uncompressed, if compression would make it longer. Then you need an
extra flag somewhere to indicate whether it's compressed or not. It
doesn't break the theoretical limit because the actual stored length
is then original length + 1 bit, but it's usually not hard to find a
place for one extra bit.Don't we already have that flag, though? I see ZSCompressedBtreeItem has
t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.The flag ZSBT_COMPRESSED differentiates between container (compressed)
item and plain (uncompressed item). Current code is writtten such that
within container (compressed) item, all the data is compressed. If need
exists to store some part of uncompressed data inside container item, then
this additional flag would be required to indicate the same. Hence its
different than ZSBT_COMPRESSED. I am thinking one of the ways could be to
just not store this datum in container item if can't be compressed and
just store it as plain item with uncompressed data, this additional flag
won't be required. Will know more once write code for this.
I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
means "this is a container". And then have another flag to track whether
the container is compressed or not. But as I suggested elsewhere in this
thread, I think it might be better to store some ID of the compression
algorithm used instead of a simple flag.
FWIW when I had to deal with incremental compression (adding data into
already compressed buffers), which is what seems to be happening here, I
found it very useful/efficient to allow partially compressed buffers and
only trigger recompressin when absolutely needed.
Applied to this case, the container would first store compressed chunk,
followed by raw (uncompressed) data. Say, like this:
ZSContainerData {
// header etc.
int nbytes; /* total bytes in data */
int ncompressed; /* ncompressed <= nbytes, fully compressed when
* (ncompressed == nbytes) */
char data[FLEXIBLE_ARRAY_MEMBER];
}
When adding a value to the buffer, it'd be simply appended to the data
array. When the container would grow too much (can't fit on the page or
something), recompression is triggered.
cheers
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 15, 2019 at 10:33 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On Mon, Apr 15, 2019 at 09:29:37AM -0700, Ashwin Agrawal wrote:
On Sun, Apr 14, 2019 at 9:40 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:
On 11/04/2019 17:54, Tom Lane wrote:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
Thank you for trying it out. Yes, noticed for certain patterns
pg_lzcompress() actually requires much larger output buffers. Like
for
one 86 len source it required 2296 len output buffer. Current
zedstore
code doesn’t handle this case and errors out. LZ4 for same patterns
works fine, would highly recommend using LZ4 only, as anyways speedis
very fast as well with it.
You realize of course that *every* compression method has some
inputs
that
it makes bigger. If your code assumes that compression always
produces a
smaller string, that's a bug in your code, not the compression
algorithm.
Of course. The code is not making that assumption, although clearly
there is a bug there somewhere because it throws that error. It's
early days..In practice it's easy to weasel out of that, by storing the data
uncompressed, if compression would make it longer. Then you need an
extra flag somewhere to indicate whether it's compressed or not. It
doesn't break the theoretical limit because the actual stored length
is then original length + 1 bit, but it's usually not hard to find a
place for one extra bit.Don't we already have that flag, though? I see ZSCompressedBtreeItem
has
t_flags, and there's ZSBT_COMPRESSED, but maybe it's more
complicated.
The flag ZSBT_COMPRESSED differentiates between container (compressed)
item and plain (uncompressed item). Current code is writtten such that
within container (compressed) item, all the data is compressed. If need
exists to store some part of uncompressed data inside container item,then
this additional flag would be required to indicate the same. Hence its
different than ZSBT_COMPRESSED. I am thinking one of the ways could beto
just not store this datum in container item if can't be compressed and
just store it as plain item with uncompressed data, this additionalflag
won't be required. Will know more once write code for this.
I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
means "this is a container". And then have another flag to track whether
the container is compressed or not. But as I suggested elsewhere in this
thread, I think it might be better to store some ID of the compression
algorithm used instead of a simple flag.FWIW when I had to deal with incremental compression (adding data into
already compressed buffers), which is what seems to be happening here, I
found it very useful/efficient to allow partially compressed buffers and
only trigger recompressin when absolutely needed.Applied to this case, the container would first store compressed chunk,
followed by raw (uncompressed) data. Say, like this:ZSContainerData {
// header etc.
int nbytes; /* total bytes in data */
int ncompressed; /* ncompressed <= nbytes, fully compressed when
* (ncompressed == nbytes) */char data[FLEXIBLE_ARRAY_MEMBER];
}When adding a value to the buffer, it'd be simply appended to the data
array. When the container would grow too much (can't fit on the page or
something), recompression is triggered.
I think what you suggested here is exactly how its handled currently, just
the mechanics are little different. Plain items are added to page as
insertions are performed. Then when page becomes full, compression is
triggerred container item is created for them to store the compressed data.
Then new insertions are stored as plain items, once again when page becomes
full, they are compressed and container item created for it. So, never,
compressed data is attempted to be compressed again. So, on page plain
items are acting as data section you mentioned above. A page can have mix
of n plain and n container items.
On Sun, Apr 14, 2019 at 12:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
It's not clear to me whether you're arguing for not having any such
implementation in core, or having multiple ones? I think we should aim
to have at least one in-core implementation, even if it's not the best
possible one for all sizes. It's not like our rowstore is the best
possible implementation for all cases either.
I'm mostly arguing that it's too early to decide anything at this
point. I'm definitely not opposed to having a column store in core.
I think having a colstore in core is important not just for adoption,
but also for testing and development of the executor / planner bits.If we have multiple candidates with sufficient code quality, then we may
consider including both. I don't think it's very likely to happen in the
same release, considering how much work it will require. And I have no
idea if zedstore or VOPS are / will be the only candidates - it's way
too early at this point.FWIW I personally plan to focus primarily on the features that aim to
be included in core, and that applies to colstores too.
I agree with all of that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Apr 15, 2019 at 11:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
TBH, I thought the reason we were expending so much effort on a tableam
API was exactly so we *wouldn't* have to include such stuff in core.There is a finite limit to how much stuff we can maintain as part of core.
We should embrace the notion that Postgres is an extensible system, rather
than build all the tooling for extension and then proceed to dump stuff
into core anyway.
I don't agree with that at all. I expect, and hope, that there will
be some table AMs maintained outside of core, and I think that's
great. At the same time, it's not like we have had any great success
with out-of-core index AMs, and I don't see that table AMs are likely
to be any different in that regard; indeed, they may be quite a bit
worse. Up until now an index has only had to worry about one kind of
a table, but now a table is going to have to worry about every kind of
index. Furthermore, different table AMs are going to have different
needs. It has already been remarked by both Andres and on this thread
that for columnar storage to really zip along, the executor is going
to need to be much smarter about deciding which columns to request.
Presumably there will be a market for planner/executor optimizations
that postpone fetching columns for as long as possible. It's not
going to be maintainable to build that kind of infrastructure in core
and then have no in-core user of it.
But even if it were, it would be foolish from an adoption perspective
to drive away people who are trying to contribute that kind of
technology to PostgreSQL. Columnar storage is a big deal. Very
significant numbers of people who won't consider PostgreSQL today
because the performance characteristics are not good enough for what
they need will consider it if it's got something like what Ashwin and
Heikki are building built in. Some of those people may be determined
enough that even if the facility is out-of-core they'll be willing to
download an extension and compile it, but others won't. It's already
a problem that people have to go get pgbouncer and/or pgpool to do
something that they kinda think the database should just handle.
Columnar storage, like JSON, is not some fringe thing where we can say
that the handful of people who want it can go get it: people expect
that to be a standard offering, and they wonder why PostgreSQL hasn't
got it yet.
If we have multiple candidates with sufficient code quality, then we may
consider including both.Dear god, no.
I hate to pick on any particular part of the tree, but it seems
entirely plausible to me that a second columnar storage implementation
could deliver more incremental value than spgist, an index AM you
committed. We should not move the goal posts into the stratosphere
here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Apr 15, 2019 at 10:50:21AM -0700, Ashwin Agrawal wrote:
On Mon, Apr 15, 2019 at 10:33 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:...
I see. Perhaps it'd be better to call the flag ZSBT_CONTAINER, when it
means "this is a container". And then have another flag to track whether
the container is compressed or not. But as I suggested elsewhere in this
thread, I think it might be better to store some ID of the compression
algorithm used instead of a simple flag.FWIW when I had to deal with incremental compression (adding data into
already compressed buffers), which is what seems to be happening here, I
found it very useful/efficient to allow partially compressed buffers and
only trigger recompressin when absolutely needed.Applied to this case, the container would first store compressed chunk,
followed by raw (uncompressed) data. Say, like this:ZSContainerData {
� � // header etc.
� � int nbytes;� � � � �/* total bytes in data */
� � int ncompressed;� � /* ncompressed <= nbytes, fully compressed when
� � � � � � � � � � � � �* (ncompressed == nbytes) */� � char data[FLEXIBLE_ARRAY_MEMBER];
}When adding a value to the buffer, it'd be simply appended to the data
array. When the container would grow too much (can't fit on the page or
something), recompression is triggered.I think what you suggested here is exactly how its handled currently, just
the mechanics are little different. Plain items are added to page as
insertions are performed. Then when page becomes full, compression is
triggerred container item is created for them to store the compressed
data. Then new insertions are stored as plain items, once again when page
becomes full, they are compressed and container item created for it. So,
never, compressed data is attempted to be compressed again. So, on page
plain items are acting as data section you mentioned above. A page can
have mix of n plain and n container items.
Maybe. I'm not going to pretend I fully understand the internals. Does
that mean the container contains ZSUncompressedBtreeItem as elements? Or
just the plain Datum values?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2019-04-15 11:10:38 -0400, Tom Lane wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
I think having a colstore in core is important not just for adoption,
but also for testing and development of the executor / planner bits.Agreed.
TBH, I thought the reason we were expending so much effort on a tableam
API was exactly so we *wouldn't* have to include such stuff in core.
I think it's mostly orthogonal. We need something like tableam to have
multiple types of storage options for tables - independent of whether
they are in core. And sure, we could have maybe reduced the effort a bit
here and there by e.g. not allowing AMs to be dynamlically loaded, or
writing fewer comments or such.
There is a finite limit to how much stuff we can maintain as part of core.
We should embrace the notion that Postgres is an extensible system, rather
than build all the tooling for extension and then proceed to dump stuff
into core anyway.
I totally agree that that's something we should continue to focus
on. I personally think we *already* *have* embraced that - pretty
heavily so. And sometimes to the detriment of our users.
I think there's a pretty good case for e.g. *one* column store
in-core. For one there is a huge portion of existing postgres workloads
that benefit from them (often not for all tables, but some). Relatedly,
it's also one of the more frequent reasons why users can't migrate to
postgres / have to migrate off. And from a different angle, there's
plenty planner and executor work to be done to make column stores fast -
and that can't really be done nicely outside of core; and doing the
improvements in core without a user there is both harder, less likely to
be accepted, and more likely to regress.
If we have multiple candidates with sufficient code quality, then we may
consider including both.Dear god, no.
Yea, I don't see much point in that. Unless there's a pretty fundamental
reason why one columnar AM can't fullfill two different workloads
(e.g. by having options that define how things are laid out / compressed
/ whatnot), I think that'd be a *terrible* idea. By that logic we'd just
get a lot of AMs with a few differences in some workloads, and our users
would be unable to choose one, and all of them would suck. I think one
such fundamental difference is e.g. the visibility management for an
in-line mvcc approach like heap, and an undo-based mvcc row-store (like
zheap) - it's very hard to imagine meaningful code savings by having
those combined into one AM. I'm sure we can find similar large
architectural issues for some types of columnar AMs - but I'm far far
from convinced that there's enough distinctive need for two different
approaches in postgres. Without having heap historically and the desire
for on-disk compat, I can't quite see being convinced that we should
e.g. add a store like heap if we already had zheap.
I think it's perfectly reasonable to have in-core AMs try to optimize
~80% for a lot of different [sets of] workloads, even if a very
specialized AM could be optimized for it much further.
Greetings,
Andres Freund
Hi,
On 2019-04-15 14:11:02 -0400, Robert Haas wrote:
Furthermore, different table AMs are going to have different
needs. It has already been remarked by both Andres and on this thread
that for columnar storage to really zip along, the executor is going
to need to be much smarter about deciding which columns to request.
Presumably there will be a market for planner/executor optimizations
that postpone fetching columns for as long as possible. It's not
going to be maintainable to build that kind of infrastructure in core
and then have no in-core user of it.
Right. Two notes on that: A lot of that infrastructure needed for fast
query execution (both plan time and execution time) is also going to be
useful for a row store like heap, even though it won't have the
~order-of-magnitude impacts it can have for column stores. Secondly,
even without those, the storage density alone can make column stores
worthwhile, even without query execution speedups (or even slowdowns).
Greetings,
Andres Freund
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Apr 15, 2019 at 11:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
There is a finite limit to how much stuff we can maintain as part of core.
I don't agree with that at all.
Really? Let's have a discussion of how thermodynamics applies to
software management sometime.
If we have multiple candidates with sufficient code quality, then we may
consider including both.
Dear god, no.
I hate to pick on any particular part of the tree, but it seems
entirely plausible to me that a second columnar storage implementation
could deliver more incremental value than spgist, an index AM you
committed.
Yeah, and that's something I've regretted more than once; I think SP-GiST
is a sterling example of something that isn't nearly useful enough in the
real world to justify the amount of maintenance effort we've been forced
to expend on it. You might trawl the commit logs to get a sense of the
amount of my own personal time --- not that of the original submitters ---
that's gone into that one module. Then ask yourself how much that model
will scale, and what other more-useful things I could've accomplished
with that time.
We do need to limit what we accept into core PG. I do not buy your
argument that users expect everything to be in core. Or more accurately,
the people who do think that way won't be using PG anyway --- they'll
be using MSSQL because it comes from their OS vendor.
regards, tom lane
Hi,
On 2019-04-15 14:11:02 -0400, Robert Haas wrote:
I hate to pick on any particular part of the tree, but it seems
entirely plausible to me that a second columnar storage implementation
could deliver more incremental value than spgist, an index AM you
committed. We should not move the goal posts into the stratosphere
here.
Oh, I forgot: I agree that we don't need to be absurdly picky - but I
also think that table storage is much more crucial to get right than
index storage, which is already plenty crucial. Especially when that
type of index is not commonly usable for constraints. It really sucks to
get wrong query results due to a corrupted index / wrong index
implementation - but if your table AM level corruption, you're *really*
in a dark place. There's no way to just REINDEX and potentially recover
most information with a bit of surgery. Sure there can be app level
consequences to wrong query results that can be really bad, and lead to
very permanent data loss. On-disk compat is also much more important
for table level data - it's annoying to have to reindex indexes after an
upgrade, but at least it can be done concurrently after the most
important indexes are operational.
Greetings,
Andres Freund
On Mon, Apr 15, 2019 at 11:18 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
Maybe. I'm not going to pretend I fully understand the internals. Does
that mean the container contains ZSUncompressedBtreeItem as elements? Or
just the plain Datum values?
First, your reading of code and all the comments/questions so far have been
highly encouraging. Thanks a lot for the same.
Container contains ZSUncompressedBtreeItem as elements. As for Item will
have to store meta-data like size, undo and such info. We don't wish to
restrict compressing only items from same insertion sessions only. Hence,
yes doens't just store Datum values. Wish to consider it more tuple level
operations and have meta-data for it and able to work with tuple level
granularity than block level.
Definitely many more tricks can be and need to be applied to optimize
storage format, like for fixed width columns no need to store the size in
every item. Keep it simple is theme have been trying to maintain.
Compression ideally should compress duplicate data pretty easily and
efficiently as well, but we will try to optimize as much we can without the
same.
On Mon, Apr 15, 2019 at 11:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
We do need to limit what we accept into core PG. I do not buy your
argument that users expect everything to be in core. Or more accurately,
the people who do think that way won't be using PG anyway --- they'll
be using MSSQL because it comes from their OS vendor.
I am also concerned by the broad scope of ZedStore, and I tend to
agree that it will be difficult to maintain in core. At the same time,
I think that Andres and Robert are probably right about the difficulty
of maintaining it outside of core -- that would be difficult to
impossible as a practical matter.
Unfortunately, you're both right. I don't know where that leaves us.
--
Peter Geoghegan
On Mon, Apr 15, 2019 at 2:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Apr 15, 2019 at 11:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
There is a finite limit to how much stuff we can maintain as part of core.
I don't agree with that at all.
Really? Let's have a discussion of how thermodynamics applies to
software management sometime.
Sounds like an interesting discussion, perhaps for PGCon, but what I
was actually disagreeing with was the idea that we should add a table
AM interface and then not accept any new table AMs, which I think
would be silly. And if we're going to accept any, a columnar one
seems like a strong candidate.
Yeah, and that's something I've regretted more than once; I think SP-GiST
is a sterling example of something that isn't nearly useful enough in the
real world to justify the amount of maintenance effort we've been forced
to expend on it. You might trawl the commit logs to get a sense of the
amount of my own personal time --- not that of the original submitters ---
that's gone into that one module. Then ask yourself how much that model
will scale, and what other more-useful things I could've accomplished
with that time.
Yep, that's fair.
We do need to limit what we accept into core PG. I do not buy your
argument that users expect everything to be in core. Or more accurately,
the people who do think that way won't be using PG anyway --- they'll
be using MSSQL because it comes from their OS vendor.
I agree that we need to be judicious in what we accept, but I don't
agree that we should therefore accept nothing. There are lots of
things that we could put in core and users would like it that I'm glad
we haven't put in core.
I think you might be surprised at the number of people who normally
want everything from a single source but are still willing to consider
PostgreSQL; vendors like my employer help to smooth the road for such
people. Still, I don't think there is any major database product
other than PostgreSQL that ships only a single table storage format
and just expects that it will be good enough for everyone. Like
640kB, it just isn't.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Peter Geoghegan <pg@bowt.ie> writes:
I am also concerned by the broad scope of ZedStore, and I tend to
agree that it will be difficult to maintain in core. At the same time,
I think that Andres and Robert are probably right about the difficulty
of maintaining it outside of core -- that would be difficult to
impossible as a practical matter.
Perhaps, but we won't know if we don't try. I think we should try,
and be willing to add hooks and flexibility to core as needed to make
it possible. Adding such flexibility would be good for other outside
projects that have no chance of (or perhaps no interest in) getting into
core, even if we end up deciding that ZedStore or some other specific
implementation is so useful that it does belong in core.
regards, tom lane
Hi,
On 2019-04-15 14:35:43 -0400, Tom Lane wrote:
Yeah, and that's something I've regretted more than once; I think SP-GiST
is a sterling example of something that isn't nearly useful enough in the
real world to justify the amount of maintenance effort we've been forced
to expend on it. You might trawl the commit logs to get a sense of the
amount of my own personal time --- not that of the original submitters ---
that's gone into that one module. Then ask yourself how much that model
will scale, and what other more-useful things I could've accomplished
with that time.
I do agree that the [group of] contributor's history of maintaining such
work should play a role. And I think that's doubly so with a piece as
crucial as a table AM.
But:
We do need to limit what we accept into core PG. I do not buy your
argument that users expect everything to be in core. Or more accurately,
the people who do think that way won't be using PG anyway --- they'll
be using MSSQL because it comes from their OS vendor.
I don't think anybody disagrees with that, actually. Including
Robert.
But I don't think it follows that we shouldn't provide things that are
either much more reasonably done in core like a pooler (authentication /
encryption; infrastructure for managing state like prepared statements,
GUCs; avoiding issues of explosion of connection counts with pooling in
other places), are required by a very significant portion of our users
(imo the case for a columnar store or a row store without the
architectural issues of heap), or where it's hard to provide the
necessary infrastructure without an in-core user (imo also the case with
columnar, due to the necessary planner / executor improvements for fast
query execution).
We also have at times pretty explicitly resisted making crucial pieces
of infrastructure usable outside of core. E.g. because it's legitimately
hard (grammar extensibility), or because we'd some concerns around
stability and the exact approach (WAL - the generic stuff is usable for
anything that wants to even be somewhat efficient, some xlog
integration). So there's several types of extensions that one
realistically cannot do out of core, by our choice.
Greetings,
Andres Freund
On Mon, Apr 15, 2019 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Perhaps, but we won't know if we don't try. I think we should try,
and be willing to add hooks and flexibility to core as needed to make
it possible.
We could approach it without taking a firm position on inclusion in
core until the project begins to mature. I have little faith in our
ability to predict which approach will be the least painful at this
early stage.
--
Peter Geoghegan
Andres Freund <andres@anarazel.de> writes:
We also have at times pretty explicitly resisted making crucial pieces
of infrastructure usable outside of core. E.g. because it's legitimately
hard (grammar extensibility), or because we'd some concerns around
stability and the exact approach (WAL - the generic stuff is usable for
anything that wants to even be somewhat efficient, some xlog
integration). So there's several types of extensions that one
realistically cannot do out of core, by our choice.
Well, the grammar issue comes from a pretty specific technical problem:
bison grammars don't cope with run-time extension, and moving off of bison
would cost a lot of work, and probably more than just work (i.e., probable
loss of ability to detect grammar ambiguity). WAL extensibility likewise
has some technical issues that are hard to surmount (how do you find the
code for replaying an extension WAL record, when you can't read catalogs).
I think we could fix the latter, it's just that no one has yet troubled
to expend the effort. Similarly, things like the planner's hard-wired
handling of physical-tlist optimization are certainly a problem for
column stores, but I think the way to solve that is to provide an actual
extension capability, not merely replace one hard-wired behavior with two.
As a counterpoint to my gripe about SP-GiST being a time sink, I do not
think I'll regret the time I spent a few months ago on implementing
"planner support function" hooks. I'm all in favor of adding flexibility
like that.
regards, tom lane
Hi,
On 2019-04-15 15:19:41 -0400, Tom Lane wrote:
Peter Geoghegan <pg@bowt.ie> writes:
I am also concerned by the broad scope of ZedStore, and I tend to
agree that it will be difficult to maintain in core. At the same time,
I think that Andres and Robert are probably right about the difficulty
of maintaining it outside of core -- that would be difficult to
impossible as a practical matter.Perhaps, but we won't know if we don't try. I think we should try,
and be willing to add hooks and flexibility to core as needed to make
it possible. Adding such flexibility would be good for other outside
projects that have no chance of (or perhaps no interest in) getting into
core, even if we end up deciding that ZedStore or some other specific
implementation is so useful that it does belong in core.
I don't think anybody argued against providing that flexibility. I think
we should absolutely do so - but that's imo not an argument against
integrating something like a hypothetical well developed columnstore to
core. I worked on tableam, which certainly provides a lot of new
extensibility, because it was the sane architecture to able to integrate
zheap. The current set of UNDO patches (developed for zheap), while
requiring core integration for xact.c etc, co-initiated improvements to
make the checkpointer fsync being closer to extensible and UNDO as
currently developed would be extensible if WAL was extensible as it's
tied to rmgrlist.h. And the improvements necessary to make query
executions for in-core columnar AM faster, would largely also be
applicable for out-of-core columnar AMs, and I'm sure we'd try to make
the necessary decisions not hardcoded if reasonable.
I think it's actually really hard to actually make something non-trivial
extensible without there being a proper in-core user of most of that
infrastructure.
Greetings,
Andres Freund
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Would like to know more specifics on this Peter. We may be having different context on hybrid row/column design.
I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?
Yes, the plan to optimize out TID space per datum, either by prefix compression or delta compression or some other trick.
It would be easier to do this if you knew for sure that the TID
behaves almost the same as a bigserial column -- a purely logical
monotonically increasing identifier. That's why I'm interested in what
exactly you mean by TID, the stability of a TID value, etc. If a leaf
page almost always stores a range covering no more than few hundred
contiguous logical values, you can justify aggressively compressing
the representation in the B-Tree entries. Compression would still be
based on prefix compression, but the representation itself can be
specialized.
--
Peter Geoghegan
Hi,
On 2019-04-15 12:50:14 -0700, Peter Geoghegan wrote:
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Would like to know more specifics on this Peter. We may be having different context on hybrid row/column design.
I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?
There's not much of an alternative currently. Indexes require tid
looking things, and as a consequence (and some other comparatively small
changes that'd be required) tableam does too. And there's a few places
that imbue additional meaning into the higher bits of ip_posid too, so
not all of them are valid (It can't currently be zero - or
ItemPointerIsValid fails, it can't be larger than MaxOffsetNumber -
that's used to allocate things in e.g. indexes, tidbmap.c etc).
That's one of the reasons why I've been trying to get you to get on
board with allowing different leaf-level "item pointer equivalents"
widths inside nbtree...
Greetings,
Andres Freund
On Mon, Apr 15, 2019 at 1:02 PM Andres Freund <andres@anarazel.de> wrote:
There's not much of an alternative currently. Indexes require tid
looking things, and as a consequence (and some other comparatively small
changes that'd be required) tableam does too.
I'm trying to establish whether or not that's the only reason. It
might be okay to use the same item pointer struct as the
representation of a integer-like logical identifier. Even if it isn't,
I'm still interested in just how logical the TIDs are, because it's an
important part of the overall design.
That's one of the reasons why I've been trying to get you to get on
board with allowing different leaf-level "item pointer equivalents"
widths inside nbtree...
Getting me to agree that that would be nice and getting me to do the
work are two very different things. ;-)
--
Peter Geoghegan
On Mon, Apr 15, 2019 at 11:57:49AM -0700, Ashwin Agrawal wrote:
On Mon, Apr 15, 2019 at 11:18 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Maybe. I'm not going to pretend I fully understand the internals. Does
that mean the container contains ZSUncompressedBtreeItem as elements? Or
just the plain Datum values?First, your reading of code and all the comments/questions so far have
been highly encouraging. Thanks a lot for the same.
;-)
Container contains ZSUncompressedBtreeItem as elements. As for Item will
have to store meta-data like size, undo and such info. We don't wish to
restrict compressing only items from same insertion sessions only. Hence,
yes doens't just store Datum values. Wish to consider it more tuple level
operations and have meta-data for it and able to work with tuple level
granularity than block level.
OK, thanks for the clarification, that somewhat explains my confusion.
So if I understand it correctly, ZSCompressedBtreeItem is essentially a
sequence of ZSUncompressedBtreeItem(s) stored one after another, along
with some additional top-level metadata.
Definitely many more tricks can be and need to be applied to optimize
storage format, like for fixed width columns no need to store the size in
every item. Keep it simple is theme have been trying to maintain.
Compression ideally should compress duplicate data pretty easily and
efficiently as well, but we will try to optimize as much we can without
the same.
I think there's plenty of room for improvement. The main problem I see
is that it mixes different types of data, which is bad for compression
and vectorized execution. I think we'll end up with a very different
representation of the container, essentially decomposing the items into
arrays of values of the same type - array of TIDs, array of undo
pointers, buffer of serialized values, etc.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 15, 2019 at 12:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io>
wrote:Would like to know more specifics on this Peter. We may be having
different context on hybrid row/column design.
I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?
TID is purely logical identifier. Hence, stated in initial email that for
Zedstore TID, block number and offset split carries no meaning at all. It's
purely 48 bit integer entity assigned to datum of first column during
insertion, based on where in BTree it gets inserted. Rest of the column
datums are inserted using this assigned TID value. Just due to rest to
system restrictions discussed by Heikki and Andres on table am thread poses
limitations of value it can carry currently otherwise from zedstore design
perspective it just integer number.
Yes, the plan to optimize out TID space per datum, either by prefix
compression or delta compression or some other trick.
It would be easier to do this if you knew for sure that the TID
behaves almost the same as a bigserial column -- a purely logical
monotonically increasing identifier. That's why I'm interested in what
exactly you mean by TID, the stability of a TID value, etc. If a leaf
page almost always stores a range covering no more than few hundred
contiguous logical values, you can justify aggressively compressing
the representation in the B-Tree entries. Compression would still be
based on prefix compression, but the representation itself can be
specialized.
Yes, it's for sure logical increasing number. With only inserts the number
is monotonically increasing. With deletes and updates, insert could use the
previously free'd TID values. Since TID is logical datums can be easily
moved around to split or merge pages as required.
On Mon, Apr 15, 2019 at 10:45:51PM -0700, Ashwin Agrawal wrote:
On Mon, Apr 15, 2019 at 12:50 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal <aagrawal@pivotal.io>
wrote:Would like to know more specifics on this Peter. We may be having
different context on hybrid row/column design.
I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?TID is purely logical identifier. Hence, stated in initial email that for
Zedstore TID, block number and offset split carries no meaning at all. It's
purely 48 bit integer entity assigned to datum of first column during
insertion, based on where in BTree it gets inserted. Rest of the column
datums are inserted using this assigned TID value. Just due to rest to
system restrictions discussed by Heikki and Andres on table am thread poses
limitations of value it can carry currently otherwise from zedstore design
perspective it just integer number.
I'm not sure it's that clear cut, actually. Sure, it's not the usual
(block,item) pair so it's not possible to jump to the exact location, so
it's not the raw physical identifier as regular TID. But the data are
organized in a btree, with the TID as a key, so it does actually provide
some information about the location.
I've asked about BRIN indexes elsewhere in this thread, which I think is
related to this question, because that index type relies on TID providing
sufficient information about location. And I think BRIN indexes are going
to be rather important for colstores (and formats like ORC have something
very similar built-in).
But maybe all we'll have to do is define the ranges differently - instead
of "number of pages" we may define them as "number of rows" and it might
be working.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 16, 2019 at 9:15 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
I'm not sure it's that clear cut, actually. Sure, it's not the usual
(block,item) pair so it's not possible to jump to the exact location, so
it's not the raw physical identifier as regular TID. But the data are
organized in a btree, with the TID as a key, so it does actually provide
some information about the location.
From representation perspective its logical identifier. But yes
since
is used as used as key to layout datum's, there exists pretty
good
correlation between TIDs and physical location. Can consider it
as
clustered based on TID.
I've asked about BRIN indexes elsewhere in this thread, which I think is
related to this question, because that index type relies on TID providing
sufficient information about location. And I think BRIN indexes are going
to be rather important for colstores (and formats like ORC have something
very similar built-in).But maybe all we'll have to do is define the ranges differently - instead
of "number of pages" we may define them as "number of rows" and it might
be working.
BRIN indexes work for zedstore right now. A block range maps
to
just a range of TIDs in zedstore, as pointed out above. When one converts
a
zstid to an ItemPointer, can get the "block number" from
the
ItemPointer, like from a normal heap TID. It doesn't mean the
direct
physical location of the row in zedstore, but that's
fine.
It might be sub-optimal in some cases. For example if one
zedstore
page contains TIDs 1-1000, and another 1000-2000, and the entry in
the
BRIN index covers TIDs 500-1500, have to access both
zedstore
pages. Would be better if the cutoff points in the BRIN index
would
match the physical pages of the zedstore. But it still works, and
is
probably fine in
practice.
Plan is to add integrated BRIN index in zedstore, means keep
min-max
values for appropriate columns within page. This will not help
to
eliminate the IO as external BRIN index does but helps to
skip
uncompression and visibility checks etc... for blocks not matching
the
conditions.
Just to showcase brin works for zedstore, played with hands-on example
mentioned in
[1]: .
With btree index on zedstore
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4351.50..4351.51 rows=1 width=32) (actual
time=1267.140..1267.140 rows=1
loops=1)
-> Index Scan using idx_ztemperature_log_log_timestamp on
ztemperature_log (cost=0.56..4122.28 rows=91686 width=4) (actual
time=0.117..1244.112 rows=86400
loops=1)
Index Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp
without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp
without time
zone))
Planning Time: 0.240
ms
Execution Time: 1269.016
ms
(5
rows)
With brin index on zedstore.
Note: Bitmap index for zedstore currently scans all the columns.
Scanning only required columns for query is yet to be implemented.
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=217538.85..217538.86 rows=1 width=32) (actual
time=54.167..54.167 rows=1
loops=1)
-> Gather (cost=217538.63..217538.84 rows=2 width=32) (actual
time=53.967..55.184 rows=3
loops=1)
Workers Planned:
2
Workers Launched:
2
-> Partial Aggregate (cost=216538.63..216538.64 rows=1 width=32)
(actual time=42.956..42.957 rows=1
loops=3)
-> Parallel Bitmap Heap Scan on ztemperature_log
(cost=59.19..216446.98 rows=36660 width=4) (actual time=3.571..35.904
rows=28800
loops=3)
Recheck Cond: ((log_timestamp >= '2016-04-04
00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05
00:00:00'::timestamp without time
zone))
Rows Removed by Index Recheck:
3968
Heap Blocks:
lossy=381
-> Bitmap Index Scan on
idx_ztemperature_log_log_timestamp (cost=0.00..37.19 rows=98270 width=0)
(actual time=1.201..1.201 rows=7680
loops=1)
Index Cond: ((log_timestamp >= '2016-04-04
00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05
00:00:00'::timestamp without time
zone))
Planning Time: 0.240
ms
Execution Time: 55.341
ms
(13
rows)
schema_name | index_name | index_ratio |
index_size |
table_size
-------------+------------------------------------+-------------+------------+------------
public | idx_ztemperature_log_log_timestamp | 0 | 80
kB | 1235
MB
(1
row)
1]
https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them
On 15/04/2019 22:32, Peter Geoghegan wrote:
On Mon, Apr 15, 2019 at 12:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Perhaps, but we won't know if we don't try. I think we should try,
and be willing to add hooks and flexibility to core as needed to make
it possible.We could approach it without taking a firm position on inclusion in
core until the project begins to mature. I have little faith in our
ability to predict which approach will be the least painful at this
early stage.
When we started hacking on this, we went in with the assumption that
this would have to be in core, because WAL-logging, and also because a
column-store will probably need some changes to the planner and executor
to make it shine. And also because a lot of people would like to have a
column store in PostgreSQL (although a "column store" could mean many
different things with different tradeoffs). But if we just have all the
necessary hooks in core, sure, this could be an extension, too.
But as you said, we don't need to decide that yet. Let's wait and see,
as this matures.
- Heikki
We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.
Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).
Current State:
- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.
- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.
- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4]
on
how modern filesystems manage space acted as a good source for ideas.
- Tuple locks implemented
- Serializable isolation handled
- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.
Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDO
Open questions / discussion items:
- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
- TID treated as (block, offset) in current indexing code
- Physical tlist optimization? (currently, we disabled it for
zedstore)
Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.
References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/3978b57e-fe25-ca6b-f56c-48084417e115@postgrespro.ru
3]
/messages/by-id/20190415173254.nlnk2xqhgt7c5pta@development
4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
Attachments:
v2-0001-Zedstore-compressed-in-core-columnar-storage.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Zedstore-compressed-in-core-columnar-storage.patchDownload
From b9a152c56d844f0543c2e5c1c09e7b7de4e20bc2 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal <aagrawal@pivotal.io>
Date: Wed, 22 May 2019 16:32:57 -0700
Subject: [PATCH v2] Zedstore compressed in-core columnar storage.
---
configure | 118 +
configure.in | 19 +
src/backend/access/Makefile | 2 +-
src/backend/access/gin/ginbtree.c | 2 +-
src/backend/access/gin/ginfast.c | 2 +-
src/backend/access/gin/gininsert.c | 4 +-
src/backend/access/gist/gist.c | 2 +-
src/backend/access/hash/hashinsert.c | 2 +-
src/backend/access/heap/heapam.c | 20 +-
src/backend/access/heap/heapam_handler.c | 5 +-
src/backend/access/index/indexam.c | 4 +-
src/backend/access/nbtree/nbtinsert.c | 4 +-
src/backend/access/zedstore/Makefile | 21 +
src/backend/access/zedstore/README | 295 ++
.../access/zedstore/zedstore_attpage.c | 1589 +++++++
src/backend/access/zedstore/zedstore_btree.c | 632 +++
.../access/zedstore/zedstore_compression.c | 364 ++
.../access/zedstore/zedstore_freepagemap.c | 1076 +++++
.../access/zedstore/zedstore_inspect.c | 448 ++
src/backend/access/zedstore/zedstore_meta.c | 216 +
.../access/zedstore/zedstore_tidpage.c | 1774 ++++++++
src/backend/access/zedstore/zedstore_toast.c | 192 +
.../access/zedstore/zedstore_tupslot.c | 348 ++
src/backend/access/zedstore/zedstore_undo.c | 918 ++++
src/backend/access/zedstore/zedstore_utils.c | 76 +
.../access/zedstore/zedstore_visibility.c | 728 +++
.../access/zedstore/zedstoream_handler.c | 3163 +++++++++++++
src/backend/commands/analyze.c | 7 +-
src/backend/commands/copy.c | 22 +-
src/backend/commands/tablecmds.c | 14 +-
src/backend/commands/trigger.c | 8 +
src/backend/executor/execScan.c | 90 +
src/backend/executor/nodeIndexonlyscan.c | 16 +-
src/backend/executor/nodeIndexscan.c | 20 +-
src/backend/executor/nodeSeqscan.c | 18 +-
src/backend/optimizer/plan/createplan.c | 3 +
src/backend/optimizer/util/plancat.c | 2 +
src/backend/partitioning/partbounds.c | 15 +-
src/backend/storage/lmgr/predicate.c | 45 +-
src/include/access/tableam.h | 41 +
src/include/access/zedstore_compression.h | 51 +
src/include/access/zedstore_internal.h | 618 +++
src/include/access/zedstore_undo.h | 171 +
src/include/catalog/pg_am.dat | 3 +
src/include/catalog/pg_proc.dat | 24 +
src/include/executor/executor.h | 3 +-
src/include/nodes/execnodes.h | 1 +
src/include/nodes/pathnodes.h | 1 +
src/include/pg_config.h.in | 9 +
src/include/storage/predicate.h | 9 +-
.../isolation/specs/read-only-anomaly-2.spec | 6 +-
src/test/regress/expected/.gitignore | 1 +
src/test/regress/expected/alter_table_1.out | 3997 +++++++++++++++++
src/test/regress/expected/cluster_1.out | 475 ++
src/test/regress/expected/create_am.out | 11 +-
src/test/regress/expected/fsm_1.out | 73 +
src/test/regress/expected/rangefuncs_1.out | 2100 +++++++++
src/test/regress/expected/reloptions_1.out | 219 +
src/test/regress/expected/strings_1.out | 1823 ++++++++
src/test/regress/expected/tsrf_1.out | 712 +++
src/test/regress/expected/zedstore.out | 599 +++
src/test/regress/output/misc_1.source | 692 +++
src/test/regress/parallel_schedule | 2 +-
src/test/regress/serial_schedule | 1 +
src/test/regress/sql/zedstore.sql | 176 +
src/test/storageperf/driver.sql | 36 +
src/test/storageperf/sql/onecol.sql | 38 +
src/test/storageperf/tests.sql | 3 +
68 files changed, 24118 insertions(+), 61 deletions(-)
create mode 100644 src/backend/access/zedstore/Makefile
create mode 100644 src/backend/access/zedstore/README
create mode 100644 src/backend/access/zedstore/zedstore_attpage.c
create mode 100644 src/backend/access/zedstore/zedstore_btree.c
create mode 100644 src/backend/access/zedstore/zedstore_compression.c
create mode 100644 src/backend/access/zedstore/zedstore_freepagemap.c
create mode 100644 src/backend/access/zedstore/zedstore_inspect.c
create mode 100644 src/backend/access/zedstore/zedstore_meta.c
create mode 100644 src/backend/access/zedstore/zedstore_tidpage.c
create mode 100644 src/backend/access/zedstore/zedstore_toast.c
create mode 100644 src/backend/access/zedstore/zedstore_tupslot.c
create mode 100644 src/backend/access/zedstore/zedstore_undo.c
create mode 100644 src/backend/access/zedstore/zedstore_utils.c
create mode 100644 src/backend/access/zedstore/zedstore_visibility.c
create mode 100644 src/backend/access/zedstore/zedstoream_handler.c
create mode 100644 src/include/access/zedstore_compression.h
create mode 100644 src/include/access/zedstore_internal.h
create mode 100644 src/include/access/zedstore_undo.h
create mode 100644 src/test/regress/expected/alter_table_1.out
create mode 100644 src/test/regress/expected/cluster_1.out
create mode 100644 src/test/regress/expected/fsm_1.out
create mode 100644 src/test/regress/expected/rangefuncs_1.out
create mode 100644 src/test/regress/expected/reloptions_1.out
create mode 100644 src/test/regress/expected/strings_1.out
create mode 100644 src/test/regress/expected/tsrf_1.out
create mode 100644 src/test/regress/expected/zedstore.out
create mode 100644 src/test/regress/output/misc_1.source
create mode 100644 src/test/regress/sql/zedstore.sql
create mode 100644 src/test/storageperf/driver.sql
create mode 100644 src/test/storageperf/sql/onecol.sql
create mode 100644 src/test/storageperf/tests.sql
diff --git a/configure b/configure
index fd61bf6472..59a8a8080d 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ LDFLAGS_EX
ELF_SYS
EGREP
GREP
+with_lz4
with_zlib
with_system_tzdata
with_libxslt
@@ -864,6 +865,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_lz4
with_gnu_ld
enable_largefile
enable_float4_byval
@@ -1570,6 +1572,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-lz4 build with LZ4 support
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8306,6 +8309,41 @@ fi
+#
+# LZ4
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with LZ4 support" >&5
+$as_echo_n "checking whether to build with LZ4 support... " >&6; }
+
+
+
+# Check whether --with-lz4 was given.
+if test "${with_lz4+set}" = set; then :
+ withval=$with_lz4;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LZ4 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-lz4 option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_lz4=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_lz4" >&5
+$as_echo "$with_lz4" >&6; }
+
+
#
# Elf
#
@@ -11828,6 +11866,56 @@ fi
fi
+if test "$with_lz4" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for LZ4_compress_default in -llz4" >&5
+$as_echo_n "checking for LZ4_compress_default in -llz4... " >&6; }
+if ${ac_cv_lib_lz4_LZ4_compress_default+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-llz4 $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char LZ4_compress_default ();
+int
+main ()
+{
+return LZ4_compress_default ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_lz4_LZ4_compress_default=yes
+else
+ ac_cv_lib_lz4_LZ4_compress_default=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_lz4_LZ4_compress_default" >&5
+$as_echo "$ac_cv_lib_lz4_LZ4_compress_default" >&6; }
+if test "x$ac_cv_lib_lz4_LZ4_compress_default" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBLZ4 1
+_ACEOF
+
+ LIBS="-llz4 $LIBS"
+
+else
+ as_fn_error $? "library 'lz4' is required for LZ4 support" "$LINENO" 5
+fi
+
+fi
+
if test "$enable_spinlocks" = yes; then
$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
@@ -13027,6 +13115,36 @@ Use --without-zlib to disable zlib support." "$LINENO" 5
fi
+fi
+
+if test "$with_lz4" = yes; then
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ as_fn_error $? "lz4.h header file is required for LZ4" "$LINENO" 5
+fi
+
+done
+
+fi
+
+done
+
fi
if test "$with_gssapi" = yes ; then
diff --git a/configure.in b/configure.in
index 4586a1716c..183fad3462 100644
--- a/configure.in
+++ b/configure.in
@@ -964,6 +964,16 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# LZ4
+#
+AC_MSG_CHECKING([whether to build with LZ4 support])
+PGAC_ARG_BOOL(with, lz4, no,
+ [build with LZ4 support],
+ [AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])])
+AC_MSG_RESULT([$with_lz4])
+AC_SUBST(with_lz4)
+
#
# Elf
#
@@ -1174,6 +1184,10 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
+fi
+
if test "$enable_spinlocks" = yes; then
AC_DEFINE(HAVE_SPINLOCKS, 1, [Define to 1 if you have spinlocks.])
else
@@ -1387,6 +1401,11 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_HEADERS(lz4.h, [],
+ [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is required for LZ4])])])
+fi
+
if test "$with_gssapi" = yes ; then
AC_CHECK_HEADERS(gssapi/gssapi.h, [],
[AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is required for GSSAPI])])])
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bb..6d36f3bd26 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- table tablesample transam
+ table tablesample transam zedstore
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 11a8ed7bbc..e795375495 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -89,7 +89,7 @@ ginFindLeafPage(GinBtree btree, bool searchMode,
stack->predictNumber = 1;
if (rootConflictCheck)
- CheckForSerializableConflictIn(btree->index, NULL, stack->buffer);
+ CheckForSerializableConflictIn(btree->index, NULL, btree->rootBlkno);
for (;;)
{
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2b3dd1c677..f8ffeb06f8 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -246,7 +246,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
* tree, so it conflicts with all serializable scans. All scans acquire a
* predicate lock on the metabuffer to represent that.
*/
- CheckForSerializableConflictIn(index, NULL, metabuffer);
+ CheckForSerializableConflictIn(index, NULL, GIN_METAPAGE_BLKNO);
if (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize)
{
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab14617..046a20a3d4 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -221,7 +221,7 @@ ginEntryInsert(GinState *ginstate,
return;
}
- CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
+ CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer));
/* modify an existing leaf entry */
itup = addItemPointersToLeafTuple(ginstate, itup,
items, nitem, buildStats, stack->buffer);
@@ -230,7 +230,7 @@ ginEntryInsert(GinState *ginstate,
}
else
{
- CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
+ CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer));
/* no match, so construct a new leaf entry */
itup = buildFreshLeafTuple(ginstate, attnum, key, category,
items, nitem, buildStats, stack->buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 45c00aaa87..4f150b02cb 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1273,7 +1273,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
* Check for any rw conflicts (in serializable isolation level) just
* before we intend to modify the page
*/
- CheckForSerializableConflictIn(state->r, NULL, stack->buffer);
+ CheckForSerializableConflictIn(state->r, NULL, BufferGetBlockNumber(stack->buffer));
/* Insert the tuple(s) to the page, splitting the page if necessary */
is_split = gistplacetopage(state->r, state->freespace, giststate,
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5321762d5e..e3fb47f9e3 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -88,7 +88,7 @@ restart_insert:
&usedmetap);
Assert(usedmetap != NULL);
- CheckForSerializableConflictIn(rel, NULL, buf);
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(buf));
/* remember the primary bucket buffer to release the pin on it at end. */
bucket_buf = buf;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6c342635e8..b09263364e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -446,7 +446,7 @@ heapgetpage(TableScanDesc sscan, BlockNumber page)
else
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
- CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+ heap_CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
&loctup, buffer, snapshot);
if (valid)
@@ -668,7 +668,7 @@ heapgettup(HeapScanDesc scan,
snapshot,
scan->rs_cbuf);
- CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+ heap_CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
tuple, scan->rs_cbuf,
snapshot);
@@ -1488,7 +1488,7 @@ heap_fetch(Relation relation,
if (valid)
PredicateLockTuple(relation, tuple, snapshot);
- CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot);
+ heap_CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
@@ -1622,7 +1622,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
/* If it's visible per the snapshot, we must return it */
valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
- CheckForSerializableConflictOut(valid, relation, heapTuple,
+ heap_CheckForSerializableConflictOut(valid, relation, heapTuple,
buffer, snapshot);
/* reset to original, non-redirected, tid */
heapTuple->t_self = *tid;
@@ -1764,7 +1764,7 @@ heap_get_latest_tid(TableScanDesc sscan,
* candidate.
*/
valid = HeapTupleSatisfiesVisibility(&tp, snapshot, buffer);
- CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot);
+ heap_CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot);
if (valid)
*tid = ctid;
@@ -1919,7 +1919,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
* lock "gaps" as index page locks do. So we don't need to specify a
* buffer when making the call, which makes for a faster check.
*/
- CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -2173,7 +2173,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
* lock "gaps" as index page locks do. So we don't need to specify a
* buffer when making the call, which makes for a faster check.
*/
- CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
ndone = 0;
while (ndone < ntuples)
@@ -2364,7 +2364,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
* lock "gaps" as index page locks do. So we don't need to specify a
* buffer when making the call.
*/
- CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
/*
* If tuples are cachable, mark them for invalidation from the caches in
@@ -2673,7 +2673,7 @@ l1:
* being visible to the scan (i.e., an exclusive buffer content lock is
* continuously held from this point until the tuple delete is visible).
*/
- CheckForSerializableConflictIn(relation, &tp, buffer);
+ CheckForSerializableConflictIn(relation, tid, BufferGetBlockNumber(buffer));
/* replace cid with a combo cid if necessary */
HeapTupleHeaderAdjustCmax(tp.t_data, &cid, &iscombo);
@@ -3583,7 +3583,7 @@ l2:
* will include checking the relation level, there is no benefit to a
* separate check for the new tuple.
*/
- CheckForSerializableConflictIn(relation, &oldtup, buffer);
+ CheckForSerializableConflictIn(relation, otid, BufferGetBlockNumber(buffer));
/*
* At this point newbuf and buffer are both pinned and locked, and newbuf
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a4a28e88ec..b1643790b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2275,7 +2275,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_vistuples[ntup++] = offnum;
PredicateLockTuple(scan->rs_rd, &loctup, snapshot);
}
- CheckForSerializableConflictOut(valid, scan->rs_rd, &loctup,
+ heap_CheckForSerializableConflictOut(valid, scan->rs_rd, &loctup,
buffer, snapshot);
}
}
@@ -2463,7 +2463,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
/* in pagemode, heapgetpage did this for us */
if (!pagemode)
- CheckForSerializableConflictOut(visible, scan->rs_rd, tuple,
+ heap_CheckForSerializableConflictOut(visible, scan->rs_rd, tuple,
hscan->rs_cbuf, scan->rs_snapshot);
/* Try next tuple from same page. */
@@ -2602,6 +2602,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
+ .scans_leverage_column_projection = false,
.slot_callbacks = heapam_slot_callbacks,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..61ed3167fe 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -180,8 +180,8 @@ index_insert(Relation indexRelation,
if (!(indexRelation->rd_indam->ampredlocks))
CheckForSerializableConflictIn(indexRelation,
- (HeapTuple) NULL,
- InvalidBuffer);
+ (ItemPointer) NULL,
+ InvalidBlockNumber);
return indexRelation->rd_indam->aminsert(indexRelation, values, isnull,
heap_t_ctid, heapRelation,
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2eccc99023..b905cb1986 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -290,7 +290,7 @@ top:
* checkingunique and !heapkeyspace cases, but it's okay to use the
* first page the value could be on (with scantid omitted) instead.
*/
- CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
/*
* Do the insertion. Note that insertstate contains cached binary
@@ -533,7 +533,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* otherwise be masked by this unique constraint
* violation.
*/
- CheckForSerializableConflictIn(rel, NULL, insertstate->buf);
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate->buf));
/*
* This is a definite conflict. Break the tuple down into
diff --git a/src/backend/access/zedstore/Makefile b/src/backend/access/zedstore/Makefile
new file mode 100644
index 0000000000..ae5b939026
--- /dev/null
+++ b/src/backend/access/zedstore/Makefile
@@ -0,0 +1,21 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/zedstore
+#
+# IDENTIFICATION
+# src/backend/access/zedstore/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/zedstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = zedstore_btree.o zedstore_tidpage.o zedstore_attpage.o \
+ zedstore_compression.o zedstoream_handler.o \
+ zedstore_meta.o zedstore_undo.o zedstore_toast.o zedstore_visibility.o \
+ zedstore_inspect.o zedstore_freepagemap.o zedstore_utils.o \
+ zedstore_tupslot.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/zedstore/README b/src/backend/access/zedstore/README
new file mode 100644
index 0000000000..85493caf03
--- /dev/null
+++ b/src/backend/access/zedstore/README
@@ -0,0 +1,295 @@
+
+src/backend/access/zedstore/README
+
+ZedStore - compressed column (and row) store for PostgreSQL
+===========================================================
+
+The purpose of this README is to provide overview of zedstore's
+design, major requirements/objectives it intends to fulfill and
+high-level implementation details.
+
+Objectives
+----------
+
+* Performance improvement for queries selecting subset of columns
+(reduced IO).
+
+* Reduced on-disk footprint compared to heap table. Shorter tuple
+headers and also leveraging compression of similar type data
+
+* Be first-class citizen in the Postgres architecture (tables data can
+just independently live in columnar storage) and not be at arm's
+length though an opaque interface.
+
+* Fully MVCC compliant - basically all operations supported similar to
+heap, like update, delete, serializable transactions etc...
+
+* All Indexes supported
+
+* Hybrid row-column store, where some columns are stored together, and
+others separately. Provide flexibility of granularity on how to divide
+the columns. Columns accessed together can be stored together.
+
+* Provide better control over bloat (using zheap)
+
+* Eliminate need for separate toast tables
+
+* Faster add / drop column or changing data type of column by avoiding
+full rewrite of the table.
+
+Highlevel design of zedStore - B-trees for the win!
+---------------------------------------------------
+
+To start simple, let's ignore column store aspect and consider it as
+compressed row store. The column store is natural externsion of this
+concept, explained in next section.
+
+The basic on-disk data structure leveraged is a B-tree, indexed by
+TID. BTree being a great data structure, fast and versatile. Note this
+is not refering to existing Btree indexes, but instead net new BTree
+for table data storage.
+
+TID - used as a logical row identifier:
+TID is just a 48-bit row identifier. The traditional division into
+block and offset numbers is meaningless. In order to find a tuple with
+a given TID, one must always descend the B-tree. Having logical TID
+provides flexibility to move the tuples around different pages on page
+splits or page merges can be performed.
+
+The internal pages of the B-tree are super simple and boring. Each
+internal page just stores an array of TID/downlink pairs. Let's focus
+on the leaf level. Leaf blocks have short uncompressed header,
+followed by btree items. It contains three kind of items:
+
+- plain item, holds one tuple or one datum, uncompressed payload
+
+- array item, holds multiple datums, with consecutive TIDs and the
+same visibility information. An array item saves space compared to
+multiple single items, by leaving out repetitive UNDO and TID
+fields. An array item cannot mix NULLs and non-NULLs, so the ZSBT_NULL
+flag applies to all elements.
+
+- a "container item", holds multiple plain items, compressed payload
+
++-----------------------------
+| Fixed-size page header:
+|
+| LSN
+| TID low and hi key (for Lehman & Yao B-tree operations)
+| left and right page pointers
+|
+| Items:
+|
+| TID | size | flags | uncompressed size | lastTID | payload (container item)
+| TID | size | flags | uncompressed size | lastTID | payload (container item)
+| TID | size | flags | undo pointer | payload (plain item)
+| TID | size | flags | undo pointer | payload (plain item)
+| ...
+|
++----------------------------
+
+Row store
+---------
+
+The tuples are stored one after another, sorted by TID. For each
+tuple, we store its 48-bit TID, a undo record pointer, and the actual
+tuple data uncompressed.
+
+In uncompressed form, the page can be arbitrarily large. But after
+compression, it must fit into a physical 8k block. If on insert or
+update of a tuple, the page cannot be compressed below 8k anymore, the
+page is split. Note that because TIDs are logical rather than physical
+identifiers, we can freely move tuples from one physical page to
+another during page split. A tuple's TID never changes.
+
+The buffer cache caches compressed blocks. Likewise, WAL-logging,
+full-page images etc. work on compressed blocks. Uncompression is done
+on-the-fly, as and when needed in backend-private memory, when
+reading. For some compressions like rel encoding or delta encoding
+tuples can be constructed directly from compressed data.
+
+Column store
+------------
+
+A column store uses the same structure but we have *multiple* B-trees,
+one for each column plus one for storing meta-data
+(a.k.a. meta-column), all indexed by TID. Imagine zedstore as a forest
+of B-trees. The B-trees for all columns are stored in the same
+physical file.
+
+A metapage at block 0, has links to the roots of the B-trees. Leaf
+pages look the same, but instead of storing the whole tuple, stores
+just a single attribute. To reconstruct a row with given TID, scan
+descends down the B-trees for all the columns using that TID, and
+fetches all attributes. Likewise, a sequential scan walks all the
+B-trees in lockstep.
+
+The special or first btree for meta-column is used to allocate TIDs
+for tuples, track the UNDO location which provides visibility
+information. Also this special btree, which always exists, helps to
+support zero column tables (which can be result of ADD COLUMN DROP
+COLUMN actions as well). Plus, having meta-data stored separately from
+data, helps to get better compression ratios. And also helps to
+simplify the overall design/implementation as for deletes just need to
+edit the meta-column and avoid touching the actual data btrees.
+
+
+MVCC
+----
+
+Undo record pointers are used to implement MVCC, like in zheap. Hence,
+transaction information if not directly stored with the data. In
+zheap, there's a small, fixed, number of "transaction slots" on each
+page, but zedstore has undo pointer with each item directly; in normal
+cases, the compression squeezes this down to almost nothing. In case
+of bulk load the undo record pointer is maintained for array of items
+and not per item. Undo pointer is only stored in meta-column and all
+MVCC operations are performed using the meta-column only.
+
+
+Insert:
+Inserting a new row, splits the row into datums. Then while adding
+entry for meta-column adds, decides block to insert, picks a TID for
+it, and writes undo record for the same. All the data columns are
+inserted using that TID.
+
+Toast:
+When an overly large datum is stored, it is divided into chunks, and
+each chunk is stored on a dedicated toast page within the same
+physical file. The toast pages of a datum form list, each page has a
+next/prev pointer.
+
+Select:
+Property is added to Table AM to convey if column projection is
+leveraged by AM for scans. While scanning tables with AM leveraging
+this property, executor parses the plan. Leverages the target list and
+quals to find the required columns for query. This list is passed down
+to AM on beginscan. Zedstore uses this column projection list to only
+pull data from selected columns. Virtual tuple table slot is used to
+pass back the datums for subset of columns.
+
+Current table am API requires enhancement here to pass down column
+projection to AM. The patch showcases two different ways for the same.
+
+* For sequential scans added new beginscan_with_column_projection()
+API. Executor checks AM property and if it leverages column projection
+uses this new API else normal beginscan() API.
+
+* For index scans instead of modifying the begin scan API, added new
+API to specifically pass column projection list after calling begin
+scan to populate the scan descriptor but before fetching the tuples.
+
+Delete:
+When deleting a tuple, new undo record is created for delete and only
+meta-column item is updated with this new undo record. New undo record
+created points to previous undo record pointer (insert undo record)
+present for the tuple. Hence, delete only operates on meta-column and
+no data column is edited.
+
+Update:
+Update in zedstore is pretty equivalent to delete and insert. Delete
+action is performed as stated above and new entry is added with
+updated values. So, no in-place update happens.
+
+Index Support:
+Building index also leverages columnar storage and only scans columns
+required to build the index. Indexes work pretty similar to heap
+tables. Data is inserted into tables and TID for the tuple gets stored
+in index. On index scans, required column Btrees are scanned for given
+TID and datums passed back using virtual tuple. Since only meta-column
+is leveraged to perform visibility check, only visible tuples data are
+fetched from rest of the Btrees.
+
+Page Format
+-----------
+A ZedStore table contains different kinds of pages, all in the same
+file. Kinds of pages are meta-page, per-attribute btree internal and
+leaf pages, UNDO log page, and toast pages. Each page type has its own
+distinct data storage format.
+
+META Page:
+Block 0 is always a metapage. It contains the block numbers of the
+other data structures stored within the file, like the per-attribute
+B-trees, and the UNDO log.
+
+BTREE Page:
+
+UNDO Page:
+
+TOAST Page:
+
+
+Free Space Map
+--------------
+
+
+Enhancements
+------------
+
+Instead of compressing all the tuples on a page in one batch, store a
+small "dictionary", e.g. in page header or meta page or separate
+dedicated page, and use it to compress tuple by tuple. That could make
+random reads and updates of individual tuples faster. Need to find how
+to create the dictionary first.
+
+Only cached compressed pages in the page cache. If we want to cache
+uncompressed pages instead, or in addition to that, we need to invent
+a whole new kind of a buffer cache that can deal with the
+variable-size blocks. For a first version, I think we can live without
+it.
+
+Instead of storing all columns in the same file, we could store them
+in separate files (separate forks?). That would allow immediate reuse
+of space, after dropping a column. It's not clear how to use an FSM in
+that case, though. Might have to implement an integrated FSM,
+too. (Which might not be a bad idea, anyway).
+
+Design allows for hybrid row-column store, where some columns are
+stored together, and others have a dedicated B-tree. Need to have user
+facing syntax to allow specifying how to group the columns.
+
+Salient points for the design
+------------------------------
+
+* Layout the data/tuples in mapped fashion instead of keeping the
+logical to physical mapping separate from actual data. So, keep all
+the meta-data and data logically in single stream of file, avoiding
+the need for separate forks/files to store meta-data and data.
+
+* Handle/treat operations at tuple level and not block level.
+
+* Stick to fixed size physical blocks. Variable size blocks (for
+possibly higher compression ratios) pose need for increased logical to
+physical mapping maintenance, plus restrictions on concurrency of
+writes and reads to files. Hence adopt compression to fit fixed size
+blocks instead of other way round.
+
+
+Predicate locking
+-----------------
+
+Predicate locks, to support SERIALIZABLE transactinons, are taken like
+with the heap. From README-SSI:
+
+* For a table scan, the entire relation will be locked.
+
+* Each tuple read which is visible to the reading transaction will be
+locked, whether or not it meets selection criteria; except that there
+is no need to acquire an SIREAD lock on a tuple when the transaction
+already holds a write lock on any tuple representing the row, since a
+rw-conflict would also create a ww-dependency which has more
+aggressive enforcement and thus will prevent any anomaly.
+
+* Modifying a heap tuple creates a rw-conflict with any transaction
+that holds a SIREAD lock on that tuple, or on the page or relation
+that contains it.
+
+* Inserting a new tuple creates a rw-conflict with any transaction
+holding a SIREAD lock on the entire relation. It doesn't conflict with
+page-level locks, because page-level locks are only used to aggregate
+tuple locks. Unlike index page locks, they don't lock "gaps" on the
+page.
+
+
+ZedStore isn't block-based, so page-level locks really just mean a
+range of TIDs. They're only used to aggregate tuple locks.
diff --git a/src/backend/access/zedstore/zedstore_attpage.c b/src/backend/access/zedstore/zedstore_attpage.c
new file mode 100644
index 0000000000..10e6517c26
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_attpage.c
@@ -0,0 +1,1589 @@
+/*
+ * zedstore_attpage.c
+ * Routines for handling attribute leaf pages.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with one B-tree at a time, it is the caller's
+ * responsibility to tie together the scans of each btree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_attpage.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static void zsbt_attr_recompress_replace(Relation rel, AttrNumber attno,
+ Buffer oldbuf, List *items);
+static ZSSingleBtreeItem *zsbt_attr_fetch(Relation rel, AttrNumber attno,
+ zstid tid, Buffer *buf_p);
+static void zsbt_attr_replace_item(Relation rel, AttrNumber attno, Buffer buf,
+ zstid oldtid, ZSBtreeItem *replacementitem,
+ List *newitems);
+static Size zsbt_compute_data_size(Form_pg_attribute atti, Datum val, bool isnull);
+static ZSBtreeItem *zsbt_attr_create_item(Form_pg_attribute att, zstid tid,
+ int nelements, Datum *datums,
+ char *dataptr, Size datasz, bool isnull);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of the btree.
+ */
+void
+zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno, zstid starttid,
+ zstid endtid, ZSBtreeScan *scan)
+{
+ Buffer buf;
+
+ scan->rel = rel;
+ scan->attno = attno;
+ scan->tupledesc = tdesc;
+
+ scan->snapshot = NULL;
+ scan->context = CurrentMemoryContext;
+ scan->lastoff = InvalidOffsetNumber;
+ scan->has_decompressed = false;
+ scan->nexttid = starttid;
+ scan->endtid = endtid;
+ memset(&scan->recent_oldest_undo, 0, sizeof(scan->recent_oldest_undo));
+ memset(&scan->array_undoptr, 0, sizeof(scan->array_undoptr));
+ scan->array_datums = palloc(sizeof(Datum));
+ scan->array_datums_allocated_size = 1;
+ scan->array_elements_left = 0;
+
+ buf = zsbt_descend(rel, attno, starttid, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ /* completely empty tree */
+ scan->active = false;
+ scan->lastbuf = InvalidBuffer;
+ return;
+ }
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ scan->active = true;
+ scan->lastbuf = buf;
+
+ zs_decompress_init(&scan->decompressor);
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+}
+
+/*
+ * Reset the 'next' TID in a scan to the given TID.
+ */
+void
+zsbt_attr_reset_scan(ZSBtreeScan *scan, zstid starttid)
+{
+ if (starttid < scan->nexttid)
+ {
+ /* have to restart from scratch. */
+ scan->array_elements_left = 0;
+ scan->nexttid = starttid;
+ scan->has_decompressed = false;
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ }
+ else
+ zsbt_scan_skip(scan, starttid);
+}
+
+void
+zsbt_attr_end_scan(ZSBtreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+ zs_decompress_free(&scan->decompressor);
+
+ scan->active = false;
+ scan->array_elements_left = 0;
+}
+
+/*
+ * Helper function of zsbt_attr_scan_next(), to extract Datums from the given
+ * array item into the scan->array_* fields.
+ */
+static void
+zsbt_attr_scan_extract_array(ZSBtreeScan *scan, ZSArrayBtreeItem *aitem)
+{
+ int nelements = aitem->t_nelements;
+ zstid tid = aitem->t_tid;
+ bool isnull = (aitem->t_flags & ZSBT_NULL) != 0;
+ char *p = aitem->t_payload;
+
+ /* skip over elements that we are not interested in */
+ while (tid < scan->nexttid && nelements > 0)
+ {
+ Form_pg_attribute attr = ZSBtreeScanGetAttInfo(scan);
+ if (!isnull)
+ {
+ if (attr->attlen > 0)
+ {
+ p += att_align_nominal(attr->attlen, attr->attalign);
+ }
+ else
+ {
+ p = (Pointer) att_align_pointer(p, attr->attalign, attr->attlen, p);
+ p = att_addlength_pointer(p, attr->attlen, p);
+ }
+ }
+ tid++;
+ nelements--;
+ }
+
+ /* leave out elements that are past end of range */
+ if (tid + nelements > scan->endtid)
+ nelements = scan->endtid - tid;
+
+ scan->array_isnull = isnull;
+
+ if (nelements > scan->array_datums_allocated_size)
+ {
+ if (scan->array_datums)
+ pfree(scan->array_datums);
+ scan->array_datums = palloc(nelements * sizeof(Datum));
+ scan->array_datums_allocated_size = nelements;
+ }
+
+ if (isnull)
+ {
+ /*
+ * For NULLs, clear the Datum array. Not strictly necessary, I think,
+ * but less confusing when debugging.
+ */
+ memset(scan->array_datums, 0, nelements * sizeof(Datum));
+ }
+ else
+ {
+ /*
+ * Expand the packed array data into an array of Datums.
+ *
+ * It would perhaps be more natural to loop through the elements with
+ * datumGetSize() and fetch_att(), but this is a pretty hot loop, so it's
+ * better to avoid checking attlen/attbyval in the loop.
+ *
+ * TODO: a different on-disk representation might make this better still,
+ * for varlenas (this is pretty optimal for fixed-lengths already).
+ * For example, storing an array of sizes or an array of offsets, followed
+ * by the data itself, might incur fewer pipeline stalls in the CPU.
+ */
+ Form_pg_attribute attr = ZSBtreeScanGetAttInfo(scan);
+ int16 attlen = attr->attlen;
+
+ if (attr->attbyval)
+ {
+ if (attlen == sizeof(Datum))
+ {
+ memcpy(scan->array_datums, p, nelements * sizeof(Datum));
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < nelements; i++)
+ {
+ scan->array_datums[i] = fetch_att(p, true, sizeof(int32));
+ p += sizeof(int32);
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < nelements; i++)
+ {
+ scan->array_datums[i] = fetch_att(p, true, sizeof(int16));
+ p += sizeof(int16);
+ }
+ }
+ else if (attlen == 1)
+ {
+ for (int i = 0; i < nelements; i++)
+ {
+ scan->array_datums[i] = fetch_att(p, true, 1);
+ p += 1;
+ }
+ }
+ else
+ Assert(false);
+ }
+ else if (attlen > 0)
+ {
+ for (int i = 0; i < nelements; i++)
+ {
+ scan->array_datums[i] = PointerGetDatum(p);
+ p += att_align_nominal(attr->attlen, attr->attalign);
+ }
+ }
+ else if (attlen == -1)
+ {
+ for (int i = 0; i < nelements; i++)
+ {
+ p = (Pointer) att_align_pointer(p, attr->attalign, attr->attlen, p);
+ scan->array_datums[i] = PointerGetDatum(p);
+ p = att_addlength_pointer(p, attr->attlen, p);
+ }
+ }
+ else
+ {
+ /* TODO: convert cstrings to varlenas before we get here? */
+ elog(ERROR, "cstrings not supported");
+ }
+ }
+ scan->array_undoptr = aitem->t_undo_ptr;
+ scan->array_next_datum = &scan->array_datums[0];
+ scan->array_elements_left = nelements;
+}
+
+/*
+ * Advance scan to next item.
+ *
+ * Return true if there was another item. The Datum/isnull of the item is
+ * placed in scan->array_* fields. For a pass-by-ref datum, it's a palloc'd
+ * copy that's valid until the next call.
+ *
+ * This is normally not used directly. See zsbt_scan_next_tid() and
+ * zsbt_scan_next_fetch() wrappers, instead.
+ */
+bool
+zsbt_attr_scan_next(ZSBtreeScan *scan)
+{
+ Buffer buf;
+ bool buf_is_locked = false;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ BlockNumber next;
+
+ Assert(scan->active);
+
+ /*
+ * Advance to the next TID >= nexttid.
+ *
+ * This advances scan->nexttid as it goes.
+ */
+ while (scan->nexttid < scan->endtid)
+ {
+ /*
+ * If we are still processing an array item, return next element from it.
+ */
+ if (scan->array_elements_left > 0)
+ {
+ return true;
+ }
+
+ /*
+ * If we are still processing a compressed item, process the next item
+ * from the it. If it's an array item, we start iterating the array by
+ * setting the scan->array_* fields, and loop back to top to return the
+ * first element from the array.
+ */
+ if (scan->has_decompressed)
+ {
+ zstid lasttid;
+ ZSBtreeItem *uitem;
+
+ uitem = zs_decompress_read_item(&scan->decompressor);
+
+ if (uitem == NULL)
+ {
+ scan->has_decompressed = false;
+ continue;
+ }
+
+ /* a compressed item cannot contain nested compressed items */
+ Assert((uitem->t_flags & ZSBT_COMPRESSED) == 0);
+
+ lasttid = zsbt_item_lasttid(uitem);
+ if (lasttid < scan->nexttid)
+ continue;
+
+ if (uitem->t_tid >= scan->endtid)
+ break;
+
+ if ((uitem->t_flags & ZSBT_ARRAY) != 0)
+ {
+ /* no need to make a copy, because the uncompressed buffer
+ * is already a copy */
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) uitem;
+
+ zsbt_attr_scan_extract_array(scan, aitem);
+ continue;
+ }
+ else
+ {
+ /* single item */
+ ZSSingleBtreeItem *sitem = (ZSSingleBtreeItem *) uitem;
+ Form_pg_attribute attr = ZSBtreeScanGetAttInfo(scan);
+
+ scan->nexttid = sitem->t_tid;
+ scan->array_undoptr = sitem->t_undo_ptr;
+ scan->array_elements_left = 1;
+ scan->array_next_datum = &scan->array_datums[0];
+ if (sitem->t_flags & ZSBT_NULL)
+ scan->array_isnull = true;
+ else
+ {
+ scan->array_isnull = false;
+ scan->array_datums[0] = fetch_att(sitem->t_payload, attr->attbyval, attr->attlen);
+ /* no need to copy, because the uncompression buffer is a copy already */
+ /* FIXME: do we need to copy anyway, to make sure it's aligned correctly? */
+ }
+
+ if (buf_is_locked)
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ return true;
+ }
+ }
+
+ /*
+ * Scan the page for the next item.
+ */
+ buf = scan->lastbuf;
+ if (!buf_is_locked)
+ {
+ if (BufferIsValid(buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ buf_is_locked = true;
+
+ /*
+ * It's possible that the page was concurrently split or recycled by
+ * another backend (or ourselves). Have to re-check that the page is
+ * still valid.
+ */
+ if (!zsbt_page_is_expected(scan->rel, scan->attno, scan->nexttid, 0, buf))
+ {
+ /*
+ * It's not valid for the TID we're looking for, but maybe it was the
+ * right page for the previous TID. In that case, we don't need to
+ * restart from the root, we can follow the right-link instead.
+ */
+ if (zsbt_page_is_expected(scan->rel, scan->attno, scan->nexttid - 1, 0, buf))
+ {
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ next = opaque->zs_next;
+ if (next != InvalidBlockNumber)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ buf = ReleaseAndReadBuffer(buf, scan->rel, next);
+ scan->lastbuf = buf;
+ continue;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+ buf_is_locked = false;
+ buf = scan->lastbuf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(buf))
+ {
+ buf = scan->lastbuf = zsbt_descend(scan->rel, scan->attno, scan->nexttid, 0, true);
+ buf_is_locked = true;
+ }
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ Assert(opaque->zs_page_id == ZS_BTREE_PAGE_ID);
+
+ /* TODO: check the last offset first, as an optimization */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSBtreeItem *item = (ZSBtreeItem *) PageGetItem(page, iid);
+ zstid lasttid;
+
+ lasttid = zsbt_item_lasttid(item);
+
+ if (scan->nexttid > lasttid)
+ continue;
+
+ if (item->t_tid >= scan->endtid)
+ {
+ scan->nexttid = scan->endtid;
+ break;
+ }
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+ MemoryContext oldcxt = MemoryContextSwitchTo(scan->context);
+
+ zs_decompress_chunk(&scan->decompressor, citem);
+ MemoryContextSwitchTo(oldcxt);
+ scan->has_decompressed = true;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ break;
+ }
+ else
+ {
+ if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ /* copy the item, because we can't hold a lock on the page */
+ ZSArrayBtreeItem *aitem;
+
+ aitem = MemoryContextAlloc(scan->context, item->t_size);
+ memcpy(aitem, item, item->t_size);
+
+ zsbt_attr_scan_extract_array(scan, aitem);
+
+ if (scan->array_elements_left > 0)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ break;
+ }
+ }
+ else
+ {
+ /* single item */
+ ZSSingleBtreeItem *sitem = (ZSSingleBtreeItem *) item;
+ Form_pg_attribute attr = ZSBtreeScanGetAttInfo(scan);
+
+ scan->nexttid = sitem->t_tid;
+ scan->array_undoptr = sitem->t_undo_ptr;
+ scan->array_elements_left = 1;
+ scan->array_next_datum = &scan->array_datums[0];
+ if (item->t_flags & ZSBT_NULL)
+ scan->array_isnull = true;
+ else
+ {
+ scan->array_isnull = false;
+ scan->array_datums[0] = fetch_att(sitem->t_payload, attr->attbyval, attr->attlen);
+ scan->array_datums[0] = zs_datumCopy(scan->array_datums[0], attr->attbyval, attr->attlen);
+ }
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ return true;
+ }
+ }
+ }
+
+ if (scan->array_elements_left > 0 || scan->has_decompressed)
+ continue;
+
+ /* No more items on this page. Walk right, if possible */
+ next = opaque->zs_next;
+ if (next == BufferGetBlockNumber(buf))
+ elog(ERROR, "btree page %u next-pointer points to itself", next);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+
+ if (next == InvalidBlockNumber || scan->nexttid >= scan->endtid)
+ {
+ scan->active = false;
+ scan->array_elements_left = 0;
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ break;
+ }
+
+ scan->lastbuf = ReleaseAndReadBuffer(scan->lastbuf, scan->rel, next);
+ }
+
+ return false;
+}
+
+/*
+ * Insert a multiple items to the given attribute's btree.
+ *
+ * Populates the TIDs of the new tuples.
+ *
+ * If 'tid' in list is valid, then that TID is used. It better not be in use already. If
+ * it's invalid, then a new TID is allocated, as we see best. (When inserting the
+ * first column of the row, pass invalid, and for other columns, pass the TID
+ * you got for the first column.)
+ */
+void
+zsbt_attr_multi_insert(Relation rel, AttrNumber attno,
+ Datum *datums, bool *isnulls, zstid *tids, int nitems)
+{
+ Form_pg_attribute attr;
+ zstid tid = tids[0];
+ Buffer buf;
+ zstid insert_target_key;
+ int i;
+ List *newitems;
+
+ Assert (attno >= 1);
+ attr = &rel->rd_att->attrs[attno - 1];
+
+ /*
+ * Find the right place for the given TID.
+ */
+ insert_target_key = tid;
+
+ buf = zsbt_descend(rel, attno, insert_target_key, 0, false);
+
+ /* Create items to insert. */
+ newitems = NIL;
+ i = 0;
+ while (i < nitems)
+ {
+ Size datasz;
+ int j;
+ ZSBtreeItem *newitem;
+
+ /*
+ * Try to collapse as many items as possible into an Array item.
+ * The first item in the array is now at tids[i]/datums[i]/isnulls[i].
+ * Items can be stored in the same array as long as the TIDs are
+ * consecutive, they all have the same isnull flag, and the array
+ * isn't too large to be stored on a single leaf page. Scan the
+ * arrays, checking those conditions.
+ */
+ datasz = zsbt_compute_data_size(attr, datums[i], isnulls[i]);
+ for (j = i + 1; j < nitems; j++)
+ {
+ if (isnulls[j] != isnulls[i])
+ break;
+
+ if (tids[j] != tids[j - 1] + 1)
+ break;
+
+ /*
+ * Will the array still fit on a leaf page, if this datum is
+ * included in it? We actually use 1/4 of the page, to avoid
+ * making very large arrays, which might be slower to update in
+ * the future. Also, using an array that completely fills a page
+ * might cause more fragmentation. (XXX: The 1/4 threshold
+ * is arbitrary, though, and this probably needs more smarts
+ * or testing to determine the optimum.)
+ */
+ if (!isnulls[i])
+ {
+ Datum val = datums[j];
+ Size datum_sz;
+
+ datum_sz = zsbt_compute_data_size(attr, val, false);
+ if (datasz + datum_sz < MaxZedStoreDatumSize / 4)
+ break;
+ datasz += datum_sz;
+ }
+ }
+
+ /*
+ * 'i' is now the first entry to store in the array, and 'j' is the
+ * last + 1 elemnt to store. If j == i + 1, then there is only one
+ * element and zsbt_create_item() will create a 'single' item rather
+ * than an array.
+ */
+ newitem = zsbt_attr_create_item(attr, tids[i],
+ j - i, &datums[i], NULL, datasz, isnulls[i]);
+
+ newitems = lappend(newitems, newitem);
+ i = j;
+ }
+
+ /* recompress and possibly split the page */
+ zsbt_attr_replace_item(rel, attno, buf,
+ InvalidZSTid, NULL,
+ newitems);
+ /* zsbt_replace_item unlocked 'buf' */
+ ReleaseBuffer(buf);
+}
+
+void
+zsbt_attr_remove(Relation rel, AttrNumber attno, zstid tid)
+{
+ Buffer buf;
+ ZSSingleBtreeItem *item;
+
+ /* Find the item to delete. (It could be compressed) */
+ item = zsbt_attr_fetch(rel, attno, tid, &buf);
+ if (item == NULL)
+ {
+ elog(WARNING, "could not find tuple to remove with TID (%u, %u) for attribute %d",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid), attno);
+ return;
+ }
+
+ /* remove it */
+ zsbt_attr_replace_item(rel, attno, buf,
+ tid, NULL,
+ NIL);
+ ReleaseBuffer(buf); /* zsbt_replace_item released */
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Fetch the item with given TID. The page containing the item is kept locked, and
+ * returned to the caller in *buf_p. This is used to locate a tuple for updating
+ * or deleting it.
+ */
+static ZSSingleBtreeItem *
+zsbt_attr_fetch(Relation rel, AttrNumber attno, zstid tid, Buffer *buf_p)
+{
+ Buffer buf;
+ Page page;
+ ZSBtreeItem *item = NULL;
+ bool found = false;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ buf = zsbt_descend(rel, attno, tid, 0, false);
+ if (buf == InvalidBuffer)
+ {
+ *buf_p = InvalidBuffer;
+ return NULL;
+ }
+ page = BufferGetPage(buf);
+
+ /* Find the item on the page that covers the target TID */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ item = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+ ZSDecompressContext decompressor;
+
+ zs_decompress_init(&decompressor);
+ zs_decompress_chunk(&decompressor, citem);
+
+ while ((item = zs_decompress_read_item(&decompressor)) != NULL)
+ {
+ zstid lasttid = zsbt_item_lasttid(item);
+
+ if (item->t_tid <= tid && lasttid >= tid)
+ {
+ found = true;
+ break;
+ }
+ }
+ if (found)
+ {
+ /* FIXME: decompressor is leaked. Can't free it yet, because we still
+ * need to access the item below
+ */
+ break;
+ }
+ zs_decompress_free(&decompressor);
+ }
+ else
+ {
+ zstid lasttid = zsbt_item_lasttid(item);
+
+ if (item->t_tid <= tid && lasttid >= tid)
+ {
+ found = true;
+ break;
+ }
+ }
+ }
+
+ if (found)
+ {
+ ZSSingleBtreeItem *result;
+
+ if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) item;
+ int elemno = tid - aitem->t_tid;
+ char *dataptr = NULL;
+ int datasz;
+ int resultsize;
+
+ Assert(elemno < aitem->t_nelements);
+
+ if ((item->t_flags & ZSBT_NULL) == 0)
+ {
+ /*
+ * TODO: Currently, zsbt_fetch() is called from functions
+ * which don't have Slot, and Relation object can be trusted
+ * for attlen and attbyval. Ideally, we wish to not rely on
+ * Relation object and see how to decouple it. Previously, we
+ * stored these two values in meta-page and get these values
+ * from it but just storing them for this purpose, seems
+ * heavy. Ideally, catalog stores those values so shouldn't
+ * need to duplicate storing the same.
+ */
+ TupleDesc tdesc = RelationGetDescr(rel);
+ int attlen = tdesc->attrs[attno - 1].attlen;
+ bool attbyval = tdesc->attrs[attno - 1].attbyval;
+
+ if (attlen > 0)
+ {
+ dataptr = aitem->t_payload + elemno * attlen;
+ datasz = attlen;
+ }
+ else
+ {
+ dataptr = aitem->t_payload;
+ for (int i = 0; i < elemno; i++)
+ {
+ dataptr += zs_datumGetSize(PointerGetDatum(dataptr), attbyval, attlen);
+ }
+ datasz = zs_datumGetSize(PointerGetDatum(dataptr), attbyval, attlen);
+ }
+ }
+ else
+ datasz = 0;
+
+ resultsize = offsetof(ZSSingleBtreeItem, t_payload) + datasz;
+ result = palloc(resultsize);
+ memset(result, 0, offsetof(ZSSingleBtreeItem, t_payload)); /* zero padding */
+ result->t_tid = tid;
+ result->t_flags = item->t_flags & ~ZSBT_ARRAY;
+ result->t_size = resultsize;
+ result->t_undo_ptr = aitem->t_undo_ptr;
+ if (datasz > 0)
+ memcpy(result->t_payload, dataptr, datasz);
+ }
+ else
+ {
+ /* single item */
+ result = (ZSSingleBtreeItem *) item;
+ }
+
+ *buf_p = buf;
+ return result;
+ }
+ else
+ {
+ UnlockReleaseBuffer(buf);
+ *buf_p = InvalidBuffer;
+ return NULL;
+ }
+}
+
+/*
+ * Compute the size of a slice of an array, from an array item. 'dataptr'
+ * points to the packed on-disk representation of the array item's data.
+ * The elements are stored one after each other.
+ */
+static Size
+zsbt_get_array_slice_len(int16 attlen, bool attbyval, bool isnull,
+ char *dataptr, int nelements)
+{
+ Size datasz;
+
+ if (isnull)
+ datasz = 0;
+ else
+ {
+ /*
+ * For a fixed-width type, we can just multiply. For variable-length,
+ * we have to walk through the elements, looking at the length of each
+ * element.
+ */
+ if (attlen > 0)
+ {
+ datasz = attlen * nelements;
+ }
+ else
+ {
+ char *p = dataptr;
+
+ datasz = 0;
+ for (int i = 0; i < nelements; i++)
+ {
+ Size datumsz;
+
+ datumsz = zs_datumGetSize(PointerGetDatum(p), attbyval, attlen);
+
+ /*
+ * The array should already use short varlen representation whenever
+ * possible.
+ */
+ Assert(!VARATT_CAN_MAKE_SHORT(DatumGetPointer(p)));
+
+ datasz += datumsz;
+ p += datumsz;
+ }
+ }
+ }
+ return datasz;
+}
+
+
+/* Does att's datatype allow packing into the 1-byte-header varlena format? */
+#define ATT_IS_PACKABLE(att) \
+ ((att)->attlen == -1 && (att)->attstorage != 'p')
+/* Use this if it's already known varlena */
+#define VARLENA_ATT_IS_PACKABLE(att) \
+ ((att)->attstorage != 'p')
+
+/*
+ * This is very similar to heap_compute_data_size()
+ */
+static Size
+zsbt_compute_data_size(Form_pg_attribute atti, Datum val, bool isnull)
+{
+ Size data_length = 0;
+
+ if (isnull)
+ return 0;
+
+ if (ATT_IS_PACKABLE(atti) &&
+ VARATT_CAN_MAKE_SHORT(DatumGetPointer(val)))
+ {
+ /*
+ * we're anticipating converting to a short varlena header, so
+ * adjust length and don't count any alignment
+ */
+ data_length += VARATT_CONVERTED_SHORT_SIZE(DatumGetPointer(val));
+ }
+ else if (atti->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ /*
+ * we want to flatten the expanded value so that the constructed
+ * tuple doesn't depend on it
+ */
+ data_length = att_align_nominal(data_length, atti->attalign);
+ data_length += EOH_get_flat_size(DatumGetEOHP(val));
+ }
+ else if (atti->attlen == -1 &&
+ VARATT_IS_EXTERNAL(val) && VARTAG_EXTERNAL(val) == VARTAG_ZEDSTORE)
+ {
+ data_length += sizeof(varatt_zs_toastptr);
+ }
+ else
+ {
+ data_length = att_align_datum(data_length, atti->attalign,
+ atti->attlen, val);
+ data_length = att_addlength_datum(data_length, atti->attlen,
+ val);
+ }
+
+ return data_length;
+}
+
+/*
+ * Form a ZSBtreeItem out of the given datums, or data that's already in on-disk
+ * array format, for insertion.
+ *
+ * If there's more than one element, an array item is created. Otherwise, a single
+ * item.
+ */
+static ZSBtreeItem *
+zsbt_attr_create_item(Form_pg_attribute att, zstid tid,
+ int nelements, Datum *datums,
+ char *datasrc, Size datasz, bool isnull)
+{
+ ZSBtreeItem *result;
+ Size itemsz;
+ char *databegin;
+
+ Assert(nelements > 0);
+
+ if (nelements > 1)
+ {
+ ZSArrayBtreeItem *newitem;
+
+ itemsz = offsetof(ZSArrayBtreeItem, t_payload) + datasz;
+
+ newitem = palloc(itemsz);
+ memset(newitem, 0, offsetof(ZSArrayBtreeItem, t_payload)); /* zero padding */
+ newitem->t_tid = tid;
+ newitem->t_size = itemsz;
+ newitem->t_flags = ZSBT_ARRAY;
+ if (isnull)
+ newitem->t_flags |= ZSBT_NULL;
+ newitem->t_nelements = nelements;
+ ZSUndoRecPtrInitialize(&newitem->t_undo_ptr);
+
+ databegin = newitem->t_payload;
+
+ result = (ZSBtreeItem *) newitem;
+ }
+ else
+ {
+ ZSSingleBtreeItem *newitem;
+
+ itemsz = offsetof(ZSSingleBtreeItem, t_payload) + datasz;
+
+ newitem = palloc(itemsz);
+ memset(newitem, 0, offsetof(ZSSingleBtreeItem, t_payload)); /* zero padding */
+ newitem->t_tid = tid;
+ newitem->t_flags = 0;
+ if (isnull)
+ newitem->t_flags |= ZSBT_NULL;
+ newitem->t_size = itemsz;
+ ZSUndoRecPtrInitialize(&newitem->t_undo_ptr);
+
+ databegin = newitem->t_payload;
+
+ result = (ZSBtreeItem *) newitem;
+ }
+
+ /*
+ * Copy the data.
+ *
+ * This is largely copied from heaptuple.c's fill_val().
+ */
+ if (!isnull)
+ {
+ char *data = databegin;
+
+ if (datums)
+ {
+ for (int i = 0; i < nelements; i++)
+ {
+ Datum datum = datums[i];
+ Size data_length;
+
+ /*
+ * XXX we use the att_align macros on the pointer value itself, not on an
+ * offset. This is a bit of a hack.
+ */
+ if (att->attbyval)
+ {
+ /* pass-by-value */
+ data = (char *) att_align_nominal(data, att->attalign);
+ store_att_byval(data, datum, att->attlen);
+ data_length = att->attlen;
+ }
+ else if (att->attlen == -1)
+ {
+ /* varlena */
+ Pointer val = DatumGetPointer(datum);
+
+ if (VARATT_IS_EXTERNAL(val))
+ {
+ if (VARATT_IS_EXTERNAL_EXPANDED(val))
+ {
+ /*
+ * we want to flatten the expanded value so that the
+ * constructed tuple doesn't depend on it
+ */
+ /* FIXME: This should happen earlier, because if the
+ * datum is very large, it should be toasted, and
+ * that should happen earlier.
+ */
+ ExpandedObjectHeader *eoh = DatumGetEOHP(datum);
+
+ data = (char *) att_align_nominal(data,
+ att->attalign);
+ data_length = EOH_get_flat_size(eoh);
+ EOH_flatten_into(eoh, data, data_length);
+ }
+ else if (VARATT_IS_EXTERNAL(val) && VARTAG_EXTERNAL(val) == VARTAG_ZEDSTORE)
+ {
+ data_length = sizeof(varatt_zs_toastptr);
+ memcpy(data, val, data_length);
+ }
+ else
+ {
+ /* no alignment, since it's short by definition */
+ data_length = VARSIZE_EXTERNAL(val);
+ memcpy(data, val, data_length);
+ }
+ }
+ else if (VARATT_IS_SHORT(val))
+ {
+ /* no alignment for short varlenas */
+ data_length = VARSIZE_SHORT(val);
+ memcpy(data, val, data_length);
+ }
+ else if (VARLENA_ATT_IS_PACKABLE(att) &&
+ VARATT_CAN_MAKE_SHORT(val))
+ {
+ /* convert to short varlena -- no alignment */
+ data_length = VARATT_CONVERTED_SHORT_SIZE(val);
+ SET_VARSIZE_SHORT(data, data_length);
+ memcpy(data + 1, VARDATA(val), data_length - 1);
+ }
+ else
+ {
+ /* full 4-byte header varlena */
+ data = (char *) att_align_nominal(data,
+ att->attalign);
+ data_length = VARSIZE(val);
+ memcpy(data, val, data_length);
+ }
+ }
+ else if (att->attlen == -2)
+ {
+ /* cstring ... never needs alignment */
+ Assert(att->attalign == 'c');
+ data_length = strlen(DatumGetCString(datum)) + 1;
+ memcpy(data, DatumGetPointer(datum), data_length);
+ }
+ else
+ {
+ /* fixed-length pass-by-reference */
+ data = (char *) att_align_nominal(data, att->attalign);
+ Assert(att->attlen > 0);
+ data_length = att->attlen;
+ memcpy(data, DatumGetPointer(datum), data_length);
+ }
+ data += data_length;
+ }
+ Assert(data - databegin == datasz);
+ }
+ else
+ memcpy(data, datasrc, datasz);
+ }
+
+ return result;
+}
+
+/*
+ * This helper function is used to implement INSERT, UPDATE and DELETE.
+ *
+ * If 'olditem' is not NULL, then 'olditem' on the page is replaced with
+ * 'replacementitem'. 'replacementitem' can be NULL, to remove an old item.
+ *
+ * If 'newitems' is not empty, the items in the list are added to the page,
+ * to the correct position. FIXME: Actually, they're always just added to
+ * the end of the page, and that better be the correct position.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page if needed.
+ */
+static void
+zsbt_attr_replace_item(Relation rel, AttrNumber attno, Buffer buf,
+ zstid oldtid,
+ ZSBtreeItem *replacementitem,
+ List *newitems)
+{
+ Form_pg_attribute attr;
+ int16 attlen;
+ bool attbyval;
+ Page page = BufferGetPage(buf);
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ List *items;
+ bool found_old_item = false;
+ /* We might need to decompress up to two previously compressed items */
+ ZSDecompressContext decompressor;
+ bool decompressor_used = false;
+ bool decompressing;
+
+ if (attno == ZS_META_ATTRIBUTE_NUM)
+ {
+ attr = NULL;
+ attlen = 0;
+ attbyval = true;
+ }
+ else
+ {
+ attr = &rel->rd_att->attrs[attno - 1];
+ attlen = attr->attlen;
+ attbyval = attr->attbyval;
+ }
+
+ if (replacementitem)
+ Assert(replacementitem->t_tid == oldtid);
+
+ /*
+ * TODO: It would be good to have a fast path, for the common case that we're
+ * just adding items to the end.
+ */
+
+ /* Loop through all old items on the page */
+ items = NIL;
+ maxoff = PageGetMaxOffsetNumber(page);
+ decompressing = false;
+ off = 1;
+ for (;;)
+ {
+ ZSBtreeItem *item;
+
+ /*
+ * Get the next item to process. If we're decompressing, get the next
+ * tuple from the decompressor, otherwise get the next item from the page.
+ */
+ if (decompressing)
+ {
+ item = zs_decompress_read_item(&decompressor);
+ if (!item)
+ {
+ decompressing = false;
+ continue;
+ }
+ }
+ else if (off <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, off);
+
+ item = (ZSBtreeItem *) PageGetItem(page, iid);
+ off++;
+
+ }
+ else
+ {
+ /* out of items */
+ break;
+ }
+
+ /* we now have an item to process, either straight from the page or from
+ * the decompressor */
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ zstid item_lasttid = zsbt_item_lasttid(item);
+
+ /* there shouldn't nested compressed items */
+ if (decompressing)
+ elog(ERROR, "nested compressed items on zedstore page not supported");
+
+ if (oldtid != InvalidZSTid && item->t_tid <= oldtid && oldtid <= item_lasttid)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+
+ /* Found it, this compressed item covers the target or the new TID. */
+ /* We have to decompress it, and recompress */
+ Assert(!decompressor_used);
+
+ zs_decompress_init(&decompressor);
+ zs_decompress_chunk(&decompressor, citem);
+ decompressor_used = true;
+ decompressing = true;
+ continue;
+ }
+ else
+ {
+ /* keep this compressed item as it is */
+ items = lappend(items, item);
+ }
+ }
+ else if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ /* array item */
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) item;
+ zstid item_lasttid = zsbt_item_lasttid(item);
+
+ if (oldtid != InvalidZSTid && item->t_tid <= oldtid && oldtid <= item_lasttid)
+ {
+ /*
+ * The target TID is currently part of an array item. We have to split
+ * the array item into two, and put the replacement item in the middle.
+ */
+ int cutoff;
+ Size olddatalen;
+ int nelements = aitem->t_nelements;
+ bool isnull = (aitem->t_flags & ZSBT_NULL) != 0;
+ char *dataptr;
+
+ cutoff = oldtid - item->t_tid;
+
+ /* Array slice before the target TID */
+ dataptr = aitem->t_payload;
+ if (cutoff > 0)
+ {
+ ZSBtreeItem *item1;
+ Size datalen1;
+
+ datalen1 = zsbt_get_array_slice_len(attlen, attbyval, isnull,
+ dataptr, cutoff);
+ item1 = zsbt_attr_create_item(attr, aitem->t_tid,
+ cutoff, NULL, dataptr, datalen1, isnull);
+ dataptr += datalen1;
+ items = lappend(items, item1);
+ }
+
+ /*
+ * Skip over the target element, and store the replacement
+ * item, if any, in its place
+ */
+ olddatalen = zsbt_get_array_slice_len(attlen, attbyval, isnull,
+ dataptr, 1);
+ dataptr += olddatalen;
+ if (replacementitem)
+ items = lappend(items, replacementitem);
+
+ /* Array slice after the target */
+ if (cutoff + 1 < nelements)
+ {
+ ZSBtreeItem *item2;
+ Size datalen2;
+
+ datalen2 = zsbt_get_array_slice_len(attlen, attbyval, isnull,
+ dataptr, nelements - (cutoff + 1));
+ item2 = zsbt_attr_create_item(attr, oldtid + 1,
+ nelements - (cutoff + 1), NULL, dataptr, datalen2, isnull);
+ items = lappend(items, item2);
+ }
+
+ found_old_item = true;
+ }
+ else
+ items = lappend(items, item);
+ }
+ else
+ {
+ /* single item */
+ if (oldtid != InvalidZSTid && item->t_tid == oldtid)
+ {
+ Assert(!found_old_item);
+ found_old_item = true;
+ if (replacementitem)
+ items = lappend(items, replacementitem);
+ }
+ else
+ items = lappend(items, item);
+ }
+ }
+
+ if (oldtid != InvalidZSTid && !found_old_item)
+ elog(ERROR, "could not find old item to replace");
+
+ /* Add any new items to the end */
+ if (newitems)
+ items = list_concat(items, newitems);
+
+ /* Now pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_attr_recompress_replace(rel, attno, buf, items);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, attno, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack);
+ }
+
+ /*
+ * We can now free the decompression contexts. The pointers in the 'items' list
+ * point to decompression buffers, so we cannot free them until after writing out
+ * the pages.
+ */
+ if (decompressor_used)
+ zs_decompress_free(&decompressor);
+ list_free(items);
+}
+
+/*
+ * Recompressor routines
+ */
+typedef struct
+{
+ Page currpage;
+ ZSCompressContext compressor;
+ int compressed_items;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ int total_items;
+ int total_compressed_items;
+ int total_already_compressed_items;
+
+ AttrNumber attno;
+ zstid hikey;
+} zsbt_attr_recompress_context;
+
+static void
+zsbt_attr_recompress_newpage(zsbt_attr_recompress_context *cxt, zstid nexttid, int flags)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ if (cxt->currpage)
+ {
+ /* set the last tid on previous page */
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ oldopaque->zs_hikey = nexttid;
+ }
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (cxt->stack_tail)
+ cxt->stack_tail->next = stack;
+ else
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = cxt->attno;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = flags;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_attr_recompress_add_to_page(zsbt_attr_recompress_context *cxt, ZSBtreeItem *item)
+{
+ if (PageGetFreeSpace(cxt->currpage) < MAXALIGN(item->t_size))
+ zsbt_attr_recompress_newpage(cxt, item->t_tid, 0);
+
+ if (PageAddItemExtended(cxt->currpage,
+ (Item) item, item->t_size,
+ PageGetMaxOffsetNumber(cxt->currpage) + 1,
+ PAI_OVERWRITE) == InvalidOffsetNumber)
+ elog(ERROR, "could not add item to page while recompressing");
+
+ cxt->total_items++;
+}
+
+static bool
+zsbt_attr_recompress_add_to_compressor(zsbt_attr_recompress_context *cxt, ZSBtreeItem *item)
+{
+ bool result;
+
+ if (cxt->compressed_items == 0)
+ zs_compress_begin(&cxt->compressor, PageGetFreeSpace(cxt->currpage));
+
+ result = zs_compress_add(&cxt->compressor, item);
+ if (result)
+ {
+ cxt->compressed_items++;
+
+ cxt->total_compressed_items++;
+ }
+
+ return result;
+}
+
+static void
+zsbt_attr_recompress_flush(zsbt_attr_recompress_context *cxt)
+{
+ ZSCompressedBtreeItem *citem;
+
+ if (cxt->compressed_items == 0)
+ return;
+
+ citem = zs_compress_finish(&cxt->compressor);
+
+ if (citem)
+ zsbt_attr_recompress_add_to_page(cxt, (ZSBtreeItem *) citem);
+ else
+ {
+ uint16 size = 0;
+ /*
+ * compression failed hence add items uncompressed. We should maybe
+ * note that these items/pattern are not compressible and skip future
+ * attempts to compress but its possible this clubbed with some other
+ * future items may compress. So, better avoid recording such info and
+ * try compression again later if required.
+ */
+ for (int i = 0; i < cxt->compressor.nitems; i++)
+ {
+ citem = (ZSCompressedBtreeItem *) (cxt->compressor.uncompressedbuffer + size);
+ zsbt_attr_recompress_add_to_page(cxt, (ZSBtreeItem *) citem);
+
+ size += MAXALIGN(citem->t_size);
+ }
+ }
+
+ cxt->compressed_items = 0;
+}
+
+/*
+ * Rewrite a leaf page, with given 'items' as the new content.
+ *
+ * If there are any uncompressed items in the list, we try to compress them.
+ * Any already-compressed items are added as is.
+ *
+ * If the items no longer fit on the page, then the page is split. It is
+ * entirely possible that they don't fit even on two pages; we split the page
+ * into as many pages as needed. Hopefully not more than a few pages, though,
+ * because otherwise you might hit limits on the number of buffer pins (with
+ * tiny shared_buffers).
+ *
+ * On entry, 'oldbuf' must be pinned and exclusive-locked. On exit, the lock
+ * is released, but it's still pinned.
+ *
+ * TODO: Try to combine single items, and existing array-items, into new array
+ * items.
+ */
+static void
+zsbt_attr_recompress_replace(Relation rel, AttrNumber attno, Buffer oldbuf, List *items)
+{
+ ListCell *lc;
+ zsbt_attr_recompress_context cxt;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(BufferGetPage(oldbuf));
+ ZSUndoRecPtr recent_oldest_undo = { 0 };
+ BlockNumber orignextblk;
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+
+ orignextblk = oldopaque->zs_next;
+
+ cxt.currpage = NULL;
+ zs_compress_init(&cxt.compressor);
+ cxt.compressed_items = 0;
+ cxt.stack_head = cxt.stack_tail = NULL;
+ cxt.attno = attno;
+ cxt.hikey = oldopaque->zs_hikey;
+
+ cxt.total_items = 0;
+ cxt.total_compressed_items = 0;
+ cxt.total_already_compressed_items = 0;
+
+ zsbt_attr_recompress_newpage(&cxt, oldopaque->zs_lokey, (oldopaque->zs_flags & ZSBT_ROOT));
+
+ foreach(lc, items)
+ {
+ ZSBtreeItem *item = (ZSBtreeItem *) lfirst(lc);
+
+ /* We can leave out any old-enough DEAD items */
+ if ((item->t_flags & ZSBT_DEAD) != 0)
+ {
+ ZSBtreeItem *uitem = (ZSBtreeItem *) item;
+
+ if (recent_oldest_undo.counter == 0)
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+
+ if (zsbt_item_undoptr(uitem).counter <= recent_oldest_undo.counter)
+ continue;
+ }
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ /* already compressed, add as it is. */
+ zsbt_attr_recompress_flush(&cxt);
+ cxt.total_already_compressed_items++;
+ zsbt_attr_recompress_add_to_page(&cxt, item);
+ }
+ else
+ {
+ /* try to add this item to the compressor */
+ if (!zsbt_attr_recompress_add_to_compressor(&cxt, item))
+ {
+ if (cxt.compressed_items > 0)
+ {
+ /* flush, and retry */
+ zsbt_attr_recompress_flush(&cxt);
+
+ if (!zsbt_attr_recompress_add_to_compressor(&cxt, item))
+ {
+ /* could not compress, even on its own. Store it uncompressed, then */
+ zsbt_attr_recompress_add_to_page(&cxt, item);
+ }
+ }
+ else
+ {
+ /* could not compress, even on its own. Store it uncompressed, then */
+ zsbt_attr_recompress_add_to_page(&cxt, item);
+ }
+ }
+ }
+ }
+
+ /* flush the last one, if any */
+ zsbt_attr_recompress_flush(&cxt);
+
+ zs_compress_free(&cxt.compressor);
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt.stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, InvalidBuffer);
+ stack->next->buf = nextbuf;
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = orignextblk;
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt.stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt.stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt.stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt.stack_tail->next = zsbt_newroot(rel, attno, oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt.stack_tail->next = zsbt_insert_downlinks(rel, attno,
+ oldopaque->zs_lokey, BufferGetBlockNumber(oldbuf), oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt.stack_head);
+}
diff --git a/src/backend/access/zedstore/zedstore_btree.c b/src/backend/access/zedstore/zedstore_btree.c
new file mode 100644
index 0000000000..108170ffee
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_btree.c
@@ -0,0 +1,632 @@
+/*
+ * zedstore_btree.c
+ * Common routines for handling TID and attibute B-tree structures
+ *
+ * A Zedstore table consists of multiple B-trees, one to store TIDs and
+ * visibility information of the rows, and one tree for each attribute,
+ * to hold the data. The TID and attribute trees differ at the leaf
+ * level, but the internal pages have the same layout. This file contains
+ * routines to deal with internal pages, and some other common
+ * functionality.
+ *
+ * When dealing with the TID tree, pass ZS_META_ATTRIBUTE_NUM as the
+ * attribute number.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_btree.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static zs_split_stack *zsbt_split_internal_page(Relation rel, AttrNumber attno,
+ Buffer leftbuf, OffsetNumber newoff, List *downlinks);
+static zs_split_stack *zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left);
+
+static int zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems);
+
+/*
+ * Find the page containing the given key TID at the given level.
+ *
+ * Level 0 means leaf. The returned buffer is exclusive-locked.
+ */
+Buffer
+zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly)
+{
+ BlockNumber next;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ BlockNumber rootblk;
+ int nextlevel = -1;
+ BlockNumber failblk = InvalidBlockNumber;
+
+ /* start from root */
+restart:
+ rootblk = zsmeta_get_root_for_attribute(rel, attno, readonly);
+
+ if (rootblk == InvalidBlockNumber)
+ {
+ /* completely empty tree */
+ return InvalidBuffer;
+ }
+
+ next = rootblk;
+ for (;;)
+ {
+ /*
+ * If we arrive again to a block that was a dead-end earlier, it seems
+ * that the tree is corrupt.
+ *
+ * XXX: It's theoretically possible that the block was removed, but then
+ * added back at the same location, and removed again. So perhaps retry
+ * a few times?
+ */
+ if (next == failblk || next == ZS_META_BLK)
+ elog(ERROR, "arrived at incorrect block %u while descending zedstore btree", next);
+
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); /* TODO: shared */
+ page = BufferGetPage(buf);
+ if (!zsbt_page_is_expected(rel, attno, key, nextlevel, buf))
+ {
+ /*
+ * We arrived at an unexpected page. This can happen with concurrent
+ * splits, or page deletions. We could try following the right-link, but
+ * there's no guarantee that's the correct page either, so let's restart
+ * from the root. If we landed here because of concurrent modifications,
+ * the next attempt should land on the correct page. Remember that we
+ * incorrectly ended up on this page, so that if this happens because
+ * the tree is corrupt, rather than concurrent splits, and we land here
+ * again, we won't loop forever.
+ */
+ failblk = next;
+ goto restart;
+ }
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (nextlevel == -1)
+ nextlevel = opaque->zs_level;
+
+ else if (opaque->zs_level != nextlevel)
+ elog(ERROR, "unexpected level encountered when descending tree");
+
+ if (opaque->zs_level == level)
+ return buf;
+
+ /* Find the downlink and follow it */
+ items = ZSBtreeInternalPageGetItems(page);
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+
+ itemno = zsbt_binsrch_internal(key, items, nitems);
+ if (itemno < 0)
+ elog(ERROR, "could not descend tree for tid (%u, %u)",
+ ZSTidGetBlockNumber(key), ZSTidGetOffsetNumber(key));
+
+ next = items[itemno].childblk;
+ nextlevel--;
+
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/*
+ * Check that a page is a valid B-tree page, and covers the given key.
+ *
+ * This is used when traversing the tree, to check that e.g. a concurrent page
+ * split didn't move pages around, so that the page we were walking to isn't
+ * the correct one anymore.
+ */
+bool
+zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque;
+
+ /*
+ * The page might have been deleted and even reused as a completely different
+ * kind of a page, so we must be prepared for anything.
+ */
+ if (PageIsNew(page))
+ return false;
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ return false;
+
+ opaque = ZSBtreePageGetOpaque(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ return false;
+
+ if (opaque->zs_attno != attno)
+ return false;
+
+ if (level != -1 && opaque->zs_level != level)
+ return false;
+
+ if (opaque->zs_lokey > key || opaque->zs_hikey <= key)
+ return false;
+
+ return true;
+}
+
+/*
+ * Create a new btree root page, containing two downlinks.
+ *
+ * NOTE: the very first root page of a btree, which is also the leaf, is created
+ * in zsmeta_get_root_for_attribute(), not here.
+ *
+ * XXX: What if there are too many downlinks to fit on a page? Shouldn't happen
+ * in practice..
+ */
+zs_split_stack *
+zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks)
+{
+ Page metapage;
+ ZSMetaPage *metapg;
+ Buffer newrootbuf;
+ Page newrootpage;
+ ZSBtreePageOpaque *newrootopaque;
+ ZSBtreeInternalPageItem *items;
+ Buffer metabuf;
+ zs_split_stack *stack1;
+ zs_split_stack *stack2;
+ ListCell *lc;
+ int i;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* allocate a new root page */
+ newrootbuf = zspage_getnewbuf(rel, metabuf);
+ newrootpage = palloc(BLCKSZ);
+ PageInit(newrootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ newrootopaque = ZSBtreePageGetOpaque(newrootpage);
+ newrootopaque->zs_attno = attno;
+ newrootopaque->zs_next = InvalidBlockNumber;
+ newrootopaque->zs_lokey = MinZSTid;
+ newrootopaque->zs_hikey = MaxPlusOneZSTid;
+ newrootopaque->zs_level = level;
+ newrootopaque->zs_flags = ZSBT_ROOT;
+ newrootopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ items = ZSBtreeInternalPageGetItems(newrootpage);
+
+ /* add all the downlinks */
+ i = 0;
+ foreach (lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ items[i++] = *downlink;
+ }
+ ((PageHeader) newrootpage)->pd_lower += i * sizeof(ZSBtreeInternalPageItem);
+
+ /* FIXME: Check that all the downlinks fit on the page. */
+
+ /* update the metapage */
+ metapage = PageGetTempPageCopy(BufferGetPage(metabuf));
+
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+ if ((attno != ZS_META_ATTRIBUTE_NUM) && (attno <= 0 || attno > metapg->nattributes))
+ elog(ERROR, "invalid attribute number %d (table \"%s\" has only %d attributes)",
+ attno, RelationGetRelationName(rel), metapg->nattributes);
+
+ metapg->tree_root_dir[attno].root = BufferGetBlockNumber(newrootbuf);
+
+ stack1 = zs_new_split_stack_entry(metabuf, metapage);
+ stack2 = zs_new_split_stack_entry(newrootbuf, newrootpage);
+ stack2->next = stack1;
+
+ return stack2;
+}
+
+/*
+ * After page split, insert the downlink of 'rightblkno' to the parent.
+ *
+ * On entry, 'leftbuf' must be pinned exclusive-locked.
+ */
+zs_split_stack *
+zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks)
+{
+ int numdownlinks = list_length(downlinks);
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ Buffer parentbuf;
+ Page parentpage;
+ zs_split_stack *split_stack;
+ ZSBtreeInternalPageItem *firstdownlink;
+
+ /*
+ * re-find parent
+ *
+ * TODO: this is a bit inefficient. Usually, we have just descended the
+ * tree, and if we just remembered the path we descended, we could just
+ * walk back up.
+ */
+ parentbuf = zsbt_descend(rel, attno, leftlokey, level, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ firstdownlink = (ZSBtreeInternalPageItem *) linitial(downlinks);
+
+ /* Find the position in the parent for the downlink */
+ items = ZSBtreeInternalPageGetItems(parentpage);
+ nitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(firstdownlink->tid, items, nitems);
+
+ /* sanity checks */
+ if (itemno < 0 || items[itemno].tid != leftlokey ||
+ items[itemno].childblk != leftblkno)
+ {
+ elog(ERROR, "could not find downlink for block %u TID (%u, %u)",
+ leftblkno, ZSTidGetBlockNumber(leftlokey),
+ ZSTidGetOffsetNumber(leftlokey));
+ }
+ itemno++;
+
+ if (PageGetExactFreeSpace(parentpage) < numdownlinks * sizeof(ZSBtreeInternalPageItem))
+ {
+ /* split internal page */
+ split_stack = zsbt_split_internal_page(rel, attno, parentbuf, itemno, downlinks);
+ }
+ else
+ {
+ ZSBtreeInternalPageItem *newitems;
+ Page newpage;
+ int i;
+ ListCell *lc;
+
+ newpage = PageGetTempPageCopySpecial(parentpage);
+
+ split_stack = zs_new_split_stack_entry(parentbuf, newpage);
+
+ /* insert the new downlink for the right page. */
+ newitems = ZSBtreeInternalPageGetItems(newpage);
+ memcpy(newitems, items, itemno * sizeof(ZSBtreeInternalPageItem));
+
+ i = itemno;
+ foreach(lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ Assert(downlink->childblk != 0);
+ newitems[i++] = *downlink;
+ }
+
+ memcpy(&newitems[i], &items[itemno], (nitems - itemno) * sizeof(ZSBtreeInternalPageItem));
+ ((PageHeader) newpage)->pd_lower += (nitems + numdownlinks) * sizeof(ZSBtreeInternalPageItem);
+ }
+ return split_stack;
+}
+
+/*
+ * Split an internal page.
+ *
+ * The new downlink specified by 'newkey' is inserted to position 'newoff', on 'leftbuf'.
+ * The page is split.
+ */
+static zs_split_stack *
+zsbt_split_internal_page(Relation rel, AttrNumber attno, Buffer origbuf,
+ OffsetNumber newoff, List *newitems)
+{
+ Page origpage = BufferGetPage(origbuf);
+ ZSBtreePageOpaque *origopaque = ZSBtreePageGetOpaque(origpage);
+ Buffer buf;
+ Page page;
+ ZSBtreeInternalPageItem *origitems;
+ int orignitems;
+ zs_split_stack *stack_first;
+ zs_split_stack *stack;
+ Size splitthreshold;
+ ListCell *lc;
+ int origitemno;
+ List *downlinks = NIL;
+
+ origitems = ZSBtreeInternalPageGetItems(origpage);
+ orignitems = ZSBtreeInternalPageGetNumItems(origpage);
+
+ page = PageGetTempPageCopySpecial(origpage);
+ buf = origbuf;
+
+ stack = zs_new_split_stack_entry(buf, page);
+ stack_first = stack;
+
+ /* XXX: currently, we always do 90/10 splits */
+ splitthreshold = PageGetExactFreeSpace(page) * 0.10;
+
+ lc = list_head(newitems);
+ origitemno = 0;
+ for (;;)
+ {
+ ZSBtreeInternalPageItem *item;
+ ZSBtreeInternalPageItem *p;
+
+ if (origitemno == newoff && lc)
+ {
+ item = lfirst(lc);
+ lc = lnext(lc);
+ }
+ else
+ {
+ if (origitemno == orignitems)
+ break;
+ item = &origitems[origitemno];
+ origitemno++;
+ }
+
+ if (PageGetExactFreeSpace(page) < splitthreshold)
+ {
+ /* have to split to another page */
+ ZSBtreePageOpaque *prevopaque = ZSBtreePageGetOpaque(page);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ BlockNumber blkno;
+ ZSBtreeInternalPageItem *downlink;
+
+ buf = zspage_getnewbuf(rel, InvalidBuffer);
+ blkno = BufferGetBlockNumber(buf);
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ opaque = ZSBtreePageGetOpaque(page);
+ opaque->zs_attno = attno;
+ opaque->zs_next = prevopaque->zs_next;
+ opaque->zs_lokey = item->tid;
+ opaque->zs_hikey = prevopaque->zs_hikey;
+ opaque->zs_level = prevopaque->zs_level;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ prevopaque->zs_next = blkno;
+ prevopaque->zs_hikey = item->tid;
+
+ stack->next = zs_new_split_stack_entry(buf, page);
+ stack = stack->next;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = item->tid;
+ downlink->childblk = blkno;
+ downlinks = lappend(downlinks, downlink);
+ }
+
+ p = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+ *p = *item;
+ ((PageHeader) page)->pd_lower += sizeof(ZSBtreeInternalPageItem);
+ }
+
+ /* recurse to insert downlinks, if we had to split. */
+ if (downlinks)
+ {
+ if ((origopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(origbuf);
+ downlinks = lcons(downlink, downlinks);
+
+ stack->next = zsbt_newroot(rel, attno, origopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ ZSBtreePageGetOpaque(stack_first->page)->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ stack->next = zsbt_insert_downlinks(rel, attno,
+ origopaque->zs_lokey,
+ BufferGetBlockNumber(origbuf),
+ origopaque->zs_level + 1,
+ downlinks);
+ }
+ }
+
+ return stack_first;
+}
+
+
+/*
+ * Removes the last item from page, and unlinks the page from the tree.
+ *
+ * NOTE: you cannot remove the only leaf. Returns NULL if the page could not
+ * be deleted.
+ */
+zs_split_stack *
+zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ Buffer leftbuf;
+ Buffer rightbuf;
+ zs_split_stack *stack;
+
+ /* cannot currently remove the only page at its level. */
+ if (opaque->zs_lokey == MinZSTid && opaque->zs_hikey == MaxPlusOneZSTid)
+ {
+ return NULL;
+ }
+
+ /*
+ * Find left sibling.
+ * or if this is leftmost page, find right sibling.
+ */
+ if (opaque->zs_lokey != MinZSTid)
+ {
+ rightbuf = buf;
+ leftbuf = zsbt_descend(rel, attno, opaque->zs_lokey - 1, level, false);
+
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, false);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(leftbuf);
+ return NULL;
+ }
+ }
+ else
+ {
+ rightbuf = zsbt_descend(rel, attno, opaque->zs_hikey, level, false);
+ leftbuf = buf;
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, true);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(rightbuf);
+ return NULL;
+ }
+ }
+
+ return stack;
+}
+
+/*
+ * Page deletion:
+ *
+ * Mark page empty, remove downlink. If parent becomes empty, recursively delete it.
+ *
+ * Unlike in the nbtree index, we don't need to worry about concurrent scans. They
+ * will simply retry if they land on an unexpected page.
+ */
+static zs_split_stack *
+zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left)
+{
+ Buffer parentbuf;
+ Page origleftpage;
+ Page leftpage;
+ Page rightpage;
+ ZSBtreePageOpaque *leftopaque;
+ ZSBtreePageOpaque *origleftopaque;
+ ZSBtreePageOpaque *rightopaque;
+ ZSBtreeInternalPageItem *parentitems;
+ int parentnitems;
+ Page parentpage;
+ int itemno;
+ zs_split_stack *stack;
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ origleftpage = BufferGetPage(leftbuf);
+ origleftopaque = ZSBtreePageGetOpaque(origleftpage);
+ rightpage = BufferGetPage(rightbuf);
+ rightopaque = ZSBtreePageGetOpaque(rightpage);
+
+ /* find downlink for 'rightbuf' in the parent */
+ parentbuf = zsbt_descend(rel, attno, rightopaque->zs_lokey, origleftopaque->zs_level + 1, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ parentitems = ZSBtreeInternalPageGetItems(parentpage);
+ parentnitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(rightopaque->zs_lokey, parentitems, parentnitems);
+ if (itemno < 0 || parentitems[itemno].childblk != BufferGetBlockNumber(rightbuf))
+ elog(ERROR, "could not find downlink to FPM page %u", BufferGetBlockNumber(rightbuf));
+
+ if (parentnitems > 1 && itemno == 0)
+ {
+ /*
+ * Don't delete the leftmost child of a parent. That would move the
+ * keyspace of the parent, so we'd need to adjust the lo/hikey of
+ * the parent page, and the parent's downlink in the grandparent.
+ * Maybe later...
+ */
+ UnlockReleaseBuffer(parentbuf);
+ elog(DEBUG1, "deleting leftmost child of a parent not implemented");
+ return NULL;
+ }
+
+ if (target_is_left)
+ {
+ /* move all items from right to left before unlinking the right page */
+ leftpage = PageGetTempPageCopy(rightpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+
+ memcpy(leftopaque, origleftopaque, sizeof(ZSBtreePageOpaque));
+ }
+ else
+ {
+ /* right page is empty. */
+ leftpage = PageGetTempPageCopy(origleftpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+ }
+
+ /* update left hikey */
+ leftopaque->zs_hikey = ZSBtreePageGetOpaque(rightpage)->zs_hikey;
+
+ Assert(ZSBtreePageGetOpaque(leftpage)->zs_level == ZSBtreePageGetOpaque(rightpage)->zs_level);
+
+ stack = zs_new_split_stack_entry(leftbuf, leftpage);
+ stack_head = stack_tail = stack;
+
+ /* Mark right page as empty/unused */
+ rightpage = palloc0(BLCKSZ);
+
+ stack = zs_new_split_stack_entry(rightbuf, rightpage);
+ stack->recycle = true;
+ stack_tail->next = stack;
+ stack_tail = stack;
+
+ /* remove downlink from parent */
+ if (parentnitems > 1)
+ {
+ Page newpage = PageGetTempPageCopySpecial(parentpage);
+ ZSBtreeInternalPageItem *newitems = ZSBtreeInternalPageGetItems(newpage);
+
+ memcpy(newitems, parentitems, itemno * sizeof(ZSBtreeInternalPageItem));
+ memcpy(&newitems[itemno], &parentitems[itemno + 1], (parentnitems - itemno -1) * sizeof(ZSBtreeInternalPageItem));
+
+ ((PageHeader) newpage)->pd_lower += (parentnitems - 1) * sizeof(ZSBtreeInternalPageItem);
+
+ stack = zs_new_split_stack_entry(parentbuf, newpage);
+ stack_tail->next = stack;
+ stack_tail = stack;
+ }
+ else
+ {
+ /* the parent becomes empty as well. Recursively remove it. */
+ stack_tail->next = zsbt_unlink_page(rel, attno, parentbuf, leftopaque->zs_level + 1);
+ if (stack_tail->next == NULL)
+ {
+ /* oops, couldn't remove the parent. Back out */
+ stack = stack_head;
+ while (stack)
+ {
+ zs_split_stack *next = stack->next;
+
+ pfree(stack->page);
+ pfree(stack);
+ stack = next;
+ }
+ }
+ }
+
+ return stack_head;
+}
+
+static int
+zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid].tid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git a/src/backend/access/zedstore/zedstore_compression.c b/src/backend/access/zedstore/zedstore_compression.c
new file mode 100644
index 0000000000..1a1d9a018c
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_compression.c
@@ -0,0 +1,364 @@
+/*
+ * zedstore_compression.c
+ * Routines for compression
+ *
+ * There are two implementations at the moment: LZ4, and the Postgres
+ * pg_lzcompress(). LZ4 support requires that the server was compiled
+ * with --with-lz4.
+ *
+ * The compressor works on ZSUncompressedBtreeItems.
+ *
+ * Compression interface
+ * ---------------------
+ *
+ * Call zs_compress_init() to initialize.
+ *
+ * Call zs_compress_begin(), to begin compressing a group of items. Pass the
+ * maximum amount of space it's allowed to use after compression, as argument.
+ *
+ * Feed them to the compressor one by one with zs_compress_add(), until it
+ * returns false.
+ *
+ * Finally, call zs_compress_finish(). It returns a ZSCompressedBtreeItem,
+ * which contains all the plain items that were added (except for the last one
+ * for which zs_compress_add() returned false)
+ *
+ * Decompression interface
+ * -----------------------
+ *
+ * zs_decompress_chunk() takes a ZSCompressedBtreeItem as argument. It
+ * initializes a "context" with the given chunk.
+ *
+ * Call zs_decompress_read_item() to return the uncompressed items one by one.
+ *
+ *
+ * NOTES:
+ *
+ * Currently, the compressor accepts input, until the *uncompressed* size exceeds
+ * the *compressed* size available. I.e it assumes that the compressed size is never
+ * larger than uncompressed size.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_compression.c
+ */
+#include "postgres.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "common/pg_lzcompress.h"
+#include "utils/datum.h"
+
+
+/*
+ * There are two implementations at the moment: LZ4, and the Postgres
+ * pg_lzcompress(). LZ4 support requires that the server was compiled
+ * with --with-lz4.
+ */
+#ifdef USE_LZ4
+
+/*
+ * Begin compression, with given max compressed size.
+ */
+void
+zs_compress_init(ZSCompressContext *context)
+{
+ context->uncompressedbuffer = palloc(BLCKSZ * 10); // FIXME: arbitrary size
+ context->buffer = palloc(BLCKSZ);
+ context->maxCompressedSize = 0;
+ context->maxUncompressedSize = 0;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+void
+zs_compress_begin(ZSCompressContext *context, int maxCompressedSize)
+{
+ context->buffer = repalloc(context->buffer, maxCompressedSize);
+
+ maxCompressedSize -= offsetof(ZSCompressedBtreeItem, t_payload);
+ if (maxCompressedSize < 0)
+ maxCompressedSize = 0;
+
+ context->maxCompressedSize = maxCompressedSize;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+/*
+ * Try to add some data to the compressed block.
+ *
+ * If it wouldn't fit, return false.
+ */
+bool
+zs_compress_add(ZSCompressContext *context, ZSBtreeItem *item)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+
+ Assert((item->t_flags & ZSBT_COMPRESSED) == 0);
+ Assert(item->t_tid != InvalidZSTid);
+
+ if (LZ4_COMPRESSBOUND(context->rawsize + MAXALIGN(item->t_size)) > context->maxCompressedSize)
+ return false;
+
+ memcpy(context->uncompressedbuffer + context->rawsize, item, item->t_size);
+ /* TODO: clear alignment padding */
+ if (context->nitems == 0)
+ chunk->t_tid = item->t_tid;
+ chunk->t_lasttid = zsbt_item_lasttid(item);
+ context->nitems++;
+ context->rawsize += MAXALIGN(item->t_size);
+
+ return true;
+}
+
+ZSCompressedBtreeItem *
+zs_compress_finish(ZSCompressContext *context)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+ int32 compressed_size;
+
+ compressed_size = LZ4_compress_default(context->uncompressedbuffer,
+ chunk->t_payload,
+ context->rawsize,
+ context->maxCompressedSize);
+ if (compressed_size < 0)
+ return NULL;
+
+ chunk->t_size = offsetof(ZSCompressedBtreeItem, t_payload) + compressed_size;
+ chunk->t_flags = ZSBT_COMPRESSED;
+ chunk->t_uncompressedsize = context->rawsize;
+
+ return chunk;
+}
+
+void
+zs_compress_free(ZSCompressContext *context)
+{
+ pfree(context->uncompressedbuffer);
+ pfree(context->buffer);
+}
+
+void
+zs_decompress_init(ZSDecompressContext *context)
+{
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+void
+zs_decompress_chunk(ZSDecompressContext *context, ZSCompressedBtreeItem *chunk)
+{
+ Assert((chunk->t_flags & ZSBT_COMPRESSED) != 0);
+ Assert(chunk->t_uncompressedsize > 0);
+ if (context->bufsize < chunk->t_uncompressedsize)
+ {
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = palloc(chunk->t_uncompressedsize);
+ context->bufsize = chunk->t_uncompressedsize;
+ }
+ context->uncompressedsize = chunk->t_uncompressedsize;
+
+ if (LZ4_decompress_safe(chunk->t_payload,
+ context->buffer,
+ chunk->t_size - offsetof(ZSCompressedBtreeItem, t_payload),
+ context->uncompressedsize) != context->uncompressedsize)
+ elog(ERROR, "could not decompress chunk");
+
+ context->bytesread = 0;
+}
+
+ZSBtreeItem *
+zs_decompress_read_item(ZSDecompressContext *context)
+{
+ ZSBtreeItem *next;
+
+ if (context->bytesread == context->uncompressedsize)
+ return NULL;
+ next = (ZSBtreeItem *) (context->buffer + context->bytesread);
+ if (context->bytesread + MAXALIGN(next->t_size) > context->uncompressedsize)
+ elog(ERROR, "invalid compressed item");
+ context->bytesread += MAXALIGN(next->t_size);
+
+ Assert(next->t_size >= sizeof(ZSBtreeItem));
+ Assert(next->t_tid != InvalidZSTid);
+
+ return next;
+}
+
+void
+zs_decompress_free(ZSDecompressContext *context)
+{
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+
+#else
+/* PGLZ imlementation */
+
+/*
+ * In the worst case, pg_lz outputs everything as "literals", and emits one
+ * "control byte" ever 8 bytes. Also, it requires 4 bytes extra at the end
+ * of the buffer. And add 10 bytes of slop, for good measure.
+ */
+#define MAX_COMPRESS_EXPANSION_OVERHEAD (8)
+#define MAX_COMPRESS_EXPANSION_BYTES (4 + 10)
+
+/*
+ * Begin compression, with given max compressed size.
+ */
+void
+zs_compress_init(ZSCompressContext *context)
+{
+ context->uncompressedbuffer = palloc(BLCKSZ * 10); // FIXME: arbitrary size
+ context->buffer = palloc(BLCKSZ);
+ context->maxCompressedSize = 0;
+ context->maxUncompressedSize = 0;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+void
+zs_compress_begin(ZSCompressContext *context, int maxCompressedSize)
+{
+ int maxUncompressedSize;
+
+ context->buffer = repalloc(context->buffer, maxCompressedSize + 4 /* LZ slop */);
+
+ context->maxCompressedSize = maxCompressedSize;
+
+ /* determine the max uncompressed size */
+ maxUncompressedSize = maxCompressedSize;
+ maxUncompressedSize -= offsetof(ZSCompressedBtreeItem, t_payload);
+ maxUncompressedSize -= maxUncompressedSize / MAX_COMPRESS_EXPANSION_OVERHEAD;
+ maxUncompressedSize -= MAX_COMPRESS_EXPANSION_BYTES;
+ if (maxUncompressedSize < 0)
+ maxUncompressedSize = 0;
+ context->maxUncompressedSize = maxUncompressedSize;
+ context->nitems = 0;
+ context->rawsize = 0;
+}
+
+/*
+ * Try to add some data to the compressed block.
+ *
+ * If it wouldn't fit, return false.
+ */
+bool
+zs_compress_add(ZSCompressContext *context, ZSBtreeItem *item)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+
+ Assert ((item->t_flags & ZSBT_COMPRESSED) == 0);
+
+ if (context->rawsize + item->t_size > context->maxUncompressedSize)
+ return false;
+
+ memcpy(context->uncompressedbuffer + context->rawsize, item, item->t_size);
+ if (context->nitems == 0)
+ chunk->t_tid = item->t_tid;
+ chunk->t_lasttid = zsbt_item_lasttid(item);
+ context->nitems++;
+ context->rawsize += MAXALIGN(item->t_size);
+
+ return true;
+}
+
+ZSCompressedBtreeItem *
+zs_compress_finish(ZSCompressContext *context)
+{
+ ZSCompressedBtreeItem *chunk = (ZSCompressedBtreeItem *) context->buffer;
+ int32 compressed_size;
+
+ compressed_size = pglz_compress(context->uncompressedbuffer, context->rawsize,
+ chunk->t_payload,
+ PGLZ_strategy_always);
+ if (compressed_size < 0)
+ return NULL;
+
+ chunk->t_size = offsetof(ZSCompressedBtreeItem, t_payload) + compressed_size;
+ chunk->t_flags = ZSBT_COMPRESSED;
+ chunk->t_uncompressedsize = context->rawsize;
+
+ return chunk;
+}
+
+void
+zs_compress_free(ZSCompressContext *context)
+{
+ pfree(context->uncompressedbuffer);
+ pfree(context->buffer);
+}
+
+void
+zs_decompress_init(ZSDecompressContext *context)
+{
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+void
+zs_decompress_chunk(ZSDecompressContext *context, ZSCompressedBtreeItem *chunk)
+{
+ Assert((chunk->t_flags & ZSBT_COMPRESSED) != 0);
+ Assert(chunk->t_uncompressedsize > 0);
+ if (context->bufsize < chunk->t_uncompressedsize)
+ {
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = palloc(chunk->t_uncompressedsize);
+ context->bufsize = chunk->t_uncompressedsize;
+ }
+ context->uncompressedsize = chunk->t_uncompressedsize;
+
+ if (pglz_decompress(chunk->t_payload,
+ chunk->t_size - offsetof(ZSCompressedBtreeItem, t_payload),
+ context->buffer,
+ context->uncompressedsize, true) != context->uncompressedsize)
+ elog(ERROR, "could not decompress chunk");
+
+ context->bytesread = 0;
+}
+
+ZSBtreeItem *
+zs_decompress_read_item(ZSDecompressContext *context)
+{
+ ZSBtreeItem *next;
+
+ if (context->bytesread == context->uncompressedsize)
+ return NULL;
+ next = (ZSBtreeItem *) (context->buffer + context->bytesread);
+ if (context->bytesread + MAXALIGN(next->t_size) > context->uncompressedsize)
+ elog(ERROR, "invalid compressed item");
+ context->bytesread += MAXALIGN(next->t_size);
+
+ Assert(next->t_size >= sizeof(ZSBtreeItem));
+ Assert(next->t_tid != InvalidZSTid);
+
+ return next;
+}
+
+void
+zs_decompress_free(ZSDecompressContext *context)
+{
+ if (context->buffer)
+ pfree(context->buffer);
+ context->buffer = NULL;
+ context->bufsize = 0;
+ context->uncompressedsize = 0;
+}
+
+#endif /* !USE_LZ4 */
diff --git a/src/backend/access/zedstore/zedstore_freepagemap.c b/src/backend/access/zedstore/zedstore_freepagemap.c
new file mode 100644
index 0000000000..efd01fd6c6
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_freepagemap.c
@@ -0,0 +1,1076 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstore_freepagemap.c
+ * ZedStore free space management
+ *
+ * The Free Page Map keeps track of unused pages in the relation.
+ *
+ * The FPM is a b-tree, indexed by physical block number. To be more compact,
+ * it stores "extents", i.e. block ranges, rather than just blocks, when
+ * possible.
+
+ * Design principles:
+ *
+ * - it's ok to have a block incorrectly stored in the FPM. Before actually
+ * reusing a page, we must check that it's safe.
+ *
+ * - a deletable page must be simple to detect just by looking at the page,
+ * and perhaps a few other pages. It should *not* require scanning the
+ * whole table, or even a whole b-tree. For example, if a column is dropped,
+ * we can detect if a b-tree page belongs to the dropped column just by
+ * looking at the information (the attribute number) stored in the page
+ * header.
+ *
+ * - if a page is deletable, it should become immediately reusable. No
+ * "wait out all possible readers that might be about to follow a link
+ * to it" business. All code that reads pages need to keep pages locked
+ * while following a link, or be prepared to retry if they land on an
+ * unexpected page.
+ *
+ *
+ * TODO:
+ *
+ * - Avoid fragmentation. If B-tree page is split, try to hand out a page
+ * that's close to the old page. When the relation is extended, allocate
+ * a larger chunk at once.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_freepagemap.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/zedstore_internal.h"
+#include "miscadmin.h"
+#include "storage/bufpage.h"
+#include "storage/lmgr.h"
+#include "utils/rel.h"
+
+/*
+ * On-disk format of the Free Page Map.
+ *
+ * The FPM is a b-tree, indexed by block number. Each page contains a
+ * ZSFreePageMapOpaque in the "special area", and an array of
+ * ZSFreePageMapItems as the content (ie. after the normal page header,
+ * up to pd_lower). On an internal page, each item contains the starting
+ * block number, and a pointer to the child FPM page. On a leaf page,
+ * each entry contains the start and end of the block range that the item
+ * represents.
+ *
+ * The block ranges stored on leaf pages must not overlap!
+ */
+typedef struct
+{
+ BlockNumber zs_lokey; /* inclusive */
+ BlockNumber zs_hikey; /* exclusive */
+ uint16 zs_level; /* 0 = leaf */
+ uint16 zs_flags;
+ char padding[2]; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* always ZS_FPM_PAGE_ID */
+} ZSFreePageMapOpaque;
+
+typedef struct
+{
+ BlockNumber zs_startblk; /* inclusive */
+ union {
+ BlockNumber zs_endblk; /* on a leaf page, end of extent, exclusive */
+ BlockNumber zs_downlink; /* on an internal page, pointer to child */
+ } u;
+} ZSFreePageMapItem;
+
+#define ZSFreePageMapGetOpaque(page) ((ZSFreePageMapOpaque *) PageGetSpecialPointer(page))
+
+/* overlap, or touch? */
+static inline bool
+zsextent_overlap(BlockNumber start1, BlockNumber end1, BlockNumber start2, BlockNumber end2)
+{
+ if (start2 < end1)
+ return false;
+ if (start1 < end2)
+ return false;
+ return true;
+}
+
+static inline ZSFreePageMapItem *
+ZSFreePageMapPageGetItems(Page page)
+{
+ ZSFreePageMapItem *items;
+
+ items = (ZSFreePageMapItem *) PageGetContents(page);
+
+ return items;
+}
+static inline int
+ZSFreePageMapPageGetNumItems(Page page)
+{
+ ZSFreePageMapItem *begin;
+ ZSFreePageMapItem *end;
+
+ begin = (ZSFreePageMapItem *) PageGetContents(page);
+ end = (ZSFreePageMapItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+
+ return end - begin;
+}
+
+static zs_split_stack *zsfpm_unlink_page(Relation rel, Buffer buf, int level, Buffer metabuf);
+static zs_split_stack *zsfpm_merge_pages(Relation rel, Buffer leftbuf, Buffer rightbuf, bool target_is_left, Buffer metabuf);
+static BlockNumber zsfpm_consume_page(Relation rel, Buffer metabuf);
+static void zsfpm_insert(Relation rel, BlockNumber startblk, BlockNumber endblk);
+static zs_split_stack *zsfpm_split(Relation rel, Buffer leftbuf,
+ int newpos, ZSFreePageMapItem *newitem);
+static zs_split_stack *zsfpm_insert_downlink(Relation rel, Buffer leftbuf,
+ BlockNumber rightlokey, BlockNumber rightblkno);
+static zs_split_stack *zsfpm_newroot(Relation rel, Buffer metabuf, int level,
+ ZSFreePageMapItem *item1, ZSFreePageMapItem *item2);
+static Buffer zsfpm_descend(Relation rel, Buffer metabuf, BlockNumber key, int level);
+static int zsfpm_binsrch_blkno(BlockNumber key, ZSFreePageMapItem *arr, int arr_elems);
+
+/*
+ * zspage_is_recyclable()
+ *
+ * Is the current page recyclable?
+ *
+ * It can be:
+ *
+ * - an empty, all-zeros page,
+ * - explicitly marked as deleted,
+ * - an UNDO page older than oldest_undo_ptr
+ * - a b-tree page belonging to a deleted attribute
+ * - a TOAST page belonging to a dead item
+ *
+ */
+static bool
+zspage_is_recyclable(Buffer buf)
+{
+ if (PageIsNew(BufferGetPage(buf)))
+ return true;
+ return false;
+}
+
+
+static void
+zsfpm_delete_leaf(Relation rel, Buffer buf, Buffer metabuf)
+{
+ Page page = BufferGetPage(buf);
+ ZSFreePageMapOpaque *opaque = ZSFreePageMapGetOpaque(page);
+
+ if (opaque->zs_lokey == 0 && opaque->zs_hikey == MaxBlockNumber + 1)
+ {
+ /* Don't delete the last leaf page. Just mark it empty */
+ START_CRIT_SECTION();
+
+ ((PageHeader) page)->pd_lower = SizeOfPageHeaderData;
+
+ MarkBufferDirty(buf);
+
+ /* TODO: WAL-log */
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+
+ return;
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsfpm_unlink_page(rel, buf, 0, metabuf);
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack);
+ }
+}
+
+/*
+ * Removes the last item from page, and unlinks the page from the tree.
+ *
+ *
+ * NOTE: you cannot remove the only leaf.
+ */
+static zs_split_stack *
+zsfpm_unlink_page(Relation rel, Buffer buf, int level, Buffer metabuf)
+{
+ Page page = BufferGetPage(buf);
+ ZSFreePageMapOpaque *opaque = ZSFreePageMapGetOpaque(page);
+ Buffer leftbuf;
+ Buffer rightbuf;
+ bool target_is_left;
+
+ Assert(opaque->zs_lokey != 0 || opaque->zs_hikey != MaxBlockNumber + 1);
+ Assert(ZSFreePageMapPageGetNumItems(page) == 1);
+
+ /*
+ * Find left sibling.
+ * or if this is leftmost page, find right sibling.
+ */
+ if (opaque->zs_lokey != 0)
+ {
+ rightbuf = buf;
+ leftbuf = zsfpm_descend(rel, metabuf, opaque->zs_lokey - 1, level);
+ target_is_left = false;
+ }
+ else
+ {
+ rightbuf = zsfpm_descend(rel, metabuf, opaque->zs_hikey, level);
+ leftbuf = buf;
+ target_is_left = true;
+ }
+
+ return zsfpm_merge_pages(rel, leftbuf, rightbuf, target_is_left, metabuf);
+}
+
+/*
+ * Page deletion:
+ *
+ * Mark page empty, remove downlink. If parent becomes empty, recursively delete it.
+ *
+ * Unlike in the nbtree index, we don't need to worry about concurrent scans. They
+ * will simply retry if they land on an unexpected page.
+ */
+static zs_split_stack *
+zsfpm_merge_pages(Relation rel, Buffer leftbuf, Buffer rightbuf, bool target_is_left, Buffer metabuf)
+{
+ Buffer parentbuf;
+ Page origleftpage;
+ Page leftpage;
+ Page rightpage;
+ ZSFreePageMapOpaque *leftopaque;
+ ZSFreePageMapOpaque *rightopaque;
+ ZSFreePageMapItem *leftitems;
+ ZSFreePageMapItem *origleftitems;
+ ZSFreePageMapItem *rightitems;
+ ZSFreePageMapItem *parentitems;
+ int origleftnitems;
+ int rightnitems;
+ int parentnitems;
+ Page parentpage;
+ int itemno;
+ zs_split_stack *stack;
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ origleftpage = BufferGetPage(leftbuf);
+ leftpage = PageGetTempPageCopySpecial(origleftpage);
+ leftopaque = ZSFreePageMapGetOpaque(leftpage);
+
+ origleftitems = ZSFreePageMapPageGetItems(origleftpage);
+ origleftnitems = ZSFreePageMapPageGetNumItems(origleftpage);
+
+ leftitems = ZSFreePageMapPageGetItems(leftpage);
+
+ rightpage = BufferGetPage(rightbuf);
+ rightopaque = ZSFreePageMapGetOpaque(rightpage);
+ rightitems = ZSFreePageMapPageGetItems(rightpage);
+ rightnitems = ZSFreePageMapPageGetNumItems(rightpage);
+
+ /* move all items from right to left */
+
+ if (target_is_left)
+ {
+ Assert(origleftnitems == 1);
+
+ memcpy(leftitems,
+ rightitems,
+ rightnitems * sizeof(ZSFreePageMapItem));
+ ((PageHeader) leftpage)->pd_lower += rightnitems * sizeof(ZSFreePageMapItem);
+ }
+ else
+ {
+ origleftitems = ZSFreePageMapPageGetItems(origleftpage);
+ leftitems = ZSFreePageMapPageGetItems(leftpage);
+
+ Assert(rightnitems == 1);
+
+ memcpy(leftitems,
+ origleftitems,
+ origleftnitems * sizeof(ZSFreePageMapItem));
+ }
+
+ /* update left hikey */
+ leftopaque->zs_hikey = ZSFreePageMapGetOpaque(rightpage)->zs_hikey;
+
+ Assert(ZSFreePageMapGetOpaque(leftpage)->zs_level == ZSFreePageMapGetOpaque(rightpage)->zs_level);
+
+ stack = zs_new_split_stack_entry(leftbuf, leftpage);
+ stack_head = stack_tail = stack;
+
+ /* Mark right page as empty/unused */
+ rightpage = palloc0(BLCKSZ);
+
+ stack = zs_new_split_stack_entry(rightbuf, rightpage);
+ stack->recycle = true;
+ stack_tail->next = stack;
+ stack_tail = stack;
+
+ /* find downlink for 'rightbuf' in the parent */
+ parentbuf = zsfpm_descend(rel, metabuf, rightopaque->zs_lokey, leftopaque->zs_level + 1);
+ parentpage = BufferGetPage(parentbuf);
+
+ parentitems = ZSFreePageMapPageGetItems(parentpage);
+ parentnitems = ZSFreePageMapPageGetNumItems(parentpage);
+ itemno = zsfpm_binsrch_blkno(rightopaque->zs_lokey, parentitems, parentnitems);
+ if (itemno < 0 || parentitems[itemno].u.zs_downlink != BufferGetBlockNumber(rightbuf))
+ elog(ERROR, "could not find downlink to FPM page %u", BufferGetBlockNumber(rightbuf));
+
+ /* remove downlink from parent */
+ if (parentnitems > 1)
+ {
+ Page newpage = PageGetTempPageCopySpecial(parentpage);
+ ZSFreePageMapItem *newitems = ZSFreePageMapPageGetItems(newpage);
+
+ memcpy(newitems, parentitems, itemno * sizeof(ZSFreePageMapItem));
+ memcpy(&newitems[itemno], &parentitems[itemno + 1], (parentnitems - itemno -1) * sizeof(ZSFreePageMapItem));
+
+ ((PageHeader) newpage)->pd_lower += (parentnitems - 1) * sizeof(ZSFreePageMapItem);
+
+ stack = zs_new_split_stack_entry(parentbuf, newpage);
+ stack_tail->next = stack;
+ stack_tail = stack;
+ }
+ else
+ {
+ /* the parent becomes empty as well. Recursively remove it. */
+ stack_tail->next = zsfpm_unlink_page(rel, parentbuf, leftopaque->zs_level + 1, metabuf);
+ }
+ return stack_head;
+}
+
+/*
+ * Allocate a new page.
+ *
+ * The page is exclusive-locked, but not initialized.
+ */
+Buffer
+zspage_getnewbuf(Relation rel, Buffer metabuf)
+{
+ bool release_metabuf;
+ Buffer buf;
+ BlockNumber blk;
+
+ if (metabuf == InvalidBuffer)
+ {
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ release_metabuf = true;
+ }
+ else
+ release_metabuf = false;
+
+retry:
+ /* Get a block from the FPM. */
+ blk = zsfpm_consume_page(rel, metabuf);
+ if (blk == 0)
+ {
+ /* metapage, not expected */
+ elog(ERROR, "could not find valid page in FPM");
+ }
+ if (blk == InvalidBlockNumber)
+ {
+ /* No free pages. Have to extend the relation. */
+ buf = zspage_extendrel_newbuf(rel);
+ blk = BufferGetBlockNumber(buf);
+ }
+ else
+ {
+ buf = ReadBuffer(rel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Check that the page really is unused. */
+ if (!zspage_is_recyclable(buf))
+ {
+ UnlockReleaseBuffer(buf);
+ goto retry;
+ }
+ }
+
+ if (release_metabuf)
+ UnlockReleaseBuffer(metabuf);
+ return buf;
+}
+
+/*
+ * Extend the relation.
+ *
+ * Returns the new page, exclusive-locked.
+ */
+Buffer
+zspage_extendrel_newbuf(Relation rel)
+{
+ Buffer buf;
+ bool needLock;
+
+ /*
+ * Extend the relation by one page.
+ *
+ * We have to use a lock to ensure no one else is extending the rel at
+ * the same time, else we will both try to initialize the same new
+ * page. We can skip locking for new or temp relations, however,
+ * since no one else could be accessing them.
+ */
+ needLock = !RELATION_IS_LOCAL(rel);
+
+ if (needLock)
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ buf = ReadBuffer(rel, P_NEW);
+
+ /* Acquire buffer lock on new page */
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * Release the file-extension lock; it's now OK for someone else to
+ * extend the relation some more. Note that we cannot release this
+ * lock before we have buffer lock on the new page, or we risk a race
+ * condition against btvacuumscan --- see comments therein.
+ */
+ if (needLock)
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ return buf;
+}
+
+
+/*
+ * Explictly mark a page as deleted and recyclable, and add it to the FPM.
+ *
+ * The caller must hold an exclusive-lock on the page.
+ */
+void
+zspage_delete_page(Relation rel, Buffer buf)
+{
+ BlockNumber blk = BufferGetBlockNumber(buf);
+ Page page;
+
+ page = BufferGetPage(buf);
+ memset(page, 0, BLCKSZ);
+
+ zsfpm_insert(rel, blk, blk + 1);
+}
+
+/*
+ * Remove and return a page from the FPM.
+ */
+static BlockNumber
+zsfpm_consume_page(Relation rel, Buffer metabuf)
+{
+ /* TODO: add some smarts, to allocate the page nearby old page, etc. */
+ /* currently, we just pick the first available page. */
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber rootblk;
+ Buffer buf;
+ Page page;
+ ZSFreePageMapItem *items;
+ int nitems;
+ BlockNumber result;
+
+ metapage = BufferGetPage(metabuf);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ rootblk = metaopaque->zs_fpm_root;
+
+ if (rootblk == InvalidBlockNumber)
+ return InvalidBlockNumber;
+
+ buf = zsfpm_descend(rel, metabuf, 0, 0);
+ page = BufferGetPage(buf);
+
+ items = ZSFreePageMapPageGetItems(page);
+ nitems = ZSFreePageMapPageGetNumItems(page);
+
+ if (nitems == 0)
+ {
+ UnlockReleaseBuffer(buf);
+ return InvalidBlockNumber;
+ }
+
+ result = items[0].zs_startblk;
+ items[0].zs_startblk++;
+ if (items[0].u.zs_endblk == items[0].zs_startblk)
+ {
+ if (nitems > 1)
+ {
+ memmove(&items[0],
+ &items[1],
+ (nitems - 1) * sizeof(ZSFreePageMapItem));
+ ((PageHeader) page)->pd_lower -= sizeof(ZSFreePageMapItem);
+
+ UnlockReleaseBuffer(buf);
+ }
+ else
+ {
+ zsfpm_delete_leaf(rel, buf, metabuf);
+ /* zsfpm_delete_leaf() released 'buf' */
+ }
+ }
+ else
+ {
+ UnlockReleaseBuffer(buf);
+ }
+ return result;
+}
+
+/*
+ * Add a block range to the FPM.
+ */
+static void
+zsfpm_insert(Relation rel, BlockNumber startblk, BlockNumber endblk)
+{
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber rootblk;
+ Buffer buf;
+ Page page;
+ ZSFreePageMapItem *items;
+ int nitems;
+ int pos;
+ int replacepos_first;
+ int replacepos_last;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); /* TODO: get shared lock first */
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ rootblk = metaopaque->zs_fpm_root;
+
+ if (rootblk == InvalidBlockNumber)
+ {
+ /* Create a new FPM root page */
+ ZSFreePageMapOpaque *opaque;
+
+ buf = zspage_extendrel_newbuf(rel);
+ page = BufferGetPage(buf);
+ rootblk = BufferGetBlockNumber(buf);
+
+ PageInit(page, BLCKSZ, sizeof(ZSFreePageMapOpaque));
+ opaque = ZSFreePageMapGetOpaque(page);
+ opaque->zs_lokey = 0;
+ opaque->zs_hikey = MaxBlockNumber + 1;
+ opaque->zs_level = 0;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_FPM_PAGE_ID;
+
+ metaopaque->zs_fpm_root = rootblk;
+
+ items = ZSFreePageMapPageGetItems(page);
+ Assert(ZSFreePageMapPageGetNumItems(page) == 0);
+ items[0].zs_startblk = startblk;
+ items[0].u.zs_endblk = endblk;
+
+ /* TODO: WAL-logging */
+
+ MarkBufferDirty(metabuf);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(metabuf);
+ UnlockReleaseBuffer(buf);
+ return;
+ }
+
+ /* Descend to the correct leaf page for this block */
+
+ buf = zsfpm_descend(rel, metabuf, startblk, 0);
+
+ UnlockReleaseBuffer(metabuf);
+
+ page = BufferGetPage(buf);
+ items = ZSFreePageMapPageGetItems(page);
+ nitems = ZSFreePageMapPageGetNumItems(page);
+
+ pos = zsfpm_binsrch_blkno(startblk, items, nitems);
+
+ /* FIXME: this merging business won't work correctly if the range crosses
+ * a b-tree page boundary. Not a problem currently, when we only insert
+ * individual pages.
+ */
+
+ /* Check if this item can be merged with the previous item */
+ replacepos_first = -1;
+ if (pos >= 0 && items[pos].u.zs_endblk >= startblk)
+ {
+ replacepos_first = pos;
+ }
+ /* If not, can this be merged with the next item? */
+ else if (pos + 1 < nitems && endblk >= items[pos + 1].zs_startblk)
+ {
+ /* yes, merge */
+ replacepos_first = pos + 1;
+ }
+
+ if (replacepos_first >= 0)
+ {
+ /* adjust the start block of this item */
+ if (startblk < items[replacepos_first].zs_startblk)
+ {
+ items[replacepos_first].zs_startblk = startblk;
+ }
+
+ /*
+ * The new end block might overlap with any number of existing
+ * ranges. Replace all overlapping ranges with one range that
+ * covers them all.
+ */
+ replacepos_last = replacepos_first;
+ if (endblk > items[replacepos_first].u.zs_endblk)
+ {
+ int j;
+ BlockNumber replace_end;
+
+ replace_end = endblk;
+
+ for (j = replacepos_first + 1; j < nitems; j++)
+ {
+ if (items[j].zs_startblk > replace_end)
+ break;
+
+ /*
+ * This item will be replaced. Check the end, to see
+ * if this is the last one that can be replaced.
+ */
+ replacepos_last = j;
+
+ if (items[j].u.zs_endblk > replace_end)
+ {
+ replace_end = items[j].u.zs_endblk;
+ break;
+ }
+ }
+
+ items[replacepos_first].u.zs_endblk = replace_end;
+ }
+
+ /* we already adjusted the item at 'replacepos_first'. Remove the rest. */
+ if (replacepos_last > replacepos_first)
+ {
+ int move_items = nitems - replacepos_last;
+ int remain_items = nitems - (replacepos_last - replacepos_first);
+
+ if (move_items > 0)
+ memmove(&items[replacepos_first + 1],
+ &items[replacepos_last + 1],
+ move_items * sizeof(ZSFreePageMapItem));
+
+ ((PageHeader) page)->pd_lower = SizeOfPageHeaderData + remain_items * sizeof(ZSFreePageMapItem);
+
+ }
+
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ return;
+ }
+
+ /*
+ * No overlap with any existing ranges. Add a new one. This might require
+ * splitting the page.
+ */
+ pos = pos + 1;
+
+ if (PageGetExactFreeSpace(page) >= sizeof(ZSFreePageMapItem))
+ {
+ START_CRIT_SECTION();
+
+ memmove(&items[pos + 1],
+ &items[pos],
+ (nitems - pos) * sizeof(ZSFreePageMapItem));
+
+ items[pos].zs_startblk = startblk;
+ items[pos].u.zs_endblk = endblk;
+
+ ((PageHeader) page)->pd_lower += sizeof(ZSFreePageMapItem);
+
+ /* TODO: WAL-log */
+
+ MarkBufferDirty(buf);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+ return;
+ }
+ else
+ {
+ /* last resort: split the page */
+ zs_split_stack *split_stack;
+ ZSFreePageMapItem newitem;
+
+ newitem.zs_startblk = startblk;
+ newitem.u.zs_endblk = endblk;
+ split_stack = zsfpm_split(rel, buf, pos, &newitem);
+
+ /* write out the temporary page copies */
+ zs_apply_split_changes(rel, split_stack);
+ }
+}
+
+/*
+ * Insert a downlink for right page, after splitting 'leftbuf' FPM page.
+ */
+static zs_split_stack *
+zsfpm_insert_downlink(Relation rel, Buffer leftbuf,
+ BlockNumber rightlokey, BlockNumber rightblkno)
+{
+ Buffer parentbuf;
+ Page leftpage = BufferGetPage(leftbuf);
+ BlockNumber leftblkno = BufferGetBlockNumber(leftbuf);
+ ZSFreePageMapOpaque *leftopaque = ZSFreePageMapGetOpaque(leftpage);
+ zstid leftlokey = leftopaque->zs_lokey;
+ ZSFreePageMapItem downlink;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber rootblk;
+ Page parentpage;
+ ZSFreePageMapItem *items;
+ int nitems;
+ int pos;
+ zs_split_stack *split_stack;
+
+ /*
+ * First, find the parent of 'leftbuf'.
+ *
+ * TODO: this is a bit inefficient. Usually, we have just descended the
+ * tree, and if we just remembered the path we descended, we could just
+ * walk back up.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metapage = BufferGetPage(metabuf);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ rootblk = metaopaque->zs_fpm_root;
+
+ if (rootblk == BufferGetBlockNumber(leftbuf))
+ {
+ /* Root split. Create new root with downlinks for the left and right page. */
+ ZSFreePageMapItem downlink1;
+ ZSFreePageMapItem downlink2;
+
+ /* re-acquire the lock on metapage in exclusive mode */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ /*
+ * No one should have been able to change the root pointer, because we were
+ * holding a lock on the root page
+ */
+ Assert(metaopaque->zs_fpm_root == BufferGetBlockNumber(leftbuf));
+
+ downlink1.zs_startblk = leftlokey;
+ downlink1.u.zs_downlink = leftblkno;
+ downlink2.zs_startblk = rightlokey;
+ downlink2.u.zs_downlink = rightblkno;
+
+ return zsfpm_newroot(rel, metabuf, leftopaque->zs_level + 1,
+ &downlink1, &downlink2);
+ }
+
+ UnlockReleaseBuffer(metabuf);
+
+ parentbuf = zsfpm_descend(rel, metabuf, leftlokey, leftopaque->zs_level + 1);
+ parentpage = BufferGetPage(parentbuf);
+
+ downlink.zs_startblk = rightlokey;
+ downlink.u.zs_downlink = rightblkno;
+
+ /* insert the item */
+ items = ZSFreePageMapPageGetItems(parentpage);
+ nitems = ZSFreePageMapPageGetNumItems(parentpage);
+
+ pos = zsfpm_binsrch_blkno(rightlokey, items, nitems);
+ pos = pos + 1;
+
+ if (PageGetExactFreeSpace(parentpage) >= sizeof(ZSFreePageMapItem))
+ {
+ ZSFreePageMapItem *newitems;
+ Page newpage;
+
+ newpage = PageGetTempPageCopySpecial(parentpage);
+
+ split_stack = zs_new_split_stack_entry(parentbuf, newpage);
+
+ newitems = ZSFreePageMapPageGetItems(newpage);
+ memcpy(newitems, items, pos * sizeof(ZSFreePageMapItem));
+
+ newitems[pos] = downlink;
+
+ memcpy(&newitems[pos + 1], &items[pos], (nitems - pos) * sizeof(ZSFreePageMapItem));
+
+ ((PageHeader) newpage)->pd_lower += (nitems + 1) * sizeof(ZSFreePageMapItem);
+
+ }
+ else
+ {
+ /* have to split the page. */
+ split_stack = zsfpm_split(rel, parentbuf, pos, &downlink);
+ }
+ return split_stack;
+}
+
+/*
+ * Split a page for insertion of 'newitem', at 'newpos'.
+ *
+ * A page split needs to modify the page being split, the block allocated for
+ * the new page, and also the downlink in the parent. If the parent needs to
+ * be split as well, its parent also needs to be recursively updated, all the
+ * way up to the root page, in the worst case. zsfpm_split() doesn't modify
+ * any pages directly, but locks them exclusively, and returns a list of
+ * zs_split_stack structs to represent the modifications. The caller must
+ * WAL-log and apply all the changes represented by the list.
+ */
+static zs_split_stack *
+zsfpm_split(Relation rel, Buffer leftbuf, int newpos, ZSFreePageMapItem *newitem)
+{
+ Buffer rightbuf;
+ Page origpage = BufferGetPage(leftbuf);
+ Page leftpage;
+ Page rightpage;
+ BlockNumber rightblkno;
+ ZSFreePageMapOpaque *leftopaque;
+ ZSFreePageMapOpaque *rightopaque;
+ ZSFreePageMapItem *origitems;
+ ZSFreePageMapItem *leftitems;
+ ZSFreePageMapItem *rightitems;
+ int orignitems;
+ int leftnitems;
+ int rightnitems;
+ int splitpoint;
+ BlockNumber splitkey;
+ bool newitemonleft;
+ int i;
+ zs_split_stack *stack1;
+ zs_split_stack *stack2;
+
+ leftpage = PageGetTempPageCopySpecial(origpage);
+ leftopaque = ZSFreePageMapGetOpaque(leftpage);
+
+ /*
+ * FIXME: can't use the FPM to get a page, because we might deadlock with
+ * ourself. We could steal a block from the page we're splitting...
+ */
+ rightbuf = zspage_extendrel_newbuf(rel);
+ rightblkno = BufferGetBlockNumber(rightbuf);
+
+ rightpage = palloc(BLCKSZ);
+ PageInit(rightpage, BLCKSZ, sizeof(ZSFreePageMapOpaque));
+ rightopaque = ZSFreePageMapGetOpaque(rightpage);
+
+ /*
+ * Figure out the split point.
+ *
+ * TODO: currently, always do 90/10 split.
+ */
+ origitems = ZSFreePageMapPageGetItems(origpage);
+ orignitems = ZSFreePageMapPageGetNumItems(origpage);
+ splitpoint = orignitems * 0.9;
+ splitkey = origitems[splitpoint].zs_startblk;
+ newitemonleft = (newitem->zs_startblk < splitkey);
+
+ /* Set up the page headers */
+ rightopaque->zs_lokey = splitkey;
+ rightopaque->zs_hikey = leftopaque->zs_hikey;
+ rightopaque->zs_level = leftopaque->zs_level;
+ rightopaque->zs_flags = 0;
+ rightopaque->zs_page_id = ZS_FPM_PAGE_ID;
+
+ leftopaque->zs_hikey = splitkey;
+
+ /* copy the items */
+ leftitems = ZSFreePageMapPageGetItems(leftpage);
+ leftnitems = 0;
+ rightitems = ZSFreePageMapPageGetItems(rightpage);
+ rightnitems = 0;
+
+ for (i = 0; i < orignitems; i++)
+ {
+ if (i == newpos)
+ {
+ if (newitemonleft)
+ leftitems[leftnitems++] = *newitem;
+ else
+ rightitems[rightnitems++] = *newitem;
+ }
+
+ if (i < splitpoint)
+ leftitems[leftnitems++] = origitems[i];
+ else
+ rightitems[rightnitems++] = origitems[i];
+ }
+ /* cope with possibility that newitem goes at the end */
+ if (i <= newpos)
+ {
+ Assert(!newitemonleft);
+ rightitems[rightnitems++] = *newitem;
+ }
+ ((PageHeader) leftpage)->pd_lower += leftnitems * sizeof(ZSFreePageMapItem);
+ ((PageHeader) rightpage)->pd_lower += rightnitems * sizeof(ZSFreePageMapItem);
+
+ Assert(leftnitems + rightnitems == orignitems + 1);
+
+ stack1 = zs_new_split_stack_entry(leftbuf, leftpage);
+ stack2 = zs_new_split_stack_entry(rightbuf, rightpage);
+ stack1->next = stack2;
+
+ /* recurse to insert downlink. */
+ stack2->next = zsfpm_insert_downlink(rel, leftbuf, splitkey, rightblkno);
+
+ return stack1;
+}
+
+static zs_split_stack *
+zsfpm_newroot(Relation rel, Buffer metabuf, int level,
+ ZSFreePageMapItem *item1, ZSFreePageMapItem *item2)
+{
+ /* Create a new FPM root page */
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ ZSFreePageMapOpaque *opaque;
+ Buffer buf;
+ Page page;
+ BlockNumber rootblk;
+ ZSFreePageMapItem *items;
+ zs_split_stack *stack1;
+ zs_split_stack *stack2;
+
+ metapage = PageGetTempPageCopy(BufferGetPage(metabuf));
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ /* TODO: get the page from the FPM */
+ buf = zspage_extendrel_newbuf(rel);
+ rootblk = BufferGetBlockNumber(buf);
+
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSFreePageMapOpaque));
+ opaque = ZSFreePageMapGetOpaque(page);
+ opaque->zs_lokey = 0;
+ opaque->zs_hikey = MaxBlockNumber + 1;
+ opaque->zs_level = level;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_FPM_PAGE_ID;
+
+ items = ZSFreePageMapPageGetItems(page);
+ items[0] = *item1;
+ items[1] = *item2;
+ ((PageHeader) page)->pd_lower += 2 * sizeof(ZSFreePageMapItem);
+ Assert(ZSFreePageMapPageGetNumItems(page) == 2);
+
+ metaopaque->zs_fpm_root = rootblk;
+
+ stack1 = zs_new_split_stack_entry(metabuf, metapage);
+
+ stack2 = zs_new_split_stack_entry(buf, page);
+ stack2->next = stack1;
+
+ return stack2;
+}
+
+static Buffer
+zsfpm_descend(Relation rel, Buffer metabuf, BlockNumber key, int level)
+{
+ BlockNumber next;
+ Buffer buf;
+ Page page;
+ ZSFreePageMapOpaque *opaque;
+ ZSFreePageMapItem *items;
+ int nitems;
+ int itemno;
+ int nextlevel = -1;
+ BlockNumber failblk = InvalidBlockNumber;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber rootblk;
+
+ metapage = BufferGetPage(metabuf);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ rootblk = metaopaque->zs_fpm_root;
+
+ next = rootblk;
+ for (;;)
+ {
+ /*
+ * If we arrive again to a block that was a dead-end earlier, it seems
+ * that the tree is corrupt.
+ *
+ * XXX: It's theoretically possible that the block was removed, but then
+ * added back at the same location, and removed again. So perhaps retry
+ * a few times?
+ */
+ if (next == failblk)
+ elog(ERROR, "could not descend to block %u in FPM", key);
+
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); /* TODO: shared */
+ page = BufferGetPage(buf);
+ opaque = ZSFreePageMapGetOpaque(page);
+
+ if (nextlevel == -1)
+ nextlevel = opaque->zs_level;
+ else if (opaque->zs_level != nextlevel)
+ elog(ERROR, "unexpected level encountered when descending FPM tree");
+
+ if (opaque->zs_level < level)
+ elog(ERROR, "unexpected page level encountered");
+
+ /*
+ * Do we need to walk right? This could happen if the page was concurrently split.
+ *
+ * XXX: actually, we restart from root. We're holding a lock on the metapage,
+ * so the root cannot change.
+ */
+ if (key >= opaque->zs_hikey)
+ {
+ /* Restart from the root */
+ failblk = next;
+ next = rootblk;
+ nextlevel = -1;
+ }
+ else
+ {
+ if (opaque->zs_level == level)
+ return buf;
+
+ /* Find the downlink and follow it */
+ items = ZSFreePageMapPageGetItems(page);
+ nitems = ZSFreePageMapPageGetNumItems(page);
+
+ itemno = zsfpm_binsrch_blkno(key, items, nitems);
+
+ if (itemno < 0)
+ elog(ERROR, "could not descend FPM tree for key blk %u", key);
+
+ next = items[itemno].u.zs_downlink;
+ nextlevel--;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+
+static int
+zsfpm_binsrch_blkno(BlockNumber key, ZSFreePageMapItem *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid].zs_startblk)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git a/src/backend/access/zedstore/zedstore_inspect.c b/src/backend/access/zedstore/zedstore_inspect.c
new file mode 100644
index 0000000000..4992c52102
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_inspect.c
@@ -0,0 +1,448 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_inspect.c
+ * Debugging functions, for viewing ZedStore page contents
+ *
+ * These should probably be moved to contrib/, but it's handy to have them
+ * here during development.
+ *
+ * Example queries
+ * ---------------
+ *
+ * How many pages of each type a table has?
+ *
+ * select count(*), pg_zs_page_type('t_zedstore', g)
+ * from generate_series(0, pg_table_size('t_zedstore') / 8192 - 1) g group by 2;
+ *
+ * count | pg_zs_page_type
+ * -------+-----------------
+ * 1 | META
+ * 3701 | BTREE
+ * 6 | UNDO
+ * (3 rows)
+ *
+ * Compression ratio of B-tree leaf pages (other pages are not compressed):
+ *
+ * select sum(uncompressedsz::numeric) / sum(totalsz) as compratio
+ * from pg_zs_btree_pages('t_zedstore') ;
+ * compratio
+ * --------------------
+ * 3.6623829559208134
+ * (1 row)
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_inspect.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "commands/vacuum.h"
+#include "funcapi.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+Datum pg_zs_page_type(PG_FUNCTION_ARGS);
+Datum pg_zs_undo_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_btree_pages(PG_FUNCTION_ARGS);
+
+Datum
+pg_zs_page_type(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint64 pageno = PG_GETARG_INT64(1);
+ Relation rel;
+ uint16 zs_page_id;
+ Buffer buf;
+ Page page;
+ char *result;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ buf = ReadBuffer(rel, pageno);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ zs_page_id = *((uint16 *) ((char *) page + BLCKSZ - sizeof(uint16)));
+
+ UnlockReleaseBuffer(buf);
+
+ table_close(rel, AccessShareLock);
+
+ switch (zs_page_id)
+ {
+ case ZS_META_PAGE_ID:
+ result = "META";
+ break;
+ case ZS_BTREE_PAGE_ID:
+ result = "BTREE";
+ break;
+ case ZS_UNDO_PAGE_ID:
+ result = "UNDO";
+ break;
+ case ZS_TOAST_PAGE_ID:
+ result = "TOAST";
+ break;
+ case ZS_FPM_PAGE_ID:
+ result = "FPM";
+ break;
+ default:
+ result = psprintf("UNKNOWN 0x%04x", zs_page_id);
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * blkno int8
+ * nrecords int4
+ * freespace int4
+ * firstrecptr int8
+ * lastrecptr int8
+ */
+Datum
+pg_zs_undo_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber blkno;
+ char *ptr;
+ char *endptr;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page.
+ */
+ blkno = firstblk;
+ while (blkno != InvalidBlockNumber)
+ {
+ Datum values[5];
+ bool nulls[5];
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+ int nrecords;
+ ZSUndoRecPtr firstptr = { 0, 0, 0 };
+ ZSUndoRecPtr lastptr = { 0, 0, 0 };
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ {
+ elog(WARNING, "unexpected page id on UNDO page %u", blkno);
+ break;
+ }
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ nrecords = 0;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+
+ Assert(undorec->undorecptr.blkno == blkno);
+
+ lastptr = undorec->undorecptr;
+ if (nrecords == 0)
+ firstptr = lastptr;
+ nrecords++;
+
+ ptr += undorec->size;
+ }
+
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int32GetDatum(nrecords);
+ values[2] = Int32GetDatum(PageGetExactFreeSpace(page));
+ values[3] = Int64GetDatum(firstptr.counter);
+ values[4] = Int64GetDatum(lastptr.counter);
+
+ blkno = opaque->next;
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+
+/*
+ * blkno int8
+ * nextblk int8
+ * attno int4
+ * level int4
+ *
+ * lokey int8
+ * hikey int8
+
+ * nitems int4
+ * ncompressed int4
+ * totalsz int4
+ * uncompressedsz int4
+ * freespace int4
+ */
+Datum
+pg_zs_btree_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ BlockNumber blkno;
+ BlockNumber nblocks;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /* scan all blocks in physical order */
+ for (blkno = 1; blkno < nblocks; blkno++)
+ {
+ Datum values[11];
+ bool nulls[11];
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ int nitems;
+ int ncompressed;
+ int totalsz;
+ int uncompressedsz;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in B-tree pages. (Presumably, most of the
+ * pages in the relation are b-tree pages, so it makes sense to
+ * scan the whole relation in physical order)
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSBtreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ nitems = 0;
+ ncompressed = 0;
+ totalsz = 0;
+ uncompressedsz = 0;
+ if (opaque->zs_level == 0)
+ {
+ /* leaf page */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSBtreeItem *item = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ nitems++;
+ totalsz += item->t_size;
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) PageGetItem(page, iid);
+
+ ncompressed++;
+ uncompressedsz += citem->t_uncompressedsize;
+ }
+ else
+ uncompressedsz += item->t_size;
+ }
+ }
+ else
+ {
+ /* internal page */
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+ }
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int64GetDatum(opaque->zs_next);
+ values[2] = Int32GetDatum(opaque->zs_attno);
+ values[3] = Int32GetDatum(opaque->zs_level);
+ values[4] = Int64GetDatum(opaque->zs_lokey);
+ values[5] = Int64GetDatum(opaque->zs_hikey);
+ values[6] = Int32GetDatum(nitems);
+ if (opaque->zs_level == 0)
+ {
+ values[7] = Int32GetDatum(ncompressed);
+ values[8] = Int32GetDatum(totalsz);
+ values[9] = Int32GetDatum(uncompressedsz);
+ }
+ else
+ {
+ nulls[7] = true;
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ values[10] = Int32GetDatum(PageGetExactFreeSpace(page));
+
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/access/zedstore/zedstore_meta.c b/src/backend/access/zedstore/zedstore_meta.c
new file mode 100644
index 0000000000..a415645914
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_meta.c
@@ -0,0 +1,216 @@
+/*
+ * zedstore_meta.c
+ * Routines for handling ZedStore metapage
+ *
+ * The metapage holds a directory of B-tree root block numbers, one for each
+ * column.
+ *
+ * TODO:
+ * - support ALTER TABLE ADD COLUMN.
+ * - extend the root block dir to an overflow page if there are too many
+ * attributes to fit on one page
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_meta.c
+ */
+#include "postgres.h"
+
+#include "access/itup.h"
+#include "access/zedstore_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "utils/rel.h"
+
+static void
+zsmeta_add_root_for_attributes(Relation rel, Page page, bool init)
+{
+ int natts = RelationGetNumberOfAttributes(rel) + 1;
+ int cur_natts;
+ int maxatts;
+ Size freespace;
+ ZSMetaPage *metapg;
+
+ /* Initialize the attribute root dir for new attribute */
+ freespace = PageGetExactFreeSpace(page);
+ maxatts = freespace / sizeof(ZSRootDirItem);
+ if (natts > maxatts)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ if (init)
+ metapg->nattributes = 0;
+
+ for (cur_natts = metapg->nattributes; cur_natts < natts; cur_natts++)
+ {
+ metapg->tree_root_dir[cur_natts].root = InvalidBlockNumber;
+ }
+
+ metapg->nattributes = natts;
+ ((PageHeader) page)->pd_lower += sizeof(ZSRootDirItem);
+}
+
+/*
+ * Initialize the metapage for an empty relation.
+ */
+void
+zsmeta_initmetapage(Relation rel)
+{
+ Buffer buf;
+ Page page;
+ ZSMetaPageOpaque *opaque;
+
+ /*
+ * It's possible that we error out when building the metapage, if there
+ * are too many attribute, so work on a temporary copy first, before actually
+ * allocating the buffer.
+ */
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSMetaPageOpaque));
+ zsmeta_add_root_for_attributes(rel, page, true);
+
+ opaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_META_PAGE_ID;
+
+ /* UNDO-related fields */
+ opaque->zs_undo_counter = 1; /* start at 1, so that 0 is always "old" */
+ opaque->zs_undo_head = InvalidBlockNumber;
+ opaque->zs_undo_tail = InvalidBlockNumber;
+ opaque->zs_undo_oldestptr.counter = 1;
+
+ opaque->zs_fpm_root = InvalidBlockNumber;
+
+ /* Ok, write it out to disk */
+ buf = ReadBuffer(rel, P_NEW);
+ if (BufferGetBlockNumber(buf) != ZS_META_BLK)
+ elog(ERROR, "index is not empty");
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ PageRestoreTempPage(page, BufferGetPage(buf));
+
+ MarkBufferDirty(buf);
+ /* TODO: WAL-log */
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Get the block number of the b-tree root for given attribute.
+ *
+ * If 'readonly' is true, and the root doesn't exist yet (ie. it's an empty
+ * table), returns InvalidBlockNumber. Otherwise new root is allocated if
+ * the root doesn't exist.
+ */
+BlockNumber
+zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool readonly)
+{
+ Buffer metabuf;
+ ZSMetaPage *metapg;
+ BlockNumber rootblk;
+ Page page;
+
+ if (RelationGetNumberOfBlocks(rel) == 0)
+ {
+ if (readonly)
+ return InvalidBlockNumber;
+
+ zsmeta_initmetapage(rel);
+ }
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ /* TODO: get share lock to begin with */
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ if ((attno != ZS_META_ATTRIBUTE_NUM) && attno <= 0)
+ elog(ERROR, "invalid attribute number %d (table has only %d attributes)", attno, metapg->nattributes);
+
+ /*
+ * file has less number of attributes stored compared to catalog. This
+ * happens due to add column default value storing value in catalog and
+ * absent in table. This attribute must be marked with atthasmissing.
+ */
+ if (attno >= metapg->nattributes)
+ {
+ if (readonly)
+ {
+ UnlockReleaseBuffer(metabuf);
+ return InvalidBlockNumber;
+ }
+ else
+ {
+ zsmeta_add_root_for_attributes(rel, page, false);
+ }
+ }
+
+ rootblk = metapg->tree_root_dir[attno].root;
+
+ if (!readonly && rootblk == InvalidBlockNumber)
+ {
+ /* try to allocate one */
+ Buffer rootbuf;
+ Page rootpage;
+ ZSBtreePageOpaque *opaque;
+
+ /* TODO: release lock on metapage while we do I/O */
+ rootbuf = zspage_getnewbuf(rel, metabuf);
+ rootblk = BufferGetBlockNumber(rootbuf);
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ /* initialize the page to look like a root leaf */
+ rootpage = BufferGetPage(rootbuf);
+ PageInit(rootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(rootpage);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = 0;
+ opaque->zs_flags = ZSBT_ROOT;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ MarkBufferDirty(rootbuf);
+ MarkBufferDirty(metabuf);
+ /* TODO: WAL-log both pages */
+
+ UnlockReleaseBuffer(rootbuf);
+ }
+
+ UnlockReleaseBuffer(metabuf);
+
+ return rootblk;
+}
+
+/*
+ *
+ * Caller is responsible for WAL-logging this.
+ */
+void
+zsmeta_update_root_for_attribute(Relation rel, AttrNumber attno,
+ Buffer metabuf, BlockNumber rootblk)
+{
+ ZSMetaPage *metapg;
+
+ metapg = (ZSMetaPage *) PageGetContents(BufferGetPage(metabuf));
+
+ if ((attno != ZS_META_ATTRIBUTE_NUM) && (attno <= 0 || attno > metapg->nattributes))
+ elog(ERROR, "invalid attribute number %d (table \"%s\" has only %d attributes)",
+ attno, RelationGetRelationName(rel), metapg->nattributes);
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ MarkBufferDirty(metabuf);
+}
diff --git a/src/backend/access/zedstore/zedstore_tidpage.c b/src/backend/access/zedstore/zedstore_tidpage.c
new file mode 100644
index 0000000000..cffc5f2a75
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tidpage.c
@@ -0,0 +1,1774 @@
+/*
+ * zedstore_tidpage.c
+ * Routines for handling the TID tree.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with one B-tree at a time, it is the caller's
+ * responsibility to tie together the scans of each btree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tidpage.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "storage/bufmgr.h"
+#include "storage/predicate.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static void zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items);
+static ZSSingleBtreeItem *zsbt_tid_fetch(Relation rel,
+ ZSUndoRecPtr *recent_oldest_undo, zstid tid, Buffer *buf_p);
+static void zsbt_tid_replace_item(Relation rel, Buffer buf,
+ zstid oldtid, ZSBtreeItem *replacementitem,
+ List *newitems);
+static ZSBtreeItem *zsbt_tid_create_item(zstid tid, ZSUndoRecPtr undo_ptr, int nelements);
+
+static TM_Result zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd, ZSUndoRecPtr *prevundoptr_p);
+static void zsbt_tid_update_insert_new(Relation rel, zstid *newtid,
+ TransactionId xid, CommandId cid, ZSUndoRecPtr prevundoptr);
+static void zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of the btree.
+ */
+void
+zsbt_tid_begin_scan(Relation rel, zstid starttid,
+ zstid endtid, Snapshot snapshot, ZSBtreeScan *scan)
+{
+ Buffer buf;
+
+ scan->rel = rel;
+ scan->attno = ZS_META_ATTRIBUTE_NUM;
+ scan->tupledesc = NULL;
+
+ scan->snapshot = snapshot;
+ scan->context = CurrentMemoryContext;
+ scan->lastoff = InvalidOffsetNumber;
+ scan->has_decompressed = false;
+ scan->nexttid = starttid;
+ scan->endtid = endtid;
+ memset(&scan->recent_oldest_undo, 0, sizeof(scan->recent_oldest_undo));
+ memset(&scan->array_undoptr, 0, sizeof(scan->array_undoptr));
+ scan->array_datums = palloc(sizeof(Datum));
+ scan->array_datums_allocated_size = 1;
+ scan->array_elements_left = 0;
+
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, starttid, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ /* completely empty tree */
+ scan->active = false;
+ scan->lastbuf = InvalidBuffer;
+ return;
+ }
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ scan->active = true;
+ scan->lastbuf = buf;
+
+ zs_decompress_init(&scan->decompressor);
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+}
+
+/*
+ * Reset the 'next' TID in a scan to the given TID.
+ */
+void
+zsbt_tid_reset_scan(ZSBtreeScan *scan, zstid starttid)
+{
+ if (starttid < scan->nexttid)
+ {
+ /* have to restart from scratch. */
+ scan->array_elements_left = 0;
+ scan->nexttid = starttid;
+ scan->has_decompressed = false;
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ }
+ else
+ zsbt_scan_skip(scan, starttid);
+}
+
+void
+zsbt_tid_end_scan(ZSBtreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+ zs_decompress_free(&scan->decompressor);
+
+ scan->active = false;
+ scan->array_elements_left = 0;
+}
+
+/*
+ * Helper function of zsbt_scan_next(), to extract Datums from the given
+ * array item into the scan->array_* fields.
+ */
+static void
+zsbt_tid_scan_extract_array(ZSBtreeScan *scan, ZSArrayBtreeItem *aitem)
+{
+ int nelements = aitem->t_nelements;
+ zstid tid = aitem->t_tid;
+
+ /* skip over elements that we are not interested in */
+ while (tid < scan->nexttid && nelements > 0)
+ {
+ tid++;
+ nelements--;
+ }
+
+ /* leave out elements that are past end of range */
+ if (tid + nelements > scan->endtid)
+ nelements = scan->endtid - tid;
+
+ scan->array_undoptr = aitem->t_undo_ptr;
+ scan->array_elements_left = nelements;
+}
+
+/*
+ * Advance scan to next item.
+ *
+ * Return true if there was another item. The Datum/isnull of the item is
+ * placed in scan->array_* fields. For a pass-by-ref datum, it's a palloc'd
+ * copy that's valid until the next call.
+ *
+ * This is normally not used directly. See zsbt_scan_next_tid() and
+ * zsbt_scan_next_fetch() wrappers, instead.
+ */
+zstid
+zsbt_tid_scan_next(ZSBtreeScan *scan)
+{
+ Buffer buf;
+ bool buf_is_locked = false;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ BlockNumber next;
+ bool visible;
+
+ if (!scan->active)
+ return InvalidZSTid;
+
+ /*
+ * Process items, until we find something that is visible to the snapshot.
+ *
+ * This advances scan->nexttid as it goes.
+ */
+ while (scan->nexttid < scan->endtid)
+ {
+ /*
+ * If we are still processing an array item, return next element from it.
+ */
+ if (scan->array_elements_left > 0)
+ goto have_array;
+
+ /*
+ * If we are still processing a compressed item, process the next item
+ * from the it. If it's an array item, we start iterating the array by
+ * setting the scan->array_* fields, and loop back to top to return the
+ * first element from the array.
+ */
+ if (scan->has_decompressed)
+ {
+ zstid lasttid;
+ ZSBtreeItem *uitem;
+ TransactionId obsoleting_xid;
+
+ uitem = zs_decompress_read_item(&scan->decompressor);
+
+ if (uitem == NULL)
+ {
+ scan->has_decompressed = false;
+ continue;
+ }
+
+ /* a compressed item cannot contain nested compressed items */
+ Assert((uitem->t_flags & ZSBT_COMPRESSED) == 0);
+
+ lasttid = zsbt_item_lasttid(uitem);
+ if (lasttid < scan->nexttid)
+ continue;
+
+ if (uitem->t_tid >= scan->endtid)
+ break;
+
+ visible = zs_SatisfiesVisibility(scan, uitem, &obsoleting_xid, NULL);
+
+ if (scan->serializable && TransactionIdIsValid(obsoleting_xid))
+ CheckForSerializableConflictOut(scan->rel, obsoleting_xid, scan->snapshot);
+
+ if (!visible)
+ {
+ scan->nexttid = lasttid + 1;
+ continue;
+ }
+ if ((uitem->t_flags & ZSBT_ARRAY) != 0)
+ {
+ /* no need to make a copy, because the uncompressed buffer
+ * is already a copy */
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) uitem;
+
+ zsbt_tid_scan_extract_array(scan, aitem);
+ continue;
+ }
+ else
+ {
+ /* single item */
+ ZSSingleBtreeItem *sitem = (ZSSingleBtreeItem *) uitem;
+
+ scan->nexttid = sitem->t_tid;
+ scan->array_undoptr = sitem->t_undo_ptr;
+ scan->array_elements_left = 1;
+
+ if (buf_is_locked)
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ goto have_array;
+ }
+ }
+
+ /*
+ * Scan the page for the next item.
+ */
+ buf = scan->lastbuf;
+ if (!buf_is_locked)
+ {
+ if (BufferIsValid(buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ buf_is_locked = true;
+
+ /*
+ * It's possible that the page was concurrently split or recycled by
+ * another backend (or ourselves). Have to re-check that the page is
+ * still valid.
+ */
+ if (!zsbt_page_is_expected(scan->rel, scan->attno, scan->nexttid, 0, buf))
+ {
+ /*
+ * It's not valid for the TID we're looking for, but maybe it was the
+ * right page for the previous TID. In that case, we don't need to
+ * restart from the root, we can follow the right-link instead.
+ */
+ if (zsbt_page_is_expected(scan->rel, scan->attno, scan->nexttid - 1, 0, buf))
+ {
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ next = opaque->zs_next;
+ if (next != InvalidBlockNumber)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ buf = ReleaseAndReadBuffer(buf, scan->rel, next);
+ scan->lastbuf = buf;
+ continue;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+ buf_is_locked = false;
+ buf = scan->lastbuf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(buf))
+ {
+ buf = scan->lastbuf = zsbt_descend(scan->rel, scan->attno, scan->nexttid, 0, true);
+ buf_is_locked = true;
+ }
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ Assert(opaque->zs_page_id == ZS_BTREE_PAGE_ID);
+
+ /* TODO: check the last offset first, as an optimization */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSBtreeItem *item = (ZSBtreeItem *) PageGetItem(page, iid);
+ zstid lasttid;
+
+ lasttid = zsbt_item_lasttid(item);
+
+ if (scan->nexttid > lasttid)
+ continue;
+
+ if (item->t_tid >= scan->endtid)
+ {
+ scan->nexttid = scan->endtid;
+ break;
+ }
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+ MemoryContext oldcxt = MemoryContextSwitchTo(scan->context);
+
+ zs_decompress_chunk(&scan->decompressor, citem);
+ MemoryContextSwitchTo(oldcxt);
+ scan->has_decompressed = true;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ break;
+ }
+ else
+ {
+ TransactionId obsoleting_xid;
+
+ visible = zs_SatisfiesVisibility(scan, item, &obsoleting_xid, NULL);
+
+ if (!visible)
+ {
+ if (scan->serializable && TransactionIdIsValid(obsoleting_xid))
+ CheckForSerializableConflictOut(scan->rel, obsoleting_xid, scan->snapshot);
+ scan->nexttid = lasttid + 1;
+ continue;
+ }
+
+ if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ /* copy the item, because we can't hold a lock on the page */
+ ZSArrayBtreeItem *aitem;
+
+ aitem = MemoryContextAlloc(scan->context, item->t_size);
+ memcpy(aitem, item, item->t_size);
+
+ zsbt_tid_scan_extract_array(scan, aitem);
+
+ if (scan->array_elements_left > 0)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ break;
+ }
+ }
+ else
+ {
+ /* single item */
+ ZSSingleBtreeItem *sitem = (ZSSingleBtreeItem *) item;
+
+ scan->nexttid = sitem->t_tid;
+ scan->array_undoptr = sitem->t_undo_ptr;
+ scan->array_elements_left = 1;
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+ goto have_array;
+ }
+ }
+ }
+
+ if (scan->array_elements_left > 0 || scan->has_decompressed)
+ continue;
+
+ /* No more items on this page. Walk right, if possible */
+ next = opaque->zs_next;
+ if (next == BufferGetBlockNumber(buf))
+ elog(ERROR, "btree page %u next-pointer points to itself", next);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf_is_locked = false;
+
+ if (next == InvalidBlockNumber || scan->nexttid >= scan->endtid)
+ {
+ scan->active = false;
+ scan->array_elements_left = 0;
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ break;
+ }
+
+ scan->lastbuf = ReleaseAndReadBuffer(scan->lastbuf, scan->rel, next);
+ }
+
+ return InvalidZSTid;
+
+have_array:
+ /*
+ * If we are still processing an array item, return next element from it.
+ */
+ Assert(scan->array_elements_left > 0);
+
+ scan->array_elements_left--;
+ return scan->nexttid++;
+}
+
+/*
+ * Get the last tid (plus one) in the tree.
+ */
+zstid
+zsbt_get_last_tid(Relation rel)
+{
+ zstid rightmostkey;
+ zstid tid;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+
+ /* Find the rightmost leaf */
+ rightmostkey = MaxZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, rightmostkey, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ return MinZSTid;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Look at the last item, for its tid.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSBtreeItem *hitup = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ tid = zsbt_item_lasttid(hitup) + 1;
+ }
+ else
+ {
+ tid = opaque->zs_lokey;
+ }
+ UnlockReleaseBuffer(buf);
+
+ return tid;
+}
+
+/*
+ * Insert a multiple TIDs.
+ *
+ * Populates the TIDs of the new tuples.
+ *
+ * If 'tid' in list is valid, then that TID is used. It better not be in use already. If
+ * it's invalid, then a new TID is allocated, as we see best. (When inserting the
+ * first column of the row, pass invalid, and for other columns, pass the TID
+ * you got for the first column.)
+ */
+void
+zsbt_tid_multi_insert(Relation rel, zstid *tids, int nitems,
+ TransactionId xid, CommandId cid, uint32 speculative_token, ZSUndoRecPtr prevundoptr)
+{
+ bool assign_tids;
+ zstid tid = tids[0];
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ zstid insert_target_key;
+ ZSUndoRec_Insert undorec;
+ int i;
+ List *newitems;
+ ZSUndoRecPtr undorecptr;
+
+ /*
+ * If TID was given, find the right place for it. Otherwise, insert to
+ * the rightmost leaf.
+ *
+ * TODO: use a Free Space Map to find suitable target.
+ */
+ assign_tids = (tid == InvalidZSTid);
+
+ if (!assign_tids)
+ insert_target_key = tid;
+ else
+ insert_target_key = MaxZSTid;
+
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, insert_target_key, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Look at the last item, for its tid.
+ *
+ * assign TIDS for each item, if needed.
+ */
+ if (assign_tids)
+ {
+ zstid lasttid;
+
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSBtreeItem *hitup = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ lasttid = zsbt_item_lasttid(hitup);
+ tid = lasttid + 1;
+ }
+ else
+ {
+ lasttid = opaque->zs_lokey;
+ tid = lasttid;
+ }
+
+ for (i = 0; i < nitems; i++)
+ {
+ tids[i] = tid;
+ tid++;
+ }
+ }
+
+ /* Form an undo record */
+ if (xid != FrozenTransactionId)
+ {
+ undorec.rec.size = sizeof(ZSUndoRec_Insert);
+ undorec.rec.type = ZSUNDO_TYPE_INSERT;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = tids[0];
+ undorec.rec.speculative_token = speculative_token;
+ undorec.rec.prevundorec = prevundoptr;
+ undorec.endtid = tids[nitems - 1];
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+ else
+ {
+ ZSUndoRecPtrInitialize(&undorecptr);
+ }
+
+ /* Create items to insert. */
+ newitems = NIL;
+ i = 0;
+ while (i < nitems)
+ {
+ int j;
+ ZSBtreeItem *newitem;
+
+ /*
+ * Try to collapse as many items as possible into an Array item.
+ * The first item in the array is now at tids[i]/datums[i]/isnulls[i].
+ * Items can be stored in the same array as long as the TIDs are
+ * consecutive, they all have the same isnull flag, and the array
+ * isn't too large to be stored on a single leaf page. Scan the
+ * arrays, checking those conditions.
+ *
+ * FIXME: this math is bogus for TIDs
+ */
+ for (j = i + 1; j < nitems; j++)
+ {
+ if (tids[j] != tids[j - 1] + 1)
+ break;
+ }
+
+ /*
+ * 'i' is now the first entry to store in the array, and 'j' is the
+ * last + 1 elemnt to store. If j == i + 1, then there is only one
+ * element and zsbt_create_item() will create a 'single' item rather
+ * than an array.
+ */
+ newitem = zsbt_tid_create_item(tids[i], undorecptr, j - i);
+
+ newitems = lappend(newitems, newitem);
+ i = j;
+ }
+
+ /* recompress and possibly split the page */
+ zsbt_tid_replace_item(rel, buf,
+ InvalidZSTid, NULL,
+ newitems);
+ /* zsbt_replace_item unlocked 'buf' */
+ ReleaseBuffer(buf);
+}
+
+TM_Result
+zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ ZSSingleBtreeItem *item;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ ZSUndoRecPtr undorecptr;
+ ZSSingleBtreeItem *deleteditem;
+ Buffer buf;
+ zstid next_tid;
+
+ /* Find the item to delete. (It could be compressed) */
+ item = zsbt_tid_fetch(rel, &recent_oldest_undo, tid, &buf);
+ if (item == NULL)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to delete with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+
+ if (snapshot)
+ {
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ (ZSBtreeItem *) item, LockTupleExclusive,
+ &keep_old_undo_ptr, hufd, &next_tid);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSBtreeScan scan;
+ TransactionId obsoleting_xid;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, (ZSBtreeItem *) item, &obsoleting_xid, NULL))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ result = TM_Updated;
+ }
+ }
+ }
+
+ /* Create UNDO record. */
+ {
+ ZSUndoRec_Delete undorec;
+
+ undorec.rec.size = sizeof(ZSUndoRec_Delete);
+ undorec.rec.type = ZSUNDO_TYPE_DELETE;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = tid;
+ undorec.changedPart = changingPart;
+
+ if (keep_old_undo_ptr)
+ undorec.rec.prevundorec = item->t_undo_ptr;
+ else
+ ZSUndoRecPtrInitialize(&undorec.rec.prevundorec);
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* Replace the ZSBreeItem with one with the new UNDO pointer. */
+ deleteditem = palloc(item->t_size);
+ memcpy(deleteditem, item, item->t_size);
+ deleteditem->t_undo_ptr = undorecptr;
+
+ zsbt_tid_replace_item(rel, buf,
+ item->t_tid, (ZSBtreeItem *) deleteditem,
+ NIL);
+ ReleaseBuffer(buf); /* zsbt_replace_item unlocked */
+
+ pfree(deleteditem);
+
+ return TM_Ok;
+}
+
+void
+zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ ZSSingleBtreeItem *item;
+ Buffer buf;
+ /* Just using meta attribute, we can follow the update chain */
+ zstid curr_tid = *tid;
+
+ for(;;)
+ {
+ zstid next_tid = InvalidZSTid;
+ if (curr_tid == InvalidZSTid)
+ break;
+
+ /* Find the item */
+ item = zsbt_tid_fetch(rel, &recent_oldest_undo, curr_tid, &buf);
+ if (item == NULL)
+ break;
+
+ if (snapshot)
+ {
+ /* FIXME: dummmy scan */
+ ZSBtreeScan scan;
+ TransactionId obsoleting_xid;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = snapshot;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (zs_SatisfiesVisibility(&scan, (ZSBtreeItem *) item,
+ &obsoleting_xid, &next_tid))
+ {
+ *tid = curr_tid;
+ }
+
+ curr_tid = next_tid;
+ UnlockReleaseBuffer(buf);
+ }
+ }
+}
+
+/*
+ * A new TID is allocated, as we see best and returned to the caller. This
+ * function is only called for META attribute btree. Data columns will use the
+ * returned tid to insert new items.
+ */
+TM_Result
+zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd,
+ zstid *newtid_p)
+{
+ TM_Result result;
+ ZSUndoRecPtr prevundoptr;
+
+ /*
+ * This is currently only used on the meta-attribute. The other attributes
+ * don't need to carry visibility information, so the caller just inserts
+ * the new values with (multi_)insert() instead. This will change once we
+ * start doing the equivalent of HOT updates, where the TID doesn't change.
+ */
+ Assert(*newtid_p == InvalidZSTid);
+
+ /*
+ * Find and lock the old item.
+ *
+ * TODO: If there's free TID space left on the same page, we should keep the
+ * buffer locked, and use the same page for the new tuple.
+ */
+ result = zsbt_tid_update_lock_old(rel, otid,
+ xid, cid, key_update, snapshot,
+ crosscheck, wait, hufd, &prevundoptr);
+
+ if (result != TM_Ok)
+ return result;
+
+ /* insert new version */
+ zsbt_tid_update_insert_new(rel, newtid_p, xid, cid, prevundoptr);
+
+ /* update the old item with the "t_ctid pointer" for the new item */
+ zsbt_tid_mark_old_updated(rel, otid, *newtid_p, xid, cid, key_update, snapshot);
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): locks the old item for update.
+ */
+static TM_Result
+zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd, ZSUndoRecPtr *prevundoptr_p)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ Buffer buf;
+ ZSSingleBtreeItem *olditem;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zstid next_tid;
+
+ /*
+ * Find the item to delete.
+ */
+ olditem = zsbt_tid_fetch(rel, &recent_oldest_undo, otid, &buf);
+ if (olditem == NULL)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+ *prevundoptr_p = olditem->t_undo_ptr;
+
+ /*
+ * Is it visible to us?
+ */
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ (ZSBtreeItem *) olditem,
+ key_update ? LockTupleExclusive : LockTupleNoKeyExclusive,
+ &keep_old_undo_ptr, hufd, &next_tid);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSBtreeScan scan;
+ TransactionId obsoleting_xid;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, (ZSBtreeItem *) olditem, &obsoleting_xid, NULL))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ result = TM_Updated;
+ }
+ }
+
+ /*
+ * TODO: tuple-locking not implemented. Pray that there is no competing
+ * concurrent update!
+ */
+
+ UnlockReleaseBuffer(buf);
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): inserts the new, updated, item.
+ */
+static void
+zsbt_tid_update_insert_new(Relation rel,
+ zstid *newtid,
+ TransactionId xid, CommandId cid, ZSUndoRecPtr prevundoptr)
+{
+ zsbt_tid_multi_insert(rel, newtid, 1, xid, cid, INVALID_SPECULATIVE_TOKEN, prevundoptr);
+}
+
+/*
+ * Subroutine of zsbt_update(): mark old item as updated.
+ */
+static void
+zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ Buffer buf;
+ ZSSingleBtreeItem *olditem;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ TM_FailureData tmfd;
+ ZSUndoRecPtr undorecptr;
+ ZSSingleBtreeItem *deleteditem;
+ zstid next_tid;
+
+ /*
+ * Find the item to delete. It could be part of a compressed item,
+ * we let zsbt_fetch() handle that.
+ */
+ olditem = zsbt_tid_fetch(rel, &recent_oldest_undo, otid, &buf);
+ if (olditem == NULL)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+
+ /*
+ * Is it visible to us?
+ */
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ (ZSBtreeItem *) olditem,
+ key_update ? LockTupleExclusive : LockTupleNoKeyExclusive,
+ &keep_old_undo_ptr, &tmfd, &next_tid);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "tuple concurrently updated - not implemented");
+ }
+
+ /* Create UNDO record. */
+ {
+ ZSUndoRec_Update undorec;
+
+ undorec.rec.size = sizeof(ZSUndoRec_Update);
+ undorec.rec.type = ZSUNDO_TYPE_UPDATE;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = otid;
+ if (keep_old_undo_ptr)
+ undorec.rec.prevundorec = olditem->t_undo_ptr;
+ else
+ ZSUndoRecPtrInitialize(&undorec.rec.prevundorec);
+ undorec.newtid = newtid;
+ undorec.key_update = key_update;
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* Replace the ZSBreeItem with one with the updated undo pointer. */
+ deleteditem = palloc(olditem->t_size);
+ memcpy(deleteditem, olditem, olditem->t_size);
+ deleteditem->t_undo_ptr = undorecptr;
+
+ zsbt_tid_replace_item(rel, buf,
+ otid, (ZSBtreeItem *) deleteditem,
+ NIL);
+ ReleaseBuffer(buf); /* zsbt_recompress_replace released */
+
+ pfree(deleteditem);
+}
+
+TM_Result
+zsbt_tid_lock(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ LockTupleMode mode, Snapshot snapshot,
+ TM_FailureData *hufd, zstid *next_tid)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ Buffer buf;
+ ZSSingleBtreeItem *item;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ ZSUndoRecPtr undorecptr;
+ ZSSingleBtreeItem *newitem;
+
+ *next_tid = tid;
+
+ /* Find the item to delete. (It could be compressed) */
+ item = zsbt_tid_fetch(rel, &recent_oldest_undo, tid, &buf);
+ if (item == NULL)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to lock with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ (ZSBtreeItem *) item, mode,
+ &keep_old_undo_ptr, hufd, next_tid);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ return result;
+ }
+
+ /* Create UNDO record. */
+ {
+ ZSUndoRec_TupleLock undorec;
+
+ undorec.rec.size = sizeof(ZSUndoRec_TupleLock);
+ undorec.rec.type = ZSUNDO_TYPE_TUPLE_LOCK;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.tid = tid;
+ undorec.lockmode = mode;
+ if (keep_old_undo_ptr)
+ undorec.rec.prevundorec = item->t_undo_ptr;
+ else
+ ZSUndoRecPtrInitialize(&undorec.rec.prevundorec);
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* Replace the item with an identical one, but with updated undo pointer. */
+ newitem = palloc(item->t_size);
+ memcpy(newitem, item, item->t_size);
+ newitem->t_undo_ptr = undorecptr;
+
+ zsbt_tid_replace_item(rel, buf,
+ item->t_tid, (ZSBtreeItem *) newitem,
+ NIL);
+ ReleaseBuffer(buf); /* zsbt_replace_item unlocked */
+
+ pfree(newitem);
+
+ return TM_Ok;
+}
+
+/*
+ * Mark item with given TID as dead.
+ *
+ * This is used during VACUUM.
+ */
+void
+zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr undoptr)
+{
+ Buffer buf;
+ ZSSingleBtreeItem *item;
+ ZSSingleBtreeItem deaditem;
+
+ /* Find the item to delete. (It could be compressed) */
+ item = zsbt_tid_fetch(rel, NULL, tid, &buf);
+ if (item == NULL)
+ {
+ elog(WARNING, "could not find tuple to mark dead with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ return;
+ }
+
+ /* Replace the ZSBreeItem with a DEAD item. (Unless it's already dead) */
+ if ((item->t_flags & ZSBT_DEAD) != 0)
+ {
+ UnlockReleaseBuffer(buf);
+ return;
+ }
+
+ memset(&deaditem, 0, offsetof(ZSSingleBtreeItem, t_payload));
+ deaditem.t_tid = tid;
+ deaditem.t_size = sizeof(ZSSingleBtreeItem);
+ deaditem.t_flags = ZSBT_DEAD;
+ deaditem.t_undo_ptr = undoptr;
+
+ zsbt_tid_replace_item(rel, buf,
+ tid, (ZSBtreeItem *) &deaditem,
+ NIL);
+ ReleaseBuffer(buf); /* zsbt_replace_item released */
+}
+
+/*
+ * Clear an item's UNDO pointer.
+ *
+ * This is used during VACUUM, to clear out aborted deletions.
+ */
+void
+zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr)
+{
+ Buffer buf;
+ ZSSingleBtreeItem *item;
+ ZSSingleBtreeItem *copy;
+
+ /* Find the item to delete. (It could be compressed) */
+ item = zsbt_tid_fetch(rel, NULL, tid, &buf);
+ if (item == NULL)
+ {
+ elog(WARNING, "could not find aborted tuple to remove with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ return;
+ }
+
+ if (ZSUndoRecPtrEquals(item->t_undo_ptr, undoptr))
+ {
+ copy = palloc(item->t_size);
+ memcpy(copy, item, item->t_size);
+ ZSUndoRecPtrInitialize(©->t_undo_ptr);
+ zsbt_tid_replace_item(rel, buf,
+ tid, (ZSBtreeItem *) copy,
+ NIL);
+ ReleaseBuffer(buf); /* zsbt_replace_item unlocked */
+ }
+ else
+ {
+ Assert(item->t_undo_ptr.counter > undoptr.counter ||
+ !IsZSUndoRecPtrValid(&item->t_undo_ptr));
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+void
+zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete)
+{
+ Buffer buf;
+ ZSSingleBtreeItem *item = NULL;
+ ZSUndoRecPtr recent_oldest_undo;
+
+ item = zsbt_tid_fetch(rel, &recent_oldest_undo, tid, &buf);
+
+ if (item == NULL)
+ elog(ERROR, "couldn't find item for meta column for inserted tuple with TID (%u, %u) in rel %s",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid), rel->rd_rel->relname.data);
+ Assert(item->t_tid == tid);
+
+ zsundo_clear_speculative_token(rel, item->t_undo_ptr);
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Fetch the item with given TID. The page containing the item is kept locked, and
+ * returned to the caller in *buf_p. This is used to locate a tuple for updating
+ * or deleting it.
+ */
+static ZSSingleBtreeItem *
+zsbt_tid_fetch(Relation rel, ZSUndoRecPtr *recent_oldest_undo,
+ zstid tid, Buffer *buf_p)
+{
+ Buffer buf;
+ Page page;
+ ZSBtreeItem *item = NULL;
+ bool found = false;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, tid, 0, false);
+ if (buf == InvalidBuffer)
+ {
+ *buf_p = InvalidBuffer;
+ return NULL;
+ }
+ page = BufferGetPage(buf);
+
+ /* Find the item on the page that covers the target TID */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ item = (ZSBtreeItem *) PageGetItem(page, iid);
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+ ZSDecompressContext decompressor;
+
+ zs_decompress_init(&decompressor);
+ zs_decompress_chunk(&decompressor, citem);
+
+ while ((item = zs_decompress_read_item(&decompressor)) != NULL)
+ {
+ zstid lasttid = zsbt_item_lasttid(item);
+
+ if (item->t_tid <= tid && lasttid >= tid)
+ {
+ found = true;
+ break;
+ }
+ }
+ if (found)
+ {
+ /* FIXME: decompressor is leaked. Can't free it yet, because we still
+ * need to access the item below
+ */
+ break;
+ }
+ zs_decompress_free(&decompressor);
+ }
+ else
+ {
+ zstid lasttid = zsbt_item_lasttid(item);
+
+ if (item->t_tid <= tid && lasttid >= tid)
+ {
+ found = true;
+ break;
+ }
+ }
+ }
+
+ if (found)
+ {
+ ZSSingleBtreeItem *result;
+
+ if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) item;
+ int resultsize;
+
+ Assert((tid - aitem->t_tid) < aitem->t_nelements);
+
+ resultsize = offsetof(ZSSingleBtreeItem, t_payload);
+ result = palloc(resultsize);
+ memset(result, 0, offsetof(ZSSingleBtreeItem, t_payload)); /* zero padding */
+ result->t_tid = tid;
+ result->t_flags = item->t_flags & ~ZSBT_ARRAY;
+ result->t_size = resultsize;
+ result->t_undo_ptr = aitem->t_undo_ptr;
+ }
+ else
+ {
+ /* single item */
+ result = (ZSSingleBtreeItem *) item;
+ }
+
+ *buf_p = buf;
+ return result;
+ }
+ else
+ {
+ UnlockReleaseBuffer(buf);
+ *buf_p = InvalidBuffer;
+ return NULL;
+ }
+}
+
+/*
+ * Form a ZSBtreeItem out of the given datums, or data that's already in on-disk
+ * array format, for insertion.
+ *
+ * If there's more than one element, an array item is created. Otherwise, a single
+ * item.
+ */
+static ZSBtreeItem *
+zsbt_tid_create_item(zstid tid, ZSUndoRecPtr undo_ptr,
+ int nelements)
+{
+ ZSBtreeItem *result;
+ Size itemsz;
+
+ Assert(nelements > 0);
+
+ if (nelements > 1)
+ {
+ ZSArrayBtreeItem *newitem;
+
+ itemsz = offsetof(ZSArrayBtreeItem, t_payload);
+
+ newitem = palloc(itemsz);
+ memset(newitem, 0, offsetof(ZSArrayBtreeItem, t_payload)); /* zero padding */
+ newitem->t_tid = tid;
+ newitem->t_size = itemsz;
+ newitem->t_flags = ZSBT_ARRAY;
+ newitem->t_nelements = nelements;
+ newitem->t_undo_ptr = undo_ptr;
+
+ result = (ZSBtreeItem *) newitem;
+ }
+ else
+ {
+ ZSSingleBtreeItem *newitem;
+
+ itemsz = offsetof(ZSSingleBtreeItem, t_payload);
+
+ newitem = palloc(itemsz);
+ memset(newitem, 0, offsetof(ZSSingleBtreeItem, t_payload)); /* zero padding */
+ newitem->t_tid = tid;
+ newitem->t_flags = 0;
+ newitem->t_size = itemsz;
+ newitem->t_undo_ptr = undo_ptr;
+
+ result = (ZSBtreeItem *) newitem;
+ }
+
+ return result;
+}
+
+/*
+ * This helper function is used to implement INSERT, UPDATE and DELETE.
+ *
+ * If 'olditem' is not NULL, then 'olditem' on the page is replaced with
+ * 'replacementitem'. 'replacementitem' can be NULL, to remove an old item.
+ *
+ * If 'newitems' is not empty, the items in the list are added to the page,
+ * to the correct position. FIXME: Actually, they're always just added to
+ * the end of the page, and that better be the correct position.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page if needed.
+ */
+static void
+zsbt_tid_replace_item(Relation rel, Buffer buf,
+ zstid oldtid,
+ ZSBtreeItem *replacementitem,
+ List *newitems)
+{
+ Page page = BufferGetPage(buf);
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ List *items;
+ bool found_old_item = false;
+ /* We might need to decompress up to two previously compressed items */
+ ZSDecompressContext decompressor;
+ bool decompressor_used = false;
+ bool decompressing;
+
+ if (replacementitem)
+ Assert(replacementitem->t_tid == oldtid);
+
+ /*
+ * TODO: It would be good to have a fast path, for the common case that we're
+ * just adding items to the end.
+ */
+
+ /* Loop through all old items on the page */
+ items = NIL;
+ maxoff = PageGetMaxOffsetNumber(page);
+ decompressing = false;
+ off = 1;
+ for (;;)
+ {
+ ZSBtreeItem *item;
+
+ /*
+ * Get the next item to process. If we're decompressing, get the next
+ * tuple from the decompressor, otherwise get the next item from the page.
+ */
+ if (decompressing)
+ {
+ item = zs_decompress_read_item(&decompressor);
+ if (!item)
+ {
+ decompressing = false;
+ continue;
+ }
+ }
+ else if (off <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, off);
+
+ item = (ZSBtreeItem *) PageGetItem(page, iid);
+ off++;
+
+ }
+ else
+ {
+ /* out of items */
+ break;
+ }
+
+ /* we now have an item to process, either straight from the page or from
+ * the decompressor */
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ zstid item_lasttid = zsbt_item_lasttid(item);
+
+ /* there shouldn't nested compressed items */
+ if (decompressing)
+ elog(ERROR, "nested compressed items on zedstore page not supported");
+
+ if (oldtid != InvalidZSTid && item->t_tid <= oldtid && oldtid <= item_lasttid)
+ {
+ ZSCompressedBtreeItem *citem = (ZSCompressedBtreeItem *) item;
+
+ /* Found it, this compressed item covers the target or the new TID. */
+ /* We have to decompress it, and recompress */
+ Assert(!decompressor_used);
+
+ zs_decompress_init(&decompressor);
+ zs_decompress_chunk(&decompressor, citem);
+ decompressor_used = true;
+ decompressing = true;
+ continue;
+ }
+ else
+ {
+ /* keep this compressed item as it is */
+ items = lappend(items, item);
+ }
+ }
+ else if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ /* array item */
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) item;
+ zstid item_lasttid = zsbt_item_lasttid(item);
+
+ if (oldtid != InvalidZSTid && item->t_tid <= oldtid && oldtid <= item_lasttid)
+ {
+ /*
+ * The target TID is currently part of an array item. We have to split
+ * the array item into two, and put the replacement item in the middle.
+ */
+ int cutoff;
+ int nelements = aitem->t_nelements;
+
+ cutoff = oldtid - item->t_tid;
+
+ /* Array slice before the target TID */
+ if (cutoff > 0)
+ {
+ ZSBtreeItem *item1;
+
+ item1 = zsbt_tid_create_item(aitem->t_tid, aitem->t_undo_ptr,
+ cutoff);
+ items = lappend(items, item1);
+ }
+
+ /*
+ * Skip over the target element, and store the replacement
+ * item, if any, in its place
+ */
+ if (replacementitem)
+ items = lappend(items, replacementitem);
+
+ /* Array slice after the target */
+ if (cutoff + 1 < nelements)
+ {
+ ZSBtreeItem *item2;
+
+ item2 = zsbt_tid_create_item(oldtid + 1, aitem->t_undo_ptr,
+ nelements - (cutoff + 1));
+ items = lappend(items, item2);
+ }
+
+ found_old_item = true;
+ }
+ else
+ items = lappend(items, item);
+ }
+ else
+ {
+ /* single item */
+ if (oldtid != InvalidZSTid && item->t_tid == oldtid)
+ {
+ Assert(!found_old_item);
+ found_old_item = true;
+ if (replacementitem)
+ items = lappend(items, replacementitem);
+ }
+ else
+ items = lappend(items, item);
+ }
+ }
+
+ if (oldtid != InvalidZSTid && !found_old_item)
+ elog(ERROR, "could not find old item to replace");
+
+ /* Add any new items to the end */
+ if (newitems)
+ items = list_concat(items, newitems);
+
+ /* Now pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_tid_recompress_replace(rel, buf, items);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack);
+ }
+
+ /*
+ * We can now free the decompression contexts. The pointers in the 'items' list
+ * point to decompression buffers, so we cannot free them until after writing out
+ * the pages.
+ */
+ if (decompressor_used)
+ zs_decompress_free(&decompressor);
+ list_free(items);
+}
+
+/*
+ * Recompressor routines
+ */
+typedef struct
+{
+ Page currpage;
+ ZSCompressContext compressor;
+ int compressed_items;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ int total_items;
+ int total_compressed_items;
+ int total_already_compressed_items;
+
+ zstid hikey;
+} zsbt_tid_recompress_context;
+
+static void
+zsbt_recompress_newpage(zsbt_tid_recompress_context *cxt, zstid nexttid, int flags)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ if (cxt->currpage)
+ {
+ /* set the last tid on previous page */
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ oldopaque->zs_hikey = nexttid;
+ }
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (cxt->stack_tail)
+ cxt->stack_tail->next = stack;
+ else
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = ZS_META_ATTRIBUTE_NUM;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = flags;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_recompress_add_to_page(zsbt_tid_recompress_context *cxt, ZSBtreeItem *item)
+{
+ if (PageGetFreeSpace(cxt->currpage) < MAXALIGN(item->t_size))
+ zsbt_recompress_newpage(cxt, item->t_tid, 0);
+
+ if (PageAddItemExtended(cxt->currpage,
+ (Item) item, item->t_size,
+ PageGetMaxOffsetNumber(cxt->currpage) + 1,
+ PAI_OVERWRITE) == InvalidOffsetNumber)
+ elog(ERROR, "could not add item to page while recompressing");
+
+ cxt->total_items++;
+}
+
+static bool
+zsbt_recompress_add_to_compressor(zsbt_tid_recompress_context *cxt, ZSBtreeItem *item)
+{
+ bool result;
+
+ if (cxt->compressed_items == 0)
+ zs_compress_begin(&cxt->compressor, PageGetFreeSpace(cxt->currpage));
+
+ result = zs_compress_add(&cxt->compressor, item);
+ if (result)
+ {
+ cxt->compressed_items++;
+
+ cxt->total_compressed_items++;
+ }
+
+ return result;
+}
+
+static void
+zsbt_recompress_flush(zsbt_tid_recompress_context *cxt)
+{
+ ZSCompressedBtreeItem *citem;
+
+ if (cxt->compressed_items == 0)
+ return;
+
+ citem = zs_compress_finish(&cxt->compressor);
+
+ if (citem)
+ zsbt_recompress_add_to_page(cxt, (ZSBtreeItem *) citem);
+ else
+ {
+ uint16 size = 0;
+ /*
+ * compression failed hence add items uncompressed. We should maybe
+ * note that these items/pattern are not compressible and skip future
+ * attempts to compress but its possible this clubbed with some other
+ * future items may compress. So, better avoid recording such info and
+ * try compression again later if required.
+ */
+ for (int i = 0; i < cxt->compressor.nitems; i++)
+ {
+ citem = (ZSCompressedBtreeItem *) (cxt->compressor.uncompressedbuffer + size);
+ zsbt_recompress_add_to_page(cxt, (ZSBtreeItem *) citem);
+
+ size += MAXALIGN(citem->t_size);
+ }
+ }
+
+ cxt->compressed_items = 0;
+}
+
+/*
+ * Rewrite a leaf page, with given 'items' as the new content.
+ *
+ * If there are any uncompressed items in the list, we try to compress them.
+ * Any already-compressed items are added as is.
+ *
+ * If the items no longer fit on the page, then the page is split. It is
+ * entirely possible that they don't fit even on two pages; we split the page
+ * into as many pages as needed. Hopefully not more than a few pages, though,
+ * because otherwise you might hit limits on the number of buffer pins (with
+ * tiny shared_buffers).
+ *
+ * On entry, 'oldbuf' must be pinned and exclusive-locked. On exit, the lock
+ * is released, but it's still pinned.
+ *
+ * TODO: Try to combine single items, and existing array-items, into new array
+ * items.
+ */
+static void
+zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items)
+{
+ ListCell *lc;
+ zsbt_tid_recompress_context cxt;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(BufferGetPage(oldbuf));
+ ZSUndoRecPtr recent_oldest_undo = { 0 };
+ BlockNumber orignextblk;
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+
+ orignextblk = oldopaque->zs_next;
+
+ cxt.currpage = NULL;
+ zs_compress_init(&cxt.compressor);
+ cxt.compressed_items = 0;
+ cxt.stack_head = cxt.stack_tail = NULL;
+ cxt.hikey = oldopaque->zs_hikey;
+
+ cxt.total_items = 0;
+ cxt.total_compressed_items = 0;
+ cxt.total_already_compressed_items = 0;
+
+ zsbt_recompress_newpage(&cxt, oldopaque->zs_lokey, (oldopaque->zs_flags & ZSBT_ROOT));
+
+ foreach(lc, items)
+ {
+ ZSBtreeItem *item = (ZSBtreeItem *) lfirst(lc);
+
+ /* We can leave out any old-enough DEAD items */
+ if ((item->t_flags & ZSBT_DEAD) != 0)
+ {
+ ZSBtreeItem *uitem = (ZSBtreeItem *) item;
+
+ if (recent_oldest_undo.counter == 0)
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+
+ if (zsbt_item_undoptr(uitem).counter <= recent_oldest_undo.counter)
+ continue;
+ }
+
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ {
+ /* already compressed, add as it is. */
+ zsbt_recompress_flush(&cxt);
+ cxt.total_already_compressed_items++;
+ zsbt_recompress_add_to_page(&cxt, item);
+ }
+ else
+ {
+ /* try to add this item to the compressor */
+ if (!zsbt_recompress_add_to_compressor(&cxt, item))
+ {
+ if (cxt.compressed_items > 0)
+ {
+ /* flush, and retry */
+ zsbt_recompress_flush(&cxt);
+
+ if (!zsbt_recompress_add_to_compressor(&cxt, item))
+ {
+ /* could not compress, even on its own. Store it uncompressed, then */
+ zsbt_recompress_add_to_page(&cxt, item);
+ }
+ }
+ else
+ {
+ /* could not compress, even on its own. Store it uncompressed, then */
+ zsbt_recompress_add_to_page(&cxt, item);
+ }
+ }
+ }
+ }
+
+ /* flush the last one, if any */
+ zsbt_recompress_flush(&cxt);
+
+ zs_compress_free(&cxt.compressor);
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt.stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, InvalidBuffer);
+ stack->next->buf = nextbuf;
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = orignextblk;
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt.stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt.stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt.stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt.stack_tail->next = zsbt_newroot(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt.stack_tail->next = zsbt_insert_downlinks(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_lokey, BufferGetBlockNumber(oldbuf), oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt.stack_head);
+}
diff --git a/src/backend/access/zedstore/zedstore_toast.c b/src/backend/access/zedstore/zedstore_toast.c
new file mode 100644
index 0000000000..8e25591b16
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_toast.c
@@ -0,0 +1,192 @@
+/*
+ * zedstore_toast.c
+ * Routines for Toasting oversized tuples in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_toast.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/rel.h"
+
+/*
+ * Toast a datum, inside the ZedStore file.
+ *
+ * This is similar to regular toasting, but instead of using a separate index and
+ * heap, the datum is stored within the same ZedStore file as all the btrees and
+ * stuff. A chain of "toast-pages" is allocated for the datum, and each page is filled
+ * with as much of the datum as possible.
+ *
+ *
+ * Note: You must call zedstore_toast_finish() after this,
+ * to set the TID in the toast-chain's first block. Otherwise, it's considered recyclable.
+ */
+Datum
+zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value)
+{
+ varatt_zs_toastptr *toastptr;
+ BlockNumber firstblk = InvalidBlockNumber;
+ Buffer buf = InvalidBuffer;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ Buffer prevbuf = InvalidBuffer;
+ ZSToastPageOpaque *prevopaque = NULL;
+ char *ptr;
+ int32 total_size;
+ int32 offset;
+
+ /* it's possible that this is the very first insertion to the relation. */
+ if (RelationGetNumberOfBlocks(rel) == 0)
+ zsmeta_initmetapage(rel);
+
+ /* TODO: try to compress it in place first. Maybe just call toast_compress_datum? */
+
+ /*
+ * If that doesn't reduce it enough, allocate a toast page
+ * for it.
+ */
+ ptr = VARDATA_ANY(value);
+ total_size = VARSIZE_ANY_EXHDR(value);
+ offset = 0;
+
+ while (total_size - offset > 0)
+ {
+ Size thisbytes;
+
+ buf = zspage_getnewbuf(rel, InvalidBuffer);
+ if (prevbuf == InvalidBuffer)
+ firstblk = BufferGetBlockNumber(buf);
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, sizeof(ZSToastPageOpaque));
+
+ thisbytes = Min(total_size - offset, PageGetExactFreeSpace(page));
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_attno = attno;
+ opaque->zs_tid = InvalidZSTid;
+ opaque->zs_total_size = total_size;
+ opaque->zs_slice_offset = offset;
+ opaque->zs_prev = BufferIsValid(prevbuf) ? BufferGetBlockNumber(prevbuf) : InvalidBlockNumber;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_TOAST_PAGE_ID;
+
+ memcpy((char *) page + SizeOfPageHeaderData, ptr, thisbytes);
+ ((PageHeader) page)->pd_lower += thisbytes;
+ ptr += thisbytes;
+ offset += thisbytes;
+
+ if (prevbuf != InvalidBuffer)
+ {
+ prevopaque->zs_next = BufferGetBlockNumber(buf);
+ MarkBufferDirty(prevbuf);
+ }
+
+ /* TODO: WAL-log */
+ MarkBufferDirty(buf);
+
+ if (prevbuf != InvalidBuffer)
+ UnlockReleaseBuffer(prevbuf);
+ prevbuf = buf;
+ prevopaque = opaque;
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ toastptr = palloc0(sizeof(varatt_zs_toastptr));
+ SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
+ toastptr->zst_block = firstblk;
+
+ return PointerGetDatum(toastptr);
+}
+
+void
+zedstore_toast_finish(Relation rel, AttrNumber attno, Datum toasted, zstid tid)
+{
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(toasted);
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+
+ Assert(toastptr->va_tag == VARTAG_ZEDSTORE);
+
+ buf = ReadBuffer(rel, toastptr->zst_block);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ Assert(opaque->zs_tid == InvalidZSTid);
+ Assert(opaque->zs_attno == attno);
+ Assert(opaque->zs_prev == InvalidBlockNumber);
+
+ opaque->zs_tid = tid;
+
+ /* TODO: WAL-log */
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+}
+
+Datum
+zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted)
+{
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(toasted);
+ BlockNumber nextblk;
+ BlockNumber prevblk;
+ char *result = NULL;
+ char *ptr = NULL;
+ int32 total_size = 0;
+
+ Assert(toastptr->va_tag == VARTAG_ZEDSTORE);
+
+ prevblk = InvalidBlockNumber;
+ nextblk = toastptr->zst_block;
+
+ while (nextblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ uint32 size;
+
+ buf = ReadBuffer(rel, nextblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ Assert(opaque->zs_attno == attno);
+ Assert(opaque->zs_prev == prevblk);
+
+ if (prevblk == InvalidBlockNumber)
+ {
+ Assert(opaque->zs_tid == tid);
+
+ total_size = opaque->zs_total_size;
+
+ result = palloc(total_size + VARHDRSZ);
+ SET_VARSIZE(result, total_size + VARHDRSZ);
+ ptr = result + VARHDRSZ;
+ }
+
+ size = ((PageHeader) page)->pd_lower - SizeOfPageHeaderData;
+ memcpy(ptr, (char *) page + SizeOfPageHeaderData, size);
+ ptr += size;
+
+ prevblk = nextblk;
+ nextblk = opaque->zs_next;
+ UnlockReleaseBuffer(buf);
+ }
+ Assert(total_size > 0);
+ Assert(ptr == result + total_size + VARHDRSZ);
+
+ return PointerGetDatum(result);
+}
diff --git a/src/backend/access/zedstore/zedstore_tupslot.c b/src/backend/access/zedstore/zedstore_tupslot.c
new file mode 100644
index 0000000000..8528287d51
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tupslot.c
@@ -0,0 +1,348 @@
+/*
+ * zedstore_tupslot.c
+ * Implementation of a TupleTableSlot for zedstore.
+ *
+ * This implementation is identical to a Virtual tuple slot
+ * (TTSOpsVirtual), but it has a slot_getsysattr() implementation
+ * that can fetch and compute the 'xmin' for the tuple.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tupslot.c
+ */
+#include "postgres.h"
+
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "executor/tuptable.h"
+#include "utils/expandeddatum.h"
+
+const TupleTableSlotOps TTSOpsZedstore;
+
+
+typedef struct ZedstoreTupleTableSlot
+{
+ TupleTableSlot base;
+
+ char *data; /* data for materialized slots */
+} ZedstoreTupleTableSlot;
+
+
+static void
+tts_zedstore_init(TupleTableSlot *slot)
+{
+}
+
+static void
+tts_zedstore_release(TupleTableSlot *slot)
+{
+}
+
+static void
+tts_zedstore_clear(TupleTableSlot *slot)
+{
+ if (unlikely(TTS_SHOULDFREE(slot)))
+ {
+ ZedstoreTupleTableSlot *vslot = (ZedstoreTupleTableSlot *) slot;
+
+ pfree(vslot->data);
+ vslot->data = NULL;
+
+ slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
+ }
+
+ slot->tts_nvalid = 0;
+ slot->tts_flags |= TTS_FLAG_EMPTY;
+ ItemPointerSetInvalid(&slot->tts_tid);
+}
+
+/*
+ * Attribute values are readily available in tts_values and tts_isnull array
+ * in a ZedstoreTupleTableSlot. So there should be no need to call either of the
+ * following two functions.
+ */
+static void
+tts_zedstore_getsomeattrs(TupleTableSlot *slot, int natts)
+{
+ elog(ERROR, "getsomeattrs is not required to be called on a zedstore tuple table slot");
+}
+
+static void
+zs_get_xmin_cmin(Relation rel, ZSUndoRecPtr recent_oldest_undo, zstid tid, ZSUndoRecPtr undo_ptr,
+ TransactionId *xmin, CommandId *cmin)
+{
+ TransactionId this_xmin;
+ CommandId this_cmin;
+ ZSUndoRec *undorec;
+
+ /*
+ * Follow the chain of UNDO records for this tuple, to find the
+ * transaction that originally inserted the row (xmin/cmin).
+ *
+ * XXX: this is similar logic to zs_cluster_process_tuple(). Can
+ * we merge it?
+ */
+ this_xmin = FrozenTransactionId;
+ this_cmin = InvalidCommandId;
+
+ for (;;)
+ {
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /* This tuple version is visible to everyone. */
+ break;
+ }
+
+ /* Fetch the next UNDO record. */
+ undorec = zsundo_fetch(rel, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ this_xmin = undorec->xid;
+ this_cmin = undorec->cid;
+ break;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK ||
+ undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ }
+
+ *xmin = this_xmin;
+ *cmin = this_cmin;
+}
+
+/*
+ * We only support fetching 'xmin', currently. It's needed for referential
+ * integrity triggers (i.e. foreign keys).
+ */
+static Datum
+tts_zedstore_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
+{
+ if (attnum == MinTransactionIdAttributeNumber ||
+ attnum == MinCommandIdAttributeNumber)
+ {
+ zstid tid = ZSTidFromItemPointer(slot->tts_tid);
+ ZSBtreeScan btree_scan;
+ bool found;
+ Relation rel;
+ ZSUndoRecPtr recent_oldest_undo;
+ TransactionId xmin;
+ CommandId cmin;
+
+ /*
+ * We assume that the table OID and TID in the slot are set. We
+ * fetch the tuple from the table, and follow its UNDO chain to
+ * find the transaction that inserted it.
+ *
+ * XXX: This is very slow compared to e.g. the heap, where we
+ * always store the xmin in tuple itself. We should probably do
+ * the same in zedstore, and add extra fields in the slot to hold
+ * xmin/cmin and fill them in when we fetch the tuple and check its
+ * visibility for the first time.
+ */
+ if (!OidIsValid(slot->tts_tableOid))
+ elog(ERROR, "zedstore tuple table slot does not have a table oid");
+
+ /* assume the caller is already holding a suitable lock on the table */
+ rel = table_open(slot->tts_tableOid, NoLock);
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+
+ /* Use the meta-data tree for the visibility information. */
+ zsbt_tid_begin_scan(rel, tid, tid + 1, SnapshotAny, &btree_scan);
+
+ found = zsbt_tid_scan_next(&btree_scan) != InvalidZSTid;
+ if (!found)
+ elog(ERROR, "could not find zedstore tuple (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+
+ zs_get_xmin_cmin(rel, recent_oldest_undo, tid, btree_scan.array_undoptr, &xmin, &cmin);
+
+ zsbt_tid_end_scan(&btree_scan);
+
+ table_close(rel, NoLock);
+
+ *isnull = false;
+ if (attnum == MinTransactionIdAttributeNumber)
+ return TransactionIdGetDatum(xmin);
+ else
+ {
+ Assert(attnum == MinCommandIdAttributeNumber);
+ return CommandIdGetDatum(cmin);
+ }
+ }
+ elog(ERROR, "zedstore tuple table slot does not have system attributes (except xmin and cmin)");
+
+ return 0; /* silence compiler warnings */
+}
+
+/*
+ * To materialize a zedstore slot all the datums that aren't passed by value
+ * have to be copied into the slot's memory context. To do so, compute the
+ * required size, and allocate enough memory to store all attributes. That's
+ * good for cache hit ratio, but more importantly requires only memory
+ * allocation/deallocation.
+ */
+static void
+tts_zedstore_materialize(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *vslot = (ZedstoreTupleTableSlot *) slot;
+ TupleDesc desc = slot->tts_tupleDescriptor;
+ Size sz = 0;
+ char *data;
+
+ /* already materialized */
+ if (TTS_SHOULDFREE(slot))
+ return;
+
+ /* compute size of memory required */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ sz = att_align_nominal(sz, att->attalign);
+ sz += EOH_get_flat_size(DatumGetEOHP(val));
+ }
+ else
+ {
+ sz = att_align_nominal(sz, att->attalign);
+ sz = att_addlength_datum(sz, att->attlen, val);
+ }
+ }
+
+ /* all data is byval */
+ if (sz == 0)
+ return;
+
+ /* allocate memory */
+ vslot->data = data = MemoryContextAlloc(slot->tts_mcxt, sz);
+ slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+ /* and copy all attributes into the pre-allocated space */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ Size data_length;
+
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ ExpandedObjectHeader *eoh = DatumGetEOHP(val);
+
+ data = (char *) att_align_nominal(data,
+ att->attalign);
+ data_length = EOH_get_flat_size(eoh);
+ EOH_flatten_into(eoh, data, data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ else
+ {
+ Size data_length = 0;
+
+ data = (char *) att_align_nominal(data, att->attalign);
+ data_length = att_addlength_datum(data_length, att->attlen, val);
+
+ memcpy(data, DatumGetPointer(val), data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ }
+}
+
+static void
+tts_zedstore_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
+{
+ TupleDesc srcdesc = dstslot->tts_tupleDescriptor;
+
+ Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
+
+ tts_zedstore_clear(dstslot);
+
+ slot_getallattrs(srcslot);
+
+ for (int natt = 0; natt < srcdesc->natts; natt++)
+ {
+ dstslot->tts_values[natt] = srcslot->tts_values[natt];
+ dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
+ }
+
+ dstslot->tts_nvalid = srcdesc->natts;
+ dstslot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ /* make sure storage doesn't depend on external memory */
+ tts_zedstore_materialize(dstslot);
+}
+
+static HeapTuple
+tts_zedstore_copy_heap_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+
+}
+
+static MinimalTuple
+tts_zedstore_copy_minimal_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_minimal_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+}
+
+
+const TupleTableSlotOps TTSOpsZedstore = {
+ .base_slot_size = sizeof(ZedstoreTupleTableSlot),
+ .init = tts_zedstore_init,
+ .release = tts_zedstore_release,
+ .clear = tts_zedstore_clear,
+ .getsomeattrs = tts_zedstore_getsomeattrs,
+ .getsysattr = tts_zedstore_getsysattr,
+ .materialize = tts_zedstore_materialize,
+ .copyslot = tts_zedstore_copyslot,
+
+ /*
+ * A zedstore tuple table slot can not "own" a heap tuple or a minimal
+ * tuple.
+ */
+ .get_heap_tuple = NULL,
+ .get_minimal_tuple = NULL,
+ .copy_heap_tuple = tts_zedstore_copy_heap_tuple,
+ .copy_minimal_tuple = tts_zedstore_copy_minimal_tuple
+};
diff --git a/src/backend/access/zedstore/zedstore_undo.c b/src/backend/access/zedstore/zedstore_undo.c
new file mode 100644
index 0000000000..0767307253
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_undo.c
@@ -0,0 +1,918 @@
+/*
+ * zedstore_undo.c
+ * Temporary UNDO-logging for zedstore.
+ *
+ * XXX: This is hopefully replaced with an upstream UNDO facility later.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_undo.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/multixact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/pg_rusage.h"
+#include "utils/rel.h"
+#include "utils/lsyscache.h"
+
+/*
+ * Working area for zsundo_scan().
+ */
+typedef struct ZSUndoTrimStats
+{
+ /* List of TIDs of tuples we intend to delete */
+ /* NB: this list is ordered by TID address */
+ int num_dead_tuples; /* current # of entries */
+ int max_dead_tuples; /* # slots allocated in array */
+ ItemPointer dead_tuples; /* array of ItemPointerData */
+ bool dead_tuples_overflowed;
+
+ BlockNumber deleted_undo_pages;
+
+ bool can_advance_oldestundorecptr;
+} ZSUndoTrimStats;
+
+/*
+ * Working area for VACUUM.
+ */
+typedef struct ZSVacRelStats
+{
+ int elevel;
+ BufferAccessStrategy vac_strategy;
+
+ /* hasindex = true means two-pass strategy; false means one-pass */
+ bool hasindex;
+ /* Overall statistics about rel */
+ BlockNumber old_rel_pages; /* previous value of pg_class.relpages */
+ BlockNumber rel_pages; /* total number of pages */
+ BlockNumber scanned_pages; /* number of pages we examined */
+ BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber frozenskipped_pages; /* # of frozen pages we skipped */
+ BlockNumber tupcount_pages; /* pages whose tuples we counted */
+ double old_live_tuples; /* previous value of pg_class.reltuples */
+ double new_rel_tuples; /* new estimated total # of tuples */
+ double new_live_tuples; /* new estimated total # of live tuples */
+ double new_dead_tuples; /* new estimated total # of dead tuples */
+ BlockNumber pages_removed;
+ double tuples_deleted;
+ BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+ ZSUndoTrimStats trimstats;
+} ZSVacRelStats;
+
+/*
+ * Guesstimation of number of dead tuples per page. This is used to
+ * provide an upper limit to memory allocated when vacuuming small
+ * tables.
+ */
+#define LAZY_ALLOC_TUPLES MaxHeapTuplesPerPage
+
+static int zs_vac_cmp_itemptr(const void *left, const void *right);
+static bool zs_lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_space_alloc(ZSVacRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats);
+static void lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats);
+static ZSUndoRecPtr zsundo_scan(Relation rel, TransactionId OldestXmin, ZSUndoTrimStats *trimstats, BlockNumber *oldest_undopage, List **unused_pages);
+static void zsundo_update_oldest_ptr(Relation rel, ZSUndoRecPtr oldest_undorecptr, BlockNumber oldest_undopage, List *unused_pages);
+static void zsundo_record_dead_tuple(ZSUndoTrimStats *trimstats, zstid tid);
+
+/*
+ * Insert the given UNDO record to the UNDO log.
+ */
+ZSUndoRecPtr
+zsundo_insert(Relation rel, ZSUndoRec *rec)
+{
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber tail_blk;
+ Buffer tail_buf = InvalidBuffer;
+ Page tail_pg = NULL;
+ ZSUndoPageOpaque *tail_opaque = NULL;
+ char *dst;
+ ZSUndoRecPtr undorecptr;
+ int offset;
+ uint64 undo_counter;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+
+ /* TODO: get share lock to begin with, for more concurrency */
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+retry_lock_tail:
+ tail_blk = metaopaque->zs_undo_tail;
+
+ /*
+ * Is there space on the tail page? If not, allocate a new UNDO page.
+ */
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_buf = ReadBuffer(rel, tail_blk);
+ LockBuffer(tail_buf, BUFFER_LOCK_EXCLUSIVE);
+ tail_pg = BufferGetPage(tail_buf);
+ tail_opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(tail_pg);
+ }
+ if (tail_blk == InvalidBlockNumber || PageGetExactFreeSpace(tail_pg) < rec->size)
+ {
+ Buffer newbuf;
+ BlockNumber newblk;
+ Page newpage;
+ ZSUndoPageOpaque *newopaque;
+
+ /*
+ * Release the lock on the metapage while we find a new block, because
+ * that could take a while. (And accessing the Free Page Map might lock
+ * the metapage, too, causing self-deadlock.)
+ */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+ /* new page */
+ newbuf = zspage_getnewbuf(rel, metabuf);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ if (metaopaque->zs_undo_tail != tail_blk)
+ {
+ /*
+ * It should not be possible for another backend to extend the UNDO log
+ * while we're holding the tail block locked.
+ */
+ if (tail_blk != InvalidBlockNumber)
+ elog(ERROR, "UNDO tail block pointer was changed unexpectedly");
+
+ /*
+ * we don't need the new page, after all. (Or maybe we do, if the new
+ * tail block is already full, but we're not smart about it.)
+ */
+ zspage_delete_page(rel, newbuf);
+ goto retry_lock_tail;
+ }
+
+ newblk = BufferGetBlockNumber(newbuf);
+ newpage = BufferGetPage(newbuf);
+ PageInit(newpage, BLCKSZ, sizeof(ZSUndoPageOpaque));
+ newopaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(newpage);
+ newopaque->next = InvalidBlockNumber;
+ newopaque->zs_page_id = ZS_UNDO_PAGE_ID;
+
+ metaopaque->zs_undo_tail = newblk;
+ if (tail_blk == InvalidBlockNumber)
+ metaopaque->zs_undo_head = newblk;
+
+ MarkBufferDirty(metabuf);
+
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_opaque->next = newblk;
+ MarkBufferDirty(tail_buf);
+ UnlockReleaseBuffer(tail_buf);
+ }
+
+ tail_blk = newblk;
+ tail_buf = newbuf;
+ tail_pg = newpage;
+ tail_opaque = newopaque;
+ }
+
+ undo_counter = metaopaque->zs_undo_counter++;
+ MarkBufferDirty(metabuf);
+
+ UnlockReleaseBuffer(metabuf);
+
+ /* insert the record to this page */
+ offset = ((PageHeader) tail_pg)->pd_lower;
+
+ undorecptr.counter = undo_counter;
+ undorecptr.blkno = tail_blk;
+ undorecptr.offset = offset;
+ rec->undorecptr = undorecptr;
+ dst = ((char *) tail_pg) + offset;
+ memcpy(dst, rec, rec->size);
+ ((PageHeader) tail_pg)->pd_lower += rec->size;
+ MarkBufferDirty(tail_buf);
+ UnlockReleaseBuffer(tail_buf);
+
+ return undorecptr;
+}
+
+/*
+ * Fetch the UNDO record with the given undo-pointer.
+ *
+ * The returned record is a palloc'd copy.
+ */
+ZSUndoRec *
+zsundo_fetch(Relation rel, ZSUndoRecPtr undoptr)
+{
+ Buffer buf;
+ Page page;
+ PageHeader pagehdr;
+ ZSUndoPageOpaque *opaque;
+ ZSUndoRec *undorec;
+ ZSUndoRec *undorec_copy;
+
+ buf = ReadBuffer(rel, undoptr.blkno);
+ page = BufferGetPage(buf);
+ pagehdr = (PageHeader) page;
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ if (PageIsNew(page))
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u; not an UNDO page",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u; not an UNDO page",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+
+ /* Sanity check that the pointer pointed to a valid place */
+ if (undoptr.offset < SizeOfPageHeaderData ||
+ undoptr.offset + sizeof(ZSUndoRec) > pagehdr->pd_lower)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+
+ undorec = (ZSUndoRec *) (((char *) page) + undoptr.offset);
+
+ if (memcmp(&undorec->undorecptr, &undoptr, sizeof(ZSUndoRecPtr)) != 0)
+ elog(ERROR, "could not find UNDO record");
+
+ undorec_copy = palloc(undorec->size);
+ memcpy(undorec_copy, undorec, undorec->size);
+
+ UnlockReleaseBuffer(buf);
+
+ return undorec_copy;
+}
+
+void
+zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr)
+{
+ Buffer buf;
+ Page page;
+ PageHeader pagehdr;
+ ZSUndoPageOpaque *opaque;
+ ZSUndoRec *undorec;
+
+ buf = ReadBuffer(rel, undoptr.blkno);
+ page = BufferGetPage(buf);
+ pagehdr = (PageHeader) page;
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u; not an UNDO page",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+
+ /* Sanity check that the pointer pointed to a valid place */
+ if (undoptr.offset < SizeOfPageHeaderData ||
+ undoptr.offset + sizeof(ZSUndoRec) > pagehdr->pd_lower)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+
+ undorec = (ZSUndoRec *) (((char *) page) + undoptr.offset);
+
+ if (undorec->type != ZSUNDO_TYPE_INSERT)
+ elog(ERROR, "unexpected undo record type %d on speculatively inserted row", undorec->type);
+
+ undorec->speculative_token = INVALID_SPECULATIVE_TOKEN;
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+}
+
+static bool
+zs_lazy_tid_reaped(ItemPointer itemptr, void *state)
+{
+ ZSVacRelStats *vacrelstats = (ZSVacRelStats *) state;
+ ItemPointer res;
+
+ res = (ItemPointer) bsearch((void *) itemptr,
+ (void *) vacrelstats->trimstats.dead_tuples,
+ vacrelstats->trimstats.num_dead_tuples,
+ sizeof(ItemPointerData),
+ zs_vac_cmp_itemptr);
+
+ return (res != NULL);
+}
+
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+zs_vac_cmp_itemptr(const void *left, const void *right)
+{
+ BlockNumber lblk,
+ rblk;
+ OffsetNumber loff,
+ roff;
+
+ lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+ rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+ if (lblk < rblk)
+ return -1;
+ if (lblk > rblk)
+ return 1;
+
+ loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+ roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+ if (loff < roff)
+ return -1;
+ if (loff > roff)
+ return 1;
+
+ return 0;
+}
+
+void
+zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin)
+{
+ ZSVacRelStats *vacrelstats;
+ ZSUndoTrimStats *trimstats;
+ Relation *Irel;
+ int nindexes;
+ IndexBulkDeleteResult **indstats;
+ BlockNumber nblocks;
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (nblocks == 0)
+ return; /* empty table */
+
+ vacrelstats = (ZSVacRelStats *) palloc0(sizeof(ZSVacRelStats));
+ trimstats = &vacrelstats->trimstats;
+
+ if (params->options & VACOPT_VERBOSE)
+ vacrelstats->elevel = INFO;
+ else
+ vacrelstats->elevel = DEBUG2;
+ vacrelstats->vac_strategy = bstrategy;
+
+ /* Open all indexes of the relation */
+ vac_open_indexes(rel, RowExclusiveLock, &nindexes, &Irel);
+ vacrelstats->hasindex = (nindexes > 0);
+ indstats = (IndexBulkDeleteResult **)
+ palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
+
+ lazy_space_alloc(vacrelstats, nblocks);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("vacuuming \"%s.%s\"",
+ get_namespace_name(RelationGetNamespace(rel)),
+ RelationGetRelationName(rel))));
+
+ do
+ {
+ ZSUndoRecPtr reaped_upto;
+ BlockNumber oldest_undopage;
+ int j;
+ List *unused_pages = NIL;
+
+ trimstats->dead_tuples_overflowed = false;
+ trimstats->num_dead_tuples = 0;
+ trimstats->deleted_undo_pages = 0;
+
+ reaped_upto = zsundo_scan(rel, OldestXmin, trimstats, &oldest_undopage, &unused_pages);
+
+ if (trimstats->num_dead_tuples > 0)
+ {
+ pg_qsort(trimstats->dead_tuples, trimstats->num_dead_tuples,
+ sizeof(ItemPointerData), zs_vac_cmp_itemptr);
+ /* TODO: currently, we write a separate UNDO record for each attribute, so there will
+ * be duplicates. Eliminate them. */
+ j = 1;
+ for (int i = 1; i < trimstats->num_dead_tuples; i++)
+ {
+ if (!ItemPointerEquals(&trimstats->dead_tuples[j - 1],
+ &trimstats->dead_tuples[i]))
+ trimstats->dead_tuples[j++] = trimstats->dead_tuples[i];
+ }
+ trimstats->num_dead_tuples = j;
+
+ /* Remove index entries */
+ for (int i = 0; i < nindexes; i++)
+ lazy_vacuum_index(Irel[i],
+ &indstats[i],
+ vacrelstats);
+
+ /*
+ * Mark the items as dead in the attribute b-trees.
+ *
+ * We cannot remove them immediately, because we must prevent the TIDs from
+ * being reused, until we have trimmed the UNDO records. Otherwise, this might
+ * happen:
+ *
+ * 1. We remove items from all the B-trees.
+ * 2. An inserter reuses the now-unused TID for a new tuple
+ * 3. We abort the VACUUM, for some reason
+ * 4. We start VACUUM again. We will now try to remove the item again, but
+ * we will remove the new item with the same TID instead.
+ *
+ * There would be other ways to deal with it. For example in step #4, we could
+ * refrain from removing items, whose UNDO pointers are newer than expected.
+ * But that's tricky, because we scan the indexes first, and we must refrain
+ * from removing index entries for new items, too.
+ */
+ for (int i = 0; i < trimstats->num_dead_tuples; i++)
+ zsbt_tid_mark_dead(rel,
+ ZSTidFromItemPointer(trimstats->dead_tuples[i]),
+ reaped_upto);
+
+ for (int attno = 1; attno <= RelationGetNumberOfAttributes(rel); attno++)
+ {
+ for (int i = 0; i < trimstats->num_dead_tuples; i++)
+ zsbt_attr_remove(rel, attno, ZSTidFromItemPointer(trimstats->dead_tuples[i]));
+ }
+ }
+
+ /*
+ * The UNDO records for the tuple versions we just removed are no longer
+ * interesting to anyone. Advance the UNDO tail, so that the UNDO pages
+ * can be recycled.
+ */
+ zsundo_update_oldest_ptr(rel, reaped_upto, oldest_undopage, unused_pages);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("\"%s\": removed %d row versions and %d undo pages",
+ RelationGetRelationName(rel),
+ trimstats->num_dead_tuples,
+ trimstats->deleted_undo_pages)));
+ } while(trimstats->dead_tuples_overflowed);
+
+ /* Do post-vacuum cleanup and statistics update for each index */
+ for (int i = 0; i < nindexes; i++)
+ lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+
+ /* Done with indexes */
+ vac_close_indexes(nindexes, Irel, NoLock);
+}
+
+
+/*
+ * lazy_space_alloc - space allocation decisions for lazy vacuum
+ *
+ * See the comments at the head of this file for rationale.
+ */
+static void
+lazy_space_alloc(ZSVacRelStats *vacrelstats, BlockNumber relblocks)
+{
+ long maxtuples;
+ int vac_work_mem = IsAutoVacuumWorkerProcess() &&
+ autovacuum_work_mem != -1 ?
+ autovacuum_work_mem : maintenance_work_mem;
+
+ if (vacrelstats->hasindex)
+ {
+ maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
+ maxtuples = Min(maxtuples, INT_MAX);
+ maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+
+ /* curious coding here to ensure the multiplication can't overflow */
+ if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
+ maxtuples = relblocks * LAZY_ALLOC_TUPLES;
+
+ /* stay sane if small maintenance_work_mem */
+ maxtuples = Max(maxtuples, MaxHeapTuplesPerPage);
+ }
+ else
+ {
+ /*
+ * TODO: In heap vacuum code, this is MaxHeapTuplesPerPage. We have no
+ * particular reason to size this by that, but the same principle applies:
+ * without indexes, it's pretty cheap to do multiple iterations, so let's
+ * avoid making a huge allocation
+ */
+ maxtuples = 1000;
+ }
+
+ vacrelstats->trimstats.num_dead_tuples = 0;
+ vacrelstats->trimstats.max_dead_tuples = (int) maxtuples;
+ vacrelstats->trimstats.dead_tuples = (ItemPointer)
+ palloc(maxtuples * sizeof(ItemPointerData));
+}
+
+/*
+ * lazy_vacuum_index() -- vacuum one index relation.
+ *
+ * Delete all the index entries pointing to tuples listed in
+ * vacrelstats->dead_tuples, and update running statistics.
+ */
+static void
+lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = true;
+ ivinfo.message_level = vacrelstats->elevel;
+ /* We can only provide an approximate value of num_heap_tuples here */
+ ivinfo.num_heap_tuples = vacrelstats->old_live_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ zs_lazy_tid_reaped, (void *) vacrelstats);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("scanned index \"%s\" to remove %d row versions",
+ RelationGetRelationName(indrel),
+ vacrelstats->trimstats.num_dead_tuples),
+ errdetail_internal("%s", pg_rusage_show(&ru0))));
+}
+
+/*
+ * lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ */
+static void
+lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = (vacrelstats->tupcount_pages < vacrelstats->rel_pages);
+ ivinfo.message_level = vacrelstats->elevel;
+
+ /*
+ * Now we can provide a better estimate of total number of surviving
+ * tuples (we assume indexes are more interested in that than in the
+ * number of nominally live tuples).
+ */
+ ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ stats = index_vacuum_cleanup(&ivinfo, stats);
+
+ if (!stats)
+ return;
+
+ /*
+ * Now update statistics in pg_class, but only if the index says the count
+ * is accurate.
+ */
+ if (!stats->estimated_count)
+ vac_update_relstats(indrel,
+ stats->num_pages,
+ stats->num_index_tuples,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ false);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("index \"%s\" now contains %.0f row versions in %u pages",
+ RelationGetRelationName(indrel),
+ stats->num_index_tuples,
+ stats->num_pages),
+ errdetail("%.0f index row versions were removed.\n"
+ "%u index pages have been deleted, %u are currently reusable.\n"
+ "%s.",
+ stats->tuples_removed,
+ stats->pages_deleted, stats->pages_free,
+ pg_rusage_show(&ru0))));
+
+ pfree(stats);
+}
+
+/*
+ * Scan the UNDO log, starting from oldest entry. For every tuple that is
+ * now considered dead, add it to 'dead_tuples'. Records for committed
+ * transactions can be trimmed away immediately.
+ *
+ * Returns the value that the oldest UNDO ptr can be trimmed upto, after
+ * removing all the dead TIDs.
+ *
+ * The caller must initialize ZSUndoTrimStats. This function updates the
+ * counters, and adds dead TIDs that can be removed to trimstats->dead_tuples.
+ * If there are more dead TIDs than fit in the dead_tuples array, this
+ * function sets trimstats->dead_tuples_overflow flag, and stops just before
+ * the UNDO record for the TID that did not fit. An important special case is
+ * calling this with trimstats->max_dead_tuples == 0. In that case, we scan
+ * as much as is possible without scanning the indexes (i.e. only UNDO
+ * records belonging to committed transactions at the tail of the UNDO log).
+ * IOW, it returns the oldest UNDO rec pointer that is still needed by
+ * active snapshots.
+ */
+static ZSUndoRecPtr
+zsundo_scan(Relation rel, TransactionId OldestXmin, ZSUndoTrimStats *trimstats,
+ BlockNumber *oldest_undopage, List **unused_pages)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber lastblk;
+ ZSUndoRecPtr oldest_undorecptr;
+ bool can_advance_oldestundorecptr;
+ char *ptr;
+ char *endptr;
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ oldest_undorecptr = metaopaque->zs_undo_oldestptr;
+
+ /*
+ * If we assume that only one process can call TRIM at a time, then we
+ * don't need to hold the metapage locked. Alternatively, if multiple
+ * concurrent trims is possible, we could check after reading the head
+ * page, that it is the page we expect, and re-read the metapage if it's
+ * not.
+ *
+ * FIXME: Currently this works even if two backends call zsundo_trim()
+ * concurrently, because we never recycle UNDO pages.
+ */
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page, until we
+ * hit a record that we cannot remove.
+ */
+ lastblk = firstblk;
+ can_advance_oldestundorecptr = false;
+ while (lastblk != InvalidBlockNumber && !trimstats->dead_tuples_overflowed)
+ {
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, lastblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "unexpected page id on UNDO page");
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ while (ptr < endptr && !trimstats->dead_tuples_overflowed)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+ bool did_commit;
+
+ Assert(undorec->undorecptr.blkno == lastblk);
+
+ if (undorec->undorecptr.counter < oldest_undorecptr.counter)
+ {
+ ptr += undorec->size;
+ continue;
+ }
+ oldest_undorecptr = undorec->undorecptr;
+
+ if (!TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* This is still needed. Bail out */
+ break;
+ }
+
+ /*
+ * No one thinks this transaction is in-progress anymore. If it
+ * committed, we can just trim away its UNDO record. If it aborted,
+ * we need to apply the UNDO record first.
+ */
+ did_commit = TransactionIdDidCommit(undorec->xid);
+
+ switch (undorec->type)
+ {
+ case ZSUNDO_TYPE_INSERT:
+ if (!did_commit)
+ zsundo_record_dead_tuple(trimstats, undorec->tid);
+ break;
+ case ZSUNDO_TYPE_DELETE:
+ if (did_commit)
+ {
+ zsundo_record_dead_tuple(trimstats, undorec->tid);
+ }
+ else
+ {
+ /*
+ * must clear the item's UNDO pointer, otherwise the deletion
+ * becomes visible to everyone when the UNDO record is trimmed
+ * away
+ */
+ /*
+ * Don't do this if we're called from zsundo_get_oldest_undo_ptr(),
+ * because we might be holding a lock on the page, and deadlock.
+ */
+ if (trimstats->max_dead_tuples == 0)
+ trimstats->dead_tuples_overflowed = true;
+ else
+ zsbt_tid_undo_deletion(rel, undorec->tid, undorec->undorecptr);
+ }
+ break;
+ case ZSUNDO_TYPE_UPDATE:
+ if (did_commit)
+ zsundo_record_dead_tuple(trimstats, undorec->tid);
+ break;
+ }
+
+ if (!trimstats->dead_tuples_overflowed)
+ {
+ ptr += undorec->size;
+
+ can_advance_oldestundorecptr = true;
+ }
+ }
+
+ if (ptr < endptr)
+ {
+ UnlockReleaseBuffer(buf);
+ break;
+ }
+ else
+ {
+ /* We processed all records on the page. Step to the next one, if any. */
+ Assert(ptr == endptr);
+ *unused_pages = lappend_int(*unused_pages, lastblk);
+ lastblk = opaque->next;
+ UnlockReleaseBuffer(buf);
+ if (lastblk != InvalidBlockNumber)
+ trimstats->deleted_undo_pages++;
+ }
+ }
+
+ if (can_advance_oldestundorecptr && lastblk == InvalidBlockNumber)
+ {
+ /*
+ * We stopped after the last valid record. Advance by one, to the next
+ * record which hasn't been created yet, and which is still needed
+ */
+ oldest_undorecptr.counter++;
+ oldest_undorecptr.blkno = InvalidBlockNumber;
+ oldest_undorecptr.offset = 0;
+ }
+
+ trimstats->can_advance_oldestundorecptr = can_advance_oldestundorecptr;
+ *oldest_undopage = lastblk;
+ return oldest_undorecptr;
+}
+
+/* Update metapage with the oldest value */
+static void
+zsundo_update_oldest_ptr(Relation rel, ZSUndoRecPtr oldest_undorecptr, BlockNumber oldest_undopage, List *unused_pages)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ ListCell *lc;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ metaopaque->zs_undo_oldestptr = oldest_undorecptr;
+ if (oldest_undopage == InvalidBlockNumber)
+ {
+ metaopaque->zs_undo_head = InvalidBlockNumber;
+ metaopaque->zs_undo_tail = InvalidBlockNumber;
+ }
+ else
+ metaopaque->zs_undo_head = oldest_undopage;
+
+ /* TODO: WAL-log */
+
+ MarkBufferDirty(metabuf);
+ UnlockReleaseBuffer(metabuf);
+
+ foreach(lc, unused_pages)
+ {
+ BlockNumber blk = (BlockNumber) lfirst_int(lc);
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+
+ /* check that the page still looks like what we'd expect. */
+ buf = ReadBuffer(rel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ if (PageIsEmpty(page) ||
+ PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSUndoPageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ /* FIXME: Also check here that the max UndoRecPtr on the page is less
+ * than the new 'oldest_undorecptr'
+ */
+
+ zspage_delete_page(rel, buf);
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/*
+ * zsundo_record_dead_tuple - remember one deletable tuple
+ */
+static void
+zsundo_record_dead_tuple(ZSUndoTrimStats *trimstats, zstid tid)
+{
+ /*
+ * The array shouldn't overflow under normal behavior, but perhaps it
+ * could if we are given a really small maintenance_work_mem. In that
+ * case, just forget the last few tuples (we'll get 'em next time).
+ */
+ if (trimstats->num_dead_tuples < trimstats->max_dead_tuples)
+ {
+ trimstats->dead_tuples[trimstats->num_dead_tuples] = ItemPointerFromZSTid(tid);
+ trimstats->num_dead_tuples++;
+ pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+ trimstats->num_dead_tuples);
+ }
+ else
+ trimstats->dead_tuples_overflowed = true;
+}
+
+/*
+ * Return the current "Oldest undo pointer". The effects of any actions with
+ * undo pointer older than this is known to be visible to everyone. (i.e.
+ * an inserted tuple is known to be visible, and a deleted tuple is known to
+ * be invisible.)
+ */
+ZSUndoRecPtr
+zsundo_get_oldest_undo_ptr(Relation rel)
+{
+ ZSUndoRecPtr result;
+ ZSUndoTrimStats trimstats;
+ BlockNumber oldest_undopage;
+ List *unused_pages = NIL;
+
+ if (RelationGetNumberOfBlocks(rel) == 0)
+ {
+ memset(&result, 0, sizeof(ZSUndoRecPtr));
+ return result;
+ }
+
+ /*
+ * Call zsundo_scan, with max_dead_tuples = 0. It scans the UNDO log,
+ * starting from the oldest record, and advances the oldest UNDO pointer
+ * past as many committed, visible-to-all transactions as possible.
+ *
+ * TODO:
+ * We could get the latest cached value directly from the metapage, but
+ * this allows trimming the UNDO log more aggressively, whenever we're
+ * scanning. Fetching records from the UNDO log is pretty expensive,
+ * so until that is somehow sped up, it is a good tradeoff to be
+ * aggressive about that.
+ */
+ trimstats.num_dead_tuples = 0;
+ trimstats.max_dead_tuples = 0;
+ trimstats.dead_tuples = NULL;
+ trimstats.dead_tuples_overflowed = false;
+ trimstats.deleted_undo_pages = 0;
+ result = zsundo_scan(rel, RecentGlobalXmin, &trimstats, &oldest_undopage, &unused_pages);
+
+ if (trimstats.can_advance_oldestundorecptr)
+ zsundo_update_oldest_ptr(rel, result, oldest_undopage, unused_pages);
+
+ return result;
+}
diff --git a/src/backend/access/zedstore/zedstore_utils.c b/src/backend/access/zedstore/zedstore_utils.c
new file mode 100644
index 0000000000..7673537292
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_utils.c
@@ -0,0 +1,76 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstore_utils.c
+ * ZedStore utility functions
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_freepagemap.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/zedstore_internal.h"
+#include "miscadmin.h"
+
+/*
+ * Allocate a new zs_split_stack struct.
+ */
+zs_split_stack *
+zs_new_split_stack_entry(Buffer buf, Page page)
+{
+ zs_split_stack *stack;
+
+ stack = palloc(sizeof(zs_split_stack));
+ stack->next = NULL;
+ stack->buf = buf;
+ stack->page = page;
+ stack->recycle = false; /* caller can change this */
+
+ return stack;
+}
+
+/*
+ * Apply all the changes represented by a list of zs_split_stack
+ * entries.
+ */
+void
+zs_apply_split_changes(Relation rel, zs_split_stack *stack)
+{
+ zs_split_stack *head = stack;
+
+ START_CRIT_SECTION();
+
+ while (stack)
+ {
+ PageRestoreTempPage(stack->page, BufferGetPage(stack->buf));
+ MarkBufferDirty(stack->buf);
+ stack = stack->next;
+ }
+
+ /* TODO: WAL-log all the changes */
+
+ END_CRIT_SECTION();
+
+ stack = head;
+ while (stack)
+ {
+ zs_split_stack *next;
+
+ /* add this page to the Free Page Map for recycling */
+ if (stack->recycle)
+ zspage_delete_page(rel, stack->buf);
+
+ UnlockReleaseBuffer(stack->buf);
+
+ next = stack->next;
+ pfree(stack);
+ stack = next;
+ }
+}
diff --git a/src/backend/access/zedstore/zedstore_visibility.c b/src/backend/access/zedstore/zedstore_visibility.c
new file mode 100644
index 0000000000..0087991f78
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_visibility.c
@@ -0,0 +1,728 @@
+/*
+ * zedstore_visibility.c
+ * Routines for MVCC in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_visibility.c
+ */
+#include "postgres.h"
+
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "storage/procarray.h"
+
+static bool
+zs_tuplelock_compatible(LockTupleMode mode, LockTupleMode newmode)
+{
+ switch (newmode)
+ {
+ case LockTupleKeyShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare ||
+ mode == LockTupleNoKeyExclusive;
+
+ case LockTupleShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare;
+
+ case LockTupleNoKeyExclusive:
+ return mode == LockTupleKeyShare;
+ case LockTupleExclusive:
+ return false;
+
+ default:
+ elog(ERROR, "unknown tuple lock mode %d", newmode);
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesUpdate.
+ *
+ * When returns TM_Ok, this also returns a flag in *undo_record_needed, to indicate
+ * whether the old UNDO record is still of interest to anyone. If the old record
+ * belonged to an aborted deleting transaction, for example, it can be ignored.
+ *
+ * This does more than HeapTupleSatisfiesUpdate. If HeapTupleSatisfiesUpdate sees
+ * an updated or locked tuple, it returns TM_BeingUpdated, and the caller has to
+ * check if the tuple lock is compatible with the update. zs_SatisfiesUpdate
+ * checks if the new lock mode is compatible with the old one, and returns TM_Ok
+ * if so. Waiting for conflicting locks is left to the caller.
+ *
+ * This is also used for tuple locking (e.g. SELECT FOR UPDATE). 'mode' indicates
+ * the lock mode. For a genuine UPDATE, pass LockTupleExclusive or
+ * LockTupleNoKeyExclusive depending on whether key columns are being modified.
+ *
+ * If the tuple was UPDATEd, *next_tid is set to the TID of the new row version.
+ */
+TM_Result
+zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo, ZSBtreeItem *item,
+ LockTupleMode mode,
+ bool *undo_record_needed, TM_FailureData *tmfd, zstid *next_tid)
+{
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ int chain_depth = 0;
+
+ Assert((item->t_flags & ZSBT_COMPRESSED) == 0);
+
+ *undo_record_needed = true;
+
+ undo_ptr = zsbt_item_undoptr(item);
+
+fetch_undo_record:
+ chain_depth++;
+
+ /* Is it visible? */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /*
+ * The old UNDO record is no longer visible to anyone, so we don't
+ * need to keep it. If this record was not the one directly referenced
+ * from the item, then we must keep it, though. For example, if there
+ * is a chain (item -> LOCK_TUPLE -> INSERT), and the INSERT record is
+ * no longer needed by anyone, we must still keep the pointer to the LOCK
+ * record.
+ */
+ if (chain_depth == 1)
+ *undo_record_needed = false;
+ return TM_Ok;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (undorec->cid >= snapshot->curcid)
+ return TM_Invisible; /* inserted after scan started */
+ }
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return TM_Invisible; /* inserter has not committed yet */
+ else if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* it must have aborted or crashed */
+ return TM_Invisible;
+ }
+
+ /* The tuple is visible to use. But can we lock it? */
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ ZSUndoRec_TupleLock *lock_undorec = (ZSUndoRec_TupleLock *) undorec;
+
+ /*
+ * If any subtransaction of the current top transaction already holds
+ * a lock as strong as or stronger than what we're requesting, we
+ * effectively hold the desired lock already. We *must* succeed
+ * without trying to take the tuple lock, else we will deadlock
+ * against anyone wanting to acquire a stronger lock.
+ */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (lock_undorec->lockmode >= mode)
+ {
+ *undo_record_needed = true;
+ return TM_Ok;
+ }
+ }
+ else if (!zs_tuplelock_compatible(lock_undorec->lockmode, mode) &&
+ TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item->t_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ return TM_BeingModified;
+ }
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE)
+ {
+ ZSUndoRec_Delete *deleterec = (ZSUndoRec_Delete *) undorec;
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item->t_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item->t_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ if (deleterec->changedPart)
+ {
+ ItemPointerSet(&tmfd->ctid, MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber);
+ *next_tid = InvalidZSTid;
+ return TM_Updated;
+ }
+ else
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item->t_tid);
+ return TM_Deleted;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* updated-away tuple */
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ LockTupleMode old_lockmode;
+
+ *next_tid = updaterec->newtid;
+ old_lockmode = updaterec->key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item->t_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(item->t_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(((ZSUndoRec_Update *) undorec)->newtid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ return TM_Updated;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+
+/*
+ * Like HeapTupleSatisfiesAny
+ */
+static bool
+zs_SatisfiesAny(ZSBtreeScan *scan, ZSBtreeItem *item)
+{
+ return true;
+}
+
+/*
+ * helper function to zs_SatisfiesMVCC(), to check if the given XID
+ * is visible to the snapshot.
+ */
+static bool
+xid_is_visible(Snapshot snapshot, TransactionId xid, CommandId cid, bool *aborted)
+{
+ *aborted = false;
+ if (TransactionIdIsCurrentTransactionId(xid))
+ {
+ if (cid >= snapshot->curcid)
+ return false;
+ else
+ return true;
+ }
+ else if (XidInMVCCSnapshot(xid, snapshot))
+ return false;
+ else if (TransactionIdDidCommit(xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ *aborted = true;
+ return false;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesMVCC
+ */
+static bool
+zs_SatisfiesMVCC(ZSBtreeScan *scan, ZSBtreeItem *item,
+ TransactionId *obsoleting_xid, zstid *next_tid)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr recent_oldest_undo = scan->recent_oldest_undo;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ bool aborted;
+
+ Assert((item->t_flags & ZSBT_COMPRESSED) == 0);
+ Assert (snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ undo_ptr = zsbt_item_undoptr(item);
+
+fetch_undo_record:
+ /* If this record is "old", then the record is visible. */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ return true;
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ /* Inserted tuple */
+ bool result;
+
+ result = xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted);
+ if (!result && !aborted)
+ *obsoleting_xid = undorec->xid;
+ return result;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /*
+ * Deleted or updated-away. They are treated the same in an MVCC snapshot.
+ * They only need different treatment when updating or locking the row,
+ * in SatisfiesUpdate().
+ */
+ if (xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted))
+ {
+ /* we can see the deletion */
+ return false;
+ }
+ else
+ {
+ if (!aborted)
+ *obsoleting_xid = undorec->xid;
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesSelf
+ */
+static bool
+zs_SatisfiesSelf(ZSBtreeScan *scan, ZSBtreeItem *item, zstid *next_tid)
+{
+ Relation rel = scan->rel;
+ ZSUndoRecPtr recent_oldest_undo = scan->recent_oldest_undo;
+ ZSUndoRec *undorec;
+ ZSUndoRecPtr undo_ptr;
+
+ Assert((item->t_flags & ZSBT_COMPRESSED) == 0);
+ Assert (scan->snapshot->snapshot_type == SNAPSHOT_SELF);
+
+ undo_ptr = zsbt_item_undoptr(item);
+
+fetch_undo_record:
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ return true;
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return false;
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true;
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true;
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesDirty
+ */
+static bool
+zs_SatisfiesDirty(ZSBtreeScan *scan, ZSBtreeItem *item, zstid *next_tid)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr recent_oldest_undo = scan->recent_oldest_undo;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ Assert((item->t_flags & ZSBT_COMPRESSED) == 0);
+ Assert (snapshot->snapshot_type == SNAPSHOT_DIRTY);
+
+ snapshot->xmin = snapshot->xmax = InvalidTransactionId;
+ snapshot->speculativeToken = INVALID_SPECULATIVE_TOKEN;
+
+ undo_ptr = zsbt_item_undoptr(item);
+
+fetch_undo_record:
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ return true;
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ snapshot->speculativeToken = undorec->speculative_token;
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ {
+ snapshot->xmin = undorec->xid;
+ return true;
+ }
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* locked tuple. */
+ /* look at the previous UNDO record to find the insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /* deleted or updated-away tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ snapshot->xmax = undorec->xid;
+ return true;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * True if tuple might be visible to some transaction; false if it's
+ * surely dead to everyone, ie, vacuumable.
+ */
+static bool
+zs_SatisfiesNonVacuumable(ZSBtreeScan *scan, ZSBtreeItem *item)
+{
+ Relation rel = scan->rel;
+ TransactionId OldestXmin = scan->snapshot->xmin;
+ ZSUndoRecPtr recent_oldest_undo = scan->recent_oldest_undo;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ Assert (scan->snapshot->snapshot_type == SNAPSHOT_NON_VACUUMABLE);
+ Assert(TransactionIdIsValid(OldestXmin));
+
+ undo_ptr = zsbt_item_undoptr(item);
+
+fetch_undo_record:
+
+ /* Is it visible? */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ return true;
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ /* Inserted tuple */
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* inserter has not committed yet */
+
+ if (TransactionIdDidCommit(undorec->xid))
+ return true;
+
+ /* it must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* deleted or updated-away tuple */
+ ZSUndoRecPtr prevptr;
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* delete-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter committed. But perhaps it was recent enough that some open
+ * transactions could still see the tuple.
+ */
+ if (!TransactionIdPrecedes(undorec->xid, OldestXmin))
+ return true;
+
+ return false;
+ }
+
+ /*
+ * The deleting transaction did not commit. But before concluding
+ * that the tuple is live, we have to check if the inserting
+ * XID is live.
+ */
+ do {
+ prevptr = undorec->prevundorec;
+
+ if (prevptr.counter < recent_oldest_undo.counter)
+ return true;
+ undorec = zsundo_fetch(rel, prevptr);
+ } while(undorec->type == ZSUNDO_TYPE_TUPLE_LOCK);
+
+ Assert(undorec->type == ZSUNDO_TYPE_INSERT);
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* insert-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true; /* inserted committed */
+
+ /* inserter must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* look at the previous UNDO record, to find the Insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesVisibility
+ *
+ * If next_tid is not NULL then gets populated for the tuple if tuple was
+ * UPDATEd. *next_tid_p is set to the TID of the new row version.
+ */
+bool
+zs_SatisfiesVisibility(ZSBtreeScan *scan, ZSBtreeItem *item,
+ TransactionId *obsoleting_xid, zstid *next_tid)
+{
+ ZSUndoRecPtr undo_ptr;
+
+ /* initialize as invalid, if we find valid one populate the same */
+ if (next_tid)
+ *next_tid = InvalidZSTid;
+
+ /*
+ * This works on a single or array item. Compressed items don't have
+ * visibility information (the items inside the compressed container
+ * do)
+ */
+ Assert((item->t_flags & ZSBT_COMPRESSED) == 0);
+
+ /* The caller should've filled in the recent_oldest_undo pointer */
+ Assert(scan->recent_oldest_undo.counter != 0);
+
+ *obsoleting_xid = InvalidTransactionId;
+
+ /* dead items are never considered visible. */
+ if ((item->t_flags & ZSBT_DEAD) != 0)
+ return false;
+
+ /*
+ * Items with invalid undo record are considered visible. Mostly META
+ * column stores the valid undo record, all other columns stores invalid
+ * undo pointer. Visibility check is performed based on META column and
+ * only if visible rest of columns are fetched. For in-place updates,
+ * columns other than META column may have valid undo record, in which
+ * case the visibility check needs to be performed for the same. META
+ * column can sometime also have items with invalid undo, see
+ * zsbt_undo_item_deletion().
+ */
+ undo_ptr = zsbt_item_undoptr(item);
+ if (!IsZSUndoRecPtrValid(&undo_ptr))
+ return true;
+
+ switch (scan->snapshot->snapshot_type)
+ {
+ case SNAPSHOT_MVCC:
+ return zs_SatisfiesMVCC(scan, item, obsoleting_xid, next_tid);
+
+ case SNAPSHOT_SELF:
+ return zs_SatisfiesSelf(scan, item, next_tid);
+
+ case SNAPSHOT_ANY:
+ return zs_SatisfiesAny(scan, item);
+
+ case SNAPSHOT_TOAST:
+ elog(ERROR, "SnapshotToast not implemented in zedstore");
+ break;
+
+ case SNAPSHOT_DIRTY:
+ return zs_SatisfiesDirty(scan, item, next_tid);
+
+ case SNAPSHOT_HISTORIC_MVCC:
+ elog(ERROR, "SnapshotHistoricMVCC not implemented in zedstore yet");
+ break;
+
+ case SNAPSHOT_NON_VACUUMABLE:
+ return zs_SatisfiesNonVacuumable(scan, item);
+ }
+
+ return false; /* keep compiler quiet */
+}
diff --git a/src/backend/access/zedstore/zedstoream_handler.c b/src/backend/access/zedstore/zedstoream_handler.c
new file mode 100644
index 0000000000..5a79b7a1fc
--- /dev/null
+++ b/src/backend/access/zedstore/zedstoream_handler.c
@@ -0,0 +1,3163 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_handler.c
+ * ZedStore table access method code
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_handler.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/multixact.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/tsmapi.h"
+#include "access/tupdesc_details.h"
+#include "access/tuptoaster.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "catalog/catalog.h"
+#include "catalog/index.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "executor/executor.h"
+#include "optimizer/plancat.h"
+#include "pgstat.h"
+#include "storage/lmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+
+typedef enum
+{
+ ZSSCAN_STATE_UNSTARTED,
+ ZSSCAN_STATE_SCANNING,
+ ZSSCAN_STATE_FINISHED_RANGE,
+ ZSSCAN_STATE_FINISHED
+} zs_scan_state;
+
+typedef struct ZedStoreProjectData
+{
+ int num_proj_atts;
+ bool *project_columns;
+ int *proj_atts;
+ ZSBtreeScan *btree_scans;
+ MemoryContext context;
+} ZedStoreProjectData;
+
+typedef struct ZedStoreDescData
+{
+ /* scan parameters */
+ TableScanDescData rs_scan; /* */
+ ZedStoreProjectData proj_data;
+
+ zs_scan_state state;
+ zstid cur_range_start;
+ zstid cur_range_end;
+ bool finished;
+
+ /* These fields are used for bitmap scans, to hold a "block's" worth of data */
+#define MAX_ITEMS_PER_LOGICAL_BLOCK MaxHeapTuplesPerPage
+ int bmscan_ntuples;
+ zstid *bmscan_tids;
+ Datum **bmscan_datums;
+ bool **bmscan_isnulls;
+ int bmscan_nexttuple;
+
+ /* These fields are use for TABLESAMPLE scans */
+ zstid max_tid_to_scan;
+ zstid next_tid_to_scan;
+
+} ZedStoreDescData;
+
+typedef struct ZedStoreDescData *ZedStoreDesc;
+
+typedef struct ZedStoreIndexFetchData
+{
+ IndexFetchTableData idx_fetch_data;
+ ZedStoreProjectData proj_data;
+} ZedStoreIndexFetchData;
+
+typedef struct ZedStoreIndexFetchData *ZedStoreIndexFetch;
+
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static IndexFetchTableData *zedstoream_begin_index_fetch(Relation rel);
+static void zedstoream_end_index_fetch(IndexFetchTableData *scan);
+static bool zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot);
+
+static Size zs_parallelscan_estimate(Relation rel);
+static Size zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan);
+static void zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan);
+static bool zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end);
+static void zsbt_fill_missing_attribute_value(ZSBtreeScan *scan, Datum *datum, bool *isnull);
+
+/* ----------------------------------------------------------------
+ * storage AM support routines for zedstoream
+ * ----------------------------------------------------------------
+ */
+
+static bool
+zedstoream_fetch_row_version(Relation rel,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ IndexFetchTableData *fetcher;
+ bool result;
+
+ fetcher = zedstoream_begin_index_fetch(rel);
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ PredicateLockTID(rel, tid_p, snapshot);
+ }
+ ExecMaterializeSlot(slot);
+ slot->tts_tableOid = RelationGetRelid(rel);
+ slot->tts_tid = *tid_p;
+
+ zedstoream_end_index_fetch(fetcher);
+
+ return result;
+}
+
+static void
+zedstoream_get_latest_tid(TableScanDesc sscan,
+ ItemPointer tid)
+{
+ zstid ztid = ZSTidFromItemPointer(*tid);
+ zsbt_find_latest_tid(sscan->rs_rd, &ztid, sscan->rs_snapshot);
+ *tid = ItemPointerFromZSTid(ztid);
+}
+
+static inline void
+zedstoream_insert_internal(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate, uint32 speculative_token)
+{
+ AttrNumber attno;
+ Datum *d;
+ bool *isnulls;
+ zstid tid;
+ TransactionId xid = GetCurrentTransactionId();
+ bool isnull;
+ Datum datum;
+ ZSUndoRecPtr prevundoptr;
+
+ ZSUndoRecPtrInitialize(&prevundoptr);
+
+ if (slot->tts_tupleDescriptor->natts != relation->rd_att->natts)
+ elog(ERROR, "slot's attribute count doesn't match relcache entry");
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ tid = InvalidZSTid;
+
+ isnull = true;
+ ZSUndoRecPtrInitialize(&prevundoptr);
+ zsbt_tid_multi_insert(relation,
+ &tid, 1,
+ xid, cid, speculative_token, prevundoptr);
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(slot->tts_tupleDescriptor, attno - 1);
+ Datum toastptr = (Datum) 0;
+ datum = d[attno - 1];
+ isnull = isnulls[attno - 1];
+
+ if (!isnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(datum))
+ datum = PointerGetDatum(heap_tuple_fetch_attr((struct varlena *) DatumGetPointer(datum)));
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ toastptr = datum = zedstore_toast_datum(relation, attno, datum);
+ }
+
+ zsbt_attr_multi_insert(relation, attno,
+ &datum, &isnull, &tid, 1);
+
+ if (toastptr != (Datum) 0)
+ zedstore_toast_finish(relation, attno, toastptr, tid);
+ }
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+
+ /* Note: speculative insertions are counted too, even if aborted later */
+ pgstat_count_heap_insert(relation, 1);
+}
+
+static void
+zedstoream_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, INVALID_SPECULATIVE_TOKEN);
+}
+
+static void
+zedstoream_insert_speculative(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, BulkInsertState bistate, uint32 specToken)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, specToken);
+}
+
+static void
+zedstoream_complete_speculative(Relation relation, TupleTableSlot *slot, uint32 spekToken,
+ bool succeeded)
+{
+ zstid tid;
+
+ tid = ZSTidFromItemPointer(slot->tts_tid);
+ zsbt_tid_clear_speculative_token(relation, tid, spekToken, true /* for complete */);
+ /*
+ * there is a conflict
+ */
+ if (!succeeded)
+ elog(ERROR, "zedstoream_complete_speculative abort is not handled");
+}
+
+static void
+zedstoream_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
+ CommandId cid, int options, BulkInsertState bistate)
+{
+ AttrNumber attno;
+ int i;
+ bool slotgetandset = true;
+ TransactionId xid = GetCurrentTransactionId();
+ int *tupletoasted;
+ Datum *datums;
+ bool *isnulls;
+ zstid *tids;
+ ZSUndoRecPtr prevundoptr;
+
+ tupletoasted = palloc(ntuples * sizeof(int));
+ datums = palloc0(ntuples * sizeof(Datum));
+ isnulls = palloc(ntuples * sizeof(bool));
+ tids = palloc0(ntuples * sizeof(zstid));
+
+ for (i = 0; i < ntuples; i++)
+ isnulls[i] = true;
+
+ ZSUndoRecPtrInitialize(&prevundoptr);
+ zsbt_tid_multi_insert(relation, tids, ntuples,
+ xid, cid, INVALID_SPECULATIVE_TOKEN, prevundoptr);
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr((slots[0])->tts_tupleDescriptor, attno - 1);
+ int ntupletoasted = 0;
+
+ for (i = 0; i < ntuples; i++)
+ {
+ Datum datum = slots[i]->tts_values[attno - 1];
+ bool isnull = slots[i]->tts_isnull[attno - 1];
+
+ if (slotgetandset)
+ {
+ slot_getallattrs(slots[i]);
+ }
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(relation, attno, datum);
+ tupletoasted[ntupletoasted++] = i;
+ }
+ datums[i] = datum;
+ isnulls[i] = isnull;
+ }
+
+ zsbt_attr_multi_insert(relation, attno,
+ datums, isnulls, tids, ntuples);
+
+ for (i = 0; i < ntupletoasted; i++)
+ {
+ int idx = tupletoasted[i];
+
+ zedstore_toast_finish(relation, attno, datums[idx], tids[idx]);
+ }
+
+ slotgetandset = false;
+ }
+
+ for (i = 0; i < ntuples; i++)
+ {
+ slots[i]->tts_tableOid = RelationGetRelid(relation);
+ slots[i]->tts_tid = ItemPointerFromZSTid(tids[i]);
+ }
+
+ pgstat_count_heap_insert(relation, ntuples);
+
+ pfree(tids);
+ pfree(tupletoasted);
+ pfree(datums);
+ pfree(isnulls);
+}
+
+static TM_Result
+zedstoream_delete(Relation relation, ItemPointer tid_p, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result = TM_Ok;
+
+retry:
+ result = zsbt_tid_delete(relation, tid, xid, cid,
+ snapshot, crosscheck, wait, hufd, changingPart);
+
+ if (result != TM_Ok)
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to delete invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ /* TODO: use something like heap_acquire_tuplock() for priority */
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Delete);
+ goto retry;
+ }
+ }
+ }
+
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, tid_p, ItemPointerGetBlockNumber(tid_p));
+
+ if (result == TM_Ok)
+ pgstat_count_heap_delete(relation);
+
+ return result;
+}
+
+
+/*
+ * Each tuple lock mode has a corresponding heavyweight lock, and one or two
+ * corresponding MultiXactStatuses (one to merely lock tuples, another one to
+ * update them). This table (and the macros below) helps us determine the
+ * heavyweight lock mode and MultiXactStatus values to use for any particular
+ * tuple lock strength.
+ *
+ * Don't look at lockstatus/updstatus directly! Use get_mxact_status_for_lock
+ * instead.
+ */
+static const struct
+{
+ LOCKMODE hwlock;
+ int lockstatus;
+ int updstatus;
+}
+
+ tupleLockExtraInfo[MaxLockTupleMode + 1] =
+{
+ { /* LockTupleKeyShare */
+ AccessShareLock,
+ MultiXactStatusForKeyShare,
+ -1 /* KeyShare does not allow updating tuples */
+ },
+ { /* LockTupleShare */
+ RowShareLock,
+ MultiXactStatusForShare,
+ -1 /* Share does not allow updating tuples */
+ },
+ { /* LockTupleNoKeyExclusive */
+ ExclusiveLock,
+ MultiXactStatusForNoKeyUpdate,
+ MultiXactStatusNoKeyUpdate
+ },
+ { /* LockTupleExclusive */
+ AccessExclusiveLock,
+ MultiXactStatusForUpdate,
+ MultiXactStatusUpdate
+ }
+};
+
+
+/*
+ * Acquire heavyweight locks on tuples, using a LockTupleMode strength value.
+ * This is more readable than having every caller translate it to lock.h's
+ * LOCKMODE.
+ */
+#define LockTupleTuplock(rel, tup, mode) \
+ LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define UnlockTupleTuplock(rel, tup, mode) \
+ UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define ConditionalLockTupleTuplock(rel, tup, mode) \
+ ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+
+/*
+ * Acquire heavyweight lock on the given tuple, in preparation for acquiring
+ * its normal, Xmax-based tuple lock.
+ *
+ * have_tuple_lock is an input and output parameter: on input, it indicates
+ * whether the lock has previously been acquired (and this function does
+ * nothing in that case). If this function returns success, have_tuple_lock
+ * has been flipped to true.
+ *
+ * Returns false if it was unable to obtain the lock; this can only happen if
+ * wait_policy is Skip.
+ *
+ * XXX: This is identical to heap_acquire_tuplock
+ */
+
+static bool
+zs_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, bool *have_tuple_lock)
+{
+ if (*have_tuple_lock)
+ return true;
+
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ LockTupleTuplock(relation, tid, mode);
+ break;
+
+ case LockWaitSkip:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ return false;
+ break;
+
+ case LockWaitError:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+ *have_tuple_lock = true;
+
+ return true;
+}
+
+
+static TM_Result
+zedstoream_lock_tuple(Relation relation, ItemPointer tid_p, Snapshot snapshot,
+ TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, uint8 flags,
+ TM_FailureData *tmfd)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result;
+ bool have_tuple_lock = false;
+ zstid next_tid = tid;
+ SnapshotData SnapshotDirty;
+ bool locked_something = false;
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = *tid_p;
+
+ tmfd->traversed = false;
+ /*
+ * For now, we lock just the first attribute. As long as everyone
+ * does that, that's enough.
+ */
+retry:
+ result = zsbt_tid_lock(relation, tid, xid, cid,
+ mode, snapshot, tmfd, &next_tid);
+
+ if (result == TM_Invisible)
+ {
+ /*
+ * This is possible, but only when locking a tuple for ON CONFLICT
+ * UPDATE. We return this value here rather than throwing an error in
+ * order to give that case the opportunity to throw a more specific
+ * error.
+ */
+ /*
+ * This can also happen, if we're locking an UPDATE chain for KEY SHARE mode:
+ * A tuple has been inserted, and then updated, by a different transaction.
+ * The updating transaction is still in progress. We can lock the row
+ * in KEY SHARE mode, assuming the key columns were not updated, and we will
+ * try to lock all the row version, even the still in-progress UPDATEs.
+ * It's possible that the UPDATE aborts while we're chasing the update chain,
+ * so that the updated tuple becomes invisible to us. That's OK.
+ */
+ if (mode == LockTupleKeyShare && locked_something)
+ return TM_Ok;
+ else
+ return TM_Invisible;
+ }
+ else if (result == TM_Updated ||
+ (result == TM_SelfModified && tmfd->cmax == cid))
+ {
+ /*
+ * The other transaction is an update and it already committed.
+ *
+ * If the caller asked for the latest version, find it.
+ */
+ if ((flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION) != 0 && next_tid != tid)
+ {
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (ItemPointerIndicatesMovedPartitions(&tmfd->ctid))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
+ /* it was updated, so look at the updated version */
+ *tid_p = ItemPointerFromZSTid(next_tid);
+
+ /* signal that a tuple later in the chain is getting locked */
+ tmfd->traversed = true;
+
+ /* loop back to fetch next in chain */
+
+ /* FIXME: In the corresponding code in heapam, we cross-check the xmin/xmax
+ * of the old and new tuple. Should we do the same here?
+ */
+
+ InitDirtySnapshot(SnapshotDirty);
+ snapshot = &SnapshotDirty;
+ tid = next_tid;
+ goto retry;
+ }
+
+ return result;
+ }
+ else if (result == TM_Deleted)
+ {
+ /*
+ * The other transaction is a delete and it already committed.
+ */
+ return result;
+ }
+ else if (result == TM_BeingModified)
+ {
+ TransactionId xwait = tmfd->xmax;
+
+ /*
+ * Acquire tuple lock to establish our priority for the tuple, or
+ * die trying. LockTuple will release us when we are next-in-line
+ * for the tuple. We must do this even if we are share-locking.
+ *
+ * If we are forced to "start over" below, we keep the tuple lock;
+ * this arranges that we stay at the head of the line while
+ * rechecking tuple state.
+ */
+ if (!zs_acquire_tuplock(relation, tid_p, mode, wait_policy,
+ &have_tuple_lock))
+ {
+ /*
+ * This can only happen if wait_policy is Skip and the lock
+ * couldn't be obtained.
+ */
+ return TM_WouldBlock;
+ }
+
+ /* wait for regular transaction to end, or die trying */
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Lock);
+ break;
+ case LockWaitSkip:
+ if (!ConditionalXactLockTableWait(xwait))
+ {
+ return TM_WouldBlock;
+ }
+ break;
+ case LockWaitError:
+ if (!ConditionalXactLockTableWait(xwait))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+
+ /*
+ * xwait is done. Retry.
+ */
+ goto retry;
+ }
+ if (result == TM_Ok)
+ locked_something = true;
+
+ /*
+ * Now that we have successfully marked the tuple as locked, we can
+ * release the lmgr tuple lock, if we had it.
+ */
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (mode == LockTupleKeyShare)
+ {
+ /* lock all row versions, if it's a KEY SHARE lock */
+ if (result == TM_Ok && tid != next_tid && next_tid != InvalidZSTid)
+ {
+ tid = next_tid;
+ goto retry;
+ }
+ }
+
+ /* Fetch the tuple, too. */
+ if (!zedstoream_fetch_row_version(relation, tid_p, SnapshotAny, slot))
+ elog(ERROR, "could not fetch locked tuple");
+
+ return TM_Ok;
+}
+
+/* like heap_tuple_attr_equals */
+static bool
+zs_tuple_attr_equals(int attrnum, TupleTableSlot *slot1, TupleTableSlot *slot2)
+{
+ TupleDesc tupdesc = slot1->tts_tupleDescriptor;
+ Datum value1,
+ value2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ /*
+ * If it's a whole-tuple reference, say "not equal". It's not really
+ * worth supporting this case, since it could only succeed after a no-op
+ * update, which is hardly a case worth optimizing for.
+ */
+ if (attrnum == 0)
+ return false;
+
+ /*
+ * Likewise, automatically say "not equal" for any system attribute other
+ * than tableOID; we cannot expect these to be consistent in a HOT chain,
+ * or even to be set correctly yet in the new tuple.
+ */
+ if (attrnum < 0)
+ {
+ if (attrnum != TableOidAttributeNumber)
+ return false;
+ }
+
+ /*
+ * Extract the corresponding values. XXX this is pretty inefficient if
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
+ * a single heap_deform_tuple call on each tuple, instead? But that
+ * doesn't work for system columns ...
+ */
+ value1 = slot_getattr(slot1, attrnum, &isnull1);
+ value2 = slot_getattr(slot2, attrnum, &isnull2);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+
+ /*
+ * If both are NULL, they can be considered equal.
+ */
+ if (isnull1)
+ return true;
+
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attrnum <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ }
+ else
+ {
+ Assert(attrnum <= tupdesc->natts);
+ att = TupleDescAttr(tupdesc, attrnum - 1);
+ return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ }
+}
+
+static bool
+is_key_update(Relation relation, TupleTableSlot *oldslot, TupleTableSlot *newslot)
+{
+ Bitmapset *key_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
+ int attnum;
+
+ /*
+ * Fetch the list of attributes to be checked for various operations.
+ *
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are
+ * doing an update on one of the relevant system catalogs, we could
+ * deadlock if we try to fetch the list later. In any case, the relcache
+ * caches the data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity and columns that are
+ * considered the "key" of rows in the table.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
+ */
+ key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
+
+ interesting_attrs = NULL;
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+
+ /* Determine columns modified by the update. */
+ modified_attrs = NULL;
+ while ((attnum = bms_first_member(interesting_attrs)) >= 0)
+ {
+ attnum += FirstLowInvalidHeapAttributeNumber;
+
+ if (!zs_tuple_attr_equals(attnum, oldslot, newslot))
+ modified_attrs = bms_add_member(modified_attrs,
+ attnum - FirstLowInvalidHeapAttributeNumber);
+ }
+
+ return bms_overlap(modified_attrs, key_attrs);
+}
+
+static TM_Result
+zedstoream_update(Relation relation, ItemPointer otid_p, TupleTableSlot *slot,
+ CommandId cid, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd,
+ LockTupleMode *lockmode, bool *update_indexes)
+{
+ zstid otid = ZSTidFromItemPointer(*otid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ AttrNumber attno;
+ bool key_update;
+ Datum *d;
+ bool *isnulls;
+ TM_Result result;
+ zstid newtid;
+ TupleTableSlot *oldslot;
+ IndexFetchTableData *fetcher;
+ ZSUndoRecPtr prevundoptr;
+
+ ZSUndoRecPtrInitialize(&prevundoptr);
+
+ *update_indexes = true;
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ oldslot = table_slot_create(relation, NULL);
+ fetcher = zedstoream_begin_index_fetch(relation);
+
+ /*
+ * The meta-attribute holds the visibility information, including the "t_ctid"
+ * pointer to the updated version. All the real attributes are just inserted,
+ * as if for a new row.
+ */
+retry:
+ newtid = InvalidZSTid;
+
+ /*
+ * Fetch the old row, so that we can figure out which columns were modified.
+ *
+ * FIXME: if we have to follow the update chain, we should look at the
+ * currently latest tuple version, rather than the one visible to our snapshot.
+ */
+ if (!zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ otid_p, SnapshotAny, oldslot))
+ {
+ return TM_Invisible;
+ }
+ key_update = is_key_update(relation, oldslot, slot);
+
+ *lockmode = key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ result = zsbt_tid_update(relation, otid,
+ xid, cid, key_update, snapshot, crosscheck,
+ wait, hufd, &newtid);
+
+ if (result == TM_Ok)
+ {
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, otid_p, ItemPointerGetBlockNumber(otid_p));
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(relation->rd_att, attno - 1);
+ Datum newdatum = d[attno - 1];
+ bool newisnull = isnulls[attno - 1];
+ Datum toastptr = (Datum) 0;
+
+ if (!newisnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(newdatum))
+ newdatum = PointerGetDatum(heap_tuple_fetch_attr((struct varlena *) DatumGetPointer(newdatum)));
+
+ /* If this datum is too large, toast it */
+ if (!newisnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(newdatum) > MaxZedStoreDatumSize)
+ {
+ toastptr = newdatum = zedstore_toast_datum(relation, attno, newdatum);
+ }
+
+ zsbt_attr_multi_insert(relation, attno,
+ &newdatum, &newisnull, &newtid, 1);
+
+ if (toastptr != (Datum) 0)
+ zedstore_toast_finish(relation, attno, toastptr, newtid);
+ }
+
+ slot->tts_tid = ItemPointerFromZSTid(newtid);
+
+ pgstat_count_heap_update(relation, false);
+ }
+ else
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to update invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ /* TODO: use something like heap_acquire_tuplock() for priority */
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ XactLockTableWait(xwait, relation, otid_p, XLTW_Delete);
+ goto retry;
+ }
+ }
+ }
+
+ zedstoream_end_index_fetch(fetcher);
+ ExecDropSingleTupleTableSlot(oldslot);
+
+ return result;
+}
+
+static const TupleTableSlotOps *
+zedstoream_slot_callbacks(Relation relation)
+{
+ return &TTSOpsZedstore;
+}
+
+static inline void
+zs_initialize_proj_attributes(TupleDesc tupledesc, ZedStoreProjectData *proj_data)
+{
+ MemoryContext oldcontext;
+
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* add one for meta-attribute */
+ proj_data->proj_atts = palloc((tupledesc->natts + 1) * sizeof(int));
+ proj_data->btree_scans = palloc0((tupledesc->natts + 1) * sizeof(ZSBtreeScan));
+
+ proj_data->proj_atts[proj_data->num_proj_atts++] = ZS_META_ATTRIBUTE_NUM;
+
+ /*
+ * convert booleans array into an array of the attribute numbers of the
+ * required columns.
+ */
+ for (int idx = 0; idx < tupledesc->natts; idx++)
+ {
+ int att_no = idx + 1;
+
+ /*
+ * never project dropped columns, null will be returned for them
+ * in slot by default.
+ */
+ if (TupleDescAttr(tupledesc, idx)->attisdropped)
+ continue;
+
+ /* project_columns empty also conveys need all the columns */
+ if (proj_data->project_columns == NULL || proj_data->project_columns[idx])
+ proj_data->proj_atts[proj_data->num_proj_atts++] = att_no;
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static inline void
+zs_initialize_proj_attributes_extended(ZedStoreDesc scan, TupleDesc tupledesc)
+{
+ MemoryContext oldcontext;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ /* if already initialized return */
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ zs_initialize_proj_attributes(tupledesc, proj_data);
+
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* Extra setup for bitmap and sample scans */
+ if ((scan->rs_scan.rs_flags & SO_TYPE_BITMAPSCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_SAMPLESCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_ANALYZE))
+ {
+ scan->bmscan_ntuples = 0;
+ scan->bmscan_tids = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(zstid));
+
+ scan->bmscan_datums = palloc(proj_data->num_proj_atts * sizeof(Datum *));
+ scan->bmscan_isnulls = palloc(proj_data->num_proj_atts * sizeof(bool *));
+ for (int i = 0; i < proj_data->num_proj_atts; i++)
+ {
+ scan->bmscan_datums[i] = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(Datum));
+ scan->bmscan_isnulls[i] = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(bool));
+ }
+ }
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static TableScanDesc
+zedstoream_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ bool *project_columns)
+{
+ ZedStoreDesc scan;
+
+ /* Sample scans have no snapshot, but we need one */
+ if (!snapshot)
+ {
+ Assert(!(flags & SO_TYPE_SAMPLESCAN));
+ snapshot = SnapshotAny;
+ }
+
+ /*
+ * allocate and initialize scan descriptor
+ */
+ scan = (ZedStoreDesc) palloc0(sizeof(ZedStoreDescData));
+
+ scan->rs_scan.rs_rd = relation;
+ scan->rs_scan.rs_snapshot = snapshot;
+ scan->rs_scan.rs_nkeys = nkeys;
+ scan->rs_scan.rs_flags = flags;
+ scan->rs_scan.rs_parallel = parallel_scan;
+
+ /*
+ * we can use page-at-a-time mode if it's an MVCC-safe snapshot
+ */
+ scan->state = ZSSCAN_STATE_UNSTARTED;
+
+ /*
+ * we do this here instead of in initscan() because heap_rescan also calls
+ * initscan() and we don't want to allocate memory again
+ */
+ if (nkeys > 0)
+ scan->rs_scan.rs_key = (ScanKey) palloc(sizeof(ScanKeyData) * nkeys);
+ else
+ scan->rs_scan.rs_key = NULL;
+
+ scan->proj_data.context = CurrentMemoryContext;
+ scan->proj_data.project_columns = project_columns;
+
+ /*
+ * For a seqscan in a serializable transaction, acquire a predicate lock
+ * on the entire relation. This is required not only to lock all the
+ * matching tuples, but also to conflict with new insertions into the
+ * table. In an indexscan, we take page locks on the index pages covering
+ * the range specified in the scan qual, but in a heap scan there is
+ * nothing more fine-grained to lock. A bitmap scan is a different story,
+ * there we have already scanned the index and locked the index pages
+ * covering the predicate. But in that case we still have to lock any
+ * matching heap tuples.
+ */
+ if (!(flags & SO_TYPE_BITMAPSCAN) &&
+ !(flags & SO_TYPE_ANALYZE))
+ PredicateLockRelation(relation, snapshot);
+
+ /*
+ * Currently, we don't have a stats counter for bitmap heap scans (but the
+ * underlying bitmap index scans will be counted) or sample scans (we only
+ * update stats for tuple fetches there)
+ */
+ if (!(flags & SO_TYPE_BITMAPSCAN) && !(flags & SO_TYPE_SAMPLESCAN))
+ pgstat_count_heap_scan(relation);
+
+ return (TableScanDesc) scan;
+}
+
+static TableScanDesc
+zedstoream_beginscan(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags)
+{
+ return zedstoream_beginscan_with_column_projection(relation, snapshot,
+ nkeys, key, parallel_scan, flags, NULL);
+}
+
+static void
+zedstoream_endscan(TableScanDesc sscan)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ if (proj_data->proj_atts)
+ pfree(proj_data->proj_atts);
+
+ if (proj_data->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&proj_data->btree_scans[0]);
+ for (int i = 1; i < proj_data->num_proj_atts; i++)
+ zsbt_attr_end_scan(&proj_data->btree_scans[i]);
+ }
+
+ if (scan->rs_scan.rs_flags & SO_TEMP_SNAPSHOT)
+ UnregisterSnapshot(scan->rs_scan.rs_snapshot);
+
+ if (proj_data->btree_scans)
+ pfree(proj_data->btree_scans);
+ pfree(scan);
+}
+
+static void
+zedstoream_rescan(TableScanDesc sscan, struct ScanKeyData *key,
+ bool set_params, bool allow_strat,
+ bool allow_sync, bool allow_pagemode)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+
+ /* these params don't do much in zedstore yet, but whatever */
+ if (set_params)
+ {
+ if (allow_strat)
+ scan->rs_scan.rs_flags |= SO_ALLOW_STRAT;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_STRAT;
+
+ if (allow_sync)
+ scan->rs_scan.rs_flags |= SO_ALLOW_SYNC;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_SYNC;
+
+ if (allow_pagemode && scan->rs_scan.rs_snapshot &&
+ IsMVCCSnapshot(scan->rs_scan.rs_snapshot))
+ scan->rs_scan.rs_flags |= SO_ALLOW_PAGEMODE;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_PAGEMODE;
+ }
+
+ if (scan->proj_data.num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&scan->proj_data.btree_scans[0]);
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ zsbt_attr_end_scan(&scan->proj_data.btree_scans[i]);
+ }
+
+ scan->state = ZSSCAN_STATE_UNSTARTED;
+}
+
+static bool
+zedstoream_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *scan_proj = &scan->proj_data;
+ int i;
+ int slot_natts = slot->tts_tupleDescriptor->natts;
+ Datum *slot_values = slot->tts_values;
+ bool *slot_isnull = slot->tts_isnull;
+
+ if (direction != ForwardScanDirection)
+ elog(ERROR, "backward scan not implemented in zedstore");
+
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, scan_proj);
+ Assert((scan_proj->num_proj_atts - 1) <= slot_natts);
+
+ /*
+ * Initialize the slot.
+ *
+ * We initialize all columns to NULL. The values for columns that are projected
+ * will be set to the actual values below, but it's important that non-projected
+ * columns are NULL.
+ */
+ ExecClearTuple(slot);
+ for (i = 0; i < slot_natts; i++)
+ slot_isnull[i] = true;
+
+ while (scan->state != ZSSCAN_STATE_FINISHED)
+ {
+ zstid this_tid;
+ Datum datum;
+ bool isnull;
+
+ if (scan->state == ZSSCAN_STATE_UNSTARTED ||
+ scan->state == ZSSCAN_STATE_FINISHED_RANGE)
+ {
+ MemoryContext oldcontext;
+
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ scan->state = ZSSCAN_STATE_FINISHED;
+ break;
+ }
+ }
+ else
+ {
+ if (scan->state == ZSSCAN_STATE_FINISHED_RANGE)
+ {
+ scan->state = ZSSCAN_STATE_FINISHED;
+ break;
+ }
+ scan->cur_range_start = MinZSTid;
+ scan->cur_range_end = MaxPlusOneZSTid;
+ }
+
+ oldcontext = MemoryContextSwitchTo(scan_proj->context);
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ scan->cur_range_start,
+ scan->cur_range_end,
+ scan->rs_scan.rs_snapshot,
+ &scan_proj->btree_scans[0]);
+ scan_proj->btree_scans[0].serializable = true;
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ int attno = scan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ slot->tts_tupleDescriptor,
+ attno,
+ scan->cur_range_start,
+ scan->cur_range_end,
+ &scan_proj->btree_scans[i]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ scan->state = ZSSCAN_STATE_SCANNING;
+ }
+
+ /* We now have a range to scan. Find the next visible TID. */
+ Assert(scan->state == ZSSCAN_STATE_SCANNING);
+
+ this_tid = zsbt_tid_scan_next(&scan_proj->btree_scans[0]);
+ if (this_tid == InvalidZSTid)
+ {
+ scan->state = ZSSCAN_STATE_FINISHED_RANGE;
+ }
+ else
+ {
+ Assert (this_tid < scan->cur_range_end);
+
+ /* Note: We don't need to predicate-lock tuples in Serializable mode,
+ * because in a sequential scan, we predicate-locked the whole table.
+ */
+
+ /* Fetch the datums of each attribute for this row */
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ ZSBtreeScan *btscan = &scan_proj->btree_scans[i];
+ Form_pg_attribute attr = ZSBtreeScanGetAttInfo(btscan);
+ int natt;
+
+ if (!zsbt_scan_next_fetch(btscan, &datum, &isnull, this_tid))
+ zsbt_fill_missing_attribute_value(btscan, &datum, &isnull);
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ natt = scan_proj->proj_atts[i];
+
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, this_tid, datum);
+ }
+
+ /* Check that the values coming out of the b-tree are aligned properly */
+ if (!isnull && attr->attlen == -1)
+ {
+ Assert (VARATT_IS_1B(datum) || INTALIGN(datum) == datum);
+ }
+
+ if (natt != ZS_META_ATTRIBUTE_NUM)
+ {
+ Assert(natt > 0);
+ slot_values[natt - 1] = datum;
+ slot_isnull[natt - 1] = isnull;
+ }
+ }
+ }
+
+ if (scan->state == ZSSCAN_STATE_FINISHED_RANGE)
+ {
+ zsbt_tid_end_scan(&scan_proj->btree_scans[0]);
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ zsbt_attr_end_scan(&scan_proj->btree_scans[i]);
+ }
+ else
+ {
+ Assert(scan->state == ZSSCAN_STATE_SCANNING);
+ slot->tts_tid = ItemPointerFromZSTid(this_tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ pgstat_count_heap_getnext(scan->rs_scan.rs_rd);
+ return true;
+ }
+ }
+
+ ExecClearTuple(slot);
+ return false;
+}
+
+static bool
+zedstoream_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid ztid = ZSTidFromItemPointer(*tid);
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(sscan->rs_rd);
+ }
+
+ /*
+ * FIXME: should we get lowest TID as well to further optimize the check.
+ */
+ if (ztid <= scan->max_tid_to_scan)
+ return true;
+ else
+ return false;
+}
+
+static bool
+zedstoream_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
+ Snapshot snapshot)
+{
+ /*
+ * TODO: we didn't keep any visibility information about the tuple in the
+ * slot, so we have to fetch it again. A custom slot type might be a
+ * good idea..
+ */
+ zstid tid = ZSTidFromItemPointer(slot->tts_tid);
+ ZSBtreeScan meta_scan;
+ bool found;
+
+ /* Use the meta-data tree for the visibility information. */
+ zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &meta_scan);
+
+ found = zsbt_tid_scan_next(&meta_scan) != InvalidZSTid;
+
+ zsbt_tid_end_scan(&meta_scan);
+
+ return found;
+}
+
+static TransactionId
+zedstoream_compute_xid_horizon_for_tuples(Relation rel,
+ ItemPointerData *items,
+ int nitems)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+
+}
+
+static IndexFetchTableData *
+zedstoream_begin_index_fetch(Relation rel)
+{
+ ZedStoreIndexFetch zscan = palloc0(sizeof(ZedStoreIndexFetchData));
+
+ zscan->idx_fetch_data.rel = rel;
+ zscan->proj_data.context = CurrentMemoryContext;
+
+ return (IndexFetchTableData *) zscan;
+}
+
+static void
+zedstoream_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ bool *project_columns)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ zscan->proj_data.project_columns = project_columns;
+}
+
+static void
+zedstoream_reset_index_fetch(IndexFetchTableData *scan)
+{
+ /* TODO: we could close the scans here, but currently we don't bother */
+}
+
+static void
+zedstoream_end_index_fetch(IndexFetchTableData *scan)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&zscan_proj->btree_scans[0]);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ zsbt_attr_end_scan(&zscan_proj->btree_scans[i]);
+ }
+
+ if (zscan_proj->proj_atts)
+ pfree(zscan_proj->proj_atts);
+
+ if (zscan_proj->btree_scans)
+ pfree(zscan_proj->btree_scans);
+ pfree(zscan);
+}
+
+static bool
+zedstoream_index_fetch_tuple(struct IndexFetchTableData *scan,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ bool *call_again, bool *all_dead)
+{
+ bool result;
+
+ /*
+ * we don't do in-place updates, so this is essentially the same as
+ * fetch_row_version.
+ */
+ if (call_again)
+ *call_again = false;
+ if (all_dead)
+ *all_dead = false;
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) scan, tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ PredicateLockTID(scan->rel, tid_p, snapshot);
+ }
+ return result;
+}
+
+/*
+ * Shared implementation of fetch_row_version and index_fetch_tuple callbacks.
+ */
+static bool
+zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ Relation rel = fetch->idx_fetch_data.rel;
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ bool found = true;
+ ZedStoreProjectData *fetch_proj = &fetch->proj_data;
+
+ /* first time here, initialize */
+ if (fetch_proj->num_proj_atts == 0)
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, fetch_proj);
+ else
+ {
+ /* If we had a previous fetches still open, close them first */
+ zsbt_tid_end_scan(&fetch_proj->btree_scans[0]);
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ zsbt_attr_end_scan(&fetch_proj->btree_scans[i]);
+ }
+
+ /*
+ * Initialize the slot.
+ *
+ * If we're not fetching all columns, initialize the unfetched values
+ * in the slot to NULL. (Actually, this initializes all to NULL, and the
+ * code below will overwrite them for the columns that are projected)
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < slot->tts_tupleDescriptor->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &fetch_proj->btree_scans[0]);
+ fetch_proj->btree_scans[0].serializable = true;
+ found = zsbt_tid_scan_next(&fetch_proj->btree_scans[0]) != InvalidZSTid;
+ if (found)
+ {
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ {
+ int natt = fetch_proj->proj_atts[i];
+ ZSBtreeScan *btscan = &fetch_proj->btree_scans[i];
+ Form_pg_attribute attr;
+ Datum datum;
+ bool isnull;
+
+ zsbt_attr_begin_scan(rel, slot->tts_tupleDescriptor, natt, tid, tid + 1,
+ btscan);
+
+ attr = ZSBtreeScanGetAttInfo(btscan);
+ if (zsbt_scan_next_fetch(btscan, &datum, &isnull, tid))
+ {
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(rel, natt, tid, datum);
+ }
+ }
+ else
+ zsbt_fill_missing_attribute_value(btscan, &datum, &isnull);
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ }
+
+ if (found)
+ {
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ return true;
+ }
+
+ return false;
+}
+
+static void
+zedstoream_index_validate_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ Snapshot snapshot,
+ ValidateIndexState *state)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ bool *proj;
+ int attno;
+ TableScanDesc scan;
+ ItemPointerData idx_ptr;
+ bool tuplesort_empty = false;
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. We need just those tuples
+ * satisfying the passed-in reference snapshot. We must disable syncscan
+ * here, because it's critical that we read from block zero forward to
+ * match the sorted TIDs.
+ */
+
+ /*
+ * TODO: It would be very good to fetch only the columns we need.
+ */
+ proj = palloc0(baseRelation->rd_att->natts * sizeof(bool));
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ /* skip expressions */
+ if (indexInfo->ii_IndexAttrNumbers[attno] > 0)
+ proj[indexInfo->ii_IndexAttrNumbers[attno] - 1] = true;
+ }
+ GetNeededColumnsForNode((Node *)indexInfo->ii_Predicate, proj,
+ baseRelation->rd_att->natts);
+ GetNeededColumnsForNode((Node *)indexInfo->ii_Expressions, proj,
+ baseRelation->rd_att->natts);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ /*
+ * Scan all tuples matching the snapshot.
+ */
+ ItemPointerSet(&idx_ptr, 0, 0); /* this is less than any real TID */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ ItemPointerData tup_ptr = slot->tts_tid;
+ HeapTuple heapTuple;
+ int cmp;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, like heapam's function.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ if (tuplesort_empty)
+ cmp = -1;
+ else
+ {
+ while ((cmp = ItemPointerCompare(&tup_ptr, &idx_ptr)) > 0)
+ {
+ Datum ts_val;
+ bool ts_isnull;
+
+ tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+ &ts_val, &ts_isnull, NULL);
+ if (!tuplesort_empty)
+ {
+ Assert(!ts_isnull);
+ itemptr_decode(&idx_ptr, DatumGetInt64(ts_val));
+
+ /* If int8 is pass-by-ref, free (encoded) TID Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(ts_val));
+#endif
+ break;
+ }
+ else
+ {
+ /* Be tidy */
+ ItemPointerSetInvalid(&idx_ptr);
+ cmp = -1;
+ }
+ }
+ }
+ if (cmp < 0)
+ {
+ /* This item is not in the index */
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ heapTuple = ExecCopySlotHeapTuple(slot);
+ heapTuple->t_self = slot->tts_tid;
+ index_insert(indexRelation, values, isnull, &tup_ptr, baseRelation,
+ indexInfo->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ indexInfo);
+ pfree(heapTuple);
+
+ state->tups_inserted += 1;
+ }
+ }
+
+ table_endscan(scan);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+}
+
+static double
+zedstoream_index_build_range_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ bool anyvisible,
+ bool progress,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state,
+ TableScanDesc scan)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ double reltuples;
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ Snapshot snapshot;
+ SnapshotData NonVacuumableSnapshot;
+ bool need_unregister_snapshot = false;
+ TransactionId OldestXmin;
+
+#ifdef USE_ASSERT_CHECKING
+ bool checking_uniqueness;
+ /* See whether we're verifying uniqueness/exclusion properties */
+ checking_uniqueness = (indexInfo->ii_Unique ||
+ indexInfo->ii_ExclusionOps != NULL);
+
+ /*
+ * "Any visible" mode is not compatible with uniqueness checks; make sure
+ * only one of those is requested.
+ */
+ Assert(!(anyvisible && checking_uniqueness));
+#endif
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. In a normal index build, we use
+ * SnapshotAny because we must retrieve all tuples and do our own time
+ * qual checks (because we have to index RECENTLY_DEAD tuples). In a
+ * concurrent build, or during bootstrap, we take a regular MVCC snapshot
+ * and index whatever's live according to that.
+ */
+ OldestXmin = InvalidTransactionId;
+
+ /* okay to ignore lazy VACUUMs here */
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
+ OldestXmin = GetOldestXmin(baseRelation, PROCARRAY_FLAGS_VACUUM);
+
+ /*
+ * TODO: It would be very good to fetch only the columns we need.
+ */
+ if (!scan)
+ {
+ bool *proj;
+ int attno;
+
+ /*
+ * Serial index build.
+ *
+ * Must begin our own zedstore scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ */
+ if (!TransactionIdIsValid(OldestXmin))
+ {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ need_unregister_snapshot = true;
+ }
+ else
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot, OldestXmin);
+ snapshot = &NonVacuumableSnapshot;
+ }
+
+ proj = palloc0(baseRelation->rd_att->natts * sizeof(bool));
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ /* skip expressions */
+ if (indexInfo->ii_IndexAttrNumbers[attno] > 0)
+ proj[indexInfo->ii_IndexAttrNumbers[attno] - 1] = true;
+ }
+
+ GetNeededColumnsForNode((Node *)indexInfo->ii_Predicate, proj,
+ baseRelation->rd_att->natts);
+ GetNeededColumnsForNode((Node *)indexInfo->ii_Expressions, proj,
+ baseRelation->rd_att->natts);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ if (start_blockno != 0 || numblocks != InvalidBlockNumber)
+ {
+ ZedStoreDesc zscan = (ZedStoreDesc) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ zscan->cur_range_start = ZSTidFromBlkOff(start_blockno, 1);
+ zscan->cur_range_end = ZSTidFromBlkOff(numblocks, 1);
+
+ /* FIXME: when can 'num_proj_atts' be 0? */
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_begin_scan(zscan->rs_scan.rs_rd,
+ zscan->cur_range_start,
+ zscan->cur_range_end,
+ zscan->rs_scan.rs_snapshot,
+ &zscan_proj->btree_scans[0]);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ {
+ int natt = zscan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(zscan->rs_scan.rs_rd,
+ RelationGetDescr(zscan->rs_scan.rs_rd),
+ natt,
+ zscan->cur_range_start,
+ zscan->cur_range_end,
+ &zscan_proj->btree_scans[i]);
+ }
+ }
+ zscan->state = ZSSCAN_STATE_SCANNING;
+ }
+ }
+ else
+ {
+ /*
+ * Parallel index build.
+ *
+ * Parallel case never registers/unregisters own snapshot. Snapshot
+ * is taken from parallel zedstore scan, and is SnapshotAny or an MVCC
+ * snapshot, based on same criteria as serial case.
+ */
+ Assert(!IsBootstrapProcessingMode());
+ Assert(allow_sync);
+ Assert(start_blockno == 0);
+ Assert(numblocks == InvalidBlockNumber);
+ snapshot = scan->rs_snapshot;
+
+ if (snapshot == SnapshotAny)
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot, OldestXmin);
+ snapshot = &NonVacuumableSnapshot;
+ }
+ }
+
+ /*
+ * Must call GetOldestXmin() with SnapshotAny. Should never call
+ * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == &NonVacuumableSnapshot || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == &NonVacuumableSnapshot ? TransactionIdIsValid(OldestXmin) :
+ !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == &NonVacuumableSnapshot || !anyvisible);
+
+ reltuples = 0;
+
+ /*
+ * Scan all tuples in the base relation.
+ */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ bool tupleIsAlive;
+ HeapTuple heapTuple;
+
+ if (numblocks != InvalidBlockNumber &&
+ ItemPointerGetBlockNumber(&slot->tts_tid) >= numblocks)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* table_scan_getnextslot did the visibility check */
+ tupleIsAlive = true;
+ reltuples += 1;
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, to figure out which tuple version to index.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ heapTuple = ExecCopySlotHeapTuple(slot);
+ heapTuple->t_self = slot->tts_tid;
+ callback(indexRelation, heapTuple, values, isnull, tupleIsAlive,
+ callback_state);
+ pfree(heapTuple);
+ }
+
+ table_endscan(scan);
+
+ /* we can now forget our snapshot, if set and registered by us */
+ if (need_unregister_snapshot)
+ UnregisterSnapshot(snapshot);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+
+ return reltuples;
+}
+
+static void
+zedstoream_finish_bulk_insert(Relation relation, int options)
+{
+ /*
+ * If we skipped writing WAL, then we need to sync the zedstore (but not
+ * indexes since those use WAL anyway / don't go through tableam)
+ */
+ if (options & HEAP_INSERT_SKIP_WAL)
+ heap_sync(relation);
+}
+
+/* ------------------------------------------------------------------------
+ * DDL related callbacks for zedstore AM.
+ * ------------------------------------------------------------------------
+ */
+
+static void
+zedstoream_relation_set_new_filenode(Relation rel,
+ const RelFileNode *newrnode,
+ char persistence,
+ TransactionId *freezeXid,
+ MultiXactId *minmulti)
+{
+ SMgrRelation srel;
+
+ /*
+ * Initialize to the minimum XID that could put tuples in the table. We
+ * know that no xacts older than RecentXmin are still running, so that
+ * will do.
+ */
+ *freezeXid = RecentXmin;
+
+ /*
+ * Similarly, initialize the minimum Multixact to the first value that
+ * could possibly be stored in tuples in the table. Running transactions
+ * could reuse values from their local cache, so we are careful to
+ * consider all currently running multis.
+ *
+ * XXX this could be refined further, but is it worth the hassle?
+ */
+ *minmulti = GetOldestMultiXactId();
+
+ srel = RelationCreateStorage(*newrnode, persistence);
+
+ /*
+ * If required, set up an init fork for an unlogged table so that it can
+ * be correctly reinitialized on restart. An immediate sync is required
+ * even if the page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved the
+ * redo pointer past our xlog record. Recovery may as well remove it
+ * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+ * record. Therefore, logging is necessary even if wal_level=minimal.
+ */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW ||
+ rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+ smgrcreate(srel, INIT_FORKNUM, false);
+ log_smgrcreate(newrnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+}
+
+static void
+zedstoream_relation_nontransactional_truncate(Relation rel)
+{
+ RelationTruncate(rel, 0);
+}
+
+static void
+zedstoream_relation_copy_data(Relation rel, const RelFileNode *newrnode)
+{
+ SMgrRelation dstrel;
+
+ dstrel = smgropen(*newrnode, rel->rd_backend);
+ RelationOpenSmgr(rel);
+
+ /*
+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers. We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.
+ */
+ FlushRelationBuffers(rel);
+
+ /*
+ * Create and copy all the relation, and schedule unlinking of the
+ * old physical file.
+ *
+ * NOTE: any conflict in relfilenode value will be caught in
+ * RelationCreateStorage().
+ *
+ * NOTE: There is only the main fork in zedstore. Otherwise
+ * this would need to copy other forks, too.
+ */
+ RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+
+ /* copy main fork */
+ RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
+ rel->rd_rel->relpersistence);
+
+ /* drop old relation, and close new one */
+ RelationDropStorage(rel);
+ smgrclose(dstrel);
+}
+
+/*
+ * Subroutine of the zedstoream_relation_copy_for_cluster() callback.
+ *
+ * Creates the TID item with correct visibility information for the
+ * given tuple in the old table. Returns the tid of the tuple in the
+ * new table, or InvalidZSTid if this tuple can be left out completely.
+ *
+ * FIXME: This break UPDATE chains. I.e. after this is done, an UPDATE
+ * looks like DELETE + INSERT, instead of an UPDATe, to any transaction that
+ * might try to follow the update chain.
+ */
+static zstid
+zs_cluster_process_tuple(Relation OldHeap, Relation NewHeap,
+ zstid oldtid, ZSUndoRecPtr old_undoptr,
+ ZSUndoRecPtr recent_oldest_undo,
+ TransactionId OldestXmin)
+{
+ TransactionId this_xmin;
+ CommandId this_cmin;
+ TransactionId this_xmax;
+ CommandId this_cmax;
+ bool this_changedPart;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ /*
+ * Follow the chain of UNDO records for this tuple, to find the
+ * transaction that originally inserted the row (xmin/cmin), and
+ * the transaction that deleted or updated it away, if any (xmax/cmax)
+ */
+ this_xmin = FrozenTransactionId;
+ this_cmin = InvalidCommandId;
+ this_xmax = InvalidTransactionId;
+ this_cmax = InvalidCommandId;
+
+ undo_ptr = old_undoptr;
+ for (;;)
+ {
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /* This tuple version is visible to everyone. */
+ break;
+ }
+
+ /* Fetch the next UNDO record. */
+ undorec = zsundo_fetch(OldHeap, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * inserter aborted or crashed. This row is not visible to
+ * anyone. Including any later tuple versions we might have
+ * seen.
+ */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* Inserter committed. */
+ this_xmin = undorec->xid;
+ this_cmin = undorec->cid;
+
+ /* we know everything there is to know about this tuple version. */
+ break;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* Ignore tuple locks for now.
+ *
+ * FIXME: we should propagate them to the new copy of the table
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* Row was deleted (or updated away). */
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter aborted or crashed. The previous record should
+ * be an insertion (possibly with some tuple-locking in
+ * between). We'll remember the tuple when we see the
+ * insertion.
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else
+ {
+ /* deleter committed or is still in progress. */
+ if (TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* the deletion is visible to everyone. We can skip the row completely. */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* deleter committed or is in progress. Remember that it was
+ * deleted by this XID.
+ */
+ this_xmax = undorec->xid;
+ this_cmax = undorec->cid;
+ if (undorec->type == ZSUNDO_TYPE_DELETE)
+ this_changedPart = ((ZSUndoRec_Delete *) undorec)->changedPart;
+ else
+ this_changedPart = false;
+
+ /* follow the UNDO chain to find information about the inserting
+ * transaction (xmin/cmin)
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ }
+ }
+ }
+
+ /*
+ * We now know the visibility of this tuple. Re-create it in the new table.
+ */
+ if (this_xmin != InvalidTransactionId)
+ {
+ /* Insert the first version of the row. */
+ ZSUndoRecPtr prevundoptr;
+ zstid newtid = InvalidZSTid;
+
+ /* First, insert the tuple. */
+ ZSUndoRecPtrInitialize(&prevundoptr);
+ zsbt_tid_multi_insert(NewHeap,
+ &newtid, 1,
+ this_xmin,
+ this_cmin,
+ INVALID_SPECULATIVE_TOKEN,
+ prevundoptr);
+
+ /* And if the tuple was deleted/updated away, do the same in the new table. */
+ if (this_xmax != InvalidTransactionId)
+ {
+ TM_Result delete_result;
+
+ /* tuple was deleted. */
+ delete_result = zsbt_tid_delete(NewHeap, newtid,
+ this_xmax, this_cmax,
+ NULL, NULL, false, NULL, this_changedPart);
+ if (delete_result != TM_Ok)
+ elog(ERROR, "tuple deletion failed during table rewrite");
+ }
+ return newtid;
+ }
+ else
+ return InvalidZSTid;
+}
+
+
+static void
+zedstoream_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
+ Relation OldIndex, bool use_sort,
+ TransactionId OldestXmin,
+ TransactionId *xid_cutoff,
+ MultiXactId *multi_cutoff,
+ double *num_tuples,
+ double *tups_vacuumed,
+ double *tups_recently_dead)
+{
+ TupleDesc olddesc;
+ ZSBtreeScan meta_scan;
+ ZSBtreeScan *attr_scans;
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(OldHeap);
+ int attno;
+ IndexScanDesc indexScan;
+
+ olddesc = RelationGetDescr(OldHeap),
+
+ attr_scans = palloc((olddesc->natts + 1) * sizeof(ZSBtreeScan));
+
+ /*
+ * Scan the old table. We ignore any old updated-away tuple versions,
+ * and only stop at the latest tuple version of each row. At the latest
+ * version, follow the update chain to get all the old versions of that
+ * row, too. That way, the whole update chain is processed in one go,
+ * and can be reproduced in the new table.
+ */
+ zsbt_tid_begin_scan(OldHeap, MinZSTid, MaxPlusOneZSTid,
+ SnapshotAny, &meta_scan);
+
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_begin_scan(OldHeap,
+ olddesc,
+ attno,
+ MinZSTid,
+ MaxPlusOneZSTid,
+ &attr_scans[attno]);
+ }
+
+ /* TODO: sorting not implemented yet. (it would require materializing each
+ * row into a HeapTuple or something like that, which could carry the xmin/xmax
+ * information through the sorter).
+ */
+ use_sort = false;
+
+ /*
+ * Prepare to scan the OldHeap. To ensure we see recently-dead tuples
+ * that still need to be copied, we scan with SnapshotAny and use
+ * HeapTupleSatisfiesVacuum for the visibility test.
+ */
+ if (OldIndex != NULL && !use_sort)
+ {
+ const int ci_index[] = {
+ PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_INDEX_RELID
+ };
+ int64 ci_val[2];
+
+ /* Set phase and OIDOldIndex to columns */
+ ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[1] = RelationGetRelid(OldIndex);
+ pgstat_progress_update_multi_param(2, ci_index, ci_val);
+
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ index_rescan(indexScan, NULL, 0, NULL, 0);
+ }
+ else
+ {
+ /* In scan-and-sort mode and also VACUUM FULL, set phase */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+
+ indexScan = NULL;
+
+ /* Set total heap blocks */
+ /* TODO */
+#if 0
+ pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ heapScan->rs_nblocks);
+#endif
+ }
+
+ for (;;)
+ {
+ zstid old_tid;
+ ZSUndoRecPtr old_undoptr;
+ zstid new_tid;
+ Datum datum;
+ bool isnull;
+ zstid fetchtid = InvalidZSTid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (indexScan != NULL)
+ {
+ ItemPointer itemptr;
+
+ itemptr = index_getnext_tid(indexScan, ForwardScanDirection);
+ if (!itemptr)
+ break;
+
+ /* Since we used no scan keys, should never need to recheck */
+ if (indexScan->xs_recheck)
+ elog(ERROR, "CLUSTER does not support lossy index conditions");
+
+ fetchtid = ZSTidFromItemPointer(*itemptr);
+ zsbt_tid_reset_scan(&meta_scan, fetchtid);
+ old_tid = zsbt_tid_scan_next(&meta_scan);
+ }
+ else
+ {
+ old_tid = zsbt_tid_scan_next(&meta_scan);
+ fetchtid = old_tid;
+ }
+ if (old_tid == InvalidZSTid)
+ break;
+ if (old_tid != fetchtid)
+ break;
+ old_undoptr = meta_scan.array_undoptr;
+
+ new_tid = zs_cluster_process_tuple(OldHeap, NewHeap,
+ old_tid, old_undoptr,
+ recent_oldest_undo,
+ OldestXmin);
+ if (new_tid != InvalidZSTid)
+ {
+ /* Fetch the attributes and write them out */
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ Form_pg_attribute att = TupleDescAttr(olddesc, attno - 1);
+ Datum toastptr = (Datum) 0;
+
+ if (att->attisdropped)
+ {
+ datum = (Datum) 0;
+ isnull = true;
+ }
+ else
+ {
+ if (indexScan)
+ zsbt_attr_reset_scan(&attr_scans[attno], old_tid);
+
+ if (!zsbt_scan_next_fetch(&attr_scans[attno], &datum, &isnull, old_tid))
+ zsbt_fill_missing_attribute_value(&attr_scans[attno], &datum, &isnull);
+ }
+
+ /* flatten and re-toast any ZS-TOASTed values */
+ if (!isnull && att->attlen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(OldHeap, attno, old_tid, datum);
+ }
+
+ if (VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ toastptr = datum = zedstore_toast_datum(NewHeap, attno, datum);
+ }
+ }
+
+ zsbt_attr_multi_insert(NewHeap, attno, &datum, &isnull, &new_tid, 1);
+
+ if (toastptr != (Datum) 0)
+ zedstore_toast_finish(NewHeap, attno, toastptr, new_tid);
+ }
+ }
+ }
+
+ if (indexScan != NULL)
+ index_endscan(indexScan);
+
+ zsbt_tid_end_scan(&meta_scan);
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_end_scan(&attr_scans[attno]);
+ }
+}
+
+/*
+ * FIXME: The ANALYZE API is problematic for us. acquire_sample_rows() calls
+ * RelationGetNumberOfBlocks() directly on the relation, and chooses the
+ * block numbers to sample based on that. But the logical block numbers
+ * have little to do with physical ones in zedstore.
+ */
+static bool
+zedstoream_scan_analyze_next_block(TableScanDesc sscan, BlockNumber blockno,
+ BufferAccessStrategy bstrategy)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ Relation rel = scan->rs_scan.rs_rd;
+ int ntuples;
+ ZSBtreeScan btree_scan;
+ zstid tid;
+
+ /* TODO: for now, assume that we need all columns */
+ zs_initialize_proj_attributes_extended(scan, RelationGetDescr(rel));
+
+ ntuples = 0;
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ ZSTidFromBlkOff(blockno, 1),
+ ZSTidFromBlkOff(blockno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &btree_scan);
+ /*
+ * TODO: it would be good to pass the next expected TID down to zsbt_scan_next,
+ * so that it could skip over to it more efficiently.
+ */
+ ntuples = 0;
+ while ((tid = zsbt_tid_scan_next(&btree_scan)) != InvalidZSTid)
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+ }
+ zsbt_tid_end_scan(&btree_scan);
+
+ if (ntuples)
+ {
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int natt = scan->proj_data.proj_atts[i];
+ ZSBtreeScan btree_scan;
+ Datum datum;
+ bool isnull;
+ Datum *datums = scan->bmscan_datums[i];
+ bool *isnulls = scan->bmscan_isnulls[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ RelationGetDescr(scan->rs_scan.rs_rd),
+ natt,
+ ZSTidFromBlkOff(blockno, 1),
+ ZSTidFromBlkOff(blockno + 1, 1),
+ &btree_scan);
+ for (int n = 0; n < ntuples; n++)
+ {
+ zstid tid = scan->bmscan_tids[n];
+ if (zsbt_scan_next_fetch(&btree_scan, &datum, &isnull, tid))
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ }
+ else
+ zsbt_fill_missing_attribute_value(&btree_scan, &datum, &isnull);
+
+ /*
+ * have to make a copy because we close the scan immediately.
+ * FIXME: I think this leaks into a too-long-lived context
+ */
+ if (!isnull)
+ datum = zs_datumCopy(datum,
+ ZSBtreeScanGetAttInfo(&btree_scan)->attbyval,
+ ZSBtreeScanGetAttInfo(&btree_scan)->attlen);
+ datums[n] = datum;
+ isnulls[n] = isnull;
+ }
+ zsbt_attr_end_scan(&btree_scan);
+ }
+ }
+
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return true;
+}
+
+static bool
+zedstoream_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin,
+ double *liverows, double *deadrows,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid tid;
+
+ if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
+ return false;
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ tid = scan->bmscan_tids[scan->bmscan_nexttuple];
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int natt = scan->proj_data.proj_atts[i];
+ Form_pg_attribute att = TupleDescAttr(slot->tts_tupleDescriptor, natt - 1);
+
+ Datum datum;
+ bool isnull;
+
+ datum = (scan->bmscan_datums[i])[scan->bmscan_nexttuple];
+ isnull = (scan->bmscan_isnulls[i])[scan->bmscan_nexttuple];
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, tid, datum);
+ }
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ scan->bmscan_nexttuple++;
+ (*liverows)++;
+
+ return true;
+}
+
+/* ------------------------------------------------------------------------
+ * Miscellaneous callbacks for the heap AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * FIXME: Implement this function as best for zedstore. The return value is
+ * for example leveraged by analyze to find which blocks to sample.
+ */
+static uint64
+zedstoream_relation_size(Relation rel, ForkNumber forkNumber)
+{
+ uint64 nblocks = 0;
+
+ /* Open it at the smgr level if not already done */
+ RelationOpenSmgr(rel);
+ nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+ return nblocks * BLCKSZ;
+}
+
+/*
+ * Zedstore stores TOAST chunks within the table file itself. Hence, doesn't
+ * need separate toast table to be created. Return false for this callback
+ * avoids creation of toast table.
+ */
+static bool
+zedstoream_relation_needs_toast_table(Relation rel)
+{
+ return false;
+}
+
+/* ------------------------------------------------------------------------
+ * Planner related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * currently this is exact duplicate of heapam_estimate_rel_size().
+ * TODO fix to tune it based on zedstore storage.
+ */
+static void
+zedstoream_relation_estimate_size(Relation rel, int32 *attr_widths,
+ BlockNumber *pages, double *tuples,
+ double *allvisfrac)
+{
+ BlockNumber curpages;
+ BlockNumber relpages;
+ double reltuples;
+ BlockNumber relallvisible;
+ double density;
+
+ /* it has storage, ok to call the smgr */
+ curpages = RelationGetNumberOfBlocks(rel);
+
+ /* coerce values in pg_class to more desirable types */
+ relpages = (BlockNumber) rel->rd_rel->relpages;
+ reltuples = (double) rel->rd_rel->reltuples;
+ relallvisible = (BlockNumber) rel->rd_rel->relallvisible;
+
+ /*
+ * HACK: if the relation has never yet been vacuumed, use a minimum size
+ * estimate of 10 pages. The idea here is to avoid assuming a
+ * newly-created table is really small, even if it currently is, because
+ * that may not be true once some data gets loaded into it. Once a vacuum
+ * or analyze cycle has been done on it, it's more reasonable to believe
+ * the size is somewhat stable.
+ *
+ * (Note that this is only an issue if the plan gets cached and used again
+ * after the table has been filled. What we're trying to avoid is using a
+ * nestloop-type plan on a table that has grown substantially since the
+ * plan was made. Normally, autovacuum/autoanalyze will occur once enough
+ * inserts have happened and cause cached-plan invalidation; but that
+ * doesn't happen instantaneously, and it won't happen at all for cases
+ * such as temporary tables.)
+ *
+ * We approximate "never vacuumed" by "has relpages = 0", which means this
+ * will also fire on genuinely empty relations. Not great, but
+ * fortunately that's a seldom-seen case in the real world, and it
+ * shouldn't degrade the quality of the plan too much anyway to err in
+ * this direction.
+ *
+ * If the table has inheritance children, we don't apply this heuristic.
+ * Totally empty parent tables are quite common, so we should be willing
+ * to believe that they are empty.
+ */
+ if (curpages < 10 &&
+ relpages == 0 &&
+ !rel->rd_rel->relhassubclass)
+ curpages = 10;
+
+ /* report estimated # pages */
+ *pages = curpages;
+ /* quick exit if rel is clearly empty */
+ if (curpages == 0)
+ {
+ *tuples = 0;
+ *allvisfrac = 0;
+ return;
+ }
+
+ /* estimate number of tuples from previous tuple density */
+ if (relpages > 0)
+ density = reltuples / (double) relpages;
+ else
+ {
+ /*
+ * When we have no data because the relation was truncated, estimate
+ * tuple width from attribute datatypes. We assume here that the
+ * pages are completely full, which is OK for tables (since they've
+ * presumably not been VACUUMed yet) but is probably an overestimate
+ * for indexes. Fortunately get_relation_info() can clamp the
+ * overestimate to the parent table's size.
+ *
+ * Note: this code intentionally disregards alignment considerations,
+ * because (a) that would be gilding the lily considering how crude
+ * the estimate is, and (b) it creates platform dependencies in the
+ * default plans which are kind of a headache for regression testing.
+ */
+ int32 tuple_width;
+
+ tuple_width = get_rel_data_width(rel, attr_widths);
+ tuple_width += MAXALIGN(SizeofHeapTupleHeader);
+ tuple_width += sizeof(ItemIdData);
+ /* note: integer division is intentional here */
+ density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width;
+ }
+ *tuples = rint(density * (double) curpages);
+
+ /*
+ * We use relallvisible as-is, rather than scaling it up like we do for
+ * the pages and tuples counts, on the theory that any pages added since
+ * the last VACUUM are most likely not marked all-visible. But costsize.c
+ * wants it converted to a fraction.
+ */
+ if (relallvisible == 0 || curpages <= 0)
+ *allvisfrac = 0;
+ else if ((double) relallvisible >= curpages)
+ *allvisfrac = 1;
+ else
+ *allvisfrac = (double) relallvisible / curpages;
+}
+
+/* ------------------------------------------------------------------------
+ * Executor related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+static bool
+zedstoream_scan_bitmap_next_block(TableScanDesc sscan,
+ TBMIterateResult *tbmres)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ BlockNumber tid_blkno = tbmres->blockno;
+ int ntuples;
+ ZSBtreeScan btree_scan;
+ zstid tid;
+ int noff = 0;
+
+ zs_initialize_proj_attributes_extended(scan, RelationGetDescr(scan->rs_scan.rs_rd));
+
+ /*
+ * Our strategy for a bitmap scan is to scan the tree of each attribute,
+ * starting at the given logical block number, and store all the datums
+ * in the scan struct. zedstoream_scan_analyze_next_tuple() then just
+ * needs to store the datums of the next TID in the slot.
+ *
+ * An alternative would be to keep the scans of each attribute open,
+ * like in a sequential scan. I'm not sure which is better.
+ */
+ ntuples = 0;
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ ZSTidFromBlkOff(tid_blkno, 1),
+ ZSTidFromBlkOff(tid_blkno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &btree_scan);
+ btree_scan.serializable = true;
+ while ((tid = zsbt_tid_scan_next(&btree_scan)) != InvalidZSTid)
+ {
+ ItemPointerData itemptr;
+
+ Assert(ZSTidGetBlockNumber(tid) == tid_blkno);
+
+ ItemPointerSet(&itemptr, tid_blkno, ZSTidGetOffsetNumber(tid));
+
+ if (tbmres->ntuples != -1)
+ {
+ while (ZSTidGetOffsetNumber(tid) > tbmres->offsets[noff] && noff < tbmres->ntuples)
+ {
+ /*
+ * Acquire predicate lock on all tuples that we scan, even those that are
+ * not visible to the snapshot.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot);
+
+ noff++;
+ }
+
+ if (noff == tbmres->ntuples)
+ break;
+
+ if (ZSTidGetOffsetNumber(tid) < tbmres->offsets[noff])
+ continue;
+ }
+
+ Assert(ZSTidGetBlockNumber(tid) == tid_blkno);
+
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot);
+ }
+ zsbt_tid_end_scan(&btree_scan);
+
+ if (ntuples)
+ {
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int natt = scan->proj_data.proj_atts[i];
+ ZSBtreeScan btree_scan;
+ Datum datum;
+ bool isnull;
+ Datum *datums = scan->bmscan_datums[i];
+ bool *isnulls = scan->bmscan_isnulls[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ RelationGetDescr(scan->rs_scan.rs_rd),
+ natt,
+ ZSTidFromBlkOff(tid_blkno, 1),
+ ZSTidFromBlkOff(tid_blkno + 1, 1),
+ &btree_scan);
+ for (int n = 0; n < ntuples; n++)
+ {
+ if (!zsbt_scan_next_fetch(&btree_scan, &datum, &isnull, scan->bmscan_tids[n]))
+ zsbt_fill_missing_attribute_value(&btree_scan, &datum, &isnull);
+
+ /* have to make a copy because we close the scan immediately. */
+ if (!isnull)
+ datum = zs_datumCopy(datum,
+ ZSBtreeScanGetAttInfo(&btree_scan)->attbyval,
+ ZSBtreeScanGetAttInfo(&btree_scan)->attlen);
+ datums[n] = datum;
+ isnulls[n] = isnull;
+ }
+ zsbt_attr_end_scan(&btree_scan);
+ }
+ }
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return ntuples > 0;
+}
+
+static bool
+zedstoream_scan_bitmap_next_tuple(TableScanDesc sscan,
+ TBMIterateResult *tbmres,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid tid;
+
+ if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
+ return false;
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ tid = scan->bmscan_tids[scan->bmscan_nexttuple];
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int natt = scan->proj_data.proj_atts[i];
+ Form_pg_attribute att = TupleDescAttr(slot->tts_tupleDescriptor, natt - 1);
+ Datum datum;
+ bool isnull;
+
+ datum = (scan->bmscan_datums[i])[scan->bmscan_nexttuple];
+ isnull = (scan->bmscan_isnulls[i])[scan->bmscan_nexttuple];
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, tid, datum);
+ }
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ scan->bmscan_nexttuple++;
+
+ pgstat_count_heap_fetch(scan->rs_scan.rs_rd);
+
+ return true;
+}
+
+static bool
+zedstoream_scan_sample_next_block(TableScanDesc sscan, SampleScanState *scanstate)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ Relation rel = scan->rs_scan.rs_rd;
+ TsmRoutine *tsm = scanstate->tsmroutine;
+ int ntuples;
+ ZSBtreeScan btree_scan;
+ zstid tid;
+ BlockNumber blockno;
+
+ /* TODO: for now, assume that we need all columns */
+ zs_initialize_proj_attributes_extended(scan, RelationGetDescr(rel));
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it, used to calculate max blocks to
+ * scan either for SYSTEM or BERNOULLI sampling.
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(rel);
+ /*
+ * TODO: should get lowest tid instead of starting from 0
+ */
+ scan->next_tid_to_scan = ZSTidFromBlkOff(0, 1);
+ }
+
+ if (tsm->NextSampleBlock)
+ {
+ /* Adding one below to convert block number to number of blocks. */
+ blockno = tsm->NextSampleBlock(scanstate,
+ ZSTidGetBlockNumber(scan->max_tid_to_scan) + 1);
+
+ if (!BlockNumberIsValid(blockno))
+ return false;
+ }
+ else
+ {
+ /* scanning table sequentially */
+ if (scan->next_tid_to_scan > scan->max_tid_to_scan)
+ return false;
+
+ blockno = ZSTidGetBlockNumber(scan->next_tid_to_scan);
+ /* move on to next block of tids for next iteration of scan */
+ scan->next_tid_to_scan = ZSTidFromBlkOff(blockno + 1, 1);
+ }
+
+ Assert(BlockNumberIsValid(blockno));
+
+ ntuples = 0;
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ ZSTidFromBlkOff(blockno, 1),
+ ZSTidFromBlkOff(blockno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &btree_scan);
+ while ((tid = zsbt_tid_scan_next(&btree_scan)) != InvalidZSTid)
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+ }
+ zsbt_tid_end_scan(&btree_scan);
+
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return true;
+}
+
+static bool
+zedstoream_scan_sample_next_tuple(TableScanDesc sscan, SampleScanState *scanstate,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ TsmRoutine *tsm = scanstate->tsmroutine;
+ zstid tid;
+ BlockNumber blockno;
+ OffsetNumber tupoffset;
+ bool found;
+
+ /* all tuples on this block are invisible */
+ if (scan->bmscan_ntuples == 0)
+ return false;
+
+ blockno = ZSTidGetBlockNumber(scan->bmscan_tids[0]);
+
+ /* find which visible tuple in this block to sample */
+ for (;;)
+ {
+ zstid lasttid_for_block = scan->bmscan_tids[scan->bmscan_ntuples - 1];
+ OffsetNumber maxoffset = ZSTidGetOffsetNumber(lasttid_for_block);
+ /* Ask the tablesample method which tuples to check on this page. */
+ tupoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+
+ if (!OffsetNumberIsValid(tupoffset))
+ return false;
+
+ tid = ZSTidFromBlkOff(blockno, tupoffset);
+
+ found = false;
+ for (int n = 0; n < scan->bmscan_ntuples; n++)
+ {
+ if (scan->bmscan_tids[n] == tid)
+ {
+ /* visible tuple */
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ break;
+ else
+ continue;
+ }
+
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ /* fetch values for tuple pointed by tid to sample */
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int natt = scan->proj_data.proj_atts[i];
+ ZSBtreeScan btree_scan;
+ Form_pg_attribute attr;
+ Datum datum;
+ bool isnull;
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ slot->tts_tupleDescriptor,
+ natt,
+ tid, tid + 1,
+ &btree_scan);
+
+ attr = ZSBtreeScanGetAttInfo(&btree_scan);
+ if (zsbt_scan_next_fetch(&btree_scan, &datum, &isnull, tid))
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ }
+ else
+ {
+ zsbt_fill_missing_attribute_value(&btree_scan, &datum, &isnull);
+ }
+
+ /*
+ * have to make a copy because we close the scan immediately.
+ * FIXME: I think this leaks into a too-long-lived context
+ */
+ if (!isnull)
+ datum = zs_datumCopy(datum, attr->attbyval, attr->attlen);
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+
+ zsbt_attr_end_scan(&btree_scan);
+ }
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ return true;
+}
+
+static void
+zedstoream_vacuum_rel(Relation onerel, VacuumParams *params,
+ BufferAccessStrategy bstrategy)
+{
+ zsundo_vacuum(onerel, params, bstrategy,
+ GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM));
+}
+
+static const TableAmRoutine zedstoream_methods = {
+ .type = T_TableAmRoutine,
+ .scans_leverage_column_projection = true,
+
+ .slot_callbacks = zedstoream_slot_callbacks,
+
+ .scan_begin = zedstoream_beginscan,
+ .scan_begin_with_column_projection = zedstoream_beginscan_with_column_projection,
+ .scan_end = zedstoream_endscan,
+ .scan_rescan = zedstoream_rescan,
+ .scan_getnextslot = zedstoream_getnextslot,
+
+ .parallelscan_estimate = zs_parallelscan_estimate,
+ .parallelscan_initialize = zs_parallelscan_initialize,
+ .parallelscan_reinitialize = zs_parallelscan_reinitialize,
+
+ .index_fetch_begin = zedstoream_begin_index_fetch,
+ .index_fetch_reset = zedstoream_reset_index_fetch,
+ .index_fetch_end = zedstoream_end_index_fetch,
+ .index_fetch_set_column_projection = zedstoream_fetch_set_column_projection,
+ .index_fetch_tuple = zedstoream_index_fetch_tuple,
+
+ .tuple_insert = zedstoream_insert,
+ .tuple_insert_speculative = zedstoream_insert_speculative,
+ .tuple_complete_speculative = zedstoream_complete_speculative,
+ .multi_insert = zedstoream_multi_insert,
+ .tuple_delete = zedstoream_delete,
+ .tuple_update = zedstoream_update,
+ .tuple_lock = zedstoream_lock_tuple,
+ .finish_bulk_insert = zedstoream_finish_bulk_insert,
+
+ .tuple_fetch_row_version = zedstoream_fetch_row_version,
+ .tuple_get_latest_tid = zedstoream_get_latest_tid,
+ .tuple_tid_valid = zedstoream_tuple_tid_valid,
+ .tuple_satisfies_snapshot = zedstoream_tuple_satisfies_snapshot,
+ .compute_xid_horizon_for_tuples = zedstoream_compute_xid_horizon_for_tuples,
+
+ .relation_set_new_filenode = zedstoream_relation_set_new_filenode,
+ .relation_nontransactional_truncate = zedstoream_relation_nontransactional_truncate,
+ .relation_copy_data = zedstoream_relation_copy_data,
+ .relation_copy_for_cluster = zedstoream_relation_copy_for_cluster,
+ .relation_vacuum = zedstoream_vacuum_rel,
+ .scan_analyze_next_block = zedstoream_scan_analyze_next_block,
+ .scan_analyze_next_tuple = zedstoream_scan_analyze_next_tuple,
+
+ .index_build_range_scan = zedstoream_index_build_range_scan,
+ .index_validate_scan = zedstoream_index_validate_scan,
+
+ .relation_size = zedstoream_relation_size,
+ .relation_needs_toast_table = zedstoream_relation_needs_toast_table,
+ .relation_estimate_size = zedstoream_relation_estimate_size,
+
+ .scan_bitmap_next_block = zedstoream_scan_bitmap_next_block,
+ .scan_bitmap_next_tuple = zedstoream_scan_bitmap_next_tuple,
+ .scan_sample_next_block = zedstoream_scan_sample_next_block,
+ .scan_sample_next_tuple = zedstoream_scan_sample_next_tuple
+};
+
+Datum
+zedstore_tableam_handler(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_POINTER(&zedstoream_methods);
+}
+
+
+/*
+ * Routines for dividing up the TID range for parallel seq scans
+ */
+
+typedef struct ParallelZSScanDescData
+{
+ ParallelTableScanDescData base;
+
+ zstid pzs_endtid; /* last tid + 1 in relation at start of scan */
+ pg_atomic_uint64 pzs_allocatedtid_blk; /* TID space allocated to workers so far. (in 65536 increments) */
+} ParallelZSScanDescData;
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static Size
+zs_parallelscan_estimate(Relation rel)
+{
+ return sizeof(ParallelZSScanDescData);
+}
+
+static Size
+zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc zpscan = (ParallelZSScanDesc) pscan;
+
+ zpscan->base.phs_relid = RelationGetRelid(rel);
+ zpscan->pzs_endtid = zsbt_get_last_tid(rel);
+ pg_atomic_init_u64(&zpscan->pzs_allocatedtid_blk, 0);
+
+ return sizeof(ParallelZSScanDescData);
+}
+
+static void
+zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc bpscan = (ParallelZSScanDesc) pscan;
+
+ pg_atomic_write_u64(&bpscan->pzs_allocatedtid_blk, 0);
+}
+
+/*
+ * get the next TID range to scan
+ *
+ * Returns true if there is more to scan, false otherwise.
+ *
+ * Get the next TID range to scan. Even if there are no TIDs left to scan,
+ * another backend could have grabbed a range to scan and not yet finished
+ * looking at it, so it doesn't follow that the scan is done when the first
+ * backend gets 'false' return.
+ */
+static bool
+zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end)
+{
+ uint64 allocatedtid_blk;
+
+ /*
+ * zhs_allocatedtid tracks how much has been allocated to workers
+ * already. When phs_allocatedtid >= rs_lasttid, all TIDs have been
+ * allocated.
+ *
+ * Because we use an atomic fetch-and-add to fetch the current value, the
+ * phs_allocatedtid counter will exceed rs_lasttid, because workers will
+ * still increment the value, when they try to allocate the next block but
+ * all blocks have been allocated already. The counter must be 64 bits
+ * wide because of that, to avoid wrapping around when rs_lasttid is close
+ * to 2^32. That's also one reason we do this at granularity of 2^16 TIDs,
+ * even though zedstore isn't block-oriented.
+ *
+ * TODO: we divide the TID space into chunks of 2^16 TIDs each. That's
+ * pretty inefficient, there's a fair amount of overhead in re-starting
+ * the B-tree scans between each range. We probably should use much larger
+ * ranges. But this is good for testing.
+ */
+ allocatedtid_blk = pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtid_blk, 1);
+ *start = ZSTidFromBlkOff(allocatedtid_blk, 1);
+ *end = ZSTidFromBlkOff(allocatedtid_blk + 1, 1);
+
+ return *start < pzscan->pzs_endtid;
+}
+
+static void
+zsbt_fill_missing_attribute_value(ZSBtreeScan *scan, Datum *datum, bool *isnull)
+{
+ int attno = scan->attno - 1;
+ TupleDesc tupleDesc = scan->tupledesc;
+ Form_pg_attribute attr = ZSBtreeScanGetAttInfo(scan);
+
+ *isnull = true;
+ *datum = (Datum) 0;
+
+ /* This means catalog doesn't have the default value for this attribute */
+ if (!attr->atthasmissing)
+ return;
+
+ if (tupleDesc->constr &&
+ tupleDesc->constr->missing)
+ {
+ AttrMissing *attrmiss = NULL;
+ /*
+ * If there are missing values we want to put them into the
+ * tuple.
+ */
+ attrmiss = tupleDesc->constr->missing;
+
+ if (attrmiss[attno].am_present)
+ {
+ *isnull = false;
+ if (attr->attbyval)
+ *datum = fetch_att(&attrmiss[attno].am_value, attr->attbyval, attr->attlen);
+ else
+ *datum = zs_datumCopy(attrmiss[attno].am_value, attr->attbyval, attr->attlen);
+ }
+ }
+}
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 6cb545c126..e795a510ae 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1057,7 +1057,11 @@ acquire_sample_rows(Relation onerel, int elevel,
* done.
*/
if (numrows < targrows)
- rows[numrows++] = ExecCopySlotHeapTuple(slot);
+ {
+ rows[numrows] = ExecCopySlotHeapTuple(slot);
+ rows[numrows]->t_self = slot->tts_tid;
+ numrows++;
+ }
else
{
/*
@@ -1079,6 +1083,7 @@ acquire_sample_rows(Relation onerel, int elevel,
Assert(k >= 0 && k < targrows);
heap_freetuple(rows[k]);
rows[k] = ExecCopySlotHeapTuple(slot);
+ rows[k]->t_self = slot->tts_tid;
}
rowstoskip -= 1;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b00891ffd2..ab9fea881a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2116,9 +2116,27 @@ CopyTo(CopyState cstate)
{
TupleTableSlot *slot;
TableScanDesc scandesc;
+ bool *proj = NULL;
- scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
slot = table_slot_create(cstate->rel, NULL);
+ if (table_scans_leverage_column_projection(cstate->rel))
+ {
+ proj = palloc0(slot->tts_tupleDescriptor->natts * sizeof(bool));
+ foreach(cur, cstate->attnumlist)
+ {
+ int attnum = lfirst_int(cur);
+ Assert(attnum <= slot->tts_tupleDescriptor->natts);
+ proj[attnum-1] = true;
+ }
+
+ scandesc = table_beginscan_with_column_projection(cstate->rel,
+ GetActiveSnapshot(),
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
+ }
processed = 0;
while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot))
@@ -2135,6 +2153,8 @@ CopyTo(CopyState cstate)
ExecDropSingleTupleTableSlot(slot);
table_endscan(scandesc);
+ if (proj)
+ pfree(proj);
}
else
{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 602a8dbd1c..2af39c8fdc 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9586,6 +9586,7 @@ validateCheckConstraint(Relation rel, HeapTuple constrtup)
Form_pg_constraint constrForm;
bool isnull;
Snapshot snapshot;
+ bool *proj = NULL;
/*
* VALIDATE CONSTRAINT is a no-op for foreign tables and partitioned
@@ -9618,7 +9619,16 @@ validateCheckConstraint(Relation rel, HeapTuple constrtup)
econtext->ecxt_scantuple = slot;
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(rel, snapshot, 0, NULL);
+ if (table_scans_leverage_column_projection(rel))
+ {
+ proj = palloc0(slot->tts_tupleDescriptor->natts * sizeof(bool));
+ GetNeededColumnsForNode((Node*)exprstate->expr, proj, slot->tts_tupleDescriptor->natts);
+ scan = table_beginscan_with_column_projection(rel, snapshot, 0, NULL, proj);
+ }
+ else
+ {
+ scan = table_beginscan(rel, snapshot, 0, NULL);
+ }
/*
* Switch to per-tuple memory context and reset it for each tuple
@@ -9643,6 +9653,8 @@ validateCheckConstraint(Relation rel, HeapTuple constrtup)
UnregisterSnapshot(snapshot);
ExecDropSingleTupleTableSlot(slot);
FreeExecutorState(estate);
+ if (proj)
+ pfree(proj);
}
/*
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3132a13785..db09b3ac9c 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2841,6 +2841,10 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
if (newtuple != trigtuple)
heap_freetuple(newtuple);
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(slot);
+
if (should_free)
heap_freetuple(trigtuple);
@@ -3125,6 +3129,10 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
newtuple = NULL;
}
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(newslot);
+
if (should_free_trig)
heap_freetuple(trigtuple);
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 67c4be5108..73483aa835 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -20,6 +20,7 @@
#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
#include "utils/memutils.h"
@@ -301,3 +302,92 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+typedef struct neededColumnContext
+{
+ bool *mask;
+ int n;
+} neededColumnContext;
+
+static bool
+neededColumnContextWalker(Node *node, neededColumnContext *c)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *)node;
+
+ if (var->varattno > 0)
+ {
+ Assert(var->varattno <= c->n);
+ c->mask[var->varattno - 1] = true;
+ }
+ /*
+ * If all attributes are included,
+ * set all entries in mask to true.
+ */
+ else if (var->varattno == 0)
+ memset(c->mask, true, c->n);
+
+ return false;
+ }
+ return expression_tree_walker(node, neededColumnContextWalker, (void * )c);
+}
+
+/*
+ * n specifies the number of allowed entries in mask: we use
+ * it for bounds-checking in the walker above.
+ */
+void
+GetNeededColumnsForNode(Node *expr, bool *mask, int n)
+{
+ neededColumnContext c;
+
+ c.mask = mask;
+ c.n = n;
+
+ neededColumnContextWalker(expr, &c);
+}
+
+bool *
+GetNeededColumnsForScan(ScanState *scanstate, int ncol)
+{
+ Plan *plan = scanstate->ps.plan;
+ bool *proj;
+ int i;
+
+ proj = palloc0(ncol * sizeof(bool));
+ GetNeededColumnsForNode((Node *) plan->targetlist, proj, ncol);
+ GetNeededColumnsForNode((Node *) plan->qual, proj, ncol);
+
+ /*
+ * Some node types have more fields with expressions. FIXME: This list
+ * surely very incomplete. Should teach the planner to do this for us.
+ */
+ if (IsA(plan, IndexScan))
+ {
+ GetNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexqualorig, proj, ncol);
+ GetNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexorderbyorig, proj, ncol);
+ }
+ else if (IsA(plan, BitmapHeapScan))
+ {
+ GetNeededColumnsForNode((Node *) ((BitmapHeapScan *) plan)->bitmapqualorig, proj, ncol);
+ }
+
+ for (i = 0; i < ncol; i++)
+ {
+ if (proj[i])
+ break;
+ }
+
+ /*
+ * In some cases (for example, count(*)), no columns are specified.
+ * We always scan the first column.
+ */
+ if (i == ncol && ncol > 0)
+ proj[0] = true;
+
+ return proj;
+}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index ee5b1c493b..8a4d795d1a 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -166,10 +166,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, slot))
+ if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
continue; /* no visible tuple, try next index entry */
- ExecClearTuple(slot);
+ ExecClearTuple(node->ioss_TableSlot);
/*
* Only MVCC snapshots are supported here, so there should be no
@@ -528,7 +528,17 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
*/
tupDesc = ExecTypeFromTL(node->indextlist);
ExecInitScanTupleSlot(estate, &indexstate->ss, tupDesc,
- table_slot_callbacks(currentRelation));
+ &TTSOpsVirtual);
+
+ /*
+ * We need another slot, in a format that's suitable for the table AM,
+ * for when we need to fetch a tuple from the table for rechecking
+ * visibility.
+ */
+ indexstate->ioss_TableSlot =
+ ExecAllocTableSlot(&estate->es_tupleTable,
+ RelationGetDescr(currentRelation),
+ table_slot_callbacks(currentRelation));
/*
* Initialize result type and projection info. The node's targetlist will
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f67..5492816b6d 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,13 @@ IndexNext(IndexScanState *node)
node->iss_NumScanKeys,
node->iss_NumOrderByKeys);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ bool *proj;
+ proj = GetNeededColumnsForScan(&node->ss, node->ss.ss_currentRelation->rd_att->natts);
+ table_index_fetch_set_column_projection(scandesc->xs_heapfetch, proj);
+ }
+
node->iss_ScanDesc = scandesc;
/*
@@ -897,6 +904,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
{
IndexScanState *indexstate;
Relation currentRelation;
+ const TupleTableSlotOps *table_slot_ops;
LOCKMODE lockmode;
/*
@@ -923,11 +931,19 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ss_currentScanDesc = NULL; /* no heap scan here */
/*
- * get the scan type from the relation descriptor.
+ * Initialize the scan slot.
+ *
+ * With the reorder queue, we will sometimes use the reorderqueue's slot,
+ * which uses heap ops, and sometimes the table AM's slot directly. We
+ * have to set scanopsfixed to false, unless the table AM also uses heap
+ * ops.
*/
+ table_slot_ops = table_slot_callbacks(currentRelation);
ExecInitScanTupleSlot(estate, &indexstate->ss,
RelationGetDescr(currentRelation),
- table_slot_callbacks(currentRelation));
+ table_slot_ops);
+ if (node->indexorderby && table_slot_ops != &TTSOpsHeapTuple)
+ indexstate->ss.ps.scanopsfixed = false;
/*
* Initialize result type and projection.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 436b43f8ca..c0922ff823 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -31,6 +31,7 @@
#include "access/tableam.h"
#include "executor/execdebug.h"
#include "executor/nodeSeqscan.h"
+#include "nodes/nodeFuncs.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
@@ -68,9 +69,20 @@ SeqNext(SeqScanState *node)
* We reach here if the scan is not parallel, or if we're serially
* executing a scan that was planned to be parallel.
*/
- scandesc = table_beginscan(node->ss.ss_currentRelation,
- estate->es_snapshot,
- 0, NULL);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ bool *proj;
+ proj = GetNeededColumnsForScan(&node->ss, node->ss.ss_currentRelation->rd_att->natts);
+ scandesc = table_beginscan_with_column_projection(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ }
node->ss.ss_currentScanDesc = scandesc;
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5adfed..6527e0d5d2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -822,6 +822,9 @@ use_physical_tlist(PlannerInfo *root, Path *path, int flags)
rel->rtekind != RTE_CTE)
return false;
+ if (rel->rtekind == RTE_RELATION && rel->leverage_column_projection)
+ return false;
+
/*
* Can't do it with inheritance cases either (mainly because Append
* doesn't project; this test may be unnecessary now that
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 2405acbf6f..00d125378b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -123,6 +123,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
*/
relation = table_open(relationObjectId, NoLock);
+ if (relation->rd_tableam)
+ rel->leverage_column_projection = relation->rd_tableam->scans_leverage_column_projection;
/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())
ereport(ERROR,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 99d26de7e6..b4110e4152 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1274,6 +1274,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
TableScanDesc scan;
MemoryContext oldCxt;
TupleTableSlot *tupslot;
+ bool *proj = NULL;
/* Lock already taken above. */
if (part_relid != RelationGetRelid(default_rel))
@@ -1330,7 +1331,16 @@ check_default_partition_contents(Relation parent, Relation default_rel,
econtext = GetPerTupleExprContext(estate);
snapshot = RegisterSnapshot(GetLatestSnapshot());
tupslot = table_slot_create(part_rel, &estate->es_tupleTable);
- scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ if (table_scans_leverage_column_projection(part_rel))
+ {
+ proj = palloc0(tupslot->tts_tupleDescriptor->natts * sizeof(bool));
+ GetNeededColumnsForNode((Node*)partqualstate->expr, proj, tupslot->tts_tupleDescriptor->natts);
+ scan = table_beginscan_with_column_projection(part_rel, snapshot, 0, NULL, proj);
+ }
+ else
+ {
+ scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ }
/*
* Switch to per-tuple memory context and reset it for each tuple
@@ -1360,6 +1370,9 @@ check_default_partition_contents(Relation parent, Relation default_rel,
if (RelationGetRelid(default_rel) != RelationGetRelid(part_rel))
table_close(part_rel, NoLock); /* keep the lock until commit */
+
+ if (proj)
+ pfree(proj);
}
}
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 2fedbc4c15..b31c0bfe00 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2547,8 +2547,6 @@ PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot)
void
PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
{
- PREDICATELOCKTARGETTAG tag;
- ItemPointer tid;
TransactionId targetxmin;
if (!SerializationNeededForRead(relation, snapshot))
@@ -2579,6 +2577,17 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
}
}
+ PredicateLockTID(relation, &(tuple->t_self), snapshot);
+}
+
+void
+PredicateLockTID(Relation relation, ItemPointer tid, Snapshot snapshot)
+{
+ PREDICATELOCKTARGETTAG tag;
+
+ if (!SerializationNeededForRead(relation, snapshot))
+ return;
+
/*
* Do quick-but-not-definitive test for a relation lock first. This will
* never cause a return when the relation is *not* locked, but will
@@ -2591,7 +2600,6 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
if (PredicateLockExists(&tag))
return;
- tid = &(tuple->t_self);
SET_PREDICATELOCKTARGETTAG_TUPLE(tag,
relation->rd_node.dbNode,
relation->rd_id,
@@ -4054,14 +4062,11 @@ XidIsConcurrent(TransactionId xid)
* currently no known reason to call this function from an index AM.
*/
void
-CheckForSerializableConflictOut(bool visible, Relation relation,
+heap_CheckForSerializableConflictOut(bool visible, Relation relation,
HeapTuple tuple, Buffer buffer,
Snapshot snapshot)
{
TransactionId xid;
- SERIALIZABLEXIDTAG sxidtag;
- SERIALIZABLEXID *sxid;
- SERIALIZABLEXACT *sxact;
HTSV_Result htsvResult;
if (!SerializationNeededForRead(relation, snapshot))
@@ -4125,6 +4130,19 @@ CheckForSerializableConflictOut(bool visible, Relation relation,
Assert(TransactionIdIsValid(xid));
Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+ return CheckForSerializableConflictOut(relation, xid, snapshot);
+}
+
+void
+CheckForSerializableConflictOut(Relation relation, TransactionId xid, Snapshot snapshot)
+{
+ SERIALIZABLEXIDTAG sxidtag;
+ SERIALIZABLEXID *sxid;
+ SERIALIZABLEXACT *sxact;
+
+ if (!SerializationNeededForRead(relation, snapshot))
+ return;
+
/*
* Find top level xid. Bail out if xid is too early to be a conflict, or
* if it's our own xid.
@@ -4439,8 +4457,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
* tuple itself.
*/
void
-CheckForSerializableConflictIn(Relation relation, HeapTuple tuple,
- Buffer buffer)
+CheckForSerializableConflictIn(Relation relation, ItemPointer tid, BlockNumber blkno)
{
PREDICATELOCKTARGETTAG targettag;
@@ -4470,22 +4487,22 @@ CheckForSerializableConflictIn(Relation relation, HeapTuple tuple,
* It is not possible to take and hold a lock across the checks for all
* granularities because each target could be in a separate partition.
*/
- if (tuple != NULL)
+ if (tid != NULL)
{
SET_PREDICATELOCKTARGETTAG_TUPLE(targettag,
relation->rd_node.dbNode,
relation->rd_id,
- ItemPointerGetBlockNumber(&(tuple->t_self)),
- ItemPointerGetOffsetNumber(&(tuple->t_self)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
CheckTargetForConflictsIn(&targettag);
}
- if (BufferIsValid(buffer))
+ if (blkno != InvalidBlockNumber)
{
SET_PREDICATELOCKTARGETTAG_PAGE(targettag,
relation->rd_node.dbNode,
relation->rd_id,
- BufferGetBlockNumber(buffer));
+ blkno);
CheckTargetForConflictsIn(&targettag);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6f1cd382d8..d914d395c9 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -163,6 +163,7 @@ typedef struct TableAmRoutine
{
/* this must be set to T_TableAmRoutine */
NodeTag type;
+ bool scans_leverage_column_projection;
/* ------------------------------------------------------------------------
@@ -203,6 +204,13 @@ typedef struct TableAmRoutine
ParallelTableScanDesc pscan,
uint32 flags);
+ TableScanDesc (*scan_begin_with_column_projection)(Relation relation,
+ Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ bool *project_column);
+
/*
* Release resources and deallocate scan. If TableScanDesc.temp_snap,
* TableScanDesc.rs_snapshot needs to be unregistered.
@@ -278,6 +286,13 @@ typedef struct TableAmRoutine
*/
void (*index_fetch_end) (struct IndexFetchTableData *data);
+ /*
+ * Set column projections for AM which leverage column projections for
+ * scanning.
+ */
+ void (*index_fetch_set_column_projection) (struct IndexFetchTableData *data,
+ bool *project_column);
+
/*
* Fetch tuple at `tid` into `slot`, after doing a visibility test
* according to `snapshot`. If a tuple was found and passed the visibility
@@ -743,6 +758,12 @@ table_beginscan(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline bool
+table_scans_leverage_column_projection(Relation relation)
+{
+ return relation->rd_tableam->scans_leverage_column_projection;
+}
+
/*
* Like table_beginscan(), but for scanning catalog. It'll automatically use a
* snapshot appropriate for scanning catalog relations.
@@ -772,6 +793,19 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline TableScanDesc
+table_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ bool *project_column)
+{
+ uint32 flags = SO_TYPE_SEQSCAN |
+ SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+ Assert(relation->rd_tableam->scans_leverage_column_projection);
+ return relation->rd_tableam->scan_begin_with_column_projection(
+ relation, snapshot, nkeys, key, NULL, flags, project_column);
+}
+
/*
* table_beginscan_bm is an alternative entry point for setting up a
* TableScanDesc for a bitmap heap scan. Although that scan technology is
@@ -956,6 +990,13 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
scan->rel->rd_tableam->index_fetch_end(scan);
}
+static inline void
+table_index_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ bool *project_column)
+{
+ scan->rel->rd_tableam->index_fetch_set_column_projection(scan, project_column);
+}
+
/*
* Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
* a visibility test according to `snapshot`. If a tuple was found and passed
diff --git a/src/include/access/zedstore_compression.h b/src/include/access/zedstore_compression.h
new file mode 100644
index 0000000000..f70713a1a7
--- /dev/null
+++ b/src/include/access/zedstore_compression.h
@@ -0,0 +1,51 @@
+/*
+ * zedstore_compression.h
+ * internal declarations for ZedStore compression
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_compression.h
+ */
+#ifndef ZEDSTORE_COMPRESSION_H
+#define ZEDSTORE_COMPRESSION_H
+
+#include "storage/itemptr.h"
+
+typedef struct ZSDecompressContext
+{
+ char *buffer;
+ int bufsize; /* allocated size of 'buffer' */
+ int uncompressedsize;
+ int bytesread;
+} ZSDecompressContext;
+
+typedef struct ZSCompressContext
+{
+ char *uncompressedbuffer;
+
+ int maxCompressedSize;
+ int maxUncompressedSize;
+ char *buffer;
+ int nitems;
+ int rawsize;
+} ZSCompressContext;
+
+typedef struct ZSBtreeItem ZSBtreeItem;
+typedef struct ZSCompressedBtreeItem ZSCompressedBtreeItem;
+typedef struct ZSSingleBtreeItem ZSSingleBtreeItem;
+
+/* compression functions */
+extern void zs_compress_init(ZSCompressContext *context);
+extern void zs_compress_begin(ZSCompressContext *context, int maxCompressedSize);
+extern bool zs_compress_add(ZSCompressContext *context, ZSBtreeItem *item);
+extern ZSCompressedBtreeItem *zs_compress_finish(ZSCompressContext *context);
+extern void zs_compress_free(ZSCompressContext *context);
+
+/* decompression functions */
+extern void zs_decompress_init(ZSDecompressContext *context);
+extern void zs_decompress_chunk(ZSDecompressContext *context, ZSCompressedBtreeItem *chunk);
+extern ZSBtreeItem *zs_decompress_read_item(ZSDecompressContext *context);
+extern void zs_decompress_free(ZSDecompressContext *context);
+
+#endif /* ZEDSTORE_COMPRESSION_H */
diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
new file mode 100644
index 0000000000..8eb9f74b96
--- /dev/null
+++ b/src/include/access/zedstore_internal.h
@@ -0,0 +1,618 @@
+/*
+ * zedstore_internal.h
+ * internal declarations for ZedStore tables
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_internal.h
+ */
+#ifndef ZEDSTORE_INTERNAL_H
+#define ZEDSTORE_INTERNAL_H
+
+#include "access/tableam.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_undo.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+
+#define ZS_META_ATTRIBUTE_NUM 0
+
+/*
+ * Throughout ZedStore, we pass around TIDs as uint64's, rather than ItemPointers,
+ * for speed.
+ */
+typedef uint64 zstid;
+
+#define InvalidZSTid 0
+#define MinZSTid 1 /* blk 0, off 1 */
+#define MaxZSTid ((uint64) MaxBlockNumber << 16 | 0xffff)
+/* note: if this is converted to ItemPointer, it is invalid */
+#define MaxPlusOneZSTid (MaxZSTid + 1)
+
+#define MaxZSTidOffsetNumber 129
+
+static inline zstid
+ZSTidFromBlkOff(BlockNumber blk, OffsetNumber off)
+{
+ Assert(off != 0);
+
+ return (uint64) blk * (MaxZSTidOffsetNumber - 1) + off;
+}
+
+static inline zstid
+ZSTidFromItemPointer(ItemPointerData iptr)
+{
+ Assert(ItemPointerIsValid(&iptr));
+ return ZSTidFromBlkOff(ItemPointerGetBlockNumber(&iptr),
+ ItemPointerGetOffsetNumber(&iptr));
+}
+
+static inline ItemPointerData
+ItemPointerFromZSTid(zstid tid)
+{
+ ItemPointerData iptr;
+ BlockNumber blk;
+ OffsetNumber off;
+
+ blk = (tid - 1) / (MaxZSTidOffsetNumber - 1);
+ off = (tid - 1) % (MaxZSTidOffsetNumber - 1) + 1;
+
+ ItemPointerSet(&iptr, blk, off);
+ Assert(ItemPointerIsValid(&iptr));
+ return iptr;
+}
+
+static inline BlockNumber
+ZSTidGetBlockNumber(zstid tid)
+{
+ return (BlockNumber) ((tid - 1) / (MaxZSTidOffsetNumber - 1));
+}
+
+static inline OffsetNumber
+ZSTidGetOffsetNumber(zstid tid)
+{
+ return (OffsetNumber) ((tid - 1) % (MaxZSTidOffsetNumber - 1) + 1);
+}
+
+/*
+ * A ZedStore table contains different kinds of pages, all in the same file.
+ *
+ * Block 0 is always a metapage. It contains the block numbers of the other
+ * data structures stored within the file, like the per-attribute B-trees,
+ * and the UNDO log. In addition, if there are overly large datums in the
+ * the table, they are chopped into separate "toast" pages.
+ */
+#define ZS_META_PAGE_ID 0xF083
+#define ZS_BTREE_PAGE_ID 0xF084
+#define ZS_UNDO_PAGE_ID 0xF085
+#define ZS_TOAST_PAGE_ID 0xF086
+#define ZS_FPM_PAGE_ID 0xF087
+
+/* flags for zedstore b-tree pages */
+#define ZSBT_ROOT 0x0001
+
+typedef struct ZSBtreePageOpaque
+{
+ AttrNumber zs_attno;
+ BlockNumber zs_next;
+ zstid zs_lokey; /* inclusive */
+ zstid zs_hikey; /* exclusive */
+ uint16 zs_level; /* 0 = leaf */
+ uint16 zs_flags;
+ uint16 padding; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* always ZS_BTREE_PAGE_ID */
+} ZSBtreePageOpaque;
+
+#define ZSBtreePageGetOpaque(page) ((ZSBtreePageOpaque *) PageGetSpecialPointer(page))
+
+/*
+ * Internal B-tree page layout.
+ *
+ * The "contents" of the page is an array of ZSBtreeInternalPageItem. The number
+ * of items can be deduced from pd_lower.
+ */
+typedef struct ZSBtreeInternalPageItem
+{
+ zstid tid;
+ BlockNumber childblk;
+} ZSBtreeInternalPageItem;
+
+static inline ZSBtreeInternalPageItem *
+ZSBtreeInternalPageGetItems(Page page)
+{
+ ZSBtreeInternalPageItem *items;
+
+ items = (ZSBtreeInternalPageItem *) PageGetContents(page);
+
+ return items;
+}
+static inline int
+ZSBtreeInternalPageGetNumItems(Page page)
+{
+ ZSBtreeInternalPageItem *begin;
+ ZSBtreeInternalPageItem *end;
+
+ begin = (ZSBtreeInternalPageItem *) PageGetContents(page);
+ end = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+
+ return end - begin;
+}
+
+static inline bool
+ZSBtreeInternalPageIsFull(Page page)
+{
+ PageHeader phdr = (PageHeader) page;
+
+ return phdr->pd_upper - phdr->pd_lower < sizeof(ZSBtreeInternalPageItem);
+}
+
+/*
+ * Leaf B-tree page layout
+ *
+ * Leaf pages are packed with ZSBtreeItems. There are three kinds of items:
+ *
+ * 1. Single item, holds one tuple (or rather, one datum).
+ *
+ * 2. "Array item", holds multiple datums, with consecutive TIDs and the same
+ * visibility information. An array item saves space compared to multiple
+ * single items, by leaving out repetitive UNDO and TID fields. An array
+ * item cannot mix NULLs and non-NULLs, so the ZSBT_NULL flag applies to
+ * all elements.
+ *
+ * 3. "Compressed item", which can hold multiple single or array items.
+ *
+ * A single or array item can furthermore be marked as DEAD. A dead item
+ * prevents the TID (or TID range, for an array item) from being reused. It's
+ * used during VACUUM, to mark items for which there are no index pointers
+ * anymore. But it cannot be removed until the undo record has been trimmed
+ * away, because if the TID was reused for a new record, vacuum might remove
+ * the new tuple version instead. After t_undo_ptr becomes older than "oldest
+ * undo ptr", the item can be removed and the TID recycled.
+ *
+ * TODO: squeeze harder: eliminate padding, use high bits of t_tid for flags or size
+ */
+typedef struct ZSBtreeItem
+{
+ zstid t_tid;
+ uint16 t_size;
+ uint16 t_flags;
+} ZSBtreeItem;
+
+typedef struct ZSSingleBtreeItem
+{
+ /* these fields must match ZSBtreeItem */
+ zstid t_tid;
+ uint16 t_size;
+ uint16 t_flags;
+
+ ZSUndoRecPtr t_undo_ptr;
+
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+} ZSSingleBtreeItem;
+
+typedef struct ZSArrayBtreeItem
+{
+ /* these fields must match ZSBtreeItem */
+ zstid t_tid;
+ uint16 t_size;
+ uint16 t_flags;
+
+ uint16 t_nelements;
+ ZSUndoRecPtr t_undo_ptr;
+
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+} ZSArrayBtreeItem;
+
+typedef struct ZSCompressedBtreeItem
+{
+ /* these fields must match ZSBtreeItem */
+ zstid t_tid;
+ uint16 t_size;
+ uint16 t_flags;
+
+ uint16 t_uncompressedsize;
+ zstid t_lasttid; /* inclusive */
+
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+} ZSCompressedBtreeItem;
+
+#define ZSBT_COMPRESSED 0x0001
+#define ZSBT_ARRAY 0x0002
+#define ZSBT_NULL 0x0010
+#define ZSBT_DEAD 0x0020
+
+/*
+ * Get the last TID that the given item spans.
+ *
+ * For a single item, it's the TID of the item. For an array item, it's the
+ * TID of the last element. For a compressed item, it's the last TID of the
+ * last item it contains (which is stored explicitly in the item header).
+ */
+static inline zstid
+zsbt_item_lasttid(ZSBtreeItem *item)
+{
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ return ((ZSCompressedBtreeItem *) item)->t_lasttid;
+ else if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) item;
+ return aitem->t_tid + aitem->t_nelements - 1;
+ }
+ else
+ return item->t_tid;
+}
+
+static inline ZSUndoRecPtr
+zsbt_item_undoptr(ZSBtreeItem *item)
+{
+ if ((item->t_flags & ZSBT_COMPRESSED) != 0)
+ elog(ERROR, "cannot get undo pointer from compressed item");
+ else if ((item->t_flags & ZSBT_ARRAY) != 0)
+ {
+ ZSArrayBtreeItem *aitem = (ZSArrayBtreeItem *) item;
+ return aitem->t_undo_ptr;
+ }
+ else
+ {
+ ZSSingleBtreeItem *sitem = (ZSSingleBtreeItem *) item;
+ return sitem->t_undo_ptr;
+ }
+}
+
+/*
+ * Toast page layout.
+ *
+ * When an overly large datum is stored, it is divided into chunks, and each
+ * chunk is stored on a dedicated toast page. The toast pages of a datum form
+ * list, each page has a next/prev pointer.
+ */
+/*
+ * Maximum size of an individual untoasted Datum stored in ZedStore. Datums
+ * larger than this need to be toasted.
+ *
+ * A datum needs to fit on a B-tree page, with page and item headers.
+ *
+ * XXX: 500 accounts for all the headers. Need to compute this correctly...
+ */
+#define MaxZedStoreDatumSize (BLCKSZ - 500)
+
+typedef struct ZSToastPageOpaque
+{
+ AttrNumber zs_attno;
+
+ /* these are only set on the first page. */
+ zstid zs_tid;
+ uint32 zs_total_size;
+
+ uint32 zs_slice_offset;
+ BlockNumber zs_prev;
+ BlockNumber zs_next;
+ uint16 zs_flags;
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id;
+} ZSToastPageOpaque;
+
+/*
+ * "Toast pointer" of a datum that's stored in zedstore toast pages.
+ *
+ * This looks somewhat like a normal TOAST pointer, but we mustn't let these
+ * escape out of zedstore code, because the rest of the system doesn't know
+ * how to deal with them.
+ *
+ * This must look like varattrib_1b_e!
+ */
+typedef struct varatt_zs_toastptr
+{
+ /* varattrib_1b_e */
+ uint8 va_header;
+ uint8 va_tag; /* VARTAG_ZEDSTORE in zedstore toast datums */
+
+ /* first block */
+ BlockNumber zst_block;
+} varatt_zs_toastptr;
+
+/*
+ * va_tag value. this should be distinguishable from the values in
+ * vartag_external
+ */
+#define VARTAG_ZEDSTORE 10
+
+/*
+ * Versions of datumGetSize and datumCopy that know about ZedStore-toasted
+ * datums.
+ */
+static inline Size
+zs_datumGetSize(Datum value, bool typByVal, int typLen)
+{
+ if (typLen > 0)
+ return typLen;
+ else if (typLen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ return sizeof(varatt_zs_toastptr);
+ else
+ return VARSIZE_ANY(value);
+ }
+ else
+ return datumGetSize(value, typByVal, typLen);
+}
+
+static inline Datum
+zs_datumCopy(Datum value, bool typByVal, int typLen)
+{
+ if (typLen < 0 && VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ {
+ char *result = palloc(sizeof(varatt_zs_toastptr));
+
+ memcpy(result, DatumGetPointer(value), sizeof(varatt_zs_toastptr));
+
+ return PointerGetDatum(result);
+ }
+ else
+ return datumCopy(value, typByVal, typLen);
+}
+
+/*
+ * Block 0 on every ZedStore table is a metapage.
+ *
+ * It contains a directory of b-tree roots for each attribute, and lots more.
+ */
+#define ZS_META_BLK 0
+
+/*
+ * The metapage stores one of these for each attribute.
+ */
+typedef struct ZSRootDirItem
+{
+ BlockNumber root;
+} ZSRootDirItem;
+
+typedef struct ZSMetaPage
+{
+ int nattributes;
+ ZSRootDirItem tree_root_dir[FLEXIBLE_ARRAY_MEMBER]; /* one for each attribute */
+} ZSMetaPage;
+
+/*
+ * it's not clear what we should store in the "opaque" special area, and what
+ * as page contents, on a metapage. But have at least the page_id field here,
+ * so that tools like pg_filedump can recognize it as a zedstore metapage.
+ */
+typedef struct ZSMetaPageOpaque
+{
+ uint64 zs_undo_counter;
+ BlockNumber zs_undo_head;
+ BlockNumber zs_undo_tail;
+ ZSUndoRecPtr zs_undo_oldestptr;
+
+ BlockNumber zs_fpm_root; /* root of the Free Page Map */
+
+ uint16 zs_flags;
+ uint16 zs_page_id;
+} ZSMetaPageOpaque;
+
+
+/*
+ * Holds the state of an in-progress scan on a zedstore btree.
+ */
+typedef struct ZSBtreeScan
+{
+ Relation rel;
+ AttrNumber attno;
+ TupleDesc tupledesc;
+
+ /*
+ * memory context that should be used for any allocations that go with the scan,
+ * like the decompression buffers. This isn't a dedicated context, you must still
+ * free everything to avoid leaking! We need this because the getnext function
+ * might be called in a short-lived memory context that is reset between calls.
+ */
+ MemoryContext context;
+
+ bool active;
+ Buffer lastbuf;
+ OffsetNumber lastoff;
+ zstid nexttid;
+ zstid endtid;
+ Snapshot snapshot;
+
+ /* in the "real" UNDO-log, this would probably be a global variable */
+ ZSUndoRecPtr recent_oldest_undo;
+
+ /* should this scan do predicate locking? Or check for conflicts? */
+ bool serializable;
+ bool acquire_predicate_tuple_locks;
+
+ /*
+ * if we have remaining items from a compressed container tuple, they
+ * are kept in the decompressor context, and 'has_decompressed' is true.
+ */
+ ZSDecompressContext decompressor;
+ bool has_decompressed;
+
+ /*
+ * These fields are used, if the scan is processing an array tuple.
+ * And also for a single-item tuple - it works just like a single-element
+ * array tuple.
+ */
+ ZSUndoRecPtr array_undoptr;
+ int array_datums_allocated_size;
+ Datum *array_datums;
+ Datum *array_next_datum;
+ int array_elements_left;
+ bool array_isnull;
+
+} ZSBtreeScan;
+
+static inline Form_pg_attribute
+ZSBtreeScanGetAttInfo(ZSBtreeScan *scan)
+{
+ return TupleDescAttr(scan->tupledesc, scan->attno - 1);
+}
+
+/*
+ * zs_split_stack is used during page split, or page merge, to keep track
+ * of all the modified pages. The page split (or merge) routines don't
+ * modify pages directly, but they construct a list of 'zs_split_stack'
+ * entries. Each entry holds a buffer, and a temporary in-memory copy of
+ * a page that should be written to the buffer, once everything is completed.
+ * All the buffers are exclusively-locked.
+ */
+typedef struct zs_split_stack zs_split_stack;
+
+struct zs_split_stack
+{
+ zs_split_stack *next;
+
+ Buffer buf;
+ Page page; /* temp in-memory copy of page */
+ bool recycle; /* should the page be added to the FPM? */
+};
+
+/* prototypes for functions in zedstore_tidpage.c */
+extern void zsbt_tid_begin_scan(Relation rel,
+ zstid starttid, zstid endtid, Snapshot snapshot, ZSBtreeScan *scan);
+extern void zsbt_tid_reset_scan(ZSBtreeScan *scan, zstid starttid);
+extern void zsbt_tid_end_scan(ZSBtreeScan *scan);
+extern zstid zsbt_tid_scan_next(ZSBtreeScan *scan);
+
+extern void zsbt_tid_multi_insert(Relation rel,
+ zstid *tids, int ndatums,
+ TransactionId xid, CommandId cid, uint32 speculative_token, ZSUndoRecPtr prevundoptr);
+extern TM_Result zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart);
+extern TM_Result zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid,
+ CommandId cid, bool key_update, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd, zstid *newtid_p);
+extern void zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete);
+extern void zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr undoptr);
+extern TM_Result zsbt_tid_lock(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ LockTupleMode lockmode, Snapshot snapshot, TM_FailureData *hufd, zstid *next_tid);
+extern void zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr);
+extern zstid zsbt_get_last_tid(Relation rel);
+extern void zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot);
+
+/* prototypes for functions in zedstore_attrpage.c */
+extern void zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno,
+ zstid starttid, zstid endtid, ZSBtreeScan *scan);
+extern void zsbt_attr_reset_scan(ZSBtreeScan *scan, zstid starttid);
+extern void zsbt_attr_end_scan(ZSBtreeScan *scan);
+extern bool zsbt_attr_scan_next(ZSBtreeScan *scan);
+
+extern void zsbt_attr_multi_insert(Relation rel, AttrNumber attno,
+ Datum *datums, bool *isnulls, zstid *tids, int ndatums);
+
+/* prototypes for functions in zedstore_btree.c */
+extern zs_split_stack *zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks);
+extern zs_split_stack *zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks);
+extern void zsbt_attr_remove(Relation rel, AttrNumber attno, zstid tid);
+extern zs_split_stack *zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level);
+extern Buffer zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly);
+extern bool zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf);
+
+static inline void
+zsbt_scan_skip(ZSBtreeScan *scan, zstid tid)
+{
+ if (tid > scan->nexttid)
+ {
+ if (scan->array_elements_left > 0)
+ {
+ int64 skip = tid - scan->nexttid - 1;
+
+ if (skip < scan->array_elements_left)
+ {
+ scan->array_next_datum += skip;
+ scan->array_elements_left -= skip;
+ }
+ else
+ {
+ scan->array_elements_left = 0;
+ }
+ }
+ scan->nexttid = tid;
+ }
+}
+
+/*
+ * Return the value of row identified with 'tid' in a scan.
+ *
+ * 'tid' must be greater than any previously returned item.
+ *
+ * Returns true if a matching item is found, false otherwise. After
+ * a false return, it's OK to call this again with another greater TID.
+ */
+static inline bool
+zsbt_scan_next_fetch(ZSBtreeScan *scan, Datum *datum, bool *isnull, zstid tid)
+{
+ if (!scan->active)
+ return false;
+
+ /* skip to the given tid. */
+ zsbt_scan_skip(scan, tid);
+
+ /*
+ * Fetch the next item from the scan. The item we're looking for might
+ * already be in scan->array_*.
+ */
+ do
+ {
+ if (tid < scan->nexttid)
+ {
+ /* The next item from this scan is beyond the TID we're looking for. */
+ return false;
+ }
+
+ if (scan->array_elements_left > 0)
+ {
+ *isnull = scan->array_isnull;
+ *datum = *(scan->array_next_datum++);
+ scan->nexttid++;
+ scan->array_elements_left--;
+ return true;
+ }
+ /* Advance the scan, and check again. */
+ } while (zsbt_attr_scan_next(scan));
+
+ return false;
+}
+
+extern PGDLLIMPORT const TupleTableSlotOps TTSOpsZedstore;
+
+/* prototypes for functions in zedstore_meta.c */
+extern void zsmeta_initmetapage(Relation rel);
+extern BlockNumber zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool for_update);
+extern void zsmeta_update_root_for_attribute(Relation rel, AttrNumber attno, Buffer metabuf, BlockNumber rootblk);
+extern void zsmeta_add_root_for_new_attributes(Relation rel, Page page);
+
+/* prototypes for functions in zedstore_visibility.c */
+extern TM_Result zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo, ZSBtreeItem *item,
+ LockTupleMode mode,
+ bool *undo_record_needed,
+ TM_FailureData *tmfd, zstid *next_tid);
+extern bool zs_SatisfiesVisibility(ZSBtreeScan *scan, ZSBtreeItem *item,
+ TransactionId *obsoleting_xid, zstid *next_tid);
+
+/* prototypes for functions in zedstore_toast.c */
+extern Datum zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value);
+extern void zedstore_toast_finish(Relation rel, AttrNumber attno, Datum toasted, zstid tid);
+extern Datum zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted);
+
+/* prototypes for functions in zedstore_freepagemap.c */
+extern Buffer zspage_getnewbuf(Relation rel, Buffer metabuf);
+extern Buffer zspage_extendrel_newbuf(Relation rel);
+extern void zspage_delete_page(Relation rel, Buffer buf);
+
+/* prototypes for functions in zedstore_utils.c */
+extern zs_split_stack *zs_new_split_stack_entry(Buffer buf, Page page);
+extern void zs_apply_split_changes(Relation rel, zs_split_stack *stack);
+
+#endif /* ZEDSTORE_INTERNAL_H */
diff --git a/src/include/access/zedstore_undo.h b/src/include/access/zedstore_undo.h
new file mode 100644
index 0000000000..2b0c5406a6
--- /dev/null
+++ b/src/include/access/zedstore_undo.h
@@ -0,0 +1,171 @@
+/*
+ * zedstore_undo.h
+ * internal declarations for ZedStore undo logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_undo.h
+ */
+#ifndef ZEDSTORE_UNDO_H
+#define ZEDSTORE_UNDO_H
+
+#include "commands/vacuum.h"
+#include "utils/relcache.h"
+
+/* this must match the definition in zedstore_internal.h */
+typedef uint64 zstid;
+
+/*
+ * An UNDO-pointer.
+ *
+ * In the "real" UNDO-logging work from EDB, an UndoRecPtr is only 64 bits.
+ * But we make life easier for us, by encoding more information in it.
+ *
+ * 'counter' is a number that's incremented every time a new undo record is
+ * created. It can be used to determine if an undo pointer is too old to be
+ * of interest to anyone.
+ *
+ * 'blkno' and 'offset' are the physical location of the UNDO record. They
+ * can be used to easily fetch a given record.
+ */
+typedef struct
+{
+ uint64 counter;
+ BlockNumber blkno;
+ int32 offset;
+} ZSUndoRecPtr;
+
+/* TODO: assert that blkno and offset match, too, if counter matches */
+#define ZSUndoRecPtrEquals(a, b) ((a).counter == (b).counter)
+
+#define INVALID_SPECULATIVE_TOKEN 0
+
+typedef struct
+{
+ int16 size; /* size of this record, including header */
+ uint8 type; /* ZSUNDO_TYPE_* */
+ ZSUndoRecPtr undorecptr;
+ TransactionId xid;
+ CommandId cid;
+ zstid tid;
+ uint32 speculative_token; /* Only used for INSERT records */
+
+ /*
+ * UNDO-record of the inserter. This is needed if a row is inserted, and
+ * deleted, and there are some snapshots active don't don't consider even
+ * the insertion as visible.
+ *
+ * This is also used in Insert records, if the record represents the
+ * new tuple version of an UPDATE, rather than an INSERT. It's needed to
+ * dig into possible KEY SHARE locks held on the row, which didn't prevent
+ * the tuple from being updated.
+ */
+ ZSUndoRecPtr prevundorec;
+} ZSUndoRec;
+
+#define ZSUNDO_TYPE_INSERT 1
+#define ZSUNDO_TYPE_DELETE 2
+#define ZSUNDO_TYPE_UPDATE 3
+#define ZSUNDO_TYPE_TUPLE_LOCK 4
+
+/*
+ * Type-specific record formats.
+ *
+ * We store similar info as zheap for INSERT/UPDATE/DELETE. See zheap README.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid endtid; /* inclusive */
+
+} ZSUndoRec_Insert;
+
+typedef struct
+{
+ ZSUndoRec rec;
+
+ bool changedPart; /* tuple was moved to a different partition by UPDATE */
+
+ /*
+ * TODO: It might be good to move the deleted tuple to the undo-log, so
+ * that the space can immediately be reused. But currently, we don't do
+ * that. (or even better, move the old tuple to the undo-log lazily, if
+ * the space is needed for a new insertion, before the old tuple becomes
+ * recyclable.
+ */
+} ZSUndoRec_Delete;
+
+/*
+ * This is used for an UPDATE, to mark the old tuple version as updated.
+ * It's the same as a deletion, except this stores the TID of the new tuple
+ * version, so it can be followed in READ COMMITTED mode.
+ *
+ * The ZSUndoRec_Insert record is used for the insertion of the new tuple
+ * version.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+
+ bool key_update; /* were key columns updated?
+ * (for conflicting with FOR KEY SHARE) */
+
+ zstid newtid;
+
+} ZSUndoRec_Update;
+
+/*
+ * This is used when a tuple is locked e.g. with SELECT FOR UPDATE.
+ * The tuple isn't really changed in any way, but the undo record gives
+ * a place to store the XID of the locking transaction.
+ *
+ * In case of a FOR SHARE lock, there can be multiple lockers. Each locker
+ * will create a new undo record with its own XID that points to the previous
+ * record. So the records will form a chain, leading finally to the insertion
+ * record (or beyond the UNDO horizon, meaning the tuple's insertion is visible
+ * to everyone)
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+
+ /*
+ * XXX: Is it OK to store this on disk? The enum values could change. Then
+ * again, no one should care about old locks that were acquired before
+ * last restart. Except with two-phase commit prepared transactions.
+ */
+ LockTupleMode lockmode;
+} ZSUndoRec_TupleLock;
+
+typedef struct
+{
+ BlockNumber next;
+ uint16 padding; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* ZS_UNDO_PAGE_ID */
+} ZSUndoPageOpaque;
+
+static inline void
+ZSUndoRecPtrInitialize(ZSUndoRecPtr *uptr)
+{
+ uptr->blkno = InvalidBlockNumber;
+ uptr->offset = InvalidOffsetNumber;
+ uptr->counter = 0;
+}
+
+static inline bool
+IsZSUndoRecPtrValid(ZSUndoRecPtr *uptr)
+{
+ return (uptr->blkno != InvalidBlockNumber &&
+ uptr->offset != InvalidOffsetNumber);
+}
+
+/* prototypes for functions in zstore_undo.c */
+extern ZSUndoRecPtr zsundo_insert(Relation rel, ZSUndoRec *rec);
+extern ZSUndoRec *zsundo_fetch(Relation rel, ZSUndoRecPtr undorecptr);
+extern void zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr);
+extern void zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin);
+extern ZSUndoRecPtr zsundo_get_oldest_undo_ptr(Relation rel);
+
+#endif /* ZEDSTORE_UNDO_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 393b41dd68..f370f63460 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '6668', oid_symbol => 'ZEDSTORE_TABLE_AM_OID',
+ descr => 'zedstore table access method',
+ amname => 'zedstore', amhandler => 'zedstore_tableam_handler', amtype => 't' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 87335248a0..1df6febeca 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -873,6 +873,11 @@
proname => 'heap_tableam_handler', provolatile => 'v',
prorettype => 'table_am_handler', proargtypes => 'internal',
prosrc => 'heap_tableam_handler' },
+{ oid => '6669', oid_symbol => 'ZEDSTORE_TABLE_AM_HANDLER_OID',
+ descr => 'column-oriented table access method handler',
+ proname => 'zedstore_tableam_handler', provolatile => 'v',
+ prorettype => 'table_am_handler', proargtypes => 'internal',
+ prosrc => 'zedstore_tableam_handler' },
# Index access method handlers
{ oid => '330', descr => 'btree index access method handler',
@@ -10677,4 +10682,23 @@
proname => 'pg_partition_root', prorettype => 'regclass',
proargtypes => 'regclass', prosrc => 'pg_partition_root' },
+# zedstore inspection functions
+{ oid => '7000', descr => 'get zedstore page type',
+ proname => 'pg_zs_page_type', prorettype => 'text',
+ proargtypes => 'regclass int8', prosrc => 'pg_zs_page_type' },
+{ oid => '7001', descr => 'show stats about active zedstore undo pages',
+ proname => 'pg_zs_undo_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int4,int4,int8,int8}',
+ proargmodes => '{i,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nrecords,freespace,firstrecptr,lastrecptr}',
+ prosrc => 'pg_zs_undo_pages' },
+{ oid => '7002', descr => 'show stats about zedstore btree pages',
+ proname => 'pg_zs_btree_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int4,int4,int8,int8,int4,int4,int4,int4,int4}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nextblk,attno,level,lokey,hikey,nitems,ncompressed,totalsz,uncompressedsz,freespace}',
+ prosrc => 'pg_zs_btree_pages' },
+
]
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 88134bcc71..2317c688e8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -597,5 +597,6 @@ extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
extern void CheckSubscriptionRelkind(char relkind, const char *nspname,
const char *relname);
-
+extern void GetNeededColumnsForNode(Node *expr, bool *mask, int n);
+extern bool *GetNeededColumnsForScan(ScanState *scanstate, int ncol);
#endif /* EXECUTOR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64122bc1e3..cd5b26118b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1423,6 +1423,7 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
Buffer ioss_VMBuffer;
Size ioss_PscanLen;
+ TupleTableSlot *ioss_TableSlot;
} IndexOnlyScanState;
/* ----------------
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d478..b413bb9f78 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -681,6 +681,7 @@ typedef struct RelOptInfo
PlannerInfo *subroot; /* if subquery */
List *subplan_params; /* if subquery */
int rel_parallel_workers; /* wanted number of parallel workers */
+ bool leverage_column_projection;
/* Information about foreign tables and foreign joins */
Oid serverid; /* identifies server for the table or join */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6cd4cfed0a..ad7870a0bb 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -350,6 +350,9 @@
/* Define to 1 if you have the `ldap_r' library (-lldap_r). */
#undef HAVE_LIBLDAP_R
+/* Define to 1 if you have the `lz4' library (-llz4). */
+#undef HAVE_LIBLZ4
+
/* Define to 1 if you have the `m' library (-lm). */
#undef HAVE_LIBM
@@ -389,6 +392,9 @@
/* Define to 1 if `long long int' works and is 64 bits. */
#undef HAVE_LONG_LONG_INT_64
+/* Define to 1 if you have the <lz4.h> header file. */
+#undef HAVE_LZ4_H
+
/* Define to 1 if you have the <mbarrier.h> header file. */
#undef HAVE_MBARRIER_H
@@ -932,6 +938,9 @@
/* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
#undef USE_LLVM
+/* Define to 1 to build with LZ4 support. (--with-lz4) */
+#undef USE_LZ4
+
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 376245ecd7..866c3a76f9 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -58,15 +58,18 @@ extern void RegisterPredicateLockingXid(TransactionId xid);
extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
extern void PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot);
+extern void PredicateLockTID(Relation relation, ItemPointer tid, Snapshot snapshot);
extern void PredicateLockPageSplit(Relation relation, BlockNumber oldblkno, BlockNumber newblkno);
extern void PredicateLockPageCombine(Relation relation, BlockNumber oldblkno, BlockNumber newblkno);
extern void TransferPredicateLocksToHeapRelation(Relation relation);
extern void ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe);
/* conflict detection (may also trigger rollback) */
-extern void CheckForSerializableConflictOut(bool valid, Relation relation, HeapTuple tuple,
- Buffer buffer, Snapshot snapshot);
-extern void CheckForSerializableConflictIn(Relation relation, HeapTuple tuple, Buffer buffer);
+extern void heap_CheckForSerializableConflictOut(bool valid, Relation relation, HeapTuple tuple,
+ Buffer buffer, Snapshot snapshot);
+extern void CheckForSerializableConflictOut(Relation relation, TransactionId xid,
+ Snapshot snapshot);
+extern void CheckForSerializableConflictIn(Relation relation, ItemPointer tid, BlockNumber blkno);
extern void CheckTableForSerializableConflictIn(Relation relation);
/* final rollback checking */
diff --git a/src/test/isolation/specs/read-only-anomaly-2.spec b/src/test/isolation/specs/read-only-anomaly-2.spec
index 9812f49ee4..2b17fcb521 100644
--- a/src/test/isolation/specs/read-only-anomaly-2.spec
+++ b/src/test/isolation/specs/read-only-anomaly-2.spec
@@ -18,13 +18,15 @@ teardown
}
session "s1"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s1ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s1wy" { UPDATE bank_account SET balance = 20 WHERE id = 'Y'; }
step "s1c" { COMMIT; }
session "s2"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s2rx" { SELECT balance FROM bank_account WHERE id = 'X'; }
step "s2ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s2wx" { UPDATE bank_account SET balance = -11 WHERE id = 'X'; }
diff --git a/src/test/regress/expected/.gitignore b/src/test/regress/expected/.gitignore
index 93c56c85a0..0eb6984372 100644
--- a/src/test/regress/expected/.gitignore
+++ b/src/test/regress/expected/.gitignore
@@ -5,5 +5,6 @@
/largeobject.out
/largeobject_1.out
/misc.out
+/misc_1.out
/security_label.out
/tablespace.out
diff --git a/src/test/regress/expected/alter_table_1.out b/src/test/regress/expected/alter_table_1.out
new file mode 100644
index 0000000000..09d60af3b7
--- /dev/null
+++ b/src/test/regress/expected/alter_table_1.out
@@ -0,0 +1,3997 @@
+--
+-- ALTER_TABLE
+--
+-- Clean up in case a prior regression run failed
+SET client_min_messages TO 'warning';
+DROP ROLE IF EXISTS regress_alter_table_user1;
+RESET client_min_messages;
+CREATE USER regress_alter_table_user1;
+--
+-- add attribute
+--
+CREATE TABLE attmp (initial int4);
+COMMENT ON TABLE attmp_wrong IS 'table comment';
+ERROR: relation "attmp_wrong" does not exist
+COMMENT ON TABLE attmp IS 'table comment';
+COMMENT ON TABLE attmp IS NULL;
+ALTER TABLE attmp ADD COLUMN xmin integer; -- fails
+ERROR: column name "xmin" conflicts with a system column name
+ALTER TABLE attmp ADD COLUMN a int4 default 3;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+DROP TABLE attmp;
+-- the wolf bug - schema mods caused inconsistent row descriptors
+CREATE TABLE attmp (
+ initial int4
+);
+ALTER TABLE attmp ADD COLUMN a int4;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+CREATE INDEX attmp_idx ON attmp (a, (d + e), b);
+ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ERROR: column number must be in range from 1 to 32767
+LINE 1: ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ ^
+ALTER INDEX attmp_idx ALTER COLUMN 1 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "a" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS 1000;
+\d+ attmp_idx
+ Index "public.attmp_idx"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+------------------+------+------------+---------+--------------
+ a | integer | yes | a | plain |
+ expr | double precision | yes | (d + e) | plain | 1000
+ b | cstring | yes | b | plain |
+btree, for table "public.attmp"
+
+ALTER INDEX attmp_idx ALTER COLUMN 3 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "b" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 4 SET STATISTICS 1000;
+ERROR: column number 4 of relation "attmp_idx" does not exist
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS -1;
+DROP TABLE attmp;
+--
+-- rename - check on both non-temp and temp tables
+--
+CREATE TABLE attmp (regtable int);
+CREATE TEMP TABLE attmp (attmptable int);
+ALTER TABLE attmp RENAME TO attmp_new;
+SELECT * FROM attmp;
+ regtable
+----------
+(0 rows)
+
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+ALTER TABLE attmp RENAME TO attmp_new2;
+SELECT * FROM attmp; -- should fail
+ERROR: relation "attmp" does not exist
+LINE 1: SELECT * FROM attmp;
+ ^
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+SELECT * FROM attmp_new2;
+ regtable
+----------
+(0 rows)
+
+DROP TABLE attmp_new;
+DROP TABLE attmp_new2;
+-- check rename of partitioned tables and indexes also
+CREATE TABLE part_attmp (a int primary key) partition by range (a);
+CREATE TABLE part_attmp1 PARTITION OF part_attmp FOR VALUES FROM (0) TO (100);
+ALTER INDEX part_attmp_pkey RENAME TO part_attmp_index;
+ALTER INDEX part_attmp1_pkey RENAME TO part_attmp1_index;
+ALTER TABLE part_attmp RENAME TO part_at2tmp;
+ALTER TABLE part_attmp1 RENAME TO part_at2tmp1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX part_attmp_index RENAME TO fail;
+ERROR: must be owner of index part_attmp_index
+ALTER INDEX part_attmp1_index RENAME TO fail;
+ERROR: must be owner of index part_attmp1_index
+ALTER TABLE part_at2tmp RENAME TO fail;
+ERROR: must be owner of table part_at2tmp
+ALTER TABLE part_at2tmp1 RENAME TO fail;
+ERROR: must be owner of table part_at2tmp1
+RESET ROLE;
+DROP TABLE part_at2tmp;
+--
+-- check renaming to a table's array type's autogenerated name
+-- (the array type's name should get out of the way)
+--
+CREATE TABLE attmp_array (id int);
+CREATE TABLE attmp_array2 (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = 'attmp_array2[]'::regtype;
+ typname
+---------------
+ _attmp_array2
+(1 row)
+
+ALTER TABLE attmp_array2 RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+----------------
+ ___attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+DROP TABLE attmp_array;
+-- renaming to table's own array type's name is an interesting corner case
+CREATE TABLE attmp_array (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+ALTER TABLE attmp_array RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+-- ALTER TABLE ... RENAME on non-table relations
+-- renaming indexes (FIXME: this should probably test the index's functionality)
+ALTER INDEX IF EXISTS __onek_unique1 RENAME TO attmp_onek_unique1;
+NOTICE: relation "__onek_unique1" does not exist, skipping
+ALTER INDEX IF EXISTS __attmp_onek_unique1 RENAME TO onek_unique1;
+NOTICE: relation "__attmp_onek_unique1" does not exist, skipping
+ALTER INDEX onek_unique1 RENAME TO attmp_onek_unique1;
+ALTER INDEX attmp_onek_unique1 RENAME TO onek_unique1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX onek_unique1 RENAME TO fail; -- permission denied
+ERROR: must be owner of index onek_unique1
+RESET ROLE;
+-- renaming views
+CREATE VIEW attmp_view (unique1) AS SELECT unique1 FROM tenk1;
+ALTER TABLE attmp_view RENAME TO attmp_view_new;
+SET ROLE regress_alter_table_user1;
+ALTER VIEW attmp_view_new RENAME TO fail; -- permission denied
+ERROR: must be owner of view attmp_view_new
+RESET ROLE;
+-- hack to ensure we get an indexscan here
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+-- 5 values, sorted
+SELECT unique1 FROM tenk1 WHERE unique1 < 5;
+ unique1
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+(5 rows)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+DROP VIEW attmp_view_new;
+-- toast-like relation name
+alter table stud_emp rename to pg_toast_stud_emp;
+alter table pg_toast_stud_emp rename to stud_emp;
+-- renaming index should rename constraint as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+ALTER INDEX onek_unique1_constraint RENAME TO onek_unique1_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraint
+ALTER TABLE onek ADD CONSTRAINT onek_check_constraint CHECK (unique1 >= 0);
+ALTER TABLE onek RENAME CONSTRAINT onek_check_constraint TO onek_check_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_check_constraint_foo;
+-- renaming constraint should rename index as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+DROP INDEX onek_unique1_constraint; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint because constraint onek_unique1_constraint on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint on table onek instead.
+ALTER TABLE onek RENAME CONSTRAINT onek_unique1_constraint TO onek_unique1_constraint_foo;
+DROP INDEX onek_unique1_constraint_foo; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint_foo because constraint onek_unique1_constraint_foo on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint_foo on table onek instead.
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraints vs. inheritance
+CREATE TABLE constraint_rename_test (a int CONSTRAINT con1 CHECK (a > 0), b int, c int);
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+
+CREATE TABLE constraint_rename_test2 (a int CONSTRAINT con1 CHECK (a > 0), d int) INHERITS (constraint_rename_test);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging constraint "con1" with inherited definition
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test2 RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: cannot rename inherited constraint "con1"
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: inherited constraint "con1" must be renamed in child tables too
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con2 CHECK (b > 0) NO INHERIT;
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con2 TO con2foo; -- ok
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con2foo TO con2bar; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con3 PRIMARY KEY (a);
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con3 TO con3foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+Indexes:
+ "con3foo" PRIMARY KEY, btree (a)
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+DROP TABLE constraint_rename_test2;
+DROP TABLE constraint_rename_test;
+ALTER TABLE IF EXISTS constraint_not_exist RENAME CONSTRAINT con3 TO con3foo; -- ok
+NOTICE: relation "constraint_not_exist" does not exist, skipping
+ALTER TABLE IF EXISTS constraint_rename_test ADD CONSTRAINT con4 UNIQUE (a);
+NOTICE: relation "constraint_rename_test" does not exist, skipping
+-- renaming constraints with cache reset of target relation
+CREATE TABLE constraint_rename_cache (a int,
+ CONSTRAINT chk_a CHECK (a > 0),
+ PRIMARY KEY (a));
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT chk_a TO chk_a_new;
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT constraint_rename_cache_pkey TO constraint_rename_pkey_new;
+CREATE TABLE like_constraint_rename_cache
+ (LIKE constraint_rename_cache INCLUDING ALL);
+\d like_constraint_rename_cache
+ Table "public.like_constraint_rename_cache"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+Indexes:
+ "like_constraint_rename_cache_pkey" PRIMARY KEY, btree (a)
+Check constraints:
+ "chk_a_new" CHECK (a > 0)
+
+DROP TABLE constraint_rename_cache;
+DROP TABLE like_constraint_rename_cache;
+-- FOREIGN KEY CONSTRAINT adding TEST
+CREATE TABLE attmp2 (a int primary key);
+CREATE TABLE attmp3 (a int, b int);
+CREATE TABLE attmp4 (a int, b int, unique(a,b));
+CREATE TABLE attmp5 (a int, b int);
+-- Insert rows into attmp2 (pktable)
+INSERT INTO attmp2 values (1);
+INSERT INTO attmp2 values (2);
+INSERT INTO attmp2 values (3);
+INSERT INTO attmp2 values (4);
+-- Insert rows into attmp3
+INSERT INTO attmp3 values (1,10);
+INSERT INTO attmp3 values (1,20);
+INSERT INTO attmp3 values (5,50);
+-- Try (and fail) to add constraint due to invalid source columns
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(c) references attmp2 match full;
+ERROR: column "c" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid destination columns explicitly given
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(a) references attmp2(b) match full;
+ERROR: column "b" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid data
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed)
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ALTER TABLE attmp3 drop constraint attmpconstr;
+INSERT INTO attmp3 values (5,50);
+-- Try NOT VALID and then VALIDATE CONSTRAINT, but fails. Delete failure then re-validate
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full NOT VALID;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed) and repeat to show it works on already valid constraint
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+-- Try a non-verified CHECK constraint
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10); -- fail
+ERROR: check constraint "b_greater_than_ten" is violated by some row
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10) NOT VALID; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- fails
+ERROR: check constraint "b_greater_than_ten" is violated by some row
+DELETE FROM attmp3 WHERE NOT b > 10;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+-- Test inherited NOT VALID CHECK constraints
+select * from attmp3;
+ a | b
+---+----
+ 1 | 20
+(1 row)
+
+CREATE TABLE attmp6 () INHERITS (attmp3);
+CREATE TABLE attmp7 () INHERITS (attmp3);
+INSERT INTO attmp6 VALUES (6, 30), (7, 16);
+ALTER TABLE attmp3 ADD CONSTRAINT b_le_20 CHECK (b <= 20) NOT VALID;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- fails
+ERROR: check constraint "b_le_20" is violated by some row
+DELETE FROM attmp6 WHERE b > 20;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- succeeds
+-- An already validated constraint must not be revalidated
+CREATE FUNCTION boo(int) RETURNS int IMMUTABLE STRICT LANGUAGE plpgsql AS $$ BEGIN RAISE NOTICE 'boo: %', $1; RETURN $1; END; $$;
+INSERT INTO attmp7 VALUES (8, 18);
+ALTER TABLE attmp7 ADD CONSTRAINT identity CHECK (b = boo(b));
+NOTICE: boo: 18
+ALTER TABLE attmp3 ADD CONSTRAINT IDENTITY check (b = boo(b)) NOT VALID;
+NOTICE: merging constraint "identity" with inherited definition
+ALTER TABLE attmp3 VALIDATE CONSTRAINT identity;
+NOTICE: boo: 16
+NOTICE: boo: 20
+-- A NO INHERIT constraint should not be looked for in children during VALIDATE CONSTRAINT
+create table parent_noinh_convalid (a int);
+create table child_noinh_convalid () inherits (parent_noinh_convalid);
+insert into parent_noinh_convalid values (1);
+insert into child_noinh_convalid values (1);
+alter table parent_noinh_convalid add constraint check_a_is_2 check (a = 2) no inherit not valid;
+-- fail, because of the row in parent
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+ERROR: check constraint "check_a_is_2" is violated by some row
+delete from only parent_noinh_convalid;
+-- ok (parent itself contains no violating rows)
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+select convalidated from pg_constraint where conrelid = 'parent_noinh_convalid'::regclass and conname = 'check_a_is_2';
+ convalidated
+--------------
+ t
+(1 row)
+
+-- cleanup
+drop table parent_noinh_convalid, child_noinh_convalid;
+-- Try (and fail) to create constraint from attmp5(a) to attmp4(a) - unique constraint on
+-- attmp4 is a,b
+ALTER TABLE attmp5 add constraint attmpconstr foreign key(a) references attmp4(a) match full;
+ERROR: there is no unique constraint matching given keys for referenced table "attmp4"
+DROP TABLE attmp7;
+DROP TABLE attmp6;
+DROP TABLE attmp5;
+DROP TABLE attmp4;
+DROP TABLE attmp3;
+DROP TABLE attmp2;
+-- NOT VALID with plan invalidation -- ensure we don't use a constraint for
+-- exclusion until validated
+set constraint_exclusion TO 'partition';
+create table nv_parent (d date, check (false) no inherit not valid);
+-- not valid constraint added at creation time should automatically become valid
+\d nv_parent
+ Table "public.nv_parent"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_parent_check" CHECK (false) NO INHERIT
+
+create table nv_child_2010 () inherits (nv_parent);
+create table nv_child_2011 () inherits (nv_parent);
+alter table nv_child_2010 add check (d between '2010-01-01'::date and '2010-12-31'::date) not valid;
+alter table nv_child_2011 add check (d between '2011-01-01'::date and '2011-12-31'::date) not valid;
+explain (costs off) select * from nv_parent where d between '2011-08-01' and '2011-08-31';
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+create table nv_child_2009 (check (d between '2009-01-01'::date and '2009-12-31'::date)) inherits (nv_parent);
+explain (costs off) select * from nv_parent where d between '2011-08-01'::date and '2011-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2011
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(9 rows)
+
+-- after validation, the constraint should be used
+alter table nv_child_2011 VALIDATE CONSTRAINT nv_child_2011_d_check;
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(7 rows)
+
+-- add an inherited NOT VALID constraint
+alter table nv_parent add check (d between '2001-01-01'::date and '2099-12-31'::date) not valid;
+\d nv_child_2009
+ Table "public.nv_child_2009"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_child_2009_d_check" CHECK (d >= '01-01-2009'::date AND d <= '12-31-2009'::date)
+ "nv_parent_d_check" CHECK (d >= '01-01-2001'::date AND d <= '12-31-2099'::date) NOT VALID
+Inherits: nv_parent
+
+-- we leave nv_parent and children around to help test pg_dump logic
+-- Foreign key adding test with mixed types
+-- Note: these tables are TEMP to avoid name conflicts when this test
+-- is run in parallel with foreign_key.sql.
+CREATE TEMP TABLE PKTABLE (ptest1 int PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 inet);
+-- This next should fail, because int=inet does not exist
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+-- This should also fail for the same reason, but here we
+-- give the column name
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable(ptest1);
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+-- This should succeed, even though they are different types,
+-- because int=int8 exists and is a member of the integer opfamily
+CREATE TEMP TABLE FKTABLE (ftest1 int8);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+-- This should fail, because we'd have to cast numeric to int which is
+-- not an implicit coercion (or use numeric=numeric, but that's not part
+-- of the integer opfamily)
+CREATE TEMP TABLE FKTABLE (ftest1 numeric);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: numeric and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- On the other hand, this should work because int implicitly promotes to
+-- numeric, and we allow promotion on the FK side
+CREATE TEMP TABLE PKTABLE (ptest1 numeric PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+CREATE TEMP TABLE PKTABLE (ptest1 int, ptest2 inet,
+ PRIMARY KEY(ptest1, ptest2));
+-- This should fail, because we just chose really odd types
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- Again, so should this...
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- This fails because we mixed up the column ordering
+CREATE TEMP TABLE FKTABLE (ftest1 int, ftest2 inet);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest2, ptest1);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest2" are of incompatible types: integer and inet.
+-- As does this...
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest2, ftest1)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest2_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest2" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- Test that ALTER CONSTRAINT updates trigger deferrability properly
+CREATE TEMP TABLE PKTABLE (ptest1 int primary key);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY IMMEDIATE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fknd2 NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdd2 DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdi2 DEFERRABLE INITIALLY IMMEDIATE;
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'pktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+------------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd2 | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdi | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi | "RI_FKey_noaction_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi2 | "RI_FKey_noaction_upd" | 17 | t | f
+ fknd | "RI_FKey_cascade_del" | 9 | f | f
+ fknd | "RI_FKey_noaction_upd" | 17 | f | f
+ fknd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fknd2 | "RI_FKey_noaction_upd" | 17 | f | f
+(12 rows)
+
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'fktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+---------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_check_ins" | 5 | t | t
+ fkdd | "RI_FKey_check_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_check_ins" | 5 | t | t
+ fkdd2 | "RI_FKey_check_upd" | 17 | t | t
+ fkdi | "RI_FKey_check_ins" | 5 | t | f
+ fkdi | "RI_FKey_check_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_check_ins" | 5 | t | f
+ fkdi2 | "RI_FKey_check_upd" | 17 | t | f
+ fknd | "RI_FKey_check_ins" | 5 | f | f
+ fknd | "RI_FKey_check_upd" | 17 | f | f
+ fknd2 | "RI_FKey_check_ins" | 5 | f | f
+ fknd2 | "RI_FKey_check_upd" | 17 | f | f
+(12 rows)
+
+-- temp tables should go away by themselves, need not drop them.
+-- test check constraint adding
+create table atacc1 ( test int );
+-- add a check constraint
+alter table atacc1 add constraint atacc_test1 check (test>3);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (2).
+-- should succeed
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails when added
+create table atacc1 ( test int );
+-- insert a soon to be failing row
+insert into atacc1 (test) values (2);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test>3);
+ERROR: check constraint "atacc_test1" is violated by some row
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails because the column doesn't exist
+create table atacc1 ( test int );
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test1>3);
+ERROR: column "test1" does not exist
+HINT: Perhaps you meant to reference the column "atacc1.test".
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int, test3 int);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test+test2<test3*4);
+-- should fail
+insert into atacc1 (test,test2,test3) values (4,4,2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (4, 4, 2).
+-- should succeed
+insert into atacc1 (test,test2,test3) values (4,4,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int check (test>3), test2 int);
+alter table atacc1 add check (test2>test);
+-- should fail for $2
+insert into atacc1 (test2, test) values (3, 4);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_check"
+DETAIL: Failing row contains (4, 3).
+drop table atacc1;
+-- inheritance related tests
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc2 add constraint foo check (test2>0);
+-- fail and then succeed on atacc2
+insert into atacc2 (test2) values (-3);
+ERROR: new row for relation "atacc2" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc2 (test2) values (3);
+-- fail and then succeed on atacc3
+insert into atacc3 (test2) values (-3);
+ERROR: new row for relation "atacc3" violates check constraint "foo"
+DETAIL: Failing row contains (null, -3, null).
+insert into atacc3 (test2) values (3);
+drop table atacc3;
+drop table atacc2;
+drop table atacc1;
+-- same things with one created with INHERIT
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc3 no inherit atacc2;
+-- fail
+alter table atacc3 no inherit atacc2;
+ERROR: relation "atacc2" is not a parent of relation "atacc3"
+-- make sure it really isn't a child
+insert into atacc3 (test2) values (3);
+select test2 from atacc2;
+ test2
+-------
+(0 rows)
+
+-- fail due to missing constraint
+alter table atacc2 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing constraint "foo"
+-- fail due to missing column
+alter table atacc3 rename test2 to testx;
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing column "test2"
+-- fail due to mismatched data type
+alter table atacc3 add test2 bool;
+alter table atacc3 inherit atacc2;
+ERROR: child table "atacc3" has different type for column "test2"
+alter table atacc3 drop test2;
+-- succeed
+alter table atacc3 add test2 int;
+update atacc3 set test2 = 4 where test2 is null;
+alter table atacc3 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+-- fail due to duplicates and circular inheritance
+alter table atacc3 inherit atacc2;
+ERROR: relation "atacc2" would be inherited from more than once
+alter table atacc2 inherit atacc3;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc3" is already a child of "atacc2".
+alter table atacc2 inherit atacc2;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc2" is already a child of "atacc2".
+-- test that we really are a child now (should see 4 not 3 and cascade should go through)
+select test2 from atacc2;
+ test2
+-------
+ 4
+(1 row)
+
+drop table atacc2 cascade;
+NOTICE: drop cascades to table atacc3
+drop table atacc1;
+-- adding only to a parent is allowed as of 9.2
+create table atacc1 (test int);
+create table atacc2 (test2 int) inherits (atacc1);
+-- ok:
+alter table atacc1 add constraint foo check (test>0) no inherit;
+-- check constraint is not there on child
+insert into atacc2 (test) values (-3);
+-- check constraint is there on parent
+insert into atacc1 (test) values (-3);
+ERROR: new row for relation "atacc1" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc1 (test) values (3);
+-- fail, violating row:
+alter table atacc2 add constraint foo check (test>0) no inherit;
+ERROR: check constraint "foo" is violated by some row
+drop table atacc2;
+drop table atacc1;
+-- test unique constraint adding
+create table atacc1 ( test int ) ;
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- try to create duplicates via alter table using - should fail
+alter table atacc1 alter column test type integer using 0;
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(0) is duplicated.
+drop table atacc1;
+-- let's do one where the unique constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the unique constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test1);
+ERROR: column "test1" named in key does not exist
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test, test2);
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, unique(test));
+alter table atacc1 add unique (test2);
+-- should fail for @@ second one @@
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_test_key"
+DETAIL: Key (test)=(3) already exists.
+drop table atacc1;
+-- test primary key constraint adding
+create table atacc1 ( id serial, test int) ;
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- inserting NULL should fail
+insert into atacc1 (test) values(NULL);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (4, null).
+-- try adding a second primary key (should fail)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- drop first primary key constraint
+alter table atacc1 drop constraint atacc_test1 restrict;
+-- try adding a primary key on oid (should succeed)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+drop table atacc1;
+-- let's do one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do another one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing row
+insert into atacc1 (test) values (NULL);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: column "test" contains null values
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the primary key constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a primary key constraint (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test1);
+ERROR: column "test1" of relation "atacc1" does not exist
+drop table atacc1;
+-- adding a new column as primary key to a non-empty table.
+-- should fail unless the column has a non-null default value.
+create table atacc1 ( test int );
+insert into atacc1 (test) values (0);
+-- add a primary key column without a default (fails).
+alter table atacc1 add column test2 int primary key;
+ERROR: column "test2" contains null values
+-- now add a primary key column with a default (succeeds).
+alter table atacc1 add column test2 int default 0 primary key;
+drop table atacc1;
+-- this combination used to have order-of-execution problems (bug #15580)
+create table atacc1 (a int);
+insert into atacc1 values(1);
+alter table atacc1
+ add column b float8 not null default random(),
+ add primary key(a);
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test, test2);
+-- try adding a second primary key - should fail
+alter table atacc1 add constraint atacc_test2 primary key (test);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+insert into atacc1 (test,test2) values (NULL,3);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (null, 3).
+insert into atacc1 (test,test2) values (3, NULL);
+ERROR: null value in column "test2" violates not-null constraint
+DETAIL: Failing row contains (3, null).
+insert into atacc1 (test,test2) values (NULL,NULL);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, primary key(test));
+-- only first should succeed
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_pkey"
+DETAIL: Key (test)=(3) already exists.
+insert into atacc1 (test2, test) values (1, NULL);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (null, 1).
+drop table atacc1;
+-- alter table / alter column [set/drop] not null tests
+-- try altering system catalogs, should fail
+alter table pg_class alter column relname drop not null;
+ERROR: permission denied: "pg_class" is a system catalog
+alter table pg_class alter relname set not null;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table non_existent alter column bar set not null;
+ERROR: relation "non_existent" does not exist
+alter table non_existent alter column bar drop not null;
+ERROR: relation "non_existent" does not exist
+-- test setting columns to null and not null and vice versa
+-- test checking for null values and primary key
+create table atacc1 (test int not null);
+alter table atacc1 add constraint "atacc1_pkey" primary key (test);
+alter table atacc1 alter column test drop not null;
+ERROR: column "test" is in a primary key
+alter table atacc1 drop constraint "atacc1_pkey";
+alter table atacc1 alter column test drop not null;
+insert into atacc1 values (null);
+alter table atacc1 alter test set not null;
+ERROR: column "test" contains null values
+delete from atacc1;
+alter table atacc1 alter test set not null;
+-- try altering a non-existent column, should fail
+alter table atacc1 alter bar set not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+alter table atacc1 alter bar drop not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+alter table myview alter column test drop not null;
+ERROR: "myview" is not a table or foreign table
+alter table myview alter column test set not null;
+ERROR: "myview" is not a table or foreign table
+drop view myview;
+drop table atacc1;
+-- set not null verified by constraints
+create table atacc1 (test_a int, test_b int);
+insert into atacc1 values (null, 1);
+-- constraint not cover all values, should fail
+alter table atacc1 add constraint atacc1_constr_or check(test_a is not null or test_b < 10);
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" contains null values
+alter table atacc1 drop constraint atacc1_constr_or;
+-- not valid constraint, should fail
+alter table atacc1 add constraint atacc1_constr_invalid check(test_a is not null) not valid;
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" contains null values
+alter table atacc1 drop constraint atacc1_constr_invalid;
+-- with valid constraint
+update atacc1 set test_a = 1;
+alter table atacc1 add constraint atacc1_constr_a_valid check(test_a is not null);
+alter table atacc1 alter test_a set not null;
+delete from atacc1;
+insert into atacc1 values (2, null);
+alter table atacc1 alter test_a drop not null;
+-- test multiple set not null at same time
+-- test_a checked by atacc1_constr_a_valid, test_b should fail by table scan
+alter table atacc1 alter test_a set not null, alter test_b set not null;
+ERROR: column "test_b" contains null values
+-- commands order has no importance
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+ERROR: column "test_b" contains null values
+-- valid one by table scan, one by check constraints
+update atacc1 set test_b = 1;
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+alter table atacc1 alter test_a drop not null, alter test_b drop not null;
+-- both column has check constraints
+alter table atacc1 add constraint atacc1_constr_b_valid check(test_b is not null);
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+drop table atacc1;
+-- test inheritance
+create table parent (a int);
+create table child (b varchar(255)) inherits (parent);
+alter table parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null).
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+alter table parent alter a drop not null;
+insert into parent values (NULL);
+insert into child (a, b) values (NULL, 'foo');
+alter table only parent alter a set not null;
+ERROR: column "a" contains null values
+alter table child alter a set not null;
+ERROR: column "a" contains null values
+delete from parent;
+alter table only parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null).
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+delete from child;
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+drop table child;
+drop table parent;
+-- test setting and removing default values
+create table def_test (
+ c1 int4 default 5,
+ c2 text default 'initial_default'
+);
+insert into def_test default values;
+alter table def_test alter column c1 drop default;
+insert into def_test default values;
+alter table def_test alter column c2 drop default;
+insert into def_test default values;
+alter table def_test alter column c1 set default 10;
+alter table def_test alter column c2 set default 'new_default';
+insert into def_test default values;
+select * from def_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+(4 rows)
+
+-- set defaults to an incorrect type: this should fail
+alter table def_test alter column c1 set default 'wrong_datatype';
+ERROR: invalid input syntax for type integer: "wrong_datatype"
+alter table def_test alter column c2 set default 20;
+-- set defaults on a non-existent column: this should fail
+alter table def_test alter column c3 set default 30;
+ERROR: column "c3" of relation "def_test" does not exist
+-- set defaults on views: we need to create a view, add a rule
+-- to allow insertions into it, and then alter the view to add
+-- a default
+create view def_view_test as select * from def_test;
+create rule def_view_test_ins as
+ on insert to def_view_test
+ do instead insert into def_test select new.*;
+insert into def_view_test default values;
+alter table def_view_test alter column c1 set default 45;
+insert into def_view_test default values;
+alter table def_view_test alter column c2 set default 'view_default';
+insert into def_view_test default values;
+select * from def_view_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+ |
+ 45 |
+ 45 | view_default
+(7 rows)
+
+drop rule def_view_test_ins on def_view_test;
+drop view def_view_test;
+drop table def_test;
+-- alter table / drop column tests
+-- try altering system catalogs, should fail
+alter table pg_class drop column relname;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table nosuchtable drop column bar;
+ERROR: relation "nosuchtable" does not exist
+-- test dropping columns
+create table atacc1 (a int4 not null, b int4, c int4 not null, d int4);
+insert into atacc1 values (1, 2, 3, 4);
+alter table atacc1 drop a;
+alter table atacc1 drop a;
+ERROR: column "a" of relation "atacc1" does not exist
+-- SELECTs
+select * from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select * from atacc1 order by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 order by a;
+ ^
+select * from atacc1 order by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 order by "........pg.dropped.1........"...
+ ^
+select * from atacc1 group by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 group by a;
+ ^
+select * from atacc1 group by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 group by "........pg.dropped.1........"...
+ ^
+select atacc1.* from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a from atacc1;
+ ^
+select atacc1.a from atacc1;
+ERROR: column atacc1.a does not exist
+LINE 1: select atacc1.a from atacc1;
+ ^
+select b,c,d from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a,b,c,d from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a,b,c,d from atacc1;
+ ^
+select * from atacc1 where a = 1;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 where a = 1;
+ ^
+select "........pg.dropped.1........" from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........" from atacc1;
+ ^
+select atacc1."........pg.dropped.1........" from atacc1;
+ERROR: column atacc1.........pg.dropped.1........ does not exist
+LINE 1: select atacc1."........pg.dropped.1........" from atacc1;
+ ^
+select "........pg.dropped.1........",b,c,d from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........",b,c,d from atacc1;
+ ^
+select * from atacc1 where "........pg.dropped.1........" = 1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 where "........pg.dropped.1........" = ...
+ ^
+-- UPDATEs
+update atacc1 set a = 3;
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: update atacc1 set a = 3;
+ ^
+update atacc1 set b = 2 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: update atacc1 set b = 2 where a = 3;
+ ^
+update atacc1 set "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: update atacc1 set "........pg.dropped.1........" = 3;
+ ^
+update atacc1 set b = 2 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: update atacc1 set b = 2 where "........pg.dropped.1........"...
+ ^
+-- INSERTs
+insert into atacc1 values (10, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (10, 11, 12, 13);
+ ^
+insert into atacc1 values (default, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (default, 11, 12, 13);
+ ^
+insert into atacc1 values (11, 12, 13);
+insert into atacc1 (a) values (10);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (10);
+ ^
+insert into atacc1 (a) values (default);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (default);
+ ^
+insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ ^
+insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ ^
+insert into atacc1 (b,c,d) values (11,12,13);
+insert into atacc1 ("........pg.dropped.1........") values (10);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........") values (default);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (10,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (default,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+-- DELETEs
+delete from atacc1 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: delete from atacc1 where a = 3;
+ ^
+delete from atacc1 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: delete from atacc1 where "........pg.dropped.1........" = 3;
+ ^
+delete from atacc1;
+-- try dropping a non-existent column, should fail
+alter table atacc1 drop bar;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try removing an oid column, should succeed (as it's nonexistant)
+alter table atacc1 SET WITHOUT OIDS;
+-- try adding an oid column, should fail (not supported)
+alter table atacc1 SET WITH OIDS;
+ERROR: syntax error at or near "WITH"
+LINE 1: alter table atacc1 SET WITH OIDS;
+ ^
+-- try dropping the xmin column, should fail
+alter table atacc1 drop xmin;
+ERROR: cannot drop system column "xmin"
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+select * from myview;
+ b | c | d
+---+---+---
+(0 rows)
+
+alter table myview drop d;
+ERROR: "myview" is not a table, composite type, or foreign table
+drop view myview;
+-- test some commands to make sure they fail on the dropped column
+analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+vacuum analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+vacuum analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+comment on column atacc1.a is 'testing';
+ERROR: column "a" of relation "atacc1" does not exist
+comment on column atacc1."........pg.dropped.1........" is 'testing';
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set storage plain;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set storage plain;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set statistics 0;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set statistics 0;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set default 3;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set default 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop default;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop default;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 rename a to x;
+ERROR: column "a" does not exist
+alter table atacc1 rename "........pg.dropped.1........" to x;
+ERROR: column "........pg.dropped.1........" does not exist
+alter table atacc1 add primary key(a);
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 add primary key("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 add unique(a);
+ERROR: column "a" named in key does not exist
+alter table atacc1 add unique("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" named in key does not exist
+alter table atacc1 add check (a > 3);
+ERROR: column "a" does not exist
+alter table atacc1 add check ("........pg.dropped.1........" > 3);
+ERROR: column "........pg.dropped.1........" does not exist
+create table atacc2 (id int4 unique);
+alter table atacc1 add foreign key (a) references atacc2(id);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc1 add foreign key ("........pg.dropped.1........") references atacc2(id);
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1(a);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+drop table atacc2;
+create index "testing_idx" on atacc1(a);
+ERROR: column "a" does not exist
+create index "testing_idx" on atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" does not exist
+-- test create as and select into
+insert into atacc1 values (21, 22, 23);
+create table attest1 as select * from atacc1;
+select * from attest1;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest1;
+select * into attest2 from atacc1;
+select * from attest2;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest2;
+-- try dropping all columns
+alter table atacc1 drop c;
+alter table atacc1 drop d;
+alter table atacc1 drop b;
+select * from atacc1;
+--
+(1 row)
+
+drop table atacc1;
+-- test constraint error reporting in presence of dropped columns
+create table atacc1 (id serial primary key, value int check (value < 10));
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (1, 100).
+alter table atacc1 drop column value;
+alter table atacc1 add column value int check (value < 10);
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (2, 100).
+insert into atacc1(id, value) values (null, 0);
+ERROR: null value in column "id" violates not-null constraint
+DETAIL: Failing row contains (null, 0).
+drop table atacc1;
+-- test inheritance
+create table parent (a int, b int, c int);
+insert into parent values (1, 2, 3);
+alter table parent drop a;
+create table child (d varchar(255)) inherits (parent);
+insert into child values (12, 13, 'testing');
+select * from parent;
+ b | c
+----+----
+ 2 | 3
+ 12 | 13
+(2 rows)
+
+select * from child;
+ b | c | d
+----+----+---------
+ 12 | 13 | testing
+(1 row)
+
+alter table parent drop c;
+select * from parent;
+ b
+----
+ 2
+ 12
+(2 rows)
+
+select * from child;
+ b | d
+----+---------
+ 12 | testing
+(1 row)
+
+drop table child;
+drop table parent;
+-- check error cases for inheritance column merging
+create table parent (a float8, b numeric(10,4), c text collate "C");
+create table child (a float4) inherits (parent); -- fail
+NOTICE: merging column "a" with inherited definition
+ERROR: column "a" has a type conflict
+DETAIL: double precision versus real
+create table child (b decimal(10,7)) inherits (parent); -- fail
+NOTICE: moving and merging column "b" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "b" has a type conflict
+DETAIL: numeric(10,4) versus numeric(10,7)
+create table child (c text collate "POSIX") inherits (parent); -- fail
+NOTICE: moving and merging column "c" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "c" has a collation conflict
+DETAIL: "C" versus "POSIX"
+create table child (a double precision, b decimal(10,4)) inherits (parent);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging column "b" with inherited definition
+drop table child;
+drop table parent;
+-- test copy in/out
+create table attest (a int4, b int4, c int4);
+insert into attest values (1,2,3);
+alter table attest drop a;
+copy attest to stdout;
+2 3
+copy attest(a) to stdout;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") to stdout;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest from stdin;
+ERROR: extra data after last expected column
+CONTEXT: COPY attest, line 1: "10 11 12"
+select * from attest;
+ b | c
+---+---
+ 2 | 3
+(1 row)
+
+copy attest from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+(2 rows)
+
+copy attest(a) from stdin;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") from stdin;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest(b,c) from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+ 31 | 32
+(3 rows)
+
+drop table attest;
+-- test inheritance
+create table dropColumn (a int, b int, e int);
+create table dropColumnChild (c int) inherits (dropColumn);
+create table dropColumnAnother (d int) inherits (dropColumnChild);
+-- these two should fail
+alter table dropColumnchild drop column a;
+ERROR: cannot drop inherited column "a"
+alter table only dropColumnChild drop column b;
+ERROR: cannot drop inherited column "b"
+-- these three should work
+alter table only dropColumn drop column e;
+alter table dropColumnChild drop column c;
+alter table dropColumn drop column a;
+create table renameColumn (a int);
+create table renameColumnChild (b int) inherits (renameColumn);
+create table renameColumnAnother (c int) inherits (renameColumnChild);
+-- these three should fail
+alter table renameColumnChild rename column a to d;
+ERROR: cannot rename inherited column "a"
+alter table only renameColumnChild rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+alter table only renameColumn rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+-- these should work
+alter table renameColumn rename column a to d;
+alter table renameColumnChild rename column b to a;
+-- these should work
+alter table if exists doesnt_exist_tab rename column a to d;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+alter table if exists doesnt_exist_tab rename column b to a;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+-- this should work
+alter table renameColumn add column w int;
+-- this should fail
+alter table only renameColumn add column x int;
+ERROR: column must be added to child tables too
+-- Test corner cases in dropping of inherited columns
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+-- should work
+alter table p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+select f1 from c1;
+ f1
+----
+(0 rows)
+
+alter table c1 drop column f1;
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table p1 drop column f1;
+-- c1.f1 is dropped now, since there is no local definition for it
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is NOT dropped, but must now be considered non-inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1(id int, name text);
+create table p2(id2 int, name text, height int);
+create table c1(age int) inherits(p1,p2);
+NOTICE: merging multiple inherited definitions of column "name"
+create table gc1() inherits (c1);
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | name | 2 | f
+ c1 | id2 | 1 | f
+ c1 | height | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | name | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | height | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p1 | name | 0 | t
+ p2 | id2 | 0 | t
+ p2 | name | 0 | t
+ p2 | height | 0 | t
+(15 rows)
+
+-- should work
+alter table only p1 drop column name;
+-- should work. Now c1.name is local and inhcount is 0.
+alter table p2 drop column name;
+-- should be rejected since its inherited
+alter table gc1 drop column name;
+ERROR: cannot drop inherited column "name"
+-- should work, and drop gc1.name along
+alter table c1 drop column name;
+-- should fail: column does not exist
+alter table gc1 drop column name;
+ERROR: column "name" of relation "gc1" does not exist
+-- should work and drop the attribute in all tables
+alter table p2 drop column height;
+-- IF EXISTS test
+create table dropColumnExists ();
+alter table dropColumnExists drop column non_existing; --fail
+ERROR: column "non_existing" of relation "dropcolumnexists" does not exist
+alter table dropColumnExists drop column if exists non_existing; --succeed
+NOTICE: column "non_existing" of relation "dropcolumnexists" does not exist, skipping
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | id2 | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p2 | id2 | 0 | t
+(8 rows)
+
+drop table p1, p2 cascade;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to table c1
+drop cascades to table gc1
+-- test attinhcount tracking with merged columns
+create table depth0();
+create table depth1(c text) inherits (depth0);
+create table depth2() inherits (depth1);
+alter table depth0 add c text;
+NOTICE: merging definition of column "c" for child "depth1"
+select attrelid::regclass, attname, attinhcount, attislocal
+from pg_attribute
+where attnum > 0 and attrelid::regclass in ('depth0', 'depth1', 'depth2')
+order by attrelid::regclass::text, attnum;
+ attrelid | attname | attinhcount | attislocal
+----------+---------+-------------+------------
+ depth0 | c | 0 | t
+ depth1 | c | 1 | t
+ depth2 | c | 1 | f
+(3 rows)
+
+-- test renumbering of child-table columns in inherited operations
+create table p1 (f1 int);
+create table c1 (f2 text, f3 int) inherits (p1);
+alter table p1 add column a1 int check (a1 > 0);
+alter table p1 add column f2 text;
+NOTICE: merging definition of column "f2" for child "c1"
+insert into p1 values (1,2,'abc');
+insert into c1 values(11,'xyz',33,0); -- should fail
+ERROR: new row for relation "c1" violates check constraint "p1_a1_check"
+DETAIL: Failing row contains (11, xyz, 33, 0).
+insert into c1 values(11,'xyz',33,22);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 2 | abc
+ 11 | 22 | xyz
+(2 rows)
+
+update p1 set a1 = a1 + 1, f2 = upper(f2);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 3 | ABC
+ 11 | 23 | XYZ
+(2 rows)
+
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+-- test that operations with a dropped column do not try to reference
+-- its datatype
+create domain mytype as text;
+create temp table foo (f1 text, f2 mytype, f3 text);
+insert into foo values('bb','cc','dd');
+select * from foo;
+ f1 | f2 | f3
+----+----+----
+ bb | cc | dd
+(1 row)
+
+drop domain mytype cascade;
+NOTICE: drop cascades to column f2 of table foo
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+(1 row)
+
+insert into foo values('qq','rr');
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+ qq | rr
+(2 rows)
+
+update foo set f3 = 'zz';
+select * from foo;
+ f1 | f3
+----+----
+ bb | zz
+ qq | zz
+(2 rows)
+
+select f3,max(f1) from foo group by f3;
+ f3 | max
+----+-----
+ zz | qq
+(1 row)
+
+-- Simple tests for alter table column type
+alter table foo alter f1 TYPE integer; -- fails
+ERROR: column "f1" cannot be cast automatically to type integer
+HINT: You might need to specify "USING f1::integer".
+alter table foo alter f1 TYPE varchar(10);
+create table anothertab (atcol1 serial8, atcol2 boolean,
+ constraint anothertab_chk check (atcol1 <= 3));
+insert into anothertab (atcol1, atcol2) values (default, true);
+insert into anothertab (atcol1, atcol2) values (default, false);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+alter table anothertab alter column atcol1 type boolean; -- fails
+ERROR: column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to specify "USING atcol1::boolean".
+alter table anothertab alter column atcol1 type boolean using atcol1::int; -- fails
+ERROR: result of USING clause for column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to add an explicit cast.
+alter table anothertab alter column atcol1 type integer;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+insert into anothertab (atcol1, atcol2) values (45, null); -- fails
+ERROR: new row for relation "anothertab" violates check constraint "anothertab_chk"
+DETAIL: Failing row contains (45, null).
+insert into anothertab (atcol1, atcol2) values (default, null);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+ 3 |
+(3 rows)
+
+alter table anothertab alter column atcol2 type text
+ using case when atcol2 is true then 'IT WAS TRUE'
+ when atcol2 is false then 'IT WAS FALSE'
+ else 'IT WAS NULL!' end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ 1 | IT WAS TRUE
+ 2 | IT WAS FALSE
+ 3 | IT WAS NULL!
+(3 rows)
+
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: default for column "atcol1" cannot be cast automatically to type boolean
+alter table anothertab alter column atcol1 drop default;
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: operator does not exist: boolean <= integer
+HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
+alter table anothertab drop constraint anothertab_chk;
+alter table anothertab drop constraint anothertab_chk; -- fails
+ERROR: constraint "anothertab_chk" of relation "anothertab" does not exist
+alter table anothertab drop constraint IF EXISTS anothertab_chk; -- succeeds
+NOTICE: constraint "anothertab_chk" of relation "anothertab" does not exist, skipping
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ f | IT WAS TRUE
+ t | IT WAS FALSE
+ f | IT WAS NULL!
+(3 rows)
+
+drop table anothertab;
+create table another (f1 int, f2 text);
+insert into another values(1, 'one');
+insert into another values(2, 'two');
+insert into another values(3, 'three');
+select * from another;
+ f1 | f2
+----+-------
+ 1 | one
+ 2 | two
+ 3 | three
+(3 rows)
+
+alter table another
+ alter f1 type text using f2 || ' more',
+ alter f2 type bigint using f1 * 10;
+select * from another;
+ f1 | f2
+------------+----
+ one more | 10
+ two more | 20
+ three more | 30
+(3 rows)
+
+drop table another;
+-- table's row type
+create table tab1 (a int, b text);
+create table tab2 (x int, y tab1);
+alter table tab1 alter column b type varchar; -- fails
+ERROR: cannot alter table "tab1" because column "tab2.y" uses its row type
+-- Alter column type that's part of a partitioned index
+create table at_partitioned (a int, b text) partition by range (a);
+create table at_part_1 partition of at_partitioned for values from (0) to (1000);
+insert into at_partitioned values (512, '0.123');
+create table at_part_2 (b text, a int);
+insert into at_part_2 values ('1.234', 1024);
+create index on at_partitioned (b);
+create index on at_partitioned (a);
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+
+alter table at_partitioned attach partition at_part_2 for values from (1000) to (2000);
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+alter table at_partitioned alter column b type numeric using b::numeric;
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | numeric | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | numeric | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+drop table at_partitioned;
+-- Alter column type when no table rewrite is required
+-- Also check that comments are preserved
+create table at_partitioned(id int, name varchar(64), unique (id, name))
+ partition by hash(id);
+comment on constraint at_partitioned_id_name_key on at_partitioned is 'parent constraint';
+comment on index at_partitioned_id_name_key is 'parent index';
+create table at_partitioned_0 partition of at_partitioned
+ for values with (modulus 2, remainder 0);
+comment on constraint at_partitioned_0_id_name_key on at_partitioned_0 is 'child 0 constraint';
+comment on index at_partitioned_0_id_name_key is 'child 0 index';
+create table at_partitioned_1 partition of at_partitioned
+ for values with (modulus 2, remainder 1);
+comment on constraint at_partitioned_1_id_name_key on at_partitioned_1 is 'child 1 constraint';
+comment on index at_partitioned_1_id_name_key is 'child 1 index';
+insert into at_partitioned values(1, 'foo');
+insert into at_partitioned values(3, 'bar');
+create temp table old_oids as
+ select relname, oid as oldoid, relfilenode as oldfilenode
+ from pg_class where relname like 'at_partitioned%';
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+---------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | t | own | child 0 index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | t | own | child 1 index
+ at_partitioned_id_name_key | t | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+--------------------
+ at_partitioned_0_id_name_key | child 0 constraint
+ at_partitioned_1_id_name_key | child 1 constraint
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+alter table at_partitioned alter column name type varchar(127);
+-- Note: these tests currently show the wrong behavior for comments :-(
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+--------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | f | own | parent index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | f | own | parent index
+ at_partitioned_id_name_key | f | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+-------------------
+ at_partitioned_0_id_name_key |
+ at_partitioned_1_id_name_key |
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+-- Don't remove this DROP, it exposes bug #15672
+drop table at_partitioned;
+-- disallow recursive containment of row types
+create temp table recur1 (f1 int);
+alter table recur1 add column f2 recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 recur1[]; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create domain array_of_recur1 as recur1[];
+alter table recur1 add column f2 array_of_recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create temp table recur2 (f1 int, f2 recur1);
+alter table recur1 add column f2 recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 int;
+alter table recur1 alter column f2 type recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+-- SET STORAGE may need to add a TOAST table
+create table test_storage (a text);
+alter table test_storage alter a set storage plain;
+alter table test_storage add b int default 0; -- rewrite table to remove its TOAST table
+alter table test_storage alter a set storage extended; -- re-add TOAST table
+select reltoastrelid <> 0 as has_toast_table
+from pg_class
+where oid = 'test_storage'::regclass;
+ has_toast_table
+-----------------
+ f
+(1 row)
+
+-- ALTER COLUMN TYPE with a check constraint and a child table (bug #13779)
+CREATE TABLE test_inh_check (a float check (a > 10.2), b float);
+CREATE TABLE test_inh_check_child() INHERITS(test_inh_check);
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN a TYPE numeric;
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+-- also try noinherit, local, and local+inherited cases
+ALTER TABLE test_inh_check ADD CONSTRAINT bnoinherit CHECK (b > 100) NO INHERIT;
+ALTER TABLE test_inh_check_child ADD CONSTRAINT blocal CHECK (b < 1000);
+ALTER TABLE test_inh_check_child ADD CONSTRAINT bmerged CHECK (b > 1);
+ALTER TABLE test_inh_check ADD CONSTRAINT bmerged CHECK (b > 1);
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "bmerged" CHECK (b > 1::double precision)
+ "bnoinherit" CHECK (b > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "blocal" CHECK (b < 1000::double precision)
+ "bmerged" CHECK (b > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN b TYPE numeric;
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "bnoinherit" CHECK (b::double precision > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "blocal" CHECK (b::double precision < 1000::double precision)
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+-- ALTER COLUMN TYPE with different schema in children
+-- Bug at https://postgr.es/m/20170102225618.GA10071@telsasoft.com
+CREATE TABLE test_type_diff (f1 int);
+CREATE TABLE test_type_diff_c (extra smallint) INHERITS (test_type_diff);
+ALTER TABLE test_type_diff ADD COLUMN f2 int;
+INSERT INTO test_type_diff_c VALUES (1, 2, 3);
+ALTER TABLE test_type_diff ALTER COLUMN f2 TYPE bigint USING f2::bigint;
+CREATE TABLE test_type_diff2 (int_two int2, int_four int4, int_eight int8);
+CREATE TABLE test_type_diff2_c1 (int_four int4, int_eight int8, int_two int2);
+CREATE TABLE test_type_diff2_c2 (int_eight int8, int_two int2, int_four int4);
+CREATE TABLE test_type_diff2_c3 (int_two int2, int_four int4, int_eight int8);
+ALTER TABLE test_type_diff2_c1 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c2 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c3 INHERIT test_type_diff2;
+INSERT INTO test_type_diff2_c1 VALUES (1, 2, 3);
+INSERT INTO test_type_diff2_c2 VALUES (4, 5, 6);
+INSERT INTO test_type_diff2_c3 VALUES (7, 8, 9);
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int8 USING int_four::int8;
+-- whole-row references are disallowed
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int4 USING (pg_column_size(test_type_diff2));
+ERROR: cannot convert whole-row table reference
+DETAIL: USING expression contains a whole-row table reference.
+-- check for rollback of ANALYZE corrupting table property flags (bug #11638)
+CREATE TABLE check_fk_presence_1 (id int PRIMARY KEY, t text);
+CREATE TABLE check_fk_presence_2 (id int REFERENCES check_fk_presence_1, t text);
+BEGIN;
+ALTER TABLE check_fk_presence_2 DROP CONSTRAINT check_fk_presence_2_id_fkey;
+ANALYZE check_fk_presence_2;
+ROLLBACK;
+\d check_fk_presence_2
+ Table "public.check_fk_presence_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id | integer | | |
+ t | text | | |
+Foreign-key constraints:
+ "check_fk_presence_2_id_fkey" FOREIGN KEY (id) REFERENCES check_fk_presence_1(id)
+
+DROP TABLE check_fk_presence_1, check_fk_presence_2;
+-- check column addition within a view (bug #14876)
+create table at_base_table(id int, stuff text);
+insert into at_base_table values (23, 'skidoo');
+create view at_view_1 as select * from at_base_table bt;
+create view at_view_2 as select *, to_json(v1) as j from at_view_1 v1;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+View definition:
+ SELECT bt.id,
+ bt.stuff
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo"}
+(1 row)
+
+create or replace view at_view_1 as select *, 2+2 as more from at_base_table bt;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ more | integer | | | | plain |
+View definition:
+ SELECT bt.id,
+ bt.stuff,
+ 2 + 2 AS more
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff, NULL))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo","more":null}
+(1 row)
+
+drop view at_view_2;
+drop view at_view_1;
+drop table at_base_table;
+--
+-- lock levels
+--
+drop type lockmodes;
+ERROR: type "lockmodes" does not exist
+create type lockmodes as enum (
+ 'SIReadLock'
+,'AccessShareLock'
+,'RowShareLock'
+,'RowExclusiveLock'
+,'ShareUpdateExclusiveLock'
+,'ShareLock'
+,'ShareRowExclusiveLock'
+,'ExclusiveLock'
+,'AccessExclusiveLock'
+);
+drop view my_locks;
+ERROR: view "my_locks" does not exist
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = txid_current()::integer)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname != 'my_locks'
+group by c.relname;
+create table alterlock (f1 int primary key, f2 text);
+insert into alterlock values (1, 'foo');
+create table alterlock2 (f3 int primary key, f1 int);
+insert into alterlock2 values (1, 1);
+begin; alter table alterlock alter column f2 set statistics 150;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock cluster on alterlock_pkey;
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+ alterlock_pkey | ShareUpdateExclusiveLock
+(2 rows)
+
+commit;
+begin; alter table alterlock set without cluster;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (fillfactor = 100);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock reset (fillfactor);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (toast.autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set (n_distinct = 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+-- test that mixing options with different lock levels works as expected
+begin; alter table alterlock set (autovacuum_enabled = off, fillfactor = 80);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set storage extended;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock alter column f2 set default 'x';
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin;
+create trigger ttdummy
+ before delete or update on alterlock
+ for each row
+ execute procedure
+ ttdummy (1, 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+-----------------------
+ alterlock | ShareRowExclusiveLock
+(1 row)
+
+rollback;
+begin;
+select * from my_locks order by 1;
+ relname | max_lockmode
+---------+--------------
+(0 rows)
+
+alter table alterlock2 add foreign key (f1) references alterlock (f1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+begin;
+alter table alterlock2
+add constraint alterlock2nv foreign key (f1) references alterlock (f1) NOT VALID;
+select * from my_locks order by 1;
+ relname | max_lockmode
+------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+(2 rows)
+
+commit;
+begin;
+alter table alterlock2 validate constraint alterlock2nv;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+--------------------------
+ alterlock | RowShareLock
+ alterlock2 | ShareUpdateExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = txid_current()::integer)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname = 'my_locks'
+group by c.relname;
+-- raise exception
+alter table my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter view my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter table my_locks reset (autovacuum_enabled);
+alter view my_locks reset (autovacuum_enabled);
+begin;
+alter view my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter view my_locks reset (security_barrier);
+rollback;
+-- this test intentionally applies the ALTER TABLE command against a view, but
+-- uses a view option so we expect this to succeed. This form of SQL is
+-- accepted for historical reasons, as shown in the docs for ALTER VIEW
+begin;
+alter table my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter table my_locks reset (security_barrier);
+rollback;
+-- cleanup
+drop table alterlock2;
+drop table alterlock;
+drop view my_locks;
+drop type lockmodes;
+--
+-- alter function
+--
+create function test_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql returns null on null input;
+select test_strict(NULL);
+ test_strict
+-------------
+
+(1 row)
+
+alter function test_strict(text) called on null input;
+select test_strict(NULL);
+ test_strict
+-------------------
+ got passed a null
+(1 row)
+
+create function non_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql called on null input;
+select non_strict(NULL);
+ non_strict
+-------------------
+ got passed a null
+(1 row)
+
+alter function non_strict(text) returns null on null input;
+select non_strict(NULL);
+ non_strict
+------------
+
+(1 row)
+
+--
+-- alter object set schema
+--
+create schema alter1;
+create schema alter2;
+create table alter1.t1(f1 serial primary key, f2 int check (f2 > 0));
+create view alter1.v1 as select * from alter1.t1;
+create function alter1.plus1(int) returns int as 'select $1+1' language sql;
+create domain alter1.posint integer check (value > 0);
+create type alter1.ctype as (f1 int, f2 text);
+create function alter1.same(alter1.ctype, alter1.ctype) returns boolean language sql
+as 'select $1.f1 is not distinct from $2.f1 and $1.f2 is not distinct from $2.f2';
+create operator alter1.=(procedure = alter1.same, leftarg = alter1.ctype, rightarg = alter1.ctype);
+create operator class alter1.ctype_hash_ops default for type alter1.ctype using hash as
+ operator 1 alter1.=(alter1.ctype, alter1.ctype);
+create conversion alter1.ascii_to_utf8 for 'sql_ascii' to 'utf8' from ascii_to_utf8;
+create text search parser alter1.prs(start = prsd_start, gettoken = prsd_nexttoken, end = prsd_end, lextypes = prsd_lextype);
+create text search configuration alter1.cfg(parser = alter1.prs);
+create text search template alter1.tmpl(init = dsimple_init, lexize = dsimple_lexize);
+create text search dictionary alter1.dict(template = alter1.tmpl);
+insert into alter1.t1(f2) values(11);
+insert into alter1.t1(f2) values(12);
+alter table alter1.t1 set schema alter1; -- no-op, same schema
+alter table alter1.t1 set schema alter2;
+alter table alter1.v1 set schema alter2;
+alter function alter1.plus1(int) set schema alter2;
+alter domain alter1.posint set schema alter2;
+alter operator class alter1.ctype_hash_ops using hash set schema alter2;
+alter operator family alter1.ctype_hash_ops using hash set schema alter2;
+alter operator alter1.=(alter1.ctype, alter1.ctype) set schema alter2;
+alter function alter1.same(alter1.ctype, alter1.ctype) set schema alter2;
+alter type alter1.ctype set schema alter1; -- no-op, same schema
+alter type alter1.ctype set schema alter2;
+alter conversion alter1.ascii_to_utf8 set schema alter2;
+alter text search parser alter1.prs set schema alter2;
+alter text search configuration alter1.cfg set schema alter2;
+alter text search template alter1.tmpl set schema alter2;
+alter text search dictionary alter1.dict set schema alter2;
+-- this should succeed because nothing is left in alter1
+drop schema alter1;
+insert into alter2.t1(f2) values(13);
+insert into alter2.t1(f2) values(14);
+select * from alter2.t1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select * from alter2.v1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select alter2.plus1(41);
+ plus1
+-------
+ 42
+(1 row)
+
+-- clean up
+drop schema alter2 cascade;
+NOTICE: drop cascades to 13 other objects
+DETAIL: drop cascades to table alter2.t1
+drop cascades to view alter2.v1
+drop cascades to function alter2.plus1(integer)
+drop cascades to type alter2.posint
+drop cascades to type alter2.ctype
+drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
+drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
+drop cascades to conversion alter2.ascii_to_utf8
+drop cascades to text search parser alter2.prs
+drop cascades to text search configuration alter2.cfg
+drop cascades to text search template alter2.tmpl
+drop cascades to text search dictionary alter2.dict
+--
+-- composite types
+--
+CREATE TYPE test_type AS (a int);
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE nosuchtype ADD ATTRIBUTE b text; -- fails
+ERROR: relation "nosuchtype" does not exist
+ALTER TYPE test_type ADD ATTRIBUTE b text;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+ALTER TYPE test_type ADD ATTRIBUTE b text; -- fails
+ERROR: column "b" of relation "test_type" already exists
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE varchar;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE integer;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE b;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE c; -- fails
+ERROR: column "c" of relation "test_type" does not exist
+ALTER TYPE test_type DROP ATTRIBUTE IF EXISTS c;
+NOTICE: column "c" of relation "test_type" does not exist, skipping
+ALTER TYPE test_type DROP ATTRIBUTE a, ADD ATTRIBUTE d boolean;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ d | boolean | | |
+
+ALTER TYPE test_type RENAME ATTRIBUTE a TO aa;
+ERROR: column "a" does not exist
+ALTER TYPE test_type RENAME ATTRIBUTE d TO dd;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ dd | boolean | | |
+
+DROP TYPE test_type;
+CREATE TYPE test_type1 AS (a int, b text);
+CREATE TABLE test_tbl1 (x int, y test_type1);
+ALTER TYPE test_type1 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type1" because column "test_tbl1.y" uses it
+CREATE TYPE test_type2 AS (a int, b text);
+CREATE TABLE test_tbl2 OF test_type2;
+CREATE TABLE test_tbl2_subclass () INHERITS (test_tbl2);
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ADD ATTRIBUTE c text; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ADD ATTRIBUTE c text CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 DROP ATTRIBUTE b; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 DROP ATTRIBUTE b CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+\d test_tbl2_subclass
+ Table "public.test_tbl2_subclass"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Inherits: test_tbl2
+
+DROP TABLE test_tbl2_subclass;
+CREATE TYPE test_typex AS (a int, b text);
+CREATE TABLE test_tblx (x int, y test_typex check ((y).a > 0));
+ALTER TYPE test_typex DROP ATTRIBUTE a; -- fails
+ERROR: cannot drop column a of composite type test_typex because other objects depend on it
+DETAIL: constraint test_tblx_y_check on table test_tblx depends on column a of composite type test_typex
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+ALTER TYPE test_typex DROP ATTRIBUTE a CASCADE;
+NOTICE: drop cascades to constraint test_tblx_y_check on table test_tblx
+\d test_tblx
+ Table "public.test_tblx"
+ Column | Type | Collation | Nullable | Default
+--------+------------+-----------+----------+---------
+ x | integer | | |
+ y | test_typex | | |
+
+DROP TABLE test_tblx;
+DROP TYPE test_typex;
+-- This test isn't that interesting on its own, but the purpose is to leave
+-- behind a table to test pg_upgrade with. The table has a composite type
+-- column in it, and the composite type has a dropped attribute.
+CREATE TYPE test_type3 AS (a int);
+CREATE TABLE test_tbl3 (c) AS SELECT '(1)'::test_type3;
+ALTER TYPE test_type3 DROP ATTRIBUTE a, ADD ATTRIBUTE b int;
+CREATE TYPE test_type_empty AS ();
+DROP TYPE test_type_empty;
+--
+-- typed tables: OF / NOT OF
+--
+CREATE TYPE tt_t0 AS (z inet, x int, y numeric(8,2));
+ALTER TYPE tt_t0 DROP ATTRIBUTE z;
+CREATE TABLE tt0 (x int NOT NULL, y numeric(8,2)); -- OK
+CREATE TABLE tt1 (x int, y bigint); -- wrong base type
+CREATE TABLE tt2 (x int, y numeric(9,2)); -- wrong typmod
+CREATE TABLE tt3 (y numeric(8,2), x int); -- wrong column order
+CREATE TABLE tt4 (x int); -- too few columns
+CREATE TABLE tt5 (x int, y numeric(8,2), z int); -- too few columns
+CREATE TABLE tt6 () INHERITS (tt0); -- can't have a parent
+CREATE TABLE tt7 (x int, q text, y numeric(8,2));
+ALTER TABLE tt7 DROP q; -- OK
+ALTER TABLE tt0 OF tt_t0;
+ALTER TABLE tt1 OF tt_t0;
+ERROR: table "tt1" has different type for column "y"
+ALTER TABLE tt2 OF tt_t0;
+ERROR: table "tt2" has different type for column "y"
+ALTER TABLE tt3 OF tt_t0;
+ERROR: table has column "y" where type requires "x"
+ALTER TABLE tt4 OF tt_t0;
+ERROR: table is missing column "y"
+ALTER TABLE tt5 OF tt_t0;
+ERROR: table has extra column "z"
+ALTER TABLE tt6 OF tt_t0;
+ERROR: typed tables cannot inherit
+ALTER TABLE tt7 OF tt_t0;
+CREATE TYPE tt_t1 AS (x int, y numeric(8,2));
+ALTER TABLE tt7 OF tt_t1; -- reassign an already-typed table
+ALTER TABLE tt7 NOT OF;
+\d tt7
+ Table "public.tt7"
+ Column | Type | Collation | Nullable | Default
+--------+--------------+-----------+----------+---------
+ x | integer | | |
+ y | numeric(8,2) | | |
+
+-- make sure we can drop a constraint on the parent but it remains on the child
+CREATE TABLE test_drop_constr_parent (c text CHECK (c IS NOT NULL));
+CREATE TABLE test_drop_constr_child () INHERITS (test_drop_constr_parent);
+ALTER TABLE ONLY test_drop_constr_parent DROP CONSTRAINT "test_drop_constr_parent_c_check";
+-- should fail
+INSERT INTO test_drop_constr_child (c) VALUES (NULL);
+ERROR: new row for relation "test_drop_constr_child" violates check constraint "test_drop_constr_parent_c_check"
+DETAIL: Failing row contains (null).
+DROP TABLE test_drop_constr_parent CASCADE;
+NOTICE: drop cascades to table test_drop_constr_child
+--
+-- IF EXISTS test
+--
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+NOTICE: relation "tt8" does not exist, skipping
+CREATE TABLE tt8(a int);
+CREATE SCHEMA alter2;
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+\d alter2.tt8
+ Table "alter2.tt8"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ f1 | integer | | not null | 0
+Indexes:
+ "xxx" PRIMARY KEY, btree (f1)
+Check constraints:
+ "tt8_f_check" CHECK (f1 >= 0 AND f1 <= 10)
+
+DROP TABLE alter2.tt8;
+DROP SCHEMA alter2;
+--
+-- Check conflicts between index and CHECK constraint names
+--
+CREATE TABLE tt9(c integer);
+ALTER TABLE tt9 ADD CHECK(c > 1);
+ALTER TABLE tt9 ADD CHECK(c > 2); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 3);
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 4); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD UNIQUE(c);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key UNIQUE(c); -- fail, dup name
+ERROR: relation "tt9_c_key" already exists
+ALTER TABLE tt9 ADD CONSTRAINT foo UNIQUE(c); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key CHECK(c > 5); -- fail, dup name
+ERROR: constraint "tt9_c_key" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key2 CHECK(c > 6);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+\d tt9
+ Table "public.tt9"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c | integer | | |
+Indexes:
+ "tt9_c_key" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key1" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key3" UNIQUE CONSTRAINT, btree (c)
+Check constraints:
+ "foo" CHECK (c > 3)
+ "tt9_c_check" CHECK (c > 1)
+ "tt9_c_check1" CHECK (c > 2)
+ "tt9_c_key2" CHECK (c > 6)
+
+DROP TABLE tt9;
+-- Check that comments on constraints and indexes are not lost at ALTER TABLE.
+CREATE TABLE comment_test (
+ id int,
+ positive_col int CHECK (positive_col > 0),
+ indexed_col int,
+ CONSTRAINT comment_test_pk PRIMARY KEY (id));
+CREATE INDEX comment_test_index ON comment_test(indexed_col);
+COMMENT ON COLUMN comment_test.id IS 'Column ''id'' on comment_test';
+COMMENT ON INDEX comment_test_index IS 'Simple index on comment_test';
+COMMENT ON CONSTRAINT comment_test_positive_col_check ON comment_test IS 'CHECK constraint on comment_test.positive_col';
+COMMENT ON CONSTRAINT comment_test_pk ON comment_test IS 'PRIMARY KEY constraint of comment_test';
+COMMENT ON INDEX comment_test_pk IS 'Index backing the PRIMARY KEY of comment_test';
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Change the datatype of all the columns. ALTER TABLE is optimized to not
+-- rebuild an index if the new data type is binary compatible with the old
+-- one. Check do a dummy ALTER TABLE that doesn't change the datatype
+-- first, to test that no-op codepath, and another one that does.
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE bigint;
+-- Check that the comments are intact.
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Check compatibility for foreign keys and comments. This is done
+-- separately as rebuilding the column type of the parent leads
+-- to an error and would reduce the test scope.
+CREATE TABLE comment_test_child (
+ id text CONSTRAINT comment_test_child_fk REFERENCES comment_test);
+CREATE INDEX comment_test_child_fk ON comment_test_child(id);
+COMMENT ON COLUMN comment_test_child.id IS 'Column ''id'' on comment_test_child';
+COMMENT ON INDEX comment_test_child_fk IS 'Index backing the FOREIGN KEY of comment_test_child';
+COMMENT ON CONSTRAINT comment_test_child_fk ON comment_test_child IS 'FOREIGN KEY constraint of comment_test_child';
+-- Change column type of parent
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int USING id::integer;
+ERROR: foreign key constraint "comment_test_child_fk" cannot be implemented
+DETAIL: Key columns "id" and "id" are of incompatible types: text and integer.
+-- Comments should be intact
+SELECT col_description('comment_test_child'::regclass, 1) as comment;
+ comment
+-----------------------------------
+ Column 'id' on comment_test_child
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ index | comment
+-----------------------+-----------------------------------------------------
+ comment_test_child_fk | Index backing the FOREIGN KEY of comment_test_child
+(1 row)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ constraint | comment
+-----------------------+----------------------------------------------
+ comment_test_child_fk | FOREIGN KEY constraint of comment_test_child
+(1 row)
+
+-- Check that we map relation oids to filenodes and back correctly. Only
+-- display bad mappings so the test output doesn't change all the time. A
+-- filenode function call can return NULL for a relation dropped concurrently
+-- with the call's surrounding query, so ignore a NULL mapped_oid for
+-- relations that no longer exist after all calls finish.
+CREATE TEMP TABLE filenode_mapping AS
+SELECT
+ oid, mapped_oid, reltablespace, relfilenode, relname
+FROM pg_class,
+ pg_filenode_relation(reltablespace, pg_relation_filenode(oid)) AS mapped_oid
+WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
+SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
+WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
+ oid | mapped_oid | reltablespace | relfilenode | relname
+-----+------------+---------------+-------------+---------
+(0 rows)
+
+-- Checks on creating and manipulation of user defined relations in
+-- pg_catalog.
+--
+-- XXX: It would be useful to add checks around trying to manipulate
+-- catalog tables, but that might have ugly consequences when run
+-- against an existing server with allow_system_table_mods = on.
+SHOW allow_system_table_mods;
+ allow_system_table_mods
+-------------------------
+ off
+(1 row)
+
+-- disallowed because of search_path issues with pg_dump
+CREATE TABLE pg_catalog.new_system_table();
+ERROR: permission denied to create "pg_catalog.new_system_table"
+DETAIL: System catalog modifications are currently disallowed.
+-- instead create in public first, move to catalog
+CREATE TABLE new_system_table(id serial primary key, othercol text);
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table SET SCHEMA public;
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+-- will be ignored -- already there:
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table RENAME TO old_system_table;
+CREATE INDEX old_system_table__othercol ON old_system_table (othercol);
+INSERT INTO old_system_table(othercol) VALUES ('somedata'), ('otherdata');
+UPDATE old_system_table SET id = -id;
+DELETE FROM old_system_table WHERE othercol = 'somedata';
+TRUNCATE old_system_table;
+ALTER TABLE old_system_table DROP CONSTRAINT new_system_table_pkey;
+ALTER TABLE old_system_table DROP COLUMN othercol;
+DROP TABLE old_system_table;
+-- set logged
+CREATE UNLOGGED TABLE unlogged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of an unlogged table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | u
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | u
+(3 rows)
+
+CREATE UNLOGGED TABLE unlogged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged1); -- foreign key
+CREATE UNLOGGED TABLE unlogged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged3); -- self-referencing foreign key
+ALTER TABLE unlogged3 SET LOGGED; -- skip self-referencing foreign key
+ALTER TABLE unlogged2 SET LOGGED; -- fails because a foreign key to an unlogged table exists
+ERROR: could not change table "unlogged2" to logged because it references unlogged table "unlogged1"
+ALTER TABLE unlogged1 SET LOGGED;
+-- check relpersistence of an unlogged table after changing to permanent
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | p
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | p
+(3 rows)
+
+ALTER TABLE unlogged1 SET LOGGED; -- silently do nothing
+DROP TABLE unlogged3;
+DROP TABLE unlogged2;
+DROP TABLE unlogged1;
+-- set unlogged
+CREATE TABLE logged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of a permanent table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | p
+ logged1_f1_seq | S | p
+ logged1_pkey | i | p
+(3 rows)
+
+CREATE TABLE logged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged1); -- foreign key
+CREATE TABLE logged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged3); -- self-referencing foreign key
+ALTER TABLE logged1 SET UNLOGGED; -- fails because a foreign key from a permanent table exists
+ERROR: could not change table "logged1" to unlogged because it references logged table "logged2"
+ALTER TABLE logged3 SET UNLOGGED; -- skip self-referencing foreign key
+ALTER TABLE logged2 SET UNLOGGED;
+ALTER TABLE logged1 SET UNLOGGED;
+-- check relpersistence of a permanent table after changing to unlogged
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | u
+ logged1_f1_seq | S | p
+ logged1_pkey | i | u
+(3 rows)
+
+ALTER TABLE logged1 SET UNLOGGED; -- silently do nothing
+DROP TABLE logged3;
+DROP TABLE logged2;
+DROP TABLE logged1;
+-- test ADD COLUMN IF NOT EXISTS
+CREATE TABLE test_add_column(c1 integer);
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer;
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer, -- fail because c2 already exists
+ ADD COLUMN c3 integer;
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN c3 integer; -- fail because c3 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer; -- skipping because c3 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer, -- skipping because c3 already exists
+ ADD COLUMN c4 integer;
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | |
+ c4 | integer | | |
+
+DROP TABLE test_add_column;
+-- unsupported constraint types for partitioned tables
+CREATE TABLE partitioned (
+ a int,
+ b int
+) PARTITION BY RANGE (a, (a+b+1));
+ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ERROR: exclusion constraints are not supported on partitioned tables
+LINE 1: ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ ^
+-- cannot drop column that is part of the partition key
+ALTER TABLE partitioned DROP COLUMN a;
+ERROR: cannot drop column named in partition key
+ALTER TABLE partitioned ALTER COLUMN a TYPE char(5);
+ERROR: cannot alter type of column named in partition key
+ALTER TABLE partitioned DROP COLUMN b;
+ERROR: cannot drop column referenced in partition key expression
+ALTER TABLE partitioned ALTER COLUMN b TYPE char(5);
+ERROR: cannot alter type of column referenced in partition key expression
+-- partitioned table cannot participate in regular inheritance
+CREATE TABLE nonpartitioned (
+ a int,
+ b int
+);
+ALTER TABLE partitioned INHERIT nonpartitioned;
+ERROR: cannot change inheritance of partitioned table
+ALTER TABLE nonpartitioned INHERIT partitioned;
+ERROR: cannot inherit from partitioned table "partitioned"
+-- cannot add NO INHERIT constraint to partitioned tables
+ALTER TABLE partitioned ADD CONSTRAINT chk_a CHECK (a > 0) NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "partitioned"
+DROP TABLE partitioned, nonpartitioned;
+--
+-- ATTACH PARTITION
+--
+-- check that target table is partitioned
+CREATE TABLE unparted (
+ a int
+);
+CREATE TABLE fail_part (like unparted);
+ALTER TABLE unparted ATTACH PARTITION fail_part FOR VALUES IN ('a');
+ERROR: table "unparted" is not partitioned
+DROP TABLE unparted, fail_part;
+-- check that partition bound is compatible
+CREATE TABLE list_parted (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part (LIKE list_parted);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) TO (10);
+ERROR: invalid bound specification for a list partition
+LINE 1: ...list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) T...
+ ^
+DROP TABLE fail_part;
+-- check that the table being attached exists
+ALTER TABLE list_parted ATTACH PARTITION nonexistant FOR VALUES IN (1);
+ERROR: relation "nonexistant" does not exist
+-- check ownership of the source table
+CREATE ROLE regress_test_me;
+CREATE ROLE regress_test_not_me;
+CREATE TABLE not_owned_by_me (LIKE list_parted);
+ALTER TABLE not_owned_by_me OWNER TO regress_test_not_me;
+SET SESSION AUTHORIZATION regress_test_me;
+CREATE TABLE owned_by_me (
+ a int
+) PARTITION BY LIST (a);
+ALTER TABLE owned_by_me ATTACH PARTITION not_owned_by_me FOR VALUES IN (1);
+ERROR: must be owner of table not_owned_by_me
+RESET SESSION AUTHORIZATION;
+DROP TABLE owned_by_me, not_owned_by_me;
+DROP ROLE regress_test_not_me;
+DROP ROLE regress_test_me;
+-- check that the table being attached is not part of regular inheritance
+CREATE TABLE parent (LIKE list_parted);
+CREATE TABLE child () INHERITS (parent);
+ALTER TABLE list_parted ATTACH PARTITION child FOR VALUES IN (1);
+ERROR: cannot attach inheritance child as partition
+ALTER TABLE list_parted ATTACH PARTITION parent FOR VALUES IN (1);
+ERROR: cannot attach inheritance parent as partition
+DROP TABLE parent CASCADE;
+NOTICE: drop cascades to table child
+-- check any TEMP-ness
+CREATE TEMP TABLE temp_parted (a int) PARTITION BY LIST (a);
+CREATE TABLE perm_part (a int);
+ALTER TABLE temp_parted ATTACH PARTITION perm_part FOR VALUES IN (1);
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_parted"
+DROP TABLE temp_parted, perm_part;
+-- check that the table being attached is not a typed table
+CREATE TYPE mytype AS (a int);
+CREATE TABLE fail_part OF mytype;
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: cannot attach a typed table as partition
+DROP TYPE mytype CASCADE;
+NOTICE: drop cascades to table fail_part
+-- check that the table being attached has only columns present in the parent
+CREATE TABLE fail_part (like list_parted, c int);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: table "fail_part" contains column "c" not found in parent "list_parted"
+DETAIL: The new partition may contain only the columns present in parent.
+DROP TABLE fail_part;
+-- check that the table being attached has every column of the parent
+CREATE TABLE fail_part (a int NOT NULL);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing column "b"
+DROP TABLE fail_part;
+-- check that columns match in type, collation and NOT NULL status
+CREATE TABLE fail_part (
+ b char(3),
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different type for column "b"
+ALTER TABLE fail_part ALTER b TYPE char (2) COLLATE "POSIX";
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different collation for column "b"
+DROP TABLE fail_part;
+-- check that the table being attached has all constraints of the parent
+CREATE TABLE fail_part (
+ b char(2) COLLATE "C",
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing constraint "check_a"
+-- check that the constraint matches in definition with parent's constraint
+ALTER TABLE fail_part ADD CONSTRAINT check_a CHECK (a >= 0);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different definition for check constraint "check_a"
+DROP TABLE fail_part;
+-- check the attributes and constraints after partition is attached
+CREATE TABLE part_1 (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+);
+ALTER TABLE list_parted ATTACH PARTITION part_1 FOR VALUES IN (1);
+-- attislocal and conislocal are always false for merged attributes and constraints respectively.
+SELECT attislocal, attinhcount FROM pg_attribute WHERE attrelid = 'part_1'::regclass AND attnum > 0;
+ attislocal | attinhcount
+------------+-------------
+ f | 1
+ f | 1
+(2 rows)
+
+SELECT conislocal, coninhcount FROM pg_constraint WHERE conrelid = 'part_1'::regclass AND conname = 'check_a';
+ conislocal | coninhcount
+------------+-------------
+ f | 1
+(1 row)
+
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE fail_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: partition "fail_part" would overlap partition "part_1"
+DROP TABLE fail_part;
+-- check that an existing table can be attached as a default partition
+CREATE TABLE def_part (LIKE list_parted INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION def_part DEFAULT;
+-- check attaching default partition fails if a default partition already
+-- exists
+CREATE TABLE fail_def_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_def_part DEFAULT;
+ERROR: partition "fail_def_part" conflicts with existing default partition "def_part"
+-- check validation when attaching list partitions
+CREATE TABLE list_parted2 (
+ a int,
+ b char
+) PARTITION BY LIST (a);
+-- check that violating rows are correctly reported
+CREATE TABLE part_2 (LIKE list_parted2);
+INSERT INTO part_2 VALUES (3, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part_2;
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+-- check partition cannot be attached if default has some row for its values
+CREATE TABLE list_parted2_def PARTITION OF list_parted2 DEFAULT;
+INSERT INTO list_parted2_def VALUES (11, 'z');
+CREATE TABLE part_3 (LIKE list_parted2);
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+ERROR: updated partition constraint for default partition would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM list_parted2_def WHERE a = 11;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part_3_4 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IN (3))
+);
+-- however, if a list partition does not accept nulls, there should be
+-- an explicit NOT NULL constraint on the partition key column for the
+-- validation scan to be skipped;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+-- adding a NOT NULL constraint will cause the scan to be skipped
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+ALTER TABLE part_3_4 ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+INFO: partition constraint for table "part_3_4" is implied by existing constraints
+-- check if default partition scan skipped
+ALTER TABLE list_parted2_def ADD CONSTRAINT check_a CHECK (a IN (5, 6));
+CREATE TABLE part_55_66 PARTITION OF list_parted2 FOR VALUES IN (55, 66);
+INFO: updated partition constraint for default partition "list_parted2_def" is implied by existing constraints
+-- check validation when attaching range partitions
+CREATE TABLE range_parted (
+ a int,
+ b int
+) PARTITION BY RANGE (a, b);
+-- check that violating rows are correctly reported
+CREATE TABLE part1 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 1 AND b <= 10)
+);
+INSERT INTO part1 VALUES (1, 10);
+-- Remember the TO bound is exclusive
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part1;
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part2 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 10 AND b < 18)
+);
+ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
+INFO: partition constraint for table "part2" is implied by existing constraints
+-- Create default partition
+CREATE TABLE partr_def1 PARTITION OF range_parted DEFAULT;
+-- Only one default partition is allowed, hence, following should give error
+CREATE TABLE partr_def2 (LIKE part1 INCLUDING CONSTRAINTS);
+ALTER TABLE range_parted ATTACH PARTITION partr_def2 DEFAULT;
+ERROR: partition "partr_def2" conflicts with existing default partition "partr_def1"
+-- Overlapping partitions cannot be attached, hence, following should give error
+INSERT INTO partr_def1 VALUES (2, 10);
+CREATE TABLE part3 (LIKE range_parted);
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (2, 10) TO (2, 20);
+ERROR: updated partition constraint for default partition would be violated by some row
+-- Attaching partitions should be successful when there are no overlapping rows
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (3, 10) TO (3, 20);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE part_5 (
+ LIKE list_parted2
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE part_5_a PARTITION OF part_5 FOR VALUES IN ('a');
+INSERT INTO part_5_a (a, b) VALUES (6, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+ERROR: partition constraint is violated by some row
+-- delete the faulting row and also add a constraint to skip the scan
+DELETE FROM part_5_a WHERE a NOT IN (3);
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 5);
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+INFO: partition constraint for table "part_5" is implied by existing constraints
+ALTER TABLE list_parted2 DETACH PARTITION part_5;
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+-- scan should again be skipped, even though NOT NULL is now a column property
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+INFO: partition constraint for table "part_5" is implied by existing constraints
+-- Check the case where attnos of the partitioning columns in the table being
+-- attached differs from the parent. It should not affect the constraint-
+-- checking logic that allows to skip the scan.
+CREATE TABLE part_6 (
+ c int,
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 6)
+);
+ALTER TABLE part_6 DROP c;
+ALTER TABLE list_parted2 ATTACH PARTITION part_6 FOR VALUES IN (6);
+INFO: partition constraint for table "part_6" is implied by existing constraints
+-- Similar to above, but the table being attached is a partitioned table
+-- whose partition has still different attnos for the root partitioning
+-- columns.
+CREATE TABLE part_7 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+) PARTITION BY LIST (b);
+CREATE TABLE part_7_a_null (
+ c int,
+ d int,
+ e int,
+ LIKE list_parted2, -- 'a' will have attnum = 4
+ CONSTRAINT check_b CHECK (b IS NULL OR b = 'a'),
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+);
+ALTER TABLE part_7_a_null DROP c, DROP d, DROP e;
+ALTER TABLE part_7 ATTACH PARTITION part_7_a_null FOR VALUES IN ('a', null);
+INFO: partition constraint for table "part_7_a_null" is implied by existing constraints
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+INFO: partition constraint for table "part_7" is implied by existing constraints
+INFO: updated partition constraint for default partition "list_parted2_def" is implied by existing constraints
+-- Same example, but check this time that the constraint correctly detects
+-- violating rows
+ALTER TABLE list_parted2 DETACH PARTITION part_7;
+ALTER TABLE part_7 DROP CONSTRAINT check_a; -- thusly, scan won't be skipped
+INSERT INTO part_7 (a, b) VALUES (8, null), (9, 'a');
+SELECT tableoid::regclass, a, b FROM part_7 order by a;
+ tableoid | a | b
+---------------+---+---
+ part_7_a_null | 8 |
+ part_7_a_null | 9 | a
+(2 rows)
+
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+INFO: updated partition constraint for default partition "list_parted2_def" is implied by existing constraints
+ERROR: partition constraint is violated by some row
+-- check that leaf partitions of default partition are scanned when
+-- attaching a partitioned table.
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+CREATE TABLE part5_def PARTITION OF part_5 DEFAULT PARTITION BY LIST(a);
+CREATE TABLE part5_def_p1 PARTITION OF part5_def FOR VALUES IN (5);
+INSERT INTO part5_def_p1 VALUES (5, 'y');
+CREATE TABLE part5_p1 (LIKE part_5);
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+ERROR: updated partition constraint for default partition would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part5_def_p1 WHERE b = 'y';
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+-- check that the table being attached is not already a partition
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: "part_2" is already a partition
+-- check that circular inheritance is not allowed
+ALTER TABLE part_5 ATTACH PARTITION list_parted2 FOR VALUES IN ('b');
+ERROR: circular inheritance not allowed
+DETAIL: "part_5" is already a child of "list_parted2".
+ALTER TABLE list_parted2 ATTACH PARTITION list_parted2 FOR VALUES IN (0);
+ERROR: circular inheritance not allowed
+DETAIL: "list_parted2" is already a child of "list_parted2".
+-- If a partitioned table being created or an existing table being attached
+-- as a partition does not have a constraint that would allow validation scan
+-- to be skipped, but an individual partition does, then the partition's
+-- validation scan is skipped.
+CREATE TABLE quuux (a int, b text) PARTITION BY LIST (a);
+CREATE TABLE quuux_default PARTITION OF quuux DEFAULT PARTITION BY LIST (b);
+CREATE TABLE quuux_default1 PARTITION OF quuux_default (
+ CONSTRAINT check_1 CHECK (a IS NOT NULL AND a = 1)
+) FOR VALUES IN ('b');
+CREATE TABLE quuux1 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux1 FOR VALUES IN (1); -- validate!
+CREATE TABLE quuux2 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux2 FOR VALUES IN (2); -- skip validation
+INFO: updated partition constraint for default partition "quuux_default1" is implied by existing constraints
+DROP TABLE quuux1, quuux2;
+-- should validate for quuux1, but not for quuux2
+CREATE TABLE quuux1 PARTITION OF quuux FOR VALUES IN (1);
+CREATE TABLE quuux2 PARTITION OF quuux FOR VALUES IN (2);
+INFO: updated partition constraint for default partition "quuux_default1" is implied by existing constraints
+DROP TABLE quuux;
+-- check validation when attaching hash partitions
+-- Use hand-rolled hash functions and operator class to get predictable result
+-- on different matchines. part_test_int4_ops is defined in insert.sql.
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE hash_parted (
+ a int,
+ b int
+) PARTITION BY HASH (a part_test_int4_ops);
+CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 4, REMAINDER 0);
+CREATE TABLE fail_part (LIKE hpart_1);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 4);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 0);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+DROP TABLE fail_part;
+-- check validation when attaching hash partitions
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_2 (LIKE hash_parted);
+INSERT INTO hpart_2 VALUES (3, 0);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_2;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE hpart_5 (
+ LIKE hash_parted
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_5_a PARTITION OF hpart_5 FOR VALUES IN ('1', '2', '3');
+INSERT INTO hpart_5_a (a, b) VALUES (7, 1);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_5_a;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+-- check that the table being attach is with valid modulus and remainder value
+CREATE TABLE fail_part(LIKE hash_parted);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 0, REMAINDER 1);
+ERROR: modulus for hash partition must be a positive integer
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 8);
+ERROR: remainder for hash partition must be less than modulus
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 3, REMAINDER 2);
+ERROR: every hash partition modulus must be a factor of the next larger modulus
+DROP TABLE fail_part;
+--
+-- DETACH PARTITION
+--
+-- check that the table is partitioned at all
+CREATE TABLE regular_table (a int);
+ALTER TABLE regular_table DETACH PARTITION any_name;
+ERROR: table "regular_table" is not partitioned
+DROP TABLE regular_table;
+-- check that the partition being detached exists at all
+ALTER TABLE list_parted2 DETACH PARTITION part_4;
+ERROR: relation "part_4" does not exist
+ALTER TABLE hash_parted DETACH PARTITION hpart_4;
+ERROR: relation "hpart_4" does not exist
+-- check that the partition being detached is actually a partition of the parent
+CREATE TABLE not_a_part (a int);
+ALTER TABLE list_parted2 DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "list_parted2"
+ALTER TABLE list_parted2 DETACH PARTITION part_1;
+ERROR: relation "part_1" is not a partition of relation "list_parted2"
+ALTER TABLE hash_parted DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "hash_parted"
+DROP TABLE not_a_part;
+-- check that, after being detached, attinhcount/coninhcount is dropped to 0 and
+-- attislocal/conislocal is set to true
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+SELECT attinhcount, attislocal FROM pg_attribute WHERE attrelid = 'part_3_4'::regclass AND attnum > 0;
+ attinhcount | attislocal
+-------------+------------
+ 0 | t
+ 0 | t
+(2 rows)
+
+SELECT coninhcount, conislocal FROM pg_constraint WHERE conrelid = 'part_3_4'::regclass AND conname = 'check_a';
+ coninhcount | conislocal
+-------------+------------
+ 0 | t
+(1 row)
+
+DROP TABLE part_3_4;
+-- check that a detached partition is not dropped on dropping a partitioned table
+CREATE TABLE range_parted2 (
+ a int
+) PARTITION BY RANGE(a);
+CREATE TABLE part_rp PARTITION OF range_parted2 FOR VALUES FROM (0) to (100);
+ALTER TABLE range_parted2 DETACH PARTITION part_rp;
+DROP TABLE range_parted2;
+SELECT * from part_rp;
+ a
+---
+(0 rows)
+
+DROP TABLE part_rp;
+-- Check ALTER TABLE commands for partitioned tables and partitions
+-- cannot add/drop column to/from *only* the parent
+ALTER TABLE ONLY list_parted2 ADD COLUMN c int;
+ERROR: column must be added to child tables too
+ALTER TABLE ONLY list_parted2 DROP COLUMN b;
+ERROR: cannot drop column from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- cannot add a column to partition or drop an inherited one
+ALTER TABLE part_2 ADD COLUMN c text;
+ERROR: cannot add column to a partition
+ALTER TABLE part_2 DROP COLUMN b;
+ERROR: cannot drop inherited column "b"
+-- Nor rename, alter type
+ALTER TABLE part_2 RENAME COLUMN b to c;
+ERROR: cannot rename inherited column "b"
+ALTER TABLE part_2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter inherited column "b"
+-- cannot add/drop NOT NULL or check constraints to *only* the parent, when
+-- partitions exist
+ALTER TABLE ONLY list_parted2 ALTER b SET NOT NULL;
+ERROR: constraint must be added to child tables too
+DETAIL: Column "b" of relation "part_2" is not already NOT NULL.
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE ONLY list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ERROR: constraint must be added to child tables too
+ALTER TABLE list_parted2 ALTER b SET NOT NULL;
+ALTER TABLE ONLY list_parted2 ALTER b DROP NOT NULL;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ALTER TABLE ONLY list_parted2 DROP CONSTRAINT check_b;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- It's alright though, if no partitions are yet created
+CREATE TABLE parted_no_parts (a int) PARTITION BY LIST (a);
+ALTER TABLE ONLY parted_no_parts ALTER a SET NOT NULL;
+ALTER TABLE ONLY parted_no_parts ADD CONSTRAINT check_a CHECK (a > 0);
+ALTER TABLE ONLY parted_no_parts ALTER a DROP NOT NULL;
+ALTER TABLE ONLY parted_no_parts DROP CONSTRAINT check_a;
+DROP TABLE parted_no_parts;
+-- cannot drop inherited NOT NULL or check constraints from partition
+ALTER TABLE list_parted2 ALTER b SET NOT NULL, ADD CONSTRAINT check_a2 CHECK (a > 0);
+ALTER TABLE part_2 ALTER b DROP NOT NULL;
+ERROR: column "b" is marked NOT NULL in parent table
+ALTER TABLE part_2 DROP CONSTRAINT check_a2;
+ERROR: cannot drop inherited constraint "check_a2" of relation "part_2"
+-- Doesn't make sense to add NO INHERIT constraints on partitioned tables
+ALTER TABLE list_parted2 add constraint check_b2 check (b <> 'zz') NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "list_parted2"
+-- check that a partition cannot participate in regular inheritance
+CREATE TABLE inh_test () INHERITS (part_2);
+ERROR: cannot inherit from partition "part_2"
+CREATE TABLE inh_test (LIKE part_2);
+ALTER TABLE inh_test INHERIT part_2;
+ERROR: cannot inherit from a partition
+ALTER TABLE part_2 INHERIT inh_test;
+ERROR: cannot change inheritance of a partition
+-- cannot drop or alter type of partition key columns of lower level
+-- partitioned tables; for example, part_5, which is list_parted2's
+-- partition, is partitioned on b;
+ALTER TABLE list_parted2 DROP COLUMN b;
+ERROR: cannot drop column named in partition key
+ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter type of column named in partition key
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+ a
+---
+(0 rows)
+
+-- cleanup
+DROP TABLE list_parted, list_parted2, range_parted;
+DROP TABLE fail_def_part;
+DROP TABLE hash_parted;
+-- more tests for certain multi-level partitioning scenarios
+create table p (a int, b int) partition by range (a, b);
+create table p1 (b int, a int not null) partition by range (b);
+create table p11 (like p1);
+alter table p11 drop a;
+alter table p11 add a int;
+alter table p11 drop a;
+alter table p11 add a int not null;
+-- attnum for key attribute 'a' is different in p, p1, and p11
+select attrelid::regclass, attname, attnum
+from pg_attribute
+where attname = 'a'
+ and (attrelid = 'p'::regclass
+ or attrelid = 'p1'::regclass
+ or attrelid = 'p11'::regclass)
+order by attrelid::regclass::text;
+ attrelid | attname | attnum
+----------+---------+--------
+ p | a | 1
+ p1 | a | 2
+ p11 | a | 4
+(3 rows)
+
+alter table p1 attach partition p11 for values from (2) to (5);
+insert into p1 (a, b) values (2, 3);
+-- check that partition validation scan correctly detects violating rows
+alter table p attach partition p1 for values from (1, 2) to (1, 10);
+ERROR: partition constraint is violated by some row
+-- cleanup
+drop table p;
+drop table p1;
+-- validate constraint on partitioned tables should only scan leaf partitions
+create table parted_validate_test (a int) partition by list (a);
+create table parted_validate_test_1 partition of parted_validate_test for values in (0, 1);
+alter table parted_validate_test add constraint parted_validate_test_chka check (a > 0) not valid;
+alter table parted_validate_test validate constraint parted_validate_test_chka;
+drop table parted_validate_test;
+-- test alter column options
+CREATE TABLE attmp(i integer);
+INSERT INTO attmp VALUES (1);
+ALTER TABLE attmp ALTER COLUMN i SET (n_distinct = 1, n_distinct_inherited = 2);
+ALTER TABLE attmp ALTER COLUMN i RESET (n_distinct_inherited);
+ANALYZE attmp;
+DROP TABLE attmp;
+DROP USER regress_alter_table_user1;
+-- check that violating rows are correctly reported when attaching as the
+-- default partition
+create table defpart_attach_test (a int) partition by list (a);
+create table defpart_attach_test1 partition of defpart_attach_test for values in (1);
+create table defpart_attach_test_d (like defpart_attach_test);
+insert into defpart_attach_test_d values (1), (2);
+-- error because its constraint as the default partition would be violated
+-- by the row containing 1
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+ERROR: partition constraint is violated by some row
+delete from defpart_attach_test_d where a = 1;
+alter table defpart_attach_test_d add check (a > 1);
+-- should be attached successfully and without needing to be scanned
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+INFO: partition constraint for table "defpart_attach_test_d" is implied by existing constraints
+drop table defpart_attach_test;
+-- check combinations of temporary and permanent relations when attaching
+-- partitions.
+create table perm_part_parent (a int) partition by list (a);
+create temp table temp_part_parent (a int) partition by list (a);
+create table perm_part_child (a int);
+create temp table temp_part_child (a int);
+alter table temp_part_parent attach partition perm_part_child default; -- error
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_part_parent"
+alter table perm_part_parent attach partition temp_part_child default; -- error
+ERROR: cannot attach a temporary relation as partition of permanent relation "perm_part_parent"
+alter table temp_part_parent attach partition temp_part_child default; -- ok
+drop table perm_part_parent cascade;
+drop table temp_part_parent cascade;
+-- check that attaching partitions to a table while it is being used is
+-- prevented
+create table tab_part_attach (a int) partition by list (a);
+create or replace function func_part_attach() returns trigger
+ language plpgsql as $$
+ begin
+ execute 'create table tab_part_attach_1 (a int)';
+ execute 'alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)';
+ return null;
+ end $$;
+create trigger trig_part_attach before insert on tab_part_attach
+ for each statement execute procedure func_part_attach();
+insert into tab_part_attach values (1);
+ERROR: cannot ALTER TABLE "tab_part_attach" because it is being used by active queries in this session
+CONTEXT: SQL statement "alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)"
+PL/pgSQL function func_part_attach() line 4 at EXECUTE
+drop table tab_part_attach;
+drop function func_part_attach();
+-- test case where the partitioning operator is a SQL function whose
+-- evaluation results in the table's relcache being rebuilt partway through
+-- the execution of an ATTACH PARTITION command
+create function at_test_sql_partop (int4, int4) returns int language sql
+as $$ select case when $1 = $2 then 0 when $1 > $2 then 1 else -1 end; $$;
+create operator class at_test_sql_partop for type int4 using btree as
+ operator 1 < (int4, int4), operator 2 <= (int4, int4),
+ operator 3 = (int4, int4), operator 4 >= (int4, int4),
+ operator 5 > (int4, int4), function 1 at_test_sql_partop(int4, int4);
+create table at_test_sql_partop (a int) partition by range (a at_test_sql_partop);
+create table at_test_sql_partop_1 (a int);
+alter table at_test_sql_partop attach partition at_test_sql_partop_1 for values from (0) to (10);
+drop table at_test_sql_partop;
+drop operator class at_test_sql_partop using btree;
+drop function at_test_sql_partop;
diff --git a/src/test/regress/expected/cluster_1.out b/src/test/regress/expected/cluster_1.out
new file mode 100644
index 0000000000..a707ea30cb
--- /dev/null
+++ b/src/test/regress/expected/cluster_1.out
@@ -0,0 +1,475 @@
+--
+-- CLUSTER
+--
+CREATE TABLE clstr_tst_s (rf_a SERIAL PRIMARY KEY,
+ b INT);
+CREATE TABLE clstr_tst (a SERIAL PRIMARY KEY,
+ b INT,
+ c TEXT,
+ d TEXT,
+ CONSTRAINT clstr_tst_con FOREIGN KEY (b) REFERENCES clstr_tst_s);
+CREATE INDEX clstr_tst_b ON clstr_tst (b);
+CREATE INDEX clstr_tst_c ON clstr_tst (c);
+CREATE INDEX clstr_tst_c_b ON clstr_tst (c,b);
+CREATE INDEX clstr_tst_b_c ON clstr_tst (b,c);
+INSERT INTO clstr_tst_s (b) VALUES (0);
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+CREATE TABLE clstr_tst_inh () INHERITS (clstr_tst);
+INSERT INTO clstr_tst (b, c) VALUES (11, 'once');
+INSERT INTO clstr_tst (b, c) VALUES (10, 'diez');
+INSERT INTO clstr_tst (b, c) VALUES (31, 'treinta y uno');
+INSERT INTO clstr_tst (b, c) VALUES (22, 'veintidos');
+INSERT INTO clstr_tst (b, c) VALUES (3, 'tres');
+INSERT INTO clstr_tst (b, c) VALUES (20, 'veinte');
+INSERT INTO clstr_tst (b, c) VALUES (23, 'veintitres');
+INSERT INTO clstr_tst (b, c) VALUES (21, 'veintiuno');
+INSERT INTO clstr_tst (b, c) VALUES (4, 'cuatro');
+INSERT INTO clstr_tst (b, c) VALUES (14, 'catorce');
+INSERT INTO clstr_tst (b, c) VALUES (2, 'dos');
+INSERT INTO clstr_tst (b, c) VALUES (18, 'dieciocho');
+INSERT INTO clstr_tst (b, c) VALUES (27, 'veintisiete');
+INSERT INTO clstr_tst (b, c) VALUES (25, 'veinticinco');
+INSERT INTO clstr_tst (b, c) VALUES (13, 'trece');
+INSERT INTO clstr_tst (b, c) VALUES (28, 'veintiocho');
+INSERT INTO clstr_tst (b, c) VALUES (32, 'treinta y dos');
+INSERT INTO clstr_tst (b, c) VALUES (5, 'cinco');
+INSERT INTO clstr_tst (b, c) VALUES (29, 'veintinueve');
+INSERT INTO clstr_tst (b, c) VALUES (1, 'uno');
+INSERT INTO clstr_tst (b, c) VALUES (24, 'veinticuatro');
+INSERT INTO clstr_tst (b, c) VALUES (30, 'treinta');
+INSERT INTO clstr_tst (b, c) VALUES (12, 'doce');
+INSERT INTO clstr_tst (b, c) VALUES (17, 'diecisiete');
+INSERT INTO clstr_tst (b, c) VALUES (9, 'nueve');
+INSERT INTO clstr_tst (b, c) VALUES (19, 'diecinueve');
+INSERT INTO clstr_tst (b, c) VALUES (26, 'veintiseis');
+INSERT INTO clstr_tst (b, c) VALUES (15, 'quince');
+INSERT INTO clstr_tst (b, c) VALUES (7, 'siete');
+INSERT INTO clstr_tst (b, c) VALUES (16, 'dieciseis');
+INSERT INTO clstr_tst (b, c) VALUES (8, 'ocho');
+-- This entry is needed to test that TOASTED values are copied correctly.
+INSERT INTO clstr_tst (b, c, d) VALUES (6, 'seis', repeat('xyzzy', 100000));
+CLUSTER clstr_tst_c ON clstr_tst;
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY a;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 1 | 11 | once | |
+ 2 | 10 | diez | |
+ 3 | 31 | treinta y uno | |
+ 4 | 22 | veintidos | |
+ 5 | 3 | tres | |
+ 6 | 20 | veinte | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 9 | 4 | cuatro | |
+ 10 | 14 | catorce | |
+ 11 | 2 | dos | |
+ 12 | 18 | dieciocho | |
+ 13 | 27 | veintisiete | |
+ 14 | 25 | veinticinco | |
+ 15 | 13 | trece | |
+ 16 | 28 | veintiocho | |
+ 17 | 32 | treinta y dos | |
+ 18 | 5 | cinco | |
+ 19 | 29 | veintinueve | |
+ 20 | 1 | uno | |
+ 21 | 24 | veinticuatro | |
+ 22 | 30 | treinta | |
+ 23 | 12 | doce | |
+ 24 | 17 | diecisiete | |
+ 25 | 9 | nueve | |
+ 26 | 19 | diecinueve | |
+ 27 | 26 | veintiseis | |
+ 28 | 15 | quince | |
+ 29 | 7 | siete | |
+ 30 | 16 | dieciseis | |
+ 31 | 8 | ocho | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY b;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 20 | 1 | uno | |
+ 11 | 2 | dos | |
+ 5 | 3 | tres | |
+ 9 | 4 | cuatro | |
+ 18 | 5 | cinco | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 31 | 8 | ocho | |
+ 25 | 9 | nueve | |
+ 2 | 10 | diez | |
+ 1 | 11 | once | |
+ 23 | 12 | doce | |
+ 15 | 13 | trece | |
+ 10 | 14 | catorce | |
+ 28 | 15 | quince | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 12 | 18 | dieciocho | |
+ 26 | 19 | diecinueve | |
+ 6 | 20 | veinte | |
+ 8 | 21 | veintiuno | |
+ 4 | 22 | veintidos | |
+ 7 | 23 | veintitres | |
+ 21 | 24 | veinticuatro | |
+ 14 | 25 | veinticinco | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 16 | 28 | veintiocho | |
+ 19 | 29 | veintinueve | |
+ 22 | 30 | treinta | |
+ 3 | 31 | treinta y uno | |
+ 17 | 32 | treinta y dos | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY c;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+----------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+(33 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------
+ clstr_tst_con
+ clstr_tst_pkey
+(2 rows)
+
+SELECT relname, relkind,
+ EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
+FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
+ relname | relkind | hastoast
+----------------------+---------+----------
+ clstr_tst | r | f
+ clstr_tst_a_seq | S | f
+ clstr_tst_b | i | f
+ clstr_tst_b_c | i | f
+ clstr_tst_c | i | f
+ clstr_tst_c_b | i | f
+ clstr_tst_inh | r | f
+ clstr_tst_pkey | i | f
+ clstr_tst_s | r | f
+ clstr_tst_s_pkey | i | f
+ clstr_tst_s_rf_a_seq | S | f
+(11 rows)
+
+-- Verify that indisclustered is correctly set
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+-------------
+ clstr_tst_c
+(1 row)
+
+-- Try changing indisclustered
+ALTER TABLE clstr_tst CLUSTER ON clstr_tst_b_c;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------------
+ clstr_tst_b_c
+(1 row)
+
+-- Try turning off all clustering
+ALTER TABLE clstr_tst SET WITHOUT CLUSTER;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------
+(0 rows)
+
+-- Verify that clustering all tables does in fact cluster the right ones
+CREATE USER regress_clstr_user;
+CREATE TABLE clstr_1 (a INT PRIMARY KEY);
+CREATE TABLE clstr_2 (a INT PRIMARY KEY);
+CREATE TABLE clstr_3 (a INT PRIMARY KEY);
+ALTER TABLE clstr_1 OWNER TO regress_clstr_user;
+ALTER TABLE clstr_3 OWNER TO regress_clstr_user;
+GRANT SELECT ON clstr_2 TO regress_clstr_user;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- "CLUSTER <tablename>" on a table that hasn't been clustered
+CLUSTER clstr_2;
+ERROR: there is no previously clustered index for table "clstr_2"
+CLUSTER clstr_1_pkey ON clstr_1;
+CLUSTER clstr_2 USING clstr_2_pkey;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 1
+ 2
+ 2
+ 1
+(6 rows)
+
+-- revert to the original state
+DELETE FROM clstr_1;
+DELETE FROM clstr_2;
+DELETE FROM clstr_3;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- this user can only cluster clstr_1 and clstr_3, but the latter
+-- has not been clustered
+SET SESSION AUTHORIZATION regress_clstr_user;
+CLUSTER;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 2
+ 1
+ 2
+ 1
+(6 rows)
+
+-- cluster a single table using the indisclustered bit previously set
+DELETE FROM clstr_1;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+CLUSTER clstr_1;
+SELECT * FROM clstr_1;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- Test MVCC-safety of cluster. There isn't much we can do to verify the
+-- results with a single backend...
+CREATE TABLE clustertest (key int PRIMARY KEY);
+INSERT INTO clustertest VALUES (10);
+INSERT INTO clustertest VALUES (20);
+INSERT INTO clustertest VALUES (30);
+INSERT INTO clustertest VALUES (40);
+INSERT INTO clustertest VALUES (50);
+-- Use a transaction so that updates are not committed when CLUSTER sees 'em
+BEGIN;
+-- Test update where the old row version is found first in the scan
+UPDATE clustertest SET key = 100 WHERE key = 10;
+-- Test update where the new row version is found first in the scan
+UPDATE clustertest SET key = 35 WHERE key = 40;
+-- Test longer update chain
+UPDATE clustertest SET key = 60 WHERE key = 50;
+UPDATE clustertest SET key = 70 WHERE key = 60;
+UPDATE clustertest SET key = 80 WHERE key = 70;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 100
+ 35
+ 80
+(5 rows)
+
+CLUSTER clustertest_pkey ON clustertest;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+COMMIT;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+-- check that temp tables can be clustered
+create temp table clstr_temp (col1 int primary key, col2 text);
+insert into clstr_temp values (2, 'two'), (1, 'one');
+cluster clstr_temp using clstr_temp_pkey;
+select * from clstr_temp;
+ col1 | col2
+------+------
+ 1 | one
+ 2 | two
+(2 rows)
+
+drop table clstr_temp;
+RESET SESSION AUTHORIZATION;
+-- Check that partitioned tables cannot be clustered
+CREATE TABLE clstrpart (a int) PARTITION BY RANGE (a);
+CREATE INDEX clstrpart_idx ON clstrpart (a);
+ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+ERROR: cannot mark index clustered in partitioned table
+CLUSTER clstrpart USING clstrpart_idx;
+ERROR: cannot cluster a partitioned table
+DROP TABLE clstrpart;
+-- Test CLUSTER with external tuplesorting
+create table clstr_4 as select * from tenk1;
+create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
+-- ensure we don't use the index in CLUSTER nor the checking SELECTs
+set enable_indexscan = off;
+-- Use external sort:
+set maintenance_work_mem = '1MB';
+cluster clstr_4 using cluster_sort;
+select * from
+(select hundred, lag(hundred) over () as lhundred,
+ thousand, lag(thousand) over () as lthousand,
+ tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
+where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
+ hundred | lhundred | thousand | lthousand | tenthous | ltenthous
+---------+----------+----------+-----------+----------+-----------
+(0 rows)
+
+reset enable_indexscan;
+reset maintenance_work_mem;
+-- clean up
+DROP TABLE clustertest;
+DROP TABLE clstr_1;
+DROP TABLE clstr_2;
+DROP TABLE clstr_3;
+DROP TABLE clstr_4;
+DROP USER regress_clstr_user;
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index 352959b751..6eae2bab97 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -126,11 +126,12 @@ ERROR: function int4in(internal) does not exist
CREATE ACCESS METHOD bogus TYPE TABLE HANDLER bthandler;
ERROR: function bthandler must return type table_am_handler
SELECT amname, amhandler, amtype FROM pg_am where amtype = 't' ORDER BY 1, 2;
- amname | amhandler | amtype
---------+----------------------+--------
- heap | heap_tableam_handler | t
- heap2 | heap_tableam_handler | t
-(2 rows)
+ amname | amhandler | amtype
+----------+--------------------------+--------
+ heap | heap_tableam_handler | t
+ heap2 | heap_tableam_handler | t
+ zedstore | zedstore_tableam_handler | t
+(3 rows)
-- First create tables employing the new AM using USING
-- plain CREATE TABLE
diff --git a/src/test/regress/expected/fsm_1.out b/src/test/regress/expected/fsm_1.out
new file mode 100644
index 0000000000..9b5f9be13a
--- /dev/null
+++ b/src/test/regress/expected/fsm_1.out
@@ -0,0 +1,73 @@
+--
+-- Free Space Map test
+--
+SELECT current_setting('block_size')::integer AS blocksize,
+current_setting('block_size')::integer / 8 AS strsize
+\gset
+CREATE TABLE fsm_check_size (num int, str text);
+-- Fill 3 blocks with one record each
+ALTER TABLE fsm_check_size SET (fillfactor=15);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(1,3) i;
+-- There should be no FSM
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- The following operations are for testing the functionality of the local
+-- in-memory map. In particular, we want to be able to insert into some
+-- other block than the one at the end of the heap, without using a FSM.
+-- Fill most of the last block
+ALTER TABLE fsm_check_size SET (fillfactor=100);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(101,105) i;
+-- Make sure records can go into any block but the last one
+ALTER TABLE fsm_check_size SET (fillfactor=30);
+-- Insert large record and make sure it does not cause the relation to extend
+INSERT INTO fsm_check_size VALUES (111, rpad('', :strsize, 'a'));
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- Extend table with enough blocks to exceed the FSM threshold
+DO $$
+DECLARE curtid tid;
+num int;
+BEGIN
+num = 11;
+ LOOP
+ INSERT INTO fsm_check_size VALUES (num, 'b') RETURNING ctid INTO curtid;
+ EXIT WHEN curtid >= tid '(4, 0)';
+ num = num + 1;
+ END LOOP;
+END;
+$$;
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ fsm_nblocks
+-------------
+ 0
+(1 row)
+
+-- Add long random string to extend TOAST table to 1 block
+INSERT INTO fsm_check_size
+VALUES(0, (SELECT string_agg(md5(chr(i)), '')
+ FROM generate_series(1, :blocksize / 100) i));
+VACUUM fsm_check_size;
+SELECT pg_relation_size(reltoastrelid, 'main') / :blocksize AS toast_nblocks,
+pg_relation_size(reltoastrelid, 'fsm') / :blocksize AS toast_fsm_nblocks
+FROM pg_class WHERE relname = 'fsm_check_size';
+ toast_nblocks | toast_fsm_nblocks
+---------------+-------------------
+ |
+(1 row)
+
+DROP TABLE fsm_check_size;
diff --git a/src/test/regress/expected/rangefuncs_1.out b/src/test/regress/expected/rangefuncs_1.out
new file mode 100644
index 0000000000..78b177ceb0
--- /dev/null
+++ b/src/test/regress/expected/rangefuncs_1.out
@@ -0,0 +1,2100 @@
+CREATE TABLE rngfunc2(rngfuncid int, f2 int);
+INSERT INTO rngfunc2 VALUES(1, 11);
+INSERT INTO rngfunc2 VALUES(2, 22);
+INSERT INTO rngfunc2 VALUES(1, 111);
+CREATE FUNCTION rngfunct(int) returns setof rngfunc2 as 'SELECT * FROM rngfunc2 WHERE rngfuncid = $1 ORDER BY f2;' LANGUAGE SQL;
+-- function with ORDINALITY
+select * from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select * from rngfunct(1) with ordinality as z(a,b,ord) where b > 100; -- ordinal 2, not 1
+ a | b | ord
+---+-----+-----
+ 1 | 111 | 2
+(1 row)
+
+-- ordinality vs. column names and types
+select a,b,ord from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select a,ord from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select * from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select a,ord from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select * from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select row_to_json(s.*) from generate_series(11,14) with ordinality s;
+ row_to_json
+-------------------------
+ {"s":11,"ordinality":1}
+ {"s":12,"ordinality":2}
+ {"s":13,"ordinality":3}
+ {"s":14,"ordinality":4}
+(4 rows)
+
+-- ordinality vs. views
+create temporary view vw_ord as select * from (values (1)) v(n) join rngfunct(1) with ordinality as z(a,b,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | ord
+---+---+----+-----
+ 1 | 1 | 11 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN rngfunct(1) WITH ORDINALITY z(a, b, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- multiple functions
+select * from rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord);
+ a | b | c | d | ord
+---+-----+---+----+-----
+ 1 | 11 | 2 | 22 | 1
+ 1 | 111 | | | 2
+(2 rows)
+
+create temporary view vw_ord as select * from (values (1)) v(n) join rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | c | d | ord
+---+---+----+---+----+-----
+ 1 | 1 | 11 | 2 | 22 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.c, +
+ z.d, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN ROWS FROM(rngfunct(1), rngfunct(2)) WITH ORDINALITY z(a, b, c, d, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- expansions of unnest()
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]);
+ unnest | unnest | unnest
+--------+--------+--------
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(101,102)) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 101 | 1
+ 20 | bar | 102 | 2
+(2 rows)
+
+create temporary view vw_ord as select * from unnest(array[10,20],array['foo','bar'],array[1.0]) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(1,2)) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+---
+ 10 | foo | 1
+ 20 | bar | 2
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM ROWS FROM(unnest(ARRAY[10, 20]), unnest(ARRAY['foo'::text, 'bar'::text]), generate_series(1, 2)) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+-- ordinality and multiple functions vs. rewind and reverse scan
+begin;
+declare rf_cur scroll cursor for select * from rows from(generate_series(1,5),generate_series(1,2)) with ordinality as g(i,j,o);
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch backward all from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+ 4 | | 4
+ 3 | | 3
+ 2 | 2 | 2
+ 1 | 1 | 1
+(5 rows)
+
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+(1 row)
+
+fetch absolute 1 from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 4 | | 4
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+commit;
+-- function with implicit LATERAL
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) z where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2
+-----------+-----+-----------+-----
+ 1 | 11 | 1 | 11
+ 2 | 22 | 2 | 22
+ 1 | 111 | 1 | 111
+(3 rows)
+
+-- function with implicit LATERAL and explicit ORDINALITY
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) with ordinality as z(rngfuncid,f2,ord) where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2 | ord
+-----------+-----+-----------+-----+-----
+ 1 | 11 | 1 | 11 | 1
+ 2 | 22 | 2 | 22 | 1
+ 1 | 111 | 1 | 111 | 2
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+ 2 | 22
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(1) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = 1) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- nested functions
+select rngfunct.rngfuncid, rngfunct.f2 from rngfunct(sin(pi()/2)::int) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+CREATE TABLE rngfunc (rngfuncid int, rngfuncsubid int, rngfuncname text, primary key(rngfuncid,rngfuncsubid));
+INSERT INTO rngfunc VALUES(1,1,'Joe');
+INSERT INTO rngfunc VALUES(1,2,'Ed');
+INSERT INTO rngfunc VALUES(2,1,'Mary');
+-- sql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc1(int) RETURNS int AS 'SELECT $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc1(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc1(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc1
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1) WITH ORDINALITY as t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc2(int) RETURNS setof int AS 'SELECT rngfuncid FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc2(1) AS t1;
+ t1
+----
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc2
+-------------
+ 1
+ 1
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc3(int) RETURNS setof text AS 'SELECT rngfuncname FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc3(1) AS t1;
+ t1
+-----
+ Joe
+ Ed
+(2 rows)
+
+SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc3
+-------------
+ Joe
+ Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc4(int) RETURNS rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc4(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = c
+CREATE FUNCTION getrngfunc5(int) RETURNS setof rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc5(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = record
+CREATE FUNCTION getrngfunc6(int) RETURNS RECORD AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc6(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc6(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = record
+CREATE FUNCTION getrngfunc7(int) RETURNS setof record AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc7(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc7(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc8(int) RETURNS int AS 'DECLARE rngfuncint int; BEGIN SELECT rngfuncid into rngfuncint FROM rngfunc WHERE rngfuncid = $1; RETURN rngfuncint; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc8(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc8
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc9(int) RETURNS rngfunc AS 'DECLARE rngfunctup rngfunc%ROWTYPE; BEGIN SELECT * into rngfunctup FROM rngfunc WHERE rngfuncid = $1; RETURN rngfunctup; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc9(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- mix 'n match kinds, to exercise expandRTE and related logic
+select * from rows from(getrngfunc1(1),getrngfunc2(1),getrngfunc3(1),getrngfunc4(1),getrngfunc5(1),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc8(1),getrngfunc9(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+---+-----+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1
+ | 1 | Ed | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | | 2
+(2 rows)
+
+select * from rows from(getrngfunc9(1),getrngfunc8(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc5(1),getrngfunc4(1),getrngfunc3(1),getrngfunc2(1),getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+---+-----+---+---+-----+---+---+-----+---+---+-----+-----+---+---+---
+ 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | Joe | 1 | 1 | 1
+ | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | Ed | 1 | | 2
+(2 rows)
+
+create temporary view vw_rngfunc as
+ select * from rows from(getrngfunc9(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,n);
+select * from vw_rngfunc;
+ a | b | c | d | e | f | g | n
+---+---+-----+---+---+-----+---+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1
+ | | | 1 | 2 | Ed | | 2
+(2 rows)
+
+select pg_get_viewdef('vw_rngfunc');
+ pg_get_viewdef
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ SELECT t1.a, +
+ t1.b, +
+ t1.c, +
+ t1.d, +
+ t1.e, +
+ t1.f, +
+ t1.g, +
+ t1.n +
+ FROM ROWS FROM(getrngfunc9(1), getrngfunc7(1) AS (rngfuncid integer, rngfuncsubid integer, rngfuncname text), getrngfunc1(1)) WITH ORDINALITY t1(a, b, c, d, e, f, g, n);
+(1 row)
+
+drop view vw_rngfunc;
+DROP FUNCTION getrngfunc1(int);
+DROP FUNCTION getrngfunc2(int);
+DROP FUNCTION getrngfunc3(int);
+DROP FUNCTION getrngfunc4(int);
+DROP FUNCTION getrngfunc5(int);
+DROP FUNCTION getrngfunc6(int);
+DROP FUNCTION getrngfunc7(int);
+DROP FUNCTION getrngfunc8(int);
+DROP FUNCTION getrngfunc9(int);
+DROP FUNCTION rngfunct(int);
+DROP TABLE rngfunc2;
+DROP TABLE rngfunc;
+-- Rescan tests --
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq1;
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq2;
+CREATE TYPE rngfunc_rescan_t AS (i integer, s bigint);
+CREATE FUNCTION rngfunc_sql(int,int) RETURNS setof rngfunc_rescan_t AS 'SELECT i, nextval(''rngfunc_rescan_seq1'') FROM generate_series($1,$2) i;' LANGUAGE SQL;
+-- plpgsql functions use materialize mode
+CREATE FUNCTION rngfunc_mat(int,int) RETURNS setof rngfunc_rescan_t AS 'begin for i in $1..$2 loop return next (i, nextval(''rngfunc_rescan_seq2'')); end loop; end;' LANGUAGE plpgsql;
+--invokes ExecReScanFunctionScan - all these cases should materialize the function only once
+-- LEFT JOIN on a condition that the planner can't prove to be true is used to ensure the function
+-- is on the inner path of a nestloop join
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN ROWS FROM( rngfunc_sql(11,13), rngfunc_mat(11,13) ) WITH ORDINALITY AS f(i1,s1,i2,s2,o) ON (r+i1+i2)<100;
+ r | i1 | s1 | i2 | s2 | o
+---+----+----+----+----+---
+ 1 | 11 | 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 12 | 2 | 2
+ 1 | 13 | 3 | 13 | 3 | 3
+ 2 | 11 | 1 | 11 | 1 | 1
+ 2 | 12 | 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 13 | 3 | 3
+ 3 | 11 | 1 | 11 | 1 | 1
+ 3 | 12 | 2 | 12 | 2 | 2
+ 3 | 13 | 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 2 | 11
+ 2 | 12
+ 2 | 13
+ 3 | 11
+ 3 | 12
+ 3 | 13
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 10
+ 2 | 20
+ 2 | 30
+ 3 | 10
+ 3 | 20
+ 3 | 30
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 10 | 1
+ 2 | 20 | 2
+ 2 | 30 | 3
+ 3 | 10 | 1
+ 3 | 20 | 2
+ 3 | 30 | 3
+(9 rows)
+
+--invokes ExecReScanFunctionScan with chgParam != NULL (using implied LATERAL)
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+-- selective rescan of multiple functions:
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(11,11), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | | | 12 | 2
+ 1 | | | 13 | 3
+ 2 | 11 | 1 | 12 | 4
+ 2 | | | 13 | 5
+ 3 | 11 | 1 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(11,11) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | |
+ 1 | 13 | 3 | |
+ 2 | 12 | 4 | 11 | 1
+ 2 | 13 | 5 | |
+ 3 | 13 | 6 | 11 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | 12 | 2
+ 1 | 13 | 3 | 13 | 3
+ 2 | 12 | 4 | 12 | 4
+ 2 | 13 | 5 | 13 | 5
+ 3 | 13 | 6 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM generate_series(1,2) r1, generate_series(r1,3) r2, ROWS FROM( rngfunc_sql(10+r1,13), rngfunc_mat(10+r2,13) );
+ r1 | r2 | i | s | i | s
+----+----+----+----+----+---
+ 1 | 1 | 11 | 1 | 11 | 1
+ 1 | 1 | 12 | 2 | 12 | 2
+ 1 | 1 | 13 | 3 | 13 | 3
+ 1 | 2 | 11 | 4 | 12 | 4
+ 1 | 2 | 12 | 5 | 13 | 5
+ 1 | 2 | 13 | 6 | |
+ 1 | 3 | 11 | 7 | 13 | 6
+ 1 | 3 | 12 | 8 | |
+ 1 | 3 | 13 | 9 | |
+ 2 | 2 | 12 | 10 | 12 | 7
+ 2 | 2 | 13 | 11 | 13 | 8
+ 2 | 3 | 12 | 12 | 13 | 9
+ 2 | 3 | 13 | 13 | |
+(13 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) f(i);
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 12
+ 2 | 13
+ 2 | 14
+ 2 | 15
+ 2 | 16
+ 2 | 17
+ 2 | 18
+ 3 | 13
+ 3 | 14
+ 3 | 15
+ 3 | 16
+ 3 | 17
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 1 | 14 | 4
+ 1 | 15 | 5
+ 1 | 16 | 6
+ 1 | 17 | 7
+ 1 | 18 | 8
+ 1 | 19 | 9
+ 2 | 12 | 1
+ 2 | 13 | 2
+ 2 | 14 | 3
+ 2 | 15 | 4
+ 2 | 16 | 5
+ 2 | 17 | 6
+ 2 | 18 | 7
+ 3 | 13 | 1
+ 3 | 14 | 2
+ 3 | 15 | 3
+ 3 | 16 | 4
+ 3 | 17 | 5
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) f(i);
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 20
+ 2 | 40
+ 2 | 60
+ 3 | 30
+ 3 | 60
+ 3 | 90
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 20 | 1
+ 2 | 40 | 2
+ 2 | 60 | 3
+ 3 | 30 | 1
+ 3 | 60 | 2
+ 3 | 90 | 3
+(9 rows)
+
+-- deep nesting
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(21,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 21
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 21
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 21
+ 3 | 3 | 10 | 22
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 21
+ 3 | 3 | 30 | 22
+ 3 | 3 | 30 | 23
+(27 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(20+r1,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 23
+(18 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r2,r2+3) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 10
+ 1 | 1 | 10 | 11
+ 1 | 1 | 10 | 12
+ 1 | 1 | 10 | 13
+ 1 | 1 | 20 | 20
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 30
+ 1 | 1 | 30 | 31
+ 1 | 1 | 30 | 32
+ 1 | 1 | 30 | 33
+ 2 | 2 | 10 | 10
+ 2 | 2 | 10 | 11
+ 2 | 2 | 10 | 12
+ 2 | 2 | 10 | 13
+ 2 | 2 | 20 | 20
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 30
+ 2 | 2 | 30 | 31
+ 2 | 2 | 30 | 32
+ 2 | 2 | 30 | 33
+ 3 | 3 | 10 | 10
+ 3 | 3 | 10 | 11
+ 3 | 3 | 10 | 12
+ 3 | 3 | 10 | 13
+ 3 | 3 | 20 | 20
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 30
+ 3 | 3 | 30 | 31
+ 3 | 3 | 30 | 32
+ 3 | 3 | 30 | 33
+(36 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r1,2+r2/5) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+---
+ 1 | 1 | 10 | 1
+ 1 | 1 | 10 | 2
+ 1 | 1 | 10 | 3
+ 1 | 1 | 10 | 4
+ 1 | 1 | 20 | 1
+ 1 | 1 | 20 | 2
+ 1 | 1 | 20 | 3
+ 1 | 1 | 20 | 4
+ 1 | 1 | 20 | 5
+ 1 | 1 | 20 | 6
+ 1 | 1 | 30 | 1
+ 1 | 1 | 30 | 2
+ 1 | 1 | 30 | 3
+ 1 | 1 | 30 | 4
+ 1 | 1 | 30 | 5
+ 1 | 1 | 30 | 6
+ 1 | 1 | 30 | 7
+ 1 | 1 | 30 | 8
+ 2 | 2 | 10 | 2
+ 2 | 2 | 10 | 3
+ 2 | 2 | 10 | 4
+ 2 | 2 | 20 | 2
+ 2 | 2 | 20 | 3
+ 2 | 2 | 20 | 4
+ 2 | 2 | 20 | 5
+ 2 | 2 | 20 | 6
+ 2 | 2 | 30 | 2
+ 2 | 2 | 30 | 3
+ 2 | 2 | 30 | 4
+ 2 | 2 | 30 | 5
+ 2 | 2 | 30 | 6
+ 2 | 2 | 30 | 7
+ 2 | 2 | 30 | 8
+ 3 | 3 | 10 | 3
+ 3 | 3 | 10 | 4
+ 3 | 3 | 20 | 3
+ 3 | 3 | 20 | 4
+ 3 | 3 | 20 | 5
+ 3 | 3 | 20 | 6
+ 3 | 3 | 30 | 3
+ 3 | 3 | 30 | 4
+ 3 | 3 | 30 | 5
+ 3 | 3 | 30 | 6
+ 3 | 3 | 30 | 7
+ 3 | 3 | 30 | 8
+(45 rows)
+
+-- check handling of FULL JOIN with multiple lateral references (bug #15741)
+SELECT *
+FROM (VALUES (1),(2)) v1(r1)
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, v1.r1) AS gs1
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, gs1) AS gs2
+ LEFT JOIN generate_series(1, gs2) AS gs3 ON TRUE
+ ) AS ss1 ON TRUE
+ FULL JOIN generate_series(1, v1.r1) AS gs4 ON FALSE
+ ) AS ss0 ON TRUE;
+ r1 | gs1 | gs2 | gs3 | gs4
+----+-----+-----+-----+-----
+ 1 | | | | 1
+ 1 | 1 | 1 | 1 |
+ 2 | | | | 1
+ 2 | | | | 2
+ 2 | 1 | 1 | 1 |
+ 2 | 2 | 1 | 1 |
+ 2 | 2 | 2 | 1 |
+ 2 | 2 | 2 | 2 |
+(8 rows)
+
+DROP FUNCTION rngfunc_sql(int,int);
+DROP FUNCTION rngfunc_mat(int,int);
+DROP SEQUENCE rngfunc_rescan_seq1;
+DROP SEQUENCE rngfunc_rescan_seq2;
+--
+-- Test cases involving OUT parameters
+--
+CREATE FUNCTION rngfunc(in f1 int, out f2 int)
+AS 'select $1+1' LANGUAGE sql;
+SELECT rngfunc(42);
+ rngfunc
+---------
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42);
+ f2
+----
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42) AS p(x);
+ x
+----
+ 43
+(1 row)
+
+-- explicit spec of return type is OK
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+-- error, wrong result type
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS float
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be integer because of OUT parameters
+-- with multiple OUT params you must get a RECORD result
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be record because of OUT parameters
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text)
+RETURNS record
+AS 'select $1+1' LANGUAGE sql;
+ERROR: cannot change return type of existing function
+HINT: Use DROP FUNCTION rngfunc(integer) first.
+CREATE OR REPLACE FUNCTION rngfuncr(in f1 int, out f2 int, out text)
+AS $$select $1-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncr(f1) FROM int4_tbl;
+ f1 | rngfuncr
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (123455,123456z)
+ -123456 | (-123457,-123456z)
+ 2147483647 | (2147483646,2147483647z)
+ -2147483647 | (-2147483648,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncr(42);
+ f2 | column2
+----+---------
+ 41 | 42z
+(1 row)
+
+SELECT * FROM rngfuncr(42) AS p(a,b);
+ a | b
+----+-----
+ 41 | 42z
+(1 row)
+
+CREATE OR REPLACE FUNCTION rngfuncb(in f1 int, inout f2 int, out text)
+AS $$select $2-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncb(f1, f1/2) FROM int4_tbl;
+ f1 | rngfuncb
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (61727,123456z)
+ -123456 | (-61729,-123456z)
+ 2147483647 | (1073741822,2147483647z)
+ -2147483647 | (-1073741824,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncb(42, 99);
+ f2 | column2
+----+---------
+ 98 | 42z
+(1 row)
+
+SELECT * FROM rngfuncb(42, 99) AS p(a,b);
+ a | b
+----+-----
+ 98 | 42z
+(1 row)
+
+-- Can reference function with or without OUT params for DROP, etc
+DROP FUNCTION rngfunc(int);
+DROP FUNCTION rngfuncr(in f2 int, out f1 int, out text);
+DROP FUNCTION rngfuncb(in f1 int, inout f2 int);
+--
+-- For my next trick, polymorphic OUT parameters
+--
+CREATE FUNCTION dup (f1 anyelement, f2 out anyelement, f3 out anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+SELECT dup('xyz'); -- fails
+ERROR: could not determine polymorphic type because input has type unknown
+SELECT dup('xyz'::text);
+ dup
+-------------------
+ (xyz,"{xyz,xyz}")
+(1 row)
+
+SELECT * FROM dup('xyz'::text);
+ f2 | f3
+-----+-----------
+ xyz | {xyz,xyz}
+(1 row)
+
+-- fails, as we are attempting to rename first argument
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot change name of input parameter "f1"
+HINT: Use DROP FUNCTION dup(anyelement) first.
+DROP FUNCTION dup(anyelement);
+-- equivalent behavior, though different name exposed for input arg
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+DROP FUNCTION dup(anyelement);
+-- fails, no way to deduce outputs
+CREATE FUNCTION bad (f1 int, out f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot determine result data type
+DETAIL: A function returning a polymorphic type must have at least one polymorphic argument.
+--
+-- table functions
+--
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a int)
+AS $$ SELECT a FROM generate_series(1,5) a(a) $$ LANGUAGE sql;
+SELECT * FROM rngfunc();
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+DROP FUNCTION rngfunc();
+CREATE OR REPLACE FUNCTION rngfunc(int)
+RETURNS TABLE(a int, b int)
+AS $$ SELECT a, b
+ FROM generate_series(1,$1) a(a),
+ generate_series(1,$1) b(b) $$ LANGUAGE sql;
+SELECT * FROM rngfunc(3);
+ a | b
+---+---
+ 1 | 1
+ 1 | 2
+ 1 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(9 rows)
+
+DROP FUNCTION rngfunc(int);
+-- case that causes change of typmod knowledge during inlining
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a varchar(5))
+AS $$ SELECT 'hello'::varchar(5) $$ LANGUAGE sql STABLE;
+SELECT * FROM rngfunc() GROUP BY 1;
+ a
+-------
+ hello
+(1 row)
+
+DROP FUNCTION rngfunc();
+--
+-- some tests on SQL functions with RETURNING
+--
+create temp table tt(f1 serial, data text);
+create function insert_tt(text) returns int as
+$$ insert into tt(data) values($1) returning f1 $$
+language sql;
+select insert_tt('foo');
+ insert_tt
+-----------
+ 1
+(1 row)
+
+select insert_tt('bar');
+ insert_tt
+-----------
+ 2
+(1 row)
+
+select * from tt;
+ f1 | data
+----+------
+ 1 | foo
+ 2 | bar
+(2 rows)
+
+-- insert will execute to completion even if function needs just 1 row
+create or replace function insert_tt(text) returns int as
+$$ insert into tt(data) values($1),($1||$1) returning f1 $$
+language sql;
+select insert_tt('fool');
+ insert_tt
+-----------
+ 3
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+(4 rows)
+
+-- setof does what's expected
+create or replace function insert_tt2(text,text) returns setof int as
+$$ insert into tt(data) values($1),($2) returning f1 $$
+language sql;
+select insert_tt2('foolish','barrish');
+ insert_tt2
+------------
+ 5
+ 6
+(2 rows)
+
+select * from insert_tt2('baz','quux');
+ insert_tt2
+------------
+ 7
+ 8
+(2 rows)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+(8 rows)
+
+-- limit doesn't prevent execution to completion
+select insert_tt2('foolish','barrish') limit 1;
+ insert_tt2
+------------
+ 9
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+(10 rows)
+
+-- triggers will fire, too
+create function noticetrigger() returns trigger as $$
+begin
+ raise notice 'noticetrigger % %', new.f1, new.data;
+ return null;
+end $$ language plpgsql;
+create trigger tnoticetrigger after insert on tt for each row
+execute procedure noticetrigger();
+select insert_tt2('foolme','barme') limit 1;
+NOTICE: noticetrigger 11 foolme
+NOTICE: noticetrigger 12 barme
+ insert_tt2
+------------
+ 11
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+(12 rows)
+
+-- and rules work
+create temp table tt_log(f1 int, data text);
+create rule insert_tt_rule as on insert to tt do also
+ insert into tt_log values(new.*);
+select insert_tt2('foollog','barlog') limit 1;
+NOTICE: noticetrigger 13 foollog
+NOTICE: noticetrigger 14 barlog
+ insert_tt2
+------------
+ 13
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+ 13 | foollog
+ 14 | barlog
+(14 rows)
+
+-- note that nextval() gets executed a second time in the rule expansion,
+-- which is expected.
+select * from tt_log;
+ f1 | data
+----+---------
+ 15 | foollog
+ 16 | barlog
+(2 rows)
+
+-- test case for a whole-row-variable bug
+create function rngfunc1(n integer, out a text, out b text)
+ returns setof record
+ language sql
+ as $$ select 'foo ' || i, 'bar ' || i from generate_series(1,$1) i $$;
+set work_mem='64kB';
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+reset work_mem;
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+drop function rngfunc1(n integer);
+-- test use of SQL functions returning record
+-- this is supported in some cases where the query doesn't specify
+-- the actual record type ...
+create function array_to_set(anyarray) returns setof record as $$
+ select i AS "index", $1[i] AS "value" from generate_subscripts($1, 1) i
+$$ language sql strict immutable;
+select array_to_set(array['one', 'two']);
+ array_to_set
+--------------
+ (1,one)
+ (2,two)
+(2 rows)
+
+select * from array_to_set(array['one', 'two']) as t(f1 int,f2 text);
+ f1 | f2
+----+-----
+ 1 | one
+ 2 | two
+(2 rows)
+
+select * from array_to_set(array['one', 'two']); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from array_to_set(array['one', 'two']);
+ ^
+create temp table rngfunc(f1 int8, f2 int8);
+create function testrngfunc() returns record as $$
+ insert into rngfunc values (1,2) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+(1 row)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+(1 row)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+create function testrngfunc() returns setof record as $$
+ insert into rngfunc values (1,2), (3,4) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+ (3,4)
+(2 rows)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+ 3 | 4
+(2 rows)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+--
+-- Check some cases involving added/dropped columns in a rowtype result
+--
+create temp table users (userid text, seq int, email text, todrop bool, moredrop int, enabled bool);
+insert into users values ('id',1,'email',true,11,true);
+insert into users values ('id2',2,'email2',true,12,true);
+alter table users drop column todrop;
+create or replace function get_first_user() returns users as
+$$ SELECT * FROM users ORDER BY userid LIMIT 1; $$
+language sql stable;
+SELECT get_first_user();
+ get_first_user
+-------------------
+ (id,1,email,11,t)
+(1 row)
+
+SELECT * FROM get_first_user();
+ userid | seq | email | moredrop | enabled
+--------+-----+-------+----------+---------
+ id | 1 | email | 11 | t
+(1 row)
+
+create or replace function get_users() returns setof users as
+$$ SELECT * FROM users ORDER BY userid; $$
+language sql stable;
+SELECT get_users();
+ get_users
+---------------------
+ (id,1,email,11,t)
+ (id2,2,email2,12,t)
+(2 rows)
+
+SELECT * FROM get_users();
+ userid | seq | email | moredrop | enabled
+--------+-----+--------+----------+---------
+ id | 1 | email | 11 | t
+ id2 | 2 | email2 | 12 | t
+(2 rows)
+
+SELECT * FROM get_users() WITH ORDINALITY; -- make sure ordinality copes
+ userid | seq | email | moredrop | enabled | ordinality
+--------+-----+--------+----------+---------+------------
+ id | 1 | email | 11 | t | 1
+ id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+-- multiple functions vs. dropped columns
+SELECT * FROM ROWS FROM(generate_series(10,11), get_users()) WITH ORDINALITY;
+ generate_series | userid | seq | email | moredrop | enabled | ordinality
+-----------------+--------+-----+--------+----------+---------+------------
+ 10 | id | 1 | email | 11 | t | 1
+ 11 | id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+-- check that we can cope with post-parsing changes in rowtypes
+create temp view usersview as
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+alter table users add column junk text;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+begin;
+alter table users drop column moredrop;
+select * from usersview; -- expect clean failure
+ERROR: attribute 5 of type record has been dropped
+rollback;
+alter table users alter column seq type numeric;
+select * from usersview; -- expect clean failure
+ERROR: attribute 2 of type record has wrong type
+DETAIL: Table has type numeric, but query expects integer.
+drop view usersview;
+drop function get_first_user();
+drop function get_users();
+drop table users;
+-- this won't get inlined because of type coercion, but it shouldn't fail
+create or replace function rngfuncbar() returns setof text as
+$$ select 'foo'::varchar union all select 'bar'::varchar ; $$
+language sql stable;
+select rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+select * from rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+drop function rngfuncbar();
+-- check handling of a SQL function with multiple OUT params (bug #5777)
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1) $$ language sql;
+select * from rngfuncbar();
+ column1 | column2
+---------+---------
+ 1 | 2.1
+(1 row)
+
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned type integer at ordinal position 2, but query expects numeric.
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1, 3) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned row contains 3 attributes, but query expects 2.
+drop function rngfuncbar();
+-- check whole-row-Var handling in nested lateral functions (bug #11703)
+create function extractq2(t int8_tbl) returns int8 as $$
+ select t.q2
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ QUERY PLAN
+------------------------------------
+ Nested Loop
+ Output: f.x
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+ -> Function Scan on f
+ Output: f.x
+ Function Call: int8_tbl.q2
+(7 rows)
+
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+create function extractq2_2(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t) offset 0
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------------
+ Nested Loop
+ Output: ((int8_tbl.*).q2)
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.*
+ -> Result
+ Output: (int8_tbl.*).q2
+(6 rows)
+
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- without the "offset 0", this function gets optimized quite differently
+create function extractq2_2_opt(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t)
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------
+ Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+(2 rows)
+
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- check handling of nulls in SRF results (bug #7808)
+create type rngfunc2 as (a integer, b text);
+select *, row_to_json(u) from unnest(array[(1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+---+---------------------
+ | | {"a":null,"b":null}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, (1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ | | {"a":null,"b":null}
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(3 rows)
+
+select *, row_to_json(u) from unnest(array[]::rngfunc2[]) u;
+ a | b | row_to_json
+---+---+-------------
+(0 rows)
+
+drop type rngfunc2;
diff --git a/src/test/regress/expected/reloptions_1.out b/src/test/regress/expected/reloptions_1.out
new file mode 100644
index 0000000000..fd0b73a365
--- /dev/null
+++ b/src/test/regress/expected/reloptions_1.out
@@ -0,0 +1,219 @@
+-- Simple create
+CREATE TABLE reloptions_test(i INT) WITH (FiLLFaCToR=30,
+ autovacuum_enabled = false, autovacuum_analyze_scale_factor = 0.2);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {fillfactor=30,autovacuum_enabled=false,autovacuum_analyze_scale_factor=0.2}
+(1 row)
+
+-- Fail min/max values check
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=2);
+ERROR: value 2 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=110);
+ERROR: value 110 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = -10.0);
+ERROR: value -10.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = 110.0);
+ERROR: value 110.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+-- Fail when option and namespace do not exist
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_namespace.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_namespace"
+-- Fail while setting improper values
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=-30.1);
+ERROR: value -30.1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor='string');
+ERROR: invalid value for integer option "fillfactor": string
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=true);
+ERROR: invalid value for integer option "fillfactor": true
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=12);
+ERROR: invalid value for boolean option "autovacuum_enabled": 12
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=30.5);
+ERROR: invalid value for boolean option "autovacuum_enabled": 30.5
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled='string');
+ERROR: invalid value for boolean option "autovacuum_enabled": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor='string');
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor=true);
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": true
+-- Fail if option is specified twice
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=30, fillfactor=40);
+ERROR: parameter "fillfactor" specified more than once
+-- Specifying name only for a non-Boolean option should fail
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor);
+ERROR: invalid value for integer option "fillfactor": true
+-- Simple ALTER TABLE
+ALTER TABLE reloptions_test SET (fillfactor=31,
+ autovacuum_analyze_scale_factor = 0.3);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {autovacuum_enabled=false,fillfactor=31,autovacuum_analyze_scale_factor=0.3}
+(1 row)
+
+-- Set boolean option to true without specifying value
+ALTER TABLE reloptions_test SET (autovacuum_enabled, fillfactor=32);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-----------------------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true,fillfactor=32}
+(1 row)
+
+-- Check that RESET works well
+ALTER TABLE reloptions_test RESET (fillfactor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+---------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true}
+(1 row)
+
+-- Resetting all values causes the column to become null
+ALTER TABLE reloptions_test RESET (autovacuum_enabled,
+ autovacuum_analyze_scale_factor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass AND
+ reloptions IS NULL;
+ reloptions
+------------
+
+(1 row)
+
+-- RESET fails if a value is specified
+ALTER TABLE reloptions_test RESET (fillfactor=12);
+ERROR: RESET must not include values for parameters
+-- Test vacuum_truncate option
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test(i INT NOT NULL, j text)
+ WITH (vacuum_truncate=false,
+ toast.vacuum_truncate=false,
+ autovacuum_enabled=false);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+--------------------------------------------------
+ {vacuum_truncate=false,autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid =
+ (SELECT reltoastrelid FROM pg_class
+ WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (vacuum_truncate);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+----------------------------
+ {autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') = 0;
+ ?column?
+----------
+ f
+(1 row)
+
+-- Test toast.* options
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR)
+ WITH (toast.autovacuum_vacuum_cost_delay = 23);
+SELECT reltoastrelid as toast_oid
+ FROM pg_class WHERE oid = 'reloptions_test'::regclass \gset
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test SET (toast.autovacuum_vacuum_cost_delay = 24);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (toast.autovacuum_vacuum_cost_delay);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+-- Fail on non-existent options in toast namespace
+CREATE TABLE reloptions_test2 (i int) WITH (toast.not_existing_option = 42);
+ERROR: unrecognized parameter "not_existing_option"
+-- Mix TOAST & heap
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR) WITH
+ (toast.autovacuum_vacuum_cost_delay = 23,
+ autovacuum_vacuum_cost_delay = 24, fillfactor = 40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-------------------------------------------------
+ {autovacuum_vacuum_cost_delay=24,fillfactor=40}
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid = (
+ SELECT reltoastrelid FROM pg_class WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+--
+-- CREATE INDEX, ALTER INDEX for btrees
+--
+CREATE INDEX reloptions_test_idx ON reloptions_test (s) WITH (fillfactor=30);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=30}
+(1 row)
+
+-- Fail when option and namespace do not exist
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_ns.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_ns"
+-- Check allowed ranges
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=1);
+ERROR: value 1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=130);
+ERROR: value 130 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+-- Check ALTER
+ALTER INDEX reloptions_test_idx SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
+-- Check ALTER on empty reloption list
+CREATE INDEX reloptions_test_idx3 ON reloptions_test (s);
+ALTER INDEX reloptions_test_idx3 SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx3'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
diff --git a/src/test/regress/expected/strings_1.out b/src/test/regress/expected/strings_1.out
new file mode 100644
index 0000000000..a5c324a8b7
--- /dev/null
+++ b/src/test/regress/expected/strings_1.out
@@ -0,0 +1,1823 @@
+--
+-- STRINGS
+-- Test various data entry syntaxes.
+--
+-- SQL string continuation syntax
+-- E021-03 character string literals
+SELECT 'first line'
+' - next line'
+ ' - third line'
+ AS "Three lines to one";
+ Three lines to one
+-------------------------------------
+ first line - next line - third line
+(1 row)
+
+-- illegal string continuation syntax
+SELECT 'first line'
+' - next line' /* this comment is not allowed here */
+' - third line'
+ AS "Illegal comment within continuation";
+ERROR: syntax error at or near "' - third line'"
+LINE 3: ' - third line'
+ ^
+-- Unicode escapes
+SET standard_conforming_strings TO on;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ data
+------
+ data
+(1 row)
+
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ dat\+000061
+-------------
+ dat\+000061
+(1 row)
+
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ tricky
+--------
+ \
+(1 row)
+
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: invalid Unicode escape value at or near "\061'"
+LINE 1: SELECT U&'wrong: \061';
+ ^
+SELECT U&'wrong: \+0061';
+ERROR: invalid Unicode escape value at or near "\+0061'"
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: invalid Unicode escape character at or near "+'"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+SET standard_conforming_strings TO off;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061...
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&' \' UESCAPE '!' AS "tricky";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: \+0061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR: invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR: invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR: invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+----------
+ DeSdBeEf
+(1 row)
+
+--
+-- test conversions between various string types
+-- E021-10 implicit casting among the character data types
+--
+SELECT CAST(f1 AS text) AS "text(char)" FROM CHAR_TBL;
+ text(char)
+------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(f1 AS text) AS "text(varchar)" FROM VARCHAR_TBL;
+ text(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS text) AS "text(name)";
+ text(name)
+------------
+ namefield
+(1 row)
+
+-- since this is an explicit cast, it should truncate w/o error:
+SELECT CAST(f1 AS char(10)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+------------
+ doh!
+ hi de ho n
+(2 rows)
+
+-- note: implicit-cast case is tested in char.sql
+SELECT CAST(f1 AS char(20)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+----------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS char(10)) AS "char(varchar)" FROM VARCHAR_TBL;
+ char(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS char(10)) AS "char(name)";
+ char(name)
+------------
+ namefield
+(1 row)
+
+SELECT CAST(f1 AS varchar) AS "varchar(text)" FROM TEXT_TBL;
+ varchar(text)
+-------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS varchar) AS "varchar(char)" FROM CHAR_TBL;
+ varchar(char)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS varchar) AS "varchar(name)";
+ varchar(name)
+---------------
+ namefield
+(1 row)
+
+--
+-- test SQL string functions
+-- E### and T### are feature reference numbers from SQL99
+--
+-- E021-09 trim function
+SELECT TRIM(BOTH FROM ' bunch o blanks ') = 'bunch o blanks' AS "bunch o blanks";
+ bunch o blanks
+----------------
+ t
+(1 row)
+
+SELECT TRIM(LEADING FROM ' bunch o blanks ') = 'bunch o blanks ' AS "bunch o blanks ";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(TRAILING FROM ' bunch o blanks ') = ' bunch o blanks' AS " bunch o blanks";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(BOTH 'x' FROM 'xxxxxsome Xsxxxxx') = 'some Xs' AS "some Xs";
+ some Xs
+---------
+ t
+(1 row)
+
+-- E021-06 substring expression
+SELECT SUBSTRING('1234567890' FROM 3) = '34567890' AS "34567890";
+ 34567890
+----------
+ t
+(1 row)
+
+SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
+ 456
+-----
+ t
+(1 row)
+
+-- T581 regular expression substring (with SQL's bizarre regexp syntax)
+SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
+ bcd
+-----
+ bcd
+(1 row)
+
+-- No match should return NULL
+SELECT SUBSTRING('abcdefg' FROM '#"(b_d)#"%' FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- Null inputs should return NULL
+SELECT SUBSTRING('abcdefg' FROM '%' FOR NULL) IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING(NULL FROM '%' FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM NULL FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- The first and last parts should act non-greedy
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- Vertical bar in any part affects only that part
+SELECT SUBSTRING('abcdefg' FROM 'a|b#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"x|g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%|ab#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+-- Can't have more than two part separators
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*#"x' FOR '#') AS "error";
+ERROR: SQL regular expression may not contain more than two escape-double-quote separators
+CONTEXT: SQL function "substring" statement 1
+-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
+SELECT SUBSTRING('abcdefg' FROM 'a#"%g' FOR '#') AS "bcdefg";
+ bcdefg
+--------
+ bcdefg
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a%g' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- substring() with just two arguments is not allowed by SQL spec;
+-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
+SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- With a parenthesized subexpression, return only what matches the subexpr
+SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- PostgreSQL extension to allow using back reference in replace string;
+SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
+ regexp_replace
+----------------
+ (111) 222-3333
+(1 row)
+
+SELECT regexp_replace('AAA BBB CCC ', E'\\s+', ' ', 'g');
+ regexp_replace
+----------------
+ AAA BBB CCC
+(1 row)
+
+SELECT regexp_replace('AAA', '^|$', 'Z', 'g');
+ regexp_replace
+----------------
+ ZAAAZ
+(1 row)
+
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'gi');
+ regexp_replace
+----------------
+ Z Z
+(1 row)
+
+-- invalid regexp option
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
+ERROR: invalid regular expression option: "z"
+-- set so we can tell NULL from empty string
+\pset null '\\N'
+-- return all matches from regexp
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,beque}
+(1 row)
+
+-- test case insensitive
+SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
+ regexp_matches
+----------------
+ {bAR,bEqUE}
+(1 row)
+
+-- global option - more than one match
+SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+ regexp_matches
+----------------
+ {bar,beque}
+ {bazil,barf}
+(2 rows)
+
+-- empty capture group (matched empty string)
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,"",beque}
+(1 row)
+
+-- no match
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)(beque)$re$);
+ regexp_matches
+----------------
+(0 rows)
+
+-- optional capture group did not match, null entry in array
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)?(beque)$re$);
+ regexp_matches
+------------------
+ {bar,NULL,beque}
+(1 row)
+
+-- no capture groups
+SELECT regexp_matches('foobarbequebaz', $re$barbeque$re$);
+ regexp_matches
+----------------
+ {barbeque}
+(1 row)
+
+-- start/end-of-line matches are of zero length
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '^', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '^.?', 'mg');
+ regexp_matches
+----------------
+ {1}
+ {2}
+ {3}
+ {4}
+ {""}
+(5 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+ {""}
+(10 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4', '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+(9 rows)
+
+-- give me errors
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$, 'gz');
+ERROR: invalid regular expression option: "z"
+SELECT regexp_matches('foobarbequebaz', $re$(barbeque$re$);
+ERROR: invalid regular expression: parentheses () not balanced
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque){2,1}$re$);
+ERROR: invalid regular expression: invalid repetition count(s)
+-- split string on regexp
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s+$re$) AS foo;
+ foo | length
+-------+--------
+ the | 3
+ quick | 5
+ brown | 5
+ fox | 3
+ jumps | 5
+ over | 4
+ the | 3
+ lazy | 4
+ dog | 3
+(9 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s+$re$);
+ regexp_split_to_array
+-----------------------------------------------
+ {the,quick,brown,fox,jumps,over,the,lazy,dog}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s*$re$) AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ f | 1
+ o | 1
+ x | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ t | 1
+ h | 1
+ e | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ d | 1
+ o | 1
+ g | 1
+(35 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s*$re$);
+ regexp_split_to_array
+-------------------------------------------------------------------------
+ {t,h,e,q,u,i,c,k,b,r,o,w,n,f,o,x,j,u,m,p,s,o,v,e,r,t,h,e,l,a,z,y,d,o,g}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '') AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ | 1
+ f | 1
+ o | 1
+ x | 1
+ | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ | 1
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ | 1
+ d | 1
+ o | 1
+ g | 1
+(43 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '');
+ regexp_split_to_array
+---------------------------------------------------------------------------------------------------------
+ {t,h,e," ",q,u,i,c,k," ",b,r,o,w,n," ",f,o,x," ",j,u,m,p,s," ",o,v,e,r," ",t,h,e," ",l,a,z,y," ",d,o,g}
+(1 row)
+
+-- case insensitive
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i') AS foo;
+ foo | length
+---------------------------+--------
+ th | 2
+ QUick bROWn FOx jUMPs ov | 25
+ r Th | 4
+ lazy dOG | 9
+(4 rows)
+
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i');
+ regexp_split_to_array
+-----------------------------------------------------
+ {th," QUick bROWn FOx jUMPs ov","r Th"," lazy dOG"}
+(1 row)
+
+-- no match of pattern
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', 'nomatch') AS foo;
+ foo | length
+---------------------------------------------+--------
+ the quick brown fox jumps over the lazy dog | 43
+(1 row)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', 'nomatch');
+ regexp_split_to_array
+-------------------------------------------------
+ {"the quick brown fox jumps over the lazy dog"}
+(1 row)
+
+-- some corner cases
+SELECT regexp_split_to_array('123456','1');
+ regexp_split_to_array
+-----------------------
+ {"",23456}
+(1 row)
+
+SELECT regexp_split_to_array('123456','6');
+ regexp_split_to_array
+-----------------------
+ {12345,""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','.');
+ regexp_split_to_array
+------------------------
+ {"","","","","","",""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('123456','(?:)');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('1','');
+ regexp_split_to_array
+-----------------------
+ {1}
+(1 row)
+
+-- errors
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'zippy') AS foo;
+ERROR: invalid regular expression option: "z"
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'iz');
+ERROR: invalid regular expression option: "z"
+-- global option meaningless for regexp_split
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g') AS foo;
+ERROR: regexp_split_to_table() does not support the "global" option
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g');
+ERROR: regexp_split_to_array() does not support the "global" option
+-- change NULL-display back
+\pset null ''
+-- E021-11 position expression
+SELECT POSITION('4' IN '1234567890') = '4' AS "4";
+ 4
+---
+ t
+(1 row)
+
+SELECT POSITION('5' IN '1234567890') = '5' AS "5";
+ 5
+---
+ t
+(1 row)
+
+-- T312 character overlay function
+SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";
+ yabadaba
+----------
+ yabadaba
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";
+ yabadabadoo
+-------------
+ yabadabadoo
+(1 row)
+
+SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
+ bubba
+-------
+ bubba
+(1 row)
+
+--
+-- test LIKE
+-- Be sure to form every test as a LIKE/NOT LIKE pair.
+--
+-- simplest examples
+-- E061-04 like predicate
+SELECT 'hawkeye' LIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' LIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'indio%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'indio%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'h%eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in__o' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in__o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in_o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in_o' AS "true";
+ true
+------
+ t
+(1 row)
+
+-- unused escape character
+SELECT 'hawkeye' LIKE 'h%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'ind_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'ind_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character
+-- E061-05 like predicate with escape clause
+SELECT 'h%' LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%' NOT LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%awkeye' LIKE 'h#%a%k%e' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%awkeye' NOT LIKE 'h#%a%k%e' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_nd_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_nd_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d%o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d%o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character same as pattern character
+SELECT 'maca' LIKE 'm%aca' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'maca' NOT LIKE 'm%aca' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'ma%a' LIKE 'm%a%%a' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'ma%a' NOT LIKE 'm%a%%a' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'bear' LIKE 'b_ear' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'bear' NOT LIKE 'b_ear' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE 'b_e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'be_r' NOT LIKE 'b_e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE '__e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' NOT LIKE '__e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+--
+-- test ILIKE (case-insensitive LIKE)
+-- Be sure to form every test as an ILIKE/NOT ILIKE pair.
+--
+SELECT 'hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%Eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%Eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'Hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'Hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+--
+-- test %/_ combination cases, cf bugs #4821 and #5478
+--
+SELECT 'foo' LIKE '_%' as t, 'f' LIKE '_%' as t, '' LIKE '_%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%_' as t, 'f' LIKE '%_' as t, '' LIKE '%_' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '__%' as t, 'foo' LIKE '___%' as t, 'foo' LIKE '____%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%__' as t, 'foo' LIKE '%___' as t, 'foo' LIKE '%____' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'jack' LIKE '%____%' AS t;
+ t
+---
+ t
+(1 row)
+
+--
+-- basic tests of LIKE with indexes
+--
+CREATE TABLE texttest (a text PRIMARY KEY, b int);
+SELECT * FROM texttest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+CREATE TABLE byteatest (a bytea PRIMARY KEY, b int);
+SELECT * FROM byteatest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+DROP TABLE texttest, byteatest;
+--
+-- test implicit type conversion
+--
+-- E021-07 character concatenation
+SELECT 'unknown' || ' and unknown' AS "Concat unknown types";
+ Concat unknown types
+----------------------
+ unknown and unknown
+(1 row)
+
+SELECT text 'text' || ' and unknown' AS "Concat text to unknown type";
+ Concat text to unknown type
+-----------------------------
+ text and unknown
+(1 row)
+
+SELECT char(20) 'characters' || ' and text' AS "Concat char to unknown type";
+ Concat char to unknown type
+-----------------------------
+ characters and text
+(1 row)
+
+SELECT text 'text' || char(20) ' and characters' AS "Concat text to char";
+ Concat text to char
+---------------------
+ text and characters
+(1 row)
+
+SELECT text 'text' || varchar ' and varchar' AS "Concat text to varchar";
+ Concat text to varchar
+------------------------
+ text and varchar
+(1 row)
+
+--
+-- test substr with toasted text values
+--
+CREATE TABLE toasttest(f1 text);
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+TRUNCATE TABLE toasttest;
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect >0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+TRUNCATE TABLE toasttest;
+ALTER TABLE toasttest set (toast_tuple_target = 4080);
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect 0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test substr with toasted bytea values
+--
+CREATE TABLE toasttest(f1 bytea);
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+DROP TABLE toasttest;
+-- test internally compressing datums
+-- this tests compressing a datum to a very small size which exercises a
+-- corner case in packed-varlena handling: even though small, the compressed
+-- datum must be given a 4-byte header because there are no bits to indicate
+-- compression in a 1-byte header
+CREATE TABLE toasttest (c char(4096));
+INSERT INTO toasttest VALUES('x');
+SELECT length(c), c::text FROM toasttest;
+ length | c
+--------+---
+ 1 | x
+(1 row)
+
+SELECT c FROM toasttest;
+ c
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ x
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test length
+--
+SELECT length('abcdef') AS "length_6";
+ length_6
+----------
+ 6
+(1 row)
+
+--
+-- test strpos
+--
+SELECT strpos('abcdef', 'cd') AS "pos_3";
+ pos_3
+-------
+ 3
+(1 row)
+
+SELECT strpos('abcdef', 'xy') AS "pos_0";
+ pos_0
+-------
+ 0
+(1 row)
+
+--
+-- test replace
+--
+SELECT replace('abcdef', 'de', '45') AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT replace('yabadabadoo', 'ba', '123') AS "ya123da123doo";
+ ya123da123doo
+---------------
+ ya123da123doo
+(1 row)
+
+SELECT replace('yabadoo', 'bad', '') AS "yaoo";
+ yaoo
+------
+ yaoo
+(1 row)
+
+--
+-- test split_part
+--
+select split_part('joeuser@mydatabase','@',0) AS "an error";
+ERROR: field position must be greater than zero
+select split_part('joeuser@mydatabase','@',1) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+select split_part('joeuser@mydatabase','@',2) AS "mydatabase";
+ mydatabase
+------------
+ mydatabase
+(1 row)
+
+select split_part('joeuser@mydatabase','@',3) AS "empty string";
+ empty string
+--------------
+
+(1 row)
+
+select split_part('@joeuser@mydatabase@','@',2) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+--
+-- test to_hex
+--
+select to_hex(256*256*256 - 1) AS "ffffff";
+ ffffff
+--------
+ ffffff
+(1 row)
+
+select to_hex(256::bigint*256::bigint*256::bigint*256::bigint - 1) AS "ffffffff";
+ ffffffff
+----------
+ ffffffff
+(1 row)
+
+--
+-- MD5 test suite - from IETF RFC 1321
+-- (see: ftp://ftp.rfc-editor.org/in-notes/rfc1321.txt)
+--
+select md5('') = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+--
+-- SHA-2
+--
+SET bytea_output TO hex;
+SELECT sha224('');
+ sha224
+------------------------------------------------------------
+ \xd14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f
+(1 row)
+
+SELECT sha224('The quick brown fox jumps over the lazy dog.');
+ sha224
+------------------------------------------------------------
+ \x619cba8e8e05826e9b8c519c0a5c68f4fb653e8a3d8aa04bb2c8cd4c
+(1 row)
+
+SELECT sha256('');
+ sha256
+--------------------------------------------------------------------
+ \xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
+(1 row)
+
+SELECT sha256('The quick brown fox jumps over the lazy dog.');
+ sha256
+--------------------------------------------------------------------
+ \xef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c
+(1 row)
+
+SELECT sha384('');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \x38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b
+(1 row)
+
+SELECT sha384('The quick brown fox jumps over the lazy dog.');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \xed892481d8272ca6df370bf706e4d7bc1b5739fa2177aae6c50e946678718fc67a7af2819a021c2fc34e91bdb63409d7
+(1 row)
+
+SELECT sha512('');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \xcf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e
+(1 row)
+
+SELECT sha512('The quick brown fox jumps over the lazy dog.');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \x91ea1245f20d46ae9a037a989f54f1f790f0a47607eeb8a14d12890cea77a1bbc6c7ed9cf205e67b7f2b8fd4c7dfd3a7a8617e45f3c463d481c7e586c39ac1ed
+(1 row)
+
+--
+-- test behavior of escape_string_warning and standard_conforming_strings options
+--
+set escape_string_warning = off;
+set standard_conforming_strings = off;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ off
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ off
+(1 row)
+
+set escape_string_warning = on;
+set standard_conforming_strings = on;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ on
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ on
+(1 row)
+
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd'...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set escape_string_warning = off;
+set standard_conforming_strings = on;
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+--
+-- Additional string functions
+--
+SET bytea_output TO escape;
+SELECT initcap('hi THOMAS');
+ initcap
+-----------
+ Hi Thomas
+(1 row)
+
+SELECT lpad('hi', 5, 'xy');
+ lpad
+-------
+ xyxhi
+(1 row)
+
+SELECT lpad('hi', 5);
+ lpad
+-------
+ hi
+(1 row)
+
+SELECT lpad('hi', -5, 'xy');
+ lpad
+------
+
+(1 row)
+
+SELECT lpad('hello', 2);
+ lpad
+------
+ he
+(1 row)
+
+SELECT lpad('hi', 5, '');
+ lpad
+------
+ hi
+(1 row)
+
+SELECT rpad('hi', 5, 'xy');
+ rpad
+-------
+ hixyx
+(1 row)
+
+SELECT rpad('hi', 5);
+ rpad
+-------
+ hi
+(1 row)
+
+SELECT rpad('hi', -5, 'xy');
+ rpad
+------
+
+(1 row)
+
+SELECT rpad('hello', 2);
+ rpad
+------
+ he
+(1 row)
+
+SELECT rpad('hi', 5, '');
+ rpad
+------
+ hi
+(1 row)
+
+SELECT ltrim('zzzytrim', 'xyz');
+ ltrim
+-------
+ trim
+(1 row)
+
+SELECT translate('', '14', 'ax');
+ translate
+-----------
+
+(1 row)
+
+SELECT translate('12345', '14', 'ax');
+ translate
+-----------
+ a23x5
+(1 row)
+
+SELECT ascii('x');
+ ascii
+-------
+ 120
+(1 row)
+
+SELECT ascii('');
+ ascii
+-------
+ 0
+(1 row)
+
+SELECT chr(65);
+ chr
+-----
+ A
+(1 row)
+
+SELECT chr(0);
+ERROR: null character not permitted
+SELECT repeat('Pg', 4);
+ repeat
+----------
+ PgPgPgPg
+(1 row)
+
+SELECT repeat('Pg', -4);
+ repeat
+--------
+
+(1 row)
+
+SELECT trim(E'\\000'::bytea from E'\\000Tom\\000'::bytea);
+ btrim
+-------
+ Tom
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, E'\\000'::bytea);
+ btrim
+-------
+ trim
+(1 row)
+
+SELECT btrim(''::bytea, E'\\000'::bytea);
+ btrim
+-------
+
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, ''::bytea);
+ btrim
+--------------
+ \000trim\000
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'Th\\001omas'::bytea from 2),'escape');
+ encode
+-------------
+ TTh\x01omas
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 8),'escape');
+ encode
+--------------------
+ Th\000omas\x02\x03
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 5 for 3),'escape');
+ encode
+-----------------
+ Th\000o\x02\x03
+(1 row)
+
diff --git a/src/test/regress/expected/tsrf_1.out b/src/test/regress/expected/tsrf_1.out
new file mode 100644
index 0000000000..a0f7d80c69
--- /dev/null
+++ b/src/test/regress/expected/tsrf_1.out
@@ -0,0 +1,712 @@
+--
+-- tsrf - targetlist set returning function tests
+--
+-- simple srf
+SELECT generate_series(1, 3);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- parallel iteration
+SELECT generate_series(1, 3), generate_series(3,5);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 3
+ 2 | 4
+ 3 | 5
+(3 rows)
+
+-- parallel iteration, different number of rows
+SELECT generate_series(1, 2), generate_series(1,4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 1
+ 2 | 2
+ | 3
+ | 4
+(4 rows)
+
+-- srf, with SRF argument
+SELECT generate_series(1, generate_series(1, 3));
+ generate_series
+-----------------
+ 1
+ 1
+ 2
+ 1
+ 2
+ 3
+(6 rows)
+
+-- but we've traditionally rejected the same in FROM
+SELECT * FROM generate_series(1, generate_series(1, 3));
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM generate_series(1, generate_series(1, 3));
+ ^
+-- srf, with two SRF arguments
+SELECT generate_series(generate_series(1,3), generate_series(2, 4));
+ generate_series
+-----------------
+ 1
+ 2
+ 2
+ 3
+ 3
+ 4
+(6 rows)
+
+-- check proper nesting of SRFs in different expressions
+explain (verbose, costs off)
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ ProjectSet
+ Output: generate_series(1, (generate_series(1, 3))), (generate_series(2, 4))
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(2, 4)
+ -> Result
+(5 rows)
+
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 2
+ 1 | 3
+ 2 | 3
+ 1 | 4
+ 2 | 4
+ 3 | 4
+(6 rows)
+
+CREATE TABLE few(id int, dataa text, datab text);
+INSERT INTO few VALUES(1, 'a', 'foo'),(2, 'a', 'bar'),(3, 'b', 'bar');
+-- SRF with a provably-dummy relation
+explain (verbose, costs off)
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ QUERY PLAN
+--------------------------------------
+ ProjectSet
+ Output: unnest('{1,2}'::integer[])
+ -> Result
+ One-Time Filter: false
+(4 rows)
+
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ unnest
+--------
+(0 rows)
+
+-- SRF shouldn't prevent upper query from recognizing lower as dummy
+explain (verbose, costs off)
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ QUERY PLAN
+------------------------------------------------
+ Result
+ Output: f1.id, f1.dataa, f1.datab, ss.unnest
+ One-Time Filter: false
+(3 rows)
+
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ id | dataa | datab | unnest
+----+-------+-------+--------
+(0 rows)
+
+-- SRF output order of sorting is maintained, if SRF is not referenced
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id DESC;
+ id | g
+----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 1 | 1
+ 1 | 2
+ 1 | 3
+(9 rows)
+
+-- but SRFs can be referenced in sort
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, g DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, generate_series(1,3) DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+-- it's weird to have ORDER BYs that increase the number of results
+SELECT few.id FROM few ORDER BY id, generate_series(1,3) DESC;
+ id
+----
+ 1
+ 1
+ 1
+ 2
+ 2
+ 2
+ 3
+ 3
+ 3
+(9 rows)
+
+-- SRFs are computed after aggregation
+SET enable_hashagg TO 0; -- stable output order
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(3 rows)
+
+-- unless referenced in GROUP BY clause
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, unnest('{1,1,3}'::int[]);
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, 5;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+RESET enable_hashagg;
+-- check HAVING works when GROUP BY does [not] reference SRF output
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1, 2 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+-- it's weird to have GROUP BYs that increase the number of results
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+(1 row)
+
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa, unnest('{1,1,3}'::int[]) ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+ a | 4
+(2 rows)
+
+-- SRFs are not allowed if they'd need to be conditionally executed
+SELECT q1, case when q1 > 0 then generate_series(1,3) else 0 end FROM int8_tbl;
+ERROR: set-returning functions are not allowed in CASE
+LINE 1: SELECT q1, case when q1 > 0 then generate_series(1,3) else 0...
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ERROR: set-returning functions are not allowed in COALESCE
+LINE 1: SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are not allowed in aggregate arguments
+SELECT min(generate_series(1, 3)) FROM few;
+ERROR: aggregate function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- ... unless they're within a sub-select
+SELECT sum((3 = ANY(SELECT generate_series(1,4)))::int);
+ sum
+-----
+ 1
+(1 row)
+
+SELECT sum((3 = ANY(SELECT lag(x) over(order by x)
+ FROM generate_series(1,4) x))::int);
+ sum
+-----
+ 1
+(1 row)
+
+-- SRFs are not allowed in window function arguments, either
+SELECT min(generate_series(1, 3)) OVER() FROM few;
+ERROR: window function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) OVER() FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are normally computed after window functions
+SELECT id,lag(id) OVER(), count(*) OVER(), generate_series(1,3) FROM few;
+ id | lag | count | generate_series
+----+-----+-------+-----------------
+ 1 | | 3 | 1
+ 1 | | 3 | 2
+ 1 | | 3 | 3
+ 2 | 1 | 3 | 1
+ 2 | 1 | 3 | 2
+ 2 | 1 | 3 | 3
+ 3 | 2 | 3 | 1
+ 3 | 2 | 3 | 2
+ 3 | 2 | 3 | 3
+(9 rows)
+
+-- unless referencing SRFs
+SELECT SUM(count(*)) OVER(PARTITION BY generate_series(1,3) ORDER BY generate_series(1,3)), generate_series(1,3) g FROM few GROUP BY g;
+ sum | g
+-----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(3 rows)
+
+-- sorting + grouping
+SELECT few.dataa, count(*), min(id), max(id), generate_series(1,3) FROM few GROUP BY few.dataa ORDER BY 5, 1;
+ dataa | count | min | max | generate_series
+-------+-------+-----+-----+-----------------
+ a | 2 | 1 | 2 | 1
+ b | 1 | 3 | 3 | 1
+ a | 2 | 1 | 2 | 2
+ b | 1 | 3 | 3 | 2
+ a | 2 | 1 | 2 | 3
+ b | 1 | 3 | 3 | 3
+(6 rows)
+
+-- grouping sets are a bit special, they produce NULLs in columns not actually NULL
+set enable_hashagg = false;
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | 1 | 1
+ | | 1 | 3
+ | bar | 1 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ b | | 2 | 1
+ a | foo | 2 | 1
+ | bar | 2 | 2
+ a | | 2 | 2
+ | | 2 | 3
+ b | bar | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | foo | | 2
+ a | | | 4
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | | 2 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | foo | | 2
+ a | | | 4
+ a | | 2 | 2
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | | 2
+ b | | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | 2 | 1
+ | | 2 | 3
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ | | 1 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ b | bar | 1 | 1
+ | bar | 1 | 2
+ | foo | 1 | 1
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | bar | 2 | 2
+ | | 2 | 3
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ a | foo | 2 | 1
+ b | bar | 2 | 1
+ a | | | 4
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ a | foo | | 2
+ a | bar | | 2
+ | bar | | 4
+ | foo | | 2
+(24 rows)
+
+reset enable_hashagg;
+-- case with degenerate ORDER BY
+explain (verbose, costs off)
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ QUERY PLAN
+------------------------------------------------
+ ProjectSet
+ Output: ('foo'::text), generate_series(1, 2)
+ -> Seq Scan on public.few
+ Output: 'foo'::text
+(4 rows)
+
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ f | g
+-----+---
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+(6 rows)
+
+-- data modification
+CREATE TABLE fewmore AS SELECT generate_series(1,3) AS data;
+INSERT INTO fewmore VALUES(generate_series(4,5));
+SELECT * FROM fewmore;
+ data
+------
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+-- SRFs are not allowed in UPDATE (they once were, but it was nonsense)
+UPDATE fewmore SET data = generate_series(4,9);
+ERROR: set-returning functions are not allowed in UPDATE
+LINE 1: UPDATE fewmore SET data = generate_series(4,9);
+ ^
+-- SRFs are not allowed in RETURNING
+INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3);
+ERROR: set-returning functions are not allowed in RETURNING
+LINE 1: INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3)...
+ ^
+-- nor standalone VALUES (but surely this is a bug?)
+VALUES(1, generate_series(1,2));
+ERROR: set-returning functions are not allowed in VALUES
+LINE 1: VALUES(1, generate_series(1,2));
+ ^
+-- We allow tSRFs that are not at top level
+SELECT int4mul(generate_series(1,2), 10);
+ int4mul
+---------
+ 10
+ 20
+(2 rows)
+
+SELECT generate_series(1,3) IS DISTINCT FROM 2;
+ ?column?
+----------
+ t
+ f
+ t
+(3 rows)
+
+-- but SRFs in function RTEs must be at top level (annoying restriction)
+SELECT * FROM int4mul(generate_series(1,2), 10);
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM int4mul(generate_series(1,2), 10);
+ ^
+-- DISTINCT ON is evaluated before tSRF evaluation if SRF is not
+-- referenced either in ORDER BY or in the DISTINCT ON list. The ORDER
+-- BY reference can be implicitly generated, if there's no other ORDER BY.
+-- implicit reference (via implicit ORDER) to all columns
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 1 | 1 | 1
+ 3 | 2 | 1
+ 5 | 3 | 1
+(3 rows)
+
+-- unreferenced in DISTINCT ON or ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 1
+ 1 | 4 | 2
+ 1 | 4 | 3
+ 3 | 2 | 1
+ 3 | 2 | 2
+ 3 | 2 | 3
+ 5 | 3 | 1
+ 5 | 3 | 2
+ 5 | 3 | 3
+(9 rows)
+
+-- referenced in ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 3 | 2 | 3
+ 5 | 3 | 3
+(3 rows)
+
+-- referenced in ORDER BY and DISTINCT ON
+SELECT DISTINCT ON (a, b, g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 1 | 4 | 2
+ 1 | 4 | 1
+ 1 | 1 | 3
+ 1 | 1 | 2
+ 1 | 1 | 1
+ 3 | 2 | 3
+ 3 | 2 | 2
+ 3 | 2 | 1
+ 3 | 1 | 3
+ 3 | 1 | 2
+ 3 | 1 | 1
+ 5 | 3 | 3
+ 5 | 3 | 2
+ 5 | 3 | 1
+ 5 | 1 | 3
+ 5 | 1 | 2
+ 5 | 1 | 1
+(18 rows)
+
+-- only SRF mentioned in DISTINCT ON
+SELECT DISTINCT ON (g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 3 | 2 | 1
+ 5 | 1 | 2
+ 3 | 1 | 3
+(3 rows)
+
+-- LIMIT / OFFSET is evaluated after SRF evaluation
+SELECT a, generate_series(1,2) FROM (VALUES(1),(2),(3)) r(a) LIMIT 2 OFFSET 2;
+ a | generate_series
+---+-----------------
+ 2 | 1
+ 2 | 2
+(2 rows)
+
+-- SRFs are not allowed in LIMIT.
+SELECT 1 LIMIT generate_series(1,3);
+ERROR: set-returning functions are not allowed in LIMIT
+LINE 1: SELECT 1 LIMIT generate_series(1,3);
+ ^
+-- tSRF in correlated subquery, referencing table outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET few.id) FROM few;
+ generate_series
+-----------------
+ 2
+ 3
+
+(3 rows)
+
+-- tSRF in correlated subquery, referencing SRF outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET g.i) FROM generate_series(0,3) g(i);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+
+(4 rows)
+
+-- Operators can return sets too
+CREATE OPERATOR |@| (PROCEDURE = unnest, RIGHTARG = ANYARRAY);
+SELECT |@|ARRAY[1,2,3];
+ ?column?
+----------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- Some fun cases involving duplicate SRF calls
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(1, 3)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ x | xp1
+---+-----
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+explain (verbose, costs off)
+select generate_series(1,3)+1 order by generate_series(1,3);
+ QUERY PLAN
+------------------------------------------------------------------------
+ Sort
+ Output: (((generate_series(1, 3)) + 1)), (generate_series(1, 3))
+ Sort Key: (generate_series(1, 3))
+ -> Result
+ Output: ((generate_series(1, 3)) + 1), (generate_series(1, 3))
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(8 rows)
+
+select generate_series(1,3)+1 order by generate_series(1,3);
+ ?column?
+----------
+ 2
+ 3
+ 4
+(3 rows)
+
+-- Check that SRFs of same nesting level run in lockstep
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(3, 6)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(3, 6)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ x | y
+---+---
+ 1 | 4
+ 2 | 5
+ 3 | 6
+ | 7
+(4 rows)
+
+-- Clean up
+DROP TABLE few;
+DROP TABLE fewmore;
diff --git a/src/test/regress/expected/zedstore.out b/src/test/regress/expected/zedstore.out
new file mode 100644
index 0000000000..6041e42a93
--- /dev/null
+++ b/src/test/regress/expected/zedstore.out
@@ -0,0 +1,599 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+(10 rows)
+
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+ c1 | c3
+----+----
+ 1 | 3
+ 2 | 4
+ 3 | 5
+ 4 | 6
+ 5 | 7
+ 6 | 8
+ 7 | 9
+ 8 | 10
+ 9 | 11
+ 10 | 12
+(10 rows)
+
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+ c3
+----
+ 7
+ 8
+ 9
+ 10
+ 11
+ 12
+(6 rows)
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(15 rows)
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+-- index scan
+select * from t_zedstore where c1 = 5;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+(1 row)
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+ c1
+----
+ 5
+(1 row)
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+ c1 | c2
+----+----
+ 5 | 6
+ 6 | 7
+ 7 | 8
+ 8 | 9
+ 9 | 10
+ 10 | 11
+(6 rows)
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(14 rows)
+
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(11 rows)
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+select c1, length(t) from t_zedtoast;
+ c1 | length
+----+--------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+ 6 | 10000
+ 7 | 10000
+ 8 | 10000
+ 9 | 10000
+ 10 | 10000
+(10 rows)
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+select c2 from t_zednullvalues;
+ c2
+----
+
+ 2
+(2 rows)
+
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ 1 |
+(2 rows)
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 4 | 4 | stuff | test_4 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(12 rows)
+
+COPY t_zedcopy (a, d, e) to stdout;
+9999 NN \N
+10000 41 51
+1 test_1 \N
+2 test_2 \N
+3 test_3 \N
+4 test_4 \N
+5 test_5 \N
+10001 42 52
+10002 43 53
+10003 44 54
+10004 45 55
+10005 46 56
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 |
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(11 rows)
+
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 |
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(9 rows)
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+-----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 10001 | 22 | 32 | 42 |
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+ 5 | 100 | stuff | test_5 |
+(9 rows)
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+ count
+-------
+ 1
+(1 row)
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+ a | b
+---+---
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+(1 row)
+
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+ 2 | 3
+(2 rows)
+
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+ a | b | c
+---+---+---
+ 1 | 2 | 3
+ 2 | 3 | 3
+ 3 | 4 | 3
+(3 rows)
+
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+ d
+----------
+ abcdefgh
+ abcdefgh
+ abcdefgh
+(3 rows)
+
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+ a | b | c | d
+---+---+---+----------
+ 1 | 2 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+ 3 | 4 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+(4 rows)
+
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+ b | c | d
+---+---+-------------
+ 2 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 5 | test_insert
+(5 rows)
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,14) | 141
+ (1,16) | 143
+ (1,18) | 145
+ (1,20) | 147
+ (1,22) | 149
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,36) | 163
+ (1,38) | 165
+ (1,40) | 167
+ (1,42) | 169
+ (1,44) | 171
+ (1,46) | 173
+ (1,48) | 175
+ (1,50) | 177
+ (1,52) | 179
+ (1,54) | 181
+ (1,56) | 183
+ (1,58) | 185
+ (1,60) | 187
+ (1,62) | 189
+ (1,64) | 191
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,72) | 199
+ (1,74) | 201
+ (1,76) | 203
+ (1,78) | 205
+ (1,80) | 207
+ (1,82) | 209
+ (1,84) | 211
+ (1,86) | 213
+ (1,88) | 215
+ (1,90) | 217
+ (1,92) | 219
+ (1,94) | 221
+ (1,96) | 223
+ (1,98) | 225
+ (1,100) | 227
+ (1,102) | 229
+ (1,104) | 231
+ (1,106) | 233
+ (1,108) | 235
+ (1,110) | 237
+ (1,112) | 239
+ (1,114) | 241
+ (1,116) | 243
+ (1,118) | 245
+ (1,120) | 247
+ (1,122) | 249
+ (1,124) | 251
+ (1,126) | 253
+ (1,128) | 255
+ (2,2) | 257
+ (2,4) | 259
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,14) | 269
+ (2,16) | 271
+ (2,18) | 273
+ (2,20) | 275
+ (2,22) | 277
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,32) | 287
+ (2,34) | 289
+ (2,36) | 291
+ (2,38) | 293
+ (2,40) | 295
+ (2,42) | 297
+ (2,44) | 299
+(86 rows)
+
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (0,4) | 3
+ (0,6) | 5
+ (0,8) | 7
+ (0,20) | 19
+ (0,30) | 29
+ (0,42) | 41
+ (0,44) | 43
+ (0,48) | 47
+ (0,52) | 51
+ (0,54) | 53
+ (0,56) | 55
+ (0,62) | 61
+ (0,64) | 63
+ (0,66) | 65
+ (0,76) | 75
+ (0,80) | 79
+ (0,82) | 81
+ (0,84) | 83
+ (0,88) | 87
+ (0,90) | 89
+ (0,92) | 91
+ (0,98) | 97
+ (0,106) | 105
+ (0,108) | 107
+ (0,122) | 121
+ (0,126) | 125
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,20) | 147
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,40) | 167
+ (1,44) | 171
+ (1,46) | 173
+ (1,58) | 185
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,78) | 205
+ (1,80) | 207
+ (1,88) | 215
+ (1,92) | 219
+ (1,96) | 223
+ (1,100) | 227
+ (1,102) | 229
+ (1,106) | 233
+ (1,112) | 239
+ (1,116) | 243
+ (1,120) | 247
+ (1,122) | 249
+ (1,126) | 253
+ (2,2) | 257
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,16) | 271
+ (2,18) | 273
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,34) | 289
+ (2,36) | 291
+ (2,42) | 297
+ (2,44) | 299
+(74 rows)
+
diff --git a/src/test/regress/output/misc_1.source b/src/test/regress/output/misc_1.source
new file mode 100644
index 0000000000..c29c54c414
--- /dev/null
+++ b/src/test/regress/output/misc_1.source
@@ -0,0 +1,692 @@
+--
+-- MISC
+--
+--
+-- BTREE
+--
+UPDATE onek
+ SET unique1 = onek.unique1 + 1;
+UPDATE onek
+ SET unique1 = onek.unique1 - 1;
+--
+-- BTREE partial
+--
+-- UPDATE onek2
+-- SET unique1 = onek2.unique1 + 1;
+--UPDATE onek2
+-- SET unique1 = onek2.unique1 - 1;
+--
+-- BTREE shutting out non-functional updates
+--
+-- the following two tests seem to take a long time on some
+-- systems. This non-func update stuff needs to be examined
+-- more closely. - jolly (2/22/96)
+--
+UPDATE tmp
+ SET stringu1 = reverse_name(onek.stringu1)
+ FROM onek
+ WHERE onek.stringu1 = 'JBAAAA' and
+ onek.stringu1 = tmp.stringu1;
+UPDATE tmp
+ SET stringu1 = reverse_name(onek2.stringu1)
+ FROM onek2
+ WHERE onek2.stringu1 = 'JCAAAA' and
+ onek2.stringu1 = tmp.stringu1;
+DROP TABLE tmp;
+--UPDATE person*
+-- SET age = age + 1;
+--UPDATE person*
+-- SET age = age + 3
+-- WHERE name = 'linda';
+--
+-- copy
+--
+COPY onek TO '@abs_builddir@/results/onek.data';
+DELETE FROM onek;
+COPY onek FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+DELETE FROM onek2;
+COPY onek2 FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek2 WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+COPY BINARY stud_emp TO '@abs_builddir@/results/stud_emp.data';
+DELETE FROM stud_emp;
+COPY BINARY stud_emp FROM '@abs_builddir@/results/stud_emp.data';
+SELECT * FROM stud_emp;
+ name | age | location | salary | manager | gpa | percent
+-------+-----+------------+--------+---------+-----+---------
+ jeff | 23 | (8,7.7) | 600 | sharon | 3.5 |
+ cim | 30 | (10.5,4.7) | 400 | | 3.4 |
+ linda | 19 | (0.9,6.1) | 100 | | 2.9 |
+(3 rows)
+
+-- COPY aggtest FROM stdin;
+-- 56 7.8
+-- 100 99.097
+-- 0 0.09561
+-- 42 324.78
+-- .
+-- COPY aggtest TO stdout;
+--
+-- inheritance stress test
+--
+SELECT * FROM a_star*;
+ class | a
+-------+----
+ a | 1
+ a | 2
+ a |
+ b | 3
+ b | 4
+ b |
+ b |
+ c | 5
+ c | 6
+ c |
+ c |
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d |
+ d | 11
+ d | 12
+ d | 13
+ d |
+ d |
+ d |
+ d | 14
+ d |
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e | 17
+ e |
+ e | 18
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f |
+ f | 24
+ f | 25
+ f | 26
+ f |
+ f |
+ f |
+ f | 27
+ f |
+ f |
+ f |
+ f |
+(50 rows)
+
+SELECT *
+ FROM b_star* x
+ WHERE x.b = text 'bumble' or x.a < 3;
+ class | a | b
+-------+---+--------
+ b | | bumble
+(1 row)
+
+SELECT class, a
+ FROM c_star* x
+ WHERE x.c ~ text 'hi';
+ class | a
+-------+----
+ c | 5
+ c |
+ d | 7
+ d | 8
+ d | 10
+ d |
+ d | 12
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f |
+ f | 24
+ f |
+ f |
+ f |
+(22 rows)
+
+SELECT class, b, c
+ FROM d_star* x
+ WHERE x.a < 100;
+ class | b | c
+-------+---------+------------
+ d | grumble | hi sunita
+ d | stumble | hi koko
+ d | rumble |
+ d | | hi kristin
+ d | fumble |
+ d | | hi avi
+ d | |
+ d | |
+(8 rows)
+
+SELECT class, c FROM e_star* x WHERE x.c NOTNULL;
+ class | c
+-------+-------------
+ e | hi carol
+ e | hi bob
+ e | hi michelle
+ e | hi elisa
+ f | hi claire
+ f | hi mike
+ f | hi marcel
+ f | hi keith
+ f | hi marc
+ f | hi allison
+ f | hi jeff
+ f | hi carl
+(12 rows)
+
+SELECT * FROM f_star* x WHERE x.c ISNULL;
+ class | a | c | e | f
+-------+----+---+-----+-------------------------------------------
+ f | 22 | | -7 | ((111,555),(222,666),(333,777),(444,888))
+ f | 25 | | -9 |
+ f | 26 | | | ((11111,33333),(22222,44444))
+ f | | | -11 | ((1111111,3333333),(2222222,4444444))
+ f | 27 | | |
+ f | | | -12 |
+ f | | | | ((11111111,33333333),(22222222,44444444))
+ f | | | |
+(8 rows)
+
+-- grouping and aggregation on inherited sets have been busted in the past...
+SELECT sum(a) FROM a_star*;
+ sum
+-----
+ 355
+(1 row)
+
+SELECT class, sum(a) FROM a_star* GROUP BY class ORDER BY class;
+ class | sum
+-------+-----
+ a | 3
+ b | 7
+ c | 11
+ d | 84
+ e | 66
+ f | 184
+(6 rows)
+
+ALTER TABLE f_star RENAME COLUMN f TO ff;
+ALTER TABLE e_star* RENAME COLUMN e TO ee;
+ALTER TABLE d_star* RENAME COLUMN d TO dd;
+ALTER TABLE c_star* RENAME COLUMN c TO cc;
+ALTER TABLE b_star* RENAME COLUMN b TO bb;
+ALTER TABLE a_star* RENAME COLUMN a TO aa;
+SELECT class, aa
+ FROM a_star* x
+ WHERE aa ISNULL;
+ class | aa
+-------+----
+ a |
+ b |
+ b |
+ c |
+ c |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ e |
+ e |
+ e |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+(24 rows)
+
+-- As of Postgres 7.1, ALTER implicitly recurses,
+-- so this should be same as ALTER a_star*
+ALTER TABLE a_star RENAME COLUMN aa TO foo;
+SELECT class, foo
+ FROM a_star* x
+ WHERE x.foo >= 2;
+ class | foo
+-------+-----
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(25 rows)
+
+ALTER TABLE a_star RENAME COLUMN foo TO aa;
+SELECT *
+ from a_star*
+ WHERE aa < 1000;
+ class | aa
+-------+----
+ a | 1
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(26 rows)
+
+ALTER TABLE f_star ADD COLUMN f int4;
+UPDATE f_star SET f = 10;
+ALTER TABLE e_star* ADD COLUMN e int4;
+--UPDATE e_star* SET e = 42;
+SELECT * FROM e_star*;
+ class | aa | cc | ee | e
+-------+----+-------------+-----+---
+ e | 15 | hi carol | -1 |
+ e | 16 | hi bob | |
+ e | 17 | | -2 |
+ e | | hi michelle | -3 |
+ e | 18 | | |
+ e | | hi elisa | |
+ e | | | -4 |
+ f | 19 | hi claire | -5 |
+ f | 20 | hi mike | -6 |
+ f | 21 | hi marcel | |
+ f | 22 | | -7 |
+ f | | hi keith | -8 |
+ f | 24 | hi marc | |
+ f | 25 | | -9 |
+ f | 26 | | |
+ f | | hi allison | -10 |
+ f | | hi jeff | |
+ f | | | -11 |
+ f | 27 | | |
+ f | | hi carl | |
+ f | | | -12 |
+ f | | | |
+ f | | | |
+(23 rows)
+
+ALTER TABLE a_star* ADD COLUMN a text;
+NOTICE: merging definition of column "a" for child "d_star"
+-- That ALTER TABLE should have added TOAST tables.
+SELECT relname, reltoastrelid <> 0 AS has_toast_table
+ FROM pg_class
+ WHERE oid::regclass IN ('a_star', 'c_star')
+ ORDER BY 1;
+ relname | has_toast_table
+---------+-----------------
+ a_star | f
+ c_star | f
+(2 rows)
+
+--UPDATE b_star*
+-- SET a = text 'gazpacho'
+-- WHERE aa > 4;
+SELECT class, aa, a FROM a_star*;
+ class | aa | a
+-------+----+---
+ a | 1 |
+ a | 2 |
+ a | |
+ b | 3 |
+ b | 4 |
+ b | |
+ b | |
+ c | 5 |
+ c | 6 |
+ c | |
+ c | |
+ d | 7 |
+ d | 8 |
+ d | 9 |
+ d | 10 |
+ d | |
+ d | 11 |
+ d | 12 |
+ d | 13 |
+ d | |
+ d | |
+ d | |
+ d | 14 |
+ d | |
+ d | |
+ d | |
+ d | |
+ e | 15 |
+ e | 16 |
+ e | 17 |
+ e | |
+ e | 18 |
+ e | |
+ e | |
+ f | 19 |
+ f | 20 |
+ f | 21 |
+ f | 22 |
+ f | |
+ f | 24 |
+ f | 25 |
+ f | 26 |
+ f | |
+ f | |
+ f | |
+ f | 27 |
+ f | |
+ f | |
+ f | |
+ f | |
+(50 rows)
+
+--
+-- versions
+--
+--
+-- postquel functions
+--
+--
+-- mike does post_hacking,
+-- joe and sally play basketball, and
+-- everyone else does nothing.
+--
+SELECT p.name, name(p.hobbies) FROM ONLY person p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+(3 rows)
+
+--
+-- as above, but jeff also does post_hacking.
+--
+SELECT p.name, name(p.hobbies) FROM person* p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+ jeff | posthacking
+(4 rows)
+
+--
+-- the next two queries demonstrate how functions generate bogus duplicates.
+-- this is a "feature" ..
+--
+SELECT DISTINCT hobbies_r.name, name(hobbies_r.equipment) FROM hobbies_r
+ ORDER BY 1,2;
+ name | name
+-------------+---------------
+ basketball | hightops
+ posthacking | advil
+ posthacking | peet's coffee
+ skywalking | guts
+(4 rows)
+
+SELECT hobbies_r.name, (hobbies_r.equipment).name FROM hobbies_r;
+ name | name
+-------------+---------------
+ posthacking | advil
+ posthacking | peet's coffee
+ posthacking | advil
+ posthacking | peet's coffee
+ basketball | hightops
+ basketball | hightops
+ skywalking | guts
+(7 rows)
+
+--
+-- mike needs advil and peet's coffee,
+-- joe and sally need hightops, and
+-- everyone else is fine.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM ONLY person p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+(4 rows)
+
+--
+-- as above, but jeff needs advil and peet's coffee as well.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM person* p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+ jeff | posthacking | advil
+ jeff | posthacking | peet's coffee
+(6 rows)
+
+--
+-- just like the last two, but make sure that the target list fixup and
+-- unflattening is being done correctly.
+--
+SELECT name(equipment(p.hobbies)), p.name, name(p.hobbies) FROM ONLY person p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+(4 rows)
+
+SELECT (p.hobbies).equipment.name, p.name, name(p.hobbies) FROM person* p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+ advil | jeff | posthacking
+ peet's coffee | jeff | posthacking
+(6 rows)
+
+SELECT (p.hobbies).equipment.name, name(p.hobbies), p.name FROM ONLY person p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+(4 rows)
+
+SELECT name(equipment(p.hobbies)), name(p.hobbies), p.name FROM person* p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+ advil | posthacking | jeff
+ peet's coffee | posthacking | jeff
+(6 rows)
+
+SELECT name(equipment(hobby_construct(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1a(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1b(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1c(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2a(text 'skywalking'));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2b(text 'skywalking'));
+ name
+---------------
+ advil
+ peet's coffee
+ hightops
+ guts
+(4 rows)
+
+SELECT hobbies_by_name('basketball');
+ hobbies_by_name
+-----------------
+ joe
+(1 row)
+
+SELECT name, overpaid(emp.*) FROM emp;
+ name | overpaid
+--------+----------
+ sharon | t
+ sam | t
+ bill | t
+ jeff | f
+ cim | f
+ linda | f
+(6 rows)
+
+--
+-- Try a few cases with SQL-spec row constructor expressions
+--
+SELECT * FROM equipment(ROW('skywalking', 'mer'));
+ name | hobby
+------+------------
+ guts | skywalking
+(1 row)
+
+SELECT name(equipment(ROW('skywalking', 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT *, name(equipment(h.*)) FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+SELECT *, (equipment(CAST((h.*) AS hobbies_r))).name FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+--
+-- functional joins
+--
+--
+-- instance rules
+--
+--
+-- rewrite rules
+--
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index f23fe8d870..aad070d48e 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
# ----------
# Another group of parallel tests
# ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan zedstore
# rules cannot run concurrently with any test that creates
# a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ca200eb599..5ad9d90b58 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -134,6 +134,7 @@ test: misc_functions
test: sysviews
test: tsrf
test: tidscan
+test: zedstore
test: rules
test: psql
test: psql_crosstab
diff --git a/src/test/regress/sql/zedstore.sql b/src/test/regress/sql/zedstore.sql
new file mode 100644
index 0000000000..d987e70c4f
--- /dev/null
+++ b/src/test/regress/sql/zedstore.sql
@@ -0,0 +1,176 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+
+-- index scan
+select * from t_zedstore where c1 = 5;
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+
+--
+-- Test toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+
+select c1, length(t) from t_zedtoast;
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+select c2 from t_zednullvalues;
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+9999 \N \\N \NN \N
+10000 21 31 41 51
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+1 test_1
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+2 test_2
+3 test_3
+4 test_4
+5 test_5
+\.
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+10001 22 32 42 52
+10002 23 33 43 53
+10003 24 34 44 54
+10004 25 35 45 55
+10005 26 36 46 56
+\.
+
+select * from t_zedcopy;
+COPY t_zedcopy (a, d, e) to stdout;
+
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
diff --git a/src/test/storageperf/driver.sql b/src/test/storageperf/driver.sql
new file mode 100644
index 0000000000..73981d1c94
--- /dev/null
+++ b/src/test/storageperf/driver.sql
@@ -0,0 +1,36 @@
+--
+-- Main script, to run all the tests, and print the results.
+--
+--
+
+-- First run the tests using heap.
+DROP SCHEMA IF EXISTS storagetest_heap CASCADE;
+CREATE SCHEMA storagetest_heap;
+SET search_path='storagetest_heap';
+
+CREATE TABLE results (testname text, val numeric) USING heap;
+
+SET default_table_access_method=heap;
+\i tests.sql
+
+
+-- Repeat with zedstore
+
+DROP SCHEMA IF EXISTS storagetest_zedstore CASCADE;
+CREATE SCHEMA storagetest_zedstore;
+SET search_path='storagetest_zedstore';
+
+CREATE TABLE results (testname text, val numeric) USING heap;
+
+SET default_table_access_method=zedstore;
+\i tests.sql
+
+
+SET search_path='public';
+
+SELECT COALESCE(h.testname, zs.testname) as testname,
+ h.val as heap,
+ zs.val as zedstore,
+ round(zs.val / h.val, 2) as "heap / zedstore"
+FROM storagetest_heap.results h
+FULL OUTER JOIN storagetest_zedstore.results zs ON (h.testname = zs.testname);
diff --git a/src/test/storageperf/sql/onecol.sql b/src/test/storageperf/sql/onecol.sql
new file mode 100644
index 0000000000..5cf18158c9
--- /dev/null
+++ b/src/test/storageperf/sql/onecol.sql
@@ -0,0 +1,38 @@
+-- Tests with a narrow, single-column table.
+
+CREATE UNLOGGED TABLE onecol (i int4);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select extract(epoch from now()) as before
+\gset
+
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('onecol, insert-select, size', pg_total_relation_size('onecol'));
+INSERT INTO results (testname, val) VALUES ('onecol, insert-select, time', :after - :before);
+
+COPY onecol TO '/tmp/onecol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE onecol;
+
+select extract(epoch from now()) as before
+\gset
+
+COPY onecol FROM '/tmp/onecol.data';
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('onecol, COPY, size', pg_total_relation_size('onecol'));
+INSERT INTO results (testname, val) VALUES ('onecol, COPY, time', :after - :before);
diff --git a/src/test/storageperf/tests.sql b/src/test/storageperf/tests.sql
new file mode 100644
index 0000000000..d1f25ed029
--- /dev/null
+++ b/src/test/storageperf/tests.sql
@@ -0,0 +1,3 @@
+-- Test "schedule". List all the tests you want to run here.
+
+\i sql/onecol.sql
base-commit: db6e2b4c52ade524f3db419d75084728e96e1f9c
--
2.19.1
Hi Ashwin,
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)
We were working on a similar columnar storage using pluggable APIs; one
idea that we thought of was to modify the scan slot based on the targetlist
to have only the relevant columns in the scan descriptor. This way the
table AMs are passed a slot with only relevant columns in the descriptor.
Today we do something similar to the result slot using
ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So
somewhere after creating the scan slot using ExecInitScanTupleSlot(), call
a table am handler API to modify the scan tuple slot based on the
targetlist, a probable name for the new table am handler would be:
exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
So this way the scan am handlers like getnextslot is passed a slot only
having the relevant columns in the scan descriptor. One issue though is
that the beginscan is not passed the slot, so if some memory allocation
needs to be done based on the column list, it can't be done in beginscan.
Let me know what you think.
regards,
Ajin Cherian
Fujitsu Australia
On Thu, May 23, 2019 at 3:56 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Show quoted text
We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).Current State:
- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4]
on
how modern filesystems manage space acted as a good source for ideas.- Tuple locks implemented
- Serializable isolation handled
- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDOOpen questions / discussion items:
- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)- TID treated as (block, offset) in current indexing code
- Physical tlist optimization? (currently, we disabled it for
zedstore)Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/3978b57e-fe25-ca6b-f56c-48084417e115@postgrespro.ru
3]
/messages/by-id/20190415173254.nlnk2xqhgt7c5pta@development
4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
On Thu, May 23, 2019 at 7:30 PM Ajin Cherian <itsajin@gmail.com> wrote:
Hi Ashwin,
- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)We were working on a similar columnar storage using pluggable APIs; one
idea that we thought of was to modify the scan slot based on the targetlist
to have only the relevant columns in the scan descriptor. This way the
table AMs are passed a slot with only relevant columns in the descriptor.
Today we do something similar to the result slot using
ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So
somewhere after creating the scan slot using ExecInitScanTupleSlot(), call
a table am handler API to modify the scan tuple slot based on the
targetlist, a probable name for the new table am handler would be:
exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
Interesting.
Though this reads hacky and not clean approach to me. Reasons:
- The memory allocation and initialization for slot descriptor was
done in ExecInitScanTupleSlot(). exec_init_scan_slot_tl() would
redo lot of work. ExecInitScanTupleSlot() ideally just points to
tupleDesc from Relation object. But for exec_init_scan_slot_tl()
will free the existing tupleDesc and reallocate fresh. Plus, can't
point to Relation tuple desc but essentially need to craft one out.
- As discussed in thread [1], several places want to use different
slots for the same scan, so that means will have to modify the
descriptor every time on such occasions even if it remains the same
throughout the scan. Some extra code can be added to keep around old
tupledescriptor and then reuse for next slot, but that seems again
added code complexity.
- AM needs to know the attnum in terms of relation's attribute number
to scan. How would tupledesc convey that? Like TupleDescData's attrs
currently carries info for attnum at attrs[attnum - 1]. If TupleDesc
needs to convey random attributes to scan, seems this relationship
has to be broken. attrs[offset] will provide info for some attribute
in relation, means offset != (attrs->attnum + 1). Which I am not
sure how many places in code rely on that logic to get information.
- The tupledesc provides lot of information not just attribute numbers
to scan. Like it provides information in TupleConstr about default
value for column. If AM layer has to modify existing slot's
tupledesc, it would have to copy over such information as well. This
information today is fetched using attnum as offset value in
constr->missing array. If this information will be retained how will
the constr array constructed? Will the array contain only values for
columns to scan or will contain constr array as is from Relation's
tuple descriptor as it does today. Seems will be overhead to
construct the constr array fresh and if not constructing fresh seems
will have mismatch between natt and array elements.
Seems with the proposed exec_init_scan_slot_tl() API, will have to
call it after beginscan and before calling getnextslot, to provide
column projection list to AM. Special dedicated API we have for
Zedstore to pass down column projection list, needs same calling
convention which is the reason I don't like it and trying to find
alternative. But at least the api we added for Zedstore seems much
simple, generic and flexible, in comparison, as lets AM decide what it
wishes to do with it. AM can fiddle with slot's TupleDescriptor if
wishes or can handle the column projection some other way.
So this way the scan am handlers like getnextslot is passed a slot only
having the relevant columns in the scan descriptor. One issue though is
that the beginscan is not passed the slot, so if some memory allocation
needs to be done based on the column list, it can't be done in beginscan.
Let me know what you think.
Yes, ideally would like to see if possible having this information
available on beginscan. But if can't be then seems fine to delay such
allocations on first calls to getnextslot and friends, that's how we
do today for Zedstore.
1]
/messages/by-id/20190508214627.hw7wuqwawunhynj6@alap3.anarazel.de
On 23/05/19 12:07 PM, Ashwin Agrawal wrote:
We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).Current State:
- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper
[4] on
how modern filesystems manage space acted as a good source for ideas.- Tuple locks implemented
- Serializable isolation handled
- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDOOpen questions / discussion items:
- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)- TID treated as (block, offset) in current indexing code
- Physical tlist optimization? (currently, we disabled it for
zedstore)Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/3978b57e-fe25-ca6b-f56c-48084417e115@postgrespro.ru
3]
/messages/by-id/20190415173254.nlnk2xqhgt7c5pta@development
4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
FWIW - building this against latest 12 beta1:
Loading and examining the standard pgbench schema (with the old names,
sorry) in v10 (standard heap_ and v12 (zedstore)
v10:
bench=# \i load.sql
COPY 100
Time: 16.335 ms
COPY 1000
Time: 16.748 ms
COPY 10000000
Time: 50276.230 ms (00:50.276)
bench=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------+-------+----------+------------+-------------
public | accounts | table | postgres | 1281 MB |
public | branches | table | postgres | 8192 bytes |
public | history | table | postgres | 0 bytes |
public | tellers | table | postgres | 72 kB |
v12+zedstore:
bench=# \i load.sql
COPY 100
Time: 0.656 ms
COPY 1000
Time: 3.573 ms
COPY 10000000
Time: 26244.832 ms (00:26.245)
bench=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------+-------+----------+---------+-------------
public | accounts | table | postgres | 264 MB |
public | branches | table | postgres | 56 kB |
public | history | table | postgres | 0 bytes |
public | tellers | table | postgres | 64 kB |
So a good improvement in load times and on disk footprint! Also note
that I did not build with lz4 so looks like you guys have fixed the
quirks with compression making things bigger.
regards
Mark
it's really cool and very good progress,
I'm interesting if SIDM/JIT will be supported
best wishes
TY
Show quoted text
On 2019/5/23 08:07, Ashwin Agrawal wrote:
We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).Current State:
- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4]
on
how modern filesystems manage space acted as a good source for ideas.- Tuple locks implemented
- Serializable isolation handled
- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDOOpen questions / discussion items:
- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)- TID treated as (block, offset) in current indexing code
- Physical tlist optimization? (currently, we disabled it for
zedstore)Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/3978b57e-fe25-ca6b-f56c-48084417e115@postgrespro.ru
3]
/messages/by-id/20190415173254.nlnk2xqhgt7c5pta@development
4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
From: Ashwin Agrawal [mailto:aagrawal@pivotal.io]
The objective is to gather feedback on design and approach to the same.
The implementation has core basic pieces working but not close to complete.
Thank you for proposing a very interesting topic. Are you thinking of including this in PostgreSQL 13 if possible?
* All Indexes supported
...
work. Btree indexes can be created. Btree and bitmap index scans work.
Does Zedstore allow to create indexes of existing types on the table (btree, GIN, BRIN, etc.) and perform index scans (point query, range query, etc.)?
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
...
This way of laying out the data also easily allows for hybrid row-column
store, where some columns are stored together, and others have a dedicated
B-tree. Need to have user facing syntax to allow specifying how to group
the columns.
...
Zedstore Table can be
created using command:CREATE TABLE <name> (column listing) USING zedstore;
Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table can be accessed simultaneously for both OLTP and analytics with the minimal performance impact on OLTP? (I got that impression from the word "hybrid".)
If yes, is the assumption that only a limited number of columns are to be stored in columnar format (for efficient scanning), and many other columns are to be stored in row format for efficient tuple access?
Are those row-formatted columns stored in the same file as the column-formatted columns, or in a separate file?
Regarding the column grouping, can I imagine HBase and Cassandra?
How could the current CREATE TABLE syntax support column grouping? (I guess CREATE TABLE needs a syntax for columnar store, and Zedstore need to be incorporated in core, not as an extension...)
A column store uses the same structure but we have *multiple* B-trees, one
for each column, all indexed by TID. The B-trees for all columns are stored
in the same physical file.
Did you think that it's not a good idea to have a different file for each group of columns? Is that because we can't expect physical adjacency of data blocks on disk even if we separate a column in a separate file?
I thought a separate file for each group of columns would be easier and less error-prone to implement and debug. Adding and dropping the column group would also be very easy and fast.
Regards
Takayuki Tsunakawa
On Sun, Jun 30, 2019 at 7:59 PM Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
From: Ashwin Agrawal [mailto:aagrawal@pivotal.io]
The objective is to gather feedback on design and approach to the same.
The implementation has core basic pieces working but not close tocomplete.
Thank you for proposing a very interesting topic. Are you thinking of
including this in PostgreSQL 13 if possible?* All Indexes supported
...
work. Btree indexes can be created. Btree and bitmap index scans work.
Does Zedstore allow to create indexes of existing types on the table
(btree, GIN, BRIN, etc.) and perform index scans (point query, range query,
etc.)?
Yes, all indexes types work for zedstore and allow point or range queries.
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together....
This way of laying out the data also easily allows for hybrid row-column
store, where some columns are stored together, and others have adedicated
B-tree. Need to have user facing syntax to allow specifying how to group
the columns....
Zedstore Table can be
created using command:CREATE TABLE <name> (column listing) USING zedstore;
Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table
can be accessed simultaneously for both OLTP and analytics with the minimal
performance impact on OLTP? (I got that impression from the word "hybrid".)
Well "hybrid" is more to convey compressed row and column store can be
supported with same design. It really wasn't referring to HTAP. In general
the goal we are moving towards is column store to be extremely efficient at
analytics but still should be able to support all the OLTP operations (with
minimal performance or storage size impact) Like when making trade-offs
between different design choices and if both can't be meet, preference if
towards analytics.
If yes, is the assumption that only a limited number of columns are to be
stored in columnar format (for efficient scanning), and many other columns
are to be stored in row format for efficient tuple access?
Yes, like if its known that certain columns are always accessed together
better to store them together and avoid the tuple formation cost. Though
its still to be seen if compression plays role and storing each individual
column and compressing can still be winner compared to compressing
different columns as blob. Like saving on IO cost offsets out the tuple
formation cost or not.
Are those row-formatted columns stored in the same file as the
column-formatted columns, or in a separate file?
Currently, we are focused to just get pure column store working and hence
not coded anything for hybrid layout yet. But at least right now the
thought is would be in same file.
Regarding the column grouping, can I imagine HBase and Cassandra?
How could the current CREATE TABLE syntax support column grouping? (I
guess CREATE TABLE needs a syntax for columnar store, and Zedstore need to
be incorporated in core, not as an extension...)
When column grouping comes up yes will need to modify CREATE TABLE syntax,
we are still to reach that point in development.
A column store uses the same structure but we have *multiple* B-trees,
one
for each column, all indexed by TID. The B-trees for all columns are
stored
in the same physical file.
Did you think that it's not a good idea to have a different file for each
group of columns? Is that because we can't expect physical adjacency of
data blocks on disk even if we separate a column in a separate file?I thought a separate file for each group of columns would be easier and
less error-prone to implement and debug. Adding and dropping the column
group would also be very easy and fast.
Currently, each group is a single column (till we don't have column
families) and having file for each column definitely seems not good idea.
As it just explodes the number of files. Separate file may have its
advantage from pre-fetching point of view but yes can't expect physical
adjacency of data blocks plus access pattern will anyways involve reading
multiple files (if each column stored in separate file).
I doubt storing each group makes it any easier to implement or debug, I
feel its actually reverse. Storing everything in single file but separate
blocks, keep the logic contained inside AM layer. And don't have to write
special code for example for drop table to delete files for all the groups
and all, or while moving table to different tablespace and all such
complication.
Adding and dropping column group, irrespective can be made easy and fast
with blocks for that group, added or marked for reuse within same file.
Thank you for the questions.
On Thu, May 30, 2019 at 8:07 AM DEV_OPS <devops@ww-it.cn> wrote:
it's really cool and very good progress,
I'm interesting if SIDM/JIT will be supported
That's something outside of Zedstore work directly at least now. The intent
is to work with current executor code or enhance it only wherever needed.
If current executor code supports something that would work for Zedstore.
But any other enhancements to executor will be separate undertaking.
Import Notes
Reply to msg id not found: 5ceff1bd.1c69fb81.704ee.a5b1SMTPIN_ADDED_BROKEN@mx.google.com
Hi Ashwin,
I tried playing around with the zedstore code a bit today and there
are couple questions that came into my mind.
1) Can zedstore tables be vacuumed? If yes, does VACUUM on zedstore
table set the VM bits associated with it.
2) Is there a chance that IndexOnlyScan would ever be required for
zedstore tables considering the design approach taken for it?
Further, I tried creating a zedstore table with btree index on one of
it's column and loaded around 50 lacs record into the table. When the
indexed column was scanned (with enable_seqscan flag set to off), it
went for IndexOnlyScan and that took around 15-20 times more than it
would take for IndexOnly Scan on heap table just because IndexOnlyScan
in zedstore always goes to heap as the visibility check fails.
However, the seqscan on zedstore table is quite faster than seqscan on
heap table because the time taken for I/O is quite less in case for
zedstore.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Show quoted text
On Tue, Jul 2, 2019 at 12:45 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
On Thu, May 30, 2019 at 8:07 AM DEV_OPS <devops@ww-it.cn> wrote:
it's really cool and very good progress,
I'm interesting if SIDM/JIT will be supported
That's something outside of Zedstore work directly at least now. The intent is to work with current executor code or enhance it only wherever needed. If current executor code supports something that would work for Zedstore. But any other enhancements to executor will be separate undertaking.
On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:
Hi Ashwin,
I tried playing around with the zedstore code a bit today and there
are couple questions that came into my mind.
Great! Thank You.
1) Can zedstore tables be vacuumed? If yes, does VACUUM on zedstore
table set the VM bits associated with it.
Zedstore tables can be vacuumed. On vacuum, minimal work is performed
though compared to heap. Full table is not scanned. Only UNDO log is
truncated/discarded based on RecentGlobalXmin. Plus, only TidTree or
Meta column is scanned to find dead tuples and index entries cleaned
for them, based on the same.
Currently, for zedstore we have not used the VM at all. So, it doesn't
touch the same during any operation.
2) Is there a chance that IndexOnlyScan would ever be required for
zedstore tables considering the design approach taken for it?
We have not given much thought to IndexOnlyScans so far. But I think
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.
Further, I tried creating a zedstore table with btree index on one of
it's column and loaded around 50 lacs record into the table. When the
indexed column was scanned (with enable_seqscan flag set to off), it
went for IndexOnlyScan and that took around 15-20 times more than it
would take for IndexOnly Scan on heap table just because IndexOnlyScan
in zedstore always goes to heap as the visibility check fails.
However, the seqscan on zedstore table is quite faster than seqscan on
heap table because the time taken for I/O is quite less in case for
zedstore.
Thanks for reporting, we will look into it. Should be able to optimize
it. Given no VM exists, IndexOnlyScans currently for zedstore behave
more or less like IndexScans. Planner picks IndexOnlyScans for
zedstore, mostly due to off values for reltuples, relpages, and
relallvisible.
We have been focused on implementing and optimizing the AM pieces. So,
not much work has been done for planner estimates and tunning yet. The
first step for the same to get the needed columns in the planner
instead of the executor in [1] is proposed. Once, that bakes will use
the same to perform more planner estimates and all. Also, analyze
needs work to properly reflect reltuples and relpages to influence the
planner correctly.
1]
/messages/by-id/CAAKRu_ZQ0Jy7LfZDCY0JdxChdpja9rf-S8Y5+U4vX7cYJd62dA@mail.gmail.com
On 14/08/2019 20:32, Ashwin Agrawal wrote:
On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma wrote:
2) Is there a chance that IndexOnlyScan would ever be required for
zedstore tables considering the design approach taken for it?We have not given much thought to IndexOnlyScans so far. But I think
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.
Hmm, yeah. An index-only scan on a zedstore table could perform the "VM
checks" by checking the TID tree in the zedstore. It's not as compact as
the 2 bits per TID in the heapam's visibility map, but it's pretty good.
Further, I tried creating a zedstore table with btree index on one of
it's column and loaded around 50 lacs record into the table. When the
indexed column was scanned (with enable_seqscan flag set to off), it
went for IndexOnlyScan and that took around 15-20 times more than it
would take for IndexOnly Scan on heap table just because IndexOnlyScan
in zedstore always goes to heap as the visibility check fails.
Currently, an index-only scan on zedstore should be pretty much the same
speed as a regular index scan. All the visibility checks will fail, and
you end up fetching every row from the table, just like a regular index
scan. So I think what you're seeing is that the index fetches on a
zedstore table is much slower than on heap.
Ideally, on a column store the index fetches would only fetch the needed
columns, but I don't think that's been implemented yet, so all the
columns are fetched. That can make a big difference, if you have a wide
table with lots of columns, but only actually need a few of them. Was
your test case something like that?
We haven't spent much effort on optimizing index fetches yet, so I hope
there's many other little tweaks there as well, that we can do to make
it faster.
- Heikki
We've continued hacking on Zedstore, here's a new patch version against
current PostgreSQL master (commit f1bf619acdf). If you want to follow
the development in real-time, we're working on this branch:
https://github.com/greenplum-db/postgres/tree/zedstore
If you want to do performance testing with this, make sure you configure
with the --with-lz4 option. Otherwise, you'll get pglz compression,
which is *much* slower.
Major TODOs:
* Make it crash-safe, by WAL-logging.
* Teach the planner and executor to pass down the list of columns
needed. Currently, many plans will unnecessarily fetch columns that are
not needed.
* Make visibility checks against the TID tree in index-only scans.
* zedstore-toast pages are currently leaked, so you'll get a lot of
bloat if you delete/update rows with large datums
* Use the UNDO framework that's been discussed on another thread.
There's UNDO-logging built into zedstore at the moment, but it's not
very optimized.
* Improve free space management. Pages that become empty are currently
recycled, but space on pages that are not completely empty is not not
reused, and half-empty pages are not merged.
* Implement TID recycling. Currently, TIDs are allocated in increasing
order, and after all 2^48 TIDs have been used, even if the rows have
been deleted since, no more ruples can be inserted.
- Heikki
Attachments:
v3-zedstore.patchtext/x-patch; name=v3-zedstore.patchDownload
diff --git a/configure b/configure
index 2c98e80c195..97973180296 100755
--- a/configure
+++ b/configure
@@ -699,6 +699,7 @@ with_gnu_ld
LD
LDFLAGS_SL
LDFLAGS_EX
+with_lz4
with_zlib
with_system_tzdata
with_libxslt
@@ -863,6 +864,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_lz4
with_gnu_ld
enable_largefile
enable_float4_byval
@@ -1569,6 +1571,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-lz4 build with LZ4 support
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8305,6 +8308,41 @@ fi
+#
+# LZ4
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with LZ4 support" >&5
+$as_echo_n "checking whether to build with LZ4 support... " >&6; }
+
+
+
+# Check whether --with-lz4 was given.
+if test "${with_lz4+set}" = set; then :
+ withval=$with_lz4;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LZ4 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-lz4 option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_lz4=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_lz4" >&5
+$as_echo "$with_lz4" >&6; }
+
+
#
# Assignments
#
@@ -11795,6 +11833,56 @@ fi
fi
+if test "$with_lz4" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for LZ4_compress_default in -llz4" >&5
+$as_echo_n "checking for LZ4_compress_default in -llz4... " >&6; }
+if ${ac_cv_lib_lz4_LZ4_compress_default+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-llz4 $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char LZ4_compress_default ();
+int
+main ()
+{
+return LZ4_compress_default ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_lz4_LZ4_compress_default=yes
+else
+ ac_cv_lib_lz4_LZ4_compress_default=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_lz4_LZ4_compress_default" >&5
+$as_echo "$ac_cv_lib_lz4_LZ4_compress_default" >&6; }
+if test "x$ac_cv_lib_lz4_LZ4_compress_default" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBLZ4 1
+_ACEOF
+
+ LIBS="-llz4 $LIBS"
+
+else
+ as_fn_error $? "library 'lz4' is required for LZ4 support" "$LINENO" 5
+fi
+
+fi
+
if test "$enable_spinlocks" = yes; then
$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
@@ -12994,6 +13082,36 @@ Use --without-zlib to disable zlib support." "$LINENO" 5
fi
+fi
+
+if test "$with_lz4" = yes; then
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ as_fn_error $? "lz4.h header file is required for LZ4" "$LINENO" 5
+fi
+
+done
+
+fi
+
+done
+
fi
if test "$with_gssapi" = yes ; then
diff --git a/configure.in b/configure.in
index 33edfd765ba..7a6684505ae 100644
--- a/configure.in
+++ b/configure.in
@@ -964,6 +964,16 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# LZ4
+#
+AC_MSG_CHECKING([whether to build with LZ4 support])
+PGAC_ARG_BOOL(with, lz4, no,
+ [build with LZ4 support],
+ [AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])])
+AC_MSG_RESULT([$with_lz4])
+AC_SUBST(with_lz4)
+
#
# Assignments
#
@@ -1150,6 +1160,10 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
+fi
+
if test "$enable_spinlocks" = yes; then
AC_DEFINE(HAVE_SPINLOCKS, 1, [Define to 1 if you have spinlocks.])
else
@@ -1363,6 +1377,11 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_HEADERS(lz4.h, [],
+ [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is required for LZ4])])])
+fi
+
if test "$with_gssapi" = yes ; then
AC_CHECK_HEADERS(gssapi/gssapi.h, [],
[AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is required for GSSAPI])])])
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bbb..6d36f3bd262 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- table tablesample transam
+ table tablesample transam zedstore
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 4c29261256a..adeb1d2aa04 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -89,7 +89,7 @@ ginFindLeafPage(GinBtree btree, bool searchMode,
stack->predictNumber = 1;
if (rootConflictCheck)
- CheckForSerializableConflictIn(btree->index, NULL, stack->buffer);
+ CheckForSerializableConflictIn(btree->index, NULL, btree->rootBlkno);
for (;;)
{
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 439a91b3e61..d7b52476817 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -246,7 +246,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
* tree, so it conflicts with all serializable scans. All scans acquire a
* predicate lock on the metabuffer to represent that.
*/
- CheckForSerializableConflictIn(index, NULL, metabuffer);
+ CheckForSerializableConflictIn(index, NULL, GIN_METAPAGE_BLKNO);
if (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize)
{
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab146173..046a20a3d41 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -221,7 +221,7 @@ ginEntryInsert(GinState *ginstate,
return;
}
- CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
+ CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer));
/* modify an existing leaf entry */
itup = addItemPointersToLeafTuple(ginstate, itup,
items, nitem, buildStats, stack->buffer);
@@ -230,7 +230,7 @@ ginEntryInsert(GinState *ginstate,
}
else
{
- CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
+ CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer));
/* no match, so construct a new leaf entry */
itup = buildFreshLeafTuple(ginstate, attnum, key, category,
items, nitem, buildStats, stack->buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0cc87911d6b..efb1a0cffa3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1263,7 +1263,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
* Check for any rw conflicts (in serializable isolation level) just
* before we intend to modify the page
*/
- CheckForSerializableConflictIn(state->r, NULL, stack->buffer);
+ CheckForSerializableConflictIn(state->r, NULL, BufferGetBlockNumber(stack->buffer));
/* Insert the tuple(s) to the page, splitting the page if necessary */
is_split = gistplacetopage(state->r, state->freespace, giststate,
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 89876d2ccd0..c87fa5970e3 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -88,7 +88,7 @@ restart_insert:
&usedmetap);
Assert(usedmetap != NULL);
- CheckForSerializableConflictIn(rel, NULL, buf);
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(buf));
/* remember the primary bucket buffer to release the pin on it at end. */
bucket_buf = buf;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 718f07f4f65..ccbb5daa099 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -446,7 +446,7 @@ heapgetpage(TableScanDesc sscan, BlockNumber page)
else
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
- CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+ heap_CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
&loctup, buffer, snapshot);
if (valid)
@@ -668,7 +668,7 @@ heapgettup(HeapScanDesc scan,
snapshot,
scan->rs_cbuf);
- CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+ heap_CheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
tuple, scan->rs_cbuf,
snapshot);
@@ -1479,7 +1479,7 @@ heap_fetch(Relation relation,
if (valid)
PredicateLockTuple(relation, tuple, snapshot);
- CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot);
+ heap_CheckForSerializableConflictOut(valid, relation, tuple, buffer, snapshot);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
@@ -1610,7 +1610,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
{
/* If it's visible per the snapshot, we must return it */
valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
- CheckForSerializableConflictOut(valid, relation, heapTuple,
+ heap_CheckForSerializableConflictOut(valid, relation, heapTuple,
buffer, snapshot);
if (valid)
@@ -1750,7 +1750,7 @@ heap_get_latest_tid(TableScanDesc sscan,
* candidate.
*/
valid = HeapTupleSatisfiesVisibility(&tp, snapshot, buffer);
- CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot);
+ heap_CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot);
if (valid)
*tid = ctid;
@@ -1905,7 +1905,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
* lock "gaps" as index page locks do. So we don't need to specify a
* buffer when making the call, which makes for a faster check.
*/
- CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -2159,7 +2159,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
* lock "gaps" as index page locks do. So we don't need to specify a
* buffer when making the call, which makes for a faster check.
*/
- CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
ndone = 0;
while (ndone < ntuples)
@@ -2350,7 +2350,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
* lock "gaps" as index page locks do. So we don't need to specify a
* buffer when making the call.
*/
- CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
/*
* If tuples are cachable, mark them for invalidation from the caches in
@@ -2664,7 +2664,7 @@ l1:
* being visible to the scan (i.e., an exclusive buffer content lock is
* continuously held from this point until the tuple delete is visible).
*/
- CheckForSerializableConflictIn(relation, &tp, buffer);
+ CheckForSerializableConflictIn(relation, tid, BufferGetBlockNumber(buffer));
/* replace cid with a combo cid if necessary */
HeapTupleHeaderAdjustCmax(tp.t_data, &cid, &iscombo);
@@ -3579,7 +3579,7 @@ l2:
* will include checking the relation level, there is no benefit to a
* separate check for the new tuple.
*/
- CheckForSerializableConflictIn(relation, &oldtup, buffer);
+ CheckForSerializableConflictIn(relation, otid, BufferGetBlockNumber(buffer));
/*
* At this point newbuf and buffer are both pinned and locked, and newbuf
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f1ff01e8cb9..7a8ecc27135 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2168,7 +2168,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
hscan->rs_vistuples[ntup++] = offnum;
PredicateLockTuple(scan->rs_rd, &loctup, snapshot);
}
- CheckForSerializableConflictOut(valid, scan->rs_rd, &loctup,
+ heap_CheckForSerializableConflictOut(valid, scan->rs_rd, &loctup,
buffer, snapshot);
}
}
@@ -2356,7 +2356,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
/* in pagemode, heapgetpage did this for us */
if (!pagemode)
- CheckForSerializableConflictOut(visible, scan->rs_rd, tuple,
+ heap_CheckForSerializableConflictOut(visible, scan->rs_rd, tuple,
hscan->rs_cbuf, scan->rs_snapshot);
/* Try next tuple from same page. */
@@ -2495,6 +2495,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
+ .scans_leverage_column_projection = false,
.slot_callbacks = heapam_slot_callbacks,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 28edd4aca76..6f1093ae383 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -180,8 +180,8 @@ index_insert(Relation indexRelation,
if (!(indexRelation->rd_indam->ampredlocks))
CheckForSerializableConflictIn(indexRelation,
- (HeapTuple) NULL,
- InvalidBuffer);
+ (ItemPointer) NULL,
+ InvalidBlockNumber);
return indexRelation->rd_indam->aminsert(indexRelation, values, isnull,
heap_t_ctid, heapRelation,
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 48d19be3aab..51988731769 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -290,7 +290,7 @@ top:
* checkingunique and !heapkeyspace cases, but it's okay to use the
* first page the value could be on (with scantid omitted) instead.
*/
- CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
/*
* Do the insertion. Note that insertstate contains cached binary
@@ -533,7 +533,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* otherwise be masked by this unique constraint
* violation.
*/
- CheckForSerializableConflictIn(rel, NULL, insertstate->buf);
+ CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate->buf));
/*
* This is a definite conflict. Break the tuple down into
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e678690b18e..f2e3714c89f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1760,7 +1760,8 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
indexInfo = BuildIndexInfo(btspool->index);
indexInfo->ii_Concurrent = btshared->isconcurrent;
scan = table_beginscan_parallel(btspool->heap,
- ParallelTableScanFromBTShared(btshared));
+ ParallelTableScanFromBTShared(btshared),
+ NULL);
reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
true, progress, _bt_build_callback,
(void *) &buildstate, scan);
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1dda6..2a624b62325 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
- smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+ smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o \
+ zedstoredesc.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/zedstoredesc.c b/src/backend/access/rmgrdesc/zedstoredesc.c
new file mode 100644
index 00000000000..98bc510d931
--- /dev/null
+++ b/src/backend/access/rmgrdesc/zedstoredesc.c
@@ -0,0 +1,43 @@
+/*
+ * zedstoredesc.c
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/zedstoredesc.c
+ */
+#include "postgres.h"
+
+#include "access/xlogreader.h"
+#include "access/zedstore_wal.h"
+#include "lib/stringinfo.h"
+
+void
+zedstore_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == WAL_ZEDSTORE_INIT_METAPAGE)
+ {
+ wal_zedstore_init_metapage *walrec = (wal_zedstore_init_metapage *) rec;
+
+ appendStringInfo(buf, "natts %d", walrec->natts);
+ }
+
+}
+
+const char *
+zedstore_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case WAL_ZEDSTORE_INIT_METAPAGE:
+ id = "INIT_METAPAGE";
+ break;
+ }
+ return id;
+}
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5d3f5c3f541..b9ed336a563 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -158,7 +158,7 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
}
TableScanDesc
-table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
+table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan, Bitmapset *proj)
{
Snapshot snapshot;
uint32 flags = SO_TYPE_SEQSCAN |
@@ -180,6 +180,9 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
snapshot = SnapshotAny;
}
+ if (proj)
+ return relation->rd_tableam->scan_begin_with_column_projection(relation, snapshot, 0, NULL,
+ parallel_scan, flags, proj);
return relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
parallel_scan, flags);
}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56c4ce..be9e5bb1951 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -20,6 +20,7 @@
#include "access/spgxlog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/zedstore_wal.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
#include "commands/sequence.h"
diff --git a/src/backend/access/zedstore/Makefile b/src/backend/access/zedstore/Makefile
new file mode 100644
index 00000000000..328a1c58777
--- /dev/null
+++ b/src/backend/access/zedstore/Makefile
@@ -0,0 +1,22 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/zedstore
+#
+# IDENTIFICATION
+# src/backend/access/zedstore/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/zedstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = zedstore_btree.o zedstore_tiditem.o zedstore_tidpage.o \
+ zedstore_attitem.o zedstore_attpage.o \
+ zedstore_compression.o zedstore_simple8b.o zedstoream_handler.o \
+ zedstore_meta.o zedstore_undo.o zedstore_toast.o zedstore_visibility.o \
+ zedstore_inspect.o zedstore_freepagemap.o \
+ zedstore_tupslot.o zedstore_wal.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/zedstore/README b/src/backend/access/zedstore/README
new file mode 100644
index 00000000000..35605f00e31
--- /dev/null
+++ b/src/backend/access/zedstore/README
@@ -0,0 +1,353 @@
+
+src/backend/access/zedstore/README
+
+ZedStore - compressed column (and row) store for PostgreSQL
+===========================================================
+
+The purpose of this README is to provide overview of zedstore's
+design, major requirements/objectives it intends to fulfill and
+high-level implementation details.
+
+Objectives
+----------
+
+* Performance improvement for queries selecting subset of columns
+(reduced IO).
+
+* Reduced on-disk footprint compared to heap table. Shorter tuple
+headers and also leveraging compression of similar type data
+
+* Be first-class citizen in the Postgres architecture (tables data can
+just independently live in columnar storage) and not be at arm's
+length though an opaque interface.
+
+* Fully MVCC compliant - basically all operations supported similar to
+heap, like update, delete, serializable transactions etc...
+
+* All Indexes supported
+
+* Hybrid row-column store, where some columns are stored together, and
+others separately. Provide flexibility of granularity on how to divide
+the columns. Columns accessed together can be stored together.
+
+* Provide better control over bloat.
+
+* Eliminate need for separate toast tables
+
+* Faster add / drop column or changing data type of column by avoiding
+full rewrite of the table.
+
+Highlevel design of zedStore - B-trees for the win!
+---------------------------------------------------
+
+ZedStore consists of multiple B-trees. There is one B-tree, called the
+TID tree, which contains the visibility information of each tuple, but
+no user data. In addition to that, there is one B-tree for each
+attribute, called the attribute trees, to store the user data. Note that
+these B-tree implementations are completely unrelated to PostgreSQL's
+B-tree indexes.
+
+The TID tree, and all the attribute trees, use the TID as the key. The
+TID is used as a logical row identifier. Internally, ZedStore passed
+TIDs around as 64-bit integers (zstid), but for interfacing with the
+rest of the system, they are converted to/from ItemPointers. When
+converted to an ItemPointer, the conversion ensures that the ItemPointer
+looks valid, i.e. offset 0 is never used. However, the TID is just a
+48-bit row identifier, the traditional division into block and offset
+numbers is meaningless. There is locality of access, though; TIDs that
+are close to each other, will probably also reside close to each other
+on disk. So, for example, bitmap index scans or BRIN indexes, which
+work with block numbers, still make some sense, even though the "block
+number" stored in a zedstore ItemPointer doesn't correspond to a
+physical block.
+
+The internal pages of the B-trees are super simple and boring. The internal
+pages of the TID and attribute trees look identical. Functions that work
+with either the TID or attribute tree use ZS_META_ATTRIBUTE_NUM as the
+"attribute number", when working with the TID tree.
+
+
+
+The leaf pages look different TID tree and the attribute trees. Let's
+look at the TID tree first:
+
+TID tree
+--------
+
+A TID tree page consists of multiple ZSTidArrayItems. Each ZSTidArrayItem
+represents a group of tuples, with TIDs in a particular range. The TID
+ranges of ZSTidArrayItems never overlap. For each tuple, we logically
+store the TID, and its UNDO pointer. The actual visibility information
+is stored in the UNDO log, if the tuple was recently modified.
+
+A tuple can also be marked as dead, which means that the tuple is not
+visible to anyone. Dead tuples are marked with a special constant
+UNDO pointer value, DeadUndoPtr. The TIDs of dead tuples cannot be
+reused, until all index pointers to the tuples have been removed, by
+VACUUM. VACUUM scans the TID tree to collect all the dead TIDs. (Note
+that VACUUM does not need to scan the attribute trees, and the TID tree
+is hopefully just a small fraction of the table. Vacuum on zedstore is
+therefore hopefully much faster than on heap. (Although the freeze map
+can be pretty effective on the heap, too))
+
+So logically, the TID tree stores the TID and UNDO pointer for every
+tuple. However, that would take a lot of space. To reduce disk usage,
+the TID tree consists of ZSTidArrayItems, which contain the TIDs and
+their UNDO pointers in a specially encoded format. The encoded format
+is optimized for the common cases, where the gaps between TIDs are
+small, and most tuples are visible to everyone. See comments
+ZSTidArrayItem in zedstore_internal.h for details.
+
+Having a TID tree that's separate from the attributes helps to support
+zero column tables (which can be result of ADD COLUMN DROP COLUMN actions
+as well). Plus, having meta-data stored separately from data, helps to get
+better compression ratios. And also helps to simplify the overall
+design/implementation as for deletes just need to edit the TID tree
+and avoid touching the attribute btrees.
+
+
+Attribute trees
+---------------
+
+The leaf pages on the attribute tree also consist of items, which pack
+data from multiple tuples in one item. In the attribute tree, the items
+can furthermore be compressed using LZ4, if the server has been
+configured with "configure --with-lz4". (If you don't use --with-lz4,
+PostgreSQL's built-in pglz algorithm is used, but it is *much* slower).
+Each item (ZSAttributeArrayItem) contains data for tuples with a range
+of consecutive TIDs. Multiple ZSAttributeArrayItems can be compressed
+together, into a single ZSAttributeCompressedItem item.
+
+In uncompressed form, an attribute tree page can be arbitrarily large.
+But after compression, it must fit into a physical 8k block. If on insert
+or update of a tuple, the page cannot be compressed below 8k anymore, the
+page is split. Note that because TIDs are logical rather than physical
+identifiers, we can freely move tuples from one physical page to
+another during page split. A tuple's TID never changes.
+
+The buffer cache caches compressed blocks. Likewise, WAL-logging,
+full-page images etc. work on compressed blocks. Uncompression is done
+on-the-fly, as and when needed in backend-private memory, when
+reading. For some compressions like rel encoding or delta encoding
+tuples can be constructed directly from compressed data.
+
+
+To reconstruct a row with given TID, scan descends down the B-trees for
+all the columns using that TID, and fetches all attributes. Likewise, a
+sequential scan walks all the B-trees in lockstep.
+
+
+TODO: Currently, each attribute is stored in a separate attribute
+B-tree. But a hybrid row-column store would also be possible, where some
+columns were stored together in the same tree. Or even a row store, where
+all the user data was stored in a single tree, or even combined with the
+TID tree.
+
+Metapage
+--------
+
+A metapage at block 0, has links to the roots of the B-trees.
+
+
+Low-level locking / concurrency issues
+------------------------------- ------
+Design principles:
+
+* Every page is self-identifying. Every page has a page type ID,
+ which indicates what kind of a page it is. For a B-tree page,
+ the page header contains the attribute number and lo/hi key.
+ That is enough information to find the downlink to the page, so
+ that it can be deleted if necessary. There is enough information
+ on each leaf page to easily re-build the internal pages from
+ scratch, in case of corruption, for example.
+
+* Concurrency control: When traversing the B-tree, or walking UNDO
+ or TOAST pages, it's possible that a concurrent process splits or
+ moves a page just when we're about to step on it. There is enough
+ information on each page to detect that case. For example, if a
+ B-tree page is split just when you are about to step on it, you
+ can detect that by looking at the lo/hi key. If a page is deleted,
+ that can be detected too, because the attribute number or lo/hikey
+ are not what you expected. In that case, start the scan from the
+ root.
+
+* Any page can be fairly easily be moved, starting with just the
+ page itself. When you have a B-tree page at hand, you can re-find
+ its parent using its lokey, and modify the downlink. A toast page
+ contains the attno/TID, which can be used to find the pointer to
+ it in the b-tree. An UNDO page cannot currently be moved because
+ UNDO pointers contain the physical block number, but as soon as an
+ UNDO page expires, it can be deleted.
+
+
+MVCC
+----
+
+Undo record pointers are used to implement MVCC, like in zheap. Hence,
+transaction information if not directly stored with the data. In
+zheap, there's a small, fixed, number of "transaction slots" on each
+page, but zedstore has undo pointer with each item directly; in normal
+cases, the compression squeezes this down to almost nothing. In case
+of bulk load the undo record pointer is maintained for array of items
+and not per item. Undo pointer is only stored in meta-column and all
+MVCC operations are performed using the meta-column only.
+
+
+Insert:
+Inserting a new row, splits the row into datums. Then while adding
+entry for meta-column adds, decides block to insert, picks a TID for
+it, and writes undo record for the same. All the data columns are
+inserted using that TID.
+
+Toast:
+When an overly large datum is stored, it is divided into chunks, and
+each chunk is stored on a dedicated toast page within the same
+physical file. The toast pages of a datum form list, each page has a
+next/prev pointer.
+
+Select:
+Property is added to Table AM to convey if column projection is
+leveraged by AM for scans. While scanning tables with AM leveraging
+this property, executor parses the plan. Leverages the target list and
+quals to find the required columns for query. This list is passed down
+to AM on beginscan. Zedstore uses this column projection list to only
+pull data from selected columns. Virtual tuple table slot is used to
+pass back the datums for subset of columns.
+
+Current table am API requires enhancement here to pass down column
+projection to AM. The patch showcases two different ways for the same.
+
+* For sequential scans added new beginscan_with_column_projection()
+API. Executor checks AM property and if it leverages column projection
+uses this new API else normal beginscan() API.
+
+* For index scans instead of modifying the begin scan API, added new
+API to specifically pass column projection list after calling begin
+scan to populate the scan descriptor but before fetching the tuples.
+
+Delete:
+When deleting a tuple, new undo record is created for delete and only
+meta-column item is updated with this new undo record. New undo record
+created points to previous undo record pointer (insert undo record)
+present for the tuple. Hence, delete only operates on meta-column and
+no data column is edited.
+
+Update:
+Update in zedstore is pretty equivalent to delete and insert. Delete
+action is performed as stated above and new entry is added with
+updated values. So, no in-place update happens.
+
+Index Support:
+Building index also leverages columnar storage and only scans columns
+required to build the index. Indexes work pretty similar to heap
+tables. Data is inserted into tables and TID for the tuple gets stored
+in index. On index scans, required column Btrees are scanned for given
+TID and datums passed back using virtual tuple. Since only meta-column
+is leveraged to perform visibility check, only visible tuples data are
+fetched from rest of the Btrees.
+
+Page Format
+-----------
+A ZedStore table contains different kinds of pages, all in the same
+file. Kinds of pages are meta-page, per-attribute btree internal and
+leaf pages, UNDO log page, and toast pages. Each page type has its own
+distinct data storage format.
+
+META Page:
+Block 0 is always a metapage. It contains the block numbers of the
+other data structures stored within the file, like the per-attribute
+B-trees, and the UNDO log.
+
+BTREE Page:
+
+UNDO Page:
+
+TOAST Page:
+
+
+Free Pages Map
+--------------
+
+There is a simple Free Pages Map, which is just a linked list of unused
+blocks. The block number of the first unused page in the list is stored
+in the metapage. Each unused block contains link to the next unused
+block in the chain. When a block comes unused, it is added to the
+head of the list.
+
+TODO: That doesn't scale very well, and the pages are reused in LIFO
+order. We'll probably want to do something smarter to avoid making the
+metapage a bottleneck for this, as well as try to batch the page
+allocations so that each attribute B-tree would get contiguous ranges
+of blocks, to allow I/O readahead to be effective.
+
+
+Enhancement ideas / alternative designs
+---------------------------------------
+
+Instead of compressing all the tuples on a page in one batch, store a
+small "dictionary", e.g. in page header or meta page or separate
+dedicated page, and use it to compress tuple by tuple. That could make
+random reads and updates of individual tuples faster. Need to find how
+to create the dictionary first.
+
+Only cached compressed pages in the page cache. If we want to cache
+uncompressed pages instead, or in addition to that, we need to invent
+a whole new kind of a buffer cache that can deal with the
+variable-size blocks. For a first version, I think we can live without
+it.
+
+Instead of storing all columns in the same file, we could store them
+in separate files (separate forks?). That would allow immediate reuse
+of space, after dropping a column. It's not clear how to use an FSM in
+that case, though. Might have to implement an integrated FSM,
+too. (Which might not be a bad idea, anyway).
+
+Design allows for hybrid row-column store, where some columns are
+stored together, and others have a dedicated B-tree. Need to have user
+facing syntax to allow specifying how to group the columns.
+
+Salient points for the design
+------------------------------
+
+* Layout the data/tuples in mapped fashion instead of keeping the
+logical to physical mapping separate from actual data. So, keep all
+the meta-data and data logically in single stream of file, avoiding
+the need for separate forks/files to store meta-data and data.
+
+* Handle/treat operations at tuple level and not block level.
+
+* Stick to fixed size physical blocks. Variable size blocks (for
+possibly higher compression ratios) pose need for increased logical to
+physical mapping maintenance, plus restrictions on concurrency of
+writes and reads to files. Hence adopt compression to fit fixed size
+blocks instead of other way round.
+
+
+Predicate locking
+-----------------
+
+Predicate locks, to support SERIALIZABLE transactinons, are taken like
+with the heap. From README-SSI:
+
+* For a table scan, the entire relation will be locked.
+
+* Each tuple read which is visible to the reading transaction will be
+locked, whether or not it meets selection criteria; except that there
+is no need to acquire an SIREAD lock on a tuple when the transaction
+already holds a write lock on any tuple representing the row, since a
+rw-conflict would also create a ww-dependency which has more
+aggressive enforcement and thus will prevent any anomaly.
+
+* Modifying a heap tuple creates a rw-conflict with any transaction
+that holds a SIREAD lock on that tuple, or on the page or relation
+that contains it.
+
+* Inserting a new tuple creates a rw-conflict with any transaction
+holding a SIREAD lock on the entire relation. It doesn't conflict with
+page-level locks, because page-level locks are only used to aggregate
+tuple locks. Unlike index page locks, they don't lock "gaps" on the
+page.
+
+
+ZedStore isn't block-based, so page-level locks really just mean a
+range of TIDs. They're only used to aggregate tuple locks.
diff --git a/src/backend/access/zedstore/zedstore_attitem.c b/src/backend/access/zedstore/zedstore_attitem.c
new file mode 100644
index 00000000000..ac0068d443b
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_attitem.c
@@ -0,0 +1,1437 @@
+/*
+ * zedstore_attitem.c
+ * Routines for packing datums into "items", in the attribute trees.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_attitem.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_simple8b.h"
+#include "miscadmin.h"
+#include "utils/datum.h"
+
+/*
+ * We avoid creating items that are "too large". An item can legitimately use
+ * up a whole page, but we try not to create items that large, because they
+ * could lead to fragmentation. For example, if we routinely created items
+ * that are 3/4 of page size, we could only fit one item per page, and waste
+ * 1/4 of the disk space.
+ *
+ * MAX_ATTR_ITEM_SIZE is a soft limit on how large we make items. If there's
+ * a very large datum on a row, we store it on a single item of its own
+ * that can be larger, because we don't have much choice. But we don't pack
+ * multiple datums into a single item so that it would exceed the limit.
+ * NOTE: This soft limit is on the *uncompressed* item size. So in practice,
+ * when compression is effective, the items we actually store are smaller
+ * than this.
+ *
+ * MAX_TIDS_PER_ATTR_ITEM is the max number of TIDs that can be represented
+ * by a single array item. Unlike MAX_ATTR_ITEM_SIZE, it is a hard limit.
+ */
+#define MAX_ATTR_ITEM_SIZE (MaxZedStoreDatumSize / 4)
+#define MAX_TIDS_PER_ATTR_ITEM ((BLCKSZ / 2) / sizeof(zstid))
+
+static void fetch_att_array(char *src, int srcSize, bool hasnulls,
+ int numelements, ZSAttrTreeScan *scan);
+
+static ZSAttributeArrayItem *zsbt_attr_create_item(Form_pg_attribute att,
+ Datum *datums, bool *isnulls, zstid *tids, int nitems,
+ bool has_nulls, int datasz);
+static ZSExplodedItem *zsbt_attr_explode_item(ZSAttributeArrayItem *item);
+
+/*
+ * Create an attribute item, or items, from an array of tids and datums.
+ */
+List *
+zsbt_attr_create_items(Form_pg_attribute att,
+ Datum *datums, bool *isnulls, zstid *tids, int nitems)
+{
+ List *newitems;
+ int i;
+ int max_items_with_nulls = -1;
+ int max_items_without_nulls = -1;
+
+ if (att->attlen > 0)
+ {
+ max_items_without_nulls = MAX_ATTR_ITEM_SIZE / att->attlen;
+ Assert(max_items_without_nulls > 0);
+
+ max_items_with_nulls = (MAX_ATTR_ITEM_SIZE * 8) / (att->attlen * 8 + 1);
+
+ /* clamp at maximum number of tids */
+ if (max_items_without_nulls > MAX_TIDS_PER_ATTR_ITEM)
+ max_items_without_nulls = MAX_TIDS_PER_ATTR_ITEM;
+ if (max_items_with_nulls > MAX_TIDS_PER_ATTR_ITEM)
+ max_items_with_nulls = MAX_TIDS_PER_ATTR_ITEM;
+ }
+
+ /*
+ * Loop until we have packed each input datum.
+ */
+ newitems = NIL;
+ i = 0;
+ while (i < nitems)
+ {
+ size_t datasz;
+ ZSAttributeArrayItem *item;
+ int num_elements;
+ bool has_nulls = false;
+
+ /*
+ * Compute how many input datums we can pack into the next item,
+ * without exceeding MAX_ATTR_ITEM_SIZE or MAX_TIDS_PER_ATTR_ITEM.
+ *
+ * To do that, we have to loop through the datums and compute how
+ * much space they will take when packed.
+ */
+ if (att->attlen > 0)
+ {
+ int j;
+ int num_nonnull_items;
+
+ for (j = i; j < nitems && j - i < max_items_without_nulls; j++)
+ {
+ if (isnulls[j])
+ {
+ has_nulls = true;
+ break;
+ }
+ }
+ num_nonnull_items = (j - i);
+ datasz = num_nonnull_items * att->attlen;
+
+ if (has_nulls)
+ {
+ for (;j < nitems && num_nonnull_items < max_items_with_nulls &&
+ j - i < MAX_TIDS_PER_ATTR_ITEM; j++)
+ {
+ if (!isnulls[j])
+ {
+ datasz += att->attlen;
+ num_nonnull_items++;
+ }
+ }
+ }
+ num_elements = (j - i);
+ }
+ else
+ {
+ int j;
+
+ datasz = 0;
+ for (j = i; j < nitems && j - i < MAX_TIDS_PER_ATTR_ITEM; j++)
+ {
+ size_t this_sz;
+
+ if (isnulls[j])
+ {
+ has_nulls = true;
+ this_sz = 0;
+ }
+ else
+ {
+ if (att->attlen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(datums[j]))
+ {
+ /*
+ * Any toasted datums should've been taken care of
+ * before we get here. We might see "zedstore-toasted"
+ * datums, but nothing else.
+ */
+ if (VARTAG_EXTERNAL(datums[j]) != VARTAG_ZEDSTORE)
+ elog(ERROR, "unrecognized toast tag");
+ this_sz = 2 + sizeof(BlockNumber);
+
+ }
+ else if (VARATT_IS_COMPRESSED(datums[j]))
+ {
+ /*
+ * "inline" compressed datum. We will include it in an
+ * attribute item, which we will try to compress as whole,
+ * so compressing individual items is a bit silly.
+ * We could uncompress it here, but that also seems
+ * silly, because then it was a waste of time to compress
+ * it earlier. Furthermore, it's theoretically possible that
+ * it would not compress as well using LZ4, so that it would
+ * be too large to store on a page.
+ *
+ * TODO: what to do?
+ */
+ elog(ERROR, "inline compressed datums not implemented");
+ }
+ else
+ {
+ this_sz = VARSIZE_ANY_EXHDR(DatumGetPointer(datums[j]));
+
+ if ((this_sz + 1) > 0x7F)
+ this_sz += 2;
+ else
+ this_sz += 1;
+ }
+ }
+ else
+ {
+ Assert(att->attlen == -2);
+ this_sz = strlen((char *) DatumGetPointer(datums[j]));
+
+ if ((this_sz + 1) > 0x7F)
+ this_sz += 2;
+ else
+ this_sz += 1;
+ }
+ }
+
+ if (j != i && datasz + this_sz > MAX_ATTR_ITEM_SIZE)
+ break;
+
+ datasz += this_sz;
+ }
+ num_elements = j - i;
+ }
+
+ /* FIXME: account for TID codewords in size calculation. */
+
+ item = zsbt_attr_create_item(att,
+ &datums[i], &isnulls[i], &tids[i], num_elements,
+ has_nulls, datasz);
+
+ newitems = lappend(newitems, item);
+ i += num_elements;
+ }
+
+ return newitems;
+}
+
+/* helper function to pack an array of bools into a NULL bitmap */
+static bits8 *
+write_null_bitmap(bool *isnulls, int num_elements, bits8 *dst)
+{
+ bits8 bits = 0;
+ int x = 0;
+
+ for (int j = 0; j < num_elements; j++)
+ {
+ if (x == 8)
+ {
+ *dst = bits;
+ dst++;
+ bits = 0;
+ x = 0;
+ }
+
+ if (isnulls[j])
+ bits |= 1 << x;
+ x++;
+ }
+ if (x > 0)
+ {
+ *dst = bits;
+ dst++;
+ }
+ return dst;
+}
+
+/*
+ * Create an array item from given datums and tids.
+ *
+ * The caller has already computed the size the datums will require.
+ */
+static ZSAttributeArrayItem *
+zsbt_attr_create_item(Form_pg_attribute att,
+ Datum *datums, bool *isnulls, zstid *tids, int num_elements,
+ bool has_nulls, int datasz)
+{
+ uint64 deltas[MAX_TIDS_PER_ATTR_ITEM];
+ uint64 codewords[MAX_TIDS_PER_ATTR_ITEM];
+ int num_codewords;
+ int total_encoded;
+ char *p;
+ char *pend;
+ size_t itemsz;
+ ZSAttributeArrayItem *item;
+
+ Assert(num_elements > 0);
+ Assert(num_elements <= MAX_TIDS_PER_ATTR_ITEM);
+
+ /* Compute TID distances */
+ for (int i = 1; i < num_elements; i++)
+ deltas[i] = tids[i] - tids[i - 1];
+
+ deltas[0] = 0;
+ num_codewords = 0;
+ total_encoded = 0;
+ while (total_encoded < num_elements)
+ {
+ int num_encoded;
+
+ codewords[num_codewords] =
+ simple8b_encode(&deltas[total_encoded], num_elements - total_encoded, &num_encoded);
+
+ total_encoded += num_encoded;
+ num_codewords++;
+ }
+
+ itemsz = offsetof(ZSAttributeArrayItem, t_tid_codewords);
+ itemsz += num_codewords * sizeof(uint64);
+ if (has_nulls)
+ itemsz += ZSBT_ATTR_BITMAPLEN(num_elements);
+ itemsz += datasz;
+
+ item = palloc(itemsz);
+ item->t_size = itemsz;
+ item->t_flags = 0;
+ if (has_nulls)
+ item->t_flags |= ZSBT_HAS_NULLS;
+ item->t_num_elements = num_elements;
+ item->t_num_codewords = num_codewords;
+ item->t_firsttid = tids[0];
+ item->t_endtid = tids[num_elements - 1] + 1;
+
+ for (int j = 0; j < num_codewords; j++)
+ item->t_tid_codewords[j] = codewords[j];
+
+ p = (char *) &item->t_tid_codewords[num_codewords];
+ pend = ((char *) item) + itemsz;
+
+ if (has_nulls)
+ p = (char *) write_null_bitmap(isnulls, num_elements, (bits8 *) p);
+
+ if (att->attlen > 0)
+ {
+ if (att->attbyval)
+ {
+ for (int j = 0; j < num_elements; j++)
+ {
+ if (!isnulls[j])
+ {
+ store_att_byval(p, datums[j], att->attlen);
+ p += att->attlen;
+ }
+ }
+ }
+ else
+ {
+ for (int j = 0; j < num_elements; j++)
+ {
+ if (!isnulls[j])
+ {
+ memcpy(p, DatumGetPointer(datums[j]), att->attlen);
+ p += att->attlen;
+ }
+ }
+ }
+ }
+ else
+ {
+ for (int j = 0; j < num_elements; j++)
+ {
+ if (!isnulls[j])
+ {
+ if (att->attlen == -1 && VARATT_IS_EXTERNAL(datums[j]))
+ {
+ varatt_zs_toastptr *zstoast;
+
+ /*
+ * Any toasted datums should've been taken care of
+ * before we get here. We might see "zedstore-toasted"
+ * datums, but nothing else.
+ */
+ if (VARTAG_EXTERNAL(datums[j]) != VARTAG_ZEDSTORE)
+ elog(ERROR, "unrecognized toast tag");
+
+ zstoast = (varatt_zs_toastptr *) DatumGetPointer(datums[j]);
+
+ /*
+ * 0xFFFF identifies a toast pointer. Followed by the block
+ * number of the first toast page.
+ */
+ *(p++) = 0xFF;
+ *(p++) = 0xFF;
+ memcpy(p, &zstoast->zst_block, sizeof(BlockNumber));
+ p += sizeof(BlockNumber);
+ }
+ else
+ {
+ size_t this_sz;
+ char *src;
+
+ if (att->attlen == -1)
+ {
+ this_sz = VARSIZE_ANY_EXHDR(DatumGetPointer(datums[j]));
+ src = VARDATA_ANY(DatumGetPointer(datums[j]));
+ }
+ else
+ {
+ Assert(att->attlen == -2);
+ this_sz = strlen((char *) DatumGetPointer(datums[j]));
+ src = (char *) DatumGetPointer(datums[j]);
+ }
+
+ if ((this_sz + 1) > 0x7F)
+ {
+ *(p++) = 0x80 | ((this_sz + 1) >> 8);
+ *(p++) = (this_sz + 1) & 0xFF;
+ }
+ else
+ {
+ *(p++) = (this_sz + 1);
+ }
+ memcpy(p, src, this_sz);
+ p += this_sz;
+ }
+ Assert(p <= pend);
+ }
+ }
+ }
+ if (p != pend)
+ elog(ERROR, "mismatch in item size calculation");
+
+ return item;
+}
+
+static inline int
+zsbt_attr_datasize(int attlen, char *src)
+{
+ unsigned char *p = (unsigned char *) src;
+
+ if (attlen > 0)
+ return attlen;
+ else if ((p[0] & 0x80) == 0)
+ {
+ /* single-byte header */
+ return p[0];
+ }
+ else if (p[0] == 0xFF && p[1] == 0xFF)
+ {
+ /* zedstore-toast pointer. */
+ return 6;
+ }
+ else
+ {
+ /* two-byte header */
+ return ((p[0] & 0x7F) << 8 | p[1]) + 1;
+ }
+}
+
+/*
+ * Remove elements with given TIDs from an array item.
+ *
+ * Returns NULL, if all elements were removed.
+ */
+ZSExplodedItem *
+zsbt_attr_remove_from_item(Form_pg_attribute attr,
+ ZSAttributeArrayItem *olditem,
+ zstid *removetids)
+{
+ ZSExplodedItem *origitem;
+ ZSExplodedItem *newitem;
+ int i;
+ int j;
+ char *src;
+ char *dst;
+
+ origitem = zsbt_attr_explode_item(olditem);
+
+ newitem = palloc(sizeof(ZSExplodedItem));
+ newitem->tids = palloc(origitem->t_num_elements * sizeof(zstid));
+ newitem->nullbitmap = palloc0(ZSBT_ATTR_BITMAPLEN(origitem->t_num_elements));
+ newitem->datumdata = palloc(origitem->datumdatasz);
+
+ /* walk through every element */
+ j = 0;
+ src = origitem->datumdata;
+ dst = newitem->datumdata;
+ for (i = 0; i < origitem->t_num_elements; i++)
+ {
+ int this_datasz;
+ bool this_isnull;
+
+ while (origitem->tids[i] > *removetids)
+ removetids++;
+
+ this_isnull = zsbt_attr_item_isnull(origitem->nullbitmap, i);
+ if (!this_isnull)
+ this_datasz = zsbt_attr_datasize(attr->attlen, src);
+ else
+ this_datasz = 0;
+
+ if (origitem->tids[i] == *removetids)
+ {
+ /* leave this one out */
+ removetids++;
+ }
+ else
+ {
+ newitem->tids[j] = origitem->tids[i];
+ if (this_isnull)
+ {
+ zsbt_attr_item_setnull(newitem->nullbitmap, j);
+ }
+ else
+ {
+ memcpy(dst, src, this_datasz);
+ dst += this_datasz;
+ }
+ j++;
+ }
+ src += this_datasz;
+ }
+
+ if (j == 0)
+ {
+ pfree(newitem);
+ return NULL;
+ }
+
+ newitem->t_size = 0;
+ newitem->t_flags = 0;
+ newitem->t_num_elements = j;
+ newitem->datumdatasz = dst - newitem->datumdata;
+
+ Assert(newitem->datumdatasz <= origitem->datumdatasz);
+
+ return newitem;
+}
+
+/*
+ *
+ * Extract TID and Datum/isnull arrays the given array item.
+ *
+ * The arrays are stored directly into the scan->array_* fields.
+ *
+ * TODO: avoid extracting elements we're not interested in, by passing starttid/endtid.
+ */
+void
+zsbt_attr_item_extract(ZSAttrTreeScan *scan, ZSAttributeArrayItem *item)
+{
+ int nelements = item->t_num_elements;
+ char *p;
+ char *pend;
+ zstid currtid;
+ zstid *tids;
+ uint64 *codewords;
+
+ if (nelements > scan->array_datums_allocated_size)
+ {
+ int newsize = nelements * 2;
+
+ if (scan->array_datums)
+ pfree(scan->array_datums);
+ if (scan->array_isnulls)
+ pfree(scan->array_isnulls);
+ if (scan->array_tids)
+ pfree(scan->array_tids);
+ scan->array_datums = MemoryContextAlloc(scan->context, newsize * sizeof(Datum));
+ scan->array_isnulls = MemoryContextAlloc(scan->context, newsize * sizeof(bool) + 7);
+ scan->array_tids = MemoryContextAlloc(scan->context, newsize * sizeof(zstid));
+ scan->array_datums_allocated_size = newsize;
+ }
+
+ /* decompress if needed */
+ if ((item->t_flags & ZSBT_ATTR_COMPRESSED) != 0)
+ {
+ ZSAttributeCompressedItem *citem = (ZSAttributeCompressedItem *) item;
+
+ if (scan->decompress_buf_size < citem->t_uncompressed_size)
+ {
+ size_t newsize = citem->t_uncompressed_size * 2;
+
+ if (scan->decompress_buf != NULL)
+ pfree(scan->decompress_buf);
+ scan->decompress_buf = MemoryContextAlloc(scan->context, newsize);
+ scan->decompress_buf_size = newsize;
+ }
+
+ p = (char *) citem->t_payload;
+ zs_decompress(p, scan->decompress_buf,
+ citem->t_size - offsetof(ZSAttributeCompressedItem, t_payload),
+ citem->t_uncompressed_size);
+ p = scan->decompress_buf;
+ pend = p + citem->t_uncompressed_size;
+ }
+ else
+ {
+ p = (char *) item->t_tid_codewords;
+ pend = ((char *) item) + item->t_size;
+ }
+
+ /* Decode TIDs from codewords */
+ tids = scan->array_tids;
+ codewords = (uint64 *) p;
+ p += item->t_num_codewords * sizeof(uint64);
+
+ simple8b_decode_words(codewords, item->t_num_codewords, tids, nelements);
+
+ currtid = item->t_firsttid;
+ for (int i = 0; i < nelements; i++)
+ {
+ currtid += tids[i];
+ tids[i] = currtid;
+ }
+
+ /*
+ * Expand the packed array data into an array of Datums.
+ *
+ * It would perhaps be more natural to loop through the elements with
+ * datumGetSize() and fetch_att(), but this is a pretty hot loop, so it's
+ * better to avoid checking attlen/attbyval in the loop.
+ *
+ * TODO: a different on-disk representation might make this better still,
+ * for varlenas (this is pretty optimal for fixed-lengths already).
+ * For example, storing an array of sizes or an array of offsets, followed
+ * by the data itself, might incur fewer pipeline stalls in the CPU.
+ */
+ fetch_att_array(p, pend - p,
+ ((item->t_flags & ZSBT_HAS_NULLS) != 0),
+ nelements,
+ scan);
+ scan->array_num_elements = nelements;
+}
+
+
+/*
+ * Subroutine of zsbt_attr_item_extract(). Unpack an array item into an array of
+ * TIDs, and an array of Datums and nulls.
+ *
+ * XXX: This always copies the data to a working area in 'scan'. That can be
+ * wasteful, if the data already happened to be correctly aligned. The caller
+ * relies on the copying, though, unless it already made a copy of it when
+ * decompressing it. So take that into account if you try to avoid this by
+ * avoiding the memcpys.
+ */
+static void
+fetch_att_array(char *src, int srcSize, bool hasnulls,
+ int numelements, ZSAttrTreeScan *scan)
+{
+ Form_pg_attribute attr = scan->attdesc;
+ int attlen = attr->attlen;
+ bool attbyval = attr->attbyval;
+ char attalign = attr->attalign;
+ bool *nulls = scan->array_isnulls;
+ Datum *datums = scan->array_datums;
+ unsigned char *p = (unsigned char *) src;
+
+ if (hasnulls)
+ {
+ /* expand null bitmap */
+ for (int i = 0; i < numelements; i += 8)
+ {
+ bits8 nullbits = *(bits8 *) (p++);
+
+ /* NOTE: we always overallocate the nulls array, so that we don't
+ * need to check for out of bounds here! */
+ nulls[i] = nullbits & 1;
+ nulls[i + 1] = (nullbits & (1 << 1)) >> 1;
+ nulls[i + 2] = (nullbits & (1 << 2)) >> 2;
+ nulls[i + 3] = (nullbits & (1 << 3)) >> 3;
+ nulls[i + 4] = (nullbits & (1 << 4)) >> 4;
+ nulls[i + 5] = (nullbits & (1 << 5)) >> 5;
+ nulls[i + 6] = (nullbits & (1 << 6)) >> 6;
+ nulls[i + 7] = (nullbits & (1 << 7)) >> 7;
+ }
+ }
+ else
+ memset(nulls, 0, numelements);
+
+ if (attlen > 0 && !hasnulls && attbyval)
+ {
+ memset(nulls, 0, numelements * sizeof(bool));
+
+ /* this looks a lot like fetch_att... */
+ if (attlen == sizeof(Datum))
+ {
+ memcpy(datums, p, sizeof(Datum) * numelements);
+ p += sizeof(Datum) * numelements;
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < numelements; i++)
+ {
+ uint32 x;
+
+ memcpy(&x, p, sizeof(int32));
+ p += sizeof(int32);
+ datums[i] = Int32GetDatum(x);
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < numelements; i++)
+ {
+ uint16 x;
+
+ memcpy(&x, p, sizeof(int16));
+ p += sizeof(int16);
+ datums[i] = Int16GetDatum(x);
+ }
+ }
+ else
+ {
+ Assert(attlen == 1);
+
+ for (int i = 0; i < numelements; i++)
+ {
+ datums[i] = CharGetDatum(*p);
+ p++;
+ }
+ }
+ }
+ else if (attlen > 0 && attbyval)
+ {
+ /* this looks a lot like fetch_att... but the source might not be aligned */
+ if (attlen == sizeof(int64))
+ {
+ for (int i = 0; i < numelements; i++)
+ {
+ if (nulls[i])
+ datums[i] = (Datum) 0;
+ else
+ {
+ uint64 x;
+
+ memcpy(&x, p, sizeof(int64));
+ p += sizeof(int64);
+ datums[i] = Int64GetDatum(x);
+ }
+ }
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < numelements; i++)
+ {
+ if (nulls[i])
+ datums[i] = (Datum) 0;
+ else
+ {
+ uint32 x;
+
+ memcpy(&x, p, sizeof(int32));
+ p += sizeof(int32);
+ datums[i] = Int32GetDatum(x);
+ }
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < numelements; i++)
+ {
+ if (nulls[i])
+ datums[i] = (Datum) 0;
+ else
+ {
+ uint16 x;
+
+ memcpy(&x, p, sizeof(int16));
+ p += sizeof(int16);
+ datums[i] = Int16GetDatum(x);
+ }
+ }
+ }
+ else
+ {
+ Assert(attlen == 1);
+
+ for (int i = 0; i < numelements; i++)
+ {
+ if (nulls[i])
+ datums[i] = (Datum) 0;
+ else
+ {
+ datums[i] = CharGetDatum(*p);
+ p++;
+ }
+ }
+ }
+ }
+ else if (attlen > 0 && !attbyval)
+ {
+ /*
+ * pass-by-ref fixed size.
+ *
+ * Because the on-disk format doesn't guarantee any alignment, we need to
+ * take care of that here. XXX: we could skip the copying if attalign='c'
+ */
+ int buf_needed;
+ int alignlen;
+ char *bufp;
+
+ switch (attalign)
+ {
+ case 'd':
+ alignlen = ALIGNOF_DOUBLE;
+ break;
+ case 'i':
+ alignlen = ALIGNOF_INT;
+ break;
+ case 's':
+ alignlen = ALIGNOF_SHORT;
+ break;
+ case 'c':
+ alignlen = 1;
+ break;
+ default:
+ elog(ERROR, "invalid alignment '%c'", attalign);
+ }
+
+ buf_needed = srcSize + (alignlen - 1) * numelements;
+
+ if (scan->attr_buf_size < buf_needed)
+ {
+ if (scan->attr_buf)
+ pfree(scan->attr_buf);
+ scan->attr_buf = MemoryContextAlloc(scan->context, buf_needed);
+ scan->attr_buf_size = buf_needed;
+ }
+
+ bufp = scan->attr_buf;
+
+ for (int i = 0; i < numelements; i++)
+ {
+ if (nulls[i])
+ datums[i] = (Datum) 0;
+ else
+ {
+ bufp = (char *) att_align_nominal(bufp, attalign);
+
+ Assert(bufp + attlen - scan->attr_buf <= buf_needed);
+
+ memcpy(bufp, p, attlen);
+ datums[i] = PointerGetDatum(bufp);
+ p += attlen;
+ bufp += attlen;
+ }
+ }
+ }
+ else if (attlen == -1)
+ {
+ /*
+ * Decode varlenas.
+ * Because we store varlenas unaligned, we might need a buffer for them, too,
+ * like for pass-by-ref fixed-widths above.
+ */
+ /*
+ * pass-by-ref fixed size.
+ *
+ * Because the on-disk format doesn't guarantee any alignment, we need to
+ * take care of that here. XXX: we could skip the copying if attalign='c'
+ */
+ int buf_needed;
+ char *bufp;
+
+ buf_needed = srcSize + (VARHDRSZ) * numelements;
+
+ if (scan->attr_buf_size < buf_needed)
+ {
+ if (scan->attr_buf)
+ pfree(scan->attr_buf);
+ scan->attr_buf = MemoryContextAlloc(scan->context, buf_needed);
+ scan->attr_buf_size = buf_needed;
+ }
+
+ bufp = scan->attr_buf;
+
+ for (int i = 0; i < numelements; i++)
+ {
+ if (nulls[i])
+ datums[i] = (Datum) 0;
+ else
+ {
+ if (*p == 0)
+ elog(ERROR, "invalid zs varlen header");
+
+ if ((*p & 0x80) == 0)
+ {
+ /*
+ * XXX: it would be nice if these were identical to the
+ * short varlen format used elsewhere in PostgreSQL, so
+ * we wouldn't need to copy these.
+ */
+ int this_sz = *p - 1;
+
+ datums[i] = PointerGetDatum(bufp);
+
+ /* XXX: I'm not sure if it makes sense to use
+ * the short varlen format, since this is just an in-memory
+ * copy. I think it's a good way to shake out bugs, though,
+ * so do it for now.
+ */
+ if (attr->attstorage != 'p')
+ {
+ SET_VARSIZE_1B(bufp, 1 + this_sz);
+ memcpy(bufp + 1, p + 1, this_sz);
+ p += 1 + this_sz;
+ bufp += 1 + this_sz;
+ }
+ else
+ {
+ SET_VARSIZE(bufp, VARHDRSZ + this_sz);
+ memcpy(VARDATA(bufp), p + 1, this_sz);
+ p += 1 + this_sz;
+ bufp += VARHDRSZ + this_sz;
+ }
+ }
+ else if (p[0] == 0xFF && p[1] == 0xFF)
+ {
+ /*
+ * zedstore toast pointer.
+ *
+ * Note that the zedstore toast pointer is stored
+ * unaligned. That's OK. Per postgres.h, varatts
+ * with 1-byte header don't need to aligned, and that
+ * applies to toast pointers, too.
+ */
+ varatt_zs_toastptr toastptr;
+
+ datums[i] = PointerGetDatum(bufp);
+
+ SET_VARTAG_1B_E(&toastptr, VARTAG_ZEDSTORE);
+ memcpy(&toastptr.zst_block, p + 2, sizeof(BlockNumber));
+ memcpy(bufp, &toastptr, sizeof(varatt_zs_toastptr));
+ p += 2 + sizeof(BlockNumber);
+ bufp += sizeof(varatt_zs_toastptr);
+ }
+ else
+ {
+ int this_sz = (((p[0] & 0x7f) << 8) | p[1]) - 1;
+
+ bufp = (char *) att_align_nominal(bufp, 'i');
+ datums[i] = PointerGetDatum(bufp);
+
+ Assert(bufp + VARHDRSZ + this_sz - scan->attr_buf <= buf_needed);
+
+ SET_VARSIZE(bufp, VARHDRSZ + this_sz);
+ memcpy(VARDATA(bufp), p + 2, this_sz);
+
+ p += 2 + this_sz;
+ bufp += VARHDRSZ + this_sz;
+ }
+ }
+ }
+ }
+ else
+ elog(ERROR, "not implemented");
+
+ if (p - (unsigned char *) src != srcSize)
+ elog(ERROR, "corrupt item array");
+}
+
+
+
+/*
+ * Routines to split, merge, and recompress items.
+ */
+
+static ZSExplodedItem *
+zsbt_attr_explode_item(ZSAttributeArrayItem *item)
+{
+ ZSExplodedItem *eitem;
+ int tidno;
+ zstid currtid;
+ zstid *tids;
+ char *databuf;
+ char *p;
+ char *pend;
+ uint64 *codewords;
+
+ eitem = palloc(sizeof(ZSExplodedItem));
+ eitem->t_size = 0;
+ eitem->t_flags = 0;
+ eitem->t_num_elements = item->t_num_elements;
+
+ if ((item->t_flags & ZSBT_ATTR_COMPRESSED) != 0)
+ {
+ ZSAttributeCompressedItem *citem = (ZSAttributeCompressedItem *) item;
+ int payloadsz;
+
+ payloadsz = citem->t_uncompressed_size;
+ Assert(payloadsz > 0);
+
+ databuf = palloc(payloadsz);
+
+ zs_decompress(citem->t_payload, databuf,
+ citem->t_size - offsetof(ZSAttributeCompressedItem, t_payload),
+ payloadsz);
+
+ p = databuf;
+ pend = databuf + payloadsz;
+ }
+ else
+ {
+ p = (char *) item->t_tid_codewords;
+ pend = ((char *) item) + item->t_size;
+ }
+
+ /* Decode TIDs from codewords */
+ tids = eitem->tids = palloc(item->t_num_elements * sizeof(zstid));
+ tidno = 0;
+ currtid = item->t_firsttid;
+ codewords = (uint64 *) p;
+ for (int i = 0; i < item->t_num_codewords; i++)
+ {
+ int ntids;
+
+ ntids = simple8b_decode(codewords[i], &tids[tidno]);
+
+ for (int j = 0; j < ntids; j++)
+ {
+ currtid += tids[tidno];
+ tids[tidno] = currtid;
+ tidno++;
+ }
+ }
+ p += item->t_num_codewords * sizeof(uint64);
+
+ /* nulls */
+ if ((item->t_flags & ZSBT_HAS_NULLS) != 0)
+ {
+ eitem->nullbitmap = (bits8 *) p;
+
+ p += ZSBT_ATTR_BITMAPLEN(item->t_num_elements);
+ }
+ else
+ {
+ eitem->nullbitmap = palloc0(ZSBT_ATTR_BITMAPLEN(item->t_num_elements));
+ }
+
+ /* datum data */
+ eitem->datumdata = p;
+ eitem->datumdatasz = pend - p;
+
+ return eitem;
+}
+
+/*
+ * Estimate how much space an array item takes, when it's uncompressed.
+ */
+static int
+zsbt_item_uncompressed_size(ZSAttributeArrayItem *item)
+{
+ if (item->t_size == 0)
+ {
+ ZSExplodedItem *eitem = (ZSExplodedItem *) item;
+ size_t sz = 0;
+
+ // FIXME: account for tids and null bitmap accurately.
+
+ sz += eitem->t_num_elements * 2; // Conservatively estimate 2 bytes per TID.
+ sz += eitem->datumdatasz;
+
+ return sz;
+ }
+ else if (item->t_flags & ZSBT_ATTR_COMPRESSED)
+ {
+ ZSAttributeCompressedItem *citem = (ZSAttributeCompressedItem *) item;
+
+ return offsetof(ZSAttributeCompressedItem, t_payload) + citem->t_uncompressed_size;
+ }
+ else
+ return item->t_size;
+}
+
+void
+zsbt_split_item(Form_pg_attribute attr, ZSExplodedItem *origitem, zstid first_right_tid,
+ ZSExplodedItem **leftitem_p, ZSExplodedItem **rightitem_p)
+{
+ int i;
+ int left_num_elements;
+ int left_datasz;
+ int right_num_elements;
+ int right_datasz;
+ char *p;
+ ZSExplodedItem *leftitem;
+ ZSExplodedItem *rightitem;
+
+ if (origitem->t_size != 0)
+ origitem = zsbt_attr_explode_item((ZSAttributeArrayItem *) origitem);
+
+ p = origitem->datumdata;
+ for (i = 0; i < origitem->t_num_elements; i++)
+ {
+ if (origitem->tids[i] >= first_right_tid)
+ break;
+
+ p += zsbt_attr_datasize(attr->attlen, p);
+ }
+ left_num_elements = i;
+ left_datasz = p - origitem->datumdata;
+
+ right_num_elements = origitem->t_num_elements - left_num_elements;
+ right_datasz = origitem->datumdatasz - left_datasz;
+
+ if (left_num_elements == origitem->t_num_elements)
+ elog(ERROR, "item split failed");
+
+ leftitem = palloc(sizeof(ZSExplodedItem));
+ leftitem->t_size = 0;
+ leftitem->t_flags = 0;
+ leftitem->t_num_elements = left_num_elements;
+ leftitem->tids = palloc(left_num_elements * sizeof(zstid));
+ leftitem->nullbitmap = palloc0(left_num_elements * sizeof(bool));
+ leftitem->datumdata = palloc(left_datasz);
+ leftitem->datumdatasz = left_datasz;
+
+ memcpy(leftitem->tids, &origitem->tids[0], left_num_elements * sizeof(zstid));
+ /* XXX: should copy the null bitmap in a smarter way */
+ for (i = 0; i < left_num_elements; i++)
+ {
+ if (zsbt_attr_item_isnull(origitem->nullbitmap, i))
+ zsbt_attr_item_setnull(leftitem->nullbitmap, i);
+ }
+ memcpy(leftitem->datumdata, &origitem->datumdata[0], left_datasz);
+
+ rightitem = palloc(sizeof(ZSExplodedItem));
+ rightitem->t_size = 0;
+ rightitem->t_flags = 0;
+ rightitem->t_num_elements = right_num_elements;
+ rightitem->tids = palloc(right_num_elements * sizeof(zstid));
+ rightitem->nullbitmap = palloc(right_num_elements * sizeof(bool));
+ rightitem->datumdata = palloc(right_datasz);
+ rightitem->datumdatasz = right_datasz;
+
+ memcpy(rightitem->tids, &origitem->tids[left_num_elements], right_num_elements * sizeof(zstid));
+ /* XXX: should copy the null bitmap in a smarter way */
+ for (i = 0; i < right_num_elements; i++)
+ {
+ if (zsbt_attr_item_isnull(origitem->nullbitmap, left_num_elements + i))
+ zsbt_attr_item_setnull(leftitem->nullbitmap, i);
+ }
+ memcpy(rightitem->datumdata, &origitem->datumdata[left_datasz], right_datasz);
+
+ *leftitem_p = leftitem;
+ *rightitem_p = rightitem;
+}
+
+static ZSExplodedItem *
+zsbt_combine_items(List *items, int start, int end)
+{
+ ZSExplodedItem *newitem;
+ int total_elements;
+ int total_datumdatasz;
+ List *exploded_items = NIL;
+
+ total_elements = 0;
+ total_datumdatasz = 0;
+ for (int i = start; i < end; i++)
+ {
+ ListCell *lc = list_nth_cell(items, i);
+ ZSAttributeArrayItem *item = lfirst(lc);
+ ZSExplodedItem *eitem;
+
+ if (item->t_size != 0)
+ {
+ eitem = zsbt_attr_explode_item(item);
+ lfirst(lc) = eitem;
+ }
+ else
+ eitem = (ZSExplodedItem *) item;
+
+ exploded_items = lappend(exploded_items, eitem);
+
+ total_elements += eitem->t_num_elements;
+ total_datumdatasz += eitem->datumdatasz;
+ }
+ Assert(total_elements <= MAX_TIDS_PER_ATTR_ITEM);
+
+ newitem = palloc(sizeof(ZSExplodedItem));
+ newitem->t_size = 0; // to indicate explodeditem
+ newitem->t_flags = 0;
+ newitem->t_num_elements = total_elements;
+
+ newitem->tids = palloc(total_elements * sizeof(zstid));
+ newitem->nullbitmap = palloc0(ZSBT_ATTR_BITMAPLEN(total_elements));
+ newitem->datumdata = palloc(total_datumdatasz);
+ newitem->datumdatasz = total_datumdatasz;
+
+ {
+ char *p = newitem->datumdata;
+ int elemno = 0;
+ for (int i = start; i < end; i++)
+ {
+ ZSExplodedItem *eitem = list_nth(items, i);
+
+ memcpy(&newitem->tids[elemno], eitem->tids, eitem->t_num_elements * sizeof(zstid));
+
+ /* XXX: should copy the null bitmap in a smarter way */
+ for (int j = 0; j < eitem->t_num_elements; j++)
+ {
+ if (zsbt_attr_item_isnull(eitem->nullbitmap, j))
+ zsbt_attr_item_setnull(newitem->nullbitmap, elemno + j);
+ }
+
+ memcpy(p, eitem->datumdata, eitem->datumdatasz);
+ p += eitem->datumdatasz;
+ elemno += eitem->t_num_elements;
+ }
+ }
+
+ return newitem;
+}
+
+static ZSAttributeArrayItem *
+zsbt_pack_item(Form_pg_attribute att, ZSExplodedItem *eitem)
+{
+ ZSAttributeArrayItem *newitem;
+ int num_elements = eitem->t_num_elements;
+ zstid firsttid;
+ zstid prevtid;
+ uint64 deltas[MAX_TIDS_PER_ATTR_ITEM];
+ uint64 codewords[MAX_TIDS_PER_ATTR_ITEM];
+ int num_codewords;
+ int total_encoded;
+ size_t itemsz;
+ char *p;
+ bool has_nulls;
+ int nullbitmapsz;
+
+ Assert(num_elements > 0);
+ Assert(num_elements <= MAX_TIDS_PER_ATTR_ITEM);
+
+ /* compute deltas */
+ firsttid = eitem->tids[0];
+ prevtid = firsttid;
+ deltas[0] = 0;
+ for (int i = 1; i < num_elements; i++)
+ {
+ zstid this_tid = eitem->tids[i];
+
+ deltas[i] = this_tid - prevtid;
+ prevtid = this_tid;
+ }
+
+ /* pack into codewords */
+ num_codewords = 0;
+ total_encoded = 0;
+ while (total_encoded < num_elements)
+ {
+ int num_encoded;
+
+ codewords[num_codewords] =
+ simple8b_encode(&deltas[total_encoded], num_elements - total_encoded, &num_encoded);
+
+ total_encoded += num_encoded;
+ num_codewords++;
+ }
+
+ nullbitmapsz = ZSBT_ATTR_BITMAPLEN(num_elements);
+ has_nulls = false;
+ for (int i = 0; i < nullbitmapsz; i++)
+ {
+ if (eitem->nullbitmap[i] != 0)
+ {
+ has_nulls = true;
+ break;
+ }
+ }
+
+ itemsz = offsetof(ZSAttributeArrayItem, t_tid_codewords);
+ itemsz += num_codewords * sizeof(uint64);
+ if (has_nulls)
+ {
+ /* reserve space for NULL bitmap */
+ itemsz += nullbitmapsz;
+ }
+ itemsz += eitem->datumdatasz;
+
+ Assert(has_nulls || eitem->datumdatasz > 0);
+
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_flags = 0;
+ if (has_nulls)
+ newitem->t_flags |= ZSBT_HAS_NULLS;
+ newitem->t_num_elements = num_elements;
+ newitem->t_num_codewords = num_codewords;
+ newitem->t_firsttid = eitem->tids[0];
+ newitem->t_endtid = eitem->tids[num_elements - 1] + 1;
+
+ memcpy(newitem->t_tid_codewords, codewords, num_codewords * sizeof(uint64));
+
+ p = (char *) &newitem->t_tid_codewords[num_codewords];
+
+ if (has_nulls)
+ {
+ memcpy(p, eitem->nullbitmap, nullbitmapsz);
+ p += nullbitmapsz;
+ }
+
+ memcpy(p, eitem->datumdata, eitem->datumdatasz);
+ p += eitem->datumdatasz;
+
+ Assert(p - ((char *) newitem) == itemsz);
+
+ return newitem;
+}
+
+static ZSAttributeArrayItem *
+zsbt_compress_item(ZSAttributeArrayItem *item)
+{
+ ZSAttributeCompressedItem *citem;
+ char *uncompressed_payload;
+ int uncompressed_size;
+ int compressed_size;
+ int item_allocsize;
+
+ Assert(item->t_size > 0);
+
+ uncompressed_payload = (char *) &item->t_tid_codewords;
+ uncompressed_size = ((char *) item) + item->t_size - uncompressed_payload;
+
+ item_allocsize = item->t_size;
+ /*
+ * XXX: because pglz requires a slightly larger buffer to even try compressing,
+ * make a slightly larger allocation. If the compression succeeds but with a
+ * poor ratio, so that we actually use the extra space, then we will store it
+ * uncompressed, but pglz refuses to even try if the destination buffer is not
+ * large enough.
+ */
+ item_allocsize += 10;
+
+ citem = palloc(item_allocsize);
+ citem->t_flags = ZSBT_ATTR_COMPRESSED;
+ if ((item->t_flags & ZSBT_HAS_NULLS) != 0)
+ citem->t_flags |= ZSBT_HAS_NULLS;
+ citem->t_num_elements = item->t_num_elements;
+ citem->t_num_codewords = item->t_num_codewords;
+ citem->t_uncompressed_size = uncompressed_size;
+ citem->t_firsttid = item->t_firsttid;
+ citem->t_endtid = item->t_endtid;
+
+ /* try compressing */
+ compressed_size = zs_try_compress(uncompressed_payload,
+ citem->t_payload,
+ uncompressed_size,
+ item_allocsize - offsetof(ZSAttributeCompressedItem, t_payload));
+ /*
+ * Skip compression if it wouldn't save at least 8 bytes. There are some
+ * extra header bytes on compressed items, so if we didn't check for this,
+ * the compressed item might actually be larger than the original item,
+ * even if the size of the compressed portion was the same as uncompressed
+ * size, (or 1-2 bytes less). The 8 byte marginal fixes that problem.
+ * Besides, it's hardly worth the CPU overhead of having to decompress
+ * on reading, for a saving of a few bytes.
+ */
+ if (compressed_size > 0 && compressed_size + 8 < uncompressed_size)
+ {
+ citem->t_size = offsetof(ZSAttributeCompressedItem, t_payload) + compressed_size;
+ Assert(citem->t_size < item->t_size);
+ return (ZSAttributeArrayItem *) citem;
+ }
+ else
+ return item;
+}
+
+
+/*
+ * Re-pack and compress a list of items.
+ *
+ * If there are small items in the input list, such that they can be merged
+ * together into larger items, we'll do that. And if there are uncompressed
+ * items, we'll try to compress them. If the input list contains "exploded"
+ * in-memory items, they will be packed into proper items suitable for
+ * storing on-disk.
+ */
+List *
+zsbt_attr_recompress_items(Form_pg_attribute attr, List *items)
+{
+ List *newitems = NIL;
+ int i;
+
+ /*
+ * Heuristics needed on when to try recompressing or merging existing
+ * items. Some musings on that:
+ *
+ * - If an item is already compressed, and close to maximum size, then
+ * it probably doesn't make sense to recompress.
+ * - If there are two adjacent items that are short, then it is probably
+ * worth trying to merge them.
+ */
+
+ /* loop through items, and greedily pack them */
+
+ i = 0;
+ while (i < list_length(items))
+ {
+ int total_num_elements = 0;
+ size_t total_size = 0;
+ int j;
+ ZSAttributeArrayItem *newitem;
+
+ for (j = i; j < list_length(items); j++)
+ {
+ ZSAttributeArrayItem *this_item = (ZSAttributeArrayItem *) list_nth(items, j);
+ size_t this_size;
+ int this_num_elements;
+
+ this_size = zsbt_item_uncompressed_size(this_item);
+ this_num_elements = this_item->t_num_elements;
+
+ /* don't create an item that's too large, in terms of size, or in # of tids */
+ if (total_num_elements + this_num_elements > MAX_TIDS_PER_ATTR_ITEM)
+ break;
+ if (total_size + this_size > MAX_ATTR_ITEM_SIZE)
+ break;
+ total_size += this_size;
+ total_num_elements += this_num_elements;
+ }
+ if (j == i)
+ j++; /* tolerate existing oversized items */
+
+ /* i - j are the items to pack */
+ if (j - i > 1)
+ {
+ ZSAttributeArrayItem *packeditem;
+ ZSExplodedItem *combineditem;
+
+ combineditem = zsbt_combine_items(items, i, j);
+ packeditem = zsbt_pack_item(attr, combineditem);
+ newitem = zsbt_compress_item(packeditem);
+ }
+ else
+ {
+ ZSAttributeArrayItem *olditem = list_nth(items, i);
+
+ if (olditem->t_size == 0)
+ {
+ newitem = zsbt_pack_item(attr, (ZSExplodedItem *) olditem);
+ newitem = zsbt_compress_item(newitem);
+ }
+ else if (olditem->t_flags & ZSBT_ATTR_COMPRESSED)
+ newitem = olditem;
+ else
+ newitem = zsbt_compress_item(olditem);
+ }
+
+ newitems = lappend(newitems, newitem);
+
+ i = j;
+ }
+
+ /* Check that the resulting items are in correct order, and don't overlap. */
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid endtid = 0;
+ ListCell *lc;
+
+ foreach (lc, newitems)
+ {
+ ZSAttributeArrayItem *i = (ZSAttributeArrayItem *) lfirst(lc);
+
+ Assert(i->t_firsttid >= endtid);
+ Assert(i->t_endtid > i->t_firsttid);
+ endtid = i->t_endtid;
+
+ /* there should be no exploded items left */
+ Assert(i->t_size != 0);
+ }
+ }
+#endif
+
+ return newitems;
+}
diff --git a/src/backend/access/zedstore/zedstore_attpage.c b/src/backend/access/zedstore/zedstore_attpage.c
new file mode 100644
index 00000000000..51270e6fea0
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_attpage.c
@@ -0,0 +1,833 @@
+/*
+ * zedstore_attpage.c
+ * Routines for handling attribute leaf pages.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with a scan of one attribute tree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_attpage.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static void zsbt_attr_repack_replace(Relation rel, AttrNumber attno,
+ Buffer oldbuf, List *items);
+static void zsbt_attr_add_items(Relation rel, AttrNumber attno, Buffer buf,
+ List *newitems);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of an attribute btree.
+ *
+ * Fills in the scan struct in *scan.
+ */
+void
+zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno,
+ ZSAttrTreeScan *scan)
+{
+ scan->rel = rel;
+ scan->attno = attno;
+ scan->attdesc = TupleDescAttr(tdesc, attno - 1);
+
+ scan->context = CurrentMemoryContext;
+ scan->array_datums = MemoryContextAlloc(scan->context, sizeof(Datum));
+ scan->array_isnulls = MemoryContextAlloc(scan->context, sizeof(bool) + 7);
+ scan->array_tids = MemoryContextAlloc(scan->context, sizeof(zstid));
+ scan->array_datums_allocated_size = 1;
+ scan->array_num_elements = 0;
+ scan->array_curr_idx = -1;
+
+ scan->decompress_buf = NULL;
+ scan->decompress_buf_size = 0;
+ scan->attr_buf = NULL;
+ scan->attr_buf_size = 0;
+
+ scan->active = true;
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+}
+
+void
+zsbt_attr_end_scan(ZSAttrTreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+
+ scan->active = false;
+ scan->array_num_elements = 0;
+ scan->array_curr_idx = -1;
+
+ if (scan->array_datums)
+ pfree(scan->array_datums);
+ if (scan->array_isnulls)
+ pfree(scan->array_isnulls);
+ if (scan->array_tids)
+ pfree(scan->array_tids);
+ if (scan->decompress_buf)
+ pfree(scan->decompress_buf);
+ if (scan->attr_buf)
+ pfree(scan->attr_buf);
+}
+
+/*
+ * Fetch the array item whose firsttid-endtid range contains 'nexttid',
+ * if any.
+ *
+ * Return true if an item was found. The Datum/isnull data of are
+ * placed into scan->array_* fields. The data is valid until the next
+ * call of this function. Note that the item's range contains 'nexttid',
+ * but its TID list might not include the exact TID itself. The caller
+ * must scan the array to check for that.
+ *
+ * This is normally not used directly. Use the zsbt_attr_fetch() wrapper,
+ * instead.
+ */
+bool
+zsbt_attr_scan_fetch_array(ZSAttrTreeScan *scan, zstid nexttid)
+{
+ if (!scan->active)
+ return InvalidZSTid;
+
+ /*
+ * Find the item containing nexttid.
+ */
+ for (;;)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber off;
+ OffsetNumber maxoff;
+
+ /*
+ * Find and lock the leaf page containing scan->nexttid.
+ */
+ buf = zsbt_find_and_lock_leaf_containing_tid(scan->rel, scan->attno,
+ scan->lastbuf, nexttid,
+ BUFFER_LOCK_SHARE);
+ scan->lastbuf = buf;
+ if (!BufferIsValid(buf))
+ {
+ /*
+ * Completely empty tree. This should only happen at the beginning of a
+ * scan - a tree cannot go missing after it's been created - but we don't
+ * currently check for that.
+ */
+ break;
+ }
+ page = BufferGetPage(buf);
+
+ /*
+ * Scan the items on the page, to find the next one that covers
+ * nexttid.
+ */
+ /* TODO: check the last offset first, as an optimization */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) PageGetItem(page, iid);
+
+ if (item->t_endtid <= nexttid)
+ continue;
+
+ if (item->t_firsttid > nexttid)
+ break;
+
+ /*
+ * Extract the data into scan->array_* fields.
+ *
+ * NOTE: zsbt_attr_item_extract() always makes a copy of the data,
+ * so we can release the lock on the page after doing this.
+ */
+ zsbt_attr_item_extract(scan, item);
+ scan->array_curr_idx = -1;
+
+ if (scan->array_num_elements > 0)
+ {
+ /* Found it! */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ return true;
+ }
+ }
+
+ /* No matching items. XXX: we should remember the 'next' block, for
+ * the next call. When we're seqscanning, we will almost certainly need
+ * that next.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ /* Reached end of scan. */
+ scan->array_num_elements = 0;
+ scan->array_curr_idx = -1;
+ if (BufferIsValid(scan->lastbuf))
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ return false;
+}
+
+/*
+ * Insert a multiple items to the given attribute's btree.
+ */
+void
+zsbt_attr_multi_insert(Relation rel, AttrNumber attno,
+ Datum *datums, bool *isnulls, zstid *tids, int nitems)
+{
+ Form_pg_attribute attr;
+ Buffer buf;
+ zstid insert_target_key;
+ List *newitems;
+
+ Assert (attno >= 1);
+ attr = &rel->rd_att->attrs[attno - 1];
+
+ /*
+ * Find the right place for the given TID.
+ */
+ insert_target_key = tids[0];
+
+ /* Create items to insert. */
+ newitems = zsbt_attr_create_items(attr, datums, isnulls, tids, nitems);
+
+ buf = zsbt_descend(rel, attno, insert_target_key, 0, false);
+
+ /*
+ * FIXME: I think it's possible, that the target page has been split by
+ * a concurrent backend, so that it contains only part of the keyspace.
+ * zsbt_attr_add_items() would not handle that correctly.
+ */
+
+ /* recompress and possibly split the page */
+ zsbt_attr_add_items(rel, attno, buf, newitems);
+
+ /* zsbt_attr_add_items unlocked 'buf' */
+ ReleaseBuffer(buf);
+}
+
+/*
+ * Remove datums for the given TIDs from the attribute tree.
+ */
+void
+zsbt_attr_remove(Relation rel, AttrNumber attno, IntegerSet *tids)
+{
+ Form_pg_attribute attr;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+ List *newitems = NIL;
+ ZSAttributeArrayItem *item;
+ ZSExplodedItem *newitem;
+ zstid nexttid;
+ MemoryContext oldcontext;
+ MemoryContext tmpcontext;
+
+ tmpcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMVacuumContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(tmpcontext);
+
+ attr = &rel->rd_att->attrs[attno - 1];
+
+ intset_begin_iterate(tids);
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = InvalidZSTid;
+
+ while (nexttid < MaxPlusOneZSTid)
+ {
+ buf = zsbt_descend(rel, attno, nexttid, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ newitems = NIL;
+
+ /*
+ * Find the item containing the first tid to remove.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ off = FirstOffsetNumber;
+ for (;;)
+ {
+ zstid endtid;
+ ItemId iid;
+ int num_to_remove;
+ zstid *tids_arr;
+
+ if (off > maxoff)
+ break;
+
+ iid = PageGetItemId(page, off);
+ item = (ZSAttributeArrayItem *) PageGetItem(page, iid);
+ off++;
+
+ /*
+ * If we don't find an item containing the given TID, just skip
+ * over it.
+ *
+ * This can legitimately happen, if e.g. VACUUM is
+ * interrupted, after it has already removed the attribute data for
+ * the dead tuples.
+ */
+ while (nexttid < item->t_firsttid)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ /* If this item doesn't contain any of the items we're removing, keep it as it is. */
+ endtid = item->t_endtid;
+ if (endtid < nexttid)
+ {
+ newitems = lappend(newitems, item);
+ continue;
+ }
+
+ /*
+ * We now have an array item at hand, that contains at least one
+ * of the TIDs we want to remove. Split the array, removing all
+ * the target tids.
+ */
+ tids_arr = palloc((item->t_num_elements + 1) * sizeof(zstid));
+ num_to_remove = 0;
+ while (nexttid < endtid)
+ {
+ tids_arr[num_to_remove++] = nexttid;
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+ tids_arr[num_to_remove++] = MaxPlusOneZSTid;
+ newitem = zsbt_attr_remove_from_item(attr, item, tids_arr);
+ pfree(tids_arr);
+ if (newitem)
+ newitems = lappend(newitems, newitem);
+ }
+
+ /*
+ * Skip over any remaining TIDs in the dead TID list that would
+ * be on this page, but are missing.
+ */
+ while (nexttid < opaque->zs_hikey)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ /* Now pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (newitems)
+ {
+ zsbt_attr_repack_replace(rel, attno, buf, newitems);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, attno, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack);
+ }
+ ReleaseBuffer(buf); /* zsbt_apply_split_changes unlocked 'buf' */
+
+ /*
+ * We can now free the decompression contexts. The pointers in the 'items' list
+ * point to decompression buffers, so we cannot free them until after writing out
+ * the pages.
+ */
+ MemoryContextReset(tmpcontext);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(tmpcontext);
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * This helper function is used to implement INSERT, UPDATE and DELETE.
+ *
+ * The items in the 'newitems' list are added to the page, to the correct position.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * existing items, or the page, as needed.
+ */
+static void
+zsbt_attr_add_items(Relation rel, AttrNumber attno, Buffer buf, List *newitems)
+{
+ Form_pg_attribute attr;
+ Page page = BufferGetPage(buf);
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ List *items = NIL;
+ Size growth;
+ ListCell *lc;
+ ListCell *nextnewlc;
+ zstid last_existing_tid;
+ ZSAttributeArrayItem *olditem;
+ ZSAttributeArrayItem *newitem;
+
+ attr = &rel->rd_att->attrs[attno - 1];
+
+ nextnewlc = list_head(newitems);
+
+ Assert(newitems != NIL);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Quick check if the new items go to the end of the page. This is the
+ * common case, when inserting new rows, since we allocate TIDs in order.
+ */
+ if (maxoff == 0)
+ last_existing_tid = 0;
+ else
+ {
+ ItemId iid;
+ ZSAttributeArrayItem *lastitem;
+
+ iid = PageGetItemId(page, maxoff);
+ lastitem = (ZSAttributeArrayItem *) PageGetItem(page, iid);
+
+ last_existing_tid = lastitem->t_endtid;
+ }
+
+ /*
+ * If the new items go to the end of the page, and they fit without splitting
+ * the page, just add them to the end.
+ */
+ if (((ZSAttributeArrayItem *) lfirst(nextnewlc))->t_firsttid >= last_existing_tid)
+ {
+ growth = 0;
+ foreach (lc, newitems)
+ {
+ ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) lfirst(lc);
+
+ growth += MAXALIGN(item->t_size) + sizeof(ItemId);
+ }
+
+ if (growth <= PageGetExactFreeSpace(page))
+ {
+ /* The new items fit on the page. Add them. */
+ START_CRIT_SECTION();
+
+ foreach(lc, newitems)
+ {
+ ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) lfirst(lc);
+
+ Assert(item->t_size > 0);
+
+ if (PageAddItemExtended(page,
+ (Item) item, item->t_size,
+ PageGetMaxOffsetNumber(page) + 1,
+ PAI_OVERWRITE) == InvalidOffsetNumber)
+ elog(ERROR, "could not add item to attribute page");
+ }
+
+ MarkBufferDirty(buf);
+
+ /* TODO: WAL-log */
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ list_free(newitems);
+
+ return;
+ }
+ }
+
+ /*
+ * Need to recompress and/or split the hard way.
+ *
+ * First, loop through the old and new items in lockstep, to figure out
+ * where the new items go to. If some of the old and new items have
+ * overlapping TID ranges, we will need to split some items to make
+ * them not overlap.
+ */
+ off = 1;
+ if (off <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ olditem = (ZSAttributeArrayItem *) PageGetItem(page, iid);
+ off++;
+ }
+ else
+ olditem = NULL;
+
+ if (nextnewlc)
+ {
+ newitem = lfirst(nextnewlc);
+ nextnewlc = lnext(newitems, nextnewlc);
+ }
+
+ for (;;)
+ {
+ if (!newitem && !olditem)
+ break;
+
+ if (newitem && olditem && newitem->t_firsttid == olditem->t_firsttid)
+ elog(ERROR, "duplicate TID on attribute page");
+
+ /*
+ * NNNNNNNN
+ * OOOOOOOOO
+ */
+ if (newitem && (!olditem || newitem->t_endtid <= olditem->t_firsttid))
+ {
+ items = lappend(items, newitem);
+ if (nextnewlc)
+ {
+ newitem = lfirst(nextnewlc);
+ nextnewlc = lnext(newitems, nextnewlc);
+ }
+ else
+ newitem = NULL;
+ continue;
+ }
+
+ /*
+ * NNNNNNNN
+ * OOOOOOOOO
+ */
+ if (olditem && (!newitem || olditem->t_endtid <= newitem->t_firsttid))
+ {
+ items = lappend(items, olditem);
+ if (off <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ olditem = (ZSAttributeArrayItem *) PageGetItem(page, iid);
+ off++;
+ }
+ else
+ olditem = NULL;
+ continue;
+ }
+
+ /*
+ * NNNNNNNN
+ * OOOOOOOOO
+ */
+ if (olditem->t_firsttid > newitem->t_firsttid)
+ {
+ ZSExplodedItem *left_newitem;
+ ZSExplodedItem *right_newitem;
+ /*
+ * split newitem:
+ *
+ * NNNNNnnnn
+ * OOOOOOOOO
+ */
+ zsbt_split_item(attr, (ZSExplodedItem *) newitem, olditem->t_firsttid,
+ &left_newitem, &right_newitem);
+ items = lappend(items, left_newitem);
+ newitem = (ZSAttributeArrayItem *) right_newitem;
+ continue;
+ }
+
+ /*
+ * NNNNNNNN
+ * OOOOOOOOO
+ */
+ if (olditem->t_firsttid < newitem->t_firsttid)
+ {
+ ZSExplodedItem *left_olditem;
+ ZSExplodedItem *right_olditem;
+ /*
+ * split olditem:
+ *
+ * OOOOOoooo
+ * NNNNNNNNN
+ */
+ zsbt_split_item(attr, (ZSExplodedItem *) olditem, newitem->t_firsttid,
+ &left_olditem, &right_olditem);
+ items = lappend(items, left_olditem);
+ olditem = (ZSAttributeArrayItem *) right_olditem;
+ continue;
+ }
+
+ elog(ERROR, "shouldn't reach here");
+ }
+
+ /* Now pass the list to the repacker, to distribute the items to pages. */
+ IncrBufferRefCount(buf);
+
+ /*
+ * Now we have a list of non-overlapping items, containing all the old and
+ * new data. zsbt_attr_repack_replace() takes care of storing them on the
+ * page, splitting the page if needed.
+ */
+ zsbt_attr_repack_replace(rel, attno, buf, items);
+
+ list_free(items);
+}
+
+
+/*
+ * Repacker routines
+ */
+typedef struct
+{
+ Page currpage;
+ int compressed_items;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ int total_items;
+ int total_packed_items;
+
+ AttrNumber attno;
+ zstid hikey;
+} zsbt_attr_repack_context;
+
+static void
+zsbt_attr_repack_newpage(zsbt_attr_repack_context *cxt, zstid nexttid, int flags)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ if (cxt->currpage)
+ {
+ /* set the last tid on previous page */
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ oldopaque->zs_hikey = nexttid;
+ }
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (cxt->stack_tail)
+ cxt->stack_tail->next = stack;
+ else
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = cxt->attno;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = flags;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+/*
+ * Rewrite a leaf page, with given 'items' as the new content.
+ *
+ * First, calls zsbt_attr_recompress_items(), which will try to combine
+ * short items, and compress uncompressed items. After that, will try to
+ * store all the items on the page, replacing old content on the page.
+ *
+ * The items may contain "exploded" items, as ZSExplodedItem. They will
+ * be converted to normal array items suitable for storing on-disk.
+ *
+ * If the items don't fit on the page, then the page is split. It is
+ * entirely possible that they don't fit even on two pages; we split the page
+ * into as many pages as needed. Hopefully not more than a few pages, though,
+ * because otherwise you might hit limits on the number of buffer pins (with
+ * tiny shared_buffers).
+ *
+ * On entry, 'oldbuf' must be pinned and exclusive-locked. On exit, the lock
+ * is released, but it's still pinned.
+ */
+static void
+zsbt_attr_repack_replace(Relation rel, AttrNumber attno, Buffer oldbuf, List *items)
+{
+ Form_pg_attribute attr = &rel->rd_att->attrs[attno - 1];
+ ListCell *lc;
+ zsbt_attr_repack_context cxt;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(BufferGetPage(oldbuf));
+ BlockNumber orignextblk;
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+ List *recompressed_items;
+
+ /*
+ * Check that the items in the input are in correct order and don't
+ * overlap.
+ */
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid prev_endtid = 0;
+ ListCell *lc;
+
+ foreach (lc, items)
+ {
+ ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) lfirst(lc);
+ zstid item_firsttid;
+ zstid item_endtid;
+
+ if (item->t_size == 0)
+ {
+ ZSExplodedItem *eitem = (ZSExplodedItem *) item;
+ item_firsttid = eitem->tids[0];
+ item_endtid = eitem->tids[eitem->t_num_elements - 1] + 1;
+ }
+ else
+ {
+ item_firsttid = item->t_firsttid;
+ item_endtid = item->t_endtid;;
+ }
+
+ Assert(item_firsttid >= prev_endtid);
+ Assert(item_endtid > item_firsttid);
+ prev_endtid = item_endtid;
+ }
+ }
+#endif
+
+ /*
+ * First, split, merge and compress the items as needed, into
+ * suitable chunks.
+ */
+ recompressed_items = zsbt_attr_recompress_items(attr, items);
+
+ /*
+ * Then, store them on the page, creating new pages as needed.
+ */
+ orignextblk = oldopaque->zs_next;
+ Assert(orignextblk != BufferGetBlockNumber(oldbuf));
+
+ cxt.currpage = NULL;
+ cxt.stack_head = cxt.stack_tail = NULL;
+ cxt.attno = attno;
+ cxt.hikey = oldopaque->zs_hikey;
+
+ cxt.total_items = 0;
+
+ zsbt_attr_repack_newpage(&cxt, oldopaque->zs_lokey, (oldopaque->zs_flags & ZSBT_ROOT));
+
+ foreach(lc, recompressed_items)
+ {
+ ZSAttributeArrayItem *item = lfirst(lc);
+
+ if (PageGetFreeSpace(cxt.currpage) < MAXALIGN(item->t_size))
+ zsbt_attr_repack_newpage(&cxt, item->t_firsttid, 0);
+
+ if (PageAddItemExtended(cxt.currpage,
+ (Item) item, item->t_size,
+ PageGetMaxOffsetNumber(cxt.currpage) + 1,
+ PAI_OVERWRITE) == InvalidOffsetNumber)
+ elog(ERROR, "could not add item to page while recompressing");
+
+ cxt.total_items++;
+ }
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt.stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, InvalidBuffer);
+ stack->next->buf = nextbuf;
+ Assert (BufferGetBlockNumber(nextbuf) != orignextblk);
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = orignextblk;
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt.stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt.stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt.stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt.stack_tail->next = zsbt_newroot(rel, attno, oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt.stack_tail->next = zsbt_insert_downlinks(rel, attno,
+ oldopaque->zs_lokey, BufferGetBlockNumber(oldbuf), oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt.stack_head);
+}
diff --git a/src/backend/access/zedstore/zedstore_btree.c b/src/backend/access/zedstore/zedstore_btree.c
new file mode 100644
index 00000000000..918981b7044
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_btree.c
@@ -0,0 +1,842 @@
+/*
+ * zedstore_btree.c
+ * Common routines for handling TID and attibute B-tree structures
+ *
+ * A Zedstore table consists of multiple B-trees, one to store TIDs and
+ * visibility information of the rows, and one tree for each attribute,
+ * to hold the data. The TID and attribute trees differ at the leaf
+ * level, but the internal pages have the same layout. This file contains
+ * routines to deal with internal pages, and some other common
+ * functionality.
+ *
+ * When dealing with the TID tree, pass ZS_META_ATTRIBUTE_NUM as the
+ * attribute number.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_btree.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_internal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static zs_split_stack *zsbt_split_internal_page(Relation rel, AttrNumber attno,
+ Buffer leftbuf, OffsetNumber newoff, List *downlinks);
+static zs_split_stack *zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left);
+
+static int zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems);
+
+/*
+ * Find the page containing the given key TID at the given level.
+ *
+ * Level 0 means leaf. The returned buffer is exclusive-locked.
+ *
+ * If tree doesn't exist at all (probably because the table was just created
+ * or truncated), the behavior depends on the 'readonly' argument. If
+ * readonly == true, then returns InvalidBuffer. If readonly == false, then
+ * the tree is created.
+ */
+Buffer
+zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly)
+{
+ BlockNumber next;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ int nextlevel;
+ BlockNumber failblk = InvalidBlockNumber;
+ int faillevel = -1;
+ ZSMetaCacheData *metacache;
+
+ Assert(key != InvalidZSTid);
+
+ /* Fast path for the very common case that we're looking for the rightmost page */
+ metacache = zsmeta_get_cache(rel);
+ if (level == 0 &&
+ attno < metacache->cache_nattributes &&
+ metacache->cache_attrs[attno].rightmost != InvalidBlockNumber &&
+ key >= metacache->cache_attrs[attno].rightmost_lokey)
+ {
+ next = metacache->cache_attrs[attno].rightmost;
+ nextlevel = 0;
+ }
+ else
+ {
+ /* start from root */
+ next = zsmeta_get_root_for_attribute(rel, attno, readonly);
+ if (next == InvalidBlockNumber)
+ {
+ /* completely empty tree */
+ return InvalidBuffer;
+ }
+ nextlevel = -1;
+ }
+ for (;;)
+ {
+ /*
+ * If we arrive again to a block that was a dead-end earlier, it seems
+ * that the tree is corrupt.
+ *
+ * XXX: It's theoretically possible that the block was removed, but then
+ * added back at the same location, and removed again. So perhaps retry
+ * a few times?
+ */
+ if (next == failblk || next == ZS_META_BLK)
+ elog(ERROR, "arrived at incorrect block %u while descending zedstore btree", next);
+
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); /* TODO: shared */
+ page = BufferGetPage(buf);
+ if (!zsbt_page_is_expected(rel, attno, key, nextlevel, buf))
+ {
+ /*
+ * We arrived at an unexpected page. This can happen with concurrent
+ * splits, or page deletions. We could try following the right-link, but
+ * there's no guarantee that's the correct page either, so let's restart
+ * from the root. If we landed here because of concurrent modifications,
+ * the next attempt should land on the correct page. Remember that we
+ * incorrectly ended up on this page, so that if this happens because
+ * the tree is corrupt, rather than concurrent splits, and we land here
+ * again, we won't loop forever.
+ */
+ UnlockReleaseBuffer(buf);
+
+ failblk = next;
+ faillevel = nextlevel;
+ nextlevel = -1;
+ zsmeta_invalidate_cache(rel);
+ next = zsmeta_get_root_for_attribute(rel, attno, readonly);
+ if (next == InvalidBlockNumber)
+ elog(ERROR, "could not find root for attribute %d", attno);
+
+ /*
+ * If the root was split after we cached the metadata, it's
+ * possible that the page we thought was the root page no longer
+ * is, but as we descend from the new root page, we'll end up on
+ * the same page again anyway. Don't treat thatas an error. To
+ * avoid it, check for the root case here, and if reset 'failblk'.
+ */
+ if (faillevel == -1)
+ {
+ if (next == failblk)
+ elog(ERROR, "arrived at incorrect block %u while descending zedstore btree", next);
+ failblk = InvalidBlockNumber;
+ }
+ continue;
+ }
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (nextlevel == -1)
+ nextlevel = opaque->zs_level;
+
+ else if (opaque->zs_level != nextlevel)
+ elog(ERROR, "unexpected level encountered when descending tree");
+
+ if (opaque->zs_level == level)
+ break;
+
+ /* Find the downlink and follow it */
+ items = ZSBtreeInternalPageGetItems(page);
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+
+ itemno = zsbt_binsrch_internal(key, items, nitems);
+ if (itemno < 0)
+ elog(ERROR, "could not descend tree for tid (%u, %u)",
+ ZSTidGetBlockNumber(key), ZSTidGetOffsetNumber(key));
+
+ next = items[itemno].childblk;
+ nextlevel--;
+
+ UnlockReleaseBuffer(buf);
+ }
+
+ if (opaque->zs_level == 0 && opaque->zs_next == InvalidBlockNumber)
+ {
+ metacache = zsmeta_get_cache(rel);
+ if (attno < metacache->cache_nattributes)
+ {
+ metacache->cache_attrs[attno].rightmost = next;
+ metacache->cache_attrs[attno].rightmost_lokey = opaque->zs_lokey;
+ }
+ }
+
+ return buf;
+}
+
+
+/*
+ * Find and lock the leaf page that contains data for scan->nexttid.
+ *
+ * If 'buf' is valid, it is a previously pinned page. We will check that
+ * page first. If it's not the correct page, it will be released.
+ *
+ * Returns InvalidBuffer, if the attribute tree doesn't exist at all.
+ * That should only happen after ALTER TABLE ADD COLUMN. Or on a newly
+ * created table, but none of the current callers would even try to
+ * fetch attribute data, without scanning the TID tree first.)
+ */
+Buffer
+zsbt_find_and_lock_leaf_containing_tid(Relation rel, AttrNumber attno,
+ Buffer buf, zstid nexttid, int lockmode)
+{
+ if (BufferIsValid(buf))
+ {
+retry:
+ LockBuffer(buf, lockmode);
+
+ /*
+ * It's possible that the page was concurrently split or recycled by
+ * another backend (or ourselves). Have to re-check that the page is
+ * still valid.
+ */
+ if (zsbt_page_is_expected(rel, attno, nexttid, 0, buf))
+ return buf;
+ else
+ {
+ /*
+ * It's not valid for the TID we're looking for, but maybe it was the
+ * right page for the previous TID. In that case, we don't need to
+ * restart from the root, we can follow the right-link instead.
+ */
+ if (nexttid > MinZSTid &&
+ zsbt_page_is_expected(rel, attno, nexttid - 1, 0, buf))
+ {
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ BlockNumber next = opaque->zs_next;
+
+ if (next != InvalidBlockNumber)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf = ReleaseAndReadBuffer(buf, rel, next);
+ goto retry;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ }
+ }
+
+ /* Descend the B-tree to find the correct leaf page. */
+ if (!BufferIsValid(buf))
+ buf = zsbt_descend(rel, attno, nexttid, 0, true);
+
+ return buf;
+}
+
+
+/*
+ * Check that a page is a valid B-tree page, and covers the given key.
+ *
+ * This is used when traversing the tree, to check that e.g. a concurrent page
+ * split didn't move pages around, so that the page we were walking to isn't
+ * the correct one anymore.
+ */
+bool
+zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque;
+
+ /*
+ * The page might have been deleted and even reused as a completely different
+ * kind of a page, so we must be prepared for anything.
+ */
+ if (PageIsNew(page))
+ return false;
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ return false;
+
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ return false;
+
+ if (opaque->zs_attno != attno)
+ return false;
+
+ if (level == -1)
+ {
+ if ((opaque->zs_flags & ZSBT_ROOT) == 0)
+ return false;
+ }
+ else
+ {
+ if (opaque->zs_level != level)
+ return false;
+ }
+
+ if (opaque->zs_lokey > key || opaque->zs_hikey <= key)
+ return false;
+
+ /* extra checks for corrupted pages */
+ if (opaque->zs_next == BufferGetBlockNumber(buf))
+ elog(ERROR, "btree page %u next-pointer points to itself", opaque->zs_next);
+
+ return true;
+}
+
+/*
+ * Create a new btree root page, containing two downlinks.
+ *
+ * NOTE: the very first root page of a btree, which is also the leaf, is created
+ * in zsmeta_get_root_for_attribute(), not here.
+ *
+ * XXX: What if there are too many downlinks to fit on a page? Shouldn't happen
+ * in practice..
+ */
+zs_split_stack *
+zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks)
+{
+ Page metapage;
+ ZSMetaPage *metapg;
+ Buffer newrootbuf;
+ Page newrootpage;
+ ZSBtreePageOpaque *newrootopaque;
+ ZSBtreeInternalPageItem *items;
+ Buffer metabuf;
+ zs_split_stack *stack1;
+ zs_split_stack *stack2;
+ ListCell *lc;
+ int i;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* allocate a new root page */
+ newrootbuf = zspage_getnewbuf(rel, metabuf);
+ newrootpage = palloc(BLCKSZ);
+ PageInit(newrootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ newrootopaque = ZSBtreePageGetOpaque(newrootpage);
+ newrootopaque->zs_attno = attno;
+ newrootopaque->zs_next = InvalidBlockNumber;
+ newrootopaque->zs_lokey = MinZSTid;
+ newrootopaque->zs_hikey = MaxPlusOneZSTid;
+ newrootopaque->zs_level = level;
+ newrootopaque->zs_flags = ZSBT_ROOT;
+ newrootopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ items = ZSBtreeInternalPageGetItems(newrootpage);
+
+ /* add all the downlinks */
+ i = 0;
+ foreach (lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ items[i++] = *downlink;
+ }
+ ((PageHeader) newrootpage)->pd_lower += i * sizeof(ZSBtreeInternalPageItem);
+
+ /* FIXME: Check that all the downlinks fit on the page. */
+
+ /* update the metapage */
+ metapage = PageGetTempPageCopy(BufferGetPage(metabuf));
+
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+ if ((attno != ZS_META_ATTRIBUTE_NUM) && (attno <= 0 || attno > metapg->nattributes))
+ elog(ERROR, "invalid attribute number %d (table \"%s\" has only %d attributes)",
+ attno, RelationGetRelationName(rel), metapg->nattributes);
+
+ metapg->tree_root_dir[attno].root = BufferGetBlockNumber(newrootbuf);
+
+ stack1 = zs_new_split_stack_entry(metabuf, metapage);
+ stack2 = zs_new_split_stack_entry(newrootbuf, newrootpage);
+ stack2->next = stack1;
+
+ return stack2;
+}
+
+/*
+ * After page split, insert the downlink of 'rightblkno' to the parent.
+ *
+ * On entry, 'leftbuf' must be pinned exclusive-locked.
+ */
+zs_split_stack *
+zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks)
+{
+ int numdownlinks = list_length(downlinks);
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ Buffer parentbuf;
+ Page parentpage;
+ zs_split_stack *split_stack;
+ ZSBtreeInternalPageItem *firstdownlink;
+
+ /*
+ * re-find parent
+ *
+ * TODO: this is a bit inefficient. Usually, we have just descended the
+ * tree, and if we just remembered the path we descended, we could just
+ * walk back up.
+ */
+
+ /*
+ * XXX:: There was a concurrency bug here, too, observed by running
+ * "make installcheck-parallel":
+ * We're holding a lock on a page on level 0, and the root is at level 1.
+ * However, the metacache says that the page we're holding locked is
+ * the root. When we get here, to find the parent page, we will start
+ * with the cached block at level 0, and deadlock with ourselves.
+ *
+ * To fix that, invalidate the cache, if it claims that the child
+ * block is the root. I'm not sure this fixes the whole general problem
+ * those, so this needs some more thought...
+ */
+ {
+ ZSMetaCacheData *metacache;
+
+ metacache = zsmeta_get_cache(rel);
+ if (attno < metacache->cache_nattributes &&
+ metacache->cache_attrs[attno].root == leftblkno)
+ {
+ metacache->cache_attrs[attno].root = InvalidBlockNumber;
+ }
+ }
+
+ parentbuf = zsbt_descend(rel, attno, leftlokey, level, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ firstdownlink = (ZSBtreeInternalPageItem *) linitial(downlinks);
+
+ /* Find the position in the parent for the downlink */
+ items = ZSBtreeInternalPageGetItems(parentpage);
+ nitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(firstdownlink->tid, items, nitems);
+
+ /* sanity checks */
+ if (itemno < 0 || items[itemno].tid != leftlokey ||
+ items[itemno].childblk != leftblkno)
+ {
+ elog(ERROR, "could not find downlink for block %u TID (%u, %u)",
+ leftblkno, ZSTidGetBlockNumber(leftlokey),
+ ZSTidGetOffsetNumber(leftlokey));
+ }
+ itemno++;
+
+ if (PageGetExactFreeSpace(parentpage) < numdownlinks * sizeof(ZSBtreeInternalPageItem))
+ {
+ /* split internal page */
+ split_stack = zsbt_split_internal_page(rel, attno, parentbuf, itemno, downlinks);
+ }
+ else
+ {
+ ZSBtreeInternalPageItem *newitems;
+ Page newpage;
+ int i;
+ ListCell *lc;
+
+ newpage = PageGetTempPageCopySpecial(parentpage);
+
+ split_stack = zs_new_split_stack_entry(parentbuf, newpage);
+
+ /* insert the new downlink for the right page. */
+ newitems = ZSBtreeInternalPageGetItems(newpage);
+ memcpy(newitems, items, itemno * sizeof(ZSBtreeInternalPageItem));
+
+ i = itemno;
+ foreach(lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ Assert(downlink->childblk != 0);
+ newitems[i++] = *downlink;
+ }
+
+ memcpy(&newitems[i], &items[itemno], (nitems - itemno) * sizeof(ZSBtreeInternalPageItem));
+ ((PageHeader) newpage)->pd_lower += (nitems + numdownlinks) * sizeof(ZSBtreeInternalPageItem);
+ }
+ return split_stack;
+}
+
+/*
+ * Split an internal page.
+ *
+ * The new downlink specified by 'newkey' is inserted to position 'newoff', on 'leftbuf'.
+ * The page is split.
+ */
+static zs_split_stack *
+zsbt_split_internal_page(Relation rel, AttrNumber attno, Buffer origbuf,
+ OffsetNumber newoff, List *newitems)
+{
+ Page origpage = BufferGetPage(origbuf);
+ ZSBtreePageOpaque *origopaque = ZSBtreePageGetOpaque(origpage);
+ Buffer buf;
+ Page page;
+ ZSBtreeInternalPageItem *origitems;
+ int orignitems;
+ zs_split_stack *stack_first;
+ zs_split_stack *stack;
+ Size splitthreshold;
+ ListCell *lc;
+ int origitemno;
+ List *downlinks = NIL;
+
+ origitems = ZSBtreeInternalPageGetItems(origpage);
+ orignitems = ZSBtreeInternalPageGetNumItems(origpage);
+
+ page = PageGetTempPageCopySpecial(origpage);
+ buf = origbuf;
+
+ stack = zs_new_split_stack_entry(buf, page);
+ stack_first = stack;
+
+ /* XXX: currently, we always do 90/10 splits */
+ splitthreshold = PageGetExactFreeSpace(page) * 0.10;
+
+ lc = list_head(newitems);
+ origitemno = 0;
+ for (;;)
+ {
+ ZSBtreeInternalPageItem *item;
+ ZSBtreeInternalPageItem *p;
+
+ if (origitemno == newoff && lc)
+ {
+ item = lfirst(lc);
+ lc = lnext(newitems, lc);
+ }
+ else
+ {
+ if (origitemno == orignitems)
+ break;
+ item = &origitems[origitemno];
+ origitemno++;
+ }
+
+ if (PageGetExactFreeSpace(page) < splitthreshold)
+ {
+ /* have to split to another page */
+ ZSBtreePageOpaque *prevopaque = ZSBtreePageGetOpaque(page);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ BlockNumber blkno;
+ ZSBtreeInternalPageItem *downlink;
+
+ buf = zspage_getnewbuf(rel, InvalidBuffer);
+ blkno = BufferGetBlockNumber(buf);
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ opaque = ZSBtreePageGetOpaque(page);
+ opaque->zs_attno = attno;
+ opaque->zs_next = prevopaque->zs_next;
+ opaque->zs_lokey = item->tid;
+ opaque->zs_hikey = prevopaque->zs_hikey;
+ opaque->zs_level = prevopaque->zs_level;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ prevopaque->zs_next = blkno;
+ prevopaque->zs_hikey = item->tid;
+
+ stack->next = zs_new_split_stack_entry(buf, page);
+ stack = stack->next;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = item->tid;
+ downlink->childblk = blkno;
+ downlinks = lappend(downlinks, downlink);
+ }
+
+ p = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+ *p = *item;
+ ((PageHeader) page)->pd_lower += sizeof(ZSBtreeInternalPageItem);
+ }
+
+ /* recurse to insert downlinks, if we had to split. */
+ if (downlinks)
+ {
+ if ((origopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(origbuf);
+ downlinks = lcons(downlink, downlinks);
+
+ stack->next = zsbt_newroot(rel, attno, origopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ ZSBtreePageGetOpaque(stack_first->page)->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ stack->next = zsbt_insert_downlinks(rel, attno,
+ origopaque->zs_lokey,
+ BufferGetBlockNumber(origbuf),
+ origopaque->zs_level + 1,
+ downlinks);
+ }
+ }
+
+ return stack_first;
+}
+
+
+/*
+ * Removes the last item from page, and unlinks the page from the tree.
+ *
+ * NOTE: you cannot remove the only leaf. Returns NULL if the page could not
+ * be deleted.
+ */
+zs_split_stack *
+zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ Buffer leftbuf;
+ Buffer rightbuf;
+ zs_split_stack *stack;
+
+ /* cannot currently remove the only page at its level. */
+ if (opaque->zs_lokey == MinZSTid && opaque->zs_hikey == MaxPlusOneZSTid)
+ {
+ return NULL;
+ }
+
+ /*
+ * Find left sibling.
+ * or if this is leftmost page, find right sibling.
+ */
+ if (opaque->zs_lokey != MinZSTid)
+ {
+ rightbuf = buf;
+ leftbuf = zsbt_descend(rel, attno, opaque->zs_lokey - 1, level, false);
+
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, false);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(leftbuf);
+ return NULL;
+ }
+ }
+ else
+ {
+ rightbuf = zsbt_descend(rel, attno, opaque->zs_hikey, level, false);
+ leftbuf = buf;
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, true);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(rightbuf);
+ return NULL;
+ }
+ }
+
+ return stack;
+}
+
+/*
+ * Page deletion:
+ *
+ * Mark page empty, remove downlink. If parent becomes empty, recursively delete it.
+ *
+ * Unlike in the nbtree index, we don't need to worry about concurrent scans. They
+ * will simply retry if they land on an unexpected page.
+ */
+static zs_split_stack *
+zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left)
+{
+ Buffer parentbuf;
+ Page origleftpage;
+ Page leftpage;
+ Page rightpage;
+ ZSBtreePageOpaque *leftopaque;
+ ZSBtreePageOpaque *origleftopaque;
+ ZSBtreePageOpaque *rightopaque;
+ ZSBtreeInternalPageItem *parentitems;
+ int parentnitems;
+ Page parentpage;
+ int itemno;
+ zs_split_stack *stack;
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ origleftpage = BufferGetPage(leftbuf);
+ origleftopaque = ZSBtreePageGetOpaque(origleftpage);
+ rightpage = BufferGetPage(rightbuf);
+ rightopaque = ZSBtreePageGetOpaque(rightpage);
+
+ /* find downlink for 'rightbuf' in the parent */
+ parentbuf = zsbt_descend(rel, attno, rightopaque->zs_lokey, origleftopaque->zs_level + 1, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ parentitems = ZSBtreeInternalPageGetItems(parentpage);
+ parentnitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(rightopaque->zs_lokey, parentitems, parentnitems);
+ if (itemno < 0 || parentitems[itemno].childblk != BufferGetBlockNumber(rightbuf))
+ elog(ERROR, "could not find downlink to FPM page %u", BufferGetBlockNumber(rightbuf));
+
+ if (parentnitems > 1 && itemno == 0)
+ {
+ /*
+ * Don't delete the leftmost child of a parent. That would move the
+ * keyspace of the parent, so we'd need to adjust the lo/hikey of
+ * the parent page, and the parent's downlink in the grandparent.
+ * Maybe later...
+ */
+ UnlockReleaseBuffer(parentbuf);
+ elog(DEBUG1, "deleting leftmost child of a parent not implemented");
+ return NULL;
+ }
+
+ if (target_is_left)
+ {
+ /* move all items from right to left before unlinking the right page */
+ leftpage = PageGetTempPageCopy(rightpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+
+ memcpy(leftopaque, origleftopaque, sizeof(ZSBtreePageOpaque));
+ }
+ else
+ {
+ /* right page is empty. */
+ leftpage = PageGetTempPageCopy(origleftpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+ }
+
+ /* update left hikey */
+ leftopaque->zs_hikey = ZSBtreePageGetOpaque(rightpage)->zs_hikey;
+ leftopaque->zs_next = ZSBtreePageGetOpaque(rightpage)->zs_next;
+
+ Assert(ZSBtreePageGetOpaque(leftpage)->zs_level == ZSBtreePageGetOpaque(rightpage)->zs_level);
+
+ stack = zs_new_split_stack_entry(leftbuf, leftpage);
+ stack_head = stack_tail = stack;
+
+ /* Mark right page as empty/unused */
+ rightpage = palloc0(BLCKSZ);
+
+ stack = zs_new_split_stack_entry(rightbuf, rightpage);
+ stack->recycle = true;
+ stack_tail->next = stack;
+ stack_tail = stack;
+
+ /* remove downlink from parent */
+ if (parentnitems > 1)
+ {
+ Page newpage = PageGetTempPageCopySpecial(parentpage);
+ ZSBtreeInternalPageItem *newitems = ZSBtreeInternalPageGetItems(newpage);
+
+ memcpy(newitems, parentitems, itemno * sizeof(ZSBtreeInternalPageItem));
+ memcpy(&newitems[itemno], &parentitems[itemno + 1], (parentnitems - itemno -1) * sizeof(ZSBtreeInternalPageItem));
+
+ ((PageHeader) newpage)->pd_lower += (parentnitems - 1) * sizeof(ZSBtreeInternalPageItem);
+
+ stack = zs_new_split_stack_entry(parentbuf, newpage);
+ stack_tail->next = stack;
+ stack_tail = stack;
+ }
+ else
+ {
+ /* the parent becomes empty as well. Recursively remove it. */
+ stack_tail->next = zsbt_unlink_page(rel, attno, parentbuf, leftopaque->zs_level + 1);
+ if (stack_tail->next == NULL)
+ {
+ /* oops, couldn't remove the parent. Back out */
+ stack = stack_head;
+ while (stack)
+ {
+ zs_split_stack *next = stack->next;
+
+ pfree(stack->page);
+ pfree(stack);
+ stack = next;
+ }
+ }
+ }
+
+ return stack_head;
+}
+
+/*
+ * Allocate a new zs_split_stack struct.
+ */
+zs_split_stack *
+zs_new_split_stack_entry(Buffer buf, Page page)
+{
+ zs_split_stack *stack;
+
+ stack = palloc(sizeof(zs_split_stack));
+ stack->next = NULL;
+ stack->buf = buf;
+ stack->page = page;
+ stack->recycle = false; /* caller can change this */
+
+ return stack;
+}
+
+/*
+ * Apply all the changes represented by a list of zs_split_stack
+ * entries.
+ */
+void
+zs_apply_split_changes(Relation rel, zs_split_stack *stack)
+{
+ zs_split_stack *head = stack;
+
+ START_CRIT_SECTION();
+
+ while (stack)
+ {
+ PageRestoreTempPage(stack->page, BufferGetPage(stack->buf));
+ MarkBufferDirty(stack->buf);
+ stack = stack->next;
+ }
+
+ /* TODO: WAL-log all the changes */
+
+ END_CRIT_SECTION();
+
+ stack = head;
+ while (stack)
+ {
+ zs_split_stack *next;
+
+ /* add this page to the Free Page Map for recycling */
+ if (stack->recycle)
+ zspage_delete_page(rel, stack->buf);
+
+ UnlockReleaseBuffer(stack->buf);
+
+ next = stack->next;
+ pfree(stack);
+ stack = next;
+ }
+}
+
+static int
+zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid].tid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git a/src/backend/access/zedstore/zedstore_compression.c b/src/backend/access/zedstore/zedstore_compression.c
new file mode 100644
index 00000000000..5ed7452e967
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_compression.c
@@ -0,0 +1,80 @@
+/*
+ * zedstore_compression.c
+ * Routines for compression
+ *
+ * There are two implementations at the moment: LZ4, and the Postgres
+ * pg_lzcompress(). LZ4 support requires that the server was compiled
+ * with --with-lz4.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_compression.c
+ */
+#include "postgres.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#include "access/zedstore_compression.h"
+#include "common/pg_lzcompress.h"
+#include "utils/datum.h"
+
+#ifdef USE_LZ4
+
+int
+zs_try_compress(const char *src, char *dst, int srcSize, int dstCapacity)
+{
+ int compressed_size;
+
+ compressed_size = LZ4_compress_default(src, dst, srcSize, dstCapacity);
+
+ if (compressed_size > srcSize)
+ return 0;
+ else
+ return compressed_size;
+}
+
+void
+zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize)
+{
+ int decompressed_size;
+
+ decompressed_size = LZ4_decompress_safe(src, dst, compressedSize, uncompressedSize);
+ if (decompressed_size < 0)
+ elog(ERROR, "could not decompress chunk");
+ if (decompressed_size != uncompressedSize)
+ elog(ERROR, "unexpected decompressed size");
+}
+
+#else
+/* PGLZ implementation */
+
+int
+zs_try_compress(const char *src, char *dst, int srcSize, int dstCapacity)
+{
+ int compressed_size;
+
+ if (dstCapacity < PGLZ_MAX_OUTPUT(srcSize))
+ return -1;
+
+ compressed_size = pglz_compress(src, srcSize, dst, PGLZ_strategy_always);
+
+ return compressed_size;
+}
+
+void
+zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize)
+{
+ int decompressed_size;
+
+ decompressed_size = pglz_decompress(src, compressedSize, dst, uncompressedSize, true);
+ if (decompressed_size < 0)
+ elog(ERROR, "could not decompress chunk");
+ if (decompressed_size != uncompressedSize)
+ elog(ERROR, "unexpected decompressed size");
+}
+
+#endif /* !USE_LZ4 */
diff --git a/src/backend/access/zedstore/zedstore_freepagemap.c b/src/backend/access/zedstore/zedstore_freepagemap.c
new file mode 100644
index 00000000000..de88d85969c
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_freepagemap.c
@@ -0,0 +1,233 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstore_freepagemap.c
+ * ZedStore free space management
+ *
+ * The Free Page Map keeps track of unused pages in the relation.
+ *
+ * The FPM is a linked list of pages. Each page contains a pointer to the
+ * next free page.
+
+ * Design principles:
+ *
+ * - it's ok to have a block incorrectly stored in the FPM. Before actually
+ * reusing a page, we must check that it's safe.
+ *
+ * - a deletable page must be simple to detect just by looking at the page,
+ * and perhaps a few other pages. It should *not* require scanning the
+ * whole table, or even a whole b-tree. For example, if a column is dropped,
+ * we can detect if a b-tree page belongs to the dropped column just by
+ * looking at the information (the attribute number) stored in the page
+ * header.
+ *
+ * - if a page is deletable, it should become immediately reusable. No
+ * "wait out all possible readers that might be about to follow a link
+ * to it" business. All code that reads pages need to keep pages locked
+ * while following a link, or be prepared to retry if they land on an
+ * unexpected page.
+ *
+ *
+ * TODO:
+ *
+ * - Avoid fragmentation. If B-tree page is split, try to hand out a page
+ * that's close to the old page. When the relation is extended, allocate
+ * a larger chunk at once.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_freepagemap.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/zedstore_internal.h"
+#include "miscadmin.h"
+#include "storage/bufpage.h"
+#include "storage/lmgr.h"
+#include "utils/rel.h"
+
+typedef struct ZSFreePageOpaque
+{
+ BlockNumber zs_next;
+ uint16 padding;
+ uint16 zs_page_id; /* ZS_FREE_PAGE_ID */
+} ZSFreePageOpaque;
+
+/*
+ * zspage_is_recyclable()
+ *
+ * Is the current page recyclable?
+ *
+ * It can be:
+ *
+ * - an empty, all-zeros page,
+ * - explicitly marked as deleted,
+ * - an UNDO page older than oldest_undo_ptr
+ * - a b-tree page belonging to a deleted attribute
+ * - a TOAST page belonging to a dead item
+ *
+ * TODO: currently though, we require that it's always explicitly marked as empty.
+ *
+ */
+static bool
+zspage_is_unused(Buffer buf)
+{
+ Page page;
+ ZSFreePageOpaque *opaque;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
+ return false;
+
+ if (PageGetSpecialSize(page) != sizeof(ZSFreePageOpaque))
+ return false;
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_FREE_PAGE_ID)
+ return false;
+
+ return true;
+}
+
+/*
+ * Allocate a new page.
+ *
+ * The page is exclusive-locked, but not initialized.
+ */
+Buffer
+zspage_getnewbuf(Relation rel, Buffer metabuf)
+{
+ bool release_metabuf;
+ Buffer buf;
+ BlockNumber blk;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+
+ if (metabuf == InvalidBuffer)
+ {
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ release_metabuf = true;
+ }
+ else
+ release_metabuf = false;
+
+ metapage = BufferGetPage(metabuf);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ /* Get a block from the FPM. */
+ blk = metaopaque->zs_fpm_head;
+ if (blk == 0)
+ {
+ /* metapage, not expected */
+ elog(ERROR, "could not find valid page in FPM");
+ }
+ if (blk == InvalidBlockNumber)
+ {
+ /* No free pages. Have to extend the relation. */
+ buf = zspage_extendrel_newbuf(rel);
+ blk = BufferGetBlockNumber(buf);
+ }
+ else
+ {
+ ZSFreePageOpaque *opaque;
+ Page page;
+
+ buf = ReadBuffer(rel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Check that the page really is unused. */
+ if (!zspage_is_unused(buf))
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "unexpected page found in free page list");
+ }
+ page = BufferGetPage(buf);
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ metaopaque->zs_fpm_head = opaque->zs_next;
+ }
+
+ if (release_metabuf)
+ UnlockReleaseBuffer(metabuf);
+ return buf;
+}
+
+/*
+ * Extend the relation.
+ *
+ * Returns the new page, exclusive-locked.
+ */
+Buffer
+zspage_extendrel_newbuf(Relation rel)
+{
+ Buffer buf;
+ bool needLock;
+
+ /*
+ * Extend the relation by one page.
+ *
+ * We have to use a lock to ensure no one else is extending the rel at
+ * the same time, else we will both try to initialize the same new
+ * page. We can skip locking for new or temp relations, however,
+ * since no one else could be accessing them.
+ */
+ needLock = !RELATION_IS_LOCAL(rel);
+
+ if (needLock)
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ buf = ReadBuffer(rel, P_NEW);
+
+ /* Acquire buffer lock on new page */
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * Release the file-extension lock; it's now OK for someone else to
+ * extend the relation some more. Note that we cannot release this
+ * lock before we have buffer lock on the new page, or we risk a race
+ * condition against btvacuumscan --- see comments therein.
+ */
+ if (needLock)
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ return buf;
+}
+
+
+/*
+ * Explictly mark a page as deleted and recyclable, and add it to the FPM.
+ *
+ * The caller must hold an exclusive-lock on the page.
+ */
+void
+zspage_delete_page(Relation rel, Buffer buf)
+{
+ BlockNumber blk = BufferGetBlockNumber(buf);
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ Page page;
+ ZSFreePageOpaque *opaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metapage = BufferGetPage(metabuf);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, sizeof(ZSFreePageOpaque));
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_page_id = ZS_FREE_PAGE_ID;
+ opaque->zs_next = metaopaque->zs_fpm_head;
+ metaopaque->zs_fpm_head = blk;
+
+ MarkBufferDirty(metabuf);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(metabuf);
+}
diff --git a/src/backend/access/zedstore/zedstore_inspect.c b/src/backend/access/zedstore/zedstore_inspect.c
new file mode 100644
index 00000000000..e9ccc4ec3b1
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_inspect.c
@@ -0,0 +1,474 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_inspect.c
+ * Debugging functions, for viewing ZedStore page contents
+ *
+ * These should probably be moved to contrib/, but it's handy to have them
+ * here during development.
+ *
+ * Example queries
+ * ---------------
+ *
+ * How many pages of each type a table has?
+ *
+ * select count(*), pg_zs_page_type('t_zedstore', g)
+ * from generate_series(0, pg_table_size('t_zedstore') / 8192 - 1) g group by 2;
+ *
+ * count | pg_zs_page_type
+ * -------+-----------------
+ * 1 | META
+ * 3701 | BTREE
+ * 6 | UNDO
+ * (3 rows)
+ *
+ * Compression ratio of B-tree leaf pages (other pages are not compressed):
+ *
+ * select sum(uncompressedsz::numeric) / sum(totalsz) as compratio
+ * from pg_zs_btree_pages('t_zedstore') ;
+ * compratio
+ * --------------------
+ * 3.6623829559208134
+ * (1 row)
+ *
+ * Per column compression ratio and number of pages:
+ *
+ * select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as
+ * compratio from pg_zs_btree_pages('t_zedstore') group by attno order by
+ * attno;
+ *
+ * attno | count | compratio
+ * -------+-------+------------------------
+ * 0 | 395 | 1.00000000000000000000
+ * 1 | 56 | 1.0252948766341260
+ * 2 | 3 | 38.7542309420398383
+ * (3 rows)
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_inspect.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "commands/vacuum.h"
+#include "funcapi.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+Datum pg_zs_page_type(PG_FUNCTION_ARGS);
+Datum pg_zs_undo_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_btree_pages(PG_FUNCTION_ARGS);
+
+Datum
+pg_zs_page_type(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint64 pageno = PG_GETARG_INT64(1);
+ Relation rel;
+ uint16 zs_page_id;
+ Buffer buf;
+ Page page;
+ char *result;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ buf = ReadBuffer(rel, pageno);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ zs_page_id = *((uint16 *) ((char *) page + BLCKSZ - sizeof(uint16)));
+
+ UnlockReleaseBuffer(buf);
+
+ table_close(rel, AccessShareLock);
+
+ switch (zs_page_id)
+ {
+ case ZS_META_PAGE_ID:
+ result = "META";
+ break;
+ case ZS_BTREE_PAGE_ID:
+ result = "BTREE";
+ break;
+ case ZS_UNDO_PAGE_ID:
+ result = "UNDO";
+ break;
+ case ZS_TOAST_PAGE_ID:
+ result = "TOAST";
+ break;
+ case ZS_FREE_PAGE_ID:
+ result = "FREE";
+ break;
+ default:
+ result = psprintf("UNKNOWN 0x%04x", zs_page_id);
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * blkno int8
+ * nrecords int4
+ * freespace int4
+ * firstrecptr int8
+ * lastrecptr int8
+ */
+Datum
+pg_zs_undo_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber blkno;
+ char *ptr;
+ char *endptr;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page.
+ */
+ blkno = firstblk;
+ while (blkno != InvalidBlockNumber)
+ {
+ Datum values[5];
+ bool nulls[5];
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+ int nrecords;
+ ZSUndoRecPtr firstptr = { 0, 0, 0 };
+ ZSUndoRecPtr lastptr = { 0, 0, 0 };
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ {
+ elog(WARNING, "unexpected page id on UNDO page %u", blkno);
+ break;
+ }
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ nrecords = 0;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+
+ Assert(undorec->undorecptr.blkno == blkno);
+
+ lastptr = undorec->undorecptr;
+ if (nrecords == 0)
+ firstptr = lastptr;
+ nrecords++;
+
+ ptr += undorec->size;
+ }
+
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int32GetDatum(nrecords);
+ values[2] = Int32GetDatum(PageGetExactFreeSpace(page));
+ values[3] = Int64GetDatum(firstptr.counter);
+ values[4] = Int64GetDatum(lastptr.counter);
+
+ blkno = opaque->next;
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+
+/*
+ * blkno int8
+ * nextblk int8
+ * attno int4
+ * level int4
+ *
+ * lokey int8
+ * hikey int8
+
+ * nitems int4
+ * ncompressed int4
+ * totalsz int4
+ * uncompressedsz int4
+ * freespace int4
+ */
+Datum
+pg_zs_btree_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ BlockNumber blkno;
+ BlockNumber nblocks;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /* scan all blocks in physical order */
+ for (blkno = 1; blkno < nblocks; blkno++)
+ {
+ Datum values[11];
+ bool nulls[11];
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ int nitems;
+ int ncompressed;
+ int totalsz;
+ int uncompressedsz;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in B-tree pages. (Presumably, most of the
+ * pages in the relation are b-tree pages, so it makes sense to
+ * scan the whole relation in physical order)
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSBtreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ nitems = 0;
+ ncompressed = 0;
+ totalsz = 0;
+ uncompressedsz = 0;
+ if (opaque->zs_level == 0)
+ {
+ /* leaf page */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+
+ if (opaque->zs_attno == ZS_META_ATTRIBUTE_NUM)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ nitems++;
+ totalsz += item->t_size;
+
+ uncompressedsz += item->t_size;
+ }
+ else
+ {
+ ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) PageGetItem(page, iid);
+
+ nitems++;
+ totalsz += item->t_size;
+ if ((item->t_flags & ZSBT_ATTR_COMPRESSED) != 0)
+ {
+ ZSAttributeCompressedItem *citem = (ZSAttributeCompressedItem *) PageGetItem(page, iid);
+
+ ncompressed++;
+ uncompressedsz += offsetof(ZSAttributeCompressedItem, t_payload)
+ + citem->t_uncompressed_size;
+ }
+ else
+ uncompressedsz += item->t_size;
+ }
+ }
+ }
+ else
+ {
+ /* internal page */
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+ }
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int64GetDatum(opaque->zs_next);
+ values[2] = Int32GetDatum(opaque->zs_attno);
+ values[3] = Int32GetDatum(opaque->zs_level);
+ values[4] = Int64GetDatum(opaque->zs_lokey);
+ values[5] = Int64GetDatum(opaque->zs_hikey);
+ values[6] = Int32GetDatum(nitems);
+ if (opaque->zs_level == 0)
+ {
+ values[7] = Int32GetDatum(ncompressed);
+ values[8] = Int32GetDatum(totalsz);
+ values[9] = Int32GetDatum(uncompressedsz);
+ }
+ else
+ {
+ nulls[7] = true;
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ values[10] = Int32GetDatum(PageGetExactFreeSpace(page));
+
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/access/zedstore/zedstore_meta.c b/src/backend/access/zedstore/zedstore_meta.c
new file mode 100644
index 00000000000..bcf8a28aa3e
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_meta.c
@@ -0,0 +1,397 @@
+/*
+ * zedstore_meta.c
+ * Routines for handling ZedStore metapage
+ *
+ * The metapage holds a directory of B-tree root block numbers, one for each
+ * column.
+ *
+ * TODO:
+ * - extend the root block dir to an overflow page if there are too many
+ * attributes to fit on one page
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_meta.c
+ */
+#include "postgres.h"
+
+#include "access/itup.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+static ZSMetaCacheData *
+zsmeta_populate_cache_from_metapage(Relation rel, Page page)
+{
+ ZSMetaCacheData *cache;
+ ZSMetaPage *metapg;
+ int natts;
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ natts = metapg->nattributes;
+
+ cache =
+ MemoryContextAllocZero(CacheMemoryContext,
+ offsetof(ZSMetaCacheData, cache_attrs[natts]));
+ cache->cache_nattributes = natts;
+
+ for (int i = 0; i < natts; i++)
+ {
+ cache->cache_attrs[i].root = metapg->tree_root_dir[i].root;
+ cache->cache_attrs[i].rightmost = InvalidBlockNumber;
+ }
+
+ rel->rd_amcache = cache;
+ return cache;
+}
+
+ZSMetaCacheData *
+zsmeta_populate_cache(Relation rel)
+{
+ ZSMetaCacheData *cache;
+ Buffer metabuf;
+ BlockNumber nblocks;
+
+ RelationOpenSmgr(rel);
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ {
+ cache =
+ MemoryContextAllocZero(CacheMemoryContext,
+ offsetof(ZSMetaCacheData, cache_attrs));
+ cache->cache_nattributes = 0;
+ rel->rd_amcache = cache;
+ }
+ else
+ {
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ cache = zsmeta_populate_cache_from_metapage(rel, BufferGetPage(metabuf));
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ return cache;
+}
+
+static void
+zsmeta_expand_metapage_for_new_attributes(Relation rel)
+{
+ int natts = RelationGetNumberOfAttributes(rel) + 1;
+ Buffer metabuf;
+ Page page;
+ ZSMetaPage *metapg;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ if (natts > metapg->nattributes)
+ {
+ int new_pd_lower;
+
+ new_pd_lower = (char *) &metapg->tree_root_dir[natts] - (char *) page;
+ if (new_pd_lower > ((PageHeader) page)->pd_upper)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ START_CRIT_SECTION();
+
+ /* Initialize the new attribute roots to InvalidBlockNumber */
+ for (int i = metapg->nattributes; i < natts; i++)
+ metapg->tree_root_dir[i].root = InvalidBlockNumber;
+
+ metapg->nattributes = natts;
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+
+ MarkBufferDirty(metabuf);
+ /* TODO: WAL-log */
+
+ END_CRIT_SECTION();
+ }
+ UnlockReleaseBuffer(metabuf);
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+}
+
+static Page
+zsmeta_initmetapage_internal(int natts)
+{
+ Page page;
+ ZSMetaPageOpaque *opaque;
+ ZSMetaPage *metapg;
+ int new_pd_lower;
+
+ /*
+ * It's possible that we error out when building the metapage, if there
+ * are too many attribute, so work on a temporary copy first, before actually
+ * allocating the buffer.
+ */
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSMetaPageOpaque));
+
+ opaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_META_PAGE_ID;
+
+ /* UNDO-related fields */
+ opaque->zs_undo_tail_first_counter = 2; /* start at 2, so that 0 is always "old", and 1 means "dead" */
+ opaque->zs_undo_head = InvalidBlockNumber;
+ opaque->zs_undo_tail = InvalidBlockNumber;
+ opaque->zs_undo_oldestptr.counter = 2;
+
+ opaque->zs_fpm_head = InvalidBlockNumber;
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ new_pd_lower = (char *) &metapg->tree_root_dir[natts] - (char *) page;
+ if (new_pd_lower > ((PageHeader) page)->pd_upper)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ metapg->nattributes = natts;
+ for (int i = 0; i < natts; i++)
+ metapg->tree_root_dir[i].root = InvalidBlockNumber;
+
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+ return page;
+}
+
+/*
+ * Initialize the metapage for an empty relation.
+ */
+void
+zsmeta_initmetapage(Relation rel)
+{
+ Buffer buf;
+ Page page;
+ int natts = RelationGetNumberOfAttributes(rel) + 1;
+
+ /* Ok, write it out to disk */
+ buf = ReadBuffer(rel, P_NEW);
+ if (BufferGetBlockNumber(buf) != ZS_META_BLK)
+ elog(ERROR, "index is not empty");
+ page = zsmeta_initmetapage_internal(natts);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ PageRestoreTempPage(page, BufferGetPage(buf));
+
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_init_metapage init_rec;
+ XLogRecPtr recptr;
+
+ init_rec.natts = natts;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf,
+ REGBUF_STANDARD | REGBUF_FORCE_IMAGE | REGBUF_KEEP_DATA);
+ XLogRegisterData((char *) &init_rec, SizeOfZSWalInitMetapage);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_INIT_METAPAGE);
+
+ PageSetLSN(BufferGetPage(buf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+}
+
+void
+zsmeta_initmetapage_redo(XLogReaderState *record)
+{
+ Buffer buf;
+
+ /*
+ * Metapage changes are so rare that we rely on full-page images
+ * for replay.
+ */
+ if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+ elog(ERROR, "zedstore metapage init WAL record did not contain a full-page image");
+
+ Assert(BufferGetBlockNumber(buf) == ZS_META_BLK);
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Get the block number of the b-tree root for given attribute.
+ *
+ * If 'readonly' is true, and the root doesn't exist yet (ie. it's an empty
+ * table), returns InvalidBlockNumber. Otherwise new root is allocated if
+ * the root doesn't exist.
+ */
+BlockNumber
+zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool readonly)
+{
+ Buffer metabuf;
+ ZSMetaPage *metapg;
+ BlockNumber rootblk;
+ ZSMetaCacheData *metacache;
+
+ Assert(attno == ZS_META_ATTRIBUTE_NUM || attno >= 1);
+
+ metacache = zsmeta_get_cache(rel);
+
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ BlockNumber nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (nblocks != 0)
+ metacache = zsmeta_populate_cache(rel);
+ else if (readonly)
+ return InvalidBlockNumber;
+ else
+ {
+ LockRelationForExtension(rel, ExclusiveLock);
+ /*
+ * Confirm number of blocks is still 0 after taking lock,
+ * before initializing a new metapage
+ */
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (nblocks == 0)
+ zsmeta_initmetapage(rel);
+ UnlockRelationForExtension(rel, ExclusiveLock);
+ metacache = zsmeta_populate_cache(rel);
+ }
+ }
+
+ /*
+ * file has less number of attributes stored compared to catalog. This
+ * happens due to add column default value storing value in catalog and
+ * absent in table. This attribute must be marked with atthasmissing.
+ */
+ if (attno >= metacache->cache_nattributes)
+ {
+ if (readonly)
+ {
+ /* re-check */
+ metacache = zsmeta_populate_cache(rel);
+ if (attno >= metacache->cache_nattributes)
+ return InvalidBlockNumber;
+ }
+ else
+ {
+ zsmeta_expand_metapage_for_new_attributes(rel);
+ metacache = zsmeta_populate_cache(rel);
+ }
+ }
+
+ rootblk = metacache->cache_attrs[attno].root;
+
+ if (!readonly && rootblk == InvalidBlockNumber)
+ {
+ /* try to allocate one */
+ Page page;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ /*
+ * Re-check that the root is still invalid, now that we have the
+ * metapage locked.
+ */
+ rootblk = metapg->tree_root_dir[attno].root;
+ if (rootblk == InvalidBlockNumber)
+ {
+ Buffer rootbuf;
+ Page rootpage;
+ ZSBtreePageOpaque *opaque;
+
+ /* TODO: release lock on metapage while we do I/O */
+ rootbuf = zspage_getnewbuf(rel, metabuf);
+ rootblk = BufferGetBlockNumber(rootbuf);
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ /* initialize the page to look like a root leaf */
+ rootpage = BufferGetPage(rootbuf);
+ PageInit(rootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(rootpage);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = 0;
+ opaque->zs_flags = ZSBT_ROOT;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ MarkBufferDirty(rootbuf);
+ MarkBufferDirty(metabuf);
+ /* TODO: WAL-log both pages */
+
+ UnlockReleaseBuffer(rootbuf);
+ }
+ metacache->cache_attrs[attno].root = rootblk;
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ return rootblk;
+}
+
+/*
+ *
+ * Caller is responsible for WAL-logging this.
+ */
+void
+zsmeta_update_root_for_attribute(Relation rel, AttrNumber attno,
+ Buffer metabuf, BlockNumber rootblk)
+{
+ ZSMetaPage *metapg;
+
+ metapg = (ZSMetaPage *) PageGetContents(BufferGetPage(metabuf));
+
+ if ((attno != ZS_META_ATTRIBUTE_NUM) && (attno <= 0 || attno > metapg->nattributes))
+ elog(ERROR, "invalid attribute number %d (table \"%s\" has only %d attributes)",
+ attno, RelationGetRelationName(rel), metapg->nattributes);
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ MarkBufferDirty(metabuf);
+}
diff --git a/src/backend/access/zedstore/zedstore_simple8b.c b/src/backend/access/zedstore/zedstore_simple8b.c
new file mode 100644
index 00000000000..088b8a25f42
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_simple8b.c
@@ -0,0 +1,391 @@
+/*
+ * zedstore_simple8b.c
+ * Simple-8b encoding for zedstore
+ *
+ * FIXME: This is copy-pasted from src/backend/lib/integerset.c. Some of
+ * the things we do here are not relevant for the use in zedstore, or could
+ * be optimized. For example, EMPTY_CODEWORD is not used.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_simple8b.h
+ */
+#include "postgres.h"
+
+#include "access/zedstore_simple8b.h"
+
+/*
+ * Decode an array of Simple-8b codewords, known to contain 'num_integers'
+ * integers.
+ */
+void
+simple8b_decode_words(uint64 *codewords, int num_codewords,
+ uint64 *dst, int num_integers)
+{
+ int total_decoded = 0;
+
+ /* decode all the codewords */
+ for (int i = 0; i < num_codewords; i++)
+ {
+ int num_decoded;
+
+ num_decoded = simple8b_decode(codewords[i], &dst[total_decoded]);
+ total_decoded += num_decoded;
+ }
+ /*
+ * XXX: This error message is a bit specific, but it matches how this
+ * function is actually used, i.e. to encode TIDs, and the number of integers
+ * comes from the item header.
+ */
+ if (total_decoded != num_integers)
+ elog(ERROR, "number of TIDs in codewords did not match the item header");
+}
+
+/*
+ * Simple-8b encoding.
+ *
+ * The simple-8b algorithm packs between 1 and 240 integers into 64-bit words,
+ * called "codewords". The number of integers packed into a single codeword
+ * depends on the integers being packed; small integers are encoded using
+ * fewer bits than large integers. A single codeword can store a single
+ * 60-bit integer, or two 30-bit integers, for example.
+ *
+ * Since we're storing a unique, sorted, set of integers, we actually encode
+ * the *differences* between consecutive integers. That way, clusters of
+ * integers that are close to each other are packed efficiently, regardless
+ * of their absolute values.
+ *
+ * In Simple-8b, each codeword consists of a 4-bit selector, which indicates
+ * how many integers are encoded in the codeword, and the encoded integers are
+ * packed into the remaining 60 bits. The selector allows for 16 different
+ * ways of using the remaining 60 bits, called "modes". The number of integers
+ * packed into a single codeword in each mode is listed in the simple8b_modes
+ * table below. For example, consider the following codeword:
+ *
+ * 20-bit integer 20-bit integer 20-bit integer
+ * 1101 00000000000000010010 01111010000100100000 00000000000000010100
+ * ^
+ * selector
+ *
+ * The selector 1101 is 13 in decimal. From the modes table below, we see
+ * that it means that the codeword encodes three 20-bit integers. In decimal,
+ * those integers are 18, 500000 and 20. Because we encode deltas rather than
+ * absolute values, the actual values that they represent are 18, 500018 and
+ * 500038.
+ *
+ * Modes 0 and 1 are a bit special; they encode a run of 240 or 120 zeroes
+ * (which means 240 or 120 consecutive integers, since we're encoding the
+ * deltas between integers), without using the rest of the codeword bits
+ * for anything.
+ *
+ * Simple-8b cannot encode integers larger than 60 bits. Values larger than
+ * that are always stored in the 'first' field of a leaf item, never in the
+ * packed codeword. If there is a sequence of integers that are more than
+ * 2^60 apart, the codeword will go unused on those items. To represent that,
+ * we use a magic EMPTY_CODEWORD codeword value.
+ */
+static const struct simple8b_mode
+{
+ uint8 bits_per_int;
+ uint8 num_ints;
+} simple8b_modes[17] =
+
+{
+ {0, 240}, /* mode 0: 240 zeroes */
+ {0, 120}, /* mode 1: 120 zeroes */
+ {1, 60}, /* mode 2: sixty 1-bit integers */
+ {2, 30}, /* mode 3: thirty 2-bit integers */
+ {3, 20}, /* mode 4: twenty 3-bit integers */
+ {4, 15}, /* mode 5: fifteen 4-bit integers */
+ {5, 12}, /* mode 6: twelve 5-bit integers */
+ {6, 10}, /* mode 7: ten 6-bit integers */
+ {7, 8}, /* mode 8: eight 7-bit integers (four bits
+ * are wasted) */
+ {8, 7}, /* mode 9: seven 8-bit integers (four bits
+ * are wasted) */
+ {10, 6}, /* mode 10: six 10-bit integers */
+ {12, 5}, /* mode 11: five 12-bit integers */
+ {15, 4}, /* mode 12: four 15-bit integers */
+ {20, 3}, /* mode 13: three 20-bit integers */
+ {30, 2}, /* mode 14: two 30-bit integers */
+ {60, 1}, /* mode 15: one 60-bit integer */
+
+ {0, 0} /* sentinel value */
+};
+
+/*
+ * Maximum number of integers that can be encoded in a single Simple-8b
+ * codeword.
+ */
+#define SIMPLE8B_MAX_VALUES_PER_CODEWORD 240
+
+/*
+ * EMPTY_CODEWORD is a special value, used to indicate "no values".
+ * It is used if the next value is too large to be encoded with Simple-8b.
+ *
+ * This value looks like a mode-0 codeword, but we can distinguish it
+ * because a regular mode-0 codeword would have zeroes in the unused bits.
+ */
+#define EMPTY_CODEWORD UINT64CONST(0x0FFFFFFFFFFFFFFF)
+
+/*
+ * Encode a number of integers into a Simple-8b codeword.
+ *
+ * Returns the encoded codeword, and sets *num_encoded to the number of
+ * input integers that were encoded. That can be zero, if the first delta
+ * is too large to be encoded.
+ */
+uint64
+simple8b_encode(const uint64 *ints, int num_ints, int *num_encoded)
+{
+ int selector;
+ int nints;
+ int bits;
+ uint64 val;
+ uint64 codeword;
+ int i;
+
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ selector = 0;
+ nints = simple8b_modes[0].num_ints;
+ bits = simple8b_modes[0].bits_per_int;
+ val = ints[0];
+ i = 0; /* number of deltas we have accepted */
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << bits))
+ {
+ /* too large, step up to next mode */
+ selector++;
+ nints = simple8b_modes[selector].num_ints;
+ bits = simple8b_modes[selector].bits_per_int;
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ i++;
+ if (i >= nints)
+ break;
+ /* examine next delta */
+ if (i < num_ints)
+ val = ints[i];
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ if (nints == 0)
+ {
+ /*
+ * The first delta is too large to be encoded with Simple-8b.
+ *
+ * If there is at least one not-too-large integer in the input, we
+ * will encode it using mode 15 (or a more compact mode). Hence, we
+ * can only get here if the *first* delta is >= 2^60.
+ */
+ Assert(i == 0);
+ *num_encoded = 0;
+ return EMPTY_CODEWORD;
+ }
+
+ /*
+ * Encode the integers using the selected mode. Note that we shift them
+ * into the codeword in reverse order, so that they will come out in the
+ * correct order in the decoder.
+ */
+ codeword = 0;
+ if (bits > 0)
+ {
+ for (i = nints - 1; i > 0; i--)
+ {
+ val = ints[i];
+ codeword |= val;
+ codeword <<= bits;
+ }
+ val = ints[0];
+ codeword |= val;
+ }
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ *num_encoded = nints;
+ return codeword;
+}
+
+/*
+ * Encode a number of same integers into a Simple-8b codeword.
+ *
+ * This is a special version of simple8b_encode, where the first input
+ * integer is 'firstint', followed by a number of 'secondint'. This is
+ * equivalent to calling simple8b_encode() with an input array:
+ *
+ * ints[0]: firstint
+ * ints[1]: secondint
+ * ints[2]: secondint
+ * ...
+ * ints[num_ints - 1]: secondint
+ *
+ *
+ * We need that when doing a multi-insert, and it seems nice to have a
+ * specialized version for that, for speed, but also to keep the calling
+ * code simpler, so that it doesn't need to construct an input array.
+ *
+ * TODO: This is just copy-pasted from simple8b_encode, but since we know
+ * what the input is, we could probably optimize this further.
+ */
+uint64
+simple8b_encode_consecutive(const uint64 firstint, const uint64 secondint, int num_ints,
+ int *num_encoded)
+{
+ int selector;
+ int nints;
+ int bits;
+ uint64 val;
+ uint64 codeword;
+ int i;
+
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ selector = 0;
+ nints = simple8b_modes[0].num_ints;
+ bits = simple8b_modes[0].bits_per_int;
+ val = firstint;
+ i = 0; /* number of deltas we have accepted */
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << bits))
+ {
+ /* too large, step up to next mode */
+ selector++;
+ nints = simple8b_modes[selector].num_ints;
+ bits = simple8b_modes[selector].bits_per_int;
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ i++;
+ if (i >= nints)
+ break;
+ /* examine next delta */
+ if (i < num_ints)
+ val = secondint;
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ if (nints == 0)
+ {
+ /*
+ * The first delta is too large to be encoded with Simple-8b.
+ *
+ * If there is at least one not-too-large integer in the input, we
+ * will encode it using mode 15 (or a more compact mode). Hence, we
+ * can only get here if the *first* delta is >= 2^60.
+ */
+ Assert(i == 0);
+ *num_encoded = 0;
+ return EMPTY_CODEWORD;
+ }
+
+ /*
+ * Encode the integers using the selected mode. Note that we shift them
+ * into the codeword in reverse order, so that they will come out in the
+ * correct order in the decoder.
+ */
+ codeword = 0;
+ if (bits > 0)
+ {
+ for (i = nints - 1; i > 0; i--)
+ {
+ val = secondint;
+ codeword |= val;
+ codeword <<= bits;
+ }
+ val = firstint;
+ codeword |= val;
+ }
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ *num_encoded = nints;
+ return codeword;
+}
+
+/*
+ * Decode a codeword into an array of integers.
+ * Returns the number of integers decoded.
+ */
+int
+simple8b_decode(uint64 codeword, uint64 *decoded)
+{
+ int selector = (codeword >> 60);
+ int nints = simple8b_modes[selector].num_ints;
+ int bits = simple8b_modes[selector].bits_per_int;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ if (codeword == EMPTY_CODEWORD)
+ return 0;
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ decoded[i] = val;
+ codeword >>= bits;
+ }
+
+ return nints;
+}
diff --git a/src/backend/access/zedstore/zedstore_tiditem.c b/src/backend/access/zedstore/zedstore_tiditem.c
new file mode 100644
index 00000000000..ab21fdc9547
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tiditem.c
@@ -0,0 +1,932 @@
+/*
+ * zedstore_tiditem.c
+ * Routines for packing TIDs into "items"
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tiditem.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_internal.h"
+#include "access/zedstore_simple8b.h"
+
+static int remap_slots(uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *orig_slots, int num_orig_slots,
+ int target_idx, ZSUndoRecPtr target_ptr,
+ ZSUndoRecPtr *new_slots,
+ int *new_num_slots,
+ uint8 *new_slotnos,
+ ZSUndoRecPtr recent_oldest_undo);
+static ZSTidArrayItem *build_item(zstid *tids, uint64 *deltas, uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *slots, int num_slots);
+
+static void deltas_to_tids(zstid firsttid, uint64 *deltas, int num_tids, zstid *tids);
+static void slotwords_to_slotnos(uint64 *slotwords, int num_tids, uint8 *slotnos);
+static int binsrch_tid_array(zstid key, zstid *arr, int arr_elems);
+
+/*
+ * Extract TIDs from an item into iterator.
+ */
+void
+zsbt_tid_item_unpack(ZSTidArrayItem *item, ZSTidItemIterator *iter)
+{
+ ZSUndoRecPtr *slots;
+ int num_tids;
+ uint64 *slotwords;
+ uint64 *codewords;
+
+ if (iter->tids_allocated_size < item->t_num_tids)
+ {
+ if (iter->tids)
+ pfree(iter->tids);
+ if (iter->tid_undoslotnos)
+ pfree(iter->tid_undoslotnos);
+ iter->tids = MemoryContextAlloc(iter->context, item->t_num_tids * sizeof(zstid));
+ iter->tid_undoslotnos = MemoryContextAlloc(iter->context, item->t_num_tids * sizeof(uint8));
+ iter->tids_allocated_size = item->t_num_tids;
+ }
+
+ ZSTidArrayItemDecode(item, &codewords, &slots, &slotwords);
+ num_tids = item->t_num_tids;
+
+ /* decode all the codewords */
+ simple8b_decode_words(codewords, item->t_num_codewords, iter->tids, num_tids);
+
+ /* convert the deltas to TIDs */
+ deltas_to_tids(item->t_firsttid, iter->tids, num_tids, iter->tids);
+ iter->num_tids = num_tids;
+ Assert(iter->tids[num_tids - 1] == item->t_endtid - 1);
+
+ /* Expand slotwords to slotnos */
+ slotwords_to_slotnos(slotwords, num_tids, iter->tid_undoslotnos);
+
+ /* also copy out the slots to the iterator */
+ iter->undoslots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ iter->undoslots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < item->t_num_undo_slots; i++)
+ iter->undoslots[i] = slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+}
+
+/*
+ * Create a ZSTidArrayItem (or items), to represent a range of contiguous TIDs,
+ * all with the same UNDO pointer.
+ */
+List *
+zsbt_tid_item_create_for_range(zstid tid, int nelements, ZSUndoRecPtr undo_ptr)
+{
+ uint64 total_encoded;
+ List *newitems = NIL;
+ uint64 codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ int num_slots;
+ int slotno;
+
+ Assert(undo_ptr.counter != DeadUndoPtr.counter);
+ if (IsZSUndoRecPtrValid(&undo_ptr))
+ {
+ slotno = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT + 1;
+ }
+ else
+ {
+ slotno = ZSBT_OLD_UNDO_SLOT;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+ }
+
+ total_encoded = 0;
+ while (total_encoded < nelements)
+ {
+ ZSTidArrayItem *newitem;
+ Size itemsz;
+ int num_codewords;
+ int num_tids;
+ zstid firsttid = tid + total_encoded;
+ uint64 first_delta;
+ uint64 second_delta;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *slotword_p;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ int i;
+
+ /*
+ * The first 'diff' is 0, because the first TID is implicitly 'starttid'.
+ * The rest have distance of 1 to the previous TID.
+ */
+ first_delta = 0;
+ second_delta = 1;
+ num_tids = 0;
+ for (num_codewords = 0;
+ num_codewords < ZSBT_MAX_ITEM_CODEWORDS && total_encoded < nelements && num_tids < ZSBT_MAX_ITEM_TIDS;
+ num_codewords++)
+ {
+ uint64 codeword;
+ int num_encoded;
+
+ codeword = simple8b_encode_consecutive(first_delta, second_delta,
+ nelements - total_encoded,
+ &num_encoded);
+ if (num_encoded == 0)
+ break;
+
+ codewords[num_codewords] = codeword;
+ total_encoded += num_encoded;
+ num_tids += num_encoded;
+ first_delta = 1;
+ }
+
+ itemsz = SizeOfZSTidArrayItem(num_tids, num_slots, num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_tids = num_tids;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = num_codewords;
+ newitem->t_firsttid = firsttid;
+ newitem->t_endtid = tid + total_encoded;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* Fill in undo slots */
+ if (slotno == ZSBT_FIRST_NORMAL_UNDO_SLOT)
+ {
+ Assert(num_slots == ZSBT_FIRST_NORMAL_UNDO_SLOT + 1);
+ newitem_slots[0] = undo_ptr;
+ }
+
+ /* Fill in slotwords */
+ i = 0;
+ slotword_p = newitem_slotwords;
+ while (i < num_tids)
+ {
+ uint64 slotword;
+
+ slotword = 0;
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(slotword_p++) = slotword;
+ }
+
+ /* Fill in TID codewords */
+ for (int i = 0; i < num_codewords; i++)
+ newitem_codewords[i] = codewords[i];
+
+ newitems = lappend(newitems, newitem);
+ }
+
+ return newitems;
+}
+
+/*
+ * Add a range of contiguous TIDs to an existing item.
+ *
+ * If all the new TIDs can be merged with the existing item, returns a List
+ * with a single element, containing the new combined item that covers all
+ * the existing TIDs, and the new TIDs. *modified_orig is set to true.
+ *
+ * If some of the new TIDs can be merged with the existing item, returns a
+ * List with more than one item. The returned items together replace the
+ * original item, such that all the existing TIDs and all the new TIDs are
+ * covered. *modified_orig is set to true in that case, too.
+ *
+ * If the new TIDs could not be merged with the existing item, returns a list
+ * of new items to represent the new TIDs, just like
+ * zsbt_tid_item_create_for_range(), and *modified_orig is set to false.
+ */
+List *
+zsbt_tid_item_add_tids(ZSTidArrayItem *orig, zstid firsttid, int nelements,
+ ZSUndoRecPtr undo_ptr, bool *modified_orig)
+{
+ int num_slots;
+ int num_new_codewords;
+ uint64 new_codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ ZSUndoRecPtr *orig_slots;
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ int slotno;
+ uint64 first_delta;
+ uint64 second_delta;
+ int total_new_encoded;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ List *newitems;
+ int num_tids;
+ ZSUndoRecPtr *dst_slot;
+ uint64 *dst_slotword;
+ uint64 *dst_codeword;
+ int i;
+ int j;
+
+ if (orig == NULL)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ /* Quick check to see if we can add the new TIDs to the previous item */
+ Assert(orig->t_endtid <= firsttid);
+
+ /*
+ * Is there room for a new codeword? Currently, we don't try to add tids to the
+ * last existing codeword, even if we perhaps could.
+ */
+ if (orig->t_num_codewords >= ZSBT_MAX_ITEM_CODEWORDS)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots, &orig_slotwords);
+
+ /* Is there an UNDO slot we can use? */
+ Assert(undo_ptr.counter != DeadUndoPtr.counter);
+ if (!IsZSUndoRecPtrValid(&undo_ptr))
+ {
+ slotno = ZSBT_OLD_UNDO_SLOT;
+ num_slots = orig->t_num_undo_slots;
+ }
+ else
+ {
+ for (slotno = ZSBT_FIRST_NORMAL_UNDO_SLOT; slotno < orig->t_num_undo_slots; slotno++)
+ {
+ if (orig_slots[slotno - ZSBT_FIRST_NORMAL_UNDO_SLOT].counter == undo_ptr.counter)
+ break;
+ }
+ if (slotno >= ZSBT_MAX_ITEM_UNDO_SLOTS)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ if (slotno >= orig->t_num_undo_slots)
+ num_slots = orig->t_num_undo_slots + 1;
+ else
+ num_slots = orig->t_num_undo_slots;
+ }
+
+ /* ok, go ahead, create as many new codewords as fits, or is needed. */
+ first_delta = firsttid - orig->t_endtid + 1;
+ second_delta = 1;
+ total_new_encoded = 0;
+ num_new_codewords = 0;
+ while (num_new_codewords < ZSBT_MAX_ITEM_CODEWORDS - orig->t_num_codewords &&
+ total_new_encoded < nelements && orig->t_num_tids + total_new_encoded < ZSBT_MAX_ITEM_TIDS)
+ {
+ uint64 codeword;
+ int num_encoded;
+
+ codeword = simple8b_encode_consecutive(first_delta,
+ second_delta,
+ nelements - total_new_encoded,
+ &num_encoded);
+ if (num_encoded == 0)
+ break;
+
+ new_codewords[num_new_codewords] = codeword;
+ first_delta = 1;
+ num_new_codewords++;
+ total_new_encoded += num_encoded;
+ }
+
+ if (num_new_codewords == 0)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ num_tids = orig->t_num_tids + total_new_encoded;
+
+ itemsz = SizeOfZSTidArrayItem(num_tids, num_slots, orig->t_num_codewords + num_new_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = orig->t_num_codewords + num_new_codewords;
+ newitem->t_firsttid = orig->t_firsttid;
+ newitem->t_endtid = firsttid + total_new_encoded;
+ newitem->t_num_tids = newitem->t_endtid - newitem->t_firsttid;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* copy existing codewords, followed by new ones */
+ dst_codeword = newitem_codewords;
+ for (int i = 0; i < orig->t_num_codewords; i++)
+ *(dst_codeword++) = orig_codewords[i];
+ for (int i = 0; i < num_new_codewords; i++)
+ *(dst_codeword++) = new_codewords[i];
+
+ /* copy existing UNDO slots, followed by new slot, if any */
+ dst_slot = newitem_slots;
+ for (i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ *(dst_slot++) = orig_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+ if (num_slots > orig->t_num_undo_slots)
+ *(dst_slot++) = undo_ptr;
+
+ /*
+ * Copy and build slotwords
+ */
+ dst_slotword = newitem_slotwords;
+ /* copy full original slotwords as is */
+ for (i = 0; i < orig->t_num_tids / ZSBT_SLOTNOS_PER_WORD; i++)
+ *(dst_slotword++) = orig_slotwords[i];
+
+ /* add to the last, partial slotword. */
+ i = orig->t_num_tids;
+ j = orig->t_num_tids % ZSBT_SLOTNOS_PER_WORD;
+ if (j != 0)
+ {
+ uint64 slotword = orig_slotwords[orig->t_num_tids / ZSBT_SLOTNOS_PER_WORD];
+
+ for (; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(dst_slotword++) = slotword;
+ }
+
+ /* new slotwords */
+ while (i < num_tids)
+ {
+ uint64 slotword = 0;
+
+ for (j = 0; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(dst_slotword++) = slotword;
+ }
+ Assert(dst_slotword == newitem_slotwords + ZSBT_NUM_SLOTWORDS(num_tids));
+
+ /* Create more items for the remainder, if needed */
+ *modified_orig = true;
+ if (total_new_encoded < nelements)
+ newitems = zsbt_tid_item_create_for_range(newitem->t_endtid,
+ nelements - total_new_encoded,
+ undo_ptr);
+ else
+ newitems = NIL;
+ newitems = lcons(newitem, newitems);
+ return newitems;
+}
+
+/*
+ * Change the UNDO pointer of a tuple with TID 'target_tid', inside an item.
+ *
+ * Returns an item, or multiple items, to replace the original one.
+ */
+List *
+zsbt_tid_item_change_undoptr(ZSTidArrayItem *orig, zstid target_tid, ZSUndoRecPtr undoptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ uint64 *deltas;
+ zstid *tids;
+ int num_tids = orig->t_num_tids;
+ int target_idx = -1;
+ ZSUndoRecPtr *orig_slots_partial;
+ ZSUndoRecPtr orig_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ List *newitems;
+ int new_slotno;
+
+ deltas = palloc(sizeof(uint64) * num_tids);
+ tids = palloc(sizeof(zstid) * num_tids);
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots_partial, &orig_slotwords);
+
+ /* decode the codewords, to find the target TID */
+ simple8b_decode_words(orig_codewords, orig->t_num_codewords, deltas, num_tids);
+
+ deltas_to_tids(orig->t_firsttid, deltas, num_tids, tids);
+
+ target_idx = binsrch_tid_array(target_tid, tids, num_tids);
+ Assert(tids[target_idx] == target_tid);
+
+ /* Ok, we know the target TID now. Can we use one of the existing UNDO slots? */
+ new_slotno = -1;
+ if (undoptr.counter == DeadUndoPtr.counter)
+ new_slotno = ZSBT_DEAD_UNDO_SLOT;
+ if (new_slotno == -1 && undoptr.counter < recent_oldest_undo.counter)
+ new_slotno = ZSBT_OLD_UNDO_SLOT;
+
+ orig_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ orig_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ orig_slots[i] = orig_slots_partial[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+
+ if (new_slotno == -1)
+ {
+ for (int i = 0; i < orig->t_num_undo_slots; i++)
+ {
+ if (orig_slots[i].counter == undoptr.counter)
+ {
+ /* We can reuse this existing slot for the target. */
+ new_slotno = i;
+ }
+ }
+ }
+ if (new_slotno == -1 && orig->t_num_undo_slots < ZSBT_MAX_ITEM_UNDO_SLOTS)
+ {
+ /* There's a free slot we can use for the target */
+ new_slotno = orig->t_num_undo_slots;
+ }
+
+ if (new_slotno != -1)
+ {
+ int num_slots;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+
+ num_slots = orig->t_num_undo_slots;
+ if (new_slotno == orig->t_num_undo_slots)
+ num_slots++;
+
+ /* Simple case */
+ itemsz = SizeOfZSTidArrayItem(orig->t_num_tids, num_slots, orig->t_num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = orig->t_num_codewords;
+ newitem->t_firsttid = orig->t_firsttid;
+ newitem->t_endtid = orig->t_endtid;
+ newitem->t_num_tids = orig->t_num_tids;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* copy codewords. They're unmodified. */
+ for (int i = 0; i < orig->t_num_codewords; i++)
+ newitem_codewords[i] = orig_codewords[i];
+
+ /* copy existing slots, followed by new slot, if any */
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ newitem_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT] = orig_slots[i];
+ if (new_slotno == orig->t_num_undo_slots)
+ newitem_slots[new_slotno - ZSBT_FIRST_NORMAL_UNDO_SLOT] = undoptr;
+
+ /* copy slotwords */
+ for (int i = 0; i < ZSBT_NUM_SLOTWORDS(orig->t_num_tids); i++)
+ {
+ uint64 slotword;
+
+ slotword = orig_slotwords[i];
+
+ if (target_idx / ZSBT_SLOTNOS_PER_WORD == i)
+ {
+ /* this slotword contains the target TID */
+ int shift = (target_idx % ZSBT_SLOTNOS_PER_WORD) * ZSBT_ITEM_UNDO_SLOT_BITS;
+ uint64 mask;
+
+ mask = ((UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1) << shift;
+
+ slotword &= ~mask;
+ slotword |= (uint64) new_slotno << shift;
+ }
+
+ newitem_slotwords[i] = slotword;
+ }
+
+ newitems = list_make1(newitem);
+ }
+ else
+ {
+ /* Have to remap the slots. */
+ uint8 *slotnos;
+ ZSUndoRecPtr tmp_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint8 *tmp_slotnos;
+ int idx;
+
+ slotnos = palloc(orig->t_num_tids * sizeof(uint8));
+ slotwords_to_slotnos(orig_slotwords, orig->t_num_tids, slotnos);
+
+ tmp_slotnos = palloc(orig->t_num_tids * sizeof(uint8));
+
+ /* reconstruct items */
+ idx = 0;
+ newitems = NIL;
+ while (idx < orig->t_num_tids)
+ {
+ ZSTidArrayItem *newitem;
+ int num_remapped;
+ int num_tmp_slots;
+
+ num_remapped = remap_slots(&slotnos[idx], orig->t_num_tids - idx,
+ orig_slots, orig->t_num_undo_slots,
+ target_idx - idx, undoptr,
+ tmp_slots, &num_tmp_slots,
+ tmp_slotnos,
+ recent_oldest_undo);
+
+ deltas[idx] = 0;
+ newitem = build_item(&tids[idx], &deltas[idx], tmp_slotnos, num_remapped,
+ tmp_slots, num_tmp_slots);
+
+ newitems = lappend(newitems, newitem);
+ idx += newitem->t_num_tids;
+ }
+
+ pfree(slotnos);
+ pfree(tmp_slotnos);
+ }
+
+ pfree(deltas);
+ pfree(tids);
+
+ return newitems;
+}
+
+/*
+ * Completely remove a number of TIDs from an item. (for vacuum)
+ */
+List *
+zsbt_tid_item_remove_tids(ZSTidArrayItem *orig, zstid *nexttid, IntegerSet *remove_tids,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ ZSUndoRecPtr *orig_slots_partial;
+ ZSUndoRecPtr orig_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ int total_remain;
+ uint64 *deltas;
+ zstid *tids;
+ int nelements = orig->t_num_tids;
+ List *newitems = NIL;
+ zstid tid;
+ zstid prev_tid;
+ int idx;
+ uint8 *slotnos;
+
+ deltas = palloc(sizeof(uint64) * nelements);
+ tids = palloc(sizeof(zstid) * nelements);
+ slotnos = palloc(sizeof(uint8) * nelements);
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots_partial, &orig_slotwords);
+
+ /* decode all the codewords */
+ simple8b_decode_words(orig_codewords, orig->t_num_codewords, deltas, orig->t_num_tids);
+
+ /* also decode the slotwords */
+ orig_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ orig_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ orig_slots[i] = orig_slots_partial[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+
+ idx = 0;
+ while (idx < orig->t_num_tids)
+ {
+ uint64 slotword = orig_slotwords[idx / ZSBT_SLOTNOS_PER_WORD];
+
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && idx < orig->t_num_tids; j++)
+ {
+ slotnos[idx++] = slotword & ((UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1);
+ slotword >>= slotword;
+ }
+ }
+
+ /*
+ * Remove all the TIDs we can
+ */
+ total_remain = 0;
+ tid = orig->t_firsttid;
+ prev_tid = tid;
+ for (int i = 0; i < orig->t_num_tids; i++)
+ {
+ uint64 delta = deltas[i];
+
+ tid += delta;
+
+ while (*nexttid < tid)
+ {
+ if (!intset_iterate_next(remove_tids, nexttid))
+ *nexttid = MaxPlusOneZSTid;
+ }
+ if (tid < *nexttid)
+ {
+ deltas[total_remain] = tid - prev_tid;
+ tids[total_remain] = tid;
+ slotnos[total_remain] = slotnos[i];
+ total_remain++;
+ prev_tid = tid;
+ }
+ }
+
+ if (total_remain > 0)
+ {
+ ZSUndoRecPtr tmp_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint8 *tmp_slotnos;
+ int idx;
+
+ tmp_slotnos = palloc(total_remain * sizeof(uint8));
+
+ /*
+ * Ok, we have the decoded tids and undo slotnos in vals and undoslotnos now.
+ *
+ * Time to re-encode.
+ */
+ idx = 0;
+ while (idx < total_remain)
+ {
+ ZSTidArrayItem *newitem;
+ int num_remapped;
+ int num_tmp_slots;
+
+ num_remapped = remap_slots(&slotnos[idx], total_remain - idx,
+ orig_slots, orig->t_num_undo_slots,
+ -1, InvalidUndoPtr,
+ tmp_slots, &num_tmp_slots,
+ tmp_slotnos,
+ recent_oldest_undo);
+
+ deltas[idx] = 0;
+ newitem = build_item(&tids[idx], &deltas[idx], tmp_slotnos, num_remapped,
+ tmp_slots, num_tmp_slots);
+
+ newitems = lappend(newitems, newitem);
+ idx += newitem->t_num_tids;
+ }
+ pfree(tmp_slotnos);
+ }
+
+ pfree(deltas);
+ pfree(tids);
+ pfree(slotnos);
+
+ return newitems;
+}
+
+
+/*
+ * Convert an array of deltas to tids.
+ *
+ * Note: the input and output may point to the same array!
+ */
+static void
+deltas_to_tids(zstid firsttid, uint64 *deltas, int num_tids, zstid *tids)
+{
+ zstid prev_tid = firsttid;
+
+ for (int i = 0; i < num_tids; i++)
+ {
+ zstid tid;
+
+ tid = prev_tid + deltas[i];
+ tids[i] = tid;
+ prev_tid = tid;
+ }
+}
+
+/*
+ * Expand the slot numbers packed in slotwords, 2 bits per slotno, into
+ * a regular C array.
+ */
+static void
+slotwords_to_slotnos(uint64 *slotwords, int num_tids, uint8 *slotnos)
+{
+ uint64 *slotword_p;
+ const uint64 mask = (UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1;
+ int i;
+
+ i = 0;
+ slotword_p = slotwords;
+ while (i < num_tids)
+ {
+ uint64 slotword = *(slotword_p++);
+ int j;
+
+ /*
+ * process four elements at a time, for speed (this is an
+ * unrolled version of the loop below
+ */
+ j = 0;
+ while (j < ZSBT_SLOTNOS_PER_WORD && num_tids - i > 3)
+ {
+ slotnos[i] = slotword & mask;
+ slotnos[i + 1] = (slotword >> 2) & mask;
+ slotnos[i + 2] = (slotword >> 4) & mask;
+ slotnos[i + 3] = (slotword >> 6) & mask;
+ slotword = slotword >> 8;
+ i += 4;
+ j += 4;
+ }
+ /* handle the 0-3 elements at the end */
+ while (j < ZSBT_SLOTNOS_PER_WORD && num_tids - i > 0)
+ {
+ slotnos[i] = slotword & mask;
+ slotword = slotword >> 2;
+ i++;
+ j++;
+ }
+ }
+}
+
+/*
+ * Remap undo slots.
+ *
+ * We start with empty UNDO slots, and walk through the items,
+ * filling a slot whenever we encounter an UNDO pointer that we
+ * haven't assigned a slot for yet. If we run out of slots, stop.
+ */
+static int
+remap_slots(uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *orig_slots, int num_orig_slots,
+ int target_idx, ZSUndoRecPtr target_ptr,
+ ZSUndoRecPtr *new_slots,
+ int *new_num_slots,
+ uint8 *new_slotnos,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ int num_slots;
+ int8 slot_mapping[ZSBT_MAX_ITEM_UNDO_SLOTS + 1];
+ int idx;
+
+ new_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ new_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+
+ /*
+ * Have to remap the UNDO slots.
+- *
+ * We start with empty UNDO slots, and walk through the items,
+ * filling a slot whenever we encounter an UNDO pointer that we
+ * haven't assigned a slot for yet. If we run out of slots, stop.
+ */
+
+ slot_mapping[ZSBT_OLD_UNDO_SLOT] = ZSBT_OLD_UNDO_SLOT;
+ slot_mapping[ZSBT_DEAD_UNDO_SLOT] = ZSBT_DEAD_UNDO_SLOT;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < num_orig_slots; i++)
+ slot_mapping[i] = -1;
+
+ for (idx = 0; idx < num_tids; idx++)
+ {
+ int orig_slotno = slotnos[idx];
+ int new_slotno;
+
+ if (idx == target_idx)
+ new_slotno = -1;
+ else
+ new_slotno = slot_mapping[orig_slotno];
+ if (new_slotno == -1)
+ {
+ /* assign new slot for this. */
+ ZSUndoRecPtr this_undoptr;
+
+ if (idx == target_idx)
+ this_undoptr = target_ptr;
+ else
+ this_undoptr = orig_slots[orig_slotno];
+
+ if (this_undoptr.counter == DeadUndoPtr.counter)
+ new_slotno = ZSBT_DEAD_UNDO_SLOT;
+ else if (this_undoptr.counter < recent_oldest_undo.counter)
+ new_slotno = ZSBT_OLD_UNDO_SLOT;
+ else
+ {
+ for (int j = 0; j < num_slots; j++)
+ {
+ if (new_slots[j].counter == this_undoptr.counter)
+ {
+ /* We already had a slot for this undo pointer. Reuse it. */
+ new_slotno = j;
+ break;
+ }
+ }
+ if (new_slotno == -1)
+ {
+ if (num_slots >= ZSBT_MAX_ITEM_UNDO_SLOTS)
+ break; /* out of slots */
+ else
+ {
+ /* assign to free slot */
+ new_slots[num_slots] = this_undoptr;
+ new_slotno = num_slots;
+ num_slots++;
+ }
+ }
+ }
+
+ if (idx != target_idx)
+ slot_mapping[orig_slotno] = new_slotno;
+ }
+
+ new_slotnos[idx] = new_slotno;
+ }
+
+ *new_num_slots = num_slots;
+ return idx;
+}
+
+/*
+ * Construct a ZSTidArrayItem.
+ *
+ * 'tids' is the list of TIDs to be packed in the item.
+ *
+ * 'deltas' contain the difference between each TID. They could be computed
+ * from the 'tids', but since the caller has them lready, we can save some
+ * effort by passing them down.
+ *
+ * 'slots' contains the UNDO slots to be stored. NOTE: it contains the
+ * special 0 and 1 slots too, but they won't be stored in the item that's
+ * created.
+ *
+ * 'slotnos' contains the UNDO slot numbers corresponding to each tuple
+ */
+static ZSTidArrayItem *
+build_item(zstid *tids, uint64 *deltas, uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *slots, int num_slots)
+{
+ int num_codewords;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ int num_encoded;
+ uint64 codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ uint64 *dst_slotword;
+ int idx;
+
+ /*
+ * Create codewords.
+ */
+ num_codewords = 0;
+ num_encoded = 0;
+ while (num_encoded < num_tids && num_codewords < ZSBT_MAX_ITEM_CODEWORDS)
+ {
+ int n;
+ uint64 codeword;
+
+ codeword = simple8b_encode(&deltas[num_encoded], num_tids - num_encoded, &n);
+ if (n == 0)
+ break;
+
+ num_encoded += n;
+
+ codewords[num_codewords++] = codeword;
+ }
+
+ itemsz = SizeOfZSTidArrayItem(num_encoded, num_slots, num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_tids = num_encoded;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = num_codewords;
+ newitem->t_firsttid = tids[0];
+ newitem->t_endtid = tids[num_encoded - 1] + 1;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* Copy in the TID codewords */
+ for (int i = 0; i < num_codewords; i++)
+ newitem_codewords[i] = codewords[i];
+
+ /* Copy in undo slots */
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < num_slots; i++)
+ newitem_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT] = slots[i];
+
+ /* Create slotwords */
+ dst_slotword = newitem_slotwords;
+ idx = 0;
+ while (idx < num_encoded)
+ {
+ uint64 slotword = 0;
+
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && idx < num_encoded; j++)
+ slotword |= (uint64) slotnos[idx++] << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+
+ *(dst_slotword++) = slotword;
+ }
+ Assert(dst_slotword == newitem_slotwords + ZSBT_NUM_SLOTWORDS(num_tids));
+
+ return newitem;
+}
+
+static int
+binsrch_tid_array(zstid key, zstid *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid])
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git a/src/backend/access/zedstore/zedstore_tidpage.c b/src/backend/access/zedstore/zedstore_tidpage.c
new file mode 100644
index 00000000000..d422849835f
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tidpage.c
@@ -0,0 +1,1849 @@
+/*
+ * zedstore_tidpage.c
+ * Routines for handling the TID tree.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with one B-tree at a time, it is the caller's
+ * responsibility to tie together the scans of each btree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tidpage.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "lib/integerset.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static void zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items);
+static OffsetNumber zsbt_tid_fetch(Relation rel, zstid tid,
+ Buffer *buf_p, ZSUndoRecPtr *undo_ptr_p, bool *isdead_p);
+static void zsbt_tid_add_items(Relation rel, Buffer buf, List *newitems);
+static void zsbt_tid_replace_item(Relation rel, Buffer buf, OffsetNumber off, List *newitems);
+
+static TM_Result zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd, ZSUndoRecPtr *prevundoptr_p);
+static void zsbt_tid_update_insert_new(Relation rel, zstid *newtid,
+ TransactionId xid, CommandId cid, ZSUndoRecPtr prevundoptr);
+static void zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot);
+static OffsetNumber zsbt_binsrch_tidpage(zstid key, Page page);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of the btree.
+ */
+void
+zsbt_tid_begin_scan(Relation rel, zstid starttid,
+ zstid endtid, Snapshot snapshot, ZSTidTreeScan *scan)
+{
+ scan->rel = rel;
+ scan->snapshot = snapshot;
+ scan->context = CurrentMemoryContext;
+ scan->starttid = starttid;
+ scan->endtid = endtid;
+ scan->currtid = starttid - 1;
+ memset(&scan->recent_oldest_undo, 0, sizeof(scan->recent_oldest_undo));
+ memset(&scan->array_iter, 0, sizeof(scan->array_iter));
+ scan->array_iter.context = CurrentMemoryContext;
+ scan->array_curr_idx = -1;
+
+ scan->active = true;
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+}
+
+/*
+ * Reset the 'next' TID in a scan to the given TID.
+ */
+void
+zsbt_tid_reset_scan(ZSTidTreeScan *scan, zstid starttid, zstid endtid, zstid currtid)
+{
+ scan->starttid = starttid;
+ scan->endtid = endtid;
+ scan->currtid = currtid;
+ scan->array_curr_idx = -1;
+}
+
+void
+zsbt_tid_end_scan(ZSTidTreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+
+ scan->active = false;
+ scan->array_iter.num_tids = 0;
+ scan->array_curr_idx = -1;
+
+ if (scan->array_iter.tids)
+ pfree(scan->array_iter.tids);
+ if (scan->array_iter.tid_undoslotnos)
+ pfree(scan->array_iter.tid_undoslotnos);
+}
+
+/*
+ * Helper function of zsbt_tid_scan_next_array(), to extract Datums from the given
+ * array item into the scan->array_* fields.
+ */
+static void
+zsbt_tid_scan_extract_array(ZSTidTreeScan *scan, ZSTidArrayItem *aitem)
+{
+ bool slots_visible[4];
+ int first;
+ int last;
+ int num_visible_tids;
+ int continue_at;
+
+ zsbt_tid_item_unpack(aitem, &scan->array_iter);
+
+ slots_visible[ZSBT_OLD_UNDO_SLOT] = true;
+ slots_visible[ZSBT_DEAD_UNDO_SLOT] = false;
+
+ scan->array_iter.undoslot_visibility[ZSBT_OLD_UNDO_SLOT] = InvalidUndoSlotVisibility;
+ scan->array_iter.undoslot_visibility[ZSBT_OLD_UNDO_SLOT].xmin = FrozenTransactionId;
+
+ scan->array_iter.undoslot_visibility[ZSBT_DEAD_UNDO_SLOT] = InvalidUndoSlotVisibility;
+
+ for (int i = 2; i < aitem->t_num_undo_slots; i++)
+ {
+ ZSUndoRecPtr undoptr = scan->array_iter.undoslots[i];
+ TransactionId obsoleting_xid;
+
+ scan->array_iter.undoslot_visibility[i] = InvalidUndoSlotVisibility;
+
+ slots_visible[i] = zs_SatisfiesVisibility(scan, undoptr, &obsoleting_xid,
+ NULL, &scan->array_iter.undoslot_visibility[i]);
+ if (scan->serializable && TransactionIdIsValid(obsoleting_xid))
+ CheckForSerializableConflictOut(scan->rel, obsoleting_xid, scan->snapshot);
+ }
+
+ /*
+ * Skip over elements at the beginning and end of the array that
+ * are not within the range we're interested in.
+ */
+ for (first = 0; first < scan->array_iter.num_tids; first++)
+ {
+ if (scan->array_iter.tids[first] >= scan->starttid)
+ break;
+ }
+ for (last = scan->array_iter.num_tids - 1; last >= first; last--)
+ {
+ if (scan->array_iter.tids[last] < scan->endtid)
+ break;
+ }
+
+ /* squeeze out invisible TIDs */
+ if (first == 0)
+ {
+ int j;
+
+ for (j = 0; j <= last; j++)
+ {
+ if (!slots_visible[scan->array_iter.tid_undoslotnos[j]])
+ break;
+ }
+ num_visible_tids = j;
+ continue_at = j + 1;
+ }
+ else
+ {
+ num_visible_tids = 0;
+ continue_at = first;
+ }
+
+ for (int i = continue_at; i <= last; i++)
+ {
+ /* Is this item visible? */
+ if (slots_visible[scan->array_iter.tid_undoslotnos[i]])
+ {
+ scan->array_iter.tids[num_visible_tids] = scan->array_iter.tids[i];
+ scan->array_iter.tid_undoslotnos[num_visible_tids] = scan->array_iter.tid_undoslotnos[i];
+ num_visible_tids++;
+ }
+ }
+ scan->array_iter.num_tids = num_visible_tids;
+ scan->array_curr_idx = -1;
+}
+
+/*
+ * Advance scan to next batch of TIDs.
+ *
+ * Finds the next TID array item >= scan->nexttid, and decodes it into
+ * scan->array_iter. The values in scan->array_iter are valid until
+ * the next call to this function, zsbt_tid_reset_scan() or
+ * zsbt_tid_end_scan().
+ *
+ * Returns true if there was another item, or false if we reached the
+ * end of the scan.
+ *
+ * This is normally not used directly, see zsbt_tid_scan_next() wrapper.
+ */
+bool
+zsbt_tid_scan_next_array(ZSTidTreeScan *scan, zstid nexttid, ScanDirection direction)
+{
+ if (!scan->active)
+ return InvalidZSTid;
+
+ /*
+ * Process items, until we find something that is visible to the snapshot.
+ *
+ * This advances nexttid as it goes.
+ */
+ while (nexttid < scan->endtid && nexttid >= scan->starttid)
+ {
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+ BlockNumber next;
+
+ /*
+ * Find and lock the leaf page containing nexttid.
+ */
+ buf = zsbt_find_and_lock_leaf_containing_tid(scan->rel, ZS_META_ATTRIBUTE_NUM,
+ scan->lastbuf, nexttid,
+ BUFFER_LOCK_SHARE);
+ if (buf != scan->lastbuf)
+ scan->lastoff = InvalidOffsetNumber;
+ scan->lastbuf = buf;
+ if (!BufferIsValid(buf))
+ {
+ /*
+ * Completely empty tree. This should only happen at the beginning of a
+ * scan - a tree cannot go missing after it's been created - but we don't
+ * currently check for that.
+ */
+ break;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ Assert(opaque->zs_page_id == ZS_BTREE_PAGE_ID);
+
+ /*
+ * Scan the items on the page, to find the next one that covers
+ * nexttid.
+ *
+ * We check the last offset first, as an optimization
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (direction == ForwardScanDirection)
+ {
+ /* Search for the next item >= nexttid */
+ off = FirstOffsetNumber;
+ if (scan->lastoff > FirstOffsetNumber && scan->lastoff <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, scan->lastoff);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid >= item->t_endtid)
+ off = scan->lastoff + 1;
+ }
+
+ for (; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid >= item->t_endtid)
+ continue;
+
+ if (item->t_firsttid >= scan->endtid)
+ {
+ nexttid = scan->endtid;
+ break;
+ }
+
+ zsbt_tid_scan_extract_array(scan, item);
+
+ if (scan->array_iter.num_tids > 0)
+ {
+ if (scan->array_iter.tids[scan->array_iter.num_tids - 1] >= nexttid)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ scan->lastoff = off;
+ return true;
+ }
+ nexttid = scan->array_iter.tids[scan->array_iter.num_tids - 1] + 1;
+ }
+ }
+ /* No more items on this page. Walk right, if possible */
+ if (nexttid < opaque->zs_hikey)
+ nexttid = opaque->zs_hikey;
+ next = opaque->zs_next;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (next == InvalidBlockNumber || nexttid >= scan->endtid)
+ {
+ /* reached end of scan */
+ break;
+ }
+
+ scan->lastbuf = ReleaseAndReadBuffer(scan->lastbuf, scan->rel, next);
+ }
+ else
+ {
+ /* Search for the next item <= nexttid */
+ for (off = maxoff; off >= FirstOffsetNumber; off--)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid < item->t_firsttid)
+ continue;
+
+ if (item->t_endtid < scan->starttid)
+ {
+ nexttid = scan->starttid - 1;
+ break;
+ }
+
+ zsbt_tid_scan_extract_array(scan, item);
+
+ if (scan->array_iter.num_tids > 0)
+ {
+ if (scan->array_iter.tids[0] <= nexttid)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ scan->lastoff = off;
+ return true;
+ }
+ nexttid = scan->array_iter.tids[0] - 1;
+ }
+ }
+ /* No more items on this page. Loop back to find the left sibling. */
+ if (nexttid >= opaque->zs_lokey)
+ nexttid = opaque->zs_lokey - 1;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ if (nexttid < scan->starttid)
+ {
+ /* reached end of scan */
+ break;
+ }
+ scan->lastbuf = InvalidBuffer;
+ }
+ }
+
+ /* Reached end of scan. */
+ scan->array_iter.num_tids = 0;
+ if (BufferIsValid(scan->lastbuf))
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+
+ return false;
+}
+
+/*
+ * Get the last tid (plus one) in the tree.
+ */
+zstid
+zsbt_get_last_tid(Relation rel)
+{
+ zstid rightmostkey;
+ zstid tid;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+
+ /* Find the rightmost leaf */
+ rightmostkey = MaxZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, rightmostkey, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ return MinZSTid;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Look at the last item, for its tid.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSTidArrayItem *lastitem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ tid = lastitem->t_endtid;
+ }
+ else
+ {
+ tid = opaque->zs_lokey;
+ }
+ UnlockReleaseBuffer(buf);
+
+ return tid;
+}
+
+/*
+ * Insert a multiple TIDs.
+ *
+ * Populates the TIDs of the new tuples.
+ *
+ * If 'tid' in list is valid, then that TID is used. It better not be in use already. If
+ * it's invalid, then a new TID is allocated, as we see best. (When inserting the
+ * first column of the row, pass invalid, and for other columns, pass the TID
+ * you got for the first column.)
+ */
+void
+zsbt_tid_multi_insert(Relation rel, zstid *tids, int ntuples,
+ TransactionId xid, CommandId cid, uint32 speculative_token, ZSUndoRecPtr prevundoptr)
+{
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ zstid insert_target_key;
+ List *newitems;
+ ZSUndoRecPtr undorecptr;
+ zstid endtid;
+ zstid tid;
+ ZSTidArrayItem *lastitem;
+ bool modified_orig;
+
+ /*
+ * Insert to the rightmost leaf.
+ *
+ * TODO: use a Free Space Map to find suitable target.
+ */
+ insert_target_key = MaxZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, insert_target_key, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Look at the last item, for its tid.
+ *
+ * assign TIDS for each item.
+ */
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+
+ lastitem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ endtid = lastitem->t_endtid;
+ }
+ else
+ {
+ endtid = opaque->zs_lokey;
+ lastitem = NULL;
+ }
+ tid = endtid;
+
+ /* Form an undo record */
+ if (xid != FrozenTransactionId)
+ {
+ undorecptr = zsundo_create_for_insert(rel, xid, cid, tid, ntuples,
+ speculative_token, prevundoptr);
+ }
+ else
+ {
+ undorecptr = InvalidUndoPtr;
+ }
+
+ /*
+ * Create an item to represent all the TIDs, merging with the last existing
+ * item if possible.
+ */
+ newitems = zsbt_tid_item_add_tids(lastitem, tid, ntuples, undorecptr,
+ &modified_orig);
+
+ /*
+ * Replace the original last item with the new items, or add new items.
+ * This splits the page if necessary.
+ */
+ if(modified_orig)
+ zsbt_tid_replace_item(rel, buf, maxoff, newitems);
+ else
+ zsbt_tid_add_items(rel, buf, newitems);
+ /* zsbt_tid_replace/add_item unlocked 'buf' */
+ ReleaseBuffer(buf);
+
+ list_free_deep(newitems);
+
+ /* Return the TIDs to the caller */
+ for (int i = 0; i < ntuples; i++)
+ tids[i] = tid + i;
+}
+
+TM_Result
+zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ ZSUndoRecPtr undorecptr;
+ OffsetNumber off;
+ ZSTidArrayItem *origitem;
+ Buffer buf;
+ Page page;
+ zstid next_tid;
+ List *newitems = NIL;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to delete with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+ if (item_isdead)
+ {
+ elog(ERROR, "cannot delete tuple that is already marked DEAD (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+
+ if (snapshot)
+ {
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ tid, item_undoptr, LockTupleExclusive,
+ &keep_old_undo_ptr, hufd, &next_tid, NULL);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+ ZSUndoSlotVisibility visi_info;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, item_undoptr, &obsoleting_xid, NULL, &visi_info))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return TM_Updated;
+ }
+ }
+ }
+
+ /* Create UNDO record. */
+ undorecptr = zsundo_create_for_delete(rel, xid, cid, tid, changingPart,
+ keep_old_undo_ptr ? item_undoptr : InvalidUndoPtr);
+
+ /* Update the tid with the new UNDO pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+
+ return TM_Ok;
+}
+
+void
+zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ int idx;
+ Buffer buf;
+ /* Just using meta attribute, we can follow the update chain */
+ zstid curr_tid = *tid;
+
+ for(;;)
+ {
+ zstid next_tid = InvalidZSTid;
+ if (curr_tid == InvalidZSTid)
+ break;
+
+ /* Find the item */
+ idx = zsbt_tid_fetch(rel, curr_tid, &buf, &item_undoptr, &item_isdead);
+ if (idx == -1 || item_isdead)
+ break;
+
+ if (snapshot)
+ {
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+ ZSUndoSlotVisibility visi_info;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = snapshot;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (zs_SatisfiesVisibility(&scan, item_undoptr,
+ &obsoleting_xid, &next_tid, &visi_info))
+ {
+ *tid = curr_tid;
+ }
+
+ curr_tid = next_tid;
+ UnlockReleaseBuffer(buf);
+ }
+ }
+}
+
+/*
+ * A new TID is allocated, as we see best and returned to the caller. This
+ * function is only called for META attribute btree. Data columns will use the
+ * returned tid to insert new items.
+ */
+TM_Result
+zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd,
+ zstid *newtid_p)
+{
+ TM_Result result;
+ ZSUndoRecPtr prevundoptr;
+
+ /*
+ * This is currently only used on the meta-attribute. The other attributes
+ * don't need to carry visibility information, so the caller just inserts
+ * the new values with (multi_)insert() instead. This will change once we
+ * start doing the equivalent of HOT updates, where the TID doesn't change.
+ */
+ Assert(*newtid_p == InvalidZSTid);
+
+ /*
+ * Find and lock the old item.
+ *
+ * TODO: If there's free TID space left on the same page, we should keep the
+ * buffer locked, and use the same page for the new tuple.
+ */
+ result = zsbt_tid_update_lock_old(rel, otid,
+ xid, cid, key_update, snapshot,
+ crosscheck, wait, hufd, &prevundoptr);
+
+ if (result != TM_Ok)
+ return result;
+
+ /* insert new version */
+ zsbt_tid_update_insert_new(rel, newtid_p, xid, cid, prevundoptr);
+
+ /* update the old item with the "t_ctid pointer" for the new item */
+ zsbt_tid_mark_old_updated(rel, otid, *newtid_p, xid, cid, key_update, snapshot);
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): locks the old item for update.
+ */
+static TM_Result
+zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd, ZSUndoRecPtr *prevundoptr_p)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ Buffer buf;
+ ZSUndoRecPtr olditem_undoptr;
+ bool olditem_isdead;
+ int idx;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zstid next_tid;
+
+ /*
+ * Find the item to delete.
+ */
+ idx = zsbt_tid_fetch(rel, otid, &buf, &olditem_undoptr, &olditem_isdead);
+ if (idx == -1 || olditem_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+ *prevundoptr_p = olditem_undoptr;
+
+ /*
+ * Is it visible to us?
+ */
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ otid, olditem_undoptr,
+ key_update ? LockTupleExclusive : LockTupleNoKeyExclusive,
+ &keep_old_undo_ptr, hufd, &next_tid, NULL);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+ ZSUndoSlotVisibility visi_info;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, olditem_undoptr, &obsoleting_xid, NULL, &visi_info))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ result = TM_Updated;
+ }
+ }
+
+ /*
+ * TODO: tuple-locking not implemented. Pray that there is no competing
+ * concurrent update!
+ */
+
+ UnlockReleaseBuffer(buf);
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): inserts the new, updated, item.
+ */
+static void
+zsbt_tid_update_insert_new(Relation rel,
+ zstid *newtid,
+ TransactionId xid, CommandId cid, ZSUndoRecPtr prevundoptr)
+{
+ zsbt_tid_multi_insert(rel, newtid, 1, xid, cid, INVALID_SPECULATIVE_TOKEN, prevundoptr);
+}
+
+/*
+ * Subroutine of zsbt_update(): mark old item as updated.
+ */
+static void
+zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr olditem_undoptr;
+ bool olditem_isdead;
+ OffsetNumber off;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ TM_FailureData tmfd;
+ ZSUndoRecPtr undorecptr;
+ List *newitems;
+ ZSTidArrayItem *origitem;
+ zstid next_tid;
+
+ /*
+ * Find the item to delete. It could be part of a compressed item,
+ * we let zsbt_fetch() handle that.
+ */
+ off = zsbt_tid_fetch(rel, otid, &buf, &olditem_undoptr, &olditem_isdead);
+ if (!OffsetNumberIsValid(off) || olditem_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+
+ /*
+ * Is it visible to us?
+ */
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ otid, olditem_undoptr,
+ key_update ? LockTupleExclusive : LockTupleNoKeyExclusive,
+ &keep_old_undo_ptr, &tmfd, &next_tid, NULL);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "tuple concurrently updated - not implemented");
+ }
+
+ /* Create UNDO record. */
+ {
+ ZSUndoRec_Update undorec;
+
+ undorec.rec.size = sizeof(ZSUndoRec_Update);
+ undorec.rec.type = ZSUNDO_TYPE_UPDATE;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.oldtid = otid;
+ undorec.newtid = newtid;
+ if (keep_old_undo_ptr)
+ undorec.rec.prevundorec = olditem_undoptr;
+ else
+ undorec.rec.prevundorec = InvalidUndoPtr;
+ undorec.key_update = key_update;
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* Replace the ZSBreeItem with one with the updated undo pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, otid, undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+}
+
+TM_Result
+zsbt_tid_lock(Relation rel, zstid tid, TransactionId xid, CommandId cid,
+ LockTupleMode mode, bool follow_updates, Snapshot snapshot,
+ TM_FailureData *hufd, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ OffsetNumber off;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ ZSUndoRecPtr undorecptr;
+ List *newitems;
+ ZSTidArrayItem *origitem;
+
+ *next_tid = tid;
+
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!OffsetNumberIsValid(off) || item_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to lock with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ tid, item_undoptr, mode,
+ &keep_old_undo_ptr, hufd, next_tid, visi_info);
+
+ if (result != TM_Ok)
+ {
+ if (result == TM_Invisible && follow_updates &&
+ TransactionIdIsInProgress(visi_info->xmin))
+ {
+ /*
+ * need to lock tuple irrespective of its visibility on
+ * follow_updates.
+ */
+ }
+ else
+ {
+ UnlockReleaseBuffer(buf);
+ return result;
+ }
+ }
+
+ /* Create UNDO record. */
+ {
+ ZSUndoRec_TupleLock undorec;
+
+ undorec.rec.size = sizeof(ZSUndoRec_TupleLock);
+ undorec.rec.type = ZSUNDO_TYPE_TUPLE_LOCK;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.tid = tid;
+ undorec.lockmode = mode;
+ if (keep_old_undo_ptr)
+ undorec.rec.prevundorec = item_undoptr;
+ else
+ undorec.rec.prevundorec = InvalidUndoPtr;
+
+ undorecptr = zsundo_insert(rel, &undorec.rec);
+ }
+
+ /* Replace the item with an identical one, but with updated undo pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+ return TM_Ok;
+}
+
+/*
+ * Collect all TIDs marked as dead in the TID tree.
+ *
+ * This is used during VACUUM.
+ */
+IntegerSet *
+zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid)
+{
+ Buffer buf = InvalidBuffer;
+ IntegerSet *result;
+ ZSBtreePageOpaque *opaque;
+ zstid nexttid;
+ BlockNumber nextblock;
+ ZSTidItemIterator iter;
+
+ memset(&iter, 0, sizeof(ZSTidItemIterator));
+ iter.context = CurrentMemoryContext;
+
+ result = intset_create();
+
+ nexttid = starttid;
+ nextblock = InvalidBlockNumber;
+ for (;;)
+ {
+ Page page;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ if (nextblock != InvalidBlockNumber)
+ {
+ buf = ReleaseAndReadBuffer(buf, rel, nextblock);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ if (!zsbt_page_is_expected(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, buf))
+ {
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(buf))
+ {
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, true);
+ if (!BufferIsValid(buf))
+ return result;
+ page = BufferGetPage(buf);
+ }
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ zsbt_tid_item_unpack(item, &iter);
+
+ for (int j = 0; j < iter.num_tids; j++)
+ {
+ if (iter.tid_undoslotnos[j] == ZSBT_DEAD_UNDO_SLOT)
+ intset_add_member(result, iter.tids[j]);
+ }
+ }
+
+ opaque = ZSBtreePageGetOpaque(page);
+ nexttid = opaque->zs_hikey;
+ nextblock = opaque->zs_next;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (nexttid == MaxPlusOneZSTid)
+ {
+ Assert(nextblock == InvalidBlockNumber);
+ break;
+ }
+
+ if (intset_memory_usage(result) > (uint64) maintenance_work_mem * 1024)
+ break;
+ }
+
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ *endtid = nexttid;
+ return result;
+}
+
+/*
+ * Mark item with given TID as dead.
+ *
+ * This is used when UNDO actions are performed, after a transaction becomes
+ * old enough.
+ */
+void
+zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr recent_oldest_undo)
+{
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr item_undoptr;
+ OffsetNumber off;
+ ZSTidArrayItem *origitem;
+ List *newitems;
+ bool isdead;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ elog(WARNING, "could not find tuple to mark dead with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ return;
+ }
+
+ /* Mark the TID as DEAD. (Unless it's already dead) */
+ if (isdead)
+ {
+ UnlockReleaseBuffer(buf);
+ return;
+ }
+
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, DeadUndoPtr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+}
+
+
+/*
+ * Remove items for the given TIDs from the TID tree.
+ *
+ * This is used during VACUUM.
+ */
+void
+zsbt_tid_remove(Relation rel, IntegerSet *tids)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ zstid nexttid;
+ MemoryContext oldcontext;
+ MemoryContext tmpcontext;
+
+ tmpcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMVacuumContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(tmpcontext);
+
+ intset_begin_iterate(tids);
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+
+ while (nexttid < MaxPlusOneZSTid)
+ {
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ List *newitems;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ /*
+ * Find the leaf page containing the next item to remove
+ */
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Rewrite the items on the page, removing all TIDs that need to be
+ * removed from the page.
+ */
+ newitems = NIL;
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ while (nexttid < item->t_firsttid)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ if (nexttid < item->t_endtid)
+ {
+ List *newitemsx = zsbt_tid_item_remove_tids(item, &nexttid, tids,
+ recent_oldest_undo);
+
+ newitems = list_concat(newitems, newitemsx);
+ }
+ else
+ {
+ /* keep this item unmodified */
+ newitems = lappend(newitems, item);
+ }
+ }
+
+ while (nexttid < opaque->zs_hikey)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ /* Pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (newitems)
+ {
+ zsbt_tid_recompress_replace(rel, buf, newitems);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack);
+ }
+
+ ReleaseBuffer(buf);
+
+ MemoryContextReset(tmpcontext);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(tmpcontext);
+}
+
+/*
+ * Clear an item's UNDO pointer.
+ *
+ * This is used during VACUUM, to clear out aborted deletions.
+ */
+void
+zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ OffsetNumber off;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ elog(WARNING, "could not find aborted tuple to remove with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ return;
+ }
+
+ if (ZSUndoRecPtrEquals(item_undoptr, undoptr))
+ {
+ ZSTidArrayItem *origitem;
+ List *newitems;
+
+ /* FIXME: we're overwriting the undo pointer with 'invalid', meaning the
+ * tuple becomes visible to everyone. That doesn't seem right. Shouldn't
+ * we restore the previous undo pointer, if the insertion was not yet
+ * visible to everyone?
+ */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, InvalidUndoPtr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+ }
+ else
+ {
+ Assert(item_isdead ||
+ item_undoptr.counter > undoptr.counter ||
+ !IsZSUndoRecPtrValid(&item_undoptr));
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+void
+zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete)
+{
+ Buffer buf;
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ bool found;
+
+ found = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!found || item_isdead)
+ elog(ERROR, "couldn't find item for meta column for inserted tuple with TID (%u, %u) in rel %s",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid), rel->rd_rel->relname.data);
+
+ zsundo_clear_speculative_token(rel, item_undoptr);
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Fetch the item with given TID. The page containing the item is kept locked, and
+ * returned to the caller in *buf_p. This is used to locate a tuple for updating
+ * or deleting it.
+ */
+static OffsetNumber
+zsbt_tid_fetch(Relation rel, zstid tid, Buffer *buf_p, ZSUndoRecPtr *undoptr_p, bool *isdead_p)
+{
+ Buffer buf;
+ Page page;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, tid, 0, false);
+ if (buf == InvalidBuffer)
+ {
+ *buf_p = InvalidBuffer;
+ *undoptr_p = InvalidUndoPtr;
+ return InvalidOffsetNumber;
+ }
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /* Find the item on the page that covers the target TID */
+ off = zsbt_binsrch_tidpage(tid, page);
+ if (off >= FirstOffsetNumber && off <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (tid < item->t_endtid)
+ {
+ ZSTidItemIterator iter;
+
+ memset(&iter, 0, sizeof(ZSTidItemIterator));
+ iter.context = CurrentMemoryContext;
+
+ zsbt_tid_item_unpack(item, &iter);
+
+ /* TODO: could do binary search here. Better yet, integrate the
+ * unpack function with the callers */
+ for (int i = 0; i < iter.num_tids; i++)
+ {
+ if (iter.tids[i] == tid)
+ {
+ int slotno = iter.tid_undoslotnos[i];
+ ZSUndoRecPtr undoptr = iter.undoslots[slotno];
+
+ *isdead_p = (slotno == ZSBT_DEAD_UNDO_SLOT);
+ *undoptr_p = undoptr;
+ *buf_p = buf;
+
+ if (iter.tids)
+ pfree(iter.tids);
+ if (iter.tid_undoslotnos)
+ pfree(iter.tid_undoslotnos);
+
+ return off;
+ }
+ }
+
+ if (iter.tids)
+ pfree(iter.tids);
+ if (iter.tid_undoslotnos)
+ pfree(iter.tid_undoslotnos);
+ }
+ }
+ return InvalidOffsetNumber;
+}
+
+/*
+ * This helper function is used to implement INSERT.
+ *
+ * The items in 'newitems' are added to the page, to the correct position.
+ * FIXME: Actually, they're always just added to the end of the page, and that
+ * better be the correct position.
+ *
+ * This function handles splitting the page if needed.
+ */
+static void
+zsbt_tid_add_items(Relation rel, Buffer buf, List *newitems)
+{
+ Page page = BufferGetPage(buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber off;
+ Size newitemsize;
+ ListCell *lc;
+
+ newitemsize = 0;
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ newitemsize += sizeof(ItemIdData) + item->t_size;
+ }
+
+ if (newitemsize <= PageGetExactFreeSpace(page))
+ {
+ /* The new items fit on the page. Add them. */
+
+ START_CRIT_SECTION();
+
+ off = maxoff;
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ off++;
+ if (!PageAddItem(page, (Item) item, item->t_size, off, true, false))
+ elog(ERROR, "could not add item to TID tree page");
+ }
+
+ MarkBufferDirty(buf);
+
+ /* TODO: WAL-log */
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ List *items = NIL;
+
+ /* Collect all the old items on the page to a list */
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ /*
+ * Get the next item to process from the page.
+ */
+ items = lappend(items, item);
+ }
+
+ /* Add any new items to the end */
+ foreach (lc, newitems)
+ {
+ items = lappend(items, lfirst(lc));
+ }
+
+ /* Now pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_tid_recompress_replace(rel, buf, items);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack);
+ }
+
+ list_free(items);
+ }
+}
+
+
+/*
+ * This helper function is used to implement INSERT, UPDATE and DELETE.
+ *
+ * If 'newitems' is not empty, the items in the list are added to the page,
+ * to the correct position. FIXME: Actually, they're always just added to
+ * the end of the page, and that better be the correct position.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page if needed.
+ */
+static void
+zsbt_tid_replace_item(Relation rel, Buffer buf, OffsetNumber targetoff, List *newitems)
+{
+ Page page = BufferGetPage(buf);
+ ItemId iid;
+ ZSTidArrayItem *olditem;
+ ListCell *lc;
+ ssize_t sizediff;
+
+ /*
+ * Find the item that covers the given tid.
+ */
+ if (targetoff < FirstOffsetNumber || targetoff > PageGetMaxOffsetNumber(page))
+ elog(ERROR, "could not find item at off %d to replace", targetoff);
+ iid = PageGetItemId(page, targetoff);
+ olditem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ /* Calculate how much free space we'll need */
+ sizediff = -(olditem->t_size + sizeof(ItemIdData));
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *newitem = (ZSTidArrayItem *) lfirst(lc);
+
+ sizediff += newitem->t_size + sizeof(ItemIdData);
+ }
+
+ /* Can we fit them? */
+ if (sizediff <= PageGetExactFreeSpace(page))
+ {
+ ZSTidArrayItem *newitem;
+ OffsetNumber off;
+
+ START_CRIT_SECTION();
+
+ /* Remove existing item, and add new ones */
+ if (newitems == 0)
+ PageIndexTupleDelete(page, targetoff);
+ else
+ {
+ lc = list_head(newitems);
+ newitem = (ZSTidArrayItem *) lfirst(lc);
+ if (!PageIndexTupleOverwrite(page, targetoff, (Item) newitem, newitem->t_size))
+ elog(ERROR, "could not replace item in TID tree page at off %d", targetoff);
+ lc = lnext(newitems, lc);
+
+ off = targetoff + 1;
+ for (; lc != NULL; lc = lnext(newitems, lc))
+ {
+ newitem = (ZSTidArrayItem *) lfirst(lc);
+ if (!PageAddItem(page, (Item) newitem, newitem->t_size, off, false, false))
+ elog(ERROR, "could not add item in TID tree page at off %d", off);
+ off++;
+ }
+ }
+
+ MarkBufferDirty(buf);
+ /* TODO: WAL-log */
+ END_CRIT_SECTION();
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid lasttid = 0;
+ OffsetNumber off;
+
+ for (off = FirstOffsetNumber; off <= PageGetMaxOffsetNumber(page); off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ Assert(item->t_firsttid >= lasttid);
+ lasttid = item->t_endtid;
+ }
+ }
+#endif
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ /* Have to split the page. */
+ List *items = NIL;
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber off;
+
+ /*
+ * Construct a List that contains all the items in the right order, and
+ * let zsbt_tid_recompress_page() do the heavy lifting to fit them on
+ * pages.
+ */
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (off == targetoff)
+ {
+ foreach(lc, newitems)
+ {
+ items = lappend(items, (ZSTidArrayItem *) lfirst(lc));
+ }
+ }
+ else
+ items = lappend(items, item);
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid endtid = 0;
+ ListCell *lc;
+
+ foreach (lc, items)
+ {
+ ZSTidArrayItem *i = (ZSTidArrayItem *) lfirst(lc);
+
+ Assert(i->t_firsttid >= endtid);
+ Assert(i->t_endtid > i->t_firsttid);
+ endtid = i->t_endtid;
+ }
+ }
+#endif
+
+ /* Pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_tid_recompress_replace(rel, buf, items);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack);
+ }
+
+ list_free(items);
+ }
+}
+
+/*
+ * Recompressor routines
+ */
+typedef struct
+{
+ Page currpage;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ int num_pages;
+ int free_space_per_page;
+
+ zstid hikey;
+} zsbt_tid_recompress_context;
+
+static void
+zsbt_tid_recompress_newpage(zsbt_tid_recompress_context *cxt, zstid nexttid, int flags)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ if (cxt->currpage)
+ {
+ /* set the last tid on previous page */
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ oldopaque->zs_hikey = nexttid;
+ }
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (cxt->stack_tail)
+ cxt->stack_tail->next = stack;
+ else
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = ZS_META_ATTRIBUTE_NUM;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = flags;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_tid_recompress_add_to_page(zsbt_tid_recompress_context *cxt, ZSTidArrayItem *item)
+{
+ OffsetNumber maxoff;
+ Size freespc;
+
+ freespc = PageGetExactFreeSpace(cxt->currpage);
+ if (freespc < item->t_size + sizeof(ItemIdData) ||
+ freespc < cxt->free_space_per_page)
+ {
+ zsbt_tid_recompress_newpage(cxt, item->t_firsttid, 0);
+ }
+
+ maxoff = PageGetMaxOffsetNumber(cxt->currpage);
+ if (!PageAddItem(cxt->currpage, (Item) item, item->t_size, maxoff + 1, true, false))
+ elog(ERROR, "could not add item to TID tree page");
+}
+
+/*
+ * Subroutine of zsbt_tid_recompress_replace. Compute how much space the
+ * items will take, and compute how many pages will be needed for them, and
+ * decide how to distribute any free space thats's left over among the
+ * pages.
+ *
+ * Like in B-tree indexes, we aim for 50/50 splits, except for the
+ * rightmost page where aim for 90/10, so that most of the free space is
+ * left to the end of the index, where it's useful for new inserts. The
+ * 90/10 splits ensure that the we don't waste too much space on a table
+ * that's loaded at the end, and never updated.
+ */
+static void
+zsbt_tid_recompress_picksplit(zsbt_tid_recompress_context *cxt, List *items)
+{
+ size_t total_sz;
+ int num_pages;
+ int space_on_empty_page;
+ Size free_space_per_page;
+ ListCell *lc;
+
+ space_on_empty_page = BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(ZSBtreePageOpaque));
+
+ /* Compute total space needed for all the items. */
+ total_sz = 0;
+ foreach(lc, items)
+ {
+ ZSTidArrayItem *item = lfirst(lc);
+
+ total_sz += sizeof(ItemIdData) + item->t_size;
+ }
+
+ /* How many pages will we need for them? */
+ num_pages = (total_sz + space_on_empty_page - 1) / space_on_empty_page;
+
+ /* If everything fits on one page, don't split */
+ if (num_pages == 1)
+ {
+ free_space_per_page = 0;
+ }
+ /* If this is the rightmost page, do a 90/10 split */
+ else if (cxt->hikey == MaxPlusOneZSTid)
+ {
+ /*
+ * What does 90/10 mean if we have to use more than two pages? It means
+ * that 10% of the items go to the last page, and 90% are distributed to
+ * all the others.
+ */
+ double total_free_space;
+
+ total_free_space = space_on_empty_page * num_pages - total_sz;
+
+ free_space_per_page = total_free_space * 0.1 / (num_pages - 1);
+ }
+ /* Otherwise, aim for an even 50/50 split */
+ else
+ {
+ free_space_per_page = (space_on_empty_page * num_pages - total_sz) / num_pages;
+ }
+
+ cxt->num_pages = num_pages;
+ cxt->free_space_per_page = free_space_per_page;
+}
+
+/*
+ * Rewrite a leaf page, with given 'items' as the new content.
+ *
+ * If there are any uncompressed items in the list, we try to compress them.
+ * Any already-compressed items are added as is.
+ *
+ * If the items no longer fit on the page, then the page is split. It is
+ * entirely possible that they don't fit even on two pages; we split the page
+ * into as many pages as needed. Hopefully not more than a few pages, though,
+ * because otherwise you might hit limits on the number of buffer pins (with
+ * tiny shared_buffers).
+ *
+ * On entry, 'oldbuf' must be pinned and exclusive-locked. On exit, the lock
+ * is released, but it's still pinned.
+ *
+ * TODO: Try to combine single items, and existing array-items, into new array
+ * items.
+ */
+static void
+zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items)
+{
+ ListCell *lc;
+ zsbt_tid_recompress_context cxt;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(BufferGetPage(oldbuf));
+ BlockNumber orignextblk;
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+
+ orignextblk = oldopaque->zs_next;
+
+ cxt.currpage = NULL;
+ cxt.stack_head = cxt.stack_tail = NULL;
+ cxt.hikey = oldopaque->zs_hikey;
+
+ zsbt_tid_recompress_picksplit(&cxt, items);
+ zsbt_tid_recompress_newpage(&cxt, oldopaque->zs_lokey, (oldopaque->zs_flags & ZSBT_ROOT));
+
+ foreach(lc, items)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ zsbt_tid_recompress_add_to_page(&cxt, item);
+ }
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt.stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, InvalidBuffer);
+ stack->next->buf = nextbuf;
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = orignextblk;
+
+ /*
+ * zsbt_tid_recompress_picksplit() calculated that we'd need
+ * 'cxt.num_pages' pages. Check that it matches with how many pages we
+ * actually created.
+ */
+ Assert(list_length(downlinks) + 1 == cxt.num_pages);
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt.stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt.stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt.stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt.stack_tail->next = zsbt_newroot(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt.stack_tail->next = zsbt_insert_downlinks(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_lokey, BufferGetBlockNumber(oldbuf), oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt.stack_head);
+}
+
+static OffsetNumber
+zsbt_binsrch_tidpage(zstid key, Page page)
+{
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber low,
+ high,
+ mid;
+
+ low = FirstOffsetNumber;
+ high = maxoff + 1;
+ while (high > low)
+ {
+ ItemId iid;
+ ZSTidArrayItem *item;
+
+ mid = low + (high - low) / 2;
+
+ iid = PageGetItemId(page, mid);
+ item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (key >= item->t_firsttid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git a/src/backend/access/zedstore/zedstore_toast.c b/src/backend/access/zedstore/zedstore_toast.c
new file mode 100644
index 00000000000..947f3f158b0
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_toast.c
@@ -0,0 +1,164 @@
+/*
+ * zedstore_toast.c
+ * Routines for Toasting oversized tuples in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_toast.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/rel.h"
+
+/*
+ * Toast a datum, inside the ZedStore file.
+ *
+ * This is similar to regular toasting, but instead of using a separate index and
+ * heap, the datum is stored within the same ZedStore file as all the btrees and
+ * stuff. A chain of "toast-pages" is allocated for the datum, and each page is filled
+ * with as much of the datum as possible.
+ */
+Datum
+zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value, zstid tid)
+{
+ varatt_zs_toastptr *toastptr;
+ BlockNumber firstblk = InvalidBlockNumber;
+ Buffer buf = InvalidBuffer;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ Buffer prevbuf = InvalidBuffer;
+ ZSToastPageOpaque *prevopaque = NULL;
+ char *ptr;
+ int32 total_size;
+ int32 offset;
+
+ Assert(tid != InvalidZSTid);
+
+ /*
+ * TID btree will always be inserted first, so there must be > 0 blocks
+ */
+ Assert(RelationGetNumberOfBlocks(rel) != 0);
+
+ /* TODO: try to compress it in place first. Maybe just call toast_compress_datum? */
+
+ /*
+ * If that doesn't reduce it enough, allocate a toast page
+ * for it.
+ */
+ ptr = VARDATA_ANY(value);
+ total_size = VARSIZE_ANY_EXHDR(value);
+ offset = 0;
+
+ while (total_size - offset > 0)
+ {
+ Size thisbytes;
+
+ buf = zspage_getnewbuf(rel, InvalidBuffer);
+ if (prevbuf == InvalidBuffer)
+ firstblk = BufferGetBlockNumber(buf);
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, sizeof(ZSToastPageOpaque));
+
+ thisbytes = Min(total_size - offset, PageGetExactFreeSpace(page));
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_attno = attno;
+ opaque->zs_tid = tid;
+ opaque->zs_total_size = total_size;
+ opaque->zs_slice_offset = offset;
+ opaque->zs_prev = BufferIsValid(prevbuf) ? BufferGetBlockNumber(prevbuf) : InvalidBlockNumber;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_TOAST_PAGE_ID;
+
+ memcpy((char *) page + SizeOfPageHeaderData, ptr, thisbytes);
+ ((PageHeader) page)->pd_lower += thisbytes;
+ ptr += thisbytes;
+ offset += thisbytes;
+
+ if (prevbuf != InvalidBuffer)
+ {
+ prevopaque->zs_next = BufferGetBlockNumber(buf);
+ MarkBufferDirty(prevbuf);
+ }
+
+ /* TODO: WAL-log */
+ MarkBufferDirty(buf);
+
+ if (prevbuf != InvalidBuffer)
+ UnlockReleaseBuffer(prevbuf);
+ prevbuf = buf;
+ prevopaque = opaque;
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ toastptr = palloc0(sizeof(varatt_zs_toastptr));
+ SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
+ toastptr->zst_block = firstblk;
+
+ return PointerGetDatum(toastptr);
+}
+
+Datum
+zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted)
+{
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(toasted);
+ BlockNumber nextblk;
+ BlockNumber prevblk;
+ char *result = NULL;
+ char *ptr = NULL;
+ int32 total_size = 0;
+
+ Assert(toastptr->va_tag == VARTAG_ZEDSTORE);
+
+ prevblk = InvalidBlockNumber;
+ nextblk = toastptr->zst_block;
+
+ while (nextblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ uint32 size;
+
+ buf = ReadBuffer(rel, nextblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ Assert(opaque->zs_attno == attno);
+ Assert(opaque->zs_prev == prevblk);
+
+ if (prevblk == InvalidBlockNumber)
+ {
+ Assert(opaque->zs_tid == tid);
+
+ total_size = opaque->zs_total_size;
+
+ result = palloc(total_size + VARHDRSZ);
+ SET_VARSIZE(result, total_size + VARHDRSZ);
+ ptr = result + VARHDRSZ;
+ }
+
+ size = ((PageHeader) page)->pd_lower - SizeOfPageHeaderData;
+ memcpy(ptr, (char *) page + SizeOfPageHeaderData, size);
+ ptr += size;
+
+ prevblk = nextblk;
+ nextblk = opaque->zs_next;
+ UnlockReleaseBuffer(buf);
+ }
+ Assert(total_size > 0);
+ Assert(ptr == result + total_size + VARHDRSZ);
+
+ return PointerGetDatum(result);
+}
diff --git a/src/backend/access/zedstore/zedstore_tupslot.c b/src/backend/access/zedstore/zedstore_tupslot.c
new file mode 100644
index 00000000000..a992477c156
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tupslot.c
@@ -0,0 +1,262 @@
+/*
+ * zedstore_tupslot.c
+ * Implementation of a TupleTableSlot for zedstore.
+ *
+ * This implementation is identical to a Virtual tuple slot
+ * (TTSOpsVirtual), but it has a slot_getsysattr() implementation
+ * that can fetch and compute the 'xmin' for the tuple.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tupslot.c
+ */
+#include "postgres.h"
+
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "executor/tuptable.h"
+#include "utils/expandeddatum.h"
+
+const TupleTableSlotOps TTSOpsZedstore;
+
+static void
+tts_zedstore_init(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ zslot->xmin = InvalidTransactionId;
+ zslot->cmin = InvalidCommandId;
+}
+
+static void
+tts_zedstore_release(TupleTableSlot *slot)
+{
+}
+
+static void
+tts_zedstore_clear(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ if (unlikely(TTS_SHOULDFREE(slot)))
+ {
+ pfree(zslot->data);
+ zslot->data = NULL;
+
+ slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
+ }
+
+ slot->tts_nvalid = 0;
+ slot->tts_flags |= TTS_FLAG_EMPTY;
+ ItemPointerSetInvalid(&slot->tts_tid);
+
+ zslot->xmin = InvalidTransactionId;
+ zslot->cmin = InvalidCommandId;
+}
+
+/*
+ * Attribute values are readily available in tts_values and tts_isnull array
+ * in a ZedstoreTupleTableSlot. So there should be no need to call either of the
+ * following two functions.
+ */
+static void
+tts_zedstore_getsomeattrs(TupleTableSlot *slot, int natts)
+{
+ elog(ERROR, "getsomeattrs is not required to be called on a zedstore tuple table slot");
+}
+
+/*
+ * We only support fetching 'xmin', currently. It's needed for referential
+ * integrity triggers (i.e. foreign keys).
+ */
+static Datum
+tts_zedstore_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ if (attnum == MinTransactionIdAttributeNumber ||
+ attnum == MinCommandIdAttributeNumber)
+ {
+ *isnull = false;
+ if (attnum == MinTransactionIdAttributeNumber)
+ return TransactionIdGetDatum(zslot->xmin);
+ else
+ {
+ Assert(attnum == MinCommandIdAttributeNumber);
+ return CommandIdGetDatum(zslot->cmin);
+ }
+ }
+ elog(ERROR, "zedstore tuple table slot does not have system attributes (except xmin and cmin)");
+
+ return 0; /* silence compiler warnings */
+}
+
+/*
+ * To materialize a zedstore slot all the datums that aren't passed by value
+ * have to be copied into the slot's memory context. To do so, compute the
+ * required size, and allocate enough memory to store all attributes. That's
+ * good for cache hit ratio, but more importantly requires only memory
+ * allocation/deallocation.
+ */
+static void
+tts_zedstore_materialize(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *vslot = (ZedstoreTupleTableSlot *) slot;
+ TupleDesc desc = slot->tts_tupleDescriptor;
+ Size sz = 0;
+ char *data;
+
+ /* already materialized */
+ if (TTS_SHOULDFREE(slot))
+ return;
+
+ /* compute size of memory required */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ sz = att_align_nominal(sz, att->attalign);
+ sz += EOH_get_flat_size(DatumGetEOHP(val));
+ }
+ else
+ {
+ sz = att_align_nominal(sz, att->attalign);
+ sz = att_addlength_datum(sz, att->attlen, val);
+ }
+ }
+
+ /* all data is byval */
+ if (sz == 0)
+ return;
+
+ /* allocate memory */
+ vslot->data = data = MemoryContextAlloc(slot->tts_mcxt, sz);
+ slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+ /* and copy all attributes into the pre-allocated space */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ Size data_length;
+
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ ExpandedObjectHeader *eoh = DatumGetEOHP(val);
+
+ data = (char *) att_align_nominal(data,
+ att->attalign);
+ data_length = EOH_get_flat_size(eoh);
+ EOH_flatten_into(eoh, data, data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ else
+ {
+ Size data_length = 0;
+
+ data = (char *) att_align_nominal(data, att->attalign);
+ data_length = att_addlength_datum(data_length, att->attlen, val);
+
+ memcpy(data, DatumGetPointer(val), data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ }
+}
+
+static void
+tts_zedstore_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
+{
+ ZedstoreTupleTableSlot *zsrcslot = (ZedstoreTupleTableSlot *) srcslot;
+ ZedstoreTupleTableSlot *zdstslot = (ZedstoreTupleTableSlot *) dstslot;
+
+ TupleDesc srcdesc = dstslot->tts_tupleDescriptor;
+
+ Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
+
+ tts_zedstore_clear(dstslot);
+
+ slot_getallattrs(srcslot);
+
+ for (int natt = 0; natt < srcdesc->natts; natt++)
+ {
+ dstslot->tts_values[natt] = srcslot->tts_values[natt];
+ dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
+ }
+
+ zdstslot->xmin = zsrcslot->xmin;
+ zdstslot->cmin = zsrcslot->cmin;
+
+ dstslot->tts_nvalid = srcdesc->natts;
+ dstslot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ /* make sure storage doesn't depend on external memory */
+ tts_zedstore_materialize(dstslot);
+}
+
+static HeapTuple
+tts_zedstore_copy_heap_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+}
+
+static MinimalTuple
+tts_zedstore_copy_minimal_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_minimal_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+}
+
+
+const TupleTableSlotOps TTSOpsZedstore = {
+ .base_slot_size = sizeof(ZedstoreTupleTableSlot),
+ .init = tts_zedstore_init,
+ .release = tts_zedstore_release,
+ .clear = tts_zedstore_clear,
+ .getsomeattrs = tts_zedstore_getsomeattrs,
+ .getsysattr = tts_zedstore_getsysattr,
+ .materialize = tts_zedstore_materialize,
+ .copyslot = tts_zedstore_copyslot,
+
+ /*
+ * A zedstore tuple table slot can not "own" a heap tuple or a minimal
+ * tuple.
+ */
+ .get_heap_tuple = NULL,
+ .get_minimal_tuple = NULL,
+ .copy_heap_tuple = tts_zedstore_copy_heap_tuple,
+ .copy_minimal_tuple = tts_zedstore_copy_minimal_tuple
+};
diff --git a/src/backend/access/zedstore/zedstore_undo.c b/src/backend/access/zedstore/zedstore_undo.c
new file mode 100644
index 00000000000..f44f5631b7f
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_undo.c
@@ -0,0 +1,1021 @@
+/*
+ * zedstore_undo.c
+ * Temporary UNDO-logging for zedstore.
+ *
+ * XXX: This is hopefully replaced with an upstream UNDO facility later.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_undo.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/multixact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "lib/integerset.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "pgstat.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/pg_rusage.h"
+#include "utils/rel.h"
+#include "utils/lsyscache.h"
+
+/*
+ * Working area for VACUUM.
+ */
+typedef struct ZSVacRelStats
+{
+ int elevel;
+ BufferAccessStrategy vac_strategy;
+
+ /* hasindex = true means two-pass strategy; false means one-pass */
+ bool hasindex;
+ /* Overall statistics about rel */
+ BlockNumber rel_pages; /* total number of pages */
+ BlockNumber tupcount_pages; /* pages whose tuples we counted */
+ double old_live_tuples; /* previous value of pg_class.reltuples */
+ double new_rel_tuples; /* new estimated total # of tuples */
+ double new_live_tuples; /* new estimated total # of live tuples */
+ double new_dead_tuples; /* new estimated total # of dead tuples */
+ BlockNumber pages_removed;
+ double tuples_deleted;
+
+ IntegerSet *dead_tids;
+} ZSVacRelStats;
+
+static bool zs_lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats);
+static void lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats);
+static ZSUndoRecPtr zsundo_trim(Relation rel, TransactionId OldestXmin);
+static void zsundo_update_oldest_ptr(Relation rel, ZSUndoRecPtr oldest_undorecptr, BlockNumber oldest_undopage, List *unused_pages);
+static ZSUndoRec *zsundo_fetch_lock(Relation rel, ZSUndoRecPtr undoptr, Buffer *buf_p, int lockmode, bool missing_ok);
+
+/*
+ * Insert the given UNDO record to the UNDO log.
+ */
+ZSUndoRecPtr
+zsundo_insert(Relation rel, ZSUndoRec *rec)
+{
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber tail_blk;
+ Buffer tail_buf = InvalidBuffer;
+ Page tail_pg = NULL;
+ ZSUndoPageOpaque *tail_opaque = NULL;
+ uint64 next_counter;
+ char *dst;
+ ZSUndoRecPtr undorecptr;
+ int offset;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+retry_lock_tail:
+ tail_blk = metaopaque->zs_undo_tail;
+
+ /*
+ * Is there space on the tail page? If not, allocate a new UNDO page.
+ */
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_buf = ReadBuffer(rel, tail_blk);
+ LockBuffer(tail_buf, BUFFER_LOCK_EXCLUSIVE);
+ tail_pg = BufferGetPage(tail_buf);
+ tail_opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(tail_pg);
+ Assert(tail_opaque->first_undorecptr.counter == metaopaque->zs_undo_tail_first_counter);
+
+ if (IsZSUndoRecPtrValid(&tail_opaque->last_undorecptr))
+ {
+ Assert(tail_opaque->last_undorecptr.counter >= metaopaque->zs_undo_tail_first_counter);
+ next_counter = tail_opaque->last_undorecptr.counter + 1;
+ }
+ else
+ {
+ next_counter = tail_opaque->first_undorecptr.counter;
+ Assert(next_counter == metaopaque->zs_undo_tail_first_counter);
+ }
+ }
+ else
+ {
+ next_counter = metaopaque->zs_undo_tail_first_counter;
+ Assert(next_counter == metaopaque->zs_undo_oldestptr.counter);
+ }
+ if (tail_blk == InvalidBlockNumber || PageGetExactFreeSpace(tail_pg) < rec->size)
+ {
+ Buffer newbuf;
+ BlockNumber newblk;
+ Page newpage;
+ ZSUndoPageOpaque *newopaque;
+
+ /*
+ * Release the lock on the metapage while we find a new block, because
+ * that could take a while. (And accessing the Free Page Map might lock
+ * the metapage, too, causing self-deadlock.)
+ */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+ /* new page */
+ newbuf = zspage_getnewbuf(rel, metabuf);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ if (metaopaque->zs_undo_tail != tail_blk)
+ {
+ /*
+ * It should not be possible for another backend to extend the UNDO log
+ * while we're holding the tail block locked.
+ */
+ if (tail_blk != InvalidBlockNumber)
+ elog(ERROR, "UNDO tail block pointer was changed unexpectedly");
+
+ /*
+ * we don't need the new page, after all. (Or maybe we do, if the new
+ * tail block is already full, but we're not smart about it.)
+ */
+ zspage_delete_page(rel, newbuf);
+ goto retry_lock_tail;
+ }
+
+ if (tail_blk == InvalidBlockNumber)
+ next_counter = metaopaque->zs_undo_tail_first_counter;
+ else
+ next_counter = tail_opaque->last_undorecptr.counter + 1;
+
+ newblk = BufferGetBlockNumber(newbuf);
+ newpage = BufferGetPage(newbuf);
+ PageInit(newpage, BLCKSZ, sizeof(ZSUndoPageOpaque));
+ newopaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(newpage);
+ newopaque->next = InvalidBlockNumber;
+ newopaque->first_undorecptr.blkno = newblk;
+ newopaque->first_undorecptr.offset = SizeOfPageHeaderData;
+ newopaque->first_undorecptr.counter = next_counter;
+ newopaque->last_undorecptr = InvalidUndoPtr;
+ newopaque->zs_page_id = ZS_UNDO_PAGE_ID;
+
+ metaopaque->zs_undo_tail = newblk;
+ metaopaque->zs_undo_tail_first_counter = next_counter;
+ if (tail_blk == InvalidBlockNumber)
+ metaopaque->zs_undo_head = newblk;
+
+ MarkBufferDirty(metabuf);
+
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_opaque->next = newblk;
+ MarkBufferDirty(tail_buf);
+ UnlockReleaseBuffer(tail_buf);
+ }
+
+ tail_blk = newblk;
+ tail_buf = newbuf;
+ tail_pg = newpage;
+ tail_opaque = newopaque;
+ }
+
+ UnlockReleaseBuffer(metabuf);
+
+ /* insert the record to this page */
+ offset = ((PageHeader) tail_pg)->pd_lower;
+
+ undorecptr.counter = next_counter;
+ undorecptr.blkno = tail_blk;
+ undorecptr.offset = offset;
+ rec->undorecptr = undorecptr;
+ dst = ((char *) tail_pg) + offset;
+ memcpy(dst, rec, rec->size);
+
+ tail_opaque->last_undorecptr = undorecptr;
+ ((PageHeader) tail_pg)->pd_lower += rec->size;
+
+ MarkBufferDirty(tail_buf);
+ UnlockReleaseBuffer(tail_buf);
+
+ return undorecptr;
+}
+
+/*
+ * Fetch the UNDO record with the given undo-pointer.
+ *
+ * The returned record is a palloc'd copy.
+ *
+ * If the record could not be found, returns NULL. That can happen if you try
+ * to fetch an UNDO record that has already been discarded. I.e. if undoptr
+ * is smaller than the oldest UNDO pointer stored in the metapage.
+ */
+ZSUndoRec *
+zsundo_fetch(Relation rel, ZSUndoRecPtr undoptr)
+{
+ ZSUndoRec *undorec_copy;
+ ZSUndoRec *undorec;
+ Buffer buf;
+
+ undorec = zsundo_fetch_lock(rel, undoptr, &buf, BUFFER_LOCK_SHARE, true);
+
+ if (undorec)
+ {
+ undorec_copy = palloc(undorec->size);
+ memcpy(undorec_copy, undorec, undorec->size);
+ }
+ else
+ undorec_copy = NULL;
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+
+ return undorec_copy;
+}
+
+/*
+ * Lock page containing the given UNDO record, and return pointer to it
+ * within the buffer.
+ *
+ * If missing_ok is true, it's OK if the UNDO record has been trimmed / discarded away
+ * already. Will return NULL in that case. If missing_ok is false, throws an error if
+ * the record cannot be found.
+ */
+static ZSUndoRec *
+zsundo_fetch_lock(Relation rel, ZSUndoRecPtr undoptr, Buffer *buf_p, int lockmode, bool missing_ok)
+{
+ Buffer buf;
+ Page page;
+ PageHeader pagehdr;
+ ZSUndoPageOpaque *opaque;
+ ZSUndoRec *undorec;
+
+ buf = ReadBuffer(rel, undoptr.blkno);
+ page = BufferGetPage(buf);
+ pagehdr = (PageHeader) page;
+
+ /*
+ * FIXME: If the page might've been trimmed away, there's a small chance of deadlock if
+ * the buffer now holds an unrelated page, and we or someone else is holding a lock on
+ * it already. We could optimistically try lock the page without blocking first, and
+ * and update oldest undo pointer from the metapage if that fails. And only if the
+ * oldest undo pointer indicates that the record should still be there, wait for the lock.
+ */
+ LockBuffer(buf, lockmode);
+ if (PageIsNew(page))
+ goto record_missing;
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ goto record_missing;
+
+ /* Check that this page contains the given record */
+ if (undoptr.counter < opaque->first_undorecptr.counter ||
+ !IsZSUndoRecPtrValid(&opaque->last_undorecptr) ||
+ undoptr.counter > opaque->last_undorecptr.counter)
+ goto record_missing;
+
+ /* Sanity check that the pointer pointed to a valid place */
+ if (undoptr.offset < SizeOfPageHeaderData ||
+ undoptr.offset + sizeof(ZSUndoRec) > pagehdr->pd_lower)
+ {
+ /*
+ * this should not happen in the case that the page was recycled for
+ * other use, so error even if 'missing_ok' is true
+ */
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+ }
+
+ undorec = (ZSUndoRec *) (((char *) page) + undoptr.offset);
+
+ if (memcmp(&undorec->undorecptr, &undoptr, sizeof(ZSUndoRecPtr)) != 0)
+ {
+ /*
+ * this should not happen in the case that the page was recycled for
+ * other use, so error even if 'fail_ok' is true
+ */
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+ }
+
+ *buf_p = buf;
+ return undorec;
+
+record_missing:
+ UnlockReleaseBuffer(buf);
+ *buf_p = InvalidBuffer;
+
+ if (missing_ok)
+ return NULL;
+ else
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u; not an UNDO page",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+}
+
+void
+zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr)
+{
+ ZSUndoRec_Insert *undorec;
+ Buffer buf;
+
+ undorec = (ZSUndoRec_Insert *) zsundo_fetch_lock(rel, undoptr, &buf, BUFFER_LOCK_EXCLUSIVE, false);
+
+ if (undorec->rec.type != ZSUNDO_TYPE_INSERT)
+ elog(ERROR, "unexpected undo record type %d on speculatively inserted row",
+ undorec->rec.type);
+
+ undorec->speculative_token = INVALID_SPECULATIVE_TOKEN;
+
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+}
+
+static bool
+zs_lazy_tid_reaped(ItemPointer itemptr, void *state)
+{
+ ZSVacRelStats *vacrelstats = (ZSVacRelStats *) state;
+ zstid tid = ZSTidFromItemPointer(*itemptr);
+
+ return intset_is_member(vacrelstats->dead_tids, tid);
+}
+
+/*
+ * Entry point of VACUUM for zedstore tables.
+ *
+ * Vacuum on a zedstore table works quite differently from the heap. We don't
+ * scan the table. Instead, we scan just the active UNDO log, and remove any
+ * garbage left behind by aborts or deletions based on the UNDO log.
+ */
+void
+zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin)
+{
+ ZSVacRelStats *vacrelstats;
+ Relation *Irel;
+ int nindexes;
+ IndexBulkDeleteResult **indstats;
+ Form_pg_class pgcform;
+ zstid starttid;
+ zstid endtid;
+
+ /* do nothing if the table is completely empty. */
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ /* don't believe the cached value without checking */
+ BlockNumber nblocks = RelationGetNumberOfBlocks(rel);
+
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ return;
+ }
+
+ /*
+ * Scan the UNDO log, and discard what we can.
+ */
+ (void) zsundo_trim(rel, RecentGlobalXmin);
+
+ vacrelstats = (ZSVacRelStats *) palloc0(sizeof(ZSVacRelStats));
+
+ if (params->options & VACOPT_VERBOSE)
+ vacrelstats->elevel = INFO;
+ else
+ vacrelstats->elevel = DEBUG2;
+ vacrelstats->vac_strategy = bstrategy;
+
+ /* Open all indexes of the relation */
+ vac_open_indexes(rel, RowExclusiveLock, &nindexes, &Irel);
+ vacrelstats->hasindex = (nindexes > 0);
+ indstats = (IndexBulkDeleteResult **)
+ palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
+
+ ereport(vacrelstats->elevel,
+ (errmsg("vacuuming \"%s.%s\"",
+ get_namespace_name(RelationGetNamespace(rel)),
+ RelationGetRelationName(rel))));
+
+ starttid = MinZSTid;
+ do
+ {
+ IntegerSet *dead_tids;
+
+ /* Scan the TID tree, to collect TIDs that have been marked dead. */
+ dead_tids = zsbt_collect_dead_tids(rel, starttid, &endtid);
+ vacrelstats->dead_tids = dead_tids;
+
+ if (intset_num_entries(dead_tids) > 0)
+ {
+ /* Remove index entries */
+ for (int i = 0; i < nindexes; i++)
+ lazy_vacuum_index(Irel[i],
+ &indstats[i],
+ vacrelstats);
+
+ /*
+ * Remove the attribute data for the dead rows, and finally their
+ * TID tree entries.
+ */
+ for (int attno = 1; attno <= RelationGetNumberOfAttributes(rel); attno++)
+ zsbt_attr_remove(rel, attno, dead_tids);
+ zsbt_tid_remove(rel, dead_tids);
+ }
+
+ ereport(vacrelstats->elevel,
+ (errmsg("\"%s\": removed " UINT64_FORMAT " row versions",
+ RelationGetRelationName(rel),
+ intset_num_entries(dead_tids))));
+
+ starttid = endtid;
+ } while(starttid < MaxPlusOneZSTid);
+
+ /* Do post-vacuum cleanup and statistics update for each index */
+ for (int i = 0; i < nindexes; i++)
+ lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+
+ /* Done with indexes */
+ vac_close_indexes(nindexes, Irel, NoLock);
+
+ /*
+ * Update pg_class to reflect new info we know. The main thing we know for
+ * sure here is relhasindex or not currently. Using OldestXmin as new
+ * frozenxid. And since we don't now the new multixid passing it as
+ * invalid to avoid update. Plus, using false for relallisvisible as don't
+ * know that either.
+ *
+ * FIXME: pass correct numbers for relpages, reltuples and other
+ * arguments.
+ */
+ pgcform = RelationGetForm(rel);
+ vac_update_relstats(rel,
+ pgcform->relpages,
+ pgcform->reltuples,
+ false,
+ nindexes > 0,
+ OldestXmin,
+ InvalidMultiXactId,
+ false);
+
+ /* report results to the stats collector, too */
+ pgstat_report_vacuum(RelationGetRelid(rel),
+ rel->rd_rel->relisshared,
+ pgcform->reltuples,
+ 0); /* FIXME: # of dead tuples */
+}
+
+/*
+ * lazy_vacuum_index() -- vacuum one index relation.
+ *
+ * Delete all the index entries pointing to tuples listed in
+ * vacrelstats->dead_tuples, and update running statistics.
+ */
+static void
+lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = true;
+ ivinfo.message_level = vacrelstats->elevel;
+ /* We can only provide an approximate value of num_heap_tuples here */
+ ivinfo.num_heap_tuples = vacrelstats->old_live_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ zs_lazy_tid_reaped, (void *) vacrelstats);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
+ RelationGetRelationName(indrel),
+ intset_num_entries(vacrelstats->dead_tids)),
+ errdetail_internal("%s", pg_rusage_show(&ru0))));
+}
+
+/*
+ * lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ */
+static void
+lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = (vacrelstats->tupcount_pages < vacrelstats->rel_pages);
+ ivinfo.message_level = vacrelstats->elevel;
+
+ /*
+ * Now we can provide a better estimate of total number of surviving
+ * tuples (we assume indexes are more interested in that than in the
+ * number of nominally live tuples).
+ */
+ ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ stats = index_vacuum_cleanup(&ivinfo, stats);
+
+ if (!stats)
+ return;
+
+ /*
+ * Now update statistics in pg_class, but only if the index says the count
+ * is accurate.
+ */
+ if (!stats->estimated_count)
+ vac_update_relstats(indrel,
+ stats->num_pages,
+ stats->num_index_tuples,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ false);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("index \"%s\" now contains %.0f row versions in %u pages",
+ RelationGetRelationName(indrel),
+ stats->num_index_tuples,
+ stats->num_pages),
+ errdetail("%.0f index row versions were removed.\n"
+ "%u index pages have been deleted, %u are currently reusable.\n"
+ "%s.",
+ stats->tuples_removed,
+ stats->pages_deleted, stats->pages_free,
+ pg_rusage_show(&ru0))));
+
+ pfree(stats);
+}
+
+/*
+ * Scan the UNDO log, starting from oldest entry. Undo the effects of any
+ * aborted transactions. Records for committed transactions can be trimmed
+ * away immediately.
+ *
+ * Returns the oldest valid UNDO ptr, after the trim.
+ */
+static ZSUndoRecPtr
+zsundo_trim(Relation rel, TransactionId OldestXmin)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber lastblk;
+ ZSUndoRecPtr oldest_undorecptr;
+ bool can_advance_oldestundorecptr;
+ char *ptr;
+ char *endptr;
+ List *unused_pages = NIL;
+ BlockNumber deleted_undo_pages = 0;
+
+ oldest_undorecptr = InvalidUndoPtr;
+
+ /*
+ * Ensure that only one process discards at a time. We use a page lock on the
+ * metapage for that.
+ */
+ LockPage(rel, ZS_META_BLK, ExclusiveLock);
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ oldest_undorecptr = metaopaque->zs_undo_oldestptr;
+
+ /*
+ * If we assume that only one process can call TRIM at a time, then we
+ * don't need to hold the metapage locked. Alternatively, if multiple
+ * concurrent trims is possible, we could check after reading the head
+ * page, that it is the page we expect, and re-read the metapage if it's
+ * not.
+ */
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page, until we
+ * hit a record that we cannot remove.
+ */
+ lastblk = firstblk;
+ can_advance_oldestundorecptr = false;
+ while (lastblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, lastblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "unexpected page id on UNDO page");
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+ bool did_commit;
+
+ Assert(undorec->undorecptr.blkno == lastblk);
+
+ if (undorec->undorecptr.counter < oldest_undorecptr.counter)
+ {
+ ptr += undorec->size;
+ continue;
+ }
+ oldest_undorecptr = undorec->undorecptr;
+
+ if (!TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* This is still needed. Bail out */
+ break;
+ }
+
+ /*
+ * No one thinks this transaction is in-progress anymore. If it
+ * committed, we can just trim away its UNDO record. If it aborted,
+ * we need to apply the UNDO record first. (For deletions, it's
+ * the other way round, though.)
+ *
+ * TODO: It would be much more efficient to do these in batches.
+ * So we should just collect the TIDs to mark dead here, and pass
+ * the whole list to zsbt_tid_mark_dead() after the loop.
+ */
+ did_commit = TransactionIdDidCommit(undorec->xid);
+
+ switch (undorec->type)
+ {
+ case ZSUNDO_TYPE_INSERT:
+ if (!did_commit)
+ {
+ ZSUndoRec_Insert *insertrec = (ZSUndoRec_Insert *) undorec;
+
+ for (zstid tid = insertrec->firsttid; tid < insertrec->endtid; tid++)
+ zsbt_tid_mark_dead(rel, tid, oldest_undorecptr);
+ }
+ break;
+ case ZSUNDO_TYPE_DELETE:
+ {
+ ZSUndoRec_Delete *deleterec = (ZSUndoRec_Delete *) undorec;
+
+ if (did_commit)
+ {
+ /* The deletion is now visible to everyone */
+ for (int i = 0; i < deleterec->num_tids; i++)
+ zsbt_tid_mark_dead(rel, deleterec->tids[i], oldest_undorecptr);
+ }
+ else
+ {
+ /*
+ * must clear the item's UNDO pointer, otherwise the deletion
+ * becomes visible to everyone when the UNDO record is trimmed
+ * away.
+ */
+ for (int i = 0; i < deleterec->num_tids; i++)
+ zsbt_tid_undo_deletion(rel, deleterec->tids[i], undorec->undorecptr,
+ oldest_undorecptr);
+ }
+ }
+ break;
+ case ZSUNDO_TYPE_UPDATE:
+ if (did_commit)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+
+ zsbt_tid_mark_dead(rel, updaterec->oldtid, oldest_undorecptr);
+ }
+ break;
+ }
+
+ ptr += undorec->size;
+ can_advance_oldestundorecptr = true;
+ }
+
+ if (ptr < endptr)
+ {
+ UnlockReleaseBuffer(buf);
+ break;
+ }
+ else
+ {
+ /* We processed all records on the page. Step to the next one, if any. */
+ Assert(ptr == endptr);
+ unused_pages = lappend_int(unused_pages, lastblk);
+ lastblk = opaque->next;
+ UnlockReleaseBuffer(buf);
+ if (lastblk != InvalidBlockNumber)
+ deleted_undo_pages++;
+ }
+ }
+
+ if (can_advance_oldestundorecptr && lastblk == InvalidBlockNumber)
+ {
+ /*
+ * We stopped after the last valid record. Advance by one, to the next
+ * record which hasn't been created yet, and which is still needed.
+ */
+ oldest_undorecptr.counter++;
+ oldest_undorecptr.blkno = InvalidBlockNumber;
+ oldest_undorecptr.offset = 0;
+ }
+
+ if (can_advance_oldestundorecptr)
+ zsundo_update_oldest_ptr(rel, oldest_undorecptr, lastblk, unused_pages);
+
+ UnlockPage(rel, ZS_META_BLK, ExclusiveLock);
+
+ return oldest_undorecptr;
+}
+
+/* Update metapage with the oldest value */
+static void
+zsundo_update_oldest_ptr(Relation rel, ZSUndoRecPtr oldest_undorecptr,
+ BlockNumber oldest_undopage, List *unused_pages)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ ListCell *lc;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ metaopaque->zs_undo_oldestptr = oldest_undorecptr;
+ if (oldest_undopage == InvalidBlockNumber)
+ {
+ metaopaque->zs_undo_head = InvalidBlockNumber;
+ metaopaque->zs_undo_tail = InvalidBlockNumber;
+ metaopaque->zs_undo_tail_first_counter = oldest_undorecptr.counter;
+ }
+ else
+ metaopaque->zs_undo_head = oldest_undopage;
+
+ /* TODO: WAL-log */
+
+ MarkBufferDirty(metabuf);
+ UnlockReleaseBuffer(metabuf);
+
+ foreach(lc, unused_pages)
+ {
+ BlockNumber blk = (BlockNumber) lfirst_int(lc);
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+
+ /* check that the page still looks like what we'd expect. */
+ buf = ReadBuffer(rel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ if (PageIsEmpty(page) ||
+ PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSUndoPageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ /* FIXME: Also check here that the max UndoRecPtr on the page is less
+ * than the new 'oldest_undorecptr'
+ */
+
+ zspage_delete_page(rel, buf);
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/*
+ * Return the current "Oldest undo pointer". The effects of any actions with
+ * undo pointer older than this is known to be visible to everyone. (i.e.
+ * an inserted tuple is known to be visible, and a deleted tuple is known to
+ * be invisible.)
+ */
+ZSUndoRecPtr
+zsundo_get_oldest_undo_ptr(Relation rel)
+{
+ /* do nothing if the table is completely empty. */
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ /* don't believe a cached 0 size without checking */
+ BlockNumber nblocks;
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ return InvalidUndoPtr;
+ }
+
+ /*
+ * Scan the UNDO log, to discard as much of it as possible. This
+ * advances the oldest UNDO pointer past as many transactions as possible.
+ *
+ * TODO:
+ * We could get the latest cached value directly from the metapage, but
+ * this allows trimming the UNDO log more aggressively, whenever we're
+ * scanning. Fetching records from the UNDO log is pretty expensive,
+ * so until that is somehow sped up, it is a good tradeoff to be
+ * aggressive about that.
+ */
+ return zsundo_trim(rel, RecentGlobalXmin);
+}
+
+
+/*
+ * Higher-level functions for constructing UNDO records, with caching.
+ *
+ * If you perform multiple operations in the same transaction and command, we
+ * reuse the same UNDO record for it. There's a one-element cache of each
+ * operation type, so this only takes effect in simple cases.
+ *
+ * TODO: make the caching work in more cases. A hash table or something..
+ * Currently, we do this for DELETEs and INSERTs. We could perhaps do this
+ * for UPDATEs as well, although they're more a bit more tricky, as we need
+ * to also store the 'ctid' pointer to the new tuple in an UPDATE.
+ */
+ZSUndoRecPtr
+zsundo_create_for_delete(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ bool changedPart, ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_Delete undorec;
+ ZSUndoRecPtr undoptr;
+
+ static RelFileNode cached_relfilenode;
+ static TransactionId cached_xid;
+ static CommandId cached_cid;
+ static bool cached_changedPart;
+ static ZSUndoRecPtr cached_prev_undo_ptr;
+ static ZSUndoRecPtr cached_undo_ptr;
+
+ if (RelFileNodeEquals(rel->rd_node, cached_relfilenode) &&
+ xid == cached_xid &&
+ cid == cached_cid &&
+ changedPart == cached_changedPart &&
+ prev_undo_ptr.counter == cached_prev_undo_ptr.counter)
+ {
+ Buffer buf;
+ ZSUndoRec_Delete *undorec_p;
+
+ undorec_p = (ZSUndoRec_Delete *) zsundo_fetch_lock(rel, cached_undo_ptr,
+ &buf, BUFFER_LOCK_EXCLUSIVE, false);
+
+ if (undorec_p->rec.type != ZSUNDO_TYPE_DELETE)
+ elog(ERROR, "unexpected undo record type %d, expected DELETE", undorec_p->rec.type);
+
+ /* Is there space for a new TID in the record? */
+ if (undorec_p->num_tids < ZSUNDO_NUM_TIDS_PER_DELETE)
+ {
+ undorec_p->tids[undorec_p->num_tids] = tid;
+ undorec_p->num_tids++;
+
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ return cached_undo_ptr;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Cache miss. Create a new UNDO record.
+ */
+ undorec.rec.size = sizeof(ZSUndoRec_Delete);
+ undorec.rec.type = ZSUNDO_TYPE_DELETE;
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.changedPart = changedPart;
+ undorec.rec.prevundorec = prev_undo_ptr;
+ undorec.tids[0] = tid;
+ undorec.num_tids = 1;
+
+ undoptr = zsundo_insert(rel, &undorec.rec);
+
+ cached_relfilenode = rel->rd_node;
+ cached_xid = xid;
+ cached_cid = cid;
+ cached_changedPart = changedPart;
+ cached_prev_undo_ptr = prev_undo_ptr;
+ cached_undo_ptr = undoptr;
+
+ return undoptr;
+}
+
+/*
+ * Create an UNDO record for insertion.
+ *
+ * The undo record stores the 'tid' of the row, as well as visibility information.
+ *
+ * There's a primitive caching mechanism here: If you perform multiple insertions
+ * with same visibility information, and consecutive TIDs, we will keep modifying
+ * the range of TIDs in the same UNDO record, instead of creating new records.
+ * That greatly reduces the space required for UNDO log of bulk inserts.
+ */
+ZSUndoRecPtr
+zsundo_create_for_insert(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ int nitems, uint32 speculative_token, ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_Insert undorec;
+ ZSUndoRecPtr undoptr;
+
+ static RelFileNode cached_relfilenode;
+ static TransactionId cached_xid;
+ static CommandId cached_cid;
+ static zstid cached_endtid;
+ static ZSUndoRecPtr cached_prev_undo_ptr;
+ static ZSUndoRecPtr cached_undo_ptr;
+
+ if (speculative_token == INVALID_SPECULATIVE_TOKEN &&
+ RelFileNodeEquals(rel->rd_node, cached_relfilenode) &&
+ xid == cached_xid &&
+ cid == cached_cid &&
+ tid == cached_endtid &&
+ prev_undo_ptr.counter == cached_prev_undo_ptr.counter)
+ {
+ Buffer buf;
+ ZSUndoRec_Insert *undorec_p;
+
+ undorec_p = (ZSUndoRec_Insert *) zsundo_fetch_lock(rel, cached_undo_ptr,
+ &buf, BUFFER_LOCK_EXCLUSIVE, false);
+
+ if (undorec_p->rec.type != ZSUNDO_TYPE_INSERT)
+ elog(ERROR, "unexpected undo record type %d, expected INSERT", undorec_p->rec.type);
+
+ /* Extend the range of the old record to cover the new TID */
+ Assert(undorec_p->endtid == tid);
+ Assert(undorec_p->speculative_token == INVALID_SPECULATIVE_TOKEN);
+ undorec_p->endtid = tid + nitems;
+
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ cached_endtid = tid + nitems;
+ return cached_undo_ptr;
+ }
+
+ /*
+ * Cache miss. Create a new UNDO record.
+ */
+ undorec.rec.size = sizeof(ZSUndoRec_Insert);
+ undorec.rec.type = ZSUNDO_TYPE_INSERT;
+ /* undorecptr will be filed in by zsundo_insert() */
+ undorec.rec.xid = xid;
+ undorec.rec.cid = cid;
+ undorec.rec.prevundorec = prev_undo_ptr;
+ undorec.firsttid = tid;
+ undorec.endtid = tid + nitems;
+ undorec.speculative_token = speculative_token;
+
+ undoptr = zsundo_insert(rel, &undorec.rec);
+
+ if (speculative_token == INVALID_SPECULATIVE_TOKEN)
+ {
+ cached_relfilenode = rel->rd_node;
+ cached_xid = xid;
+ cached_cid = cid;
+ cached_endtid = tid + nitems;
+ cached_prev_undo_ptr = prev_undo_ptr;
+ cached_undo_ptr = undoptr;
+ }
+
+ return undoptr;
+}
diff --git a/src/backend/access/zedstore/zedstore_visibility.c b/src/backend/access/zedstore/zedstore_visibility.c
new file mode 100644
index 00000000000..c70297f5b75
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_visibility.c
@@ -0,0 +1,864 @@
+/*
+ * zedstore_visibility.c
+ * Routines for MVCC in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_visibility.c
+ */
+#include "postgres.h"
+
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "storage/procarray.h"
+
+static bool
+zs_tuplelock_compatible(LockTupleMode mode, LockTupleMode newmode)
+{
+ switch (newmode)
+ {
+ case LockTupleKeyShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare ||
+ mode == LockTupleNoKeyExclusive;
+
+ case LockTupleShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare;
+
+ case LockTupleNoKeyExclusive:
+ return mode == LockTupleKeyShare;
+ case LockTupleExclusive:
+ return false;
+
+ default:
+ elog(ERROR, "unknown tuple lock mode %d", newmode);
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesUpdate.
+ *
+ * When returns TM_Ok, this also returns a flag in *undo_record_needed, to indicate
+ * whether the old UNDO record is still of interest to anyone. If the old record
+ * belonged to an aborted deleting transaction, for example, it can be ignored.
+ *
+ * This does more than HeapTupleSatisfiesUpdate. If HeapTupleSatisfiesUpdate sees
+ * an updated or locked tuple, it returns TM_BeingUpdated, and the caller has to
+ * check if the tuple lock is compatible with the update. zs_SatisfiesUpdate
+ * checks if the new lock mode is compatible with the old one, and returns TM_Ok
+ * if so. Waiting for conflicting locks is left to the caller.
+ *
+ * This is also used for tuple locking (e.g. SELECT FOR UPDATE). 'mode' indicates
+ * the lock mode. For a genuine UPDATE, pass LockTupleExclusive or
+ * LockTupleNoKeyExclusive depending on whether key columns are being modified.
+ *
+ * If the tuple was UPDATEd, *next_tid is set to the TID of the new row version.
+ */
+TM_Result
+zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo,
+ zstid item_tid, ZSUndoRecPtr item_undoptr,
+ LockTupleMode mode,
+ bool *undo_record_needed, TM_FailureData *tmfd,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ int chain_depth = 0;
+
+ *undo_record_needed = true;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ chain_depth++;
+
+retry_fetch:
+ /* Is it visible? */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /*
+ * The old UNDO record is no longer visible to anyone, so we don't
+ * need to keep it. If this record was not the one directly referenced
+ * from the item, then we must keep it, though. For example, if there
+ * is a chain (item -> LOCK_TUPLE -> INSERT), and the INSERT record is
+ * no longer needed by anyone, we must still keep the pointer to the LOCK
+ * record.
+ */
+ if (chain_depth == 1)
+ *undo_record_needed = false;
+
+ if (visi_info)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ }
+ return TM_Ok;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+ if (!undorec)
+ {
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ if (undo_ptr.counter >= recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto retry_fetch;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (undorec->cid >= snapshot->curcid)
+ return TM_Invisible; /* inserted after scan started */
+ }
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return TM_Invisible; /* inserter has not committed yet */
+ else if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* it must have aborted or crashed */
+ return TM_Invisible;
+ }
+
+ /* The tuple is visible to use. But can we lock it? */
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ ZSUndoRec_TupleLock *lock_undorec = (ZSUndoRec_TupleLock *) undorec;
+
+ /*
+ * If any subtransaction of the current top transaction already holds
+ * a lock as strong as or stronger than what we're requesting, we
+ * effectively hold the desired lock already. We *must* succeed
+ * without trying to take the tuple lock, else we will deadlock
+ * against anyone wanting to acquire a stronger lock.
+ */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (lock_undorec->lockmode >= mode)
+ {
+ *undo_record_needed = true;
+ return TM_Ok;
+ }
+ }
+ else if (!zs_tuplelock_compatible(lock_undorec->lockmode, mode) &&
+ TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ return TM_BeingModified;
+ }
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE)
+ {
+ ZSUndoRec_Delete *deleterec = (ZSUndoRec_Delete *) undorec;
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ if (deleterec->changedPart)
+ {
+ ItemPointerSet(&tmfd->ctid, MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber);
+ *next_tid = InvalidZSTid;
+ return TM_Updated;
+ }
+ else
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ return TM_Deleted;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* updated-away tuple */
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ LockTupleMode old_lockmode;
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ *next_tid = updaterec->newtid;
+ old_lockmode = updaterec->key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(((ZSUndoRec_Update *) undorec)->newtid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ return TM_Updated;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+
+/*
+ * Like HeapTupleSatisfiesAny
+ */
+static bool
+zs_SatisfiesAny(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ /* If this record is "old", then the record is visible. */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE ||
+ undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+
+ return true;
+}
+
+/*
+ * helper function to zs_SatisfiesMVCC(), to check if the given XID
+ * is visible to the snapshot.
+ */
+static bool
+xid_is_visible(Snapshot snapshot, TransactionId xid, CommandId cid, bool *aborted)
+{
+ *aborted = false;
+ if (TransactionIdIsCurrentTransactionId(xid))
+ {
+ if (cid >= snapshot->curcid)
+ return false;
+ else
+ return true;
+ }
+ else if (XidInMVCCSnapshot(xid, snapshot))
+ return false;
+ else if (TransactionIdDidCommit(xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ *aborted = true;
+ return false;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesMVCC
+ */
+static bool
+zs_SatisfiesMVCC(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ bool aborted;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ /* If this record is "old", then the record is visible. */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ /* Inserted tuple */
+ bool result;
+ result = xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted);
+ if (!result && !aborted)
+ *obsoleting_xid = undorec->xid;
+
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return result;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /*
+ * Deleted or updated-away. They are treated the same in an MVCC snapshot.
+ * They only need different treatment when updating or locking the row,
+ * in SatisfiesUpdate().
+ */
+ if (xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted))
+ {
+ /* we can see the deletion */
+ return false;
+ }
+ else
+ {
+ if (!aborted)
+ *obsoleting_xid = undorec->xid;
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesSelf
+ */
+static bool
+zs_SatisfiesSelf(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ ZSUndoRec *undorec;
+ ZSUndoRecPtr undo_ptr;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return false;
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true;
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true;
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesDirty
+ */
+static bool
+zs_SatisfiesDirty(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ snapshot->xmin = snapshot->xmax = InvalidTransactionId;
+ snapshot->speculativeToken = INVALID_SPECULATIVE_TOKEN;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ ZSUndoRec_Insert *insertrec = (ZSUndoRec_Insert *) undorec;
+ snapshot->speculativeToken = insertrec->speculative_token;
+
+ /*
+ * HACK: For SnapshotDirty need to set the values of xmin/xmax/... in
+ * snapshot based on tuples. Hence, can't set the visi_info values
+ * here similar to other snapshots. Only setting the value for
+ * TransactionIdIsInProgress().
+ */
+
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ {
+ snapshot->xmin = undorec->xid;
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return true;
+ }
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* locked tuple. */
+ /* look at the previous UNDO record to find the insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /* deleted or updated-away tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ /*
+ * TODO: not required to set the snapshot's xmax here? As gets
+ * populated based on visi_info later in snapshot by caller.
+ */
+ snapshot->xmax = undorec->xid;
+ visi_info->xmax = undorec->xid;
+ return true;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * True if tuple might be visible to some transaction; false if it's
+ * surely dead to everyone, ie, vacuumable.
+ */
+static bool
+zs_SatisfiesNonVacuumable(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ TransactionId OldestXmin = scan->snapshot->xmin;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ Assert(TransactionIdIsValid(OldestXmin));
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+
+ /* Is it visible? */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+
+ /* Inserted tuple */
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* inserter has not committed yet */
+
+ if (TransactionIdDidCommit(undorec->xid))
+ return true;
+
+ /* it must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* deleted or updated-away tuple */
+ ZSUndoRecPtr prevptr;
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* delete-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter committed. But perhaps it was recent enough that some open
+ * transactions could still see the tuple.
+ */
+ if (!TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ visi_info->nonvacuumable_status = ZSNV_RECENTLY_DEAD;
+ return true;
+ }
+
+ return false;
+ }
+
+ /*
+ * The deleting transaction did not commit. But before concluding
+ * that the tuple is live, we have to check if the inserting
+ * XID is live.
+ */
+ do {
+ prevptr = undorec->prevundorec;
+
+ if (prevptr.counter < scan->recent_oldest_undo.counter)
+ return true;
+ undorec = zsundo_fetch(rel, prevptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ return true;
+ }
+ } while(undorec->type == ZSUNDO_TYPE_TUPLE_LOCK);
+
+ Assert(undorec->type == ZSUNDO_TYPE_INSERT);
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* insert-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true; /* inserted committed */
+
+ /* inserter must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* look at the previous UNDO record, to find the Insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesVisibility
+ *
+ * If next_tid is not NULL then gets populated for the tuple if tuple was
+ * UPDATEd. *next_tid_p is set to the TID of the new row version.
+ */
+bool
+zs_SatisfiesVisibility(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr undo_ptr;
+
+ /* initialize as invalid, if we find valid one populate the same */
+ if (next_tid)
+ *next_tid = InvalidZSTid;
+
+ /* The caller should've filled in the recent_oldest_undo pointer */
+ Assert(scan->recent_oldest_undo.counter != 0);
+
+ *obsoleting_xid = InvalidTransactionId;
+
+ /*
+ * Items with invalid undo record are considered visible. Mostly META
+ * column stores the valid undo record, all other columns stores invalid
+ * undo pointer. Visibility check is performed based on META column and
+ * only if visible rest of columns are fetched. For in-place updates,
+ * columns other than META column may have valid undo record, in which
+ * case the visibility check needs to be performed for the same. META
+ * column can sometime also have items with invalid undo, see
+ * zsbt_undo_item_deletion().
+ */
+ undo_ptr = item_undoptr;
+ if (!IsZSUndoRecPtrValid(&undo_ptr))
+ return true;
+
+ switch (scan->snapshot->snapshot_type)
+ {
+ case SNAPSHOT_MVCC:
+ return zs_SatisfiesMVCC(scan, item_undoptr, obsoleting_xid, next_tid, visi_info);
+
+ case SNAPSHOT_SELF:
+ return zs_SatisfiesSelf(scan, item_undoptr, next_tid, visi_info);
+
+ case SNAPSHOT_ANY:
+ return zs_SatisfiesAny(scan, item_undoptr, visi_info);
+
+ case SNAPSHOT_TOAST:
+ elog(ERROR, "SnapshotToast not implemented in zedstore");
+ break;
+
+ case SNAPSHOT_DIRTY:
+ return zs_SatisfiesDirty(scan, item_undoptr, next_tid, visi_info);
+
+ case SNAPSHOT_HISTORIC_MVCC:
+ elog(ERROR, "SnapshotHistoricMVCC not implemented in zedstore yet");
+ break;
+
+ case SNAPSHOT_NON_VACUUMABLE:
+ return zs_SatisfiesNonVacuumable(scan, item_undoptr, visi_info);
+ }
+
+ return false; /* keep compiler quiet */
+}
diff --git a/src/backend/access/zedstore/zedstore_wal.c b/src/backend/access/zedstore/zedstore_wal.c
new file mode 100644
index 00000000000..761e2c84781
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_wal.c
@@ -0,0 +1,32 @@
+/*
+ * zedstore_wal.c
+ * WAL-logging for zedstore.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_wal.c
+ */
+#include "postgres.h"
+
+#include "access/xlogreader.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "lib/stringinfo.h"
+
+void
+zedstore_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case WAL_ZEDSTORE_INIT_METAPAGE:
+ zsmeta_initmetapage_redo(record);
+ break;
+ default:
+ elog(PANIC, "zedstore_redo: unknown op code %u", info);
+ }
+}
diff --git a/src/backend/access/zedstore/zedstoream_handler.c b/src/backend/access/zedstore/zedstoream_handler.c
new file mode 100644
index 00000000000..83517e5ca56
--- /dev/null
+++ b/src/backend/access/zedstore/zedstoream_handler.c
@@ -0,0 +1,3245 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_handler.c
+ * ZedStore table access method code
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_handler.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/multixact.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/tsmapi.h"
+#include "access/tupdesc_details.h"
+#include "access/tuptoaster.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undo.h"
+#include "catalog/catalog.h"
+#include "catalog/index.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "executor/executor.h"
+#include "optimizer/plancat.h"
+#include "pgstat.h"
+#include "storage/lmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+
+typedef enum
+{
+ ZSSCAN_STATE_UNSTARTED,
+ ZSSCAN_STATE_SCANNING,
+ ZSSCAN_STATE_FINISHED_RANGE,
+ ZSSCAN_STATE_FINISHED
+} zs_scan_state;
+
+typedef struct ZedStoreProjectData
+{
+ int num_proj_atts;
+ Bitmapset *project_columns;
+ int *proj_atts;
+ ZSTidTreeScan tid_scan;
+ ZSAttrTreeScan *attr_scans;
+ MemoryContext context;
+} ZedStoreProjectData;
+
+typedef struct ZedStoreDescData
+{
+ /* scan parameters */
+ TableScanDescData rs_scan; /* */
+ ZedStoreProjectData proj_data;
+
+ bool started;
+ zstid cur_range_start;
+ zstid cur_range_end;
+
+ /* These fields are used for bitmap scans, to hold a "block's" worth of data */
+#define MAX_ITEMS_PER_LOGICAL_BLOCK MaxHeapTuplesPerPage
+ int bmscan_ntuples;
+ zstid *bmscan_tids;
+ Datum **bmscan_datums;
+ bool **bmscan_isnulls;
+ int bmscan_nexttuple;
+
+ /* These fields are use for TABLESAMPLE scans */
+ zstid max_tid_to_scan;
+ zstid next_tid_to_scan;
+
+} ZedStoreDescData;
+
+typedef struct ZedStoreDescData *ZedStoreDesc;
+
+typedef struct ZedStoreIndexFetchData
+{
+ IndexFetchTableData idx_fetch_data;
+ ZedStoreProjectData proj_data;
+} ZedStoreIndexFetchData;
+
+typedef struct ZedStoreIndexFetchData *ZedStoreIndexFetch;
+
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static IndexFetchTableData *zedstoream_begin_index_fetch(Relation rel);
+static void zedstoream_end_index_fetch(IndexFetchTableData *scan);
+static bool zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot);
+
+static Size zs_parallelscan_estimate(Relation rel);
+static Size zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan);
+static void zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan);
+static bool zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end);
+static void zsbt_fill_missing_attribute_value(TupleDesc tupleDesc, int attno, Datum *datum, bool *isnull);
+
+/* Typedef for callback function for getnextslot() */
+typedef void (*GetNextSlotCallback) (ZSTidTreeScan *scan, zstid tid, void *state);
+
+/* ----------------------------------------------------------------
+ * storage AM support routines for zedstoream
+ * ----------------------------------------------------------------
+ */
+
+static bool
+zedstoream_fetch_row_version(Relation rel,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ IndexFetchTableData *fetcher;
+ bool result;
+
+ fetcher = zedstoream_begin_index_fetch(rel);
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ PredicateLockTID(rel, tid_p, snapshot);
+ }
+ ExecMaterializeSlot(slot);
+ slot->tts_tableOid = RelationGetRelid(rel);
+ slot->tts_tid = *tid_p;
+
+ zedstoream_end_index_fetch(fetcher);
+
+ return result;
+}
+
+static void
+zedstoream_get_latest_tid(TableScanDesc sscan,
+ ItemPointer tid)
+{
+ zstid ztid = ZSTidFromItemPointer(*tid);
+ zsbt_find_latest_tid(sscan->rs_rd, &ztid, sscan->rs_snapshot);
+ *tid = ItemPointerFromZSTid(ztid);
+}
+
+static inline void
+zedstoream_insert_internal(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate, uint32 speculative_token)
+{
+ AttrNumber attno;
+ Datum *d;
+ bool *isnulls;
+ zstid tid;
+ TransactionId xid = GetCurrentTransactionId();
+ bool isnull;
+ Datum datum;
+ MemoryContext oldcontext;
+ MemoryContext insert_mcontext;
+
+ /*
+ * insert code performs allocations for creating items and merging
+ * items. These are small allocations but add-up based on number of
+ * columns and rows being inserted. Hence, creating context to track them
+ * and wholesale free instead of retail freeing them. TODO: in long term
+ * try if can avoid creating context here, retail free in normal case and
+ * only create context for page splits maybe.
+ */
+ insert_mcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(insert_mcontext);
+
+ if (slot->tts_tupleDescriptor->natts != relation->rd_att->natts)
+ elog(ERROR, "slot's attribute count doesn't match relcache entry");
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ tid = InvalidZSTid;
+
+ isnull = true;
+ zsbt_tid_multi_insert(relation,
+ &tid, 1,
+ xid, cid, speculative_token, InvalidUndoPtr);
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(slot->tts_tupleDescriptor, attno - 1);
+ datum = d[attno - 1];
+ isnull = isnulls[attno - 1];
+
+ if (!isnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(datum))
+ datum = PointerGetDatum(heap_tuple_fetch_attr((struct varlena *) DatumGetPointer(datum)));
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(relation, attno, datum, tid);
+ }
+
+ zsbt_attr_multi_insert(relation, attno,
+ &datum, &isnull, &tid, 1);
+ }
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(insert_mcontext);
+
+ /* Note: speculative insertions are counted too, even if aborted later */
+ pgstat_count_heap_insert(relation, 1);
+}
+
+static void
+zedstoream_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, INVALID_SPECULATIVE_TOKEN);
+}
+
+static void
+zedstoream_insert_speculative(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, BulkInsertState bistate, uint32 specToken)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, specToken);
+}
+
+static void
+zedstoream_complete_speculative(Relation relation, TupleTableSlot *slot, uint32 spekToken,
+ bool succeeded)
+{
+ zstid tid;
+
+ tid = ZSTidFromItemPointer(slot->tts_tid);
+ zsbt_tid_clear_speculative_token(relation, tid, spekToken, true /* for complete */);
+ /*
+ * there is a conflict
+ */
+ if (!succeeded)
+ {
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(relation);
+
+ zsbt_tid_mark_dead(relation, tid, recent_oldest_undo);
+ }
+}
+
+static void
+zedstoream_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
+ CommandId cid, int options, BulkInsertState bistate)
+{
+ AttrNumber attno;
+ int i;
+ bool slotgetandset = true;
+ TransactionId xid = GetCurrentTransactionId();
+ Datum *datums;
+ bool *isnulls;
+ zstid *tids;
+
+ datums = palloc0(ntuples * sizeof(Datum));
+ isnulls = palloc(ntuples * sizeof(bool));
+ tids = palloc0(ntuples * sizeof(zstid));
+
+ for (i = 0; i < ntuples; i++)
+ isnulls[i] = true;
+
+ zsbt_tid_multi_insert(relation, tids, ntuples,
+ xid, cid, INVALID_SPECULATIVE_TOKEN, InvalidUndoPtr);
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr((slots[0])->tts_tupleDescriptor, attno - 1);
+
+ for (i = 0; i < ntuples; i++)
+ {
+ Datum datum = slots[i]->tts_values[attno - 1];
+ bool isnull = slots[i]->tts_isnull[attno - 1];
+
+ if (slotgetandset)
+ {
+ slot_getallattrs(slots[i]);
+ }
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(relation, attno, datum, tids[i]);
+ }
+ datums[i] = datum;
+ isnulls[i] = isnull;
+ }
+
+ zsbt_attr_multi_insert(relation, attno,
+ datums, isnulls, tids, ntuples);
+
+ slotgetandset = false;
+ }
+
+ for (i = 0; i < ntuples; i++)
+ {
+ slots[i]->tts_tableOid = RelationGetRelid(relation);
+ slots[i]->tts_tid = ItemPointerFromZSTid(tids[i]);
+ }
+
+ pgstat_count_heap_insert(relation, ntuples);
+
+ pfree(tids);
+ pfree(datums);
+ pfree(isnulls);
+}
+
+static TM_Result
+zedstoream_delete(Relation relation, ItemPointer tid_p, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result = TM_Ok;
+
+retry:
+ result = zsbt_tid_delete(relation, tid, xid, cid,
+ snapshot, crosscheck, wait, hufd, changingPart);
+
+ if (result != TM_Ok)
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to delete invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ /* TODO: use something like heap_acquire_tuplock() for priority */
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Delete);
+ goto retry;
+ }
+ }
+ }
+
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, tid_p, ItemPointerGetBlockNumber(tid_p));
+
+ if (result == TM_Ok)
+ pgstat_count_heap_delete(relation);
+
+ return result;
+}
+
+
+/*
+ * Each tuple lock mode has a corresponding heavyweight lock, and one or two
+ * corresponding MultiXactStatuses (one to merely lock tuples, another one to
+ * update them). This table (and the macros below) helps us determine the
+ * heavyweight lock mode and MultiXactStatus values to use for any particular
+ * tuple lock strength.
+ *
+ * Don't look at lockstatus/updstatus directly! Use get_mxact_status_for_lock
+ * instead.
+ */
+static const struct
+{
+ LOCKMODE hwlock;
+ int lockstatus;
+ int updstatus;
+}
+
+ tupleLockExtraInfo[MaxLockTupleMode + 1] =
+{
+ { /* LockTupleKeyShare */
+ AccessShareLock,
+ MultiXactStatusForKeyShare,
+ -1 /* KeyShare does not allow updating tuples */
+ },
+ { /* LockTupleShare */
+ RowShareLock,
+ MultiXactStatusForShare,
+ -1 /* Share does not allow updating tuples */
+ },
+ { /* LockTupleNoKeyExclusive */
+ ExclusiveLock,
+ MultiXactStatusForNoKeyUpdate,
+ MultiXactStatusNoKeyUpdate
+ },
+ { /* LockTupleExclusive */
+ AccessExclusiveLock,
+ MultiXactStatusForUpdate,
+ MultiXactStatusUpdate
+ }
+};
+
+
+/*
+ * Acquire heavyweight locks on tuples, using a LockTupleMode strength value.
+ * This is more readable than having every caller translate it to lock.h's
+ * LOCKMODE.
+ */
+#define LockTupleTuplock(rel, tup, mode) \
+ LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define UnlockTupleTuplock(rel, tup, mode) \
+ UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define ConditionalLockTupleTuplock(rel, tup, mode) \
+ ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+
+/*
+ * Acquire heavyweight lock on the given tuple, in preparation for acquiring
+ * its normal, Xmax-based tuple lock.
+ *
+ * have_tuple_lock is an input and output parameter: on input, it indicates
+ * whether the lock has previously been acquired (and this function does
+ * nothing in that case). If this function returns success, have_tuple_lock
+ * has been flipped to true.
+ *
+ * Returns false if it was unable to obtain the lock; this can only happen if
+ * wait_policy is Skip.
+ *
+ * XXX: This is identical to heap_acquire_tuplock
+ */
+
+static bool
+zs_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, bool *have_tuple_lock)
+{
+ if (*have_tuple_lock)
+ return true;
+
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ LockTupleTuplock(relation, tid, mode);
+ break;
+
+ case LockWaitSkip:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ return false;
+ break;
+
+ case LockWaitError:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+ *have_tuple_lock = true;
+
+ return true;
+}
+
+
+static TM_Result
+zedstoream_lock_tuple(Relation relation, ItemPointer tid_p, Snapshot snapshot,
+ TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, uint8 flags,
+ TM_FailureData *tmfd)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result;
+ bool have_tuple_lock = false;
+ zstid next_tid = tid;
+ SnapshotData SnapshotDirty;
+ bool locked_something = false;
+ ZSUndoSlotVisibility visi_info = InvalidUndoSlotVisibility;
+ bool follow_updates = false;
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = *tid_p;
+
+ tmfd->traversed = false;
+ /*
+ * For now, we lock just the first attribute. As long as everyone
+ * does that, that's enough.
+ */
+retry:
+ result = zsbt_tid_lock(relation, tid, xid, cid, mode, follow_updates,
+ snapshot, tmfd, &next_tid, &visi_info);
+
+ ((ZedstoreTupleTableSlot*)slot)->xmin = visi_info.xmin;
+ ((ZedstoreTupleTableSlot*)slot)->cmin = visi_info.cmin;
+
+ if (result == TM_Invisible)
+ {
+ /*
+ * This is possible, but only when locking a tuple for ON CONFLICT
+ * UPDATE and some other cases handled below. We return this value
+ * here rather than throwing an error in order to give that case the
+ * opportunity to throw a more specific error.
+ */
+ /*
+ * This can also happen, if we're locking an UPDATE chain for KEY SHARE mode:
+ * A tuple has been inserted, and then updated, by a different transaction.
+ * The updating transaction is still in progress. We can lock the row
+ * in KEY SHARE mode, assuming the key columns were not updated, and we will
+ * try to lock all the row version, even the still in-progress UPDATEs.
+ * It's possible that the UPDATE aborts while we're chasing the update chain,
+ * so that the updated tuple becomes invisible to us. That's OK.
+ */
+ if (mode == LockTupleKeyShare && locked_something)
+ return TM_Ok;
+
+ /*
+ * This can also happen, if the caller asked for the latest version
+ * of the tuple and if tuple was inserted by our own transaction, we
+ * have to check cmin against cid: cmin >= current CID means our
+ * command cannot see the tuple, so we should ignore it.
+ */
+ Assert(visi_info.cmin != InvalidCommandId);
+ if ((flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION) != 0 &&
+ TransactionIdIsCurrentTransactionId(visi_info.xmin) &&
+ visi_info.cmin >= cid)
+ {
+ tmfd->xmax = visi_info.xmin;
+ tmfd->cmax = visi_info.cmin;
+ return TM_SelfModified;
+ }
+
+ return TM_Invisible;
+ }
+ else if (result == TM_Updated ||
+ (result == TM_SelfModified && tmfd->cmax >= cid))
+ {
+ /*
+ * The other transaction is an update and it already committed.
+ *
+ * If the caller asked for the latest version, find it.
+ */
+ if ((flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION) != 0 && next_tid != tid)
+ {
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (ItemPointerIndicatesMovedPartitions(&tmfd->ctid))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
+ /* it was updated, so look at the updated version */
+ *tid_p = ItemPointerFromZSTid(next_tid);
+
+ /* signal that a tuple later in the chain is getting locked */
+ tmfd->traversed = true;
+
+ /* loop back to fetch next in chain */
+
+ /* FIXME: In the corresponding code in heapam, we cross-check the xmin/xmax
+ * of the old and new tuple. Should we do the same here?
+ */
+
+ InitDirtySnapshot(SnapshotDirty);
+ snapshot = &SnapshotDirty;
+ tid = next_tid;
+ goto retry;
+ }
+
+ return result;
+ }
+ else if (result == TM_Deleted)
+ {
+ /*
+ * The other transaction is a delete and it already committed.
+ */
+ return result;
+ }
+ else if (result == TM_BeingModified)
+ {
+ TransactionId xwait = tmfd->xmax;
+
+ /*
+ * Acquire tuple lock to establish our priority for the tuple, or
+ * die trying. LockTuple will release us when we are next-in-line
+ * for the tuple. We must do this even if we are share-locking.
+ *
+ * If we are forced to "start over" below, we keep the tuple lock;
+ * this arranges that we stay at the head of the line while
+ * rechecking tuple state.
+ */
+ if (!zs_acquire_tuplock(relation, tid_p, mode, wait_policy,
+ &have_tuple_lock))
+ {
+ /*
+ * This can only happen if wait_policy is Skip and the lock
+ * couldn't be obtained.
+ */
+ return TM_WouldBlock;
+ }
+
+ /* wait for regular transaction to end, or die trying */
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Lock);
+ break;
+ case LockWaitSkip:
+ if (!ConditionalXactLockTableWait(xwait))
+ {
+ return TM_WouldBlock;
+ }
+ break;
+ case LockWaitError:
+ if (!ConditionalXactLockTableWait(xwait))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+
+ /*
+ * xwait is done. Retry.
+ */
+ goto retry;
+ }
+ if (result == TM_Ok)
+ locked_something = true;
+
+ /*
+ * Now that we have successfully marked the tuple as locked, we can
+ * release the lmgr tuple lock, if we had it.
+ */
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (mode == LockTupleKeyShare)
+ {
+ /* lock all row versions, if it's a KEY SHARE lock */
+ follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
+ if (result == TM_Ok && tid != next_tid && next_tid != InvalidZSTid)
+ {
+ tid = next_tid;
+ goto retry;
+ }
+ }
+
+ /* Fetch the tuple, too. */
+ if (!zedstoream_fetch_row_version(relation, tid_p, SnapshotAny, slot))
+ elog(ERROR, "could not fetch locked tuple");
+
+ return TM_Ok;
+}
+
+/* like heap_tuple_attr_equals */
+static bool
+zs_tuple_attr_equals(int attrnum, TupleTableSlot *slot1, TupleTableSlot *slot2)
+{
+ TupleDesc tupdesc = slot1->tts_tupleDescriptor;
+ Datum value1,
+ value2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ /*
+ * If it's a whole-tuple reference, say "not equal". It's not really
+ * worth supporting this case, since it could only succeed after a no-op
+ * update, which is hardly a case worth optimizing for.
+ */
+ if (attrnum == 0)
+ return false;
+
+ /*
+ * Likewise, automatically say "not equal" for any system attribute other
+ * than tableOID; we cannot expect these to be consistent in a HOT chain,
+ * or even to be set correctly yet in the new tuple.
+ */
+ if (attrnum < 0)
+ {
+ if (attrnum != TableOidAttributeNumber)
+ return false;
+ }
+
+ /*
+ * Extract the corresponding values. XXX this is pretty inefficient if
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
+ * a single heap_deform_tuple call on each tuple, instead? But that
+ * doesn't work for system columns ...
+ */
+ value1 = slot_getattr(slot1, attrnum, &isnull1);
+ value2 = slot_getattr(slot2, attrnum, &isnull2);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+
+ /*
+ * If both are NULL, they can be considered equal.
+ */
+ if (isnull1)
+ return true;
+
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attrnum <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ }
+ else
+ {
+ Assert(attrnum <= tupdesc->natts);
+ att = TupleDescAttr(tupdesc, attrnum - 1);
+ return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ }
+}
+
+static bool
+is_key_update(Relation relation, TupleTableSlot *oldslot, TupleTableSlot *newslot)
+{
+ Bitmapset *key_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
+ int attnum;
+
+ /*
+ * Fetch the list of attributes to be checked for various operations.
+ *
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are
+ * doing an update on one of the relevant system catalogs, we could
+ * deadlock if we try to fetch the list later. In any case, the relcache
+ * caches the data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity and columns that are
+ * considered the "key" of rows in the table.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
+ */
+ key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
+
+ interesting_attrs = NULL;
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+
+ /* Determine columns modified by the update. */
+ modified_attrs = NULL;
+ while ((attnum = bms_first_member(interesting_attrs)) >= 0)
+ {
+ attnum += FirstLowInvalidHeapAttributeNumber;
+
+ if (!zs_tuple_attr_equals(attnum, oldslot, newslot))
+ modified_attrs = bms_add_member(modified_attrs,
+ attnum - FirstLowInvalidHeapAttributeNumber);
+ }
+
+ return bms_overlap(modified_attrs, key_attrs);
+}
+
+static TM_Result
+zedstoream_update(Relation relation, ItemPointer otid_p, TupleTableSlot *slot,
+ CommandId cid, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd,
+ LockTupleMode *lockmode, bool *update_indexes)
+{
+ zstid otid = ZSTidFromItemPointer(*otid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ AttrNumber attno;
+ bool key_update;
+ Datum *d;
+ bool *isnulls;
+ TM_Result result;
+ zstid newtid;
+ TupleTableSlot *oldslot;
+ IndexFetchTableData *fetcher;
+ MemoryContext oldcontext;
+ MemoryContext insert_mcontext;
+
+ /*
+ * insert code performs allocations for creating items and merging
+ * items. These are small allocations but add-up based on number of
+ * columns and rows being inserted. Hence, creating context to track them
+ * and wholesale free instead of retail freeing them. TODO: in long term
+ * try if can avoid creating context here, retail free in normal case and
+ * only create context for page splits maybe.
+ */
+ insert_mcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(insert_mcontext);
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ oldslot = table_slot_create(relation, NULL);
+ fetcher = zedstoream_begin_index_fetch(relation);
+
+ /*
+ * The meta-attribute holds the visibility information, including the "t_ctid"
+ * pointer to the updated version. All the real attributes are just inserted,
+ * as if for a new row.
+ */
+retry:
+ newtid = InvalidZSTid;
+
+ /*
+ * Fetch the old row, so that we can figure out which columns were modified.
+ *
+ * FIXME: if we have to follow the update chain, we should look at the
+ * currently latest tuple version, rather than the one visible to our snapshot.
+ */
+ if (!zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ otid_p, SnapshotAny, oldslot))
+ {
+ return TM_Invisible;
+ }
+ key_update = is_key_update(relation, oldslot, slot);
+
+ *lockmode = key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ result = zsbt_tid_update(relation, otid,
+ xid, cid, key_update, snapshot, crosscheck,
+ wait, hufd, &newtid);
+
+ *update_indexes = (result == TM_Ok);
+ if (result == TM_Ok)
+ {
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, otid_p, ItemPointerGetBlockNumber(otid_p));
+
+ for (attno = 1; attno <= relation->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(relation->rd_att, attno - 1);
+ Datum newdatum = d[attno - 1];
+ bool newisnull = isnulls[attno - 1];
+
+ if (!newisnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(newdatum))
+ newdatum = PointerGetDatum(heap_tuple_fetch_attr((struct varlena *) DatumGetPointer(newdatum)));
+
+ /* If this datum is too large, toast it */
+ if (!newisnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(newdatum) > MaxZedStoreDatumSize)
+ {
+ newdatum = zedstore_toast_datum(relation, attno, newdatum, newtid);
+ }
+
+ zsbt_attr_multi_insert(relation, attno,
+ &newdatum, &newisnull, &newtid, 1);
+ }
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(newtid);
+
+ pgstat_count_heap_update(relation, false);
+ }
+ else
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to update invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ /* TODO: use something like heap_acquire_tuplock() for priority */
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ XactLockTableWait(xwait, relation, otid_p, XLTW_Delete);
+ goto retry;
+ }
+ }
+ }
+
+ zedstoream_end_index_fetch(fetcher);
+ ExecDropSingleTupleTableSlot(oldslot);
+
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(insert_mcontext);
+
+ return result;
+}
+
+static const TupleTableSlotOps *
+zedstoream_slot_callbacks(Relation relation)
+{
+ return &TTSOpsZedstore;
+}
+
+static void
+zs_initialize_proj_attributes(TupleDesc tupledesc, ZedStoreProjectData *proj_data)
+{
+ MemoryContext oldcontext;
+
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* add one for meta-attribute */
+ proj_data->proj_atts = palloc((tupledesc->natts + 1) * sizeof(int));
+ proj_data->attr_scans = palloc0(tupledesc->natts * sizeof(ZSAttrTreeScan));
+ proj_data->tid_scan.active = false;
+
+ proj_data->proj_atts[proj_data->num_proj_atts++] = ZS_META_ATTRIBUTE_NUM;
+
+ /*
+ * convert booleans array into an array of the attribute numbers of the
+ * required columns.
+ */
+ for (int idx = 0; idx < tupledesc->natts; idx++)
+ {
+ int att_no = idx + 1;
+
+ /*
+ * never project dropped columns, null will be returned for them
+ * in slot by default.
+ */
+ if (TupleDescAttr(tupledesc, idx)->attisdropped)
+ continue;
+
+ /* project_columns empty also conveys need all the columns */
+ if (proj_data->project_columns == NULL ||
+ bms_is_member(att_no, proj_data->project_columns))
+ proj_data->proj_atts[proj_data->num_proj_atts++] = att_no;
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static void
+zs_initialize_proj_attributes_extended(ZedStoreDesc scan, TupleDesc tupledesc)
+{
+ MemoryContext oldcontext;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ /* if already initialized return */
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ zs_initialize_proj_attributes(tupledesc, proj_data);
+
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* Extra setup for bitmap and sample scans */
+ if ((scan->rs_scan.rs_flags & SO_TYPE_BITMAPSCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_SAMPLESCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_ANALYZE))
+ {
+ scan->bmscan_ntuples = 0;
+ scan->bmscan_tids = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(zstid));
+
+ scan->bmscan_datums = palloc(proj_data->num_proj_atts * sizeof(Datum *));
+ scan->bmscan_isnulls = palloc(proj_data->num_proj_atts * sizeof(bool *));
+ for (int i = 0; i < proj_data->num_proj_atts; i++)
+ {
+ scan->bmscan_datums[i] = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(Datum));
+ scan->bmscan_isnulls[i] = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(bool));
+ }
+ }
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static TableScanDesc
+zedstoream_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ Bitmapset *project_columns)
+{
+ ZedStoreDesc scan;
+
+ /* Sample scans have no snapshot, but we need one */
+ if (!snapshot)
+ {
+ Assert(!(flags & SO_TYPE_SAMPLESCAN));
+ snapshot = SnapshotAny;
+ }
+
+ /*
+ * allocate and initialize scan descriptor
+ */
+ scan = (ZedStoreDesc) palloc0(sizeof(ZedStoreDescData));
+
+ scan->rs_scan.rs_rd = relation;
+ scan->rs_scan.rs_snapshot = snapshot;
+ scan->rs_scan.rs_nkeys = nkeys;
+ scan->rs_scan.rs_flags = flags;
+ scan->rs_scan.rs_parallel = parallel_scan;
+
+ /*
+ * we can use page-at-a-time mode if it's an MVCC-safe snapshot
+ */
+
+ /*
+ * we do this here instead of in initscan() because heap_rescan also calls
+ * initscan() and we don't want to allocate memory again
+ */
+ if (nkeys > 0)
+ scan->rs_scan.rs_key = (ScanKey) palloc(sizeof(ScanKeyData) * nkeys);
+ else
+ scan->rs_scan.rs_key = NULL;
+
+ scan->proj_data.context = CurrentMemoryContext;
+ scan->proj_data.project_columns = project_columns;
+
+ /*
+ * For a seqscan in a serializable transaction, acquire a predicate lock
+ * on the entire relation. This is required not only to lock all the
+ * matching tuples, but also to conflict with new insertions into the
+ * table. In an indexscan, we take page locks on the index pages covering
+ * the range specified in the scan qual, but in a heap scan there is
+ * nothing more fine-grained to lock. A bitmap scan is a different story,
+ * there we have already scanned the index and locked the index pages
+ * covering the predicate. But in that case we still have to lock any
+ * matching heap tuples.
+ */
+ if (!(flags & SO_TYPE_BITMAPSCAN) &&
+ !(flags & SO_TYPE_ANALYZE))
+ PredicateLockRelation(relation, snapshot);
+
+ /*
+ * Currently, we don't have a stats counter for bitmap heap scans (but the
+ * underlying bitmap index scans will be counted) or sample scans (we only
+ * update stats for tuple fetches there)
+ */
+ if (!(flags & SO_TYPE_BITMAPSCAN) && !(flags & SO_TYPE_SAMPLESCAN))
+ pgstat_count_heap_scan(relation);
+
+ return (TableScanDesc) scan;
+}
+
+static TableScanDesc
+zedstoream_beginscan(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags)
+{
+ return zedstoream_beginscan_with_column_projection(relation, snapshot,
+ nkeys, key, parallel_scan, flags, NULL);
+}
+
+static void
+zedstoream_endscan(TableScanDesc sscan)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ if (proj_data->proj_atts)
+ pfree(proj_data->proj_atts);
+
+ if (proj_data->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&proj_data->tid_scan);
+ for (int i = 1; i < proj_data->num_proj_atts; i++)
+ zsbt_attr_end_scan(&proj_data->attr_scans[i - 1]);
+ }
+
+ if (scan->rs_scan.rs_flags & SO_TEMP_SNAPSHOT)
+ UnregisterSnapshot(scan->rs_scan.rs_snapshot);
+
+ if (proj_data->attr_scans)
+ pfree(proj_data->attr_scans);
+ pfree(scan);
+}
+
+static void
+zedstoream_rescan(TableScanDesc sscan, struct ScanKeyData *key,
+ bool set_params, bool allow_strat,
+ bool allow_sync, bool allow_pagemode)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+
+ /* these params don't do much in zedstore yet, but whatever */
+ if (set_params)
+ {
+ if (allow_strat)
+ scan->rs_scan.rs_flags |= SO_ALLOW_STRAT;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_STRAT;
+
+ if (allow_sync)
+ scan->rs_scan.rs_flags |= SO_ALLOW_SYNC;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_SYNC;
+
+ if (allow_pagemode && scan->rs_scan.rs_snapshot &&
+ IsMVCCSnapshot(scan->rs_scan.rs_snapshot))
+ scan->rs_scan.rs_flags |= SO_ALLOW_PAGEMODE;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_PAGEMODE;
+ }
+
+ if (scan->proj_data.num_proj_atts > 0)
+ {
+ zsbt_tid_reset_scan(&scan->proj_data.tid_scan,
+ scan->cur_range_start, scan->cur_range_end, scan->cur_range_start - 1);
+ }
+}
+
+static bool
+zedstoream_getnextslot_internal(TableScanDesc sscan, ScanDirection direction,
+ TupleTableSlot *slot, GetNextSlotCallback callback,
+ void *callback_state)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *scan_proj = &scan->proj_data;
+ int slot_natts = slot->tts_tupleDescriptor->natts;
+ Datum *slot_values = slot->tts_values;
+ bool *slot_isnull = slot->tts_isnull;
+ zstid this_tid;
+ Datum datum;
+ bool isnull;
+ ZSUndoSlotVisibility *visi_info;
+ uint8 slotno;
+
+ if (direction != ForwardScanDirection && scan->rs_scan.rs_parallel)
+ elog(ERROR, "parallel backward scan not implemented");
+
+ if (!scan->started)
+ {
+ MemoryContext oldcontext;
+
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, scan_proj);
+
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+ }
+ else
+ {
+ scan->cur_range_start = MinZSTid;
+ scan->cur_range_end = MaxPlusOneZSTid;
+ }
+
+ oldcontext = MemoryContextSwitchTo(scan_proj->context);
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ scan->cur_range_start,
+ scan->cur_range_end,
+ scan->rs_scan.rs_snapshot,
+ &scan_proj->tid_scan);
+ scan_proj->tid_scan.serializable = true;
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ int attno = scan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ slot->tts_tupleDescriptor,
+ attno,
+ &scan_proj->attr_scans[i - 1]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ scan->started = true;
+ }
+ Assert((scan_proj->num_proj_atts - 1) <= slot_natts);
+
+ /*
+ * Initialize the slot.
+ *
+ * We initialize all columns to NULL. The values for columns that are projected
+ * will be set to the actual values below, but it's important that non-projected
+ * columns are NULL.
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < slot_natts; i++)
+ slot_isnull[i] = true;
+
+ /*
+ * Find the next visible TID.
+ */
+ for (;;)
+ {
+ this_tid = zsbt_tid_scan_next(&scan_proj->tid_scan, direction);
+ if (this_tid == InvalidZSTid)
+ {
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+
+ zsbt_tid_reset_scan(&scan_proj->tid_scan,
+ scan->cur_range_start, scan->cur_range_end, scan->cur_range_start - 1);
+ continue;
+ }
+ else
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+ }
+ Assert (this_tid < scan->cur_range_end);
+ break;
+ }
+
+ if (callback)
+ callback(&scan_proj->tid_scan, this_tid, callback_state);
+
+ /* Note: We don't need to predicate-lock tuples in Serializable mode,
+ * because in a sequential scan, we predicate-locked the whole table.
+ */
+
+ /* Fetch the datums of each attribute for this row */
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ ZSAttrTreeScan *btscan = &scan_proj->attr_scans[i - 1];
+ Form_pg_attribute attr = btscan->attdesc;
+ int natt;
+
+ if (!zsbt_attr_fetch(btscan, &datum, &isnull, this_tid))
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, btscan->attno,
+ &datum, &isnull);
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ natt = scan_proj->proj_atts[i];
+
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, this_tid, datum);
+ }
+
+ /* Check that the values coming out of the b-tree are aligned properly */
+ if (!isnull && attr->attlen == -1)
+ {
+ Assert (VARATT_IS_1B(datum) || INTALIGN(datum) == datum);
+ }
+
+ Assert(natt > 0);
+ slot_values[natt - 1] = datum;
+ slot_isnull[natt - 1] = isnull;
+ }
+
+ /* Fill in the rest of the fields in the slot, and return the tuple */
+ slotno = ZSTidScanCurUndoSlotNo(&scan_proj->tid_scan);
+ visi_info = &scan_proj->tid_scan.array_iter.undoslot_visibility[slotno];
+ ((ZedstoreTupleTableSlot*)slot)->xmin = visi_info->xmin;
+ ((ZedstoreTupleTableSlot*)slot)->cmin = visi_info->cmin;
+
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(this_tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ pgstat_count_heap_getnext(scan->rs_scan.rs_rd);
+ return true;
+}
+
+static bool
+zedstoream_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot)
+{
+ return zedstoream_getnextslot_internal(sscan, direction, slot, NULL, NULL);
+}
+
+static bool
+zedstoream_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid ztid = ZSTidFromItemPointer(*tid);
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(sscan->rs_rd);
+ }
+
+ /*
+ * FIXME: should we get lowest TID as well to further optimize the check.
+ */
+ if (ztid <= scan->max_tid_to_scan)
+ return true;
+ else
+ return false;
+}
+
+static bool
+zedstoream_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
+ Snapshot snapshot)
+{
+ /*
+ * TODO: we didn't keep any visibility information about the tuple in the
+ * slot, so we have to fetch it again. A custom slot type might be a
+ * good idea..
+ */
+ zstid tid = ZSTidFromItemPointer(slot->tts_tid);
+ ZSTidTreeScan meta_scan;
+ bool found;
+
+ /* Use the meta-data tree for the visibility information. */
+ zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &meta_scan);
+
+ found = zsbt_tid_scan_next(&meta_scan, ForwardScanDirection) != InvalidZSTid;
+
+ zsbt_tid_end_scan(&meta_scan);
+
+ return found;
+}
+
+static TransactionId
+zedstoream_compute_xid_horizon_for_tuples(Relation rel,
+ ItemPointerData *items,
+ int nitems)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+
+}
+
+static IndexFetchTableData *
+zedstoream_begin_index_fetch(Relation rel)
+{
+ ZedStoreIndexFetch zscan = palloc0(sizeof(ZedStoreIndexFetchData));
+
+ zscan->idx_fetch_data.rel = rel;
+ zscan->proj_data.context = CurrentMemoryContext;
+
+ return (IndexFetchTableData *) zscan;
+}
+
+static void
+zedstoream_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ Bitmapset *project_columns)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ zscan->proj_data.project_columns = project_columns;
+}
+
+static void
+zedstoream_reset_index_fetch(IndexFetchTableData *scan)
+{
+ /* TODO: we could close the scans here, but currently we don't bother */
+}
+
+static void
+zedstoream_end_index_fetch(IndexFetchTableData *scan)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&zscan_proj->tid_scan);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ zsbt_attr_end_scan(&zscan_proj->attr_scans[i - 1]);
+ }
+
+ if (zscan_proj->proj_atts)
+ pfree(zscan_proj->proj_atts);
+
+ if (zscan_proj->attr_scans)
+ pfree(zscan_proj->attr_scans);
+ pfree(zscan);
+}
+
+static bool
+zedstoream_index_fetch_tuple(struct IndexFetchTableData *scan,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ bool *call_again, bool *all_dead)
+{
+ bool result;
+
+ /*
+ * we don't do in-place updates, so this is essentially the same as
+ * fetch_row_version.
+ */
+ if (call_again)
+ *call_again = false;
+ if (all_dead)
+ *all_dead = false;
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) scan, tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ PredicateLockTID(scan->rel, tid_p, snapshot);
+ }
+ return result;
+}
+
+/*
+ * Shared implementation of fetch_row_version and index_fetch_tuple callbacks.
+ */
+static bool
+zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ Relation rel = fetch->idx_fetch_data.rel;
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ bool found = true;
+ ZedStoreProjectData *fetch_proj = &fetch->proj_data;
+
+ /* first time here, initialize */
+ if (fetch_proj->num_proj_atts == 0)
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, fetch_proj);
+ else
+ {
+ /* If we had a previous fetches still open, close them first */
+ zsbt_tid_end_scan(&fetch_proj->tid_scan);
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ zsbt_attr_end_scan(&fetch_proj->attr_scans[i - 1]);
+ }
+
+ /*
+ * Initialize the slot.
+ *
+ * If we're not fetching all columns, initialize the unfetched values
+ * in the slot to NULL. (Actually, this initializes all to NULL, and the
+ * code below will overwrite them for the columns that are projected)
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < slot->tts_tupleDescriptor->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &fetch_proj->tid_scan);
+ fetch_proj->tid_scan.serializable = true;
+ found = zsbt_tid_scan_next(&fetch_proj->tid_scan, ForwardScanDirection) != InvalidZSTid;
+ if (found)
+ {
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ {
+ int natt = fetch_proj->proj_atts[i];
+ ZSAttrTreeScan *btscan = &fetch_proj->attr_scans[i - 1];
+ Form_pg_attribute attr;
+ Datum datum;
+ bool isnull;
+
+ zsbt_attr_begin_scan(rel, slot->tts_tupleDescriptor, natt, btscan);
+ attr = btscan->attdesc;
+ if (zsbt_attr_fetch(btscan, &datum, &isnull, tid))
+ {
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(rel, natt, tid, datum);
+ }
+ }
+ else
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, btscan->attno,
+ &datum, &isnull);
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ }
+
+ if (found)
+ {
+ ZSUndoSlotVisibility *visi_info;
+ uint8 slotno = ZSTidScanCurUndoSlotNo(&fetch_proj->tid_scan);
+ visi_info = &fetch_proj->tid_scan.array_iter.undoslot_visibility[slotno];
+ ((ZedstoreTupleTableSlot*)slot)->xmin = visi_info->xmin;
+ ((ZedstoreTupleTableSlot*)slot)->cmin = visi_info->cmin;
+
+ slot->tts_tableOid = RelationGetRelid(rel);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ return true;
+ }
+
+ return false;
+}
+
+static void
+zedstoream_index_validate_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ Snapshot snapshot,
+ ValidateIndexState *state)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ int attno;
+ TableScanDesc scan;
+ ItemPointerData idx_ptr;
+ bool tuplesort_empty = false;
+ Bitmapset *proj = NULL;
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. We need just those tuples
+ * satisfying the passed-in reference snapshot. We must disable syncscan
+ * here, because it's critical that we read from block zero forward to
+ * match the sorted TIDs.
+ */
+
+ /*
+ * TODO: It would be very good to fetch only the columns we need.
+ */
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ proj = bms_add_member(proj, indexInfo->ii_IndexAttrNumbers[attno]);
+ }
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Predicate,
+ baseRelation->rd_att->natts,
+ &proj);
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Expressions,
+ baseRelation->rd_att->natts,
+ &proj);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ /*
+ * Scan all tuples matching the snapshot.
+ */
+ ItemPointerSet(&idx_ptr, 0, 0); /* this is less than any real TID */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ ItemPointerData tup_ptr = slot->tts_tid;
+ int cmp;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, like heapam's function.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ if (tuplesort_empty)
+ cmp = -1;
+ else
+ {
+ while ((cmp = ItemPointerCompare(&tup_ptr, &idx_ptr)) > 0)
+ {
+ Datum ts_val;
+ bool ts_isnull;
+
+ tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+ &ts_val, &ts_isnull, NULL);
+ if (!tuplesort_empty)
+ {
+ Assert(!ts_isnull);
+ itemptr_decode(&idx_ptr, DatumGetInt64(ts_val));
+
+ /* If int8 is pass-by-ref, free (encoded) TID Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(ts_val));
+#endif
+ break;
+ }
+ else
+ {
+ /* Be tidy */
+ ItemPointerSetInvalid(&idx_ptr);
+ cmp = -1;
+ }
+ }
+ }
+ if (cmp < 0)
+ {
+ /* This item is not in the index */
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ index_insert(indexRelation, values, isnull, &tup_ptr, baseRelation,
+ indexInfo->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ indexInfo);
+
+ state->tups_inserted += 1;
+ }
+ }
+
+ table_endscan(scan);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+}
+
+/*
+ * This is called by getnextslot function after checking visibility of TID
+ * item based on passed snapshot. We have info available to help us set
+ * tupleIsAlive, which helps control if the tuple should be excluded from
+ * unique-checking or not.
+ */
+static void
+index_build_getnextslot_callback(ZSTidTreeScan *scan, zstid tid, void *state)
+{
+ bool *tupleIsAlive = (bool*)state;
+ uint8 slotno = ZSTidScanCurUndoSlotNo(scan);
+ ZSUndoSlotVisibility *visi_info = &scan->array_iter.undoslot_visibility[slotno];
+
+ Assert(tid != InvalidZSTid);
+
+ /*
+ * This callback is called after checking visibility based on passed in
+ * snapshot to scan.
+ */
+ *tupleIsAlive = true;
+
+ /*
+ * If NonVacuumableSnapshot snapshot used for scanning then we need to
+ * check for recenetly dead tuples.
+ *
+ * TODO: Heap checks for DELETE_IN_PROGRESS do we need as well?
+ */
+ if (visi_info->nonvacuumable_status == ZSNV_RECENTLY_DEAD)
+ {
+ Assert(scan->snapshot->snapshot_type == SNAPSHOT_NON_VACUUMABLE);
+ *tupleIsAlive = false;
+ return;
+ }
+
+ Assert(visi_info->nonvacuumable_status == ZSNV_NONE);
+}
+
+static double
+zedstoream_index_build_range_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ bool anyvisible,
+ bool progress,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state,
+ TableScanDesc scan)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ double reltuples;
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ Snapshot snapshot;
+ SnapshotData NonVacuumableSnapshot;
+ bool need_unregister_snapshot = false;
+ TransactionId OldestXmin;
+ bool tupleIsAlive;
+
+#ifdef USE_ASSERT_CHECKING
+ bool checking_uniqueness;
+ /* See whether we're verifying uniqueness/exclusion properties */
+ checking_uniqueness = (indexInfo->ii_Unique ||
+ indexInfo->ii_ExclusionOps != NULL);
+
+ /*
+ * "Any visible" mode is not compatible with uniqueness checks; make sure
+ * only one of those is requested.
+ */
+ Assert(!(anyvisible && checking_uniqueness));
+#endif
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. In a normal index build, we use
+ * SnapshotAny because we must retrieve all tuples and do our own time
+ * qual checks (because we have to index RECENTLY_DEAD tuples). In a
+ * concurrent build, or during bootstrap, we take a regular MVCC snapshot
+ * and index whatever's live according to that.
+ */
+ OldestXmin = InvalidTransactionId;
+
+ /* okay to ignore lazy VACUUMs here */
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
+ OldestXmin = GetOldestXmin(baseRelation, PROCARRAY_FLAGS_VACUUM);
+
+ if (!scan)
+ {
+ int attno;
+ Bitmapset *proj = NULL;
+
+ /*
+ * Serial index build.
+ *
+ * Must begin our own zedstore scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ */
+ if (!TransactionIdIsValid(OldestXmin))
+ {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ need_unregister_snapshot = true;
+ }
+ else
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot, OldestXmin);
+ snapshot = &NonVacuumableSnapshot;
+ }
+
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ proj = bms_add_member(proj, indexInfo->ii_IndexAttrNumbers[attno]);
+ }
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Predicate,
+ baseRelation->rd_att->natts,
+ &proj);
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Expressions,
+ baseRelation->rd_att->natts,
+ &proj);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ if (start_blockno != 0 || numblocks != InvalidBlockNumber)
+ {
+ ZedStoreDesc zscan = (ZedStoreDesc) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ zscan->cur_range_start = ZSTidFromBlkOff(start_blockno, 1);
+ zscan->cur_range_end = ZSTidFromBlkOff(numblocks, 1);
+
+ /* FIXME: when can 'num_proj_atts' be 0? */
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_begin_scan(zscan->rs_scan.rs_rd,
+ zscan->cur_range_start,
+ zscan->cur_range_end,
+ zscan->rs_scan.rs_snapshot,
+ &zscan_proj->tid_scan);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ {
+ int natt = zscan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(zscan->rs_scan.rs_rd,
+ RelationGetDescr(zscan->rs_scan.rs_rd),
+ natt,
+ &zscan_proj->attr_scans[i - 1]);
+ }
+ }
+ }
+ }
+ else
+ {
+ /*
+ * Parallel index build.
+ *
+ * Parallel case never registers/unregisters own snapshot. Snapshot
+ * is taken from parallel zedstore scan, and is SnapshotAny or an MVCC
+ * snapshot, based on same criteria as serial case.
+ */
+ Assert(!IsBootstrapProcessingMode());
+ Assert(allow_sync);
+ Assert(start_blockno == 0);
+ Assert(numblocks == InvalidBlockNumber);
+ snapshot = scan->rs_snapshot;
+
+ if (snapshot == SnapshotAny)
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot, OldestXmin);
+ snapshot = &NonVacuumableSnapshot;
+ }
+ }
+
+ /*
+ * Must call GetOldestXmin() with SnapshotAny. Should never call
+ * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == &NonVacuumableSnapshot || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == &NonVacuumableSnapshot ? TransactionIdIsValid(OldestXmin) :
+ !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == &NonVacuumableSnapshot || !anyvisible);
+
+ reltuples = 0;
+
+ /*
+ * Scan all tuples in the base relation.
+ */
+ while (zedstoream_getnextslot_internal(scan, ForwardScanDirection, slot,
+ &index_build_getnextslot_callback, &tupleIsAlive))
+ {
+ HeapTuple heapTuple;
+
+ if (numblocks != InvalidBlockNumber &&
+ ItemPointerGetBlockNumber(&slot->tts_tid) >= numblocks)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* table_scan_getnextslot did the visibility check */
+ if (tupleIsAlive)
+ reltuples += 1;
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, to figure out which tuple version to index.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ heapTuple = ExecCopySlotHeapTuple(slot);
+ heapTuple->t_self = slot->tts_tid;
+ callback(indexRelation, heapTuple, values, isnull, tupleIsAlive,
+ callback_state);
+ pfree(heapTuple);
+ }
+
+ table_endscan(scan);
+
+ /* we can now forget our snapshot, if set and registered by us */
+ if (need_unregister_snapshot)
+ UnregisterSnapshot(snapshot);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+
+ return reltuples;
+}
+
+static void
+zedstoream_finish_bulk_insert(Relation relation, int options)
+{
+ /*
+ * If we skipped writing WAL, then we need to sync the zedstore (but not
+ * indexes since those use WAL anyway / don't go through tableam)
+ */
+ if (options & HEAP_INSERT_SKIP_WAL)
+ heap_sync(relation);
+}
+
+/* ------------------------------------------------------------------------
+ * DDL related callbacks for zedstore AM.
+ * ------------------------------------------------------------------------
+ */
+
+static void
+zedstoream_relation_set_new_filenode(Relation rel,
+ const RelFileNode *newrnode,
+ char persistence,
+ TransactionId *freezeXid,
+ MultiXactId *minmulti)
+{
+ SMgrRelation srel;
+
+ /*
+ * Initialize to the minimum XID that could put tuples in the table. We
+ * know that no xacts older than RecentXmin are still running, so that
+ * will do.
+ */
+ *freezeXid = RecentXmin;
+
+ /*
+ * Similarly, initialize the minimum Multixact to the first value that
+ * could possibly be stored in tuples in the table. Running transactions
+ * could reuse values from their local cache, so we are careful to
+ * consider all currently running multis.
+ *
+ * XXX this could be refined further, but is it worth the hassle?
+ */
+ *minmulti = GetOldestMultiXactId();
+
+ srel = RelationCreateStorage(*newrnode, persistence);
+
+ /*
+ * If required, set up an init fork for an unlogged table so that it can
+ * be correctly reinitialized on restart. An immediate sync is required
+ * even if the page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved the
+ * redo pointer past our xlog record. Recovery may as well remove it
+ * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+ * record. Therefore, logging is necessary even if wal_level=minimal.
+ */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW ||
+ rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+ smgrcreate(srel, INIT_FORKNUM, false);
+ log_smgrcreate(newrnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+}
+
+static void
+zedstoream_relation_nontransactional_truncate(Relation rel)
+{
+ zsmeta_invalidate_cache(rel);
+ RelationTruncate(rel, 0);
+}
+
+static void
+zedstoream_relation_copy_data(Relation rel, const RelFileNode *newrnode)
+{
+ SMgrRelation dstrel;
+
+ dstrel = smgropen(*newrnode, rel->rd_backend);
+ RelationOpenSmgr(rel);
+
+ /*
+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers. We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.
+ */
+ FlushRelationBuffers(rel);
+
+ /*
+ * Create and copy all the relation, and schedule unlinking of the
+ * old physical file.
+ *
+ * NOTE: any conflict in relfilenode value will be caught in
+ * RelationCreateStorage().
+ *
+ * NOTE: There is only the main fork in zedstore. Otherwise
+ * this would need to copy other forks, too.
+ */
+ RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+
+ /* copy main fork */
+ RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
+ rel->rd_rel->relpersistence);
+
+ /* drop old relation, and close new one */
+ RelationDropStorage(rel);
+ smgrclose(dstrel);
+}
+
+/*
+ * Subroutine of the zedstoream_relation_copy_for_cluster() callback.
+ *
+ * Creates the TID item with correct visibility information for the
+ * given tuple in the old table. Returns the tid of the tuple in the
+ * new table, or InvalidZSTid if this tuple can be left out completely.
+ *
+ * FIXME: This breaks UPDATE chains. I.e. after this is done, an UPDATE
+ * looks like DELETE + INSERT, instead of an UPDATE, to any transaction that
+ * might try to follow the update chain.
+ */
+static zstid
+zs_cluster_process_tuple(Relation OldHeap, Relation NewHeap,
+ zstid oldtid, ZSUndoRecPtr old_undoptr,
+ ZSUndoRecPtr recent_oldest_undo,
+ TransactionId OldestXmin)
+{
+ TransactionId this_xmin;
+ CommandId this_cmin;
+ TransactionId this_xmax;
+ CommandId this_cmax;
+ bool this_changedPart;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ /*
+ * Follow the chain of UNDO records for this tuple, to find the
+ * transaction that originally inserted the row (xmin/cmin), and
+ * the transaction that deleted or updated it away, if any (xmax/cmax)
+ */
+ this_xmin = FrozenTransactionId;
+ this_cmin = InvalidCommandId;
+ this_xmax = InvalidTransactionId;
+ this_cmax = InvalidCommandId;
+
+ undo_ptr = old_undoptr;
+ for (;;)
+ {
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /* This tuple version is visible to everyone. */
+ break;
+ }
+
+ /* Fetch the next UNDO record. */
+ undorec = zsundo_fetch(OldHeap, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * inserter aborted or crashed. This row is not visible to
+ * anyone. Including any later tuple versions we might have
+ * seen.
+ */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* Inserter committed. */
+ this_xmin = undorec->xid;
+ this_cmin = undorec->cid;
+
+ /* we know everything there is to know about this tuple version. */
+ break;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* Ignore tuple locks for now.
+ *
+ * FIXME: we should propagate them to the new copy of the table
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* Row was deleted (or updated away). */
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter aborted or crashed. The previous record should
+ * be an insertion (possibly with some tuple-locking in
+ * between). We'll remember the tuple when we see the
+ * insertion.
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else
+ {
+ /* deleter committed or is still in progress. */
+ if (TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* the deletion is visible to everyone. We can skip the row completely. */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* deleter committed or is in progress. Remember that it was
+ * deleted by this XID.
+ */
+ this_xmax = undorec->xid;
+ this_cmax = undorec->cid;
+ if (undorec->type == ZSUNDO_TYPE_DELETE)
+ this_changedPart = ((ZSUndoRec_Delete *) undorec)->changedPart;
+ else
+ this_changedPart = false;
+
+ /* follow the UNDO chain to find information about the inserting
+ * transaction (xmin/cmin)
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ }
+ }
+ }
+
+ /*
+ * We now know the visibility of this tuple. Re-create it in the new table.
+ */
+ if (this_xmin != InvalidTransactionId)
+ {
+ /* Insert the first version of the row. */
+ zstid newtid = InvalidZSTid;
+
+ /* First, insert the tuple. */
+ zsbt_tid_multi_insert(NewHeap,
+ &newtid, 1,
+ this_xmin,
+ this_cmin,
+ INVALID_SPECULATIVE_TOKEN,
+ InvalidUndoPtr);
+
+ /* And if the tuple was deleted/updated away, do the same in the new table. */
+ if (this_xmax != InvalidTransactionId)
+ {
+ TM_Result delete_result;
+
+ /* tuple was deleted. */
+ delete_result = zsbt_tid_delete(NewHeap, newtid,
+ this_xmax, this_cmax,
+ NULL, NULL, false, NULL, this_changedPart);
+ if (delete_result != TM_Ok)
+ elog(ERROR, "tuple deletion failed during table rewrite");
+ }
+ return newtid;
+ }
+ else
+ return InvalidZSTid;
+}
+
+
+static void
+zedstoream_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
+ Relation OldIndex, bool use_sort,
+ TransactionId OldestXmin,
+ TransactionId *xid_cutoff,
+ MultiXactId *multi_cutoff,
+ double *num_tuples,
+ double *tups_vacuumed,
+ double *tups_recently_dead)
+{
+ TupleDesc olddesc;
+ ZSTidTreeScan tid_scan;
+ ZSAttrTreeScan *attr_scans;
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(OldHeap);
+ int attno;
+ IndexScanDesc indexScan;
+
+ olddesc = RelationGetDescr(OldHeap),
+
+ attr_scans = palloc(olddesc->natts * sizeof(ZSAttrTreeScan));
+
+ /*
+ * Scan the old table. We ignore any old updated-away tuple versions,
+ * and only stop at the latest tuple version of each row. At the latest
+ * version, follow the update chain to get all the old versions of that
+ * row, too. That way, the whole update chain is processed in one go,
+ * and can be reproduced in the new table.
+ */
+ zsbt_tid_begin_scan(OldHeap, MinZSTid, MaxPlusOneZSTid,
+ SnapshotAny, &tid_scan);
+
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_begin_scan(OldHeap,
+ olddesc,
+ attno,
+ &attr_scans[attno - 1]);
+ }
+
+ /* TODO: sorting not implemented yet. (it would require materializing each
+ * row into a HeapTuple or something like that, which could carry the xmin/xmax
+ * information through the sorter).
+ */
+ use_sort = false;
+
+ /*
+ * Prepare to scan the OldHeap. To ensure we see recently-dead tuples
+ * that still need to be copied, we scan with SnapshotAny and use
+ * HeapTupleSatisfiesVacuum for the visibility test.
+ */
+ if (OldIndex != NULL && !use_sort)
+ {
+ const int ci_index[] = {
+ PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_INDEX_RELID
+ };
+ int64 ci_val[2];
+
+ /* Set phase and OIDOldIndex to columns */
+ ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[1] = RelationGetRelid(OldIndex);
+ pgstat_progress_update_multi_param(2, ci_index, ci_val);
+
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ index_rescan(indexScan, NULL, 0, NULL, 0);
+ }
+ else
+ {
+ /* In scan-and-sort mode and also VACUUM FULL, set phase */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+
+ indexScan = NULL;
+
+ /* Set total heap blocks */
+ /* TODO */
+#if 0
+ pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ heapScan->rs_nblocks);
+#endif
+ }
+
+ for (;;)
+ {
+ zstid old_tid;
+ ZSUndoRecPtr old_undoptr;
+ zstid new_tid;
+ Datum datum;
+ bool isnull;
+ zstid fetchtid = InvalidZSTid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (indexScan != NULL)
+ {
+ ItemPointer itemptr;
+
+ itemptr = index_getnext_tid(indexScan, ForwardScanDirection);
+ if (!itemptr)
+ break;
+
+ /* Since we used no scan keys, should never need to recheck */
+ if (indexScan->xs_recheck)
+ elog(ERROR, "CLUSTER does not support lossy index conditions");
+
+ fetchtid = ZSTidFromItemPointer(*itemptr);
+ zsbt_tid_reset_scan(&tid_scan, MinZSTid, MaxPlusOneZSTid, fetchtid - 1);
+ old_tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection);
+ if (old_tid == InvalidZSTid)
+ continue;
+ }
+ else
+ {
+ old_tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection);
+ if (old_tid == InvalidZSTid)
+ break;
+ fetchtid = old_tid;
+ }
+ if (old_tid != fetchtid)
+ continue;
+
+ old_undoptr = tid_scan.array_iter.undoslots[ZSTidScanCurUndoSlotNo(&tid_scan)];
+
+ new_tid = zs_cluster_process_tuple(OldHeap, NewHeap,
+ old_tid, old_undoptr,
+ recent_oldest_undo,
+ OldestXmin);
+ if (new_tid != InvalidZSTid)
+ {
+ /* Fetch the attributes and write them out */
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ Form_pg_attribute att = TupleDescAttr(olddesc, attno - 1);
+
+ if (att->attisdropped)
+ {
+ datum = (Datum) 0;
+ isnull = true;
+ }
+ else
+ {
+ if (!zsbt_attr_fetch(&attr_scans[attno - 1], &datum, &isnull, old_tid))
+ zsbt_fill_missing_attribute_value(olddesc, attno, &datum, &isnull);
+ }
+
+ /* flatten and re-toast any ZS-TOASTed values */
+ if (!isnull && att->attlen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(OldHeap, attno, old_tid, datum);
+ }
+
+ if (VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(NewHeap, attno, datum, new_tid);
+ }
+ }
+
+ zsbt_attr_multi_insert(NewHeap, attno, &datum, &isnull, &new_tid, 1);
+ }
+ }
+ }
+
+ if (indexScan != NULL)
+ index_endscan(indexScan);
+
+ zsbt_tid_end_scan(&tid_scan);
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_end_scan(&attr_scans[attno - 1]);
+ }
+}
+
+/*
+ * FIXME: The ANALYZE API is problematic for us. acquire_sample_rows() calls
+ * RelationGetNumberOfBlocks() directly on the relation, and chooses the
+ * block numbers to sample based on that. But the logical block numbers
+ * have little to do with physical ones in zedstore.
+ */
+static bool
+zedstoream_scan_analyze_next_block(TableScanDesc sscan, BlockNumber blockno,
+ BufferAccessStrategy bstrategy)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ Relation rel = scan->rs_scan.rs_rd;
+ int ntuples;
+ ZSTidTreeScan tid_scan;
+ zstid tid;
+
+ /* TODO: for now, assume that we need all columns */
+ zs_initialize_proj_attributes_extended(scan, RelationGetDescr(rel));
+
+ ntuples = 0;
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ ZSTidFromBlkOff(blockno, 1),
+ ZSTidFromBlkOff(blockno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &tid_scan);
+ /*
+ * TODO: it would be good to pass the next expected TID down to zsbt_scan_next,
+ * so that it could skip over to it more efficiently.
+ */
+ ntuples = 0;
+ while ((tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection)) != InvalidZSTid)
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+ }
+ zsbt_tid_end_scan(&tid_scan);
+
+ if (ntuples)
+ {
+ TupleDesc reldesc = RelationGetDescr(scan->rs_scan.rs_rd);
+
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int attno = scan->proj_data.proj_atts[i];
+ ZSAttrTreeScan attr_scan;
+ Datum datum;
+ bool isnull;
+ Datum *datums = scan->bmscan_datums[i];
+ bool *isnulls = scan->bmscan_isnulls[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ reldesc,
+ attno,
+ &attr_scan);
+ for (int n = 0; n < ntuples; n++)
+ {
+ zstid tid = scan->bmscan_tids[n];
+
+ if (zsbt_attr_fetch(&attr_scan, &datum, &isnull, tid))
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ }
+ else
+ zsbt_fill_missing_attribute_value(reldesc, attno, &datum, &isnull);
+
+ /*
+ * have to make a copy because we close the scan immediately.
+ * FIXME: I think this leaks into a too-long-lived context
+ */
+ if (!isnull)
+ datum = zs_datumCopy(datum,
+ attr_scan.attdesc->attbyval,
+ attr_scan.attdesc->attlen);
+ datums[n] = datum;
+ isnulls[n] = isnull;
+ }
+ zsbt_attr_end_scan(&attr_scan);
+ }
+ }
+
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return true;
+}
+
+static bool
+zedstoream_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin,
+ double *liverows, double *deadrows,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid tid;
+
+ if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
+ return false;
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ tid = scan->bmscan_tids[scan->bmscan_nexttuple];
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int natt = scan->proj_data.proj_atts[i];
+ Form_pg_attribute att = TupleDescAttr(slot->tts_tupleDescriptor, natt - 1);
+
+ Datum datum;
+ bool isnull;
+
+ datum = (scan->bmscan_datums[i])[scan->bmscan_nexttuple];
+ isnull = (scan->bmscan_isnulls[i])[scan->bmscan_nexttuple];
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, tid, datum);
+ }
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ scan->bmscan_nexttuple++;
+ (*liverows)++;
+
+ return true;
+}
+
+/* ------------------------------------------------------------------------
+ * Miscellaneous callbacks for the heap AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * FIXME: Implement this function as best for zedstore. The return value is
+ * for example leveraged by analyze to find which blocks to sample.
+ */
+static uint64
+zedstoream_relation_size(Relation rel, ForkNumber forkNumber)
+{
+ uint64 nblocks = 0;
+
+ /* Open it at the smgr level if not already done */
+ RelationOpenSmgr(rel);
+ nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+ return nblocks * BLCKSZ;
+}
+
+/*
+ * Zedstore stores TOAST chunks within the table file itself. Hence, doesn't
+ * need separate toast table to be created. Return false for this callback
+ * avoids creation of toast table.
+ */
+static bool
+zedstoream_relation_needs_toast_table(Relation rel)
+{
+ return false;
+}
+
+/* ------------------------------------------------------------------------
+ * Planner related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * currently this is exact duplicate of heapam_estimate_rel_size().
+ * TODO fix to tune it based on zedstore storage.
+ */
+static void
+zedstoream_relation_estimate_size(Relation rel, int32 *attr_widths,
+ BlockNumber *pages, double *tuples,
+ double *allvisfrac)
+{
+ BlockNumber curpages;
+ BlockNumber relpages;
+ double reltuples;
+ BlockNumber relallvisible;
+ double density;
+
+ /* it has storage, ok to call the smgr */
+ curpages = RelationGetNumberOfBlocks(rel);
+
+ /* coerce values in pg_class to more desirable types */
+ relpages = (BlockNumber) rel->rd_rel->relpages;
+ reltuples = (double) rel->rd_rel->reltuples;
+ relallvisible = (BlockNumber) rel->rd_rel->relallvisible;
+
+ /*
+ * HACK: if the relation has never yet been vacuumed, use a minimum size
+ * estimate of 10 pages. The idea here is to avoid assuming a
+ * newly-created table is really small, even if it currently is, because
+ * that may not be true once some data gets loaded into it. Once a vacuum
+ * or analyze cycle has been done on it, it's more reasonable to believe
+ * the size is somewhat stable.
+ *
+ * (Note that this is only an issue if the plan gets cached and used again
+ * after the table has been filled. What we're trying to avoid is using a
+ * nestloop-type plan on a table that has grown substantially since the
+ * plan was made. Normally, autovacuum/autoanalyze will occur once enough
+ * inserts have happened and cause cached-plan invalidation; but that
+ * doesn't happen instantaneously, and it won't happen at all for cases
+ * such as temporary tables.)
+ *
+ * We approximate "never vacuumed" by "has relpages = 0", which means this
+ * will also fire on genuinely empty relations. Not great, but
+ * fortunately that's a seldom-seen case in the real world, and it
+ * shouldn't degrade the quality of the plan too much anyway to err in
+ * this direction.
+ *
+ * If the table has inheritance children, we don't apply this heuristic.
+ * Totally empty parent tables are quite common, so we should be willing
+ * to believe that they are empty.
+ */
+ if (curpages < 10 &&
+ relpages == 0 &&
+ !rel->rd_rel->relhassubclass)
+ curpages = 10;
+
+ /* report estimated # pages */
+ *pages = curpages;
+ /* quick exit if rel is clearly empty */
+ if (curpages == 0)
+ {
+ *tuples = 0;
+ *allvisfrac = 0;
+ return;
+ }
+
+ /* estimate number of tuples from previous tuple density */
+ if (relpages > 0)
+ density = reltuples / (double) relpages;
+ else
+ {
+ /*
+ * When we have no data because the relation was truncated, estimate
+ * tuple width from attribute datatypes. We assume here that the
+ * pages are completely full, which is OK for tables (since they've
+ * presumably not been VACUUMed yet) but is probably an overestimate
+ * for indexes. Fortunately get_relation_info() can clamp the
+ * overestimate to the parent table's size.
+ *
+ * Note: this code intentionally disregards alignment considerations,
+ * because (a) that would be gilding the lily considering how crude
+ * the estimate is, and (b) it creates platform dependencies in the
+ * default plans which are kind of a headache for regression testing.
+ */
+ int32 tuple_width;
+
+ tuple_width = get_rel_data_width(rel, attr_widths);
+ tuple_width += MAXALIGN(SizeofHeapTupleHeader);
+ tuple_width += sizeof(ItemIdData);
+ /* note: integer division is intentional here */
+ density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width;
+ }
+ *tuples = rint(density * (double) curpages);
+
+ /*
+ * We use relallvisible as-is, rather than scaling it up like we do for
+ * the pages and tuples counts, on the theory that any pages added since
+ * the last VACUUM are most likely not marked all-visible. But costsize.c
+ * wants it converted to a fraction.
+ */
+ if (relallvisible == 0 || curpages <= 0)
+ *allvisfrac = 0;
+ else if ((double) relallvisible >= curpages)
+ *allvisfrac = 1;
+ else
+ *allvisfrac = (double) relallvisible / curpages;
+}
+
+/* ------------------------------------------------------------------------
+ * Executor related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+static bool
+zedstoream_scan_bitmap_next_block(TableScanDesc sscan,
+ TBMIterateResult *tbmres)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ BlockNumber tid_blkno = tbmres->blockno;
+ int ntuples;
+ ZSTidTreeScan tid_scan;
+ zstid tid;
+ int noff;
+
+ zs_initialize_proj_attributes_extended(scan, RelationGetDescr(scan->rs_scan.rs_rd));
+
+ /*
+ * Our strategy for a bitmap scan is to scan the tree of each attribute,
+ * starting at the given logical block number, and store all the datums
+ * in the scan struct. zedstoream_scan_analyze_next_tuple() then just
+ * needs to store the datums of the next TID in the slot.
+ *
+ * An alternative would be to keep the scans of each attribute open,
+ * like in a sequential scan. I'm not sure which is better.
+ */
+ ntuples = 0;
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ ZSTidFromBlkOff(tid_blkno, 1),
+ ZSTidFromBlkOff(tid_blkno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &tid_scan);
+ tid_scan.serializable = true;
+ noff = 0;
+ while ((tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection)) != InvalidZSTid)
+ {
+ ItemPointerData itemptr;
+
+ Assert(ZSTidGetBlockNumber(tid) == tid_blkno);
+
+ ItemPointerSet(&itemptr, tid_blkno, ZSTidGetOffsetNumber(tid));
+
+ if (tbmres->ntuples != -1)
+ {
+ while (ZSTidGetOffsetNumber(tid) > tbmres->offsets[noff] && noff < tbmres->ntuples)
+ {
+ /*
+ * Acquire predicate lock on all tuples that we scan, even those that are
+ * not visible to the snapshot.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot);
+
+ noff++;
+ }
+
+ if (noff == tbmres->ntuples)
+ break;
+
+ if (ZSTidGetOffsetNumber(tid) < tbmres->offsets[noff])
+ continue;
+ }
+
+ Assert(ZSTidGetBlockNumber(tid) == tid_blkno);
+
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot);
+ }
+ zsbt_tid_end_scan(&tid_scan);
+
+ if (ntuples)
+ {
+ TupleDesc reldesc = RelationGetDescr(scan->rs_scan.rs_rd);
+
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int attno = scan->proj_data.proj_atts[i];
+ ZSAttrTreeScan attr_scan;
+ Datum datum;
+ bool isnull;
+ Datum *datums = scan->bmscan_datums[i];
+ bool *isnulls = scan->bmscan_isnulls[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ reldesc,
+ attno,
+ &attr_scan);
+ for (int n = 0; n < ntuples; n++)
+ {
+ if (!zsbt_attr_fetch(&attr_scan, &datum, &isnull, scan->bmscan_tids[n]))
+ zsbt_fill_missing_attribute_value(reldesc, attno, &datum, &isnull);
+
+ /* have to make a copy because we close the scan immediately. */
+ if (!isnull)
+ datum = zs_datumCopy(datum,
+ attr_scan.attdesc->attbyval,
+ attr_scan.attdesc->attlen);
+ datums[n] = datum;
+ isnulls[n] = isnull;
+ }
+ zsbt_attr_end_scan(&attr_scan);
+ }
+ }
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return ntuples > 0;
+}
+
+static bool
+zedstoream_scan_bitmap_next_tuple(TableScanDesc sscan,
+ TBMIterateResult *tbmres,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid tid;
+
+ if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
+ return false;
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ tid = scan->bmscan_tids[scan->bmscan_nexttuple];
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int natt = scan->proj_data.proj_atts[i];
+ Form_pg_attribute att = TupleDescAttr(slot->tts_tupleDescriptor, natt - 1);
+ Datum datum;
+ bool isnull;
+
+ datum = (scan->bmscan_datums[i])[scan->bmscan_nexttuple];
+ isnull = (scan->bmscan_isnulls[i])[scan->bmscan_nexttuple];
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, tid, datum);
+ }
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ scan->bmscan_nexttuple++;
+
+ pgstat_count_heap_fetch(scan->rs_scan.rs_rd);
+
+ return true;
+}
+
+static bool
+zedstoream_scan_sample_next_block(TableScanDesc sscan, SampleScanState *scanstate)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ Relation rel = scan->rs_scan.rs_rd;
+ TsmRoutine *tsm = scanstate->tsmroutine;
+ int ntuples;
+ ZSTidTreeScan tid_scan;
+ zstid tid;
+ BlockNumber blockno;
+
+ /* TODO: for now, assume that we need all columns */
+ zs_initialize_proj_attributes_extended(scan, RelationGetDescr(rel));
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it, used to calculate max blocks to
+ * scan either for SYSTEM or BERNOULLI sampling.
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(rel);
+ /*
+ * TODO: should get lowest tid instead of starting from 0
+ */
+ scan->next_tid_to_scan = ZSTidFromBlkOff(0, 1);
+ }
+
+ if (tsm->NextSampleBlock)
+ {
+ /* Adding one below to convert block number to number of blocks. */
+ blockno = tsm->NextSampleBlock(scanstate,
+ ZSTidGetBlockNumber(scan->max_tid_to_scan) + 1);
+
+ if (!BlockNumberIsValid(blockno))
+ return false;
+ }
+ else
+ {
+ /* scanning table sequentially */
+ if (scan->next_tid_to_scan > scan->max_tid_to_scan)
+ return false;
+
+ blockno = ZSTidGetBlockNumber(scan->next_tid_to_scan);
+ /* move on to next block of tids for next iteration of scan */
+ scan->next_tid_to_scan = ZSTidFromBlkOff(blockno + 1, 1);
+ }
+
+ Assert(BlockNumberIsValid(blockno));
+
+ ntuples = 0;
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ ZSTidFromBlkOff(blockno, 1),
+ ZSTidFromBlkOff(blockno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &tid_scan);
+ while ((tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection)) != InvalidZSTid)
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+ }
+ zsbt_tid_end_scan(&tid_scan);
+
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return true;
+}
+
+static bool
+zedstoream_scan_sample_next_tuple(TableScanDesc sscan, SampleScanState *scanstate,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ TsmRoutine *tsm = scanstate->tsmroutine;
+ zstid tid;
+ BlockNumber blockno;
+ OffsetNumber tupoffset;
+ bool found;
+
+ /* all tuples on this block are invisible */
+ if (scan->bmscan_ntuples == 0)
+ return false;
+
+ blockno = ZSTidGetBlockNumber(scan->bmscan_tids[0]);
+
+ /* find which visible tuple in this block to sample */
+ for (;;)
+ {
+ zstid lasttid_for_block = scan->bmscan_tids[scan->bmscan_ntuples - 1];
+ OffsetNumber maxoffset = ZSTidGetOffsetNumber(lasttid_for_block);
+
+ /* Ask the tablesample method which tuples to check on this page. */
+ tupoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+
+ if (!OffsetNumberIsValid(tupoffset))
+ return false;
+
+ tid = ZSTidFromBlkOff(blockno, tupoffset);
+
+ found = false;
+ for (int n = 0; n < scan->bmscan_ntuples; n++)
+ {
+ if (scan->bmscan_tids[n] == tid)
+ {
+ /* visible tuple */
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ break;
+ else
+ continue;
+ }
+
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ /* fetch values for tuple pointed by tid to sample */
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ int attno = scan->proj_data.proj_atts[i];
+ ZSAttrTreeScan attr_scan;
+ Form_pg_attribute attr;
+ Datum datum;
+ bool isnull;
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ slot->tts_tupleDescriptor,
+ attno,
+ &attr_scan);
+ attr = attr_scan.attdesc;
+
+ if (zsbt_attr_fetch(&attr_scan, &datum, &isnull, tid))
+ {
+ Assert(ZSTidGetBlockNumber(tid) == blockno);
+ }
+ else
+ {
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, attno,
+ &datum, &isnull);
+ }
+
+ /*
+ * have to make a copy because we close the scan immediately.
+ * FIXME: I think this leaks into a too-long-lived context
+ */
+ if (!isnull)
+ datum = zs_datumCopy(datum, attr->attbyval, attr->attlen);
+
+ slot->tts_values[attno - 1] = datum;
+ slot->tts_isnull[attno - 1] = isnull;
+
+ zsbt_attr_end_scan(&attr_scan);
+ }
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ return true;
+}
+
+static void
+zedstoream_vacuum_rel(Relation onerel, VacuumParams *params,
+ BufferAccessStrategy bstrategy)
+{
+ zsundo_vacuum(onerel, params, bstrategy,
+ GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM));
+}
+
+static const TableAmRoutine zedstoream_methods = {
+ .type = T_TableAmRoutine,
+ .scans_leverage_column_projection = true,
+
+ .slot_callbacks = zedstoream_slot_callbacks,
+
+ .scan_begin = zedstoream_beginscan,
+ .scan_begin_with_column_projection = zedstoream_beginscan_with_column_projection,
+ .scan_end = zedstoream_endscan,
+ .scan_rescan = zedstoream_rescan,
+ .scan_getnextslot = zedstoream_getnextslot,
+
+ .parallelscan_estimate = zs_parallelscan_estimate,
+ .parallelscan_initialize = zs_parallelscan_initialize,
+ .parallelscan_reinitialize = zs_parallelscan_reinitialize,
+
+ .index_fetch_begin = zedstoream_begin_index_fetch,
+ .index_fetch_reset = zedstoream_reset_index_fetch,
+ .index_fetch_end = zedstoream_end_index_fetch,
+ .index_fetch_set_column_projection = zedstoream_fetch_set_column_projection,
+ .index_fetch_tuple = zedstoream_index_fetch_tuple,
+
+ .tuple_insert = zedstoream_insert,
+ .tuple_insert_speculative = zedstoream_insert_speculative,
+ .tuple_complete_speculative = zedstoream_complete_speculative,
+ .multi_insert = zedstoream_multi_insert,
+ .tuple_delete = zedstoream_delete,
+ .tuple_update = zedstoream_update,
+ .tuple_lock = zedstoream_lock_tuple,
+ .finish_bulk_insert = zedstoream_finish_bulk_insert,
+
+ .tuple_fetch_row_version = zedstoream_fetch_row_version,
+ .tuple_get_latest_tid = zedstoream_get_latest_tid,
+ .tuple_tid_valid = zedstoream_tuple_tid_valid,
+ .tuple_satisfies_snapshot = zedstoream_tuple_satisfies_snapshot,
+ .compute_xid_horizon_for_tuples = zedstoream_compute_xid_horizon_for_tuples,
+
+ .relation_set_new_filenode = zedstoream_relation_set_new_filenode,
+ .relation_nontransactional_truncate = zedstoream_relation_nontransactional_truncate,
+ .relation_copy_data = zedstoream_relation_copy_data,
+ .relation_copy_for_cluster = zedstoream_relation_copy_for_cluster,
+ .relation_vacuum = zedstoream_vacuum_rel,
+ .scan_analyze_next_block = zedstoream_scan_analyze_next_block,
+ .scan_analyze_next_tuple = zedstoream_scan_analyze_next_tuple,
+
+ .index_build_range_scan = zedstoream_index_build_range_scan,
+ .index_validate_scan = zedstoream_index_validate_scan,
+
+ .relation_size = zedstoream_relation_size,
+ .relation_needs_toast_table = zedstoream_relation_needs_toast_table,
+ .relation_estimate_size = zedstoream_relation_estimate_size,
+
+ .scan_bitmap_next_block = zedstoream_scan_bitmap_next_block,
+ .scan_bitmap_next_tuple = zedstoream_scan_bitmap_next_tuple,
+ .scan_sample_next_block = zedstoream_scan_sample_next_block,
+ .scan_sample_next_tuple = zedstoream_scan_sample_next_tuple
+};
+
+Datum
+zedstore_tableam_handler(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_POINTER(&zedstoream_methods);
+}
+
+
+/*
+ * Routines for dividing up the TID range for parallel seq scans
+ */
+
+typedef struct ParallelZSScanDescData
+{
+ ParallelTableScanDescData base;
+
+ zstid pzs_endtid; /* last tid + 1 in relation at start of scan */
+ pg_atomic_uint64 pzs_allocatedtid_blk; /* TID space allocated to workers so far. (in 65536 increments) */
+} ParallelZSScanDescData;
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static Size
+zs_parallelscan_estimate(Relation rel)
+{
+ return sizeof(ParallelZSScanDescData);
+}
+
+static Size
+zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc zpscan = (ParallelZSScanDesc) pscan;
+
+ zpscan->base.phs_relid = RelationGetRelid(rel);
+ zpscan->pzs_endtid = zsbt_get_last_tid(rel);
+ pg_atomic_init_u64(&zpscan->pzs_allocatedtid_blk, 0);
+
+ return sizeof(ParallelZSScanDescData);
+}
+
+static void
+zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc bpscan = (ParallelZSScanDesc) pscan;
+
+ pg_atomic_write_u64(&bpscan->pzs_allocatedtid_blk, 0);
+}
+
+/*
+ * get the next TID range to scan
+ *
+ * Returns true if there is more to scan, false otherwise.
+ *
+ * Get the next TID range to scan. Even if there are no TIDs left to scan,
+ * another backend could have grabbed a range to scan and not yet finished
+ * looking at it, so it doesn't follow that the scan is done when the first
+ * backend gets 'false' return.
+ */
+static bool
+zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end)
+{
+ uint64 allocatedtid_blk;
+
+ /*
+ * zhs_allocatedtid tracks how much has been allocated to workers
+ * already. When phs_allocatedtid >= rs_lasttid, all TIDs have been
+ * allocated.
+ *
+ * Because we use an atomic fetch-and-add to fetch the current value, the
+ * phs_allocatedtid counter will exceed rs_lasttid, because workers will
+ * still increment the value, when they try to allocate the next block but
+ * all blocks have been allocated already. The counter must be 64 bits
+ * wide because of that, to avoid wrapping around when rs_lasttid is close
+ * to 2^32. That's also one reason we do this at granularity of 2^16 TIDs,
+ * even though zedstore isn't block-oriented.
+ *
+ * TODO: we divide the TID space into chunks of 2^16 TIDs each. That's
+ * pretty inefficient, there's a fair amount of overhead in re-starting
+ * the B-tree scans between each range. We probably should use much larger
+ * ranges. But this is good for testing.
+ */
+ allocatedtid_blk = pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtid_blk, 1);
+ *start = ZSTidFromBlkOff(allocatedtid_blk, 1);
+ *end = ZSTidFromBlkOff(allocatedtid_blk + 1, 1);
+
+ return *start < pzscan->pzs_endtid;
+}
+
+/*
+ * Get the value for a row, when no value has been stored in the attribute tree.
+ *
+ * This is used after ALTER TABLE ADD COLUMN, when reading rows that were
+ * created before column was added. Usually, missing values are implicitly
+ * NULLs, but you could specify a different value in the ALTER TABLE command,
+ * too, with DEFAULT.
+ */
+static void
+zsbt_fill_missing_attribute_value(TupleDesc tupleDesc, int attno, Datum *datum, bool *isnull)
+{
+ Form_pg_attribute attr = TupleDescAttr(tupleDesc, attno - 1);
+
+ *isnull = true;
+ *datum = (Datum) 0;
+
+ /* This means catalog doesn't have the default value for this attribute */
+ if (!attr->atthasmissing)
+ return;
+
+ if (tupleDesc->constr &&
+ tupleDesc->constr->missing)
+ {
+ AttrMissing *attrmiss = NULL;
+ /*
+ * If there are missing values we want to put them into the
+ * tuple.
+ */
+ attrmiss = tupleDesc->constr->missing;
+
+ if (attrmiss[attno - 1].am_present)
+ {
+ *isnull = false;
+ if (attr->attbyval)
+ *datum = fetch_att(&attrmiss[attno - 1].am_value, attr->attbyval, attr->attlen);
+ else
+ *datum = zs_datumCopy(attrmiss[attno - 1].am_value, attr->attbyval, attr->attlen);
+ }
+ }
+}
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 92285e848f5..880ed7272c5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1064,7 +1064,11 @@ acquire_sample_rows(Relation onerel, int elevel,
* we're done.
*/
if (numrows < targrows)
- rows[numrows++] = ExecCopySlotHeapTuple(slot);
+ {
+ rows[numrows] = ExecCopySlotHeapTuple(slot);
+ rows[numrows]->t_self = slot->tts_tid;
+ numrows++;
+ }
else
{
/*
@@ -1086,6 +1090,7 @@ acquire_sample_rows(Relation onerel, int elevel,
Assert(k >= 0 && k < targrows);
heap_freetuple(rows[k]);
rows[k] = ExecCopySlotHeapTuple(slot);
+ rows[k]->t_self = slot->tts_tid;
}
rowstoskip -= 1;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b281..7645ab24bd8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2114,9 +2114,26 @@ CopyTo(CopyState cstate)
{
TupleTableSlot *slot;
TableScanDesc scandesc;
+ Bitmapset *proj = NULL;
- scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
slot = table_slot_create(cstate->rel, NULL);
+ if (table_scans_leverage_column_projection(cstate->rel))
+ {
+ foreach(cur, cstate->attnumlist)
+ {
+ int attnum = lfirst_int(cur);
+ Assert(attnum <= slot->tts_tupleDescriptor->natts);
+ proj = bms_add_member(proj, attnum);
+ }
+
+ scandesc = table_beginscan_with_column_projection(cstate->rel,
+ GetActiveSnapshot(),
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
+ }
processed = 0;
while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot))
@@ -2133,6 +2150,8 @@ CopyTo(CopyState cstate)
ExecDropSingleTupleTableSlot(slot);
table_endscan(scandesc);
+ if (proj)
+ pfree(proj);
}
else
{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fb2be10794b..d956c17b6e6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9680,7 +9680,16 @@ validateCheckConstraint(Relation rel, HeapTuple constrtup)
econtext->ecxt_scantuple = slot;
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(rel, snapshot, 0, NULL);
+ if (table_scans_leverage_column_projection(rel))
+ {
+ Bitmapset *proj = NULL;
+ PopulateNeededColumnsForNode((Node*)exprstate->expr, slot->tts_tupleDescriptor->natts, &proj);
+ scan = table_beginscan_with_column_projection(rel, snapshot, 0, NULL, proj);
+ }
+ else
+ {
+ scan = table_beginscan(rel, snapshot, 0, NULL);
+ }
/*
* Switch to per-tuple memory context and reset it for each tuple
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 2d9a8e9d541..95c212d471e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2840,6 +2840,10 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
if (newtuple != trigtuple)
heap_freetuple(newtuple);
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(slot);
+
if (should_free)
heap_freetuple(trigtuple);
@@ -3124,6 +3128,10 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
newtuple = NULL;
}
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(newslot);
+
if (should_free_trig)
heap_freetuple(trigtuple);
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index c0e4a5376c3..06d16a3ccb0 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -20,6 +20,7 @@
#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
#include "utils/memutils.h"
@@ -300,3 +301,72 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+typedef struct neededColumnContext
+{
+ Bitmapset **mask;
+ int n;
+} neededColumnContext;
+
+static bool
+neededColumnContextWalker(Node *node, neededColumnContext *c)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *)node;
+
+ if (var->varattno >= 0)
+ {
+ Assert(var->varattno <= c->n);
+ *(c->mask) = bms_add_member(*(c->mask), var->varattno);
+ }
+
+ /*
+ * varattno zero flags whole row variable, so set bits for all the
+ * columns.
+ */
+ if (var->varattno == 0)
+ bms_add_range(*(c->mask), 1, c->n);
+
+ return false;
+ }
+ return expression_tree_walker(node, neededColumnContextWalker, (void * )c);
+}
+
+/*
+ * n specifies the number of allowed entries in mask: we use
+ * it for bounds-checking in the walker above.
+ */
+void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols)
+{
+ neededColumnContext c;
+
+ c.mask = scanCols;
+ c.n = n;
+
+ neededColumnContextWalker(expr, &c);
+}
+
+Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol)
+{
+ Bitmapset *result = NULL;
+ Plan *plan = scanstate->ps.plan;
+
+ PopulateNeededColumnsForNode((Node *) plan->targetlist, ncol, &result);
+ PopulateNeededColumnsForNode((Node *) plan->qual, ncol, &result);
+
+ if (IsA(plan, IndexScan))
+ {
+ PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexqualorig, ncol, &result);
+ PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexorderbyorig, ncol, &result);
+ }
+ else if (IsA(plan, BitmapHeapScan))
+ PopulateNeededColumnsForNode((Node *) ((BitmapHeapScan *) plan)->bitmapqualorig, ncol, &result);
+
+ return result;
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index ac7aa81f674..2c699e6c465 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,13 @@ IndexNext(IndexScanState *node)
node->iss_NumScanKeys,
node->iss_NumOrderByKeys);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ Bitmapset *proj = NULL;
+ proj = PopulateNeededColumnsForScan(&node->ss, node->ss.ss_currentRelation->rd_att->natts);
+ table_index_fetch_set_column_projection(scandesc->xs_heapfetch, proj);
+ }
+
node->iss_ScanDesc = scandesc;
/*
@@ -897,6 +904,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
{
IndexScanState *indexstate;
Relation currentRelation;
+ const TupleTableSlotOps *table_slot_ops;
LOCKMODE lockmode;
/*
@@ -923,11 +931,19 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ss_currentScanDesc = NULL; /* no heap scan here */
/*
- * get the scan type from the relation descriptor.
+ * Initialize the scan slot.
+ *
+ * With the reorder queue, we will sometimes use the reorderqueue's slot,
+ * which uses heap ops, and sometimes the table AM's slot directly. We
+ * have to set scanopsfixed to false, unless the table AM also uses heap
+ * ops.
*/
+ table_slot_ops = table_slot_callbacks(currentRelation);
ExecInitScanTupleSlot(estate, &indexstate->ss,
RelationGetDescr(currentRelation),
- table_slot_callbacks(currentRelation));
+ table_slot_ops);
+ if (node->indexorderby && table_slot_ops != &TTSOpsHeapTuple)
+ indexstate->ss.ps.scanopsfixed = false;
/*
* Initialize result type and projection.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 436b43f8ca5..bacaa7b5c23 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -31,6 +31,7 @@
#include "access/tableam.h"
#include "executor/execdebug.h"
#include "executor/nodeSeqscan.h"
+#include "nodes/nodeFuncs.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
@@ -68,9 +69,20 @@ SeqNext(SeqScanState *node)
* We reach here if the scan is not parallel, or if we're serially
* executing a scan that was planned to be parallel.
*/
- scandesc = table_beginscan(node->ss.ss_currentRelation,
- estate->es_snapshot,
- 0, NULL);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ Bitmapset *proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ scandesc = table_beginscan_with_column_projection(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ }
node->ss.ss_currentScanDesc = scandesc;
}
@@ -270,14 +282,22 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
{
EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
+ Bitmapset *proj = NULL;
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ }
+
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan, proj);
}
/* ----------------------------------------------------------------
@@ -307,8 +327,19 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelTableScanDesc pscan;
+ Bitmapset *proj = NULL;
+
+ /*
+ * FIXME: this is duplicate work with ExecSeqScanInitializeDSM. In future
+ * plan will have the we have projection list, then this overhead will not exist.
+ */
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ }
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan, proj);
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c036209f09..6ad97982403 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -825,6 +825,9 @@ use_physical_tlist(PlannerInfo *root, Path *path, int flags)
rel->rtekind != RTE_CTE)
return false;
+ if (rel->rtekind == RTE_RELATION && rel->leverage_column_projection)
+ return false;
+
/*
* Can't do it with inheritance cases either (mainly because Append
* doesn't project; this test may be unnecessary now that
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401dd..567b92ee2a1 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -126,6 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
*/
relation = table_open(relationObjectId, NoLock);
+ if (relation->rd_tableam)
+ rel->leverage_column_projection = relation->rd_tableam->scans_leverage_column_projection;
/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())
ereport(ERROR,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 46d03f3b9b3..7dde3597fff 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1281,6 +1281,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
TableScanDesc scan;
MemoryContext oldCxt;
TupleTableSlot *tupslot;
+ Bitmapset *proj = NULL;
/* Lock already taken above. */
if (part_relid != RelationGetRelid(default_rel))
@@ -1345,7 +1346,15 @@ check_default_partition_contents(Relation parent, Relation default_rel,
econtext = GetPerTupleExprContext(estate);
snapshot = RegisterSnapshot(GetLatestSnapshot());
tupslot = table_slot_create(part_rel, &estate->es_tupleTable);
- scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ if (table_scans_leverage_column_projection(part_rel))
+ {
+ PopulateNeededColumnsForNode((Node*)partqualstate->expr, tupslot->tts_tupleDescriptor->natts, &proj);
+ scan = table_beginscan_with_column_projection(part_rel, snapshot, 0, NULL, proj);
+ }
+ else
+ {
+ scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ }
/*
* Switch to per-tuple memory context and reset it for each tuple
@@ -1375,6 +1384,9 @@ check_default_partition_contents(Relation parent, Relation default_rel,
if (RelationGetRelid(default_rel) != RelationGetRelid(part_rel))
table_close(part_rel, NoLock); /* keep the lock until commit */
+
+ if (proj)
+ pfree(proj);
}
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5315d93af06..0839085d30e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -154,6 +154,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
case RM_COMMIT_TS_ID:
case RM_REPLORIGIN_ID:
case RM_GENERIC_ID:
+ case RM_ZEDSTORE_ID:
/* just deal with xid, and done */
ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
buf.origptr);
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 78fb90fb1bd..48c10f3003b 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -2547,8 +2547,6 @@ PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot)
void
PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
{
- PREDICATELOCKTARGETTAG tag;
- ItemPointer tid;
TransactionId targetxmin;
if (!SerializationNeededForRead(relation, snapshot))
@@ -2579,6 +2577,17 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
}
}
+ PredicateLockTID(relation, &(tuple->t_self), snapshot);
+}
+
+void
+PredicateLockTID(Relation relation, ItemPointer tid, Snapshot snapshot)
+{
+ PREDICATELOCKTARGETTAG tag;
+
+ if (!SerializationNeededForRead(relation, snapshot))
+ return;
+
/*
* Do quick-but-not-definitive test for a relation lock first. This will
* never cause a return when the relation is *not* locked, but will
@@ -2591,7 +2600,6 @@ PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot)
if (PredicateLockExists(&tag))
return;
- tid = &(tuple->t_self);
SET_PREDICATELOCKTARGETTAG_TUPLE(tag,
relation->rd_node.dbNode,
relation->rd_id,
@@ -4054,14 +4062,11 @@ XidIsConcurrent(TransactionId xid)
* currently no known reason to call this function from an index AM.
*/
void
-CheckForSerializableConflictOut(bool visible, Relation relation,
+heap_CheckForSerializableConflictOut(bool visible, Relation relation,
HeapTuple tuple, Buffer buffer,
Snapshot snapshot)
{
TransactionId xid;
- SERIALIZABLEXIDTAG sxidtag;
- SERIALIZABLEXID *sxid;
- SERIALIZABLEXACT *sxact;
HTSV_Result htsvResult;
if (!SerializationNeededForRead(relation, snapshot))
@@ -4125,6 +4130,19 @@ CheckForSerializableConflictOut(bool visible, Relation relation,
Assert(TransactionIdIsValid(xid));
Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+ return CheckForSerializableConflictOut(relation, xid, snapshot);
+}
+
+void
+CheckForSerializableConflictOut(Relation relation, TransactionId xid, Snapshot snapshot)
+{
+ SERIALIZABLEXIDTAG sxidtag;
+ SERIALIZABLEXID *sxid;
+ SERIALIZABLEXACT *sxact;
+
+ if (!SerializationNeededForRead(relation, snapshot))
+ return;
+
/*
* Find top level xid. Bail out if xid is too early to be a conflict, or
* if it's our own xid.
@@ -4439,8 +4457,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
* tuple itself.
*/
void
-CheckForSerializableConflictIn(Relation relation, HeapTuple tuple,
- Buffer buffer)
+CheckForSerializableConflictIn(Relation relation, ItemPointer tid, BlockNumber blkno)
{
PREDICATELOCKTARGETTAG targettag;
@@ -4470,22 +4487,22 @@ CheckForSerializableConflictIn(Relation relation, HeapTuple tuple,
* It is not possible to take and hold a lock across the checks for all
* granularities because each target could be in a separate partition.
*/
- if (tuple != NULL)
+ if (tid != NULL)
{
SET_PREDICATELOCKTARGETTAG_TUPLE(targettag,
relation->rd_node.dbNode,
relation->rd_id,
- ItemPointerGetBlockNumber(&(tuple->t_self)),
- ItemPointerGetOffsetNumber(&(tuple->t_self)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
CheckTargetForConflictsIn(&targettag);
}
- if (BufferIsValid(buffer))
+ if (blkno != InvalidBlockNumber)
{
SET_PREDICATELOCKTARGETTAG_PAGE(targettag,
relation->rd_node.dbNode,
relation->rd_id,
- BufferGetBlockNumber(buffer));
+ blkno);
CheckTargetForConflictsIn(&targettag);
}
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1c..c0dc97b3072 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -22,6 +22,7 @@
#include "access/spgxlog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/zedstore_wal.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
#include "commands/sequence.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5f..2a7d88848f8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_ZEDSTORE_ID, "Zedstore", zedstore_redo, zedstore_desc, zedstore_identify, NULL, NULL, NULL)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b783..0b8dcc66322 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -162,6 +162,7 @@ typedef struct TableAmRoutine
{
/* this must be set to T_TableAmRoutine */
NodeTag type;
+ bool scans_leverage_column_projection;
/* ------------------------------------------------------------------------
@@ -202,6 +203,13 @@ typedef struct TableAmRoutine
ParallelTableScanDesc pscan,
uint32 flags);
+ TableScanDesc (*scan_begin_with_column_projection)(Relation relation,
+ Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ Bitmapset *project_column);
+
/*
* Release resources and deallocate scan. If TableScanDesc.temp_snap,
* TableScanDesc.rs_snapshot needs to be unregistered.
@@ -277,6 +285,13 @@ typedef struct TableAmRoutine
*/
void (*index_fetch_end) (struct IndexFetchTableData *data);
+ /*
+ * Set column projections for AM which leverage column projections for
+ * scanning.
+ */
+ void (*index_fetch_set_column_projection) (struct IndexFetchTableData *data,
+ Bitmapset *project_column);
+
/*
* Fetch tuple at `tid` into `slot`, after doing a visibility test
* according to `snapshot`. If a tuple was found and passed the visibility
@@ -742,6 +757,12 @@ table_beginscan(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline bool
+table_scans_leverage_column_projection(Relation relation)
+{
+ return relation->rd_tableam->scans_leverage_column_projection;
+}
+
/*
* Like table_beginscan(), but for scanning catalog. It'll automatically use a
* snapshot appropriate for scanning catalog relations.
@@ -771,6 +792,19 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline TableScanDesc
+table_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ Bitmapset *project_column)
+{
+ uint32 flags = SO_TYPE_SEQSCAN |
+ SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+ Assert(relation->rd_tableam->scans_leverage_column_projection);
+ return relation->rd_tableam->scan_begin_with_column_projection(
+ relation, snapshot, nkeys, key, NULL, flags, project_column);
+}
+
/*
* table_beginscan_bm is an alternative entry point for setting up a
* TableScanDesc for a bitmap heap scan. Although that scan technology is
@@ -905,7 +939,8 @@ extern void table_parallelscan_initialize(Relation rel,
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel(Relation rel,
- ParallelTableScanDesc pscan);
+ ParallelTableScanDesc pscan,
+ Bitmapset *proj);
/*
* Restart a parallel scan. Call this in the leader process. Caller is
@@ -955,6 +990,13 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
scan->rel->rd_tableam->index_fetch_end(scan);
}
+static inline void
+table_index_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ Bitmapset *project_column)
+{
+ scan->rel->rd_tableam->index_fetch_set_column_projection(scan, project_column);
+}
+
/*
* Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
* a visibility test according to `snapshot`. If a tuple was found and passed
diff --git a/src/include/access/zedstore_compression.h b/src/include/access/zedstore_compression.h
new file mode 100644
index 00000000000..6f61b2d3029
--- /dev/null
+++ b/src/include/access/zedstore_compression.h
@@ -0,0 +1,16 @@
+/*
+ * zedstore_compression.h
+ * internal declarations for ZedStore compression
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_compression.h
+ */
+#ifndef ZEDSTORE_COMPRESSION_H
+#define ZEDSTORE_COMPRESSION_H
+
+extern int zs_try_compress(const char *src, char *dst, int srcSize, int dstCapacity);
+extern void zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize);
+
+#endif /* ZEDSTORE_COMPRESSION_H */
diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
new file mode 100644
index 00000000000..dc627ac2570
--- /dev/null
+++ b/src/include/access/zedstore_internal.h
@@ -0,0 +1,1035 @@
+/*
+ * zedstore_internal.h
+ * internal declarations for ZedStore tables
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_internal.h
+ */
+#ifndef ZEDSTORE_INTERNAL_H
+#define ZEDSTORE_INTERNAL_H
+
+#include "access/tableam.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_undo.h"
+#include "lib/integerset.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/datum.h"
+
+#define ZS_META_ATTRIBUTE_NUM 0
+
+/*
+ * Throughout ZedStore, we pass around TIDs as uint64's, rather than ItemPointers,
+ * for speed.
+ */
+typedef uint64 zstid;
+
+#define InvalidZSTid 0
+#define MinZSTid 1 /* blk 0, off 1 */
+#define MaxZSTid ((uint64) MaxBlockNumber << 16 | 0xffff)
+/* note: if this is converted to ItemPointer, it is invalid */
+#define MaxPlusOneZSTid (MaxZSTid + 1)
+
+#define MaxZSTidOffsetNumber 129
+
+static inline zstid
+ZSTidFromBlkOff(BlockNumber blk, OffsetNumber off)
+{
+ Assert(off != 0);
+
+ return (uint64) blk * (MaxZSTidOffsetNumber - 1) + off;
+}
+
+static inline zstid
+ZSTidFromItemPointer(ItemPointerData iptr)
+{
+ Assert(ItemPointerIsValid(&iptr));
+ return ZSTidFromBlkOff(ItemPointerGetBlockNumber(&iptr),
+ ItemPointerGetOffsetNumber(&iptr));
+}
+
+static inline ItemPointerData
+ItemPointerFromZSTid(zstid tid)
+{
+ ItemPointerData iptr;
+ BlockNumber blk;
+ OffsetNumber off;
+
+ blk = (tid - 1) / (MaxZSTidOffsetNumber - 1);
+ off = (tid - 1) % (MaxZSTidOffsetNumber - 1) + 1;
+
+ ItemPointerSet(&iptr, blk, off);
+ Assert(ItemPointerIsValid(&iptr));
+ return iptr;
+}
+
+static inline BlockNumber
+ZSTidGetBlockNumber(zstid tid)
+{
+ return (BlockNumber) ((tid - 1) / (MaxZSTidOffsetNumber - 1));
+}
+
+static inline OffsetNumber
+ZSTidGetOffsetNumber(zstid tid)
+{
+ return (OffsetNumber) ((tid - 1) % (MaxZSTidOffsetNumber - 1) + 1);
+}
+
+/*
+ * A ZedStore table contains different kinds of pages, all in the same file.
+ *
+ * Block 0 is always a metapage. It contains the block numbers of the other
+ * data structures stored within the file, like the per-attribute B-trees,
+ * and the UNDO log. In addition, if there are overly large datums in the
+ * the table, they are chopped into separate "toast" pages.
+ */
+#define ZS_META_PAGE_ID 0xF083
+#define ZS_BTREE_PAGE_ID 0xF084
+#define ZS_UNDO_PAGE_ID 0xF085
+#define ZS_TOAST_PAGE_ID 0xF086
+#define ZS_FREE_PAGE_ID 0xF087
+
+/* flags for zedstore b-tree pages */
+#define ZSBT_ROOT 0x0001
+
+typedef struct ZSBtreePageOpaque
+{
+ AttrNumber zs_attno;
+ BlockNumber zs_next;
+ zstid zs_lokey; /* inclusive */
+ zstid zs_hikey; /* exclusive */
+ uint16 zs_level; /* 0 = leaf */
+ uint16 zs_flags;
+ uint16 padding; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* always ZS_BTREE_PAGE_ID */
+} ZSBtreePageOpaque;
+
+#define ZSBtreePageGetOpaque(page) ((ZSBtreePageOpaque *) PageGetSpecialPointer(page))
+
+/*
+ * Internal B-tree page layout.
+ *
+ * The "contents" of the page is an array of ZSBtreeInternalPageItem. The number
+ * of items can be deduced from pd_lower.
+ */
+typedef struct ZSBtreeInternalPageItem
+{
+ zstid tid;
+ BlockNumber childblk;
+} ZSBtreeInternalPageItem;
+
+static inline ZSBtreeInternalPageItem *
+ZSBtreeInternalPageGetItems(Page page)
+{
+ ZSBtreeInternalPageItem *items;
+
+ items = (ZSBtreeInternalPageItem *) PageGetContents(page);
+
+ return items;
+}
+static inline int
+ZSBtreeInternalPageGetNumItems(Page page)
+{
+ ZSBtreeInternalPageItem *begin;
+ ZSBtreeInternalPageItem *end;
+
+ begin = (ZSBtreeInternalPageItem *) PageGetContents(page);
+ end = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+
+ return end - begin;
+}
+
+static inline bool
+ZSBtreeInternalPageIsFull(Page page)
+{
+ PageHeader phdr = (PageHeader) page;
+
+ return phdr->pd_upper - phdr->pd_lower < sizeof(ZSBtreeInternalPageItem);
+}
+
+/*
+ * Attribute B-tree leaf page layout
+ *
+ * Leaf pages in the attribute trees are packed with "array items", which
+ * contain the actual user data for the column, in a compact format. Each
+ * array item contains the datums for a range of TIDs. The ranges of two
+ * items never overlap, but there can be gaps, if a row has been deleted
+ * or updated.
+ *
+ * Each array item consists of a fixed header, a list of TIDs of the rows
+ * contained in it, a NULL bitmap (if there are any NULLs), and the actual
+ * Datum data. The TIDs are encoded using Simple-8b encoding, like in the
+ * TID tree.
+ *
+ * The data (including the TID codewords) can be compressed. In that case,
+ * ZSAttributeCompressedItem is used. The fields are mostly the same as in
+ * ZSAttributeArrayItem, and we cast between the two liberally.
+ *
+ * The datums are packed in a custom format. Fixed-width datatypes are
+ * stored as is, but without any alignment padding. Variable-length
+ * datatypes are *not* stored in the usual Postgres varlen format; the
+ * following encoding is used instead:
+ *
+ * Each varlen datum begins with a one or two byte header, to store the
+ * size. If the size of the datum, excluding the varlen header, is <=
+ * 128, then a one byte header is used. Otherwise, the high bit of the
+ * first byte is set, and two bytes are used to represent the size.
+ * Two bytes is always enough, because if a datum is larger than a page,
+ * it must be toasted.
+ *
+ * Traditional Postgres toasted datums should not be seen on-disk in
+ * zedstore. However, "zedstore-toasted" datums, i.e. datums that have been
+ * stored on separate toast blocks within zedstore, are possible. They
+ * are stored with magic 0xFF 0xFF as the two header bytes, followed by
+ * the block number of the first toast block.
+ *
+ * 0xxxxxxx [up to 128 bytes of data follows]
+ * 1xxxxxxx xxxxxxxx [data]
+ * 11111111 11111111 toast pointer.
+ *
+ * XXX Heikki: I'm not sure if this special encoding makes sense. Perhaps
+ * just storing normal Postgres varlenas would be better. Having a custom
+ * encoding felt like a good idea, but I'm not sure we're actually gaining
+ * anything. If we also did alignment padding, like the rest of Postgres
+ * does, then we could avoid some memory copies when decoding the array.
+ *
+ * TODO: squeeze harder: eliminate padding, use high bits of t_tid for flags or size
+ */
+typedef struct ZSAttributeArrayItem
+{
+ uint16 t_size;
+ uint16 t_flags;
+
+ uint16 t_num_elements;
+ uint16 t_num_codewords;
+
+ zstid t_firsttid;
+ zstid t_endtid;
+
+ uint64 t_tid_codewords[FLEXIBLE_ARRAY_MEMBER];
+
+ /* NULL bitmap follows, if ZSBT_HAS_NULLS is set */
+
+ /* The Datum data follows */
+} ZSAttributeArrayItem;
+
+typedef struct ZSAttributeCompressedItem
+{
+ uint16 t_size;
+ uint16 t_flags;
+
+ uint16 t_num_elements;
+ uint16 t_num_codewords;
+
+ zstid t_firsttid;
+ zstid t_endtid;
+
+ uint16 t_uncompressed_size;
+
+ /* compressed data follows */
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+
+} ZSAttributeCompressedItem;
+
+/*
+ * The two structs above are stored on disk. ZSExplodedItem is a third
+ * representation of an array item that is only used in memory, when
+ * repacking items on a page. It is distinguished by t_size == 0.
+ */
+typedef struct ZSExplodedItem
+{
+ uint16 t_size; /* dummy 0 */
+ uint16 t_flags;
+
+ uint16 t_num_elements;
+
+ zstid *tids;
+
+ bits8 *nullbitmap;
+
+ char *datumdata;
+ int datumdatasz;
+} ZSExplodedItem;
+
+#define ZSBT_ATTR_COMPRESSED 0x0001
+#define ZSBT_HAS_NULLS 0x0002
+
+#define ZSBT_ATTR_BITMAPLEN(nelems) (((int) (nelems) + 7) / 8)
+
+static inline void
+zsbt_attr_item_setnull(bits8 *nullbitmap, int n)
+{
+ nullbitmap[n / 8] |= (1 << (n % 8));
+}
+
+static inline bool
+zsbt_attr_item_isnull(bits8 *nullbitmap, int n)
+{
+ return (nullbitmap[n / 8] & (1 << (n % 8))) != 0;
+}
+
+/*
+ * TID B-tree leaf page layout
+ *
+ * Leaf pages are packed with ZSTidArrayItems. Each ZSTidArrayItem represents
+ * a range of tuples, starting at 't_firsttid', up to 't_endtid' - 1. For each
+ * tuple, we its TID and the UNDO pointer. The TIDs and UNDO pointers are specially
+ * encoded, so that they take less space.
+ *
+ * Item format:
+ *
+ * We make use of some assumptions / observations on the TIDs and UNDO pointers
+ * to pack them tightly:
+ *
+ * - TIDs are kept in ascending order, and the gap between two TIDs
+ * is usually very small. On a newly loaded table, all TIDs are
+ * consecutive.
+ *
+ * - It's common for the UNDO pointer to be old so that the tuple is
+ * visible to everyone. In that case we don't need to keep the exact value.
+ *
+ * - Nearby TIDs are likely to have only a few distinct UNDO pointer values.
+ *
+ *
+ * Each item looks like this:
+ *
+ * Header | 1-16 TID codewords | 0-2 UNDO pointers | UNDO "slotwords"
+ *
+ * The fixed-size header contains the start and end of the TID range that
+ * this item represents, and information on how many UNDO slots and codewords
+ * follow in the variable-size part.
+ *
+ * After the fixed-size header comes the list of TIDs. They are encoded in
+ * Simple-8b codewords. Simple-8b is an encoding scheme to pack multiple
+ * integers in 64-bit codewords. A single codeword can pack e.g. three 20-bit
+ * integers, or 20 3-bit integers, or a number of different combinations.
+ * Therefore, small integers pack more tightly than larger integers. We encode
+ * the difference between each TID, so in the common case that there are few
+ * gaps between the TIDs, we only need a few bits per tuple. The first encoded
+ * integer is always 0, because the first TID is stored explicitly in
+ * t_firsttid. (TODO: storing the first constant 0 is obviously a waste of
+ * space. Also, since there cannot be duplicates, we could store "delta - 1",
+ * which would allow a more tight representation in some cases.)
+ *
+ * After the TID codeword, are so called "UNDO slots". They represent all the
+ * distinct UNDO pointers in the group of TIDs that this item covers.
+ * Logically, there are 4 slots. Slots 0 and 1 are special, representing
+ * all-visible "old" TIDs, and "dead" TIDs. They are not stored in the item
+ * itself, to save space, but logically, they can be thought to be part of
+ * every item. They are included in 't_num_undo_slots', so the number of UNDO
+ * pointers physically stored on an item is actually 't_num_undo_slots - 2'.
+ *
+ * With the 4 UNDO slots, we can represent an UNDO pointer using a 2-bit
+ * slot number. If you update a tuple with a new UNDO pointer, and all four
+ * slots are already in use, the item needs to be split. Hopefully that doesn't
+ * happen too often (see assumptions above).
+ *
+ * After the UNDO slots come "UNDO slotwords". The slotwords contain the slot
+ * number of each tuple in the item. The slot numbers are packed in 64 bit
+ * integers, with 2 bits for each tuple.
+ *
+ * Representing UNDO pointers as distinct slots also has the advantage that
+ * when we're scanning the TID array, we can check the few UNDO pointers in
+ * the slots against the current snapshot, and remember the visibility of
+ * each slot, instead of checking every UNDO pointer separately. That
+ * considerably speeds up visibility checks when reading. That's one
+ * advantage of this special encoding scheme, compared to e.g. using a
+ * general-purpose compression algorithm on an array of TIDs and UNDO pointers.
+ *
+ * The physical size of an item depends on how many tuples it covers, the
+ * number of codewords needed to encode the TIDs, and many distinct UNDO
+ * pointers they have.
+ */
+typedef struct
+{
+ uint16 t_size;
+ uint16 t_num_tids;
+ uint16 t_num_codewords;
+ uint16 t_num_undo_slots;
+
+ zstid t_firsttid;
+ zstid t_endtid;
+
+ /* Followed by UNDO slots, and then followed by codewords */
+ uint64 t_payload[FLEXIBLE_ARRAY_MEMBER];
+
+} ZSTidArrayItem;
+
+/*
+ * We use 2 bits for the UNDO slot number for every tuple. We can therefore
+ * fit 32 slot numbers in each 64-bit "slotword".
+ */
+#define ZSBT_ITEM_UNDO_SLOT_BITS 2
+#define ZSBT_MAX_ITEM_UNDO_SLOTS (1 << (ZSBT_ITEM_UNDO_SLOT_BITS))
+#define ZSBT_ITEM_UNDO_SLOT_MASK (ZSBT_MAX_ITEM_UNDO_SLOTS - 1)
+#define ZSBT_SLOTNOS_PER_WORD (64 / ZSBT_ITEM_UNDO_SLOT_BITS)
+
+/*
+ * To keep the item size and time needed to work with them reasonable,
+ * limit the size of an item to max 16 codewords and 128 TIDs.
+ */
+#define ZSBT_MAX_ITEM_CODEWORDS 16
+#define ZSBT_MAX_ITEM_TIDS 128
+
+#define ZSBT_OLD_UNDO_SLOT 0
+#define ZSBT_DEAD_UNDO_SLOT 1
+#define ZSBT_FIRST_NORMAL_UNDO_SLOT 2
+
+/* Number of UNDO slotwords needed for a given number of tuples */
+#define ZSBT_NUM_SLOTWORDS(num_tids) ((num_tids + ZSBT_SLOTNOS_PER_WORD - 1) / ZSBT_SLOTNOS_PER_WORD)
+
+static inline size_t
+SizeOfZSTidArrayItem(int num_tids, int num_undo_slots, int num_codewords)
+{
+ Size sz;
+
+ sz = offsetof(ZSTidArrayItem, t_payload);
+ sz += num_codewords * sizeof(uint64);
+ sz += (num_undo_slots - ZSBT_FIRST_NORMAL_UNDO_SLOT) * sizeof(ZSUndoRecPtr);
+ sz += ZSBT_NUM_SLOTWORDS(num_tids) * sizeof(uint64);
+
+ return sz;
+}
+
+/*
+ * Get pointers to the TID codewords, UNDO slots, and slotwords from an item.
+ *
+ * Note: this is also used to get the pointers when constructing a new item, so
+ * don't assert here that the data is valid!
+ */
+static inline void
+ZSTidArrayItemDecode(ZSTidArrayItem *item, uint64 **codewords,
+ ZSUndoRecPtr **slots, uint64 **slotwords)
+{
+ char *p = (char *) item->t_payload;
+
+ *codewords = (uint64 *) p;
+ p += item->t_num_codewords * sizeof(uint64);
+ *slots = (ZSUndoRecPtr *) p;
+ p += (item->t_num_undo_slots - ZSBT_FIRST_NORMAL_UNDO_SLOT) * sizeof(ZSUndoRecPtr);
+ *slotwords = (uint64 *) p;
+}
+
+/*
+ * Toast page layout.
+ *
+ * When an overly large datum is stored, it is divided into chunks, and each
+ * chunk is stored on a dedicated toast page. The toast pages of a datum form
+ * list, each page has a next/prev pointer.
+ */
+/*
+ * Maximum size of an individual untoasted Datum stored in ZedStore. Datums
+ * larger than this need to be toasted.
+ *
+ * A datum needs to fit on a B-tree page, with page and item headers.
+ *
+ * XXX: 500 accounts for all the headers. Need to compute this correctly...
+ */
+#define MaxZedStoreDatumSize (BLCKSZ - 500)
+
+typedef struct ZSToastPageOpaque
+{
+ AttrNumber zs_attno;
+
+ /* these are only set on the first page. */
+ zstid zs_tid;
+ uint32 zs_total_size;
+
+ uint32 zs_slice_offset;
+ BlockNumber zs_prev;
+ BlockNumber zs_next;
+ uint16 zs_flags;
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id;
+} ZSToastPageOpaque;
+
+/*
+ * "Toast pointer" of a datum that's stored in zedstore toast pages.
+ *
+ * This looks somewhat like a normal TOAST pointer, but we mustn't let these
+ * escape out of zedstore code, because the rest of the system doesn't know
+ * how to deal with them.
+ *
+ * This must look like varattrib_1b_e!
+ */
+typedef struct varatt_zs_toastptr
+{
+ /* varattrib_1b_e */
+ uint8 va_header;
+ uint8 va_tag; /* VARTAG_ZEDSTORE in zedstore toast datums */
+
+ /* first block */
+ BlockNumber zst_block;
+} varatt_zs_toastptr;
+
+/*
+ * va_tag value. this should be distinguishable from the values in
+ * vartag_external
+ */
+#define VARTAG_ZEDSTORE 10
+
+/*
+ * Versions of datumGetSize and datumCopy that know about ZedStore-toasted
+ * datums.
+ */
+static inline Size
+zs_datumGetSize(Datum value, bool typByVal, int typLen)
+{
+ if (typLen > 0)
+ return typLen;
+ else if (typLen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ return sizeof(varatt_zs_toastptr);
+ else
+ return VARSIZE_ANY(value);
+ }
+ else
+ return datumGetSize(value, typByVal, typLen);
+}
+
+static inline Datum
+zs_datumCopy(Datum value, bool typByVal, int typLen)
+{
+ if (typLen < 0 && VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ {
+ char *result = palloc(sizeof(varatt_zs_toastptr));
+
+ memcpy(result, DatumGetPointer(value), sizeof(varatt_zs_toastptr));
+
+ return PointerGetDatum(result);
+ }
+ else
+ return datumCopy(value, typByVal, typLen);
+}
+
+/*
+ * Block 0 on every ZedStore table is a metapage.
+ *
+ * It contains a directory of b-tree roots for each attribute, and lots more.
+ */
+#define ZS_META_BLK 0
+
+/*
+ * The metapage stores one of these for each attribute.
+ */
+typedef struct ZSRootDirItem
+{
+ BlockNumber root;
+} ZSRootDirItem;
+
+typedef struct ZSMetaPage
+{
+ int nattributes;
+ ZSRootDirItem tree_root_dir[FLEXIBLE_ARRAY_MEMBER]; /* one for each attribute */
+} ZSMetaPage;
+
+/*
+ * it's not clear what we should store in the "opaque" special area, and what
+ * as page contents, on a metapage. But have at least the page_id field here,
+ * so that tools like pg_filedump can recognize it as a zedstore metapage.
+ */
+typedef struct ZSMetaPageOpaque
+{
+ /*
+ * Head and tail page of the UNDO log.
+ *
+ * 'zs_undo_tail' is the newest page, where new UNDO records will be
+ * inserted, and 'zs_undo_head' is the oldest page.
+ * 'zs_undo_tail_first_counter' is the UNDO counter value of the first
+ * record on the tail page (or if the tail page is empty, the counter
+ * value the first trecord on the tail page will have, when it's
+ * inserted). If there is no UNDO log at all,
+ * 'zs_undo_tail_first_counter' is the new counter value to use. It's
+ * actually redundant, except when there is no UNDO log at all, but it's
+ * a nice cross-check at other times.
+ */
+ BlockNumber zs_undo_head;
+ BlockNumber zs_undo_tail;
+ uint64 zs_undo_tail_first_counter;
+
+ /*
+ * Oldest UNDO record that is still needed. Anything older than this can
+ * be discarded, and considered as visible to everyone.
+ */
+ ZSUndoRecPtr zs_undo_oldestptr;
+
+ BlockNumber zs_fpm_head; /* head of the Free Page Map list */
+
+ uint16 zs_flags;
+ uint16 zs_page_id;
+} ZSMetaPageOpaque;
+
+/*
+ * Codes populated by zs_SatisfiesNonVacuumable. This has minimum values
+ * defined based on what's needed. Heap equivalent has more states.
+ */
+typedef enum
+{
+ ZSNV_NONE,
+ ZSNV_RECENTLY_DEAD /* tuple is dead, but not deletable yet */
+} ZSNV_Result;
+
+typedef struct ZSUndoSlotVisibility
+{
+ TransactionId xmin;
+ TransactionId xmax;
+ CommandId cmin;
+ uint32 speculativeToken;
+ ZSNV_Result nonvacuumable_status;
+} ZSUndoSlotVisibility;
+
+static const ZSUndoSlotVisibility InvalidUndoSlotVisibility = {
+ .xmin = InvalidTransactionId,
+ .xmax = InvalidTransactionId,
+ .cmin = InvalidCommandId,
+ .speculativeToken = INVALID_SPECULATIVE_TOKEN,
+ .nonvacuumable_status = ZSNV_NONE
+};
+
+typedef struct ZSTidItemIterator
+{
+ int tids_allocated_size;
+ zstid *tids;
+ uint8 *tid_undoslotnos;
+ int num_tids;
+ MemoryContext context;
+
+ ZSUndoRecPtr undoslots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ ZSUndoSlotVisibility undoslot_visibility[ZSBT_MAX_ITEM_UNDO_SLOTS];
+} ZSTidItemIterator;
+
+/*
+ * Holds the state of an in-progress scan on a zedstore Tid tree.
+ */
+typedef struct ZSTidTreeScan
+{
+ Relation rel;
+
+ /*
+ * memory context that should be used for any allocations that go with the scan,
+ * like the decompression buffers. This isn't a dedicated context, you must still
+ * free everything to avoid leaking! We need this because the getnext function
+ * might be called in a short-lived memory context that is reset between calls.
+ */
+ MemoryContext context;
+
+ bool active;
+ Buffer lastbuf;
+ OffsetNumber lastoff;
+ Snapshot snapshot;
+
+ /*
+ * starttid and endtid define a range of TIDs to scan. currtid is the previous
+ * TID that was returned from the scan. They determine what zsbt_tid_scan_next()
+ * will return.
+ */
+ zstid starttid;
+ zstid endtid;
+ zstid currtid;
+
+ /* in the "real" UNDO-log, this would probably be a global variable */
+ ZSUndoRecPtr recent_oldest_undo;
+
+ /* should this scan do predicate locking? Or check for conflicts? */
+ bool serializable;
+ bool acquire_predicate_tuple_locks;
+
+ /*
+ * These fields are used, when the scan is processing an array item.
+ */
+ ZSTidItemIterator array_iter;
+ int array_curr_idx;
+} ZSTidTreeScan;
+
+/*
+ * This is convenience function to get the index aka slot number for undo and
+ * visibility array. Important to note this performs "next_idx - 1" means
+ * works after returning from TID scan function when the next_idx has been
+ * incremented.
+ */
+static inline uint8
+ZSTidScanCurUndoSlotNo(ZSTidTreeScan *scan)
+{
+ Assert(scan->array_curr_idx >= 0 && scan->array_curr_idx < scan->array_iter.num_tids);
+ Assert(scan->array_iter.tid_undoslotnos != NULL);
+ return (scan->array_iter.tid_undoslotnos[scan->array_curr_idx]);
+}
+
+/*
+ * Holds the state of an in-progress scan on a zedstore attribute tree.
+ */
+typedef struct ZSAttrTreeScan
+{
+ Relation rel;
+ AttrNumber attno;
+ Form_pg_attribute attdesc;
+
+ /*
+ * memory context that should be used for any allocations that go with the scan,
+ * like the decompression buffers. This isn't a dedicated context, you must still
+ * free everything to avoid leaking! We need this because the getnext function
+ * might be called in a short-lived memory context that is reset between calls.
+ */
+ MemoryContext context;
+
+ bool active;
+ Buffer lastbuf;
+ OffsetNumber lastoff;
+
+ /*
+ * These fields are used, when the scan is processing an array tuple.
+ * They are filled in by zsbt_attr_item_extract().
+ */
+ int array_datums_allocated_size;
+ Datum *array_datums;
+ bool *array_isnulls;
+ zstid *array_tids;
+ int array_num_elements;
+
+ int array_curr_idx;
+
+ /* working areas for zsbt_attr_item_extract() */
+ char *decompress_buf;
+ int decompress_buf_size;
+ char *attr_buf;
+ int attr_buf_size;
+
+} ZSAttrTreeScan;
+
+/*
+ * We keep a this cached copy of the information in the metapage in
+ * backend-private memory. In RelationData->rd_amcache.
+ *
+ * The cache contains the block numbers of the roots of all the tree
+ * structures, for quick searches, as well as the rightmost leaf page, for
+ * quick insertions to the end.
+ *
+ * Use zsmeta_get_cache() to get the cached struct.
+ *
+ * This is used together with smgr_targblock. smgr_targblock tracks the
+ * physical size of the relation file. This struct is only considered valid
+ * when smgr_targblock is valid. So in effect, we invalidate this whenever
+ * a smgr invalidation happens. Logically, the lifetime of this is the same
+ * as smgr_targblocks/smgr_fsm_nblocks/smgr_vm_nblocks, but there's no way
+ * to attach an AM-specific struct directly to SmgrRelation.
+ */
+typedef struct ZSMetaCacheData
+{
+ int cache_nattributes;
+
+ /* For each attribute */
+ struct {
+ BlockNumber root; /* root of the b-tree */
+ BlockNumber rightmost; /* right most leaf page */
+ zstid rightmost_lokey; /* lokey of rightmost leaf */
+ } cache_attrs[FLEXIBLE_ARRAY_MEMBER];
+
+} ZSMetaCacheData;
+
+extern ZSMetaCacheData *zsmeta_populate_cache(Relation rel);
+
+static inline ZSMetaCacheData *
+zsmeta_get_cache(Relation rel)
+{
+ if (rel->rd_amcache == NULL || RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ zsmeta_populate_cache(rel);
+ return (ZSMetaCacheData *) rel->rd_amcache;
+}
+
+/*
+ * Blow away the cached ZSMetaCacheData struct. Next call to zsmeta_get_cache()
+ * will reload it from the metapage.
+ */
+static inline void
+zsmeta_invalidate_cache(Relation rel)
+{
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+}
+
+/*
+ * zs_split_stack is used during page split, or page merge, to keep track
+ * of all the modified pages. The page split (or merge) routines don't
+ * modify pages directly, but they construct a list of 'zs_split_stack'
+ * entries. Each entry holds a buffer, and a temporary in-memory copy of
+ * a page that should be written to the buffer, once everything is completed.
+ * All the buffers are exclusively-locked.
+ */
+typedef struct zs_split_stack zs_split_stack;
+
+struct zs_split_stack
+{
+ zs_split_stack *next;
+
+ Buffer buf;
+ Page page; /* temp in-memory copy of page */
+ bool recycle; /* should the page be added to the FPM? */
+};
+
+/* prototypes for functions in zedstore_tidpage.c */
+extern void zsbt_tid_begin_scan(Relation rel, zstid starttid, zstid endtid,
+ Snapshot snapshot, ZSTidTreeScan *scan);
+extern void zsbt_tid_reset_scan(ZSTidTreeScan *scan, zstid starttid, zstid endtid, zstid currtid);
+extern void zsbt_tid_end_scan(ZSTidTreeScan *scan);
+extern bool zsbt_tid_scan_next_array(ZSTidTreeScan *scan, zstid nexttid, ScanDirection direction);
+
+/*
+ * Return the next TID in the scan.
+ *
+ * The next TID means the first TID > scan->currtid. Each call moves
+ * scan->currtid to the last returned TID. You can call zsbt_tid_reset_scan()
+ * to change the position, scan->starttid and scan->endtid define the
+ * boundaries of the search.
+ */
+static inline zstid
+zsbt_tid_scan_next(ZSTidTreeScan *scan, ScanDirection direction)
+{
+ zstid nexttid;
+ int idx;
+
+ Assert(scan->active);
+
+ if (direction == ForwardScanDirection)
+ nexttid = scan->currtid + 1;
+ else if (direction == BackwardScanDirection)
+ nexttid = scan->currtid - 1;
+ else
+ nexttid = scan->currtid;
+
+ if (scan->array_iter.num_tids == 0 ||
+ nexttid < scan->array_iter.tids[0] ||
+ nexttid > scan->array_iter.tids[scan->array_iter.num_tids - 1])
+ {
+ scan->array_curr_idx = -1;
+ if (!zsbt_tid_scan_next_array(scan, nexttid, direction))
+ {
+ scan->currtid = nexttid;
+ return InvalidZSTid;
+ }
+ }
+
+ /*
+ * Optimize for the common case that we're scanning forward from the previous
+ * TID.
+ */
+ if (scan->array_curr_idx >= 0 && scan->array_iter.tids[scan->array_curr_idx] < nexttid)
+ idx = scan->array_curr_idx + 1;
+ else
+ idx = 0;
+
+ for (; idx < scan->array_iter.num_tids; idx++)
+ {
+ zstid this_tid = scan->array_iter.tids[idx];
+
+ if (this_tid >= scan->endtid)
+ {
+ scan->currtid = nexttid;
+ return InvalidZSTid;
+ }
+
+ if (this_tid >= nexttid)
+ {
+ /*
+ * Callers using SnapshotDirty need some extra visibility information.
+ */
+ if (scan->snapshot->snapshot_type == SNAPSHOT_DIRTY)
+ {
+ int slotno = scan->array_iter.tid_undoslotnos[idx];
+ ZSUndoSlotVisibility *visi_info = &scan->array_iter.undoslot_visibility[slotno];
+
+ if (visi_info->xmin != FrozenTransactionId)
+ scan->snapshot->xmin = visi_info->xmin;
+ scan->snapshot->xmax = visi_info->xmax;
+ scan->snapshot->speculativeToken = visi_info->speculativeToken;
+ }
+
+ /* on next call, continue the scan at the next TID */
+ scan->currtid = this_tid;
+ scan->array_curr_idx = idx;
+ return this_tid;
+ }
+ }
+
+ /*
+ * unreachable, because zsbt_tid_scan_next_array() should never return an array
+ * that doesn't contain a matching TID.
+ */
+ Assert(false);
+ return InvalidZSTid;
+}
+
+
+extern void zsbt_tid_multi_insert(Relation rel,
+ zstid *tids, int ntuples,
+ TransactionId xid, CommandId cid,
+ uint32 speculative_token, ZSUndoRecPtr prevundoptr);
+extern TM_Result zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart);
+extern TM_Result zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid,
+ CommandId cid, bool key_update, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd, zstid *newtid_p);
+extern void zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete);
+extern void zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr recent_oldest_undo);
+extern IntegerSet *zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid);
+extern void zsbt_tid_remove(Relation rel, IntegerSet *tids);
+extern TM_Result zsbt_tid_lock(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ LockTupleMode lockmode, bool follow_updates,
+ Snapshot snapshot, TM_FailureData *hufd,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info);
+extern void zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr, ZSUndoRecPtr recent_oldest_undo);
+extern zstid zsbt_get_last_tid(Relation rel);
+extern void zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot);
+
+/* prototypes for functions in zedstore_tiditem.c */
+extern List *zsbt_tid_item_create_for_range(zstid tid, int nelements, ZSUndoRecPtr undo_ptr);
+extern List *zsbt_tid_item_add_tids(ZSTidArrayItem *orig, zstid firsttid, int nelements,
+ ZSUndoRecPtr undo_ptr, bool *modified_orig);
+extern void zsbt_tid_item_unpack(ZSTidArrayItem *item, ZSTidItemIterator *iter);
+extern List *zsbt_tid_item_change_undoptr(ZSTidArrayItem *orig, zstid target_tid, ZSUndoRecPtr undoptr, ZSUndoRecPtr recent_oldest_undo);
+extern List *zsbt_tid_item_remove_tids(ZSTidArrayItem *orig, zstid *nexttid, IntegerSet *remove_tids,
+ ZSUndoRecPtr recent_oldest_undo);
+
+
+/* prototypes for functions in zedstore_attpage.c */
+extern void zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno,
+ ZSAttrTreeScan *scan);
+extern void zsbt_attr_end_scan(ZSAttrTreeScan *scan);
+extern bool zsbt_attr_scan_fetch_array(ZSAttrTreeScan *scan, zstid tid);
+
+extern void zsbt_attr_multi_insert(Relation rel, AttrNumber attno,
+ Datum *datums, bool *isnulls, zstid *tids, int ndatums);
+
+/* prototypes for functions in zedstore_attitem.c */
+extern List *zsbt_attr_create_items(Form_pg_attribute att,
+ Datum *datums, bool *isnulls, zstid *tids, int nelements);
+extern void zsbt_split_item(Form_pg_attribute attr, ZSExplodedItem *origitem, zstid first_right_tid,
+ ZSExplodedItem **leftitem_p, ZSExplodedItem **rightitem_p);
+extern ZSExplodedItem *zsbt_attr_remove_from_item(Form_pg_attribute attr,
+ ZSAttributeArrayItem *olditem,
+ zstid *removetids);
+extern List *zsbt_attr_recompress_items(Form_pg_attribute attr, List *olditems);
+
+extern void zsbt_attr_item_extract(ZSAttrTreeScan *scan, ZSAttributeArrayItem *item);
+
+
+/* prototypes for functions in zedstore_btree.c */
+extern zs_split_stack *zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks);
+extern zs_split_stack *zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks);
+extern void zsbt_attr_remove(Relation rel, AttrNumber attno, IntegerSet *tids);
+extern zs_split_stack *zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level);
+extern zs_split_stack *zs_new_split_stack_entry(Buffer buf, Page page);
+extern void zs_apply_split_changes(Relation rel, zs_split_stack *stack);
+extern Buffer zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly);
+extern Buffer zsbt_find_and_lock_leaf_containing_tid(Relation rel, AttrNumber attno,
+ Buffer buf, zstid nexttid, int lockmode);
+extern bool zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf);
+
+/*
+ * Return the value of row identified with 'tid' in a scan.
+ *
+ * 'tid' must be greater than any previously returned item.
+ *
+ * Returns true if a matching item is found, false otherwise. After
+ * a false return, it's OK to call this again with another greater TID.
+ */
+static inline bool
+zsbt_attr_fetch(ZSAttrTreeScan *scan, Datum *datum, bool *isnull, zstid tid)
+{
+ int idx;
+
+ /*
+ * Fetch the next item from the scan. The item we're looking for might
+ * already be in scan->array_*.
+ */
+ if (scan->array_num_elements == 0 ||
+ tid < scan->array_tids[0] ||
+ scan->array_tids[scan->array_num_elements - 1] < tid)
+ {
+ if (!zsbt_attr_scan_fetch_array(scan, tid))
+ return false;
+ scan->array_curr_idx = -1;
+ }
+ Assert(scan->array_num_elements > 0 &&
+ scan->array_tids[0] <= tid &&
+ scan->array_tids[scan->array_num_elements - 1] >= tid);
+
+ /*
+ * Optimize for the common case that we're scanning forward from the previous
+ * TID.
+ */
+ if (scan->array_curr_idx != -1 && scan->array_tids[scan->array_curr_idx] < tid)
+ idx = scan->array_curr_idx + 1;
+ else
+ idx = 0;
+
+ for (; idx < scan->array_num_elements; idx++)
+ {
+ zstid this_tid = scan->array_tids[idx];
+
+ if (this_tid == tid)
+ {
+ *isnull = scan->array_isnulls[idx];
+ *datum = scan->array_datums[idx];
+ scan->array_curr_idx = idx;
+ return true;
+ }
+ if (this_tid > tid)
+ return false;
+ }
+
+ return false;
+}
+
+extern PGDLLIMPORT const TupleTableSlotOps TTSOpsZedstore;
+
+/* prototypes for functions in zedstore_meta.c */
+extern void zsmeta_initmetapage(Relation rel);
+extern void zsmeta_initmetapage_redo(XLogReaderState *record);
+extern BlockNumber zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool for_update);
+extern void zsmeta_update_root_for_attribute(Relation rel, AttrNumber attno, Buffer metabuf, BlockNumber rootblk);
+extern void zsmeta_add_root_for_new_attributes(Relation rel, Page page);
+
+/* prototypes for functions in zedstore_visibility.c */
+extern TM_Result zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo,
+ zstid item_tid, ZSUndoRecPtr item_undoptr,
+ LockTupleMode mode,
+ bool *undo_record_needed,
+ TM_FailureData *tmfd, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info);
+extern bool zs_SatisfiesVisibility(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info);
+
+/* prototypes for functions in zedstore_toast.c */
+extern Datum zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value, zstid tid);
+extern Datum zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted);
+
+/* prototypes for functions in zedstore_freepagemap.c */
+extern Buffer zspage_getnewbuf(Relation rel, Buffer metabuf);
+extern Buffer zspage_extendrel_newbuf(Relation rel);
+extern void zspage_delete_page(Relation rel, Buffer buf);
+
+typedef struct ZedstoreTupleTableSlot
+{
+ TupleTableSlot base;
+ TransactionId xmin;
+ CommandId cmin;
+
+ char *data; /* data for materialized slots */
+} ZedstoreTupleTableSlot;
+
+#endif /* ZEDSTORE_INTERNAL_H */
diff --git a/src/include/access/zedstore_simple8b.h b/src/include/access/zedstore_simple8b.h
new file mode 100644
index 00000000000..ebaaa368c9d
--- /dev/null
+++ b/src/include/access/zedstore_simple8b.h
@@ -0,0 +1,21 @@
+/*
+ * zedstore_simple8b.h
+ * XXX
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_simple8b.h
+ */
+#ifndef ZEDSTORE_SIMPLE8B_H
+#define ZEDSTORE_SIMPLE8B_H
+
+extern uint64 simple8b_encode(const uint64 *ints, int num_ints, int *num_encoded);
+extern uint64 simple8b_encode_consecutive(const uint64 firstint, const uint64 secondint, int num_ints,
+ int *num_encoded);
+extern int simple8b_decode(uint64 codeword, uint64 *decoded);
+
+extern void simple8b_decode_words(uint64 *codewords, int num_codewords,
+ uint64 *dst, int num_integers);
+
+#endif /* ZEDSTORE_SIMPLE8B_H */
diff --git a/src/include/access/zedstore_undo.h b/src/include/access/zedstore_undo.h
new file mode 100644
index 00000000000..a6d48585041
--- /dev/null
+++ b/src/include/access/zedstore_undo.h
@@ -0,0 +1,203 @@
+/*
+ * zedstore_undo.h
+ * internal declarations for ZedStore undo logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_undo.h
+ */
+#ifndef ZEDSTORE_UNDO_H
+#define ZEDSTORE_UNDO_H
+
+#include "commands/vacuum.h"
+#include "utils/relcache.h"
+
+/* this must match the definition in zedstore_internal.h */
+typedef uint64 zstid;
+
+/*
+ * An UNDO-pointer.
+ *
+ * In the "real" UNDO-logging work from EDB, an UndoRecPtr is only 64 bits.
+ * But we make life easier for us, by encoding more information in it.
+ *
+ * 'counter' is a number that's incremented every time a new undo record is
+ * created. It can be used to determine if an undo pointer is too old to be
+ * of interest to anyone.
+ *
+ * 'blkno' and 'offset' are the physical location of the UNDO record. They
+ * can be used to easily fetch a given record.
+ */
+typedef struct
+{
+ uint64 counter;
+ BlockNumber blkno;
+ int32 offset;
+} ZSUndoRecPtr;
+
+/* TODO: assert that blkno and offset match, too, if counter matches */
+#define ZSUndoRecPtrEquals(a, b) ((a).counter == (b).counter)
+
+#define INVALID_SPECULATIVE_TOKEN 0
+
+typedef struct
+{
+ int16 size; /* size of this record, including header */
+ uint8 type; /* ZSUNDO_TYPE_* */
+ ZSUndoRecPtr undorecptr;
+ TransactionId xid;
+ CommandId cid;
+
+ /*
+ * UNDO-record of the inserter. This is needed if a row is inserted, and
+ * deleted, and there are some snapshots active don't don't consider even
+ * the insertion as visible.
+ *
+ * This is also used in Insert records, if the record represents the
+ * new tuple version of an UPDATE, rather than an INSERT. It's needed to
+ * dig into possible KEY SHARE locks held on the row, which didn't prevent
+ * the tuple from being updated.
+ */
+ ZSUndoRecPtr prevundorec;
+} ZSUndoRec;
+
+#define ZSUNDO_TYPE_INSERT 1
+#define ZSUNDO_TYPE_DELETE 2
+#define ZSUNDO_TYPE_UPDATE 3
+#define ZSUNDO_TYPE_TUPLE_LOCK 4
+
+/*
+ * Type-specific record formats.
+ *
+ * We store similar info as zheap for INSERT/UPDATE/DELETE. See zheap README.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid firsttid;
+ zstid endtid; /* exclusive */
+ uint32 speculative_token; /* Only used for INSERT records */
+
+} ZSUndoRec_Insert;
+
+#define ZSUNDO_NUM_TIDS_PER_DELETE 10
+
+typedef struct
+{
+ ZSUndoRec rec;
+
+ bool changedPart; /* tuple was moved to a different partition by UPDATE */
+
+ /*
+ * One deletion record can represent deleting up to
+ * ZSUNDO_NUM_TIDS_PER_DELETE tuples. The 'rec.tid' field is unused.
+ */
+ uint16 num_tids;
+ zstid tids[ZSUNDO_NUM_TIDS_PER_DELETE];
+
+ /*
+ * TODO: It might be good to move the deleted tuple to the undo-log, so
+ * that the space can immediately be reused. But currently, we don't do
+ * that. Or even better, move the old tuple to the undo-log lazily, if
+ * the space is needed for a new insertion, before the old tuple becomes
+ * recyclable.
+ */
+} ZSUndoRec_Delete;
+
+/*
+ * This is used for an UPDATE, to mark the old tuple version as updated.
+ * It's the same as a deletion, except this stores the TID of the new tuple
+ * version, so it can be followed in READ COMMITTED mode.
+ *
+ * The ZSUndoRec_Insert record is used for the insertion of the new tuple
+ * version.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+
+ zstid oldtid;
+ zstid newtid;
+
+ bool key_update; /* were key columns updated?
+ * (for conflicting with FOR KEY SHARE) */
+
+} ZSUndoRec_Update;
+
+/*
+ * This is used when a tuple is locked e.g. with SELECT FOR UPDATE.
+ * The tuple isn't really changed in any way, but the undo record gives
+ * a place to store the XID of the locking transaction.
+ *
+ * In case of a FOR SHARE lock, there can be multiple lockers. Each locker
+ * will create a new undo record with its own XID that points to the previous
+ * record. So the records will form a chain, leading finally to the insertion
+ * record (or beyond the UNDO horizon, meaning the tuple's insertion is visible
+ * to everyone)
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid tid;
+
+ /*
+ * XXX: Is it OK to store this on disk? The enum values could change. Then
+ * again, no one should care about old locks that were acquired before
+ * last restart. Except with two-phase commit prepared transactions.
+ */
+ LockTupleMode lockmode;
+} ZSUndoRec_TupleLock;
+
+typedef struct
+{
+ BlockNumber next;
+ ZSUndoRecPtr first_undorecptr; /* note: this is set even if the page is empty! */
+ ZSUndoRecPtr last_undorecptr;
+ uint16 padding0; /* padding, to put zs_page_id last */
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* ZS_UNDO_PAGE_ID */
+} ZSUndoPageOpaque;
+
+/*
+ * "invalid" undo pointer. The value is chosen so that an invalid pointer
+ * is less than any real UNDO pointer value. Therefore, a tuple with an
+ * invalid UNDO pointer is considered visible to everyone.
+ */
+static const ZSUndoRecPtr InvalidUndoPtr = {
+ .blkno = InvalidBlockNumber,
+ .offset = InvalidOffsetNumber,
+ .counter = 0
+};
+
+/*
+ * A special value used on TID items, to mean that a tuple is not visible to
+ * anyone
+ */
+static const ZSUndoRecPtr DeadUndoPtr = {
+ .blkno = InvalidBlockNumber,
+ .offset = InvalidOffsetNumber,
+ .counter = 1
+};
+
+static inline bool
+IsZSUndoRecPtrValid(ZSUndoRecPtr *uptr)
+{
+ return uptr->counter != 0;
+}
+
+/* prototypes for functions in zstore_undo.c */
+extern ZSUndoRecPtr zsundo_insert(Relation rel, ZSUndoRec *rec);
+extern ZSUndoRec *zsundo_fetch(Relation rel, ZSUndoRecPtr undorecptr);
+extern void zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr);
+extern void zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin);
+extern ZSUndoRecPtr zsundo_get_oldest_undo_ptr(Relation rel);
+extern ZSUndoRecPtr zsundo_create_for_delete(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ bool changedPart, ZSUndoRecPtr prev_undo_ptr);
+extern ZSUndoRecPtr zsundo_create_for_insert(Relation rel, TransactionId xid, CommandId cid,
+ zstid tid, int nitems,
+ uint32 speculative_token, ZSUndoRecPtr prev_undo_ptr);
+
+#endif /* ZEDSTORE_UNDO_H */
diff --git a/src/include/access/zedstore_wal.h b/src/include/access/zedstore_wal.h
new file mode 100644
index 00000000000..0ee09a0d6e2
--- /dev/null
+++ b/src/include/access/zedstore_wal.h
@@ -0,0 +1,35 @@
+/*
+ * zedstore_wal.h
+ * internal declarations for ZedStore wal logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_wal.h
+ */
+#ifndef ZEDSTORE_WAL_H
+#define ZEDSTORE_WAL_H
+
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+#define WAL_ZEDSTORE_INIT_METAPAGE 0x00
+
+extern void zedstore_redo(XLogReaderState *record);
+extern void zedstore_desc(StringInfo buf, XLogReaderState *record);
+extern const char *zedstore_identify(uint8 info);
+
+/*
+ * WAL record for initializing zedstore metapage (WAL_ZEDSTORE_INIT_METAPAGE)
+ *
+ * These records always use a full-page image, so this data is really just
+ * for debugging purposes.
+ */
+typedef struct wal_zedstore_init_metapage
+{
+ int32 natts; /* number of attributes. */
+} wal_zedstore_init_metapage;
+
+#define SizeOfZSWalInitMetapage (offsetof(wal_zedstore_init_metapage, natts) + sizeof(int32))
+
+#endif /* ZEDSTORE_WAL_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 393b41dd684..f370f63460c 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '6668', oid_symbol => 'ZEDSTORE_TABLE_AM_OID',
+ descr => 'zedstore table access method',
+ amname => 'zedstore', amhandler => 'zedstore_tableam_handler', amtype => 't' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b88e886f7d7..94a0da964ff 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -873,6 +873,11 @@
proname => 'heap_tableam_handler', provolatile => 'v',
prorettype => 'table_am_handler', proargtypes => 'internal',
prosrc => 'heap_tableam_handler' },
+{ oid => '6669', oid_symbol => 'ZEDSTORE_TABLE_AM_HANDLER_OID',
+ descr => 'column-oriented table access method handler',
+ proname => 'zedstore_tableam_handler', provolatile => 'v',
+ prorettype => 'table_am_handler', proargtypes => 'internal',
+ prosrc => 'zedstore_tableam_handler' },
# Index access method handlers
{ oid => '330', descr => 'btree index access method handler',
@@ -10704,4 +10709,23 @@
proname => 'pg_partition_root', prorettype => 'regclass',
proargtypes => 'regclass', prosrc => 'pg_partition_root' },
+# zedstore inspection functions
+{ oid => '7000', descr => 'get zedstore page type',
+ proname => 'pg_zs_page_type', prorettype => 'text',
+ proargtypes => 'regclass int8', prosrc => 'pg_zs_page_type' },
+{ oid => '7001', descr => 'show stats about active zedstore undo pages',
+ proname => 'pg_zs_undo_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int4,int4,int8,int8}',
+ proargmodes => '{i,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nrecords,freespace,firstrecptr,lastrecptr}',
+ prosrc => 'pg_zs_undo_pages' },
+{ oid => '7002', descr => 'show stats about zedstore btree pages',
+ proname => 'pg_zs_btree_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int4,int4,int8,int8,int4,int4,int4,int4,int4}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nextblk,attno,level,lokey,hikey,nitems,ncompressed,totalsz,uncompressedsz,freespace}',
+ prosrc => 'pg_zs_btree_pages' },
+
]
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 39c8b3bbbe8..8ed695b746d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -595,5 +595,9 @@ extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
extern void CheckSubscriptionRelkind(char relkind, const char *nspname,
const char *relname);
+extern void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols);
+extern Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol);
#endif /* EXECUTOR_H */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 13b147d85d5..ae6cc675a8a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -686,6 +686,7 @@ typedef struct RelOptInfo
PlannerInfo *subroot; /* if subquery */
List *subplan_params; /* if subquery */
int rel_parallel_workers; /* wanted number of parallel workers */
+ bool leverage_column_projection;
/* Information about foreign tables and foreign joins */
Oid serverid; /* identifies server for the table or join */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 512213aa32f..bff131319cc 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -350,6 +350,9 @@
/* Define to 1 if you have the `ldap_r' library (-lldap_r). */
#undef HAVE_LIBLDAP_R
+/* Define to 1 if you have the `lz4' library (-llz4). */
+#undef HAVE_LIBLZ4
+
/* Define to 1 if you have the `m' library (-lm). */
#undef HAVE_LIBM
@@ -389,6 +392,9 @@
/* Define to 1 if `long long int' works and is 64 bits. */
#undef HAVE_LONG_LONG_INT_64
+/* Define to 1 if you have the <lz4.h> header file. */
+#undef HAVE_LZ4_H
+
/* Define to 1 if you have the <mbarrier.h> header file. */
#undef HAVE_MBARRIER_H
@@ -929,6 +935,9 @@
/* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
#undef USE_LLVM
+/* Define to 1 to build with LZ4 support. (--with-lz4) */
+#undef USE_LZ4
+
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 376245ecd70..866c3a76f91 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -58,15 +58,18 @@ extern void RegisterPredicateLockingXid(TransactionId xid);
extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
extern void PredicateLockTuple(Relation relation, HeapTuple tuple, Snapshot snapshot);
+extern void PredicateLockTID(Relation relation, ItemPointer tid, Snapshot snapshot);
extern void PredicateLockPageSplit(Relation relation, BlockNumber oldblkno, BlockNumber newblkno);
extern void PredicateLockPageCombine(Relation relation, BlockNumber oldblkno, BlockNumber newblkno);
extern void TransferPredicateLocksToHeapRelation(Relation relation);
extern void ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe);
/* conflict detection (may also trigger rollback) */
-extern void CheckForSerializableConflictOut(bool valid, Relation relation, HeapTuple tuple,
- Buffer buffer, Snapshot snapshot);
-extern void CheckForSerializableConflictIn(Relation relation, HeapTuple tuple, Buffer buffer);
+extern void heap_CheckForSerializableConflictOut(bool valid, Relation relation, HeapTuple tuple,
+ Buffer buffer, Snapshot snapshot);
+extern void CheckForSerializableConflictOut(Relation relation, TransactionId xid,
+ Snapshot snapshot);
+extern void CheckForSerializableConflictIn(Relation relation, ItemPointer tid, BlockNumber blkno);
extern void CheckTableForSerializableConflictIn(Relation relation);
/* final rollback checking */
diff --git a/src/test/isolation/expected/multiple-row-versions_1.out b/src/test/isolation/expected/multiple-row-versions_1.out
new file mode 100644
index 00000000000..f4f140cab14
--- /dev/null
+++ b/src/test/isolation/expected/multiple-row-versions_1.out
@@ -0,0 +1,25 @@
+Parsed test spec with 4 sessions
+
+starting permutation: rx1 wx2 c2 wx3 ry3 wy4 rz4 c4 c3 wz1 c1
+step rx1: SELECT * FROM t WHERE id = 1000000;
+id txt
+
+1000000
+step wx2: UPDATE t SET txt = 'b' WHERE id = 1000000;
+step c2: COMMIT;
+step wx3: UPDATE t SET txt = 'c' WHERE id = 1000000;
+step ry3: SELECT * FROM t WHERE id = 500000;
+id txt
+
+500000
+step wy4: UPDATE t SET txt = 'd' WHERE id = 500000;
+step rz4: SELECT * FROM t WHERE id = 1;
+id txt
+
+1
+step c4: COMMIT;
+step c3: COMMIT;
+ERROR: could not serialize access due to read/write dependencies among transactions
+step wz1: UPDATE t SET txt = 'a' WHERE id = 1;
+ERROR: could not serialize access due to read/write dependencies among transactions
+step c1: COMMIT;
diff --git a/src/test/isolation/expected/read-only-anomaly-2_1.out b/src/test/isolation/expected/read-only-anomaly-2_1.out
new file mode 100644
index 00000000000..2e365520319
--- /dev/null
+++ b/src/test/isolation/expected/read-only-anomaly-2_1.out
@@ -0,0 +1,45 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s3c s2wx
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id;
+id balance
+
+X 0
+Y 20
+step s3c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
diff --git a/src/test/isolation/expected/read-only-anomaly-3_1.out b/src/test/isolation/expected/read-only-anomaly-3_1.out
new file mode 100644
index 00000000000..d9a5a8e49d4
--- /dev/null
+++ b/src/test/isolation/expected/read-only-anomaly-3_1.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id; <waiting ...>
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+step s3r: <... completed>
+id balance
+
+X 0
+Y 20
+error in steps s2wx s3r: ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/serializable-parallel_1.out b/src/test/isolation/expected/serializable-parallel_1.out
new file mode 100644
index 00000000000..2e365520319
--- /dev/null
+++ b/src/test/isolation/expected/serializable-parallel_1.out
@@ -0,0 +1,45 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s3c s2wx
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id;
+id balance
+
+X 0
+Y 20
+step s3c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
diff --git a/src/test/isolation/expected/vacuum-reltuples_1.out b/src/test/isolation/expected/vacuum-reltuples_1.out
new file mode 100644
index 00000000000..fabcd824db2
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-reltuples_1.out
@@ -0,0 +1,59 @@
+Parsed test spec with 2 sessions
+
+starting permutation: modify vac stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step vac:
+ vacuum smalltbl;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 20
+
+starting permutation: modify open fetch1 vac close stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step open:
+ begin;
+ declare c1 cursor for select 1 as dummy from smalltbl;
+
+step fetch1:
+ fetch next from c1;
+
+dummy
+
+1
+step vac:
+ vacuum smalltbl;
+
+step close:
+ commit;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 20
+
+starting permutation: modify vac stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step vac:
+ vacuum smalltbl;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 20
diff --git a/src/test/isolation/specs/read-only-anomaly-2.spec b/src/test/isolation/specs/read-only-anomaly-2.spec
index 9812f49ee47..2b17fcb5218 100644
--- a/src/test/isolation/specs/read-only-anomaly-2.spec
+++ b/src/test/isolation/specs/read-only-anomaly-2.spec
@@ -18,13 +18,15 @@ teardown
}
session "s1"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s1ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s1wy" { UPDATE bank_account SET balance = 20 WHERE id = 'Y'; }
step "s1c" { COMMIT; }
session "s2"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s2rx" { SELECT balance FROM bank_account WHERE id = 'X'; }
step "s2ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s2wx" { UPDATE bank_account SET balance = -11 WHERE id = 'X'; }
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index a24cfd4e016..63573eabd2c 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -49,12 +49,18 @@ $(top_builddir)/src/port/pg_config_paths.h: | submake-libpgport
install: all installdirs
$(INSTALL_PROGRAM) pg_regress$(X) '$(DESTDIR)$(pgxsdir)/$(subdir)/pg_regress$(X)'
+ $(INSTALL_PROGRAM) gpdiff.pl '$(DESTDIR)$(pgxsdir)/$(subdir)/gpdiff.pl'
+ $(INSTALL_PROGRAM) atmsort.pl '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pl'
+ $(INSTALL_PROGRAM) atmsort.pm '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pm'
installdirs:
$(MKDIR_P) '$(DESTDIR)$(pgxsdir)/$(subdir)'
uninstall:
rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/pg_regress$(X)'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/gpdiff.pl'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pl'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pm'
# Build dynamically-loaded object file for CREATE FUNCTION ... LANGUAGE C.
diff --git a/src/test/regress/atmsort.pl b/src/test/regress/atmsort.pl
new file mode 100755
index 00000000000..142b998bceb
--- /dev/null
+++ b/src/test/regress/atmsort.pl
@@ -0,0 +1,346 @@
+#!/usr/bin/env perl
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Pod::Usage is loaded lazily when needed, if the --help or other such option
+# is actually used. Loading the module takes some time, which adds up when
+# running hundreds of regression tests, and gpdiff.pl calls this script twice
+# for every test. See lazy_pod2usage().
+#use Pod::Usage;
+
+use Getopt::Long;
+#use Data::Dumper; # only used by commented-out debug statements.
+use strict;
+use warnings;
+
+use File::Spec;
+
+use FindBin;
+use lib "$FindBin::Bin";
+use atmsort;
+
+=head1 NAME
+
+B<atmsort.pl> - [A] [T]est [M]echanism Sort: sort the contents of SQL log files to aid diff comparison
+
+=head1 SYNOPSIS
+
+B<atmsort.pl> [options] logfile [logfile...]
+
+Options:
+
+ -help brief help message
+ -man full documentation
+ -ignore_plans ignore explain plan content in query output
+ -init <file> load initialization file
+
+=head1 OPTIONS
+
+=over 8
+
+=item B<-help>
+
+ Print a brief help message and exits.
+
+=item B<-man>
+
+ Prints the manual page and exits.
+
+=item B<-ignore_plans>
+
+Specify this option to ignore any explain plan diffs between the
+input files. This will completely ignore any plan content in
+the input files thus masking differences in plans between the input files.
+
+For example, for the following plan:
+explain select i from foo where i > 10;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2.72 rows=45 width=4)
+ -> Seq Scan on foo (cost=0.00..1.55 rows=45 width=4)
+ Filter: i > 10
+ Settings: optimizer=on
+(4 rows)
+
+atmsort.pl -ignore_plans will reduce this to:
+
+explain select i from foo where i > 10;
+QUERY PLAN
+___________
+REGRESS_IGNORE:{
+REGRESS_IGNORE: 'child' => [
+REGRESS_IGNORE: {
+REGRESS_IGNORE: 'id' => 2,
+REGRESS_IGNORE: 'parent' => 1,
+REGRESS_IGNORE: 'short' => 'Seq Scan on foo'
+REGRESS_IGNORE: }
+REGRESS_IGNORE: ],
+REGRESS_IGNORE: 'id' => 1,
+REGRESS_IGNORE: 'short' => 'Gather Motion'
+REGRESS_IGNORE:}
+REGRESS_IGNORE:(4 rows)
+
+
+=item B<-init> <file>
+
+Specify an initialization file containing a series of directives
+(mainly for match_subs) that get applied to the input files. To
+specify multiple initialization files, use multiple init arguments,
+eg:
+
+ -init file1 -init file2
+
+
+=back
+
+=back
+
+=head1 DESCRIPTION
+
+atmsort reads sql log files from STDIN and sorts the query output for
+all SELECT statements that do *not* have an ORDER BY, writing the
+result to STDOUT. This change to the log facilitates diff comparison,
+since unORDERed query output does not have a guaranteed order. Note
+that for diff to work correctly, statements that do use ORDER BY must
+have a fully-specified order.
+
+The log content must look something like:
+
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1,2,3,4;
+ a | b | c | d
+ ------------+-----------------+-----------+---------------
+ 1 | 1 | 1 | 1
+ 1 | 1 | 1 | 2
+ 3 | 2 | 2 | 5
+ (3 rows)
+
+The log file must contain SELECT statements, followed by the query
+output in the standard PostgreSQL format, ie a set of named columns, a
+separator line constructed of dashes and plus signs, and the rows,
+followed by an "(N rows)" row count. The SELECT statement must be
+unambiguous, eg no embedded SQL keywords like INSERT, UPDATE, or
+DELETE, and it must be terminated with a semicolon. Normally, the
+query output is sorted, but if the statement contains an ORDER BY
+clause the query output for that query is not sorted.
+
+=head2 EXPLAIN PLAN
+
+atmsort can also use explain.pl to process EXPLAIN and EXPLAIN ANALYZE
+output in a configuration-independent way. It strips out all timing,
+segment, and slice information, reducing the plan to a simple nested
+perl structure. For example, for the following plan:
+
+explain analyze select * from customer;
+
+ QUERY PLAN
+------------------------------------------------------------------------
+ Gather Motion 2:1 (slice1) (cost=0.00..698.88 rows=25088 width=550)
+ Rows out: 150000 rows at destination with 0.230 ms to first row,
+ 386 ms to end, start offset by 8.254 ms.
+ -> Seq Scan on customer (cost=0.00..698.88 rows=25088 width=550)
+ Rows out: Avg 75000.0 rows x 2 workers. Max 75001 rows (seg0)
+ with 0.056 ms to first row, 26 ms to end, start offset by 7.332 ms.
+ Slice statistics:
+ (slice0) Executor memory: 186K bytes.
+ (slice1) Executor memory: 130K bytes avg x 2 workers,
+ 130K bytes max (seg0).
+ Total runtime: 413.401 ms
+(8 rows)
+
+atmsort reduces the plan to:
+
+ QUERY PLAN
+------------------------------------------------------------------------
+{
+ 'child' => [
+ {
+ 'id' => 2,
+ 'parent' => 1,
+ 'short' => 'Seq Scan on customer'
+ }
+ ],
+ 'id' => 1,
+ 'short' => 'Gather Motion'
+ }
+(8 rows)
+
+
+=head2 Advanced Usage
+
+atmsort supports several "commands" that allow finer-grained control
+over the comparison process for SELECT queries. These commands are
+specified in comments in the following form:
+
+ --
+ -- order 1
+ --
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1;
+
+or
+
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1; -- order 1
+
+The supported commands are:
+
+=over 13
+
+=item -- order column number[, column number...]
+
+ The order directive is used to compare
+ "partially-ordered" query
+ output. The specified columns are assumed
+ to be ordered, and the remaining columns are
+ sorted to allow for deterministic comparison.
+
+=item -- order none
+
+ The order none directive can be used to specify that the SELECT's
+ output is not ordered. This can be necessary if the default
+ heuristic that checks if there is an ORDER BY in the query gets
+ fooled, e.g by an ORDER BY in a subquery that doesn't force the
+ overall result to be ordered.
+
+=item -- ignore
+
+The ignore directive prefixes the SELECT output with REGRESS_IGNORE. The
+diff command can use the -I flag to ignore lines with this prefix.
+
+=item -- mvd colnum[, colnum...] -> colnum[, colnum...] [; <additional specs>]
+
+mvd is designed to support Multi-Value Dependencies for OLAP queries.
+The syntax "col1,col2->col3,col4" indicates that the col1 and col2
+values determine the col3, col4 result order.
+
+=item -- start_ignore
+
+Ignore all results until the next "end_ignore" directive. The
+start_ignore directive prefixes all subsequent output with REGRESS_IGNORE,
+and all other formatting directives are ignored as well. The diff
+command can use the -I flag to ignore lines with this prefix.
+
+=item -- end_ignore
+
+ Ends the ignored region that started with "start_ignore"
+
+=item -- start_matchsubs
+
+Starts a list of match/substitution expressions, where the match and
+substitution are specified as perl "m" and "s" operators for a single
+line of input. atmsort will compile the expressions and use them to
+process the current input file. The format is:
+
+ -- start_matchsubs
+ --
+ -- # first, a match expression
+ -- m/match this/
+ -- # next, a substitute expression
+ -- s/match this/substitute this/
+ --
+ -- # and can have more matchsubs after this...
+ --
+ -- end_matchsubs
+
+ Blank lines are ignored, and comments may be used if they are
+ prefixed with "#", the perl comment character, eg:
+
+ -- # this is a comment
+
+ Multiple match and substitute pairs may be specified. See "man
+ perlre" for more information on perl regular expressions.
+
+=item -- end_matchsubs
+
+ Ends the match/substitution region that started with "start_matchsubs"
+
+=item -- start_matchignore
+
+Similar to matchsubs, starts a list of match/ignore expressions as a
+set of perl match operators. Each line that matches one of the
+specified expressions is elided from the atmsort output. Note that
+there isn't an "ignore" expression -- just a list of individual match
+operators.
+
+=item -- end_matchignore
+
+ Ends the match/ignore region that started with "start_matchignore"
+
+=item -- force_explain
+
+Normally, atmsort can detect that a SQL query is being EXPLAINed, and
+the expain processing will happen automatically. However, if the
+query is complex, you may need to tag it with a comment to force the
+explain. Using this command for non-EXPLAIN statements is
+inadvisable.
+
+=back
+
+Note that you can combine the directives for a single query, but each
+directive must be on a separate line. Multiple mvd specifications
+must be on a single mvd line, separated by semicolons. Note that
+start_ignore overrides all directives until the next end_ignore.
+
+=head1 CAVEATS/LIMITATIONS
+
+atmsort cannot handle "unsorted" SELECT queries where the output has
+strings with embedded newlines or pipe ("|") characters due to
+limitations with the parser in the "tablelizer" function. Queries
+with these characteristics must have an ORDER BY clause to avoid
+potential erroneous comparison.
+
+=cut
+
+# Calls pod2usage, but loads the module first.
+sub lazy_pod2usage
+{
+ require Pod::Usage;
+ Pod::Usage::pod2usage(@_);
+}
+
+my $glob_id = "";
+
+my $glob_init;
+
+my $glob_orderwarn;
+my $glob_verbose;
+my $glob_fqo;
+
+my $man = 0;
+my $help = 0;
+my $ignore_plans;
+my @init_file;
+my $verbose;
+my $orderwarn;
+
+GetOptions(
+ 'help|?' => \$help, man => \$man,
+ 'gpd_ignore_plans|gp_ignore_plans|ignore_plans' => \$ignore_plans,
+ 'gpd_init|gp_init|init:s' => \@init_file,
+ 'order_warn|orderwarn' => \$orderwarn,
+ 'verbose' => \$verbose
+ )
+ or lazy_pod2usage(2);
+
+lazy_pod2usage(-msg => $glob_id, -exitstatus => 1) if $help;
+lazy_pod2usage(-msg => $glob_id, -exitstatus => 0, -verbose => 2) if $man;
+
+push @{$glob_init}, @init_file;
+
+my %args;
+
+$args{IGNORE_PLANS} = $ignore_plans if (defined ($ignore_plans));
+@{$args{INIT_FILES}} = @init_file if (scalar(@init_file));
+$args{ORDER_WARN} = $orderwarn if (defined ($orderwarn));
+$args{VERBOSE} = $verbose if (defined ($verbose));
+
+atmsort::atmsort_init(%args);
+
+atmsort::run_fhs(*STDIN, *STDOUT);
+
+exit();
diff --git a/src/test/regress/atmsort.pm b/src/test/regress/atmsort.pm
new file mode 100644
index 00000000000..bbee018255b
--- /dev/null
+++ b/src/test/regress/atmsort.pm
@@ -0,0 +1,1371 @@
+#
+# This is the workhorse of atmsort.pl, extracted into a module so that it
+# can be called more efficiently from other perl programs.
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Public interface:
+#
+# atmsort_init(args in a hash)
+#
+# followed by:
+#
+# run_fhs(input file handle, output file handle)
+# or
+# run(input filename, output filename)
+#
+package atmsort;
+
+#use Data::Dumper; # only used by commented-out debug statements.
+use strict;
+use warnings;
+use File::Temp qw/ tempfile /;
+
+# optional set of prefixes to identify sql statements, query output,
+# and sorted lines (for testing purposes)
+#my $apref = 'a: ';
+#my $bpref = 'b: ';
+#my $cpref = 'c: ';
+#my $dpref = 'S: ';
+my $apref = '';
+my $bpref = '';
+my $cpref = '';
+my $dpref = '';
+
+my $glob_ignore_plans;
+my @glob_init;
+
+my $glob_orderwarn;
+my $glob_verbose;
+my $glob_fqo;
+
+my $atmsort_outfh;
+
+sub atmsort_init
+{
+ my %args = (
+ # defaults
+ IGNORE_PLANS => 0,
+ INIT_FILES => [],
+ ORDER_WARN => 0,
+ VERBOSE => 0,
+
+ # override the defaults from argument list
+ @_
+ );
+
+ $glob_ignore_plans = 0;
+ @glob_init = ();
+
+ $glob_orderwarn = 0;
+ $glob_verbose = 0;
+ $glob_fqo = {count => 0};
+
+ my $ignore_plans;
+ my @init_file;
+ my $verbose;
+ my $orderwarn;
+
+ $glob_ignore_plans = $args{IGNORE_PLANS};
+
+ @glob_init = @{$args{INIT_FILES}};
+
+ $glob_orderwarn = $args{ORDER_WARN};
+ $glob_verbose = $args{VERBOSE};
+
+ _process_init_files();
+}
+
+sub _process_init_files
+{
+ # allow multiple init files
+ if (@glob_init)
+ {
+ my $devnullfh;
+ my $init_file_fh;
+
+ open $devnullfh, "> /dev/null" or die "can't open /dev/null: $!";
+
+ for my $init_file (@glob_init)
+ {
+ die "no such file: $init_file"
+ unless (-e $init_file);
+
+ # Perform initialization from this init_file by passing it
+ # to bigloop. Open the file, and pass that as the input file
+ # handle, and redirect output to /dev/null.
+ open $init_file_fh, "< $init_file" or die "could not open $init_file: $!";
+
+ atmsort_bigloop($init_file_fh, $devnullfh);
+
+ close $init_file_fh;
+ }
+
+ close $devnullfh;
+ }
+}
+
+my $glob_match_then_sub_fnlist;
+
+sub _build_match_subs
+{
+ my ($here_matchsubs, $whomatch) = @_;
+
+ my $stat = [1];
+
+ # filter out the comments and blank lines
+ $here_matchsubs =~ s/^\s*(?:#.*)?(?:[\r\n]|\x0D\x0A)//gm;
+
+ # split up the document into separate lines
+ my @foo = split(/\n/, $here_matchsubs);
+
+ my $ii = 0;
+
+ my $matchsubs_arr = [];
+ my $msa;
+
+ # build an array of arrays of match/subs pairs
+ while ($ii < scalar(@foo))
+ {
+ my $lin = $foo[$ii];
+
+ if (defined($msa))
+ {
+ push @{$msa}, $lin;
+
+ push @{$matchsubs_arr}, $msa;
+
+ undef $msa;
+ }
+ else
+ {
+ $msa = [$lin];
+ }
+ $ii++;
+ next;
+ } # end while
+
+# print Data::Dumper->Dump($matchsubs_arr);
+
+ my $bigdef;
+
+ my $fn1;
+
+ # build a lambda function for each expression, and load it into an
+ # array
+ my $mscount = 1;
+
+ for my $defi (@{$matchsubs_arr})
+ {
+ unless (2 == scalar(@{$defi}))
+ {
+ my $err1 = "bad definition: " . Data::Dumper->Dump([$defi]);
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ $bigdef = '$fn1 = sub { my $ini = shift; '. "\n";
+ $bigdef .= 'if ($ini =~ ' . $defi->[0];
+ $bigdef .= ') { ' . "\n";
+# $bigdef .= 'print "match\n";' . "\n";
+ $bigdef .= '$ini =~ ' . $defi->[1];
+ $bigdef .= '; }' . "\n";
+ $bigdef .= 'return $ini; }' . "\n";
+
+# print $bigdef;
+
+ if (eval $bigdef)
+ {
+ my $cmt = $whomatch . " matchsubs \#" . $mscount;
+ $mscount++;
+
+ # store the function pointer and the text of the function
+ # definition
+ push @{$glob_match_then_sub_fnlist},
+ [$fn1, $bigdef, $cmt, $defi->[0], $defi->[1]];
+
+ if ($glob_verbose && defined $atmsort_outfh)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: Defined $cmt\t$defi->[0]\t$defi->[1]\n"
+ }
+ }
+ else
+ {
+ my $err1 = "bad eval: $bigdef";
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ }
+
+# print Data::Dumper->Dump($glob_match_then_sub_fnlist);
+
+ return $stat;
+
+} # end _build_match_subs
+
+sub match_then_subs
+{
+ my $ini = shift;
+
+ for my $ff (@{$glob_match_then_sub_fnlist})
+ {
+ # get the function and execute it
+ my $fn1 = $ff->[0];
+ if (!$glob_verbose)
+ {
+ $ini = &$fn1($ini);
+ }
+ else
+ {
+ my $subs = &$fn1($ini);
+ unless ($subs eq $ini)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: was: $ini";
+ print $atmsort_outfh "REGRESS_IGNORE: matched $ff->[-3]\t$ff->[-2]\t$ff->[-1]\n"
+ }
+
+ $ini = &$fn1($ini);
+ }
+
+ }
+ return $ini;
+}
+
+my $glob_match_then_ignore_fnlist;
+
+sub _build_match_ignores
+{
+ my ($here_matchignores, $whomatch) = @_;
+
+ my $stat = [1];
+
+ # filter out the comments and blank lines
+ $here_matchignores =~ s/^\s*(?:#.*)?(?:[\r\n]|\x0D\x0A)//gm;
+
+ # split up the document into separate lines
+ my @matchignores_arr = split(/\n/, $here_matchignores);
+
+ my $bigdef;
+
+ my $fn1;
+
+ # build a lambda function for each expression, and load it into an
+ # array
+ my $mscount = 1;
+
+ for my $defi (@matchignores_arr)
+ {
+ $bigdef = '$fn1 = sub { my $ini = shift; '. "\n";
+ $bigdef .= 'return ($ini =~ ' . $defi;
+ $bigdef .= ') ; } ' . "\n";
+# print $bigdef;
+
+ if (eval $bigdef)
+ {
+ my $cmt = $whomatch . " matchignores \#" . $mscount;
+ $mscount++;
+
+ # store the function pointer and the text of the function
+ # definition
+ push @{$glob_match_then_ignore_fnlist},
+ [$fn1, $bigdef, $cmt, $defi, "(ignore)"];
+ if ($glob_verbose && defined $atmsort_outfh)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: Defined $cmt\t$defi\n"
+ }
+
+ }
+ else
+ {
+ my $err1 = "bad eval: $bigdef";
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ }
+
+# print Data::Dumper->Dump($glob_match_then_ignore_fnlist);
+
+ return $stat;
+
+} # end _build_match_ignores
+
+# if the input matches, return 1 (ignore), else return 0 (keep)
+sub match_then_ignore
+{
+ my $ini = shift;
+
+ for my $ff (@{$glob_match_then_ignore_fnlist})
+ {
+ # get the function and execute it
+ my $fn1 = $ff->[0];
+
+ if (&$fn1($ini))
+ {
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: matched $ff->[-3]\t$ff->[-2]\t$ff->[-1]\n"
+ }
+ return 1; # matched
+ }
+ }
+ return 0; # no match
+}
+
+# convert a postgresql psql formatted table into an array of hashes
+sub tablelizer
+{
+ my ($ini, $got_line1) = @_;
+
+ # first, split into separate lines, the find all the column headings
+
+ my @lines = split(/\n/, $ini);
+
+ return undef
+ unless (scalar(@lines));
+
+ # if the first line is supplied, then it has the column headers,
+ # so don't try to find them (or the ---+---- separator) in
+ # "lines"
+ my $line1 = $got_line1;
+ $line1 = shift @lines
+ unless (defined($got_line1));
+
+ # look for <space>|<space>
+ my @colheads = split(/\s+\|\s+/, $line1);
+
+ # fixup first, last column head (remove leading,trailing spaces)
+
+ $colheads[0] =~ s/^(\s+|\s+$)//;
+ $colheads[-1] =~ s/^(\s+|\s+$)//;
+
+ return undef
+ unless (scalar(@lines));
+
+ shift @lines # skip dashed separator (unless it was skipped already)
+ unless (defined($got_line1));
+
+ my @rows;
+
+ for my $lin (@lines)
+ {
+ my @cols = split(/\|/, $lin, scalar(@colheads));
+ last
+ unless (scalar(@cols) == scalar(@colheads));
+
+ my $rowh = {};
+
+ for my $colhdcnt (0..(scalar(@colheads)-1))
+ {
+ my $rawcol = shift @cols;
+
+ $rawcol =~ s/^(\s+|\s+$)//;
+
+ my $colhd = $colheads[$colhdcnt];
+ $rowh->{($colhdcnt+1)} = $rawcol;
+ }
+ push @rows, $rowh;
+ }
+
+ return \@rows;
+}
+# reformat the EXPLAIN output according to the directive hash
+sub format_explain
+{
+ my ($outarr, $directive) = @_;
+ my $prefix = "";
+ my $xopt = "perl"; # normal case
+
+ $directive = {} unless (defined($directive));
+
+ # Ignore plan content if its between start_ignore and end_ignore blocks
+ # or if -ignore_plans is specified.
+ $prefix = "REGRESS_IGNORE:"
+ if (exists($directive->{ignore})) || ($glob_ignore_plans);
+
+ my @tmp_lines;
+
+ if (scalar(@{$outarr}))
+ {
+ @tmp_lines = (
+ "QUERY PLAN\n",
+ ("-" x 71) . "\n",
+ @{$outarr},
+ "(111 rows)\n"
+ );
+ }
+
+ # Apply prefix to each line, if requested.
+ if (defined($prefix) && length($prefix))
+ {
+ foreach my $line (@tmp_lines)
+ {
+ $line = $prefix . $line;
+ }
+ }
+
+ # Put back newlines and print
+ foreach my $line (@tmp_lines)
+ {
+ print $atmsort_outfh $line;
+ }
+
+ return \@tmp_lines;
+}
+
+# reformat the query output according to the directive hash
+sub format_query_output
+{
+ my ($fqostate, $has_order, $outarr, $directive) = @_;
+ my $prefix = "";
+
+ $directive = {} unless (defined($directive));
+
+ $fqostate->{count} += 1;
+
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: start fqo $fqostate->{count}\n";
+ }
+
+ # EXPLAIN
+ #
+ # EXPLAIN (COSTS OFF) output is *not* processed. The output with COSTS OFF
+ # shouldn't contain anything that varies across runs, and shouldn't need
+ # sanitizing.
+ #
+ # However when -ignore_plans is specified we also need to process
+ # EXPLAIN (COSTS OFF) to ignore the segments information.
+ if (exists($directive->{explain})
+ && ($glob_ignore_plans
+ || $directive->{explain} ne 'costs_off')
+ && (!exists($directive->{explain_processing})
+ || ($directive->{explain_processing} =~ m/on/)))
+ {
+ format_explain($outarr, $directive);
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+ return;
+ }
+
+ $prefix = "REGRESS_IGNORE:"
+ if (exists($directive->{ignore}));
+
+ if (exists($directive->{sortlines}))
+ {
+ my $firstline = $directive->{firstline};
+ my $ordercols = $directive->{order};
+ my $mvdlist = $directive->{mvd};
+
+ # lines already have newline terminator, so just rejoin them.
+ my $lines = join ("", @{$outarr});
+
+ my $ah1 = tablelizer($lines, $firstline);
+
+ unless (defined($ah1) && scalar(@{$ah1}))
+ {
+# print "No tablelizer hash for $lines, $firstline\n";
+# print STDERR "No tablelizer hash for $lines, $firstline\n";
+
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+
+ return;
+ }
+
+ my @allcols = sort (keys(%{$ah1->[0]}));
+
+ my @presortcols;
+ if (defined($ordercols) && length($ordercols))
+ {
+# $ordercols =~ s/^.*order\s*//;
+ $ordercols =~ s/\n//gm;
+ $ordercols =~ s/\s//gm;
+
+ @presortcols = split(/\s*\,\s*/, $ordercols);
+ }
+
+ my @mvdcols;
+ my @mvd_deps;
+ my @mvd_nodeps;
+ my @mvdspec;
+ if (defined($mvdlist) && length($mvdlist))
+ {
+ $mvdlist =~ s/\n//gm;
+ $mvdlist =~ s/\s//gm;
+
+ # find all the mvd specifications (separated by semicolons)
+ my @allspecs = split(/\;/, $mvdlist);
+
+# print "allspecs:", Data::Dumper->Dump(\@allspecs);
+
+ for my $item (@allspecs)
+ {
+ my $realspec;
+ # split the specification list, separating the
+ # specification columns on the left hand side (LHS)
+ # from the "dependent" columns on the right hand side (RHS)
+ my @colset = split(/\-\>/, $item, 2);
+ unless (scalar(@colset) == 2)
+ {
+ print $atmsort_outfh "invalid colset for $item\n";
+ print STDERR "invalid colset for $item\n";
+ next;
+ }
+ # specification columns (LHS)
+ my @scols = split(/\,/, $colset[0]);
+ unless (scalar(@scols))
+ {
+ print $atmsort_outfh "invalid dependency specification: $colset[0]\n";
+ print STDERR
+ "invalid dependency specification: $colset[0]\n";
+ next;
+ }
+ # dependent columns (RHS)
+ my @dcols = split(/\,/, $colset[1]);
+ unless (scalar(@dcols))
+ {
+ print $atmsort_outfh "invalid specified dependency: $colset[1]\n";
+ print STDERR "invalid specified dependency: $colset[1]\n";
+ next;
+ }
+ $realspec = {};
+ my $scol2 = [];
+ my $dcol2 = [];
+ my $sdcol = [];
+ $realspec->{spec} = $item;
+ push @{$scol2}, @scols;
+ push @{$dcol2}, @dcols;
+ push @{$sdcol}, @scols, @dcols;
+ $realspec->{scol} = $scol2;
+ $realspec->{dcol} = $dcol2;
+ $realspec->{allcol} = $sdcol;
+
+ push @mvdcols, @scols, @dcols;
+ # find all the dependent columns
+ push @mvd_deps, @dcols;
+ push @mvdspec, $realspec;
+ }
+
+ # find all the mvd cols which are *not* dependent. Need
+ # to handle the case of self-dependency, eg "mvd 1->1", so
+ # must build set of all columns, then strip out the
+ # "dependent" cols. So this is the set of all LHS columns
+ # which are never on the RHS.
+ my %get_nodeps;
+
+ for my $col (@mvdcols)
+ {
+ $get_nodeps{$col} = 1;
+ }
+
+ # remove dependent cols
+ for my $col (@mvd_deps)
+ {
+ if (exists($get_nodeps{$col}))
+ {
+ delete $get_nodeps{$col};
+ }
+ }
+ # now sorted and unique, with no dependents
+ @mvd_nodeps = sort (keys(%get_nodeps));
+# print "mvdspec:", Data::Dumper->Dump(\@mvdspec);
+# print "mvd no deps:", Data::Dumper->Dump(\@mvd_nodeps);
+ }
+
+ my %unsorth = map { $_ => 1 } @allcols;
+
+ # clear sorted column list if just "order 0"
+ if ((1 == scalar(@presortcols))
+ && ($presortcols[0] eq "0"))
+ {
+ @presortcols = ();
+ }
+
+
+ for my $col (@presortcols)
+ {
+ if (exists($unsorth{$col}))
+ {
+ delete $unsorth{$col};
+ }
+ }
+ for my $col (@mvdcols)
+ {
+ if (exists($unsorth{$col}))
+ {
+ delete $unsorth{$col};
+ }
+ }
+ my @unsortcols = sort(keys(%unsorth));
+
+# print Data::Dumper->Dump([$ah1]);
+
+ if (scalar(@presortcols))
+ {
+ my $hd1 = "sorted columns " . join(", ", @presortcols);
+
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+# print "hrow:",Data::Dumper->Dump([$h_row]), "\n";
+
+ for my $col (@presortcols)
+ {
+# print "col: ($col)\n";
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified ORDER column out of range: $col vs $maxcol\n";
+ print $atmsort_outfh $errstr;
+ print STDERR $errstr;
+ last;
+ }
+ }
+ print $atmsort_outfh $prefix, join(' | ', @collist), "\n";
+ }
+ }
+
+ if (scalar(@mvdspec))
+ {
+ my @outi;
+
+ my $hd1 = "multivalue dependency specifications";
+
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $mspec (@mvdspec)
+ {
+ $hd1 = $mspec->{spec};
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+# print "hrow:",Data::Dumper->Dump([$h_row]), "\n";
+
+ for my $col (@{$mspec->{allcol}})
+ {
+# print "col: ($col)\n";
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified MVD column out of range: $col vs $maxcol\n";
+ print $errstr;
+ print STDERR $errstr;
+ last;
+ }
+
+ }
+ push @outi, join(' | ', @collist);
+ }
+ my @ggg= sort @outi;
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $prefix, $line, "\n";
+ }
+ @outi = ();
+ }
+ }
+ my $hd2 = "unsorted columns " . join(", ", @unsortcols);
+
+ # the "unsorted" comparison must include all columns which are
+ # not sorted or part of an mvd specification, plus the sorted
+ # columns, plus the non-dependent mvd columns which aren't
+ # already in the list
+ if ((scalar(@presortcols))
+ || scalar(@mvd_nodeps))
+ {
+ if (scalar(@presortcols))
+ {
+ if (scalar(@mvd_deps))
+ {
+ my %get_presort;
+
+ for my $col (@presortcols)
+ {
+ $get_presort{$col} = 1;
+ }
+ # remove "dependent" (RHS) columns
+ for my $col (@mvd_deps)
+ {
+ if (exists($get_presort{$col}))
+ {
+ delete $get_presort{$col};
+ }
+ }
+ # now sorted and unique, minus all mvd dependent cols
+ @presortcols = sort (keys(%get_presort));
+
+ }
+
+ if (scalar(@presortcols))
+ {
+ $hd2 .= " ( " . join(", ", @presortcols) . ")";
+ # have to compare all columns as unsorted
+ push @unsortcols, @presortcols;
+ }
+ }
+ if (scalar(@mvd_nodeps))
+ {
+ my %get_nodeps;
+
+ for my $col (@mvd_nodeps)
+ {
+ $get_nodeps{$col} = 1;
+ }
+ # remove "nodeps" which are already in the output list
+ for my $col (@unsortcols)
+ {
+ if (exists($get_nodeps{$col}))
+ {
+ delete $get_nodeps{$col};
+ }
+ }
+ # now sorted and unique, minus all unsorted/sorted cols
+ @mvd_nodeps = sort (keys(%get_nodeps));
+ if (scalar(@mvd_nodeps))
+ {
+ $hd2 .= " (( " . join(", ", @mvd_nodeps) . "))";
+ # have to compare all columns as unsorted
+ push @unsortcols, @mvd_nodeps;
+ }
+
+ }
+
+ }
+
+ print $hd2, "\n", "-"x(length($hd2)), "\n";
+
+ my @finalunsort;
+
+ if (scalar(@unsortcols))
+ {
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+ for my $col (@unsortcols)
+ {
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified UNSORT column out of range: $col vs $maxcol\n";
+ print $errstr;
+ print STDERR $errstr;
+ last;
+ }
+
+ }
+ push @finalunsort, join(' | ', @collist);
+ }
+ my @ggg= sort @finalunsort;
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $prefix, $line, "\n";
+ }
+ }
+
+ if ($glob_verbose)
+ {
+ print "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+
+ return;
+ } # end order
+
+
+ if ($has_order)
+ {
+ my @ggg= @{$outarr};
+
+ if ($glob_orderwarn)
+ {
+ # If no ordering cols specified (no directive), and SELECT has
+ # ORDER BY, see if number of order by cols matches all cols in
+ # selected lists. Treat the order by cols as a comma separated
+ # list and count them. Works ok for simple ORDER BY clauses
+ if (defined($directive->{sql_statement}))
+ {
+ my @ocols = ($directive->{sql_statement} =~ m/select.*order\s+by\s+(.*)\;/ism);
+
+ if (scalar(@ocols))
+ {
+ my $fl2 = $directive->{firstline};
+ # lines already have newline terminator, so just rejoin them.
+ my $line2 = join ("", @{$outarr});
+
+ my $ah2 = tablelizer($line2, $fl2);
+ if (defined($ah2) && scalar(@{$ah2}))
+ {
+ my $allcol_count = scalar(keys(%{$ah2->[0]}));
+
+ # In order to count the number of ORDER BY columns we
+ # can transliterate over comma and increment by one to
+ # account for the last column not having a trailing
+ # comma. This is faster than splitting over the comma
+ # since we don't need to allocate the returned array.
+ my $ocol_count = ($ocols[0] =~ tr/,//) + 1;
+
+ if ($ocol_count < $allcol_count)
+ {
+ print "REGRESS_IGNORE: ORDER_WARNING: OUTPUT ",
+ $allcol_count, " columns, but ORDER BY on ",
+ $ocol_count, " \n";
+ }
+ }
+ }
+ }
+ } # end if $glob_orderwarn
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $dpref, $prefix, $line;
+ }
+ }
+ else
+ {
+ my @ggg= sort @{$outarr};
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $bpref, $prefix, $line;
+ }
+ }
+
+ if ($glob_verbose)
+ {
+ print "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+}
+
+
+# The caller should've opened ATMSORT_INFILE and ATMSORT_OUTFILE file handles.
+sub atmsort_bigloop
+{
+ my $infh = shift;
+ $atmsort_outfh = shift;
+
+ my $sql_statement = "";
+ my @outarr;
+
+ my $getrows = 0;
+ my $getstatement = 0;
+ my $has_order = 0;
+ my $copy_to_stdout_result = 0;
+ my $describe_mode = 0;
+ my $directive = {};
+ my $big_ignore = 0;
+ my %define_match_expression;
+
+ print $atmsort_outfh "REGRESS_IGNORE: formatted by atmsort.pm\n";
+
+ L_bigwhile:
+ while (<$infh>) # big while
+ {
+ reprocess_row:
+ my $ini = $_;
+
+ # look for match/substitution or match/ignore expressions
+ if (%define_match_expression)
+ {
+ if ($ini =~ m/\-\-\s*end\_match(subs|ignore)\s*$/)
+ {
+ if ($define_match_expression{"type"} ne $1)
+ {
+ die "Non-matching operation end_match" . $1 . ", " .
+ "expected end_match" . $define_match_expression{"type"};
+ }
+ }
+ else
+ {
+ $define_match_expression{"expr"} .= $ini;
+ goto L_push_outarr;
+ }
+
+ my @foo = split(/\n/, $define_match_expression{"expr"}, 2);
+
+ unless (2 == scalar(@foo))
+ {
+ $ini .= "REGRESS_IGNORE: bad match definition\n";
+ undef %define_match_expression;
+ goto L_push_outarr;
+ }
+
+ my $stat;
+
+ my $doc1 = $foo[1];
+
+ # strip off leading comment characters
+ $doc1 =~ s/^\s*\-\-//gm;
+
+ if ($define_match_expression{"type"} eq 'subs')
+ {
+ $stat = _build_match_subs($doc1, "USER");
+ }
+ else
+ {
+ $stat = _build_match_ignores($doc1, "USER");
+ }
+
+ if (scalar(@{$stat}) > 1)
+ {
+ my $outi = $stat->[1];
+
+ # print a message showing the error
+ $outi =~ s/^(.*)/REGRESS_IGNORE: ($1)/gm;
+ $ini .= $outi;
+ }
+ else
+ {
+ $ini .= "REGRESS_IGNORE: defined new match expression\n";
+ }
+
+ undef %define_match_expression;
+ goto L_push_outarr;
+ } # end defined match expression
+
+ if ($big_ignore > 0)
+ {
+ if ($ini =~ m/\-\-\s*end\_ignore\s*$/)
+ {
+ $big_ignore--;
+ }
+ print $atmsort_outfh "REGRESS_IGNORE:", $ini;
+ next;
+ }
+
+ if ($getrows) # getting rows from SELECT output
+ {
+ # The end of "result set" for a COPY TO STDOUT is a bit tricky
+ # to find. There is no explicit marker for it. We look for a
+ # line that looks like a SQL comment or a new query, or an ERROR.
+ # This is not bullet-proof, but works for the current tests.
+ if ($copy_to_stdout_result &&
+ ($ini =~ m/(?:\-\-|ERROR|copy|create|drop|select|insert|update)/i))
+ {
+ my @ggg = sort @outarr;
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $bpref, $line;
+ }
+
+ @outarr = ();
+ $getrows = 0;
+ $has_order = 0;
+ $copy_to_stdout_result = 0;
+
+ # Process the row again, in case it begins another
+ # COPY TO STDOUT statement, or another query.
+ goto reprocess_row;
+ }
+
+ my $end_of_table = 0;
+
+ if ($describe_mode)
+ {
+ # \d tables don't always end with a row count, and there may be
+ # more than one of them per command. So we allow any of the
+ # following to end the table:
+ # - a blank line
+ # - a row that doesn't have the same number of column separators
+ # as the header line
+ # - a row count (checked below)
+ if ($ini =~ m/^$/)
+ {
+ $end_of_table = 1;
+ }
+ elsif (exists($directive->{firstline}))
+ {
+ # Count the number of column separators in the table header
+ # and our current line.
+ my $headerSeparators = ($directive->{firstline} =~ tr/\|//);
+ my $lineSeparators = ($ini =~ tr/\|//);
+
+ if ($headerSeparators != $lineSeparators)
+ {
+ $end_of_table = 1;
+ }
+ }
+
+ # Don't reset describe_mode at the end of the table; there may
+ # be more tables still to go.
+ }
+
+ # regex example: (5 rows)
+ if ($ini =~ m/^\s*\(\d+\s+row(?:s)*\)\s*$/)
+ {
+ # Always ignore the rowcount for explain plan out as the
+ # skeleton plans might be the same even if the row counts
+ # differ because of session level GUCs.
+ if (exists($directive->{explain}))
+ {
+ $ini = 'REGRESS_IGNORE:' . $ini;
+ }
+
+ $end_of_table = 1;
+ }
+
+ if ($end_of_table)
+ {
+ format_query_output($glob_fqo,
+ $has_order, \@outarr, $directive);
+
+ $directive = {};
+ @outarr = ();
+ $getrows = 0;
+ $has_order = 0;
+ }
+ }
+ else # finding SQL statement or start of SELECT output
+ {
+ # To avoid hunting for gpdiff commands which are contained inside
+ # comments first establish if the line contains a comment with any
+ # trailing characters at all.
+ my $has_comment = ((m/\s*\-\-.+$/) ? 1 : 0);
+
+ if ($has_comment && $ini =~ m/\-\-\s*start\_match(subs|ignore)\s*$/)
+ {
+ $define_match_expression{"type"} = $1;
+ $define_match_expression{"expr"} = $ini;
+ goto L_push_outarr;
+ }
+ if ($has_comment && ($ini =~ m/\-\-\s*start\_ignore\s*$/))
+ {
+ $big_ignore += 1;
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $apref, $line;
+ }
+ @outarr = ();
+
+ print $atmsort_outfh 'REGRESS_IGNORE:', $ini;
+ next;
+ }
+
+ # EXPLAIN (COSTS OFF) ...
+ if ($ini =~ m/explain\s*\(.*costs\s+off.*\)/i)
+ {
+ $directive->{explain} = "costs_off";
+ }
+ # Note: \d is for the psql "describe"
+ elsif ($ini =~ m/(?:insert|update|delete|select|^\s*\\d|copy|execute)/i)
+ {
+ $copy_to_stdout_result = 0;
+ $has_order = 0;
+ $sql_statement = "";
+
+ if ($ini =~ m/explain.*(?:insert|update|delete|select|execute)/i)
+ {
+ $directive->{explain} = 'normal';
+ }
+
+ # Should we apply more heuristics to try to find the end of \d
+ # output?
+ $describe_mode = ($ini =~ m/^\s*\\d/);
+ }
+
+ # Catching multiple commands and capturing the parens matches
+ # makes it possible to check just the first character since
+ # each command has a unique first character. This allows us to
+ # use fewer regular expression matches in this hot section.
+ if ($has_comment &&
+ $ini =~ m/\-\-\s*((force_explain)\s*(operator)?\s*$|(ignore)\s*$|(order)\s+(\d+|none).*$|(mvd)\s+\d+.*$|(explain_processing_(on|off))\s+.*$)/)
+ {
+ my $full_command = $1;
+ my $cmd = substr($full_command, 0, 1);
+ if ($cmd eq 'i')
+ {
+ $directive->{ignore} = 'ignore';
+ }
+ elsif ($cmd eq 'o')
+ {
+ my $olist = $ini;
+ $olist =~ s/^.*\-\-\s*order//;
+ if ($olist =~ /none/)
+ {
+ $directive->{order_none} = 1;
+ }
+ else
+ {
+ $directive->{order} = $olist;
+ }
+ }
+ elsif ($cmd eq 'f')
+ {
+ if (defined($3))
+ {
+ $directive->{explain} = 'operator';
+ }
+ else
+ {
+ $directive->{explain} = 'normal';
+ }
+ }
+ elsif ($cmd eq 'e')
+ {
+ $full_command =~ m/(on|off)$/;
+ $directive->{explain_processing} = $1;
+ }
+ else
+ {
+ my $olist = $ini;
+ $olist =~ s/^.*\-\-\s*mvd//;
+ $directive->{mvd} = $olist;
+ }
+ }
+
+ if ($ini =~ m/select/i)
+ {
+ $getstatement = 1;
+ $sql_statement .= $ini;
+ }
+ if (index($ini, ';') != -1)
+ {
+ $getstatement = 0;
+ }
+
+ # prune notices with segment info if they are duplicates
+ if ($ini =~ m/^\s*(?:NOTICE|ERROR|HINT|DETAIL|WARNING)\:.*\(seg.*pid.*\)/)
+ {
+ $ini =~ s/\s+(?:\W)?(?:\W)?\(seg.*pid.*\)//;
+
+ my $outsize = scalar(@outarr);
+
+ my $lastguy = -1;
+
+ L_checkfor:
+ for my $jj (1..$outsize)
+ {
+ my $checkstr = $outarr[$lastguy];
+
+ #remove trailing spaces for comparison
+ $checkstr =~ s/\s+$//;
+
+ my $skinny = $ini;
+ $skinny =~ s/\s+$//;
+
+ # stop when no more notices
+ last L_checkfor
+ if ($checkstr !~ m/^\s*(?:NOTICE|ERROR|HINT|DETAIL|WARNING)\:/);
+
+ # discard this line if matches a previous notice
+ if ($skinny eq $checkstr)
+ {
+ if (0) # debug code
+ {
+ $ini = "DUP: " . $ini;
+ last L_checkfor;
+ }
+ next L_bigwhile;
+ }
+ $lastguy--;
+ } # end for
+
+ } # end if pruning notices
+
+ # MPP-1492 allow:
+ # copy (select ...) to stdout
+ # \copy (select ...) to stdout
+ # and special case these guys:
+ # copy test1 to stdout
+ # \copy test1 to stdout
+ my $matches_copy_to_stdout = 0;
+ if ($ini =~ m/^(?:\\)?copy\s+(?:(?:\(select.*\))|\S+)\s+to stdout.*$/i)
+ {
+ $matches_copy_to_stdout = 1;
+ }
+
+ # Try to detect the beginning of result set, as printed by psql
+ #
+ # Examples:
+ #
+ # hdr
+ # ----------
+ #
+ # a | b
+ # ---+---
+ #
+ # The previous line should be the header. It should have a space at the
+ # beginning and end. This line should consist of dashes and plus signs,
+ # with at least three dashes for each column.
+ #
+ if (($matches_copy_to_stdout && $ini !~ m/order by/i) ||
+ (scalar(@outarr) > 1 && $outarr[-1] =~ m/^\s+.*\s$/ &&
+ $ini =~ m/^(?:(?:\-\-)(?:\-)+(?:\+(?:\-)+)*)$/))
+ # special case for copy select
+ { # sort this region
+
+ $directive->{firstline} = $outarr[-1];
+
+ if (exists($directive->{order}) ||
+ exists($directive->{mvd}))
+ {
+ $directive->{sortlines} = $outarr[-1];
+ }
+
+ # special case for copy select
+ if ($matches_copy_to_stdout)
+ {
+ $copy_to_stdout_result = 1;
+ $sql_statement = "";
+ }
+ # special case for explain
+ if (exists($directive->{explain}) &&
+ ($ini =~ m/^\s*(?:(?:\-\-)(?:\-)+(?:\+(?:\-)+)*)+\s*$/) &&
+ (scalar(@outarr) && $outarr[-1] =~ m/QUERY PLAN/))
+ {
+ # ENGINF-88: fixup explain headers
+ $outarr[-1] = "QUERY PLAN\n";
+ $ini = ("_" x length($outarr[-1])) . "\n";
+ }
+
+ $getstatement = 0;
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $apref, $line;
+ }
+ @outarr = ();
+
+ print $atmsort_outfh $apref, $ini;
+
+ # If there is an ORDER BY in the query, then the results must
+ # be in the order that we have memorized in the expected
+ # output. Otherwise, the order of the rows is not
+ # well-defined, so we sort them before comparing, to mask out
+ # any differences in the order.
+ #
+ # This isn't foolproof, and will get fooled by ORDER BYs in
+ # subqueries, for example. But it catches the commmon cases.
+ if (defined($directive->{explain}))
+ {
+ $has_order = 1; # Do not reorder EXPLAIN output
+ }
+ elsif (defined($sql_statement)
+ && length($sql_statement)
+ && !defined($directive->{order_none})
+ # multiline match
+ && ($sql_statement =~ m/select.*order.*by/is))
+ {
+ # There was an ORDER BY. But if it was part of an
+ # "agg() OVER (ORDER BY ...)" or "WITHIN GROUP (ORDER BY
+ # ...)" construct, ignore it, because those constructs
+ # don't mean that the final result has to be in order.
+ my $t = $sql_statement;
+ $t =~ s/over\s*\(order\s+by.*\)/xx/isg;
+ $t =~ s/over\s*\((partition\s+by.*)?order\s+by.*\)/xx/isg;
+ $t =~ s/window\s+\w+\s+as\s+\((partition\s+by.*)?order\s+by.*\)/xx/isg;
+ $t =~ s/within\s+group\s*\((order\s+by.*)\)/xx/isg;
+
+ if ($t =~ m/order\s+by/is)
+ {
+ $has_order = 1; # so do *not* sort output
+ }
+ else
+ {
+ $has_order = 0; # need to sort query output
+ }
+ }
+ else
+ {
+ $has_order = 0; # need to sort query output
+ }
+ $directive->{sql_statement} = $sql_statement;
+ $sql_statement = '';
+
+ $getrows = 1;
+ next;
+ } # end sort this region
+ } # end finding SQL
+
+ # if MATCH then SUBSTITUTE
+ # see HERE document for definitions
+ $ini = match_then_subs($ini);
+
+ # if MATCH then IGNORE
+ # see HERE document for definitions
+ if ( match_then_ignore($ini))
+ {
+ next; # ignore matching lines
+ }
+
+L_push_outarr:
+
+ push @outarr, $ini;
+
+ } # end big while
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $cpref, $line;
+ }
+} # end bigloop
+
+
+# The arguments is the input filename. The output filename is returned as it
+# is generated in this function to avoid races around the temporary filename
+# creation.
+sub run
+{
+ my $infname = shift;
+
+ open my $infh, '<', $infname or die "could not open $infname: $!";
+ my ($outfh, $outfname) = tempfile();
+
+ run_fhs($infh, $outfh);
+
+ close $infh;
+ close $outfh;
+ return $outfname;
+}
+
+# The arguments are input and output file handles
+sub run_fhs
+{
+ my $infh = shift;
+ my $outfh = shift;
+
+
+ # loop over input file.
+ atmsort_bigloop($infh, $outfh);
+}
+
+1;
diff --git a/src/test/regress/expected/.gitignore b/src/test/regress/expected/.gitignore
index 93c56c85a09..0eb69843729 100644
--- a/src/test/regress/expected/.gitignore
+++ b/src/test/regress/expected/.gitignore
@@ -5,5 +5,6 @@
/largeobject.out
/largeobject_1.out
/misc.out
+/misc_1.out
/security_label.out
/tablespace.out
diff --git a/src/test/regress/expected/alter_table_1.out b/src/test/regress/expected/alter_table_1.out
new file mode 100644
index 00000000000..d03ce1f4499
--- /dev/null
+++ b/src/test/regress/expected/alter_table_1.out
@@ -0,0 +1,4062 @@
+--
+-- ALTER_TABLE
+--
+-- Clean up in case a prior regression run failed
+SET client_min_messages TO 'warning';
+DROP ROLE IF EXISTS regress_alter_table_user1;
+RESET client_min_messages;
+CREATE USER regress_alter_table_user1;
+--
+-- add attribute
+--
+CREATE TABLE attmp (initial int4);
+COMMENT ON TABLE attmp_wrong IS 'table comment';
+ERROR: relation "attmp_wrong" does not exist
+COMMENT ON TABLE attmp IS 'table comment';
+COMMENT ON TABLE attmp IS NULL;
+ALTER TABLE attmp ADD COLUMN xmin integer; -- fails
+ERROR: column name "xmin" conflicts with a system column name
+ALTER TABLE attmp ADD COLUMN a int4 default 3;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+DROP TABLE attmp;
+-- the wolf bug - schema mods caused inconsistent row descriptors
+CREATE TABLE attmp (
+ initial int4
+);
+ALTER TABLE attmp ADD COLUMN a int4;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+CREATE INDEX attmp_idx ON attmp (a, (d + e), b);
+ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ERROR: column number must be in range from 1 to 32767
+LINE 1: ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ ^
+ALTER INDEX attmp_idx ALTER COLUMN 1 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "a" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS 1000;
+\d+ attmp_idx
+ Index "public.attmp_idx"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+------------------+------+------------+---------+--------------
+ a | integer | yes | a | plain |
+ expr | double precision | yes | (d + e) | plain | 1000
+ b | cstring | yes | b | plain |
+btree, for table "public.attmp"
+
+ALTER INDEX attmp_idx ALTER COLUMN 3 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "b" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 4 SET STATISTICS 1000;
+ERROR: column number 4 of relation "attmp_idx" does not exist
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS -1;
+DROP TABLE attmp;
+--
+-- rename - check on both non-temp and temp tables
+--
+CREATE TABLE attmp (regtable int);
+CREATE TEMP TABLE attmp (attmptable int);
+ALTER TABLE attmp RENAME TO attmp_new;
+SELECT * FROM attmp;
+ regtable
+----------
+(0 rows)
+
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+ALTER TABLE attmp RENAME TO attmp_new2;
+SELECT * FROM attmp; -- should fail
+ERROR: relation "attmp" does not exist
+LINE 1: SELECT * FROM attmp;
+ ^
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+SELECT * FROM attmp_new2;
+ regtable
+----------
+(0 rows)
+
+DROP TABLE attmp_new;
+DROP TABLE attmp_new2;
+-- check rename of partitioned tables and indexes also
+CREATE TABLE part_attmp (a int primary key) partition by range (a);
+CREATE TABLE part_attmp1 PARTITION OF part_attmp FOR VALUES FROM (0) TO (100);
+ALTER INDEX part_attmp_pkey RENAME TO part_attmp_index;
+ALTER INDEX part_attmp1_pkey RENAME TO part_attmp1_index;
+ALTER TABLE part_attmp RENAME TO part_at2tmp;
+ALTER TABLE part_attmp1 RENAME TO part_at2tmp1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX part_attmp_index RENAME TO fail;
+ERROR: must be owner of index part_attmp_index
+ALTER INDEX part_attmp1_index RENAME TO fail;
+ERROR: must be owner of index part_attmp1_index
+ALTER TABLE part_at2tmp RENAME TO fail;
+ERROR: must be owner of table part_at2tmp
+ALTER TABLE part_at2tmp1 RENAME TO fail;
+ERROR: must be owner of table part_at2tmp1
+RESET ROLE;
+DROP TABLE part_at2tmp;
+--
+-- check renaming to a table's array type's autogenerated name
+-- (the array type's name should get out of the way)
+--
+CREATE TABLE attmp_array (id int);
+CREATE TABLE attmp_array2 (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = 'attmp_array2[]'::regtype;
+ typname
+---------------
+ _attmp_array2
+(1 row)
+
+ALTER TABLE attmp_array2 RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+----------------
+ ___attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+DROP TABLE attmp_array;
+-- renaming to table's own array type's name is an interesting corner case
+CREATE TABLE attmp_array (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+ALTER TABLE attmp_array RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+-- ALTER TABLE ... RENAME on non-table relations
+-- renaming indexes (FIXME: this should probably test the index's functionality)
+ALTER INDEX IF EXISTS __onek_unique1 RENAME TO attmp_onek_unique1;
+NOTICE: relation "__onek_unique1" does not exist, skipping
+ALTER INDEX IF EXISTS __attmp_onek_unique1 RENAME TO onek_unique1;
+NOTICE: relation "__attmp_onek_unique1" does not exist, skipping
+ALTER INDEX onek_unique1 RENAME TO attmp_onek_unique1;
+ALTER INDEX attmp_onek_unique1 RENAME TO onek_unique1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX onek_unique1 RENAME TO fail; -- permission denied
+ERROR: must be owner of index onek_unique1
+RESET ROLE;
+-- renaming views
+CREATE VIEW attmp_view (unique1) AS SELECT unique1 FROM tenk1;
+ALTER TABLE attmp_view RENAME TO attmp_view_new;
+SET ROLE regress_alter_table_user1;
+ALTER VIEW attmp_view_new RENAME TO fail; -- permission denied
+ERROR: must be owner of view attmp_view_new
+RESET ROLE;
+-- hack to ensure we get an indexscan here
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+-- 5 values, sorted
+SELECT unique1 FROM tenk1 WHERE unique1 < 5;
+ unique1
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+(5 rows)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+DROP VIEW attmp_view_new;
+-- toast-like relation name
+alter table stud_emp rename to pg_toast_stud_emp;
+alter table pg_toast_stud_emp rename to stud_emp;
+-- renaming index should rename constraint as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+ALTER INDEX onek_unique1_constraint RENAME TO onek_unique1_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraint
+ALTER TABLE onek ADD CONSTRAINT onek_check_constraint CHECK (unique1 >= 0);
+ALTER TABLE onek RENAME CONSTRAINT onek_check_constraint TO onek_check_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_check_constraint_foo;
+-- renaming constraint should rename index as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+DROP INDEX onek_unique1_constraint; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint because constraint onek_unique1_constraint on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint on table onek instead.
+ALTER TABLE onek RENAME CONSTRAINT onek_unique1_constraint TO onek_unique1_constraint_foo;
+DROP INDEX onek_unique1_constraint_foo; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint_foo because constraint onek_unique1_constraint_foo on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint_foo on table onek instead.
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraints vs. inheritance
+CREATE TABLE constraint_rename_test (a int CONSTRAINT con1 CHECK (a > 0), b int, c int);
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+
+CREATE TABLE constraint_rename_test2 (a int CONSTRAINT con1 CHECK (a > 0), d int) INHERITS (constraint_rename_test);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging constraint "con1" with inherited definition
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test2 RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: cannot rename inherited constraint "con1"
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: inherited constraint "con1" must be renamed in child tables too
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con2 CHECK (b > 0) NO INHERIT;
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con2 TO con2foo; -- ok
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con2foo TO con2bar; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con3 PRIMARY KEY (a);
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con3 TO con3foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+Indexes:
+ "con3foo" PRIMARY KEY, btree (a)
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+DROP TABLE constraint_rename_test2;
+DROP TABLE constraint_rename_test;
+ALTER TABLE IF EXISTS constraint_not_exist RENAME CONSTRAINT con3 TO con3foo; -- ok
+NOTICE: relation "constraint_not_exist" does not exist, skipping
+ALTER TABLE IF EXISTS constraint_rename_test ADD CONSTRAINT con4 UNIQUE (a);
+NOTICE: relation "constraint_rename_test" does not exist, skipping
+-- renaming constraints with cache reset of target relation
+CREATE TABLE constraint_rename_cache (a int,
+ CONSTRAINT chk_a CHECK (a > 0),
+ PRIMARY KEY (a));
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT chk_a TO chk_a_new;
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT constraint_rename_cache_pkey TO constraint_rename_pkey_new;
+CREATE TABLE like_constraint_rename_cache
+ (LIKE constraint_rename_cache INCLUDING ALL);
+\d like_constraint_rename_cache
+ Table "public.like_constraint_rename_cache"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+Indexes:
+ "like_constraint_rename_cache_pkey" PRIMARY KEY, btree (a)
+Check constraints:
+ "chk_a_new" CHECK (a > 0)
+
+DROP TABLE constraint_rename_cache;
+DROP TABLE like_constraint_rename_cache;
+-- FOREIGN KEY CONSTRAINT adding TEST
+CREATE TABLE attmp2 (a int primary key);
+CREATE TABLE attmp3 (a int, b int);
+CREATE TABLE attmp4 (a int, b int, unique(a,b));
+CREATE TABLE attmp5 (a int, b int);
+-- Insert rows into attmp2 (pktable)
+INSERT INTO attmp2 values (1);
+INSERT INTO attmp2 values (2);
+INSERT INTO attmp2 values (3);
+INSERT INTO attmp2 values (4);
+-- Insert rows into attmp3
+INSERT INTO attmp3 values (1,10);
+INSERT INTO attmp3 values (1,20);
+INSERT INTO attmp3 values (5,50);
+-- Try (and fail) to add constraint due to invalid source columns
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(c) references attmp2 match full;
+ERROR: column "c" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid destination columns explicitly given
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(a) references attmp2(b) match full;
+ERROR: column "b" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid data
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed)
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ALTER TABLE attmp3 drop constraint attmpconstr;
+INSERT INTO attmp3 values (5,50);
+-- Try NOT VALID and then VALIDATE CONSTRAINT, but fails. Delete failure then re-validate
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full NOT VALID;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed) and repeat to show it works on already valid constraint
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+-- Try a non-verified CHECK constraint
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10); -- fail
+ERROR: check constraint "b_greater_than_ten" is violated by some row
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10) NOT VALID; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- fails
+ERROR: check constraint "b_greater_than_ten" is violated by some row
+DELETE FROM attmp3 WHERE NOT b > 10;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+-- Test inherited NOT VALID CHECK constraints
+select * from attmp3;
+ a | b
+---+----
+ 1 | 20
+(1 row)
+
+CREATE TABLE attmp6 () INHERITS (attmp3);
+CREATE TABLE attmp7 () INHERITS (attmp3);
+INSERT INTO attmp6 VALUES (6, 30), (7, 16);
+ALTER TABLE attmp3 ADD CONSTRAINT b_le_20 CHECK (b <= 20) NOT VALID;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- fails
+ERROR: check constraint "b_le_20" is violated by some row
+DELETE FROM attmp6 WHERE b > 20;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- succeeds
+-- An already validated constraint must not be revalidated
+CREATE FUNCTION boo(int) RETURNS int IMMUTABLE STRICT LANGUAGE plpgsql AS $$ BEGIN RAISE NOTICE 'boo: %', $1; RETURN $1; END; $$;
+INSERT INTO attmp7 VALUES (8, 18);
+ALTER TABLE attmp7 ADD CONSTRAINT identity CHECK (b = boo(b));
+NOTICE: boo: 18
+ALTER TABLE attmp3 ADD CONSTRAINT IDENTITY check (b = boo(b)) NOT VALID;
+NOTICE: merging constraint "identity" with inherited definition
+ALTER TABLE attmp3 VALIDATE CONSTRAINT identity;
+NOTICE: boo: 16
+NOTICE: boo: 20
+-- A NO INHERIT constraint should not be looked for in children during VALIDATE CONSTRAINT
+create table parent_noinh_convalid (a int);
+create table child_noinh_convalid () inherits (parent_noinh_convalid);
+insert into parent_noinh_convalid values (1);
+insert into child_noinh_convalid values (1);
+alter table parent_noinh_convalid add constraint check_a_is_2 check (a = 2) no inherit not valid;
+-- fail, because of the row in parent
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+ERROR: check constraint "check_a_is_2" is violated by some row
+delete from only parent_noinh_convalid;
+-- ok (parent itself contains no violating rows)
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+select convalidated from pg_constraint where conrelid = 'parent_noinh_convalid'::regclass and conname = 'check_a_is_2';
+ convalidated
+--------------
+ t
+(1 row)
+
+-- cleanup
+drop table parent_noinh_convalid, child_noinh_convalid;
+-- Try (and fail) to create constraint from attmp5(a) to attmp4(a) - unique constraint on
+-- attmp4 is a,b
+ALTER TABLE attmp5 add constraint attmpconstr foreign key(a) references attmp4(a) match full;
+ERROR: there is no unique constraint matching given keys for referenced table "attmp4"
+DROP TABLE attmp7;
+DROP TABLE attmp6;
+DROP TABLE attmp5;
+DROP TABLE attmp4;
+DROP TABLE attmp3;
+DROP TABLE attmp2;
+-- NOT VALID with plan invalidation -- ensure we don't use a constraint for
+-- exclusion until validated
+set constraint_exclusion TO 'partition';
+create table nv_parent (d date, check (false) no inherit not valid);
+-- not valid constraint added at creation time should automatically become valid
+\d nv_parent
+ Table "public.nv_parent"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_parent_check" CHECK (false) NO INHERIT
+
+create table nv_child_2010 () inherits (nv_parent);
+create table nv_child_2011 () inherits (nv_parent);
+alter table nv_child_2010 add check (d between '2010-01-01'::date and '2010-12-31'::date) not valid;
+alter table nv_child_2011 add check (d between '2011-01-01'::date and '2011-12-31'::date) not valid;
+explain (costs off) select * from nv_parent where d between '2011-08-01' and '2011-08-31';
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+create table nv_child_2009 (check (d between '2009-01-01'::date and '2009-12-31'::date)) inherits (nv_parent);
+explain (costs off) select * from nv_parent where d between '2011-08-01'::date and '2011-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2011
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(9 rows)
+
+-- after validation, the constraint should be used
+alter table nv_child_2011 VALIDATE CONSTRAINT nv_child_2011_d_check;
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(7 rows)
+
+-- add an inherited NOT VALID constraint
+alter table nv_parent add check (d between '2001-01-01'::date and '2099-12-31'::date) not valid;
+\d nv_child_2009
+ Table "public.nv_child_2009"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_child_2009_d_check" CHECK (d >= '01-01-2009'::date AND d <= '12-31-2009'::date)
+ "nv_parent_d_check" CHECK (d >= '01-01-2001'::date AND d <= '12-31-2099'::date) NOT VALID
+Inherits: nv_parent
+
+-- we leave nv_parent and children around to help test pg_dump logic
+-- Foreign key adding test with mixed types
+-- Note: these tables are TEMP to avoid name conflicts when this test
+-- is run in parallel with foreign_key.sql.
+CREATE TEMP TABLE PKTABLE (ptest1 int PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 inet);
+-- This next should fail, because int=inet does not exist
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+-- This should also fail for the same reason, but here we
+-- give the column name
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable(ptest1);
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+-- This should succeed, even though they are different types,
+-- because int=int8 exists and is a member of the integer opfamily
+CREATE TEMP TABLE FKTABLE (ftest1 int8);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+-- This should fail, because we'd have to cast numeric to int which is
+-- not an implicit coercion (or use numeric=numeric, but that's not part
+-- of the integer opfamily)
+CREATE TEMP TABLE FKTABLE (ftest1 numeric);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: numeric and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- On the other hand, this should work because int implicitly promotes to
+-- numeric, and we allow promotion on the FK side
+CREATE TEMP TABLE PKTABLE (ptest1 numeric PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+CREATE TEMP TABLE PKTABLE (ptest1 int, ptest2 inet,
+ PRIMARY KEY(ptest1, ptest2));
+-- This should fail, because we just chose really odd types
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- Again, so should this...
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- This fails because we mixed up the column ordering
+CREATE TEMP TABLE FKTABLE (ftest1 int, ftest2 inet);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest2, ptest1);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest2" are of incompatible types: integer and inet.
+-- As does this...
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest2, ftest1)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest2_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest2" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- Test that ALTER CONSTRAINT updates trigger deferrability properly
+CREATE TEMP TABLE PKTABLE (ptest1 int primary key);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY IMMEDIATE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fknd2 NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdd2 DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdi2 DEFERRABLE INITIALLY IMMEDIATE;
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'pktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+------------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd2 | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdi | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi | "RI_FKey_noaction_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi2 | "RI_FKey_noaction_upd" | 17 | t | f
+ fknd | "RI_FKey_cascade_del" | 9 | f | f
+ fknd | "RI_FKey_noaction_upd" | 17 | f | f
+ fknd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fknd2 | "RI_FKey_noaction_upd" | 17 | f | f
+(12 rows)
+
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'fktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+---------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_check_ins" | 5 | t | t
+ fkdd | "RI_FKey_check_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_check_ins" | 5 | t | t
+ fkdd2 | "RI_FKey_check_upd" | 17 | t | t
+ fkdi | "RI_FKey_check_ins" | 5 | t | f
+ fkdi | "RI_FKey_check_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_check_ins" | 5 | t | f
+ fkdi2 | "RI_FKey_check_upd" | 17 | t | f
+ fknd | "RI_FKey_check_ins" | 5 | f | f
+ fknd | "RI_FKey_check_upd" | 17 | f | f
+ fknd2 | "RI_FKey_check_ins" | 5 | f | f
+ fknd2 | "RI_FKey_check_upd" | 17 | f | f
+(12 rows)
+
+-- temp tables should go away by themselves, need not drop them.
+-- test check constraint adding
+create table atacc1 ( test int );
+-- add a check constraint
+alter table atacc1 add constraint atacc_test1 check (test>3);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (2).
+-- should succeed
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails when added
+create table atacc1 ( test int );
+-- insert a soon to be failing row
+insert into atacc1 (test) values (2);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test>3);
+ERROR: check constraint "atacc_test1" is violated by some row
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails because the column doesn't exist
+create table atacc1 ( test int );
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test1>3);
+ERROR: column "test1" does not exist
+HINT: Perhaps you meant to reference the column "atacc1.test".
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int, test3 int);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test+test2<test3*4);
+-- should fail
+insert into atacc1 (test,test2,test3) values (4,4,2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (4, 4, 2).
+-- should succeed
+insert into atacc1 (test,test2,test3) values (4,4,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int check (test>3), test2 int);
+alter table atacc1 add check (test2>test);
+-- should fail for $2
+insert into atacc1 (test2, test) values (3, 4);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_check"
+DETAIL: Failing row contains (4, 3).
+drop table atacc1;
+-- inheritance related tests
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc2 add constraint foo check (test2>0);
+-- fail and then succeed on atacc2
+insert into atacc2 (test2) values (-3);
+ERROR: new row for relation "atacc2" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc2 (test2) values (3);
+-- fail and then succeed on atacc3
+insert into atacc3 (test2) values (-3);
+ERROR: new row for relation "atacc3" violates check constraint "foo"
+DETAIL: Failing row contains (null, -3, null).
+insert into atacc3 (test2) values (3);
+drop table atacc3;
+drop table atacc2;
+drop table atacc1;
+-- same things with one created with INHERIT
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc3 no inherit atacc2;
+-- fail
+alter table atacc3 no inherit atacc2;
+ERROR: relation "atacc2" is not a parent of relation "atacc3"
+-- make sure it really isn't a child
+insert into atacc3 (test2) values (3);
+select test2 from atacc2;
+ test2
+-------
+(0 rows)
+
+-- fail due to missing constraint
+alter table atacc2 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing constraint "foo"
+-- fail due to missing column
+alter table atacc3 rename test2 to testx;
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing column "test2"
+-- fail due to mismatched data type
+alter table atacc3 add test2 bool;
+alter table atacc3 inherit atacc2;
+ERROR: child table "atacc3" has different type for column "test2"
+alter table atacc3 drop test2;
+-- succeed
+alter table atacc3 add test2 int;
+update atacc3 set test2 = 4 where test2 is null;
+alter table atacc3 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+-- fail due to duplicates and circular inheritance
+alter table atacc3 inherit atacc2;
+ERROR: relation "atacc2" would be inherited from more than once
+alter table atacc2 inherit atacc3;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc3" is already a child of "atacc2".
+alter table atacc2 inherit atacc2;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc2" is already a child of "atacc2".
+-- test that we really are a child now (should see 4 not 3 and cascade should go through)
+select test2 from atacc2;
+ test2
+-------
+ 4
+(1 row)
+
+drop table atacc2 cascade;
+NOTICE: drop cascades to table atacc3
+drop table atacc1;
+-- adding only to a parent is allowed as of 9.2
+create table atacc1 (test int);
+create table atacc2 (test2 int) inherits (atacc1);
+-- ok:
+alter table atacc1 add constraint foo check (test>0) no inherit;
+-- check constraint is not there on child
+insert into atacc2 (test) values (-3);
+-- check constraint is there on parent
+insert into atacc1 (test) values (-3);
+ERROR: new row for relation "atacc1" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc1 (test) values (3);
+-- fail, violating row:
+alter table atacc2 add constraint foo check (test>0) no inherit;
+ERROR: check constraint "foo" is violated by some row
+drop table atacc2;
+drop table atacc1;
+-- test unique constraint adding
+create table atacc1 ( test int ) ;
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- try to create duplicates via alter table using - should fail
+alter table atacc1 alter column test type integer using 0;
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(0) is duplicated.
+drop table atacc1;
+-- let's do one where the unique constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the unique constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test1);
+ERROR: column "test1" named in key does not exist
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test, test2);
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, unique(test));
+alter table atacc1 add unique (test2);
+-- should fail for @@ second one @@
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_test_key"
+DETAIL: Key (test)=(3) already exists.
+drop table atacc1;
+-- test primary key constraint adding
+create table atacc1 ( id serial, test int) ;
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- inserting NULL should fail
+insert into atacc1 (test) values(NULL);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (4, null).
+-- try adding a second primary key (should fail)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- drop first primary key constraint
+alter table atacc1 drop constraint atacc_test1 restrict;
+-- try adding a primary key on oid (should succeed)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+drop table atacc1;
+-- let's do one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do another one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing row
+insert into atacc1 (test) values (NULL);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: column "test" contains null values
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the primary key constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a primary key constraint (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test1);
+ERROR: column "test1" of relation "atacc1" does not exist
+drop table atacc1;
+-- adding a new column as primary key to a non-empty table.
+-- should fail unless the column has a non-null default value.
+create table atacc1 ( test int );
+insert into atacc1 (test) values (0);
+-- add a primary key column without a default (fails).
+alter table atacc1 add column test2 int primary key;
+ERROR: column "test2" contains null values
+-- now add a primary key column with a default (succeeds).
+alter table atacc1 add column test2 int default 0 primary key;
+drop table atacc1;
+-- this combination used to have order-of-execution problems (bug #15580)
+create table atacc1 (a int);
+insert into atacc1 values(1);
+alter table atacc1
+ add column b float8 not null default random(),
+ add primary key(a);
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test, test2);
+-- try adding a second primary key - should fail
+alter table atacc1 add constraint atacc_test2 primary key (test);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+insert into atacc1 (test,test2) values (NULL,3);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (null, 3).
+insert into atacc1 (test,test2) values (3, NULL);
+ERROR: null value in column "test2" violates not-null constraint
+DETAIL: Failing row contains (3, null).
+insert into atacc1 (test,test2) values (NULL,NULL);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, primary key(test));
+-- only first should succeed
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_pkey"
+DETAIL: Key (test)=(3) already exists.
+insert into atacc1 (test2, test) values (1, NULL);
+ERROR: null value in column "test" violates not-null constraint
+DETAIL: Failing row contains (null, 1).
+drop table atacc1;
+-- alter table / alter column [set/drop] not null tests
+-- try altering system catalogs, should fail
+alter table pg_class alter column relname drop not null;
+ERROR: permission denied: "pg_class" is a system catalog
+alter table pg_class alter relname set not null;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table non_existent alter column bar set not null;
+ERROR: relation "non_existent" does not exist
+alter table non_existent alter column bar drop not null;
+ERROR: relation "non_existent" does not exist
+-- test setting columns to null and not null and vice versa
+-- test checking for null values and primary key
+create table atacc1 (test int not null);
+alter table atacc1 add constraint "atacc1_pkey" primary key (test);
+alter table atacc1 alter column test drop not null;
+ERROR: column "test" is in a primary key
+alter table atacc1 drop constraint "atacc1_pkey";
+alter table atacc1 alter column test drop not null;
+insert into atacc1 values (null);
+alter table atacc1 alter test set not null;
+ERROR: column "test" contains null values
+delete from atacc1;
+alter table atacc1 alter test set not null;
+-- try altering a non-existent column, should fail
+alter table atacc1 alter bar set not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+alter table atacc1 alter bar drop not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+alter table myview alter column test drop not null;
+ERROR: "myview" is not a table or foreign table
+alter table myview alter column test set not null;
+ERROR: "myview" is not a table or foreign table
+drop view myview;
+drop table atacc1;
+-- set not null verified by constraints
+create table atacc1 (test_a int, test_b int);
+insert into atacc1 values (null, 1);
+-- constraint not cover all values, should fail
+alter table atacc1 add constraint atacc1_constr_or check(test_a is not null or test_b < 10);
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" contains null values
+alter table atacc1 drop constraint atacc1_constr_or;
+-- not valid constraint, should fail
+alter table atacc1 add constraint atacc1_constr_invalid check(test_a is not null) not valid;
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" contains null values
+alter table atacc1 drop constraint atacc1_constr_invalid;
+-- with valid constraint
+update atacc1 set test_a = 1;
+alter table atacc1 add constraint atacc1_constr_a_valid check(test_a is not null);
+alter table atacc1 alter test_a set not null;
+delete from atacc1;
+insert into atacc1 values (2, null);
+alter table atacc1 alter test_a drop not null;
+-- test multiple set not null at same time
+-- test_a checked by atacc1_constr_a_valid, test_b should fail by table scan
+alter table atacc1 alter test_a set not null, alter test_b set not null;
+ERROR: column "test_b" contains null values
+-- commands order has no importance
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+ERROR: column "test_b" contains null values
+-- valid one by table scan, one by check constraints
+update atacc1 set test_b = 1;
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+alter table atacc1 alter test_a drop not null, alter test_b drop not null;
+-- both column has check constraints
+alter table atacc1 add constraint atacc1_constr_b_valid check(test_b is not null);
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+drop table atacc1;
+-- test inheritance
+create table parent (a int);
+create table child (b varchar(255)) inherits (parent);
+alter table parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null).
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+alter table parent alter a drop not null;
+insert into parent values (NULL);
+insert into child (a, b) values (NULL, 'foo');
+alter table only parent alter a set not null;
+ERROR: column "a" contains null values
+alter table child alter a set not null;
+ERROR: column "a" contains null values
+delete from parent;
+alter table only parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null).
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+delete from child;
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+drop table child;
+drop table parent;
+-- test setting and removing default values
+create table def_test (
+ c1 int4 default 5,
+ c2 text default 'initial_default'
+);
+insert into def_test default values;
+alter table def_test alter column c1 drop default;
+insert into def_test default values;
+alter table def_test alter column c2 drop default;
+insert into def_test default values;
+alter table def_test alter column c1 set default 10;
+alter table def_test alter column c2 set default 'new_default';
+insert into def_test default values;
+select * from def_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+(4 rows)
+
+-- set defaults to an incorrect type: this should fail
+alter table def_test alter column c1 set default 'wrong_datatype';
+ERROR: invalid input syntax for type integer: "wrong_datatype"
+alter table def_test alter column c2 set default 20;
+-- set defaults on a non-existent column: this should fail
+alter table def_test alter column c3 set default 30;
+ERROR: column "c3" of relation "def_test" does not exist
+-- set defaults on views: we need to create a view, add a rule
+-- to allow insertions into it, and then alter the view to add
+-- a default
+create view def_view_test as select * from def_test;
+create rule def_view_test_ins as
+ on insert to def_view_test
+ do instead insert into def_test select new.*;
+insert into def_view_test default values;
+alter table def_view_test alter column c1 set default 45;
+insert into def_view_test default values;
+alter table def_view_test alter column c2 set default 'view_default';
+insert into def_view_test default values;
+select * from def_view_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+ |
+ 45 |
+ 45 | view_default
+(7 rows)
+
+drop rule def_view_test_ins on def_view_test;
+drop view def_view_test;
+drop table def_test;
+-- alter table / drop column tests
+-- try altering system catalogs, should fail
+alter table pg_class drop column relname;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table nosuchtable drop column bar;
+ERROR: relation "nosuchtable" does not exist
+-- test dropping columns
+create table atacc1 (a int4 not null, b int4, c int4 not null, d int4);
+insert into atacc1 values (1, 2, 3, 4);
+alter table atacc1 drop a;
+alter table atacc1 drop a;
+ERROR: column "a" of relation "atacc1" does not exist
+-- SELECTs
+select * from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select * from atacc1 order by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 order by a;
+ ^
+select * from atacc1 order by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 order by "........pg.dropped.1........"...
+ ^
+select * from atacc1 group by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 group by a;
+ ^
+select * from atacc1 group by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 group by "........pg.dropped.1........"...
+ ^
+select atacc1.* from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a from atacc1;
+ ^
+select atacc1.a from atacc1;
+ERROR: column atacc1.a does not exist
+LINE 1: select atacc1.a from atacc1;
+ ^
+select b,c,d from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a,b,c,d from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a,b,c,d from atacc1;
+ ^
+select * from atacc1 where a = 1;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 where a = 1;
+ ^
+select "........pg.dropped.1........" from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........" from atacc1;
+ ^
+select atacc1."........pg.dropped.1........" from atacc1;
+ERROR: column atacc1.........pg.dropped.1........ does not exist
+LINE 1: select atacc1."........pg.dropped.1........" from atacc1;
+ ^
+select "........pg.dropped.1........",b,c,d from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........",b,c,d from atacc1;
+ ^
+select * from atacc1 where "........pg.dropped.1........" = 1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 where "........pg.dropped.1........" = ...
+ ^
+-- UPDATEs
+update atacc1 set a = 3;
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: update atacc1 set a = 3;
+ ^
+update atacc1 set b = 2 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: update atacc1 set b = 2 where a = 3;
+ ^
+update atacc1 set "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: update atacc1 set "........pg.dropped.1........" = 3;
+ ^
+update atacc1 set b = 2 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: update atacc1 set b = 2 where "........pg.dropped.1........"...
+ ^
+-- INSERTs
+insert into atacc1 values (10, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (10, 11, 12, 13);
+ ^
+insert into atacc1 values (default, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (default, 11, 12, 13);
+ ^
+insert into atacc1 values (11, 12, 13);
+insert into atacc1 (a) values (10);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (10);
+ ^
+insert into atacc1 (a) values (default);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (default);
+ ^
+insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ ^
+insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ ^
+insert into atacc1 (b,c,d) values (11,12,13);
+insert into atacc1 ("........pg.dropped.1........") values (10);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........") values (default);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (10,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (default,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+-- DELETEs
+delete from atacc1 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: delete from atacc1 where a = 3;
+ ^
+delete from atacc1 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: delete from atacc1 where "........pg.dropped.1........" = 3;
+ ^
+delete from atacc1;
+-- try dropping a non-existent column, should fail
+alter table atacc1 drop bar;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try removing an oid column, should succeed (as it's nonexistent)
+alter table atacc1 SET WITHOUT OIDS;
+-- try adding an oid column, should fail (not supported)
+alter table atacc1 SET WITH OIDS;
+ERROR: syntax error at or near "WITH"
+LINE 1: alter table atacc1 SET WITH OIDS;
+ ^
+-- try dropping the xmin column, should fail
+alter table atacc1 drop xmin;
+ERROR: cannot drop system column "xmin"
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+select * from myview;
+ b | c | d
+---+---+---
+(0 rows)
+
+alter table myview drop d;
+ERROR: "myview" is not a table, composite type, or foreign table
+drop view myview;
+-- test some commands to make sure they fail on the dropped column
+analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+vacuum analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+vacuum analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+comment on column atacc1.a is 'testing';
+ERROR: column "a" of relation "atacc1" does not exist
+comment on column atacc1."........pg.dropped.1........" is 'testing';
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set storage plain;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set storage plain;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set statistics 0;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set statistics 0;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set default 3;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set default 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop default;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop default;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 rename a to x;
+ERROR: column "a" does not exist
+alter table atacc1 rename "........pg.dropped.1........" to x;
+ERROR: column "........pg.dropped.1........" does not exist
+alter table atacc1 add primary key(a);
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 add primary key("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 add unique(a);
+ERROR: column "a" named in key does not exist
+alter table atacc1 add unique("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" named in key does not exist
+alter table atacc1 add check (a > 3);
+ERROR: column "a" does not exist
+alter table atacc1 add check ("........pg.dropped.1........" > 3);
+ERROR: column "........pg.dropped.1........" does not exist
+create table atacc2 (id int4 unique);
+alter table atacc1 add foreign key (a) references atacc2(id);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc1 add foreign key ("........pg.dropped.1........") references atacc2(id);
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1(a);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+drop table atacc2;
+create index "testing_idx" on atacc1(a);
+ERROR: column "a" does not exist
+create index "testing_idx" on atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" does not exist
+-- test create as and select into
+insert into atacc1 values (21, 22, 23);
+create table attest1 as select * from atacc1;
+select * from attest1;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest1;
+select * into attest2 from atacc1;
+select * from attest2;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest2;
+-- try dropping all columns
+alter table atacc1 drop c;
+alter table atacc1 drop d;
+alter table atacc1 drop b;
+select * from atacc1;
+--
+(1 row)
+
+drop table atacc1;
+-- test constraint error reporting in presence of dropped columns
+create table atacc1 (id serial primary key, value int check (value < 10));
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (1, 100).
+alter table atacc1 drop column value;
+alter table atacc1 add column value int check (value < 10);
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (2, 100).
+insert into atacc1(id, value) values (null, 0);
+ERROR: null value in column "id" violates not-null constraint
+DETAIL: Failing row contains (null, 0).
+drop table atacc1;
+-- test inheritance
+create table parent (a int, b int, c int);
+insert into parent values (1, 2, 3);
+alter table parent drop a;
+create table child (d varchar(255)) inherits (parent);
+insert into child values (12, 13, 'testing');
+select * from parent;
+ b | c
+----+----
+ 2 | 3
+ 12 | 13
+(2 rows)
+
+select * from child;
+ b | c | d
+----+----+---------
+ 12 | 13 | testing
+(1 row)
+
+alter table parent drop c;
+select * from parent;
+ b
+----
+ 2
+ 12
+(2 rows)
+
+select * from child;
+ b | d
+----+---------
+ 12 | testing
+(1 row)
+
+drop table child;
+drop table parent;
+-- check error cases for inheritance column merging
+create table parent (a float8, b numeric(10,4), c text collate "C");
+create table child (a float4) inherits (parent); -- fail
+NOTICE: merging column "a" with inherited definition
+ERROR: column "a" has a type conflict
+DETAIL: double precision versus real
+create table child (b decimal(10,7)) inherits (parent); -- fail
+NOTICE: moving and merging column "b" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "b" has a type conflict
+DETAIL: numeric(10,4) versus numeric(10,7)
+create table child (c text collate "POSIX") inherits (parent); -- fail
+NOTICE: moving and merging column "c" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "c" has a collation conflict
+DETAIL: "C" versus "POSIX"
+create table child (a double precision, b decimal(10,4)) inherits (parent);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging column "b" with inherited definition
+drop table child;
+drop table parent;
+-- test copy in/out
+create table attest (a int4, b int4, c int4);
+insert into attest values (1,2,3);
+alter table attest drop a;
+copy attest to stdout;
+2 3
+copy attest(a) to stdout;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") to stdout;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest from stdin;
+ERROR: extra data after last expected column
+CONTEXT: COPY attest, line 1: "10 11 12"
+select * from attest;
+ b | c
+---+---
+ 2 | 3
+(1 row)
+
+copy attest from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+(2 rows)
+
+copy attest(a) from stdin;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") from stdin;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest(b,c) from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+ 31 | 32
+(3 rows)
+
+drop table attest;
+-- test inheritance
+create table dropColumn (a int, b int, e int);
+create table dropColumnChild (c int) inherits (dropColumn);
+create table dropColumnAnother (d int) inherits (dropColumnChild);
+-- these two should fail
+alter table dropColumnchild drop column a;
+ERROR: cannot drop inherited column "a"
+alter table only dropColumnChild drop column b;
+ERROR: cannot drop inherited column "b"
+-- these three should work
+alter table only dropColumn drop column e;
+alter table dropColumnChild drop column c;
+alter table dropColumn drop column a;
+create table renameColumn (a int);
+create table renameColumnChild (b int) inherits (renameColumn);
+create table renameColumnAnother (c int) inherits (renameColumnChild);
+-- these three should fail
+alter table renameColumnChild rename column a to d;
+ERROR: cannot rename inherited column "a"
+alter table only renameColumnChild rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+alter table only renameColumn rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+-- these should work
+alter table renameColumn rename column a to d;
+alter table renameColumnChild rename column b to a;
+-- these should work
+alter table if exists doesnt_exist_tab rename column a to d;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+alter table if exists doesnt_exist_tab rename column b to a;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+-- this should work
+alter table renameColumn add column w int;
+-- this should fail
+alter table only renameColumn add column x int;
+ERROR: column must be added to child tables too
+-- Test corner cases in dropping of inherited columns
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+-- should work
+alter table p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+select f1 from c1;
+ f1
+----
+(0 rows)
+
+alter table c1 drop column f1;
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table p1 drop column f1;
+-- c1.f1 is dropped now, since there is no local definition for it
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is NOT dropped, but must now be considered non-inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1(id int, name text);
+create table p2(id2 int, name text, height int);
+create table c1(age int) inherits(p1,p2);
+NOTICE: merging multiple inherited definitions of column "name"
+create table gc1() inherits (c1);
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | name | 2 | f
+ c1 | id2 | 1 | f
+ c1 | height | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | name | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | height | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p1 | name | 0 | t
+ p2 | id2 | 0 | t
+ p2 | name | 0 | t
+ p2 | height | 0 | t
+(15 rows)
+
+-- should work
+alter table only p1 drop column name;
+-- should work. Now c1.name is local and inhcount is 0.
+alter table p2 drop column name;
+-- should be rejected since its inherited
+alter table gc1 drop column name;
+ERROR: cannot drop inherited column "name"
+-- should work, and drop gc1.name along
+alter table c1 drop column name;
+-- should fail: column does not exist
+alter table gc1 drop column name;
+ERROR: column "name" of relation "gc1" does not exist
+-- should work and drop the attribute in all tables
+alter table p2 drop column height;
+-- IF EXISTS test
+create table dropColumnExists ();
+alter table dropColumnExists drop column non_existing; --fail
+ERROR: column "non_existing" of relation "dropcolumnexists" does not exist
+alter table dropColumnExists drop column if exists non_existing; --succeed
+NOTICE: column "non_existing" of relation "dropcolumnexists" does not exist, skipping
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | id2 | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p2 | id2 | 0 | t
+(8 rows)
+
+drop table p1, p2 cascade;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to table c1
+drop cascades to table gc1
+-- test attinhcount tracking with merged columns
+create table depth0();
+create table depth1(c text) inherits (depth0);
+create table depth2() inherits (depth1);
+alter table depth0 add c text;
+NOTICE: merging definition of column "c" for child "depth1"
+select attrelid::regclass, attname, attinhcount, attislocal
+from pg_attribute
+where attnum > 0 and attrelid::regclass in ('depth0', 'depth1', 'depth2')
+order by attrelid::regclass::text, attnum;
+ attrelid | attname | attinhcount | attislocal
+----------+---------+-------------+------------
+ depth0 | c | 0 | t
+ depth1 | c | 1 | t
+ depth2 | c | 1 | f
+(3 rows)
+
+-- test renumbering of child-table columns in inherited operations
+create table p1 (f1 int);
+create table c1 (f2 text, f3 int) inherits (p1);
+alter table p1 add column a1 int check (a1 > 0);
+alter table p1 add column f2 text;
+NOTICE: merging definition of column "f2" for child "c1"
+insert into p1 values (1,2,'abc');
+insert into c1 values(11,'xyz',33,0); -- should fail
+ERROR: new row for relation "c1" violates check constraint "p1_a1_check"
+DETAIL: Failing row contains (11, xyz, 33, 0).
+insert into c1 values(11,'xyz',33,22);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 2 | abc
+ 11 | 22 | xyz
+(2 rows)
+
+update p1 set a1 = a1 + 1, f2 = upper(f2);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 3 | ABC
+ 11 | 23 | XYZ
+(2 rows)
+
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+-- test that operations with a dropped column do not try to reference
+-- its datatype
+create domain mytype as text;
+create temp table foo (f1 text, f2 mytype, f3 text);
+insert into foo values('bb','cc','dd');
+select * from foo;
+ f1 | f2 | f3
+----+----+----
+ bb | cc | dd
+(1 row)
+
+drop domain mytype cascade;
+NOTICE: drop cascades to column f2 of table foo
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+(1 row)
+
+insert into foo values('qq','rr');
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+ qq | rr
+(2 rows)
+
+update foo set f3 = 'zz';
+select * from foo;
+ f1 | f3
+----+----
+ bb | zz
+ qq | zz
+(2 rows)
+
+select f3,max(f1) from foo group by f3;
+ f3 | max
+----+-----
+ zz | qq
+(1 row)
+
+-- Simple tests for alter table column type
+alter table foo alter f1 TYPE integer; -- fails
+ERROR: column "f1" cannot be cast automatically to type integer
+HINT: You might need to specify "USING f1::integer".
+alter table foo alter f1 TYPE varchar(10);
+create table anothertab (atcol1 serial8, atcol2 boolean,
+ constraint anothertab_chk check (atcol1 <= 3));
+insert into anothertab (atcol1, atcol2) values (default, true);
+insert into anothertab (atcol1, atcol2) values (default, false);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+alter table anothertab alter column atcol1 type boolean; -- fails
+ERROR: column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to specify "USING atcol1::boolean".
+alter table anothertab alter column atcol1 type boolean using atcol1::int; -- fails
+ERROR: result of USING clause for column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to add an explicit cast.
+alter table anothertab alter column atcol1 type integer;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+insert into anothertab (atcol1, atcol2) values (45, null); -- fails
+ERROR: new row for relation "anothertab" violates check constraint "anothertab_chk"
+DETAIL: Failing row contains (45, null).
+insert into anothertab (atcol1, atcol2) values (default, null);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+ 3 |
+(3 rows)
+
+alter table anothertab alter column atcol2 type text
+ using case when atcol2 is true then 'IT WAS TRUE'
+ when atcol2 is false then 'IT WAS FALSE'
+ else 'IT WAS NULL!' end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ 1 | IT WAS TRUE
+ 2 | IT WAS FALSE
+ 3 | IT WAS NULL!
+(3 rows)
+
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: default for column "atcol1" cannot be cast automatically to type boolean
+alter table anothertab alter column atcol1 drop default;
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: operator does not exist: boolean <= integer
+HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
+alter table anothertab drop constraint anothertab_chk;
+alter table anothertab drop constraint anothertab_chk; -- fails
+ERROR: constraint "anothertab_chk" of relation "anothertab" does not exist
+alter table anothertab drop constraint IF EXISTS anothertab_chk; -- succeeds
+NOTICE: constraint "anothertab_chk" of relation "anothertab" does not exist, skipping
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ f | IT WAS TRUE
+ t | IT WAS FALSE
+ f | IT WAS NULL!
+(3 rows)
+
+drop table anothertab;
+-- Test index handling in alter table column type (cf. bugs #15835, #15865)
+create table anothertab(f1 int primary key, f2 int unique,
+ f3 int, f4 int, f5 int);
+alter table anothertab
+ add exclude using btree (f3 with =);
+alter table anothertab
+ add exclude using btree (f4 with =) where (f4 is not null);
+alter table anothertab
+ add exclude using btree (f4 with =) where (f5 > 0);
+alter table anothertab
+ add unique(f1,f4);
+create index on anothertab(f2,f3);
+create unique index on anothertab(f4);
+\d anothertab
+ Table "public.anothertab"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ f1 | integer | | not null |
+ f2 | integer | | |
+ f3 | integer | | |
+ f4 | integer | | |
+ f5 | integer | | |
+Indexes:
+ "anothertab_pkey" PRIMARY KEY, btree (f1)
+ "anothertab_f1_f4_key" UNIQUE CONSTRAINT, btree (f1, f4)
+ "anothertab_f2_f3_idx" btree (f2, f3)
+ "anothertab_f2_key" UNIQUE CONSTRAINT, btree (f2)
+ "anothertab_f3_excl" EXCLUDE USING btree (f3 WITH =)
+ "anothertab_f4_excl" EXCLUDE USING btree (f4 WITH =) WHERE (f4 IS NOT NULL)
+ "anothertab_f4_excl1" EXCLUDE USING btree (f4 WITH =) WHERE (f5 > 0)
+ "anothertab_f4_idx" UNIQUE, btree (f4)
+
+alter table anothertab alter column f1 type bigint;
+alter table anothertab
+ alter column f2 type bigint,
+ alter column f3 type bigint,
+ alter column f4 type bigint;
+alter table anothertab alter column f5 type bigint;
+\d anothertab
+ Table "public.anothertab"
+ Column | Type | Collation | Nullable | Default
+--------+--------+-----------+----------+---------
+ f1 | bigint | | not null |
+ f2 | bigint | | |
+ f3 | bigint | | |
+ f4 | bigint | | |
+ f5 | bigint | | |
+Indexes:
+ "anothertab_pkey" PRIMARY KEY, btree (f1)
+ "anothertab_f1_f4_key" UNIQUE CONSTRAINT, btree (f1, f4)
+ "anothertab_f2_f3_idx" btree (f2, f3)
+ "anothertab_f2_key" UNIQUE CONSTRAINT, btree (f2)
+ "anothertab_f3_excl" EXCLUDE USING btree (f3 WITH =)
+ "anothertab_f4_excl" EXCLUDE USING btree (f4 WITH =) WHERE (f4 IS NOT NULL)
+ "anothertab_f4_excl1" EXCLUDE USING btree (f4 WITH =) WHERE (f5 > 0)
+ "anothertab_f4_idx" UNIQUE, btree (f4)
+
+drop table anothertab;
+create table another (f1 int, f2 text);
+insert into another values(1, 'one');
+insert into another values(2, 'two');
+insert into another values(3, 'three');
+select * from another;
+ f1 | f2
+----+-------
+ 1 | one
+ 2 | two
+ 3 | three
+(3 rows)
+
+alter table another
+ alter f1 type text using f2 || ' more',
+ alter f2 type bigint using f1 * 10;
+select * from another;
+ f1 | f2
+------------+----
+ one more | 10
+ two more | 20
+ three more | 30
+(3 rows)
+
+drop table another;
+-- table's row type
+create table tab1 (a int, b text);
+create table tab2 (x int, y tab1);
+alter table tab1 alter column b type varchar; -- fails
+ERROR: cannot alter table "tab1" because column "tab2.y" uses its row type
+-- Alter column type that's part of a partitioned index
+create table at_partitioned (a int, b text) partition by range (a);
+create table at_part_1 partition of at_partitioned for values from (0) to (1000);
+insert into at_partitioned values (512, '0.123');
+create table at_part_2 (b text, a int);
+insert into at_part_2 values ('1.234', 1024);
+create index on at_partitioned (b);
+create index on at_partitioned (a);
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+
+alter table at_partitioned attach partition at_part_2 for values from (1000) to (2000);
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+alter table at_partitioned alter column b type numeric using b::numeric;
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | numeric | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | numeric | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+drop table at_partitioned;
+-- Alter column type when no table rewrite is required
+-- Also check that comments are preserved
+create table at_partitioned(id int, name varchar(64), unique (id, name))
+ partition by hash(id);
+comment on constraint at_partitioned_id_name_key on at_partitioned is 'parent constraint';
+comment on index at_partitioned_id_name_key is 'parent index';
+create table at_partitioned_0 partition of at_partitioned
+ for values with (modulus 2, remainder 0);
+comment on constraint at_partitioned_0_id_name_key on at_partitioned_0 is 'child 0 constraint';
+comment on index at_partitioned_0_id_name_key is 'child 0 index';
+create table at_partitioned_1 partition of at_partitioned
+ for values with (modulus 2, remainder 1);
+comment on constraint at_partitioned_1_id_name_key on at_partitioned_1 is 'child 1 constraint';
+comment on index at_partitioned_1_id_name_key is 'child 1 index';
+insert into at_partitioned values(1, 'foo');
+insert into at_partitioned values(3, 'bar');
+create temp table old_oids as
+ select relname, oid as oldoid, relfilenode as oldfilenode
+ from pg_class where relname like 'at_partitioned%';
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+---------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | t | own | child 0 index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | t | own | child 1 index
+ at_partitioned_id_name_key | t | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+--------------------
+ at_partitioned_0_id_name_key | child 0 constraint
+ at_partitioned_1_id_name_key | child 1 constraint
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+alter table at_partitioned alter column name type varchar(127);
+-- Note: these tests currently show the wrong behavior for comments :-(
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+--------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | f | own | parent index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | f | own | parent index
+ at_partitioned_id_name_key | f | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+-------------------
+ at_partitioned_0_id_name_key |
+ at_partitioned_1_id_name_key |
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+-- Don't remove this DROP, it exposes bug #15672
+drop table at_partitioned;
+-- disallow recursive containment of row types
+create temp table recur1 (f1 int);
+alter table recur1 add column f2 recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 recur1[]; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create domain array_of_recur1 as recur1[];
+alter table recur1 add column f2 array_of_recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create temp table recur2 (f1 int, f2 recur1);
+alter table recur1 add column f2 recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 int;
+alter table recur1 alter column f2 type recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+-- SET STORAGE may need to add a TOAST table
+create table test_storage (a text);
+alter table test_storage alter a set storage plain;
+alter table test_storage add b int default 0; -- rewrite table to remove its TOAST table
+alter table test_storage alter a set storage extended; -- re-add TOAST table
+select reltoastrelid <> 0 as has_toast_table
+from pg_class
+where oid = 'test_storage'::regclass;
+ has_toast_table
+-----------------
+ f
+(1 row)
+
+-- ALTER COLUMN TYPE with a check constraint and a child table (bug #13779)
+CREATE TABLE test_inh_check (a float check (a > 10.2), b float);
+CREATE TABLE test_inh_check_child() INHERITS(test_inh_check);
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN a TYPE numeric;
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+-- also try noinherit, local, and local+inherited cases
+ALTER TABLE test_inh_check ADD CONSTRAINT bnoinherit CHECK (b > 100) NO INHERIT;
+ALTER TABLE test_inh_check_child ADD CONSTRAINT blocal CHECK (b < 1000);
+ALTER TABLE test_inh_check_child ADD CONSTRAINT bmerged CHECK (b > 1);
+ALTER TABLE test_inh_check ADD CONSTRAINT bmerged CHECK (b > 1);
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "bmerged" CHECK (b > 1::double precision)
+ "bnoinherit" CHECK (b > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "blocal" CHECK (b < 1000::double precision)
+ "bmerged" CHECK (b > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN b TYPE numeric;
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "bnoinherit" CHECK (b::double precision > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "blocal" CHECK (b::double precision < 1000::double precision)
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+-- ALTER COLUMN TYPE with different schema in children
+-- Bug at https://postgr.es/m/20170102225618.GA10071@telsasoft.com
+CREATE TABLE test_type_diff (f1 int);
+CREATE TABLE test_type_diff_c (extra smallint) INHERITS (test_type_diff);
+ALTER TABLE test_type_diff ADD COLUMN f2 int;
+INSERT INTO test_type_diff_c VALUES (1, 2, 3);
+ALTER TABLE test_type_diff ALTER COLUMN f2 TYPE bigint USING f2::bigint;
+CREATE TABLE test_type_diff2 (int_two int2, int_four int4, int_eight int8);
+CREATE TABLE test_type_diff2_c1 (int_four int4, int_eight int8, int_two int2);
+CREATE TABLE test_type_diff2_c2 (int_eight int8, int_two int2, int_four int4);
+CREATE TABLE test_type_diff2_c3 (int_two int2, int_four int4, int_eight int8);
+ALTER TABLE test_type_diff2_c1 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c2 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c3 INHERIT test_type_diff2;
+INSERT INTO test_type_diff2_c1 VALUES (1, 2, 3);
+INSERT INTO test_type_diff2_c2 VALUES (4, 5, 6);
+INSERT INTO test_type_diff2_c3 VALUES (7, 8, 9);
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int8 USING int_four::int8;
+-- whole-row references are disallowed
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int4 USING (pg_column_size(test_type_diff2));
+ERROR: cannot convert whole-row table reference
+DETAIL: USING expression contains a whole-row table reference.
+-- check for rollback of ANALYZE corrupting table property flags (bug #11638)
+CREATE TABLE check_fk_presence_1 (id int PRIMARY KEY, t text);
+CREATE TABLE check_fk_presence_2 (id int REFERENCES check_fk_presence_1, t text);
+BEGIN;
+ALTER TABLE check_fk_presence_2 DROP CONSTRAINT check_fk_presence_2_id_fkey;
+ANALYZE check_fk_presence_2;
+ROLLBACK;
+\d check_fk_presence_2
+ Table "public.check_fk_presence_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id | integer | | |
+ t | text | | |
+Foreign-key constraints:
+ "check_fk_presence_2_id_fkey" FOREIGN KEY (id) REFERENCES check_fk_presence_1(id)
+
+DROP TABLE check_fk_presence_1, check_fk_presence_2;
+-- check column addition within a view (bug #14876)
+create table at_base_table(id int, stuff text);
+insert into at_base_table values (23, 'skidoo');
+create view at_view_1 as select * from at_base_table bt;
+create view at_view_2 as select *, to_json(v1) as j from at_view_1 v1;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+View definition:
+ SELECT bt.id,
+ bt.stuff
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo"}
+(1 row)
+
+create or replace view at_view_1 as select *, 2+2 as more from at_base_table bt;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ more | integer | | | | plain |
+View definition:
+ SELECT bt.id,
+ bt.stuff,
+ 2 + 2 AS more
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff, NULL))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo","more":null}
+(1 row)
+
+drop view at_view_2;
+drop view at_view_1;
+drop table at_base_table;
+--
+-- lock levels
+--
+drop type lockmodes;
+ERROR: type "lockmodes" does not exist
+create type lockmodes as enum (
+ 'SIReadLock'
+,'AccessShareLock'
+,'RowShareLock'
+,'RowExclusiveLock'
+,'ShareUpdateExclusiveLock'
+,'ShareLock'
+,'ShareRowExclusiveLock'
+,'ExclusiveLock'
+,'AccessExclusiveLock'
+);
+drop view my_locks;
+ERROR: view "my_locks" does not exist
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = txid_current()::integer)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname != 'my_locks'
+group by c.relname;
+create table alterlock (f1 int primary key, f2 text);
+insert into alterlock values (1, 'foo');
+create table alterlock2 (f3 int primary key, f1 int);
+insert into alterlock2 values (1, 1);
+begin; alter table alterlock alter column f2 set statistics 150;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock cluster on alterlock_pkey;
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+ alterlock_pkey | ShareUpdateExclusiveLock
+(2 rows)
+
+commit;
+begin; alter table alterlock set without cluster;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (fillfactor = 100);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock reset (fillfactor);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (toast.autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set (n_distinct = 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+-- test that mixing options with different lock levels works as expected
+begin; alter table alterlock set (autovacuum_enabled = off, fillfactor = 80);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set storage extended;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock alter column f2 set default 'x';
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin;
+create trigger ttdummy
+ before delete or update on alterlock
+ for each row
+ execute procedure
+ ttdummy (1, 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+-----------------------
+ alterlock | ShareRowExclusiveLock
+(1 row)
+
+rollback;
+begin;
+select * from my_locks order by 1;
+ relname | max_lockmode
+---------+--------------
+(0 rows)
+
+alter table alterlock2 add foreign key (f1) references alterlock (f1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+begin;
+alter table alterlock2
+add constraint alterlock2nv foreign key (f1) references alterlock (f1) NOT VALID;
+select * from my_locks order by 1;
+ relname | max_lockmode
+------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+(2 rows)
+
+commit;
+begin;
+alter table alterlock2 validate constraint alterlock2nv;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+--------------------------
+ alterlock | RowShareLock
+ alterlock2 | ShareUpdateExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = txid_current()::integer)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname = 'my_locks'
+group by c.relname;
+-- raise exception
+alter table my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter view my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter table my_locks reset (autovacuum_enabled);
+alter view my_locks reset (autovacuum_enabled);
+begin;
+alter view my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter view my_locks reset (security_barrier);
+rollback;
+-- this test intentionally applies the ALTER TABLE command against a view, but
+-- uses a view option so we expect this to succeed. This form of SQL is
+-- accepted for historical reasons, as shown in the docs for ALTER VIEW
+begin;
+alter table my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter table my_locks reset (security_barrier);
+rollback;
+-- cleanup
+drop table alterlock2;
+drop table alterlock;
+drop view my_locks;
+drop type lockmodes;
+--
+-- alter function
+--
+create function test_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql returns null on null input;
+select test_strict(NULL);
+ test_strict
+-------------
+
+(1 row)
+
+alter function test_strict(text) called on null input;
+select test_strict(NULL);
+ test_strict
+-------------------
+ got passed a null
+(1 row)
+
+create function non_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql called on null input;
+select non_strict(NULL);
+ non_strict
+-------------------
+ got passed a null
+(1 row)
+
+alter function non_strict(text) returns null on null input;
+select non_strict(NULL);
+ non_strict
+------------
+
+(1 row)
+
+--
+-- alter object set schema
+--
+create schema alter1;
+create schema alter2;
+create table alter1.t1(f1 serial primary key, f2 int check (f2 > 0));
+create view alter1.v1 as select * from alter1.t1;
+create function alter1.plus1(int) returns int as 'select $1+1' language sql;
+create domain alter1.posint integer check (value > 0);
+create type alter1.ctype as (f1 int, f2 text);
+create function alter1.same(alter1.ctype, alter1.ctype) returns boolean language sql
+as 'select $1.f1 is not distinct from $2.f1 and $1.f2 is not distinct from $2.f2';
+create operator alter1.=(procedure = alter1.same, leftarg = alter1.ctype, rightarg = alter1.ctype);
+create operator class alter1.ctype_hash_ops default for type alter1.ctype using hash as
+ operator 1 alter1.=(alter1.ctype, alter1.ctype);
+create conversion alter1.latin1_to_utf8 for 'latin1' to 'utf8' from iso8859_1_to_utf8;
+create text search parser alter1.prs(start = prsd_start, gettoken = prsd_nexttoken, end = prsd_end, lextypes = prsd_lextype);
+create text search configuration alter1.cfg(parser = alter1.prs);
+create text search template alter1.tmpl(init = dsimple_init, lexize = dsimple_lexize);
+create text search dictionary alter1.dict(template = alter1.tmpl);
+insert into alter1.t1(f2) values(11);
+insert into alter1.t1(f2) values(12);
+alter table alter1.t1 set schema alter1; -- no-op, same schema
+alter table alter1.t1 set schema alter2;
+alter table alter1.v1 set schema alter2;
+alter function alter1.plus1(int) set schema alter2;
+alter domain alter1.posint set schema alter2;
+alter operator class alter1.ctype_hash_ops using hash set schema alter2;
+alter operator family alter1.ctype_hash_ops using hash set schema alter2;
+alter operator alter1.=(alter1.ctype, alter1.ctype) set schema alter2;
+alter function alter1.same(alter1.ctype, alter1.ctype) set schema alter2;
+alter type alter1.ctype set schema alter1; -- no-op, same schema
+alter type alter1.ctype set schema alter2;
+alter conversion alter1.latin1_to_utf8 set schema alter2;
+alter text search parser alter1.prs set schema alter2;
+alter text search configuration alter1.cfg set schema alter2;
+alter text search template alter1.tmpl set schema alter2;
+alter text search dictionary alter1.dict set schema alter2;
+-- this should succeed because nothing is left in alter1
+drop schema alter1;
+insert into alter2.t1(f2) values(13);
+insert into alter2.t1(f2) values(14);
+select * from alter2.t1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select * from alter2.v1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select alter2.plus1(41);
+ plus1
+-------
+ 42
+(1 row)
+
+-- clean up
+drop schema alter2 cascade;
+NOTICE: drop cascades to 13 other objects
+DETAIL: drop cascades to table alter2.t1
+drop cascades to view alter2.v1
+drop cascades to function alter2.plus1(integer)
+drop cascades to type alter2.posint
+drop cascades to type alter2.ctype
+drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
+drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
+drop cascades to conversion alter2.latin1_to_utf8
+drop cascades to text search parser alter2.prs
+drop cascades to text search configuration alter2.cfg
+drop cascades to text search template alter2.tmpl
+drop cascades to text search dictionary alter2.dict
+--
+-- composite types
+--
+CREATE TYPE test_type AS (a int);
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE nosuchtype ADD ATTRIBUTE b text; -- fails
+ERROR: relation "nosuchtype" does not exist
+ALTER TYPE test_type ADD ATTRIBUTE b text;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+ALTER TYPE test_type ADD ATTRIBUTE b text; -- fails
+ERROR: column "b" of relation "test_type" already exists
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE varchar;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE integer;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE b;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE c; -- fails
+ERROR: column "c" of relation "test_type" does not exist
+ALTER TYPE test_type DROP ATTRIBUTE IF EXISTS c;
+NOTICE: column "c" of relation "test_type" does not exist, skipping
+ALTER TYPE test_type DROP ATTRIBUTE a, ADD ATTRIBUTE d boolean;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ d | boolean | | |
+
+ALTER TYPE test_type RENAME ATTRIBUTE a TO aa;
+ERROR: column "a" does not exist
+ALTER TYPE test_type RENAME ATTRIBUTE d TO dd;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ dd | boolean | | |
+
+DROP TYPE test_type;
+CREATE TYPE test_type1 AS (a int, b text);
+CREATE TABLE test_tbl1 (x int, y test_type1);
+ALTER TYPE test_type1 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type1" because column "test_tbl1.y" uses it
+CREATE TYPE test_type2 AS (a int, b text);
+CREATE TABLE test_tbl2 OF test_type2;
+CREATE TABLE test_tbl2_subclass () INHERITS (test_tbl2);
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ADD ATTRIBUTE c text; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ADD ATTRIBUTE c text CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 DROP ATTRIBUTE b; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 DROP ATTRIBUTE b CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+\d test_tbl2_subclass
+ Table "public.test_tbl2_subclass"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Inherits: test_tbl2
+
+DROP TABLE test_tbl2_subclass;
+CREATE TYPE test_typex AS (a int, b text);
+CREATE TABLE test_tblx (x int, y test_typex check ((y).a > 0));
+ALTER TYPE test_typex DROP ATTRIBUTE a; -- fails
+ERROR: cannot drop column a of composite type test_typex because other objects depend on it
+DETAIL: constraint test_tblx_y_check on table test_tblx depends on column a of composite type test_typex
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+ALTER TYPE test_typex DROP ATTRIBUTE a CASCADE;
+NOTICE: drop cascades to constraint test_tblx_y_check on table test_tblx
+\d test_tblx
+ Table "public.test_tblx"
+ Column | Type | Collation | Nullable | Default
+--------+------------+-----------+----------+---------
+ x | integer | | |
+ y | test_typex | | |
+
+DROP TABLE test_tblx;
+DROP TYPE test_typex;
+-- This test isn't that interesting on its own, but the purpose is to leave
+-- behind a table to test pg_upgrade with. The table has a composite type
+-- column in it, and the composite type has a dropped attribute.
+CREATE TYPE test_type3 AS (a int);
+CREATE TABLE test_tbl3 (c) AS SELECT '(1)'::test_type3;
+ALTER TYPE test_type3 DROP ATTRIBUTE a, ADD ATTRIBUTE b int;
+CREATE TYPE test_type_empty AS ();
+DROP TYPE test_type_empty;
+--
+-- typed tables: OF / NOT OF
+--
+CREATE TYPE tt_t0 AS (z inet, x int, y numeric(8,2));
+ALTER TYPE tt_t0 DROP ATTRIBUTE z;
+CREATE TABLE tt0 (x int NOT NULL, y numeric(8,2)); -- OK
+CREATE TABLE tt1 (x int, y bigint); -- wrong base type
+CREATE TABLE tt2 (x int, y numeric(9,2)); -- wrong typmod
+CREATE TABLE tt3 (y numeric(8,2), x int); -- wrong column order
+CREATE TABLE tt4 (x int); -- too few columns
+CREATE TABLE tt5 (x int, y numeric(8,2), z int); -- too few columns
+CREATE TABLE tt6 () INHERITS (tt0); -- can't have a parent
+CREATE TABLE tt7 (x int, q text, y numeric(8,2));
+ALTER TABLE tt7 DROP q; -- OK
+ALTER TABLE tt0 OF tt_t0;
+ALTER TABLE tt1 OF tt_t0;
+ERROR: table "tt1" has different type for column "y"
+ALTER TABLE tt2 OF tt_t0;
+ERROR: table "tt2" has different type for column "y"
+ALTER TABLE tt3 OF tt_t0;
+ERROR: table has column "y" where type requires "x"
+ALTER TABLE tt4 OF tt_t0;
+ERROR: table is missing column "y"
+ALTER TABLE tt5 OF tt_t0;
+ERROR: table has extra column "z"
+ALTER TABLE tt6 OF tt_t0;
+ERROR: typed tables cannot inherit
+ALTER TABLE tt7 OF tt_t0;
+CREATE TYPE tt_t1 AS (x int, y numeric(8,2));
+ALTER TABLE tt7 OF tt_t1; -- reassign an already-typed table
+ALTER TABLE tt7 NOT OF;
+\d tt7
+ Table "public.tt7"
+ Column | Type | Collation | Nullable | Default
+--------+--------------+-----------+----------+---------
+ x | integer | | |
+ y | numeric(8,2) | | |
+
+-- make sure we can drop a constraint on the parent but it remains on the child
+CREATE TABLE test_drop_constr_parent (c text CHECK (c IS NOT NULL));
+CREATE TABLE test_drop_constr_child () INHERITS (test_drop_constr_parent);
+ALTER TABLE ONLY test_drop_constr_parent DROP CONSTRAINT "test_drop_constr_parent_c_check";
+-- should fail
+INSERT INTO test_drop_constr_child (c) VALUES (NULL);
+ERROR: new row for relation "test_drop_constr_child" violates check constraint "test_drop_constr_parent_c_check"
+DETAIL: Failing row contains (null).
+DROP TABLE test_drop_constr_parent CASCADE;
+NOTICE: drop cascades to table test_drop_constr_child
+--
+-- IF EXISTS test
+--
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+NOTICE: relation "tt8" does not exist, skipping
+CREATE TABLE tt8(a int);
+CREATE SCHEMA alter2;
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+\d alter2.tt8
+ Table "alter2.tt8"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ f1 | integer | | not null | 0
+Indexes:
+ "xxx" PRIMARY KEY, btree (f1)
+Check constraints:
+ "tt8_f_check" CHECK (f1 >= 0 AND f1 <= 10)
+
+DROP TABLE alter2.tt8;
+DROP SCHEMA alter2;
+--
+-- Check conflicts between index and CHECK constraint names
+--
+CREATE TABLE tt9(c integer);
+ALTER TABLE tt9 ADD CHECK(c > 1);
+ALTER TABLE tt9 ADD CHECK(c > 2); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 3);
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 4); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD UNIQUE(c);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key UNIQUE(c); -- fail, dup name
+ERROR: relation "tt9_c_key" already exists
+ALTER TABLE tt9 ADD CONSTRAINT foo UNIQUE(c); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key CHECK(c > 5); -- fail, dup name
+ERROR: constraint "tt9_c_key" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key2 CHECK(c > 6);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+\d tt9
+ Table "public.tt9"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c | integer | | |
+Indexes:
+ "tt9_c_key" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key1" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key3" UNIQUE CONSTRAINT, btree (c)
+Check constraints:
+ "foo" CHECK (c > 3)
+ "tt9_c_check" CHECK (c > 1)
+ "tt9_c_check1" CHECK (c > 2)
+ "tt9_c_key2" CHECK (c > 6)
+
+DROP TABLE tt9;
+-- Check that comments on constraints and indexes are not lost at ALTER TABLE.
+CREATE TABLE comment_test (
+ id int,
+ positive_col int CHECK (positive_col > 0),
+ indexed_col int,
+ CONSTRAINT comment_test_pk PRIMARY KEY (id));
+CREATE INDEX comment_test_index ON comment_test(indexed_col);
+COMMENT ON COLUMN comment_test.id IS 'Column ''id'' on comment_test';
+COMMENT ON INDEX comment_test_index IS 'Simple index on comment_test';
+COMMENT ON CONSTRAINT comment_test_positive_col_check ON comment_test IS 'CHECK constraint on comment_test.positive_col';
+COMMENT ON CONSTRAINT comment_test_pk ON comment_test IS 'PRIMARY KEY constraint of comment_test';
+COMMENT ON INDEX comment_test_pk IS 'Index backing the PRIMARY KEY of comment_test';
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Change the datatype of all the columns. ALTER TABLE is optimized to not
+-- rebuild an index if the new data type is binary compatible with the old
+-- one. Check do a dummy ALTER TABLE that doesn't change the datatype
+-- first, to test that no-op codepath, and another one that does.
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE bigint;
+-- Check that the comments are intact.
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Check compatibility for foreign keys and comments. This is done
+-- separately as rebuilding the column type of the parent leads
+-- to an error and would reduce the test scope.
+CREATE TABLE comment_test_child (
+ id text CONSTRAINT comment_test_child_fk REFERENCES comment_test);
+CREATE INDEX comment_test_child_fk ON comment_test_child(id);
+COMMENT ON COLUMN comment_test_child.id IS 'Column ''id'' on comment_test_child';
+COMMENT ON INDEX comment_test_child_fk IS 'Index backing the FOREIGN KEY of comment_test_child';
+COMMENT ON CONSTRAINT comment_test_child_fk ON comment_test_child IS 'FOREIGN KEY constraint of comment_test_child';
+-- Change column type of parent
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int USING id::integer;
+ERROR: foreign key constraint "comment_test_child_fk" cannot be implemented
+DETAIL: Key columns "id" and "id" are of incompatible types: text and integer.
+-- Comments should be intact
+SELECT col_description('comment_test_child'::regclass, 1) as comment;
+ comment
+-----------------------------------
+ Column 'id' on comment_test_child
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ index | comment
+-----------------------+-----------------------------------------------------
+ comment_test_child_fk | Index backing the FOREIGN KEY of comment_test_child
+(1 row)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ constraint | comment
+-----------------------+----------------------------------------------
+ comment_test_child_fk | FOREIGN KEY constraint of comment_test_child
+(1 row)
+
+-- Check that we map relation oids to filenodes and back correctly. Only
+-- display bad mappings so the test output doesn't change all the time. A
+-- filenode function call can return NULL for a relation dropped concurrently
+-- with the call's surrounding query, so ignore a NULL mapped_oid for
+-- relations that no longer exist after all calls finish.
+CREATE TEMP TABLE filenode_mapping AS
+SELECT
+ oid, mapped_oid, reltablespace, relfilenode, relname
+FROM pg_class,
+ pg_filenode_relation(reltablespace, pg_relation_filenode(oid)) AS mapped_oid
+WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
+SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
+WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
+ oid | mapped_oid | reltablespace | relfilenode | relname
+-----+------------+---------------+-------------+---------
+(0 rows)
+
+-- Checks on creating and manipulation of user defined relations in
+-- pg_catalog.
+--
+-- XXX: It would be useful to add checks around trying to manipulate
+-- catalog tables, but that might have ugly consequences when run
+-- against an existing server with allow_system_table_mods = on.
+SHOW allow_system_table_mods;
+ allow_system_table_mods
+-------------------------
+ off
+(1 row)
+
+-- disallowed because of search_path issues with pg_dump
+CREATE TABLE pg_catalog.new_system_table();
+ERROR: permission denied to create "pg_catalog.new_system_table"
+DETAIL: System catalog modifications are currently disallowed.
+-- instead create in public first, move to catalog
+CREATE TABLE new_system_table(id serial primary key, othercol text);
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table SET SCHEMA public;
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+-- will be ignored -- already there:
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table RENAME TO old_system_table;
+CREATE INDEX old_system_table__othercol ON old_system_table (othercol);
+INSERT INTO old_system_table(othercol) VALUES ('somedata'), ('otherdata');
+UPDATE old_system_table SET id = -id;
+DELETE FROM old_system_table WHERE othercol = 'somedata';
+TRUNCATE old_system_table;
+ALTER TABLE old_system_table DROP CONSTRAINT new_system_table_pkey;
+ALTER TABLE old_system_table DROP COLUMN othercol;
+DROP TABLE old_system_table;
+-- set logged
+CREATE UNLOGGED TABLE unlogged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of an unlogged table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | u
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | u
+(3 rows)
+
+CREATE UNLOGGED TABLE unlogged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged1); -- foreign key
+CREATE UNLOGGED TABLE unlogged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged3); -- self-referencing foreign key
+ALTER TABLE unlogged3 SET LOGGED; -- skip self-referencing foreign key
+ALTER TABLE unlogged2 SET LOGGED; -- fails because a foreign key to an unlogged table exists
+ERROR: could not change table "unlogged2" to logged because it references unlogged table "unlogged1"
+ALTER TABLE unlogged1 SET LOGGED;
+-- check relpersistence of an unlogged table after changing to permanent
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | p
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | p
+(3 rows)
+
+ALTER TABLE unlogged1 SET LOGGED; -- silently do nothing
+DROP TABLE unlogged3;
+DROP TABLE unlogged2;
+DROP TABLE unlogged1;
+-- set unlogged
+CREATE TABLE logged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of a permanent table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | p
+ logged1_f1_seq | S | p
+ logged1_pkey | i | p
+(3 rows)
+
+CREATE TABLE logged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged1); -- foreign key
+CREATE TABLE logged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged3); -- self-referencing foreign key
+ALTER TABLE logged1 SET UNLOGGED; -- fails because a foreign key from a permanent table exists
+ERROR: could not change table "logged1" to unlogged because it references logged table "logged2"
+ALTER TABLE logged3 SET UNLOGGED; -- skip self-referencing foreign key
+ALTER TABLE logged2 SET UNLOGGED;
+ALTER TABLE logged1 SET UNLOGGED;
+-- check relpersistence of a permanent table after changing to unlogged
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | u
+ logged1_f1_seq | S | p
+ logged1_pkey | i | u
+(3 rows)
+
+ALTER TABLE logged1 SET UNLOGGED; -- silently do nothing
+DROP TABLE logged3;
+DROP TABLE logged2;
+DROP TABLE logged1;
+-- test ADD COLUMN IF NOT EXISTS
+CREATE TABLE test_add_column(c1 integer);
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer;
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer, -- fail because c2 already exists
+ ADD COLUMN c3 integer;
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN c3 integer; -- fail because c3 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer; -- skipping because c3 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer, -- skipping because c3 already exists
+ ADD COLUMN c4 integer;
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | |
+ c4 | integer | | |
+
+DROP TABLE test_add_column;
+-- unsupported constraint types for partitioned tables
+CREATE TABLE partitioned (
+ a int,
+ b int
+) PARTITION BY RANGE (a, (a+b+1));
+ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ERROR: exclusion constraints are not supported on partitioned tables
+LINE 1: ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ ^
+-- cannot drop column that is part of the partition key
+ALTER TABLE partitioned DROP COLUMN a;
+ERROR: cannot drop column "a" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned ALTER COLUMN a TYPE char(5);
+ERROR: cannot alter column "a" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned DROP COLUMN b;
+ERROR: cannot drop column "b" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned ALTER COLUMN b TYPE char(5);
+ERROR: cannot alter column "b" because it is part of the partition key of relation "partitioned"
+-- partitioned table cannot participate in regular inheritance
+CREATE TABLE nonpartitioned (
+ a int,
+ b int
+);
+ALTER TABLE partitioned INHERIT nonpartitioned;
+ERROR: cannot change inheritance of partitioned table
+ALTER TABLE nonpartitioned INHERIT partitioned;
+ERROR: cannot inherit from partitioned table "partitioned"
+-- cannot add NO INHERIT constraint to partitioned tables
+ALTER TABLE partitioned ADD CONSTRAINT chk_a CHECK (a > 0) NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "partitioned"
+DROP TABLE partitioned, nonpartitioned;
+--
+-- ATTACH PARTITION
+--
+-- check that target table is partitioned
+CREATE TABLE unparted (
+ a int
+);
+CREATE TABLE fail_part (like unparted);
+ALTER TABLE unparted ATTACH PARTITION fail_part FOR VALUES IN ('a');
+ERROR: table "unparted" is not partitioned
+DROP TABLE unparted, fail_part;
+-- check that partition bound is compatible
+CREATE TABLE list_parted (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part (LIKE list_parted);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) TO (10);
+ERROR: invalid bound specification for a list partition
+LINE 1: ...list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) T...
+ ^
+DROP TABLE fail_part;
+-- check that the table being attached exists
+ALTER TABLE list_parted ATTACH PARTITION nonexistent FOR VALUES IN (1);
+ERROR: relation "nonexistent" does not exist
+-- check ownership of the source table
+CREATE ROLE regress_test_me;
+CREATE ROLE regress_test_not_me;
+CREATE TABLE not_owned_by_me (LIKE list_parted);
+ALTER TABLE not_owned_by_me OWNER TO regress_test_not_me;
+SET SESSION AUTHORIZATION regress_test_me;
+CREATE TABLE owned_by_me (
+ a int
+) PARTITION BY LIST (a);
+ALTER TABLE owned_by_me ATTACH PARTITION not_owned_by_me FOR VALUES IN (1);
+ERROR: must be owner of table not_owned_by_me
+RESET SESSION AUTHORIZATION;
+DROP TABLE owned_by_me, not_owned_by_me;
+DROP ROLE regress_test_not_me;
+DROP ROLE regress_test_me;
+-- check that the table being attached is not part of regular inheritance
+CREATE TABLE parent (LIKE list_parted);
+CREATE TABLE child () INHERITS (parent);
+ALTER TABLE list_parted ATTACH PARTITION child FOR VALUES IN (1);
+ERROR: cannot attach inheritance child as partition
+ALTER TABLE list_parted ATTACH PARTITION parent FOR VALUES IN (1);
+ERROR: cannot attach inheritance parent as partition
+DROP TABLE parent CASCADE;
+NOTICE: drop cascades to table child
+-- check any TEMP-ness
+CREATE TEMP TABLE temp_parted (a int) PARTITION BY LIST (a);
+CREATE TABLE perm_part (a int);
+ALTER TABLE temp_parted ATTACH PARTITION perm_part FOR VALUES IN (1);
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_parted"
+DROP TABLE temp_parted, perm_part;
+-- check that the table being attached is not a typed table
+CREATE TYPE mytype AS (a int);
+CREATE TABLE fail_part OF mytype;
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: cannot attach a typed table as partition
+DROP TYPE mytype CASCADE;
+NOTICE: drop cascades to table fail_part
+-- check that the table being attached has only columns present in the parent
+CREATE TABLE fail_part (like list_parted, c int);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: table "fail_part" contains column "c" not found in parent "list_parted"
+DETAIL: The new partition may contain only the columns present in parent.
+DROP TABLE fail_part;
+-- check that the table being attached has every column of the parent
+CREATE TABLE fail_part (a int NOT NULL);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing column "b"
+DROP TABLE fail_part;
+-- check that columns match in type, collation and NOT NULL status
+CREATE TABLE fail_part (
+ b char(3),
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different type for column "b"
+ALTER TABLE fail_part ALTER b TYPE char (2) COLLATE "POSIX";
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different collation for column "b"
+DROP TABLE fail_part;
+-- check that the table being attached has all constraints of the parent
+CREATE TABLE fail_part (
+ b char(2) COLLATE "C",
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing constraint "check_a"
+-- check that the constraint matches in definition with parent's constraint
+ALTER TABLE fail_part ADD CONSTRAINT check_a CHECK (a >= 0);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different definition for check constraint "check_a"
+DROP TABLE fail_part;
+-- check the attributes and constraints after partition is attached
+CREATE TABLE part_1 (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+);
+ALTER TABLE list_parted ATTACH PARTITION part_1 FOR VALUES IN (1);
+-- attislocal and conislocal are always false for merged attributes and constraints respectively.
+SELECT attislocal, attinhcount FROM pg_attribute WHERE attrelid = 'part_1'::regclass AND attnum > 0;
+ attislocal | attinhcount
+------------+-------------
+ f | 1
+ f | 1
+(2 rows)
+
+SELECT conislocal, coninhcount FROM pg_constraint WHERE conrelid = 'part_1'::regclass AND conname = 'check_a';
+ conislocal | coninhcount
+------------+-------------
+ f | 1
+(1 row)
+
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE fail_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: partition "fail_part" would overlap partition "part_1"
+DROP TABLE fail_part;
+-- check that an existing table can be attached as a default partition
+CREATE TABLE def_part (LIKE list_parted INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION def_part DEFAULT;
+-- check attaching default partition fails if a default partition already
+-- exists
+CREATE TABLE fail_def_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_def_part DEFAULT;
+ERROR: partition "fail_def_part" conflicts with existing default partition "def_part"
+-- check validation when attaching list partitions
+CREATE TABLE list_parted2 (
+ a int,
+ b char
+) PARTITION BY LIST (a);
+-- check that violating rows are correctly reported
+CREATE TABLE part_2 (LIKE list_parted2);
+INSERT INTO part_2 VALUES (3, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part_2;
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+-- check partition cannot be attached if default has some row for its values
+CREATE TABLE list_parted2_def PARTITION OF list_parted2 DEFAULT;
+INSERT INTO list_parted2_def VALUES (11, 'z');
+CREATE TABLE part_3 (LIKE list_parted2);
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+ERROR: updated partition constraint for default partition would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM list_parted2_def WHERE a = 11;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part_3_4 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IN (3))
+);
+-- however, if a list partition does not accept nulls, there should be
+-- an explicit NOT NULL constraint on the partition key column for the
+-- validation scan to be skipped;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+-- adding a NOT NULL constraint will cause the scan to be skipped
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+ALTER TABLE part_3_4 ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+INFO: partition constraint for table "part_3_4" is implied by existing constraints
+-- check if default partition scan skipped
+ALTER TABLE list_parted2_def ADD CONSTRAINT check_a CHECK (a IN (5, 6));
+CREATE TABLE part_55_66 PARTITION OF list_parted2 FOR VALUES IN (55, 66);
+INFO: updated partition constraint for default partition "list_parted2_def" is implied by existing constraints
+-- check validation when attaching range partitions
+CREATE TABLE range_parted (
+ a int,
+ b int
+) PARTITION BY RANGE (a, b);
+-- check that violating rows are correctly reported
+CREATE TABLE part1 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 1 AND b <= 10)
+);
+INSERT INTO part1 VALUES (1, 10);
+-- Remember the TO bound is exclusive
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part1;
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part2 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 10 AND b < 18)
+);
+ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
+INFO: partition constraint for table "part2" is implied by existing constraints
+-- Create default partition
+CREATE TABLE partr_def1 PARTITION OF range_parted DEFAULT;
+-- Only one default partition is allowed, hence, following should give error
+CREATE TABLE partr_def2 (LIKE part1 INCLUDING CONSTRAINTS);
+ALTER TABLE range_parted ATTACH PARTITION partr_def2 DEFAULT;
+ERROR: partition "partr_def2" conflicts with existing default partition "partr_def1"
+-- Overlapping partitions cannot be attached, hence, following should give error
+INSERT INTO partr_def1 VALUES (2, 10);
+CREATE TABLE part3 (LIKE range_parted);
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (2, 10) TO (2, 20);
+ERROR: updated partition constraint for default partition would be violated by some row
+-- Attaching partitions should be successful when there are no overlapping rows
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (3, 10) TO (3, 20);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE part_5 (
+ LIKE list_parted2
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE part_5_a PARTITION OF part_5 FOR VALUES IN ('a');
+INSERT INTO part_5_a (a, b) VALUES (6, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+ERROR: partition constraint is violated by some row
+-- delete the faulting row and also add a constraint to skip the scan
+DELETE FROM part_5_a WHERE a NOT IN (3);
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 5);
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+INFO: partition constraint for table "part_5" is implied by existing constraints
+ALTER TABLE list_parted2 DETACH PARTITION part_5;
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+-- scan should again be skipped, even though NOT NULL is now a column property
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+INFO: partition constraint for table "part_5" is implied by existing constraints
+-- Check the case where attnos of the partitioning columns in the table being
+-- attached differs from the parent. It should not affect the constraint-
+-- checking logic that allows to skip the scan.
+CREATE TABLE part_6 (
+ c int,
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 6)
+);
+ALTER TABLE part_6 DROP c;
+ALTER TABLE list_parted2 ATTACH PARTITION part_6 FOR VALUES IN (6);
+INFO: partition constraint for table "part_6" is implied by existing constraints
+-- Similar to above, but the table being attached is a partitioned table
+-- whose partition has still different attnos for the root partitioning
+-- columns.
+CREATE TABLE part_7 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+) PARTITION BY LIST (b);
+CREATE TABLE part_7_a_null (
+ c int,
+ d int,
+ e int,
+ LIKE list_parted2, -- 'a' will have attnum = 4
+ CONSTRAINT check_b CHECK (b IS NULL OR b = 'a'),
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+);
+ALTER TABLE part_7_a_null DROP c, DROP d, DROP e;
+ALTER TABLE part_7 ATTACH PARTITION part_7_a_null FOR VALUES IN ('a', null);
+INFO: partition constraint for table "part_7_a_null" is implied by existing constraints
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+INFO: partition constraint for table "part_7" is implied by existing constraints
+INFO: updated partition constraint for default partition "list_parted2_def" is implied by existing constraints
+-- Same example, but check this time that the constraint correctly detects
+-- violating rows
+ALTER TABLE list_parted2 DETACH PARTITION part_7;
+ALTER TABLE part_7 DROP CONSTRAINT check_a; -- thusly, scan won't be skipped
+INSERT INTO part_7 (a, b) VALUES (8, null), (9, 'a');
+SELECT tableoid::regclass, a, b FROM part_7 order by a;
+ tableoid | a | b
+---------------+---+---
+ part_7_a_null | 8 |
+ part_7_a_null | 9 | a
+(2 rows)
+
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+INFO: updated partition constraint for default partition "list_parted2_def" is implied by existing constraints
+ERROR: partition constraint is violated by some row
+-- check that leaf partitions of default partition are scanned when
+-- attaching a partitioned table.
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+CREATE TABLE part5_def PARTITION OF part_5 DEFAULT PARTITION BY LIST(a);
+CREATE TABLE part5_def_p1 PARTITION OF part5_def FOR VALUES IN (5);
+INSERT INTO part5_def_p1 VALUES (5, 'y');
+CREATE TABLE part5_p1 (LIKE part_5);
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+ERROR: updated partition constraint for default partition would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part5_def_p1 WHERE b = 'y';
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+-- check that the table being attached is not already a partition
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: "part_2" is already a partition
+-- check that circular inheritance is not allowed
+ALTER TABLE part_5 ATTACH PARTITION list_parted2 FOR VALUES IN ('b');
+ERROR: circular inheritance not allowed
+DETAIL: "part_5" is already a child of "list_parted2".
+ALTER TABLE list_parted2 ATTACH PARTITION list_parted2 FOR VALUES IN (0);
+ERROR: circular inheritance not allowed
+DETAIL: "list_parted2" is already a child of "list_parted2".
+-- If a partitioned table being created or an existing table being attached
+-- as a partition does not have a constraint that would allow validation scan
+-- to be skipped, but an individual partition does, then the partition's
+-- validation scan is skipped.
+CREATE TABLE quuux (a int, b text) PARTITION BY LIST (a);
+CREATE TABLE quuux_default PARTITION OF quuux DEFAULT PARTITION BY LIST (b);
+CREATE TABLE quuux_default1 PARTITION OF quuux_default (
+ CONSTRAINT check_1 CHECK (a IS NOT NULL AND a = 1)
+) FOR VALUES IN ('b');
+CREATE TABLE quuux1 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux1 FOR VALUES IN (1); -- validate!
+CREATE TABLE quuux2 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux2 FOR VALUES IN (2); -- skip validation
+INFO: updated partition constraint for default partition "quuux_default1" is implied by existing constraints
+DROP TABLE quuux1, quuux2;
+-- should validate for quuux1, but not for quuux2
+CREATE TABLE quuux1 PARTITION OF quuux FOR VALUES IN (1);
+CREATE TABLE quuux2 PARTITION OF quuux FOR VALUES IN (2);
+INFO: updated partition constraint for default partition "quuux_default1" is implied by existing constraints
+DROP TABLE quuux;
+-- check validation when attaching hash partitions
+-- Use hand-rolled hash functions and operator class to get predictable result
+-- on different matchines. part_test_int4_ops is defined in insert.sql.
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE hash_parted (
+ a int,
+ b int
+) PARTITION BY HASH (a part_test_int4_ops);
+CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 4, REMAINDER 0);
+CREATE TABLE fail_part (LIKE hpart_1);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 4);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 0);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+DROP TABLE fail_part;
+-- check validation when attaching hash partitions
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_2 (LIKE hash_parted);
+INSERT INTO hpart_2 VALUES (3, 0);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_2;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE hpart_5 (
+ LIKE hash_parted
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_5_a PARTITION OF hpart_5 FOR VALUES IN ('1', '2', '3');
+INSERT INTO hpart_5_a (a, b) VALUES (7, 1);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+ERROR: partition constraint is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_5_a;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+-- check that the table being attach is with valid modulus and remainder value
+CREATE TABLE fail_part(LIKE hash_parted);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 0, REMAINDER 1);
+ERROR: modulus for hash partition must be a positive integer
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 8);
+ERROR: remainder for hash partition must be less than modulus
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 3, REMAINDER 2);
+ERROR: every hash partition modulus must be a factor of the next larger modulus
+DROP TABLE fail_part;
+--
+-- DETACH PARTITION
+--
+-- check that the table is partitioned at all
+CREATE TABLE regular_table (a int);
+ALTER TABLE regular_table DETACH PARTITION any_name;
+ERROR: table "regular_table" is not partitioned
+DROP TABLE regular_table;
+-- check that the partition being detached exists at all
+ALTER TABLE list_parted2 DETACH PARTITION part_4;
+ERROR: relation "part_4" does not exist
+ALTER TABLE hash_parted DETACH PARTITION hpart_4;
+ERROR: relation "hpart_4" does not exist
+-- check that the partition being detached is actually a partition of the parent
+CREATE TABLE not_a_part (a int);
+ALTER TABLE list_parted2 DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "list_parted2"
+ALTER TABLE list_parted2 DETACH PARTITION part_1;
+ERROR: relation "part_1" is not a partition of relation "list_parted2"
+ALTER TABLE hash_parted DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "hash_parted"
+DROP TABLE not_a_part;
+-- check that, after being detached, attinhcount/coninhcount is dropped to 0 and
+-- attislocal/conislocal is set to true
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+SELECT attinhcount, attislocal FROM pg_attribute WHERE attrelid = 'part_3_4'::regclass AND attnum > 0;
+ attinhcount | attislocal
+-------------+------------
+ 0 | t
+ 0 | t
+(2 rows)
+
+SELECT coninhcount, conislocal FROM pg_constraint WHERE conrelid = 'part_3_4'::regclass AND conname = 'check_a';
+ coninhcount | conislocal
+-------------+------------
+ 0 | t
+(1 row)
+
+DROP TABLE part_3_4;
+-- check that a detached partition is not dropped on dropping a partitioned table
+CREATE TABLE range_parted2 (
+ a int
+) PARTITION BY RANGE(a);
+CREATE TABLE part_rp PARTITION OF range_parted2 FOR VALUES FROM (0) to (100);
+ALTER TABLE range_parted2 DETACH PARTITION part_rp;
+DROP TABLE range_parted2;
+SELECT * from part_rp;
+ a
+---
+(0 rows)
+
+DROP TABLE part_rp;
+-- Check ALTER TABLE commands for partitioned tables and partitions
+-- cannot add/drop column to/from *only* the parent
+ALTER TABLE ONLY list_parted2 ADD COLUMN c int;
+ERROR: column must be added to child tables too
+ALTER TABLE ONLY list_parted2 DROP COLUMN b;
+ERROR: cannot drop column from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- cannot add a column to partition or drop an inherited one
+ALTER TABLE part_2 ADD COLUMN c text;
+ERROR: cannot add column to a partition
+ALTER TABLE part_2 DROP COLUMN b;
+ERROR: cannot drop inherited column "b"
+-- Nor rename, alter type
+ALTER TABLE part_2 RENAME COLUMN b to c;
+ERROR: cannot rename inherited column "b"
+ALTER TABLE part_2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter inherited column "b"
+-- cannot add/drop NOT NULL or check constraints to *only* the parent, when
+-- partitions exist
+ALTER TABLE ONLY list_parted2 ALTER b SET NOT NULL;
+ERROR: constraint must be added to child tables too
+DETAIL: Column "b" of relation "part_2" is not already NOT NULL.
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE ONLY list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ERROR: constraint must be added to child tables too
+ALTER TABLE list_parted2 ALTER b SET NOT NULL;
+ALTER TABLE ONLY list_parted2 ALTER b DROP NOT NULL;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ALTER TABLE ONLY list_parted2 DROP CONSTRAINT check_b;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- It's alright though, if no partitions are yet created
+CREATE TABLE parted_no_parts (a int) PARTITION BY LIST (a);
+ALTER TABLE ONLY parted_no_parts ALTER a SET NOT NULL;
+ALTER TABLE ONLY parted_no_parts ADD CONSTRAINT check_a CHECK (a > 0);
+ALTER TABLE ONLY parted_no_parts ALTER a DROP NOT NULL;
+ALTER TABLE ONLY parted_no_parts DROP CONSTRAINT check_a;
+DROP TABLE parted_no_parts;
+-- cannot drop inherited NOT NULL or check constraints from partition
+ALTER TABLE list_parted2 ALTER b SET NOT NULL, ADD CONSTRAINT check_a2 CHECK (a > 0);
+ALTER TABLE part_2 ALTER b DROP NOT NULL;
+ERROR: column "b" is marked NOT NULL in parent table
+ALTER TABLE part_2 DROP CONSTRAINT check_a2;
+ERROR: cannot drop inherited constraint "check_a2" of relation "part_2"
+-- Doesn't make sense to add NO INHERIT constraints on partitioned tables
+ALTER TABLE list_parted2 add constraint check_b2 check (b <> 'zz') NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "list_parted2"
+-- check that a partition cannot participate in regular inheritance
+CREATE TABLE inh_test () INHERITS (part_2);
+ERROR: cannot inherit from partition "part_2"
+CREATE TABLE inh_test (LIKE part_2);
+ALTER TABLE inh_test INHERIT part_2;
+ERROR: cannot inherit from a partition
+ALTER TABLE part_2 INHERIT inh_test;
+ERROR: cannot change inheritance of a partition
+-- cannot drop or alter type of partition key columns of lower level
+-- partitioned tables; for example, part_5, which is list_parted2's
+-- partition, is partitioned on b;
+ALTER TABLE list_parted2 DROP COLUMN b;
+ERROR: cannot drop column "b" because it is part of the partition key of relation "part_5"
+ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter column "b" because it is part of the partition key of relation "part_5"
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+ a
+---
+(0 rows)
+
+-- cleanup
+DROP TABLE list_parted, list_parted2, range_parted;
+DROP TABLE fail_def_part;
+DROP TABLE hash_parted;
+-- more tests for certain multi-level partitioning scenarios
+create table p (a int, b int) partition by range (a, b);
+create table p1 (b int, a int not null) partition by range (b);
+create table p11 (like p1);
+alter table p11 drop a;
+alter table p11 add a int;
+alter table p11 drop a;
+alter table p11 add a int not null;
+-- attnum for key attribute 'a' is different in p, p1, and p11
+select attrelid::regclass, attname, attnum
+from pg_attribute
+where attname = 'a'
+ and (attrelid = 'p'::regclass
+ or attrelid = 'p1'::regclass
+ or attrelid = 'p11'::regclass)
+order by attrelid::regclass::text;
+ attrelid | attname | attnum
+----------+---------+--------
+ p | a | 1
+ p1 | a | 2
+ p11 | a | 4
+(3 rows)
+
+alter table p1 attach partition p11 for values from (2) to (5);
+insert into p1 (a, b) values (2, 3);
+-- check that partition validation scan correctly detects violating rows
+alter table p attach partition p1 for values from (1, 2) to (1, 10);
+ERROR: partition constraint is violated by some row
+-- cleanup
+drop table p;
+drop table p1;
+-- validate constraint on partitioned tables should only scan leaf partitions
+create table parted_validate_test (a int) partition by list (a);
+create table parted_validate_test_1 partition of parted_validate_test for values in (0, 1);
+alter table parted_validate_test add constraint parted_validate_test_chka check (a > 0) not valid;
+alter table parted_validate_test validate constraint parted_validate_test_chka;
+drop table parted_validate_test;
+-- test alter column options
+CREATE TABLE attmp(i integer);
+INSERT INTO attmp VALUES (1);
+ALTER TABLE attmp ALTER COLUMN i SET (n_distinct = 1, n_distinct_inherited = 2);
+ALTER TABLE attmp ALTER COLUMN i RESET (n_distinct_inherited);
+ANALYZE attmp;
+DROP TABLE attmp;
+DROP USER regress_alter_table_user1;
+-- check that violating rows are correctly reported when attaching as the
+-- default partition
+create table defpart_attach_test (a int) partition by list (a);
+create table defpart_attach_test1 partition of defpart_attach_test for values in (1);
+create table defpart_attach_test_d (b int, a int);
+alter table defpart_attach_test_d drop b;
+insert into defpart_attach_test_d values (1), (2);
+-- error because its constraint as the default partition would be violated
+-- by the row containing 1
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+ERROR: partition constraint is violated by some row
+delete from defpart_attach_test_d where a = 1;
+alter table defpart_attach_test_d add check (a > 1);
+-- should be attached successfully and without needing to be scanned
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+INFO: partition constraint for table "defpart_attach_test_d" is implied by existing constraints
+-- check that attaching a partition correctly reports any rows in the default
+-- partition that should not be there for the new partition to be attached
+-- successfully
+create table defpart_attach_test_2 (like defpart_attach_test_d);
+alter table defpart_attach_test attach partition defpart_attach_test_2 for values in (2);
+ERROR: updated partition constraint for default partition would be violated by some row
+drop table defpart_attach_test;
+-- check combinations of temporary and permanent relations when attaching
+-- partitions.
+create table perm_part_parent (a int) partition by list (a);
+create temp table temp_part_parent (a int) partition by list (a);
+create table perm_part_child (a int);
+create temp table temp_part_child (a int);
+alter table temp_part_parent attach partition perm_part_child default; -- error
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_part_parent"
+alter table perm_part_parent attach partition temp_part_child default; -- error
+ERROR: cannot attach a temporary relation as partition of permanent relation "perm_part_parent"
+alter table temp_part_parent attach partition temp_part_child default; -- ok
+drop table perm_part_parent cascade;
+drop table temp_part_parent cascade;
+-- check that attaching partitions to a table while it is being used is
+-- prevented
+create table tab_part_attach (a int) partition by list (a);
+create or replace function func_part_attach() returns trigger
+ language plpgsql as $$
+ begin
+ execute 'create table tab_part_attach_1 (a int)';
+ execute 'alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)';
+ return null;
+ end $$;
+create trigger trig_part_attach before insert on tab_part_attach
+ for each statement execute procedure func_part_attach();
+insert into tab_part_attach values (1);
+ERROR: cannot ALTER TABLE "tab_part_attach" because it is being used by active queries in this session
+CONTEXT: SQL statement "alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)"
+PL/pgSQL function func_part_attach() line 4 at EXECUTE
+drop table tab_part_attach;
+drop function func_part_attach();
+-- test case where the partitioning operator is a SQL function whose
+-- evaluation results in the table's relcache being rebuilt partway through
+-- the execution of an ATTACH PARTITION command
+create function at_test_sql_partop (int4, int4) returns int language sql
+as $$ select case when $1 = $2 then 0 when $1 > $2 then 1 else -1 end; $$;
+create operator class at_test_sql_partop for type int4 using btree as
+ operator 1 < (int4, int4), operator 2 <= (int4, int4),
+ operator 3 = (int4, int4), operator 4 >= (int4, int4),
+ operator 5 > (int4, int4), function 1 at_test_sql_partop(int4, int4);
+create table at_test_sql_partop (a int) partition by range (a at_test_sql_partop);
+create table at_test_sql_partop_1 (a int);
+alter table at_test_sql_partop attach partition at_test_sql_partop_1 for values from (0) to (10);
+drop table at_test_sql_partop;
+drop operator class at_test_sql_partop using btree;
+drop function at_test_sql_partop;
diff --git a/src/test/regress/expected/box.out b/src/test/regress/expected/box.out
index 4d0f169214f..f5c0ec106de 100644
--- a/src/test/regress/expected/box.out
+++ b/src/test/regress/expected/box.out
@@ -253,6 +253,7 @@ INSERT INTO box_temp
('(-infinity,0)(0,infinity)'),
('(-infinity,-infinity)(infinity,infinity)');
SET enable_seqscan = false;
+SET enable_bitmapscan = false;
SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
f1
----------------------------
@@ -476,6 +477,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
(2 rows)
RESET enable_seqscan;
+RESET enable_bitmapscan;
DROP INDEX box_spgist;
--
-- Test the SP-GiST index on the larger volume of data
diff --git a/src/test/regress/expected/brin_1.out b/src/test/regress/expected/brin_1.out
new file mode 100644
index 00000000000..bce6bff01af
--- /dev/null
+++ b/src/test/regress/expected/brin_1.out
@@ -0,0 +1,509 @@
+CREATE TABLE brintest (byteacol bytea,
+ charcol "char",
+ namecol name,
+ int8col bigint,
+ int2col smallint,
+ int4col integer,
+ textcol text,
+ oidcol oid,
+ tidcol tid,
+ float4col real,
+ float8col double precision,
+ macaddrcol macaddr,
+ inetcol inet,
+ cidrcol cidr,
+ bpcharcol character,
+ datecol date,
+ timecol time without time zone,
+ timestampcol timestamp without time zone,
+ timestamptzcol timestamp with time zone,
+ intervalcol interval,
+ timetzcol time with time zone,
+ bitcol bit(10),
+ varbitcol bit varying(16),
+ numericcol numeric,
+ uuidcol uuid,
+ int4rangecol int4range,
+ lsncol pg_lsn,
+ boxcol box
+) WITH (fillfactor=10);
+INSERT INTO brintest SELECT
+ repeat(stringu1, 8)::bytea,
+ substr(stringu1, 1, 1)::"char",
+ stringu1::name, 142857 * tenthous,
+ thousand,
+ twothousand,
+ repeat(stringu1, 8),
+ unique1::oid,
+ format('(%s,%s)', tenthous, twenty)::tid,
+ (four + 1.0)/(hundred+1),
+ odd::float8 / (tenthous + 1),
+ format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
+ inet '10.2.3.4/24' + tenthous,
+ cidr '10.2.3/24' + tenthous,
+ substr(stringu1, 1, 1)::bpchar,
+ date '1995-08-15' + tenthous,
+ time '01:20:30' + thousand * interval '18.5 second',
+ timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
+ timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
+ justify_days(justify_hours(tenthous * interval '12 minutes')),
+ timetz '01:30:20+02' + hundred * interval '15 seconds',
+ thousand::bit(10),
+ tenthous::bit(16)::varbit,
+ tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
+ format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
+ int4range(thousand, twothousand),
+ format('%s/%s%s', odd, even, tenthous)::pg_lsn,
+ box(point(odd, even), point(thousand, twothousand))
+FROM tenk1 ORDER BY unique2 LIMIT 100;
+-- throw in some NULL's and different values
+INSERT INTO brintest (inetcol, cidrcol, int4rangecol) SELECT
+ inet 'fe80::6e40:8ff:fea9:8c46' + tenthous,
+ cidr 'fe80::6e40:8ff:fea9:8c46' + tenthous,
+ 'empty'::int4range
+FROM tenk1 ORDER BY thousand, tenthous LIMIT 25;
+CREATE INDEX brinidx ON brintest USING brin (
+ byteacol,
+ charcol,
+ namecol,
+ int8col,
+ int2col,
+ int4col,
+ textcol,
+ oidcol,
+ tidcol,
+ float4col,
+ float8col,
+ macaddrcol,
+ inetcol inet_inclusion_ops,
+ inetcol inet_minmax_ops,
+ cidrcol inet_inclusion_ops,
+ cidrcol inet_minmax_ops,
+ bpcharcol,
+ datecol,
+ timecol,
+ timestampcol,
+ timestamptzcol,
+ intervalcol,
+ timetzcol,
+ bitcol,
+ varbitcol,
+ numericcol,
+ uuidcol,
+ int4rangecol,
+ lsncol,
+ boxcol
+) with (pages_per_range = 1);
+CREATE TABLE brinopers (colname name, typ text,
+ op text[], value text[], matches int[],
+ check (cardinality(op) = cardinality(value)),
+ check (cardinality(op) = cardinality(matches)));
+INSERT INTO brinopers VALUES
+ ('byteacol', 'bytea',
+ '{>, >=, =, <=, <}',
+ '{AAAAAA, AAAAAA, BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA, ZZZZZZ, ZZZZZZ}',
+ '{100, 100, 1, 100, 100}'),
+ ('charcol', '"char"',
+ '{>, >=, =, <=, <}',
+ '{A, A, M, Z, Z}',
+ '{97, 100, 6, 100, 98}'),
+ ('namecol', 'name',
+ '{>, >=, =, <=, <}',
+ '{AAAAAA, AAAAAA, MAAAAA, ZZAAAA, ZZAAAA}',
+ '{100, 100, 2, 100, 100}'),
+ ('int2col', 'int2',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int2col', 'int4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int2col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int2',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('int8col', 'int2',
+ '{>, >=}',
+ '{0, 0}',
+ '{100, 100}'),
+ ('int8col', 'int4',
+ '{>, >=}',
+ '{0, 0}',
+ '{100, 100}'),
+ ('int8col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 1257141600, 1428427143, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('textcol', 'text',
+ '{>, >=, =, <=, <}',
+ '{ABABAB, ABABAB, BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA, ZZAAAA, ZZAAAA}',
+ '{100, 100, 1, 100, 100}'),
+ ('oidcol', 'oid',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 8800, 9999, 9999}',
+ '{100, 100, 1, 100, 100}'),
+ ('tidcol', 'tid',
+ '{>, >=, =, <=, <}',
+ '{"(0,0)", "(0,0)", "(8800,0)", "(9999,19)", "(9999,19)"}',
+ '{100, 100, 1, 100, 100}'),
+ ('float4col', 'float4',
+ '{>, >=, =, <=, <}',
+ '{0.0103093, 0.0103093, 1, 1, 1}',
+ '{100, 100, 4, 100, 96}'),
+ ('float4col', 'float8',
+ '{>, >=, =, <=, <}',
+ '{0.0103093, 0.0103093, 1, 1, 1}',
+ '{100, 100, 4, 100, 96}'),
+ ('float8col', 'float4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 0, 1.98, 1.98}',
+ '{99, 100, 1, 100, 100}'),
+ ('float8col', 'float8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 0, 1.98, 1.98}',
+ '{99, 100, 1, 100, 100}'),
+ ('macaddrcol', 'macaddr',
+ '{>, >=, =, <=, <}',
+ '{00:00:01:00:00:00, 00:00:01:00:00:00, 2c:00:2d:00:16:00, ff:fe:00:00:00:00, ff:fe:00:00:00:00}',
+ '{99, 100, 2, 100, 100}'),
+ ('inetcol', 'inet',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14.231/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14.231/24, 10.2.14.231/25, 10.2.14.231/8, 0/0}',
+ '{100, 1, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('inetcol', 'inet',
+ '{&&, >>=, <<=, =}',
+ '{fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46, fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('inetcol', 'cidr',
+ '{&&, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14/24, 10.2.14/25, 10/8, 0/0}',
+ '{100, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('inetcol', 'cidr',
+ '{&&, >>=, <<=, =}',
+ '{fe80::/32, fe80::6e40:8ff:fea9:8c46, fe80::/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('cidrcol', 'inet',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14.231/24, 10.2.14.231/25, 10.2.14.231/8, 0/0}',
+ '{100, 2, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('cidrcol', 'inet',
+ '{&&, >>=, <<=, =}',
+ '{fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46, fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('cidrcol', 'cidr',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14/24, 10.2.14/25, 10/8, 0/0}',
+ '{100, 2, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('cidrcol', 'cidr',
+ '{&&, >>=, <<=, =}',
+ '{fe80::/32, fe80::6e40:8ff:fea9:8c46, fe80::/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('bpcharcol', 'bpchar',
+ '{>, >=, =, <=, <}',
+ '{A, A, W, Z, Z}',
+ '{97, 100, 6, 100, 98}'),
+ ('datecol', 'date',
+ '{>, >=, =, <=, <}',
+ '{1995-08-15, 1995-08-15, 2009-12-01, 2022-12-30, 2022-12-30}',
+ '{100, 100, 1, 100, 100}'),
+ ('timecol', 'time',
+ '{>, >=, =, <=, <}',
+ '{01:20:30, 01:20:30, 02:28:57, 06:28:31.5, 06:28:31.5}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestampcol', 'timestamp',
+ '{>, >=, =, <=, <}',
+ '{1942-07-23 03:05:09, 1942-07-23 03:05:09, 1964-03-24 19:26:45, 1984-01-20 22:42:21, 1984-01-20 22:42:21}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestampcol', 'timestamptz',
+ '{>, >=, =, <=, <}',
+ '{1942-07-23 03:05:09, 1942-07-23 03:05:09, 1964-03-24 19:26:45, 1984-01-20 22:42:21, 1984-01-20 22:42:21}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestamptzcol', 'timestamptz',
+ '{>, >=, =, <=, <}',
+ '{1972-10-10 03:00:00-04, 1972-10-10 03:00:00-04, 1972-10-19 09:00:00-07, 1972-11-20 19:00:00-03, 1972-11-20 19:00:00-03}',
+ '{100, 100, 1, 100, 100}'),
+ ('intervalcol', 'interval',
+ '{>, >=, =, <=, <}',
+ '{00:00:00, 00:00:00, 1 mons 13 days 12:24, 2 mons 23 days 07:48:00, 1 year}',
+ '{100, 100, 1, 100, 100}'),
+ ('timetzcol', 'timetz',
+ '{>, >=, =, <=, <}',
+ '{01:30:20+02, 01:30:20+02, 01:35:50+02, 23:55:05+02, 23:55:05+02}',
+ '{99, 100, 2, 100, 100}'),
+ ('bitcol', 'bit(10)',
+ '{>, >=, =, <=, <}',
+ '{0000000010, 0000000010, 0011011110, 1111111000, 1111111000}',
+ '{100, 100, 1, 100, 100}'),
+ ('varbitcol', 'varbit(16)',
+ '{>, >=, =, <=, <}',
+ '{0000000000000100, 0000000000000100, 0001010001100110, 1111111111111000, 1111111111111000}',
+ '{100, 100, 1, 100, 100}'),
+ ('numericcol', 'numeric',
+ '{>, >=, =, <=, <}',
+ '{0.00, 0.01, 2268164.347826086956521739130434782609, 99470151.9, 99470151.9}',
+ '{100, 100, 1, 100, 100}'),
+ ('uuidcol', 'uuid',
+ '{>, >=, =, <=, <}',
+ '{00040004-0004-0004-0004-000400040004, 00040004-0004-0004-0004-000400040004, 52225222-5222-5222-5222-522252225222, 99989998-9998-9998-9998-999899989998, 99989998-9998-9998-9998-999899989998}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4rangecol', 'int4range',
+ '{<<, &<, &&, &>, >>, @>, <@, =, <, <=, >, >=}',
+ '{"[10000,)","[10000,)","(,]","[3,4)","[36,44)","(1500,1501]","[3,4)","[222,1222)","[36,44)","[43,1043)","[367,4466)","[519,)"}',
+ '{53, 53, 53, 53, 50, 22, 72, 1, 74, 75, 34, 21}'),
+ ('int4rangecol', 'int4range',
+ '{@>, <@, =, <=, >, >=}',
+ '{empty, empty, empty, empty, empty, empty}',
+ '{125, 72, 72, 72, 53, 125}'),
+ ('int4rangecol', 'int4',
+ '{@>}',
+ '{1500}',
+ '{22}'),
+ ('lsncol', 'pg_lsn',
+ '{>, >=, =, <=, <, IS, IS NOT}',
+ '{0/1200, 0/1200, 44/455222, 198/1999799, 198/1999799, NULL, NULL}',
+ '{100, 100, 1, 100, 100, 25, 100}'),
+ ('boxcol', 'point',
+ '{@>}',
+ '{"(500,43)"}',
+ '{11}'),
+ ('boxcol', 'box',
+ '{<<, &<, &&, &>, >>, <<|, &<|, |&>, |>>, @>, <@, ~=}',
+ '{"((1000,2000),(3000,4000))","((1,2),(3000,4000))","((1,2),(3000,4000))","((1,2),(3000,4000))","((1,2),(3,4))","((1000,2000),(3000,4000))","((1,2000),(3,4000))","((1000,2),(3000,4))","((1,2),(3,4))","((1,2),(300,400))","((1,2),(3000,4000))","((222,1222),(44,45))"}',
+ '{100, 100, 100, 99, 96, 100, 100, 99, 96, 1, 99, 1}');
+DO $x$
+DECLARE
+ r record;
+ r2 record;
+ cond text;
+ idx_ctids tid[];
+ ss_ctids tid[];
+ count int;
+ plan_ok bool;
+ plan_line text;
+BEGIN
+ FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers, unnest(op) WITH ORDINALITY AS oper LOOP
+
+ -- prepare the condition
+ IF r.value IS NULL THEN
+ cond := format('%I %s %L', r.colname, r.oper, r.value);
+ ELSE
+ cond := format('%I %s %L::%s', r.colname, r.oper, r.value, r.typ);
+ END IF;
+
+ -- run the query using the brin index
+ SET enable_seqscan = 0;
+ SET enable_bitmapscan = 1;
+
+ plan_ok := false;
+ FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond) LOOP
+ IF plan_line LIKE '%Bitmap Heap Scan on brintest%' THEN
+ plan_ok := true;
+ END IF;
+ END LOOP;
+ IF NOT plan_ok THEN
+ RAISE WARNING 'did not get bitmap indexscan plan for %', r;
+ END IF;
+
+ EXECUTE format($y$SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond)
+ INTO idx_ctids;
+
+ -- run the query using a seqscan
+ SET enable_seqscan = 1;
+ SET enable_bitmapscan = 0;
+
+ plan_ok := false;
+ FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond) LOOP
+ IF plan_line LIKE '%Seq Scan on brintest%' THEN
+ plan_ok := true;
+ END IF;
+ END LOOP;
+ IF NOT plan_ok THEN
+ RAISE WARNING 'did not get seqscan plan for %', r;
+ END IF;
+
+ EXECUTE format($y$SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond)
+ INTO ss_ctids;
+
+ -- make sure both return the same results
+ count := array_length(idx_ctids, 1);
+
+ IF NOT (count = array_length(ss_ctids, 1) AND
+ idx_ctids @> ss_ctids AND
+ idx_ctids <@ ss_ctids) THEN
+ -- report the results of each scan to make the differences obvious
+ RAISE WARNING 'something not right in %: count %', r, count;
+ SET enable_seqscan = 1;
+ SET enable_bitmapscan = 0;
+ FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest WHERE ' || cond LOOP
+ RAISE NOTICE 'seqscan: %', r2;
+ END LOOP;
+
+ SET enable_seqscan = 0;
+ SET enable_bitmapscan = 1;
+ FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest WHERE ' || cond LOOP
+ RAISE NOTICE 'bitmapscan: %', r2;
+ END LOOP;
+ END IF;
+
+ -- make sure we found expected number of matches
+ IF count != r.matches THEN RAISE WARNING 'unexpected number of results % for %', count, r; END IF;
+ END LOOP;
+END;
+$x$;
+RESET enable_seqscan;
+RESET enable_bitmapscan;
+INSERT INTO brintest SELECT
+ repeat(stringu1, 42)::bytea,
+ substr(stringu1, 1, 1)::"char",
+ stringu1::name, 142857 * tenthous,
+ thousand,
+ twothousand,
+ repeat(stringu1, 42),
+ unique1::oid,
+ format('(%s,%s)', tenthous, twenty)::tid,
+ (four + 1.0)/(hundred+1),
+ odd::float8 / (tenthous + 1),
+ format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
+ inet '10.2.3.4' + tenthous,
+ cidr '10.2.3/24' + tenthous,
+ substr(stringu1, 1, 1)::bpchar,
+ date '1995-08-15' + tenthous,
+ time '01:20:30' + thousand * interval '18.5 second',
+ timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
+ timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
+ justify_days(justify_hours(tenthous * interval '12 minutes')),
+ timetz '01:30:20' + hundred * interval '15 seconds',
+ thousand::bit(10),
+ tenthous::bit(16)::varbit,
+ tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
+ format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
+ int4range(thousand, twothousand),
+ format('%s/%s%s', odd, even, tenthous)::pg_lsn,
+ box(point(odd, even), point(thousand, twothousand))
+FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET 5;
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+VACUUM brintest; -- force a summarization cycle in brinidx
+UPDATE brintest SET int8col = int8col * int4col;
+UPDATE brintest SET textcol = '' WHERE textcol IS NOT NULL;
+-- Tests for brin_summarize_new_values
+SELECT brin_summarize_new_values('brintest'); -- error, not an index
+ERROR: "brintest" is not an index
+SELECT brin_summarize_new_values('tenk1_unique1'); -- error, not a BRIN index
+ERROR: "tenk1_unique1" is not a BRIN index
+SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected
+ brin_summarize_new_values
+---------------------------
+ 5
+(1 row)
+
+-- Tests for brin_desummarize_range
+SELECT brin_desummarize_range('brinidx', -1); -- error, invalid range
+ERROR: block number out of range: -1
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_desummarize_range('brinidx', 100000000);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+-- Test brin_summarize_range
+CREATE TABLE brin_summarize (
+ value int
+) WITH (fillfactor=10, autovacuum_enabled=false);
+CREATE INDEX brin_summarize_idx ON brin_summarize USING brin (value) WITH (pages_per_range=2);
+-- Fill a few pages
+DO $$
+DECLARE curtid tid;
+BEGIN
+ LOOP
+ INSERT INTO brin_summarize VALUES (1) RETURNING ctid INTO curtid;
+ EXIT WHEN curtid > tid '(2, 0)';
+ END LOOP;
+END;
+$$;
+-- summarize one range
+SELECT brin_summarize_range('brin_summarize_idx', 0);
+ brin_summarize_range
+----------------------
+ 0
+(1 row)
+
+-- nothing: already summarized
+SELECT brin_summarize_range('brin_summarize_idx', 1);
+ brin_summarize_range
+----------------------
+ 0
+(1 row)
+
+-- summarize one range
+SELECT brin_summarize_range('brin_summarize_idx', 2);
+ brin_summarize_range
+----------------------
+ 1
+(1 row)
+
+-- nothing: page doesn't exist in table
+SELECT brin_summarize_range('brin_summarize_idx', 4294967295);
+ brin_summarize_range
+----------------------
+ 2
+(1 row)
+
+-- invalid block number values
+SELECT brin_summarize_range('brin_summarize_idx', -1);
+ERROR: block number out of range: -1
+SELECT brin_summarize_range('brin_summarize_idx', 4294967296);
+ERROR: block number out of range: 4294967296
+-- test brin cost estimates behave sanely based on correlation of values
+CREATE TABLE brin_test (a INT, b INT);
+INSERT INTO brin_test SELECT x/100,x%100 FROM generate_series(1,10000) x(x);
+CREATE INDEX brin_test_a_idx ON brin_test USING brin (a) WITH (pages_per_range = 2);
+CREATE INDEX brin_test_b_idx ON brin_test USING brin (b) WITH (pages_per_range = 2);
+VACUUM ANALYZE brin_test;
+-- Ensure brin index is used when columns are perfectly correlated
+EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE a = 1;
+ QUERY PLAN
+--------------------------------------------
+ Bitmap Heap Scan on brin_test
+ Recheck Cond: (a = 1)
+ -> Bitmap Index Scan on brin_test_a_idx
+ Index Cond: (a = 1)
+(4 rows)
+
+-- Ensure brin index is not used when values are not correlated
+EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE b = 1;
+ QUERY PLAN
+-----------------------
+ Seq Scan on brin_test
+ Filter: (b = 1)
+(2 rows)
+
diff --git a/src/test/regress/expected/cluster_1.out b/src/test/regress/expected/cluster_1.out
new file mode 100644
index 00000000000..a707ea30cb7
--- /dev/null
+++ b/src/test/regress/expected/cluster_1.out
@@ -0,0 +1,475 @@
+--
+-- CLUSTER
+--
+CREATE TABLE clstr_tst_s (rf_a SERIAL PRIMARY KEY,
+ b INT);
+CREATE TABLE clstr_tst (a SERIAL PRIMARY KEY,
+ b INT,
+ c TEXT,
+ d TEXT,
+ CONSTRAINT clstr_tst_con FOREIGN KEY (b) REFERENCES clstr_tst_s);
+CREATE INDEX clstr_tst_b ON clstr_tst (b);
+CREATE INDEX clstr_tst_c ON clstr_tst (c);
+CREATE INDEX clstr_tst_c_b ON clstr_tst (c,b);
+CREATE INDEX clstr_tst_b_c ON clstr_tst (b,c);
+INSERT INTO clstr_tst_s (b) VALUES (0);
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+CREATE TABLE clstr_tst_inh () INHERITS (clstr_tst);
+INSERT INTO clstr_tst (b, c) VALUES (11, 'once');
+INSERT INTO clstr_tst (b, c) VALUES (10, 'diez');
+INSERT INTO clstr_tst (b, c) VALUES (31, 'treinta y uno');
+INSERT INTO clstr_tst (b, c) VALUES (22, 'veintidos');
+INSERT INTO clstr_tst (b, c) VALUES (3, 'tres');
+INSERT INTO clstr_tst (b, c) VALUES (20, 'veinte');
+INSERT INTO clstr_tst (b, c) VALUES (23, 'veintitres');
+INSERT INTO clstr_tst (b, c) VALUES (21, 'veintiuno');
+INSERT INTO clstr_tst (b, c) VALUES (4, 'cuatro');
+INSERT INTO clstr_tst (b, c) VALUES (14, 'catorce');
+INSERT INTO clstr_tst (b, c) VALUES (2, 'dos');
+INSERT INTO clstr_tst (b, c) VALUES (18, 'dieciocho');
+INSERT INTO clstr_tst (b, c) VALUES (27, 'veintisiete');
+INSERT INTO clstr_tst (b, c) VALUES (25, 'veinticinco');
+INSERT INTO clstr_tst (b, c) VALUES (13, 'trece');
+INSERT INTO clstr_tst (b, c) VALUES (28, 'veintiocho');
+INSERT INTO clstr_tst (b, c) VALUES (32, 'treinta y dos');
+INSERT INTO clstr_tst (b, c) VALUES (5, 'cinco');
+INSERT INTO clstr_tst (b, c) VALUES (29, 'veintinueve');
+INSERT INTO clstr_tst (b, c) VALUES (1, 'uno');
+INSERT INTO clstr_tst (b, c) VALUES (24, 'veinticuatro');
+INSERT INTO clstr_tst (b, c) VALUES (30, 'treinta');
+INSERT INTO clstr_tst (b, c) VALUES (12, 'doce');
+INSERT INTO clstr_tst (b, c) VALUES (17, 'diecisiete');
+INSERT INTO clstr_tst (b, c) VALUES (9, 'nueve');
+INSERT INTO clstr_tst (b, c) VALUES (19, 'diecinueve');
+INSERT INTO clstr_tst (b, c) VALUES (26, 'veintiseis');
+INSERT INTO clstr_tst (b, c) VALUES (15, 'quince');
+INSERT INTO clstr_tst (b, c) VALUES (7, 'siete');
+INSERT INTO clstr_tst (b, c) VALUES (16, 'dieciseis');
+INSERT INTO clstr_tst (b, c) VALUES (8, 'ocho');
+-- This entry is needed to test that TOASTED values are copied correctly.
+INSERT INTO clstr_tst (b, c, d) VALUES (6, 'seis', repeat('xyzzy', 100000));
+CLUSTER clstr_tst_c ON clstr_tst;
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY a;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 1 | 11 | once | |
+ 2 | 10 | diez | |
+ 3 | 31 | treinta y uno | |
+ 4 | 22 | veintidos | |
+ 5 | 3 | tres | |
+ 6 | 20 | veinte | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 9 | 4 | cuatro | |
+ 10 | 14 | catorce | |
+ 11 | 2 | dos | |
+ 12 | 18 | dieciocho | |
+ 13 | 27 | veintisiete | |
+ 14 | 25 | veinticinco | |
+ 15 | 13 | trece | |
+ 16 | 28 | veintiocho | |
+ 17 | 32 | treinta y dos | |
+ 18 | 5 | cinco | |
+ 19 | 29 | veintinueve | |
+ 20 | 1 | uno | |
+ 21 | 24 | veinticuatro | |
+ 22 | 30 | treinta | |
+ 23 | 12 | doce | |
+ 24 | 17 | diecisiete | |
+ 25 | 9 | nueve | |
+ 26 | 19 | diecinueve | |
+ 27 | 26 | veintiseis | |
+ 28 | 15 | quince | |
+ 29 | 7 | siete | |
+ 30 | 16 | dieciseis | |
+ 31 | 8 | ocho | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY b;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 20 | 1 | uno | |
+ 11 | 2 | dos | |
+ 5 | 3 | tres | |
+ 9 | 4 | cuatro | |
+ 18 | 5 | cinco | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 31 | 8 | ocho | |
+ 25 | 9 | nueve | |
+ 2 | 10 | diez | |
+ 1 | 11 | once | |
+ 23 | 12 | doce | |
+ 15 | 13 | trece | |
+ 10 | 14 | catorce | |
+ 28 | 15 | quince | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 12 | 18 | dieciocho | |
+ 26 | 19 | diecinueve | |
+ 6 | 20 | veinte | |
+ 8 | 21 | veintiuno | |
+ 4 | 22 | veintidos | |
+ 7 | 23 | veintitres | |
+ 21 | 24 | veinticuatro | |
+ 14 | 25 | veinticinco | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 16 | 28 | veintiocho | |
+ 19 | 29 | veintinueve | |
+ 22 | 30 | treinta | |
+ 3 | 31 | treinta y uno | |
+ 17 | 32 | treinta y dos | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY c;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+----------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+(33 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------
+ clstr_tst_con
+ clstr_tst_pkey
+(2 rows)
+
+SELECT relname, relkind,
+ EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
+FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
+ relname | relkind | hastoast
+----------------------+---------+----------
+ clstr_tst | r | f
+ clstr_tst_a_seq | S | f
+ clstr_tst_b | i | f
+ clstr_tst_b_c | i | f
+ clstr_tst_c | i | f
+ clstr_tst_c_b | i | f
+ clstr_tst_inh | r | f
+ clstr_tst_pkey | i | f
+ clstr_tst_s | r | f
+ clstr_tst_s_pkey | i | f
+ clstr_tst_s_rf_a_seq | S | f
+(11 rows)
+
+-- Verify that indisclustered is correctly set
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+-------------
+ clstr_tst_c
+(1 row)
+
+-- Try changing indisclustered
+ALTER TABLE clstr_tst CLUSTER ON clstr_tst_b_c;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------------
+ clstr_tst_b_c
+(1 row)
+
+-- Try turning off all clustering
+ALTER TABLE clstr_tst SET WITHOUT CLUSTER;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------
+(0 rows)
+
+-- Verify that clustering all tables does in fact cluster the right ones
+CREATE USER regress_clstr_user;
+CREATE TABLE clstr_1 (a INT PRIMARY KEY);
+CREATE TABLE clstr_2 (a INT PRIMARY KEY);
+CREATE TABLE clstr_3 (a INT PRIMARY KEY);
+ALTER TABLE clstr_1 OWNER TO regress_clstr_user;
+ALTER TABLE clstr_3 OWNER TO regress_clstr_user;
+GRANT SELECT ON clstr_2 TO regress_clstr_user;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- "CLUSTER <tablename>" on a table that hasn't been clustered
+CLUSTER clstr_2;
+ERROR: there is no previously clustered index for table "clstr_2"
+CLUSTER clstr_1_pkey ON clstr_1;
+CLUSTER clstr_2 USING clstr_2_pkey;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 1
+ 2
+ 2
+ 1
+(6 rows)
+
+-- revert to the original state
+DELETE FROM clstr_1;
+DELETE FROM clstr_2;
+DELETE FROM clstr_3;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- this user can only cluster clstr_1 and clstr_3, but the latter
+-- has not been clustered
+SET SESSION AUTHORIZATION regress_clstr_user;
+CLUSTER;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 2
+ 1
+ 2
+ 1
+(6 rows)
+
+-- cluster a single table using the indisclustered bit previously set
+DELETE FROM clstr_1;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+CLUSTER clstr_1;
+SELECT * FROM clstr_1;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- Test MVCC-safety of cluster. There isn't much we can do to verify the
+-- results with a single backend...
+CREATE TABLE clustertest (key int PRIMARY KEY);
+INSERT INTO clustertest VALUES (10);
+INSERT INTO clustertest VALUES (20);
+INSERT INTO clustertest VALUES (30);
+INSERT INTO clustertest VALUES (40);
+INSERT INTO clustertest VALUES (50);
+-- Use a transaction so that updates are not committed when CLUSTER sees 'em
+BEGIN;
+-- Test update where the old row version is found first in the scan
+UPDATE clustertest SET key = 100 WHERE key = 10;
+-- Test update where the new row version is found first in the scan
+UPDATE clustertest SET key = 35 WHERE key = 40;
+-- Test longer update chain
+UPDATE clustertest SET key = 60 WHERE key = 50;
+UPDATE clustertest SET key = 70 WHERE key = 60;
+UPDATE clustertest SET key = 80 WHERE key = 70;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 100
+ 35
+ 80
+(5 rows)
+
+CLUSTER clustertest_pkey ON clustertest;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+COMMIT;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+-- check that temp tables can be clustered
+create temp table clstr_temp (col1 int primary key, col2 text);
+insert into clstr_temp values (2, 'two'), (1, 'one');
+cluster clstr_temp using clstr_temp_pkey;
+select * from clstr_temp;
+ col1 | col2
+------+------
+ 1 | one
+ 2 | two
+(2 rows)
+
+drop table clstr_temp;
+RESET SESSION AUTHORIZATION;
+-- Check that partitioned tables cannot be clustered
+CREATE TABLE clstrpart (a int) PARTITION BY RANGE (a);
+CREATE INDEX clstrpart_idx ON clstrpart (a);
+ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+ERROR: cannot mark index clustered in partitioned table
+CLUSTER clstrpart USING clstrpart_idx;
+ERROR: cannot cluster a partitioned table
+DROP TABLE clstrpart;
+-- Test CLUSTER with external tuplesorting
+create table clstr_4 as select * from tenk1;
+create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
+-- ensure we don't use the index in CLUSTER nor the checking SELECTs
+set enable_indexscan = off;
+-- Use external sort:
+set maintenance_work_mem = '1MB';
+cluster clstr_4 using cluster_sort;
+select * from
+(select hundred, lag(hundred) over () as lhundred,
+ thousand, lag(thousand) over () as lthousand,
+ tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
+where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
+ hundred | lhundred | thousand | lthousand | tenthous | ltenthous
+---------+----------+----------+-----------+----------+-----------
+(0 rows)
+
+reset enable_indexscan;
+reset maintenance_work_mem;
+-- clean up
+DROP TABLE clustertest;
+DROP TABLE clstr_1;
+DROP TABLE clstr_2;
+DROP TABLE clstr_3;
+DROP TABLE clstr_4;
+DROP USER regress_clstr_user;
diff --git a/src/test/regress/expected/combocid_1.out b/src/test/regress/expected/combocid_1.out
new file mode 100644
index 00000000000..533a2f2fe46
--- /dev/null
+++ b/src/test/regress/expected/combocid_1.out
@@ -0,0 +1,169 @@
+--
+-- Tests for some likely failure cases with combo cmin/cmax mechanism
+--
+CREATE TEMP TABLE combocidtest (foobar int);
+BEGIN;
+-- a few dummy ops to push up the CommandId counter
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest VALUES (1);
+INSERT INTO combocidtest VALUES (2);
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+SAVEPOINT s1;
+UPDATE combocidtest SET foobar = foobar + 10;
+-- here we should see only updated tuples
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,3) | 12 | 11
+ (0,4) | 12 | 12
+(2 rows)
+
+ROLLBACK TO s1;
+-- now we should see old tuples, but with combo CIDs starting at 0
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+COMMIT;
+-- combo data is not there anymore, but should still see tuples
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+-- Test combo cids with portals
+BEGIN;
+INSERT INTO combocidtest VALUES (333);
+DECLARE c CURSOR FOR SELECT ctid,cmin,* FROM combocidtest;
+DELETE FROM combocidtest;
+FETCH ALL FROM c;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,5) | 0 | 333
+(3 rows)
+
+ROLLBACK;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+-- check behavior with locked tuples
+BEGIN;
+-- a few dummy ops to push up the CommandId counter
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest VALUES (444);
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+SAVEPOINT s1;
+-- this doesn't affect cmin
+SELECT ctid,cmin,* FROM combocidtest FOR UPDATE;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+-- but this does
+UPDATE combocidtest SET foobar = foobar + 10;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,7) | 12 | 11
+ (0,8) | 12 | 12
+ (0,9) | 12 | 454
+(3 rows)
+
+ROLLBACK TO s1;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+COMMIT;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+-- test for bug reported in
+-- CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
+CREATE TABLE IF NOT EXISTS testcase(
+ id int PRIMARY KEY,
+ balance numeric
+);
+INSERT INTO testcase VALUES (1, 0);
+BEGIN;
+SELECT * FROM testcase WHERE testcase.id = 1 FOR UPDATE;
+ id | balance
+----+---------
+ 1 | 0
+(1 row)
+
+UPDATE testcase SET balance = balance + 400 WHERE id=1;
+SAVEPOINT subxact;
+UPDATE testcase SET balance = balance - 100 WHERE id=1;
+ROLLBACK TO SAVEPOINT subxact;
+-- should return one tuple
+SELECT * FROM testcase WHERE id = 1 FOR UPDATE;
+ id | balance
+----+---------
+ 1 | 400
+(1 row)
+
+ROLLBACK;
+DROP TABLE testcase;
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index 84da403afc5..a9b8a5d9fab 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -126,11 +126,12 @@ ERROR: function int4in(internal) does not exist
CREATE ACCESS METHOD bogus TYPE TABLE HANDLER bthandler;
ERROR: function bthandler must return type table_am_handler
SELECT amname, amhandler, amtype FROM pg_am where amtype = 't' ORDER BY 1, 2;
- amname | amhandler | amtype
---------+----------------------+--------
- heap | heap_tableam_handler | t
- heap2 | heap_tableam_handler | t
-(2 rows)
+ amname | amhandler | amtype
+----------+--------------------------+--------
+ heap | heap_tableam_handler | t
+ heap2 | heap_tableam_handler | t
+ zedstore | zedstore_tableam_handler | t
+(3 rows)
-- First create tables employing the new AM using USING
-- plain CREATE TABLE
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c6d575a2f99..5d83629c430 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1797,7 +1797,9 @@ SELECT count(*) FROM dupindexcols
--
-- Check ordering of =ANY indexqual results (bug in 9.2.0)
--
-vacuum tenk1; -- ensure we get consistent plans here
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
explain (costs off)
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
@@ -1838,6 +1840,7 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+RESET enable_indexscan;
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1860,6 +1863,8 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+RESET enable_seqscan;
+RESET enable_bitmapscan;
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/create_index_1.out b/src/test/regress/expected/create_index_1.out
new file mode 100644
index 00000000000..ddf993abc9e
--- /dev/null
+++ b/src/test/regress/expected/create_index_1.out
@@ -0,0 +1,2326 @@
+--
+-- CREATE_INDEX
+-- Create ancillary data structures (i.e. indices)
+--
+--
+-- BTREE
+--
+CREATE INDEX onek_unique1 ON onek USING btree(unique1 int4_ops);
+CREATE INDEX IF NOT EXISTS onek_unique1 ON onek USING btree(unique1 int4_ops);
+NOTICE: relation "onek_unique1" already exists, skipping
+CREATE INDEX IF NOT EXISTS ON onek USING btree(unique1 int4_ops);
+ERROR: syntax error at or near "ON"
+LINE 1: CREATE INDEX IF NOT EXISTS ON onek USING btree(unique1 int4_...
+ ^
+CREATE INDEX onek_unique2 ON onek USING btree(unique2 int4_ops);
+CREATE INDEX onek_hundred ON onek USING btree(hundred int4_ops);
+CREATE INDEX onek_stringu1 ON onek USING btree(stringu1 name_ops);
+CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
+CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
+CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
+CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
+CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
+CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
+CREATE INDEX rix ON road USING btree (name text_ops);
+CREATE INDEX iix ON ihighway USING btree (name text_ops);
+CREATE INDEX six ON shighway USING btree (name text_ops);
+-- test comments
+COMMENT ON INDEX six_wrong IS 'bad index';
+ERROR: relation "six_wrong" does not exist
+COMMENT ON INDEX six IS 'good index';
+COMMENT ON INDEX six IS NULL;
+--
+-- BTREE ascending/descending cases
+--
+-- we load int4/text from pure descending data (each key is a new
+-- low key) and name/f8 from pure ascending data (each key is a new
+-- high key). we had a bug where new low keys would sometimes be
+-- "lost".
+--
+CREATE INDEX bt_i4_index ON bt_i4_heap USING btree (seqno int4_ops);
+CREATE INDEX bt_name_index ON bt_name_heap USING btree (seqno name_ops);
+CREATE INDEX bt_txt_index ON bt_txt_heap USING btree (seqno text_ops);
+CREATE INDEX bt_f8_index ON bt_f8_heap USING btree (seqno float8_ops);
+--
+-- BTREE partial indices
+--
+CREATE INDEX onek2_u1_prtl ON onek2 USING btree(unique1 int4_ops)
+ where unique1 < 20 or unique1 > 980;
+CREATE INDEX onek2_u2_prtl ON onek2 USING btree(unique2 int4_ops)
+ where stringu1 < 'B';
+CREATE INDEX onek2_stu1_prtl ON onek2 USING btree(stringu1 name_ops)
+ where onek2.stringu1 >= 'J' and onek2.stringu1 < 'K';
+--
+-- GiST (rtree-equivalent opclasses only)
+--
+CREATE INDEX grect2ind ON fast_emp4000 USING gist (home_base);
+CREATE INDEX gpolygonind ON polygon_tbl USING gist (f1);
+CREATE INDEX gcircleind ON circle_tbl USING gist (f1);
+INSERT INTO POINT_TBL(f1) VALUES (NULL);
+CREATE INDEX gpointind ON point_tbl USING gist (f1);
+CREATE TEMP TABLE gpolygon_tbl AS
+ SELECT polygon(home_base) AS f1 FROM slow_emp4000;
+INSERT INTO gpolygon_tbl VALUES ( '(1000,0,0,1000)' );
+INSERT INTO gpolygon_tbl VALUES ( '(0,1000,1000,1000)' );
+CREATE TEMP TABLE gcircle_tbl AS
+ SELECT circle(home_base) AS f1 FROM slow_emp4000;
+CREATE INDEX ggpolygonind ON gpolygon_tbl USING gist (f1);
+CREATE INDEX ggcircleind ON gcircle_tbl USING gist (f1);
+--
+-- Test GiST indexes
+--
+-- get non-indexed results for comparison purposes
+SET enable_seqscan = ON;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
+SELECT * FROM fast_emp4000
+ WHERE home_base @ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ home_base
+-----------------------
+ (337,455),(240,359)
+ (1444,403),(1346,344)
+(2 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ count
+-------
+ 278
+(1 row)
+
+SELECT * FROM polygon_tbl WHERE f1 ~ '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ f1
+---------------------
+ ((2,0),(2,4),(0,0))
+(1 row)
+
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ f1
+---------------
+ <(1,2),3>
+ <(1,3),5>
+ <(1,2),100>
+ <(100,1),115>
+(4 rows)
+
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ count
+-------
+ 5
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ count
+-------
+ 4
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+
+(10 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ f1
+----
+
+(1 row)
+
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (1e-300,-1e-300)
+ (0,0)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+(9 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ f1
+-------------------------------------------------
+ ((240,359),(240,455),(337,455),(337,359))
+ ((662,163),(662,187),(759,187),(759,163))
+ ((1000,0),(0,1000))
+ ((0,1000),(1000,1000))
+ ((1346,344),(1346,403),(1444,403),(1444,344))
+ ((278,1409),(278,1457),(369,1457),(369,1409))
+ ((907,1156),(907,1201),(948,1201),(948,1156))
+ ((1517,971),(1517,1043),(1594,1043),(1594,971))
+ ((175,1820),(175,1850),(259,1850),(259,1820))
+ ((2424,81),(2424,160),(2424,160),(2424,81))
+(10 rows)
+
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ circle_center | radius
+----------------+--------
+ (288.5,407) | 68
+ (710.5,175) | 50
+ (323.5,1433) | 51
+ (927.5,1178.5) | 30
+ (1395,373.5) | 57
+ (1555.5,1007) | 53
+ (217,1835) | 45
+ (489,2421.5) | 22
+ (2424,120.5) | 40
+ (751.5,2655) | 20
+(10 rows)
+
+-- Now check the results from plain indexscan
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = OFF;
+EXPLAIN (COSTS OFF)
+SELECT * FROM fast_emp4000
+ WHERE home_base @ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ QUERY PLAN
+----------------------------------------------------------------
+ Sort
+ Sort Key: ((home_base[0])[0])
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base @ '(2000,1000),(200,200)'::box)
+(4 rows)
+
+SELECT * FROM fast_emp4000
+ WHERE home_base @ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ home_base
+-----------------------
+ (337,455),(240,359)
+ (1444,403),(1346,344)
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base && '(1000,1000),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base IS NULL)
+(3 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ count
+-------
+ 278
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM polygon_tbl WHERE f1 ~ '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ QUERY PLAN
+-----------------------------------------------------------
+ Sort
+ Sort Key: ((poly_center(f1))[0])
+ -> Index Scan using gpolygonind on polygon_tbl
+ Index Cond: (f1 ~ '((1,1),(2,2),(2,1))'::polygon)
+(4 rows)
+
+SELECT * FROM polygon_tbl WHERE f1 ~ '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ f1
+---------------------
+ ((2,0),(2,4),(0,0))
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ QUERY PLAN
+--------------------------------------------------
+ Sort
+ Sort Key: (area(f1))
+ -> Index Scan using gcircleind on circle_tbl
+ Index Cond: (f1 && '<(1,-2),1>'::circle)
+(4 rows)
+
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ f1
+---------------
+ <(1,2),3>
+ <(1,3),5>
+ <(1,2),100>
+ <(100,1),115>
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ QUERY PLAN
+------------------------------------------------------------
+ Aggregate
+ -> Index Scan using ggpolygonind on gpolygon_tbl
+ Index Cond: (f1 && '((1000,1000),(0,0))'::polygon)
+(3 rows)
+
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Index Scan using ggcircleind on gcircle_tbl
+ Index Cond: (f1 && '<(500,500),500>'::circle)
+(3 rows)
+
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(100,100),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(100,100),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '((0,0),(0,100),(100,100),(50,50),(100,0),(0,0))'::polygon)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ count
+-------
+ 4
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '<(50,50),50>'::circle)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 << '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 >> '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 <^ '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 >^ '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ count
+-------
+ 4
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 ~= '(-5,-12)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Order By: (f1 <-> '(0,1)'::point)
+(2 rows)
+
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (10,10)
+ (NaN,NaN)
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (-5,-12)
+ (5.1,34.5)
+
+ (1e+300,Infinity)
+(10 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 IS NULL)
+(2 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ f1
+----
+
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 IS NOT NULL)
+ Order By: (f1 <-> '(0,1)'::point)
+(3 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (10,10)
+ (NaN,NaN)
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+------------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+ Order By: (f1 <-> '(0,1)'::point)
+(3 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ QUERY PLAN
+-----------------------------------------------------
+ Limit
+ -> Index Scan using ggpolygonind on gpolygon_tbl
+ Order By: (f1 <-> '(0,0)'::point)
+(3 rows)
+
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ f1
+-------------------------------------------------
+ ((240,359),(240,455),(337,455),(337,359))
+ ((662,163),(662,187),(759,187),(759,163))
+ ((1000,0),(0,1000))
+ ((0,1000),(1000,1000))
+ ((1346,344),(1346,403),(1444,403),(1444,344))
+ ((278,1409),(278,1457),(369,1457),(369,1409))
+ ((907,1156),(907,1201),(948,1201),(948,1156))
+ ((1517,971),(1517,1043),(1594,1043),(1594,971))
+ ((175,1820),(175,1850),(259,1850),(259,1820))
+ ((2424,81),(2424,160),(2424,160),(2424,81))
+(10 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ QUERY PLAN
+---------------------------------------------------
+ Limit
+ -> Index Scan using ggcircleind on gcircle_tbl
+ Order By: (f1 <-> '(200,300)'::point)
+(3 rows)
+
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ circle_center | radius
+----------------+--------
+ (288.5,407) | 68
+ (710.5,175) | 50
+ (323.5,1433) | 51
+ (927.5,1178.5) | 30
+ (1395,373.5) | 57
+ (1555.5,1007) | 53
+ (217,1835) | 45
+ (489,2421.5) | 22
+ (2424,120.5) | 40
+ (751.5,2655) | 20
+(10 rows)
+
+-- Now check the results from bitmap indexscan
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = ON;
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+------------------------------------------------------------
+ Sort
+ Sort Key: ((f1 <-> '(0,1)'::point))
+ -> Bitmap Heap Scan on point_tbl
+ Recheck Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+ -> Bitmap Index Scan on gpointind
+ Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+(6 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+--
+-- GIN over int[] and text[]
+--
+-- Note: GIN currently supports only bitmap scans, not plain indexscans
+--
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = ON;
+CREATE INDEX intarrayidx ON array_index_op_test USING gin (i);
+explain (costs off)
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ QUERY PLAN
+----------------------------------------------------
+ Sort
+ Sort Key: seqno
+ -> Bitmap Heap Scan on array_index_op_test
+ Recheck Cond: (i @> '{32}'::integer[])
+ -> Bitmap Index Scan on intarrayidx
+ Index Cond: (i @> '{32}'::integer[])
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(8 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(8 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32,17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32,17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(11 rows)
+
+SELECT * FROM array_index_op_test WHERE i <@ '{38,34,32,89}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------+----------------------------------------------------------------------------------------------------------------------------
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 101 | {} | {}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE i = '{47,77}' ORDER BY seqno;
+ seqno | i | t
+-------+---------+-----------------------------------------------------------------------------------------------------------------
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i @> '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 1 | {92,75,71,52,64,83} | {AAAAAAAA44066,AAAAAA1059,AAAAAAAAAAA176,AAAAAAA48038}
+ 2 | {3,6} | {AAAAAA98232,AAAAAAAA79710,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAAAAAAA55798,AAAAAAAAA12793}
+ 3 | {37,64,95,43,3,41,13,30,11,43} | {AAAAAAAAAA48845,AAAAA75968,AAAAA95309,AAA54451,AAAAAAAAAA22292,AAAAAAA99836,A96617,AA17009,AAAAAAAAAAAAAA95246}
+ 4 | {71,39,99,55,33,75,45} | {AAAAAAAAA53663,AAAAAAAAAAAAAAA67062,AAAAAAAAAA64777,AAA99043,AAAAAAAAAAAAAAAAAAA91804,39557}
+ 5 | {50,42,77,50,4} | {AAAAAAAAAAAAAAAAA26540,AAAAAAA79710,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA176,AAAAA95309,AAAAAAAAAAA46154,AAAAAA66777,AAAAAAAAA27249,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA70104}
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 7 | {12,51,88,64,8} | {AAAAAAAAAAAAAAAAAA12591,AAAAAAAAAAAAAAAAA50407,AAAAAAAAAAAA67946}
+ 8 | {60,84} | {AAAAAAA81898,AAAAAA1059,AAAAAAAAAAAA81511,AAAAA961,AAAAAAAAAAAAAAAA31334,AAAAA64741,AA6416,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAA50407}
+ 9 | {56,52,35,27,80,44,81,22} | {AAAAAAAAAAAAAAA73034,AAAAAAAAAAAAA7929,AAAAAAA66161,AA88409,39557,A27153,AAAAAAAA9523,AAAAAAAAAAA99000}
+ 10 | {71,5,45} | {AAAAAAAAAAA21658,AAAAAAAAAAAA21089,AAA54451,AAAAAAAAAAAAAAAAAA54141,AAAAAAAAAAAAAA28620,AAAAAAAAAAA21658,AAAAAAAAAAA74076,AAAAAAAAA27249}
+ 11 | {41,86,74,48,22,74,47,50} | {AAAAAAAA9523,AAAAAAAAAAAA37562,AAAAAAAAAAAAAAAA14047,AAAAAAAAAAA46154,AAAA41702,AAAAAAAAAAAAAAAAA764,AAAAA62737,39557}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 13 | {3,52,34,23} | {AAAAAA98232,AAAA49534,AAAAAAAAAAA21658}
+ 14 | {78,57,19} | {AAAA8857,AAAAAAAAAAAAAAA73034,AAAAAAAA81587,AAAAAAAAAAAAAAA68526,AAAAA75968,AAAAAAAAAAAAAA65909,AAAAAAAAA10012,AAAAAAAAAAAAAA65909}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 16 | {14,63,85,11} | {AAAAAA66777}
+ 17 | {7,10,81,85} | {AAAAAA43678,AAAAAAA12144,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAAAAA15356}
+ 18 | {1} | {AAAAAAAAAAA33576,AAAAA95309,64261,AAA59323,AAAAAAAAAAAAAA95246,55847,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAAAA64374}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 20 | {72,89,70,51,54,37,8,49,79} | {AAAAAA58494}
+ 21 | {2,8,65,10,5,79,43} | {AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAAAAA91804,AAAAA64669,AAAAAAAAAAAAAAAA1443,AAAAAAAAAAAAAAAA23657,AAAAA12179,AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAA31334,AAAAAAAAAAAAAAAA41303,AAAAAAAAAAAAAAAAAAA85420}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 23 | {40,90,5,38,72,40,30,10,43,55} | {A6053,AAAAAAAAAAA6119,AA44673,AAAAAAAAAAAAAAAAA764,AA17009,AAAAA17383,AAAAA70514,AAAAA33250,AAAAA95309,AAAAAAAAAAAA37562}
+ 24 | {94,61,99,35,48} | {AAAAAAAAAAA50956,AAAAAAAAAAA15165,AAAA85070,AAAAAAAAAAAAAAA36627,AAAAA961,AAAAAAAAAA55219}
+ 25 | {31,1,10,11,27,79,38} | {AAAAAAAAAAAAAAAAAA59334,45449}
+ 26 | {71,10,9,69,75} | {47735,AAAAAAA21462,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA91804,AAAAAAAAA72121,AAAAAAAAAAAAAAAAAAA1205,AAAAA41597,AAAA8857,AAAAAAAAAAAAAAAAAAA15356,AA17009}
+ 27 | {94} | {AA6416,A6053,AAAAAAA21462,AAAAAAA57334,AAAAAAAAAAAAAAAAAA12591,AA88409,AAAAAAAAAAAAA70254}
+ 28 | {14,33,6,34,14} | {AAAAAAAAAAAAAAA13198,AAAAAAAA69452,AAAAAAAAAAA82945,AAAAAAA12144,AAAAAAAAA72121,AAAAAAAAAA18601}
+ 29 | {39,21} | {AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA38885,AAAA85070,AAAAAAAAAAAAAAAAAAA70104,AAAAA66674,AAAAAAAAAAAAA62007,AAAAAAAA69452,AAAAAAA1242,AAAAAAAAAAAAAAAA1729,AAAA35194}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 31 | {80,24,18,21,54} | {AAAAAAAAAAAAAAA13198,AAAAAAAAAAAAAAAAAAA70415,A27153,AAAAAAAAA53663,AAAAAAAAAAAAAAAAA50407,A68938}
+ 32 | {58,79,82,80,67,75,98,10,41} | {AAAAAAAAAAAAAAAAAA61286,AAA54451,AAAAAAAAAAAAAAAAAAA87527,A96617,51533}
+ 33 | {74,73} | {A85417,AAAAAAA56483,AAAAA17383,AAAAAAAAAAAAA62159,AAAAAAAAAAAA52814,AAAAAAAAAAAAA85723,AAAAAAAAAAAAAAAAAA55796}
+ 34 | {70,45} | {AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAA28620,AAAAAAAAAA55219,AAAAAAAA23648,AAAAAAAAAA22292,AAAAAAA1242}
+ 35 | {23,40} | {AAAAAAAAAAAA52814,AAAA48949,AAAAAAAAA34727,AAAA8857,AAAAAAAAAAAAAAAAAAA62179,AAAAAAAAAAAAAAA68526,AAAAAAA99836,AAAAAAAA50094,AAAA91194,AAAAAAAAAAAAA73084}
+ 36 | {79,82,14,52,30,5,79} | {AAAAAAAAA53663,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA89194,AA88409,AAAAAAAAAAAAAAA81326,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAA33598}
+ 37 | {53,11,81,39,3,78,58,64,74} | {AAAAAAAAAAAAAAAAAAA17075,AAAAAAA66161,AAAAAAAA23648,AAAAAAAAAAAAAA10611}
+ 38 | {59,5,4,95,28} | {AAAAAAAAAAA82945,A96617,47735,AAAAA12179,AAAAA64669,AAAAAA99807,AA74433,AAAAAAAAAAAAAAAAA59387}
+ 39 | {82,43,99,16,74} | {AAAAAAAAAAAAAAA67062,AAAAAAA57334,AAAAAAAAAAAAAA65909,A27153,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAA64777,AAAAAAAAAAAA81511,AAAAAAAAAAAAAA65909,AAAAAAAAAAAAAA28620}
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 41 | {19,26,63,12,93,73,27,94} | {AAAAAAA79710,AAAAAAAAAA55219,AAAA41702,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAAAAA63050,AAAAAAA99836,AAAAAAAAAAAAAA8666}
+ 42 | {15,76,82,75,8,91} | {AAAAAAAAAAA176,AAAAAA38063,45449,AAAAAA54032,AAAAAAA81898,AA6416,AAAAAAAAAAAAAAAAAAA62179,45449,AAAAA60038,AAAAAAAA81587}
+ 43 | {39,87,91,97,79,28} | {AAAAAAAAAAA74076,A96617,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAAAAA55796,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAA67946}
+ 44 | {40,58,68,29,54} | {AAAAAAA81898,AAAAAA66777,AAAAAA98232}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 46 | {53,24} | {AAAAAAAAAAA53908,AAAAAA54032,AAAAA17383,AAAA48949,AAAAAAAAAA18601,AAAAA64669,45449,AAAAAAAAAAA98051,AAAAAAAAAAAAAAAAAA71621}
+ 47 | {98,23,64,12,75,61} | {AAA59323,AAAAA95309,AAAAAAAAAAAAAAAA31334,AAAAAAAAA27249,AAAAA17383,AAAAAAAAAAAA37562,AAAAAA1059,A84822,55847,AAAAA70466}
+ 48 | {76,14} | {AAAAAAAAAAAAA59671,AAAAAAAAAAAAAAAAAAA91804,AAAAAA66777,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAA73084,AAAAAAA79710,AAAAAAAAAAAAAAA40402,AAAAAAAAAAAAAAAAAAA65037}
+ 49 | {56,5,54,37,49} | {AA21643,AAAAAAAAAAA92631,AAAAAAAA81587}
+ 50 | {20,12,37,64,93} | {AAAAAAAAAA5483,AAAAAAAAAAAAAAAAAAA1205,AA6416,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAAAA47955}
+ 51 | {47} | {AAAAAAAAAAAAAA96505,AAAAAAAAAAAAAAAAAA36842,AAAAA95309,AAAAAAAA81587,AA6416,AAAA91194,AAAAAA58494,AAAAAA1059,AAAAAAAA69452}
+ 52 | {89,0} | {AAAAAAAAAAAAAAAAAA47955,AAAAAAA48038,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAA73084,AAAAA70466,AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA46154,AA66862}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 54 | {70,47} | {AAAAAAAAAAAAAAAAAA54141,AAAAA40681,AAAAAAA48038,AAAAAAAAAAAAAAAA29150,AAAAA41597,AAAAAAAAAAAAAAAAAA59334,AA15322}
+ 55 | {47,79,47,64,72,25,71,24,93} | {AAAAAAAAAAAAAAAAAA55796,AAAAA62737}
+ 56 | {33,7,60,54,93,90,77,85,39} | {AAAAAAAAAAAAAAAAAA32918,AA42406}
+ 57 | {23,45,10,42,36,21,9,96} | {AAAAAAAAAAAAAAAAAAA70415}
+ 58 | {92} | {AAAAAAAAAAAAAAAA98414,AAAAAAAA23648,AAAAAAAAAAAAAAAAAA55796,AA25381,AAAAAAAAAAA6119}
+ 59 | {9,69,46,77} | {39557,AAAAAAA89932,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAAAAAA26540,AAA20874,AA6416,AAAAAAAAAAAAAAAAAA47955}
+ 60 | {62,2,59,38,89} | {AAAAAAA89932,AAAAAAAAAAAAAAAAAAA15356,AA99927,AA17009,AAAAAAAAAAAAAAA35875}
+ 61 | {72,2,44,95,54,54,13} | {AAAAAAAAAAAAAAAAAAA91804}
+ 62 | {83,72,29,73} | {AAAAAAAAAAAAA15097,AAAA8857,AAAAAAAAAAAA35809,AAAAAAAAAAAA52814,AAAAAAAAAAAAAAAAAAA38885,AAAAAAAAAAAAAAAAAA24183,AAAAAA43678,A96617}
+ 63 | {11,4,61,87} | {AAAAAAAAA27249,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAA13198,AAA20874,39557,51533,AAAAAAAAAAA53908,AAAAAAAAAAAAAA96505,AAAAAAAA78938}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 66 | {31,23,70,52,4,33,48,25} | {AAAAAAAAAAAAAAAAA69675,AAAAAAAA50094,AAAAAAAAAAA92631,AAAA35194,39557,AAAAAAA99836}
+ 67 | {31,94,7,10} | {AAAAAA38063,A96617,AAAA35194,AAAAAAAAAAAA67946}
+ 68 | {90,43,38} | {AA75092,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAA92631,AAAAAAAAA10012,AAAAAAAAAAAAA7929,AA21643}
+ 69 | {67,35,99,85,72,86,44} | {AAAAAAAAAAAAAAAAAAA1205,AAAAAAAA50094,AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAAAAAAA47955}
+ 70 | {56,70,83} | {AAAA41702,AAAAAAAAAAA82945,AA21643,AAAAAAAAAAA99000,A27153,AA25381,AAAAAAAAAAAAAA96505,AAAAAAA1242}
+ 71 | {74,26} | {AAAAAAAAAAA50956,AA74433,AAAAAAA21462,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAA70254,AAAAAAAAAA43419,39557}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 73 | {88,25,96,78,65,15,29,19} | {AAA54451,AAAAAAAAA27249,AAAAAAA9228,AAAAAAAAAAAAAAA67062,AAAAAAAAAAAAAAAAAAA70415,AAAAA17383,AAAAAAAAAAAAAAAA33598}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 75 | {12,96,83,24,71,89,55} | {AAAA48949,AAAAAAAA29716,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAA29150,AAA28075,AAAAAAAAAAAAAAAAA43052}
+ 76 | {92,55,10,7} | {AAAAAAAAAAAAAAA67062}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 78 | {55,89,44,84,34} | {AAAAAAAAAAA6119,AAAAAAAAAAAAAA8666,AA99927,AA42406,AAAAAAA81898,AAAAAAA9228,AAAAAAAAAAA92631,AA21643,AAAAAAAAAAAAAA28620}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 80 | {74,89,44,80,0} | {AAAA35194,AAAAAAAA79710,AAA20874,AAAAAAAAAAAAAAAAAAA70104,AAAAAAAAAAAAA73084,AAAAAAA57334,AAAAAAA9228,AAAAAAAAAAAAA62007}
+ 81 | {63,77,54,48,61,53,97} | {AAAAAAAAAAAAAAA81326,AAAAAAAAAA22292,AA25381,AAAAAAAAAAA74076,AAAAAAA81898,AAAAAAAAA72121}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 83 | {14,10} | {AAAAAAAAAA22292,AAAAAAAAAAAAA70254,AAAAAAAAAAA6119}
+ 84 | {11,83,35,13,96,94} | {AAAAA95309,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAAA24183}
+ 85 | {39,60} | {AAAAAAAAAAAAAAAA55798,AAAAAAAAAA22292,AAAAAAA66161,AAAAAAA21462,AAAAAAAAAAAAAAAAAA12591,55847,AAAAAA98232,AAAAAAAAAAA46154}
+ 86 | {33,81,72,74,45,36,82} | {AAAAAAAA81587,AAAAAAAAAAAAAA96505,45449,AAAA80176}
+ 87 | {57,27,50,12,97,68} | {AAAAAAAAAAAAAAAAA26540,AAAAAAAAA10012,AAAAAAAAAAAA35809,AAAAAAAAAAAAAAAA29150,AAAAAAAAAAA82945,AAAAAA66777,31228,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAA96505}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 90 | {88,75} | {AAAAA60038,AAAAAAAA23648,AAAAAAAAAAA99000,AAAA41702,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAA68526}
+ 91 | {78} | {AAAAAAAAAAAAA62007,AAA99043}
+ 92 | {85,63,49,45} | {AAAAAAA89932,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA21089}
+ 93 | {11} | {AAAAAAAAAAA176,AAAAAAAAAAAAAA8666,AAAAAAAAAAAAAAA453,AAAAAAAAAAAAA85723,A68938,AAAAAAAAAAAAA9821,AAAAAAA48038,AAAAAAAAAAAAAAAAA59387,AA99927,AAAAA17383}
+ 94 | {98,9,85,62,88,91,60,61,38,86} | {AAAAAAAA81587,AAAAA17383,AAAAAAAA81587}
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 99 | {37,86} | {AAAAAAAAAAAAAAAAAA32918,AAAAA70514,AAAAAAAAA10012,AAAAAAAAAAAAAAAAA59387,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA15356}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+ 101 | {} | {}
+ 102 | {NULL} | {NULL}
+(102 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_index_op_test WHERE i <@ '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+--------+--------
+ 102 | {NULL} | {NULL}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i @> '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_op_test WHERE i && '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+CREATE INDEX textarrayidx ON array_index_op_test USING gin (t);
+explain (costs off)
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
+ QUERY PLAN
+------------------------------------------------------------
+ Sort
+ Sort Key: seqno
+ -> Bitmap Heap Scan on array_index_op_test
+ Recheck Cond: (t @> '{AAAAAAAA72908}'::text[])
+ -> Bitmap Index Scan on textarrayidx
+ Index Cond: (t @> '{AAAAAAAA72908}'::text[])
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAA72908}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------------------+--------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------------------+--------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908,AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------+--------------------------------------------------------------------
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAA72908,AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t <@ '{AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------+-----------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 101 | {} | {}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t = '{AAAAAAAAAA646,A87088}' ORDER BY seqno;
+ seqno | i | t
+-------+------------+------------------------
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t @> '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 1 | {92,75,71,52,64,83} | {AAAAAAAA44066,AAAAAA1059,AAAAAAAAAAA176,AAAAAAA48038}
+ 2 | {3,6} | {AAAAAA98232,AAAAAAAA79710,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAAAAAAA55798,AAAAAAAAA12793}
+ 3 | {37,64,95,43,3,41,13,30,11,43} | {AAAAAAAAAA48845,AAAAA75968,AAAAA95309,AAA54451,AAAAAAAAAA22292,AAAAAAA99836,A96617,AA17009,AAAAAAAAAAAAAA95246}
+ 4 | {71,39,99,55,33,75,45} | {AAAAAAAAA53663,AAAAAAAAAAAAAAA67062,AAAAAAAAAA64777,AAA99043,AAAAAAAAAAAAAAAAAAA91804,39557}
+ 5 | {50,42,77,50,4} | {AAAAAAAAAAAAAAAAA26540,AAAAAAA79710,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA176,AAAAA95309,AAAAAAAAAAA46154,AAAAAA66777,AAAAAAAAA27249,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA70104}
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 7 | {12,51,88,64,8} | {AAAAAAAAAAAAAAAAAA12591,AAAAAAAAAAAAAAAAA50407,AAAAAAAAAAAA67946}
+ 8 | {60,84} | {AAAAAAA81898,AAAAAA1059,AAAAAAAAAAAA81511,AAAAA961,AAAAAAAAAAAAAAAA31334,AAAAA64741,AA6416,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAA50407}
+ 9 | {56,52,35,27,80,44,81,22} | {AAAAAAAAAAAAAAA73034,AAAAAAAAAAAAA7929,AAAAAAA66161,AA88409,39557,A27153,AAAAAAAA9523,AAAAAAAAAAA99000}
+ 10 | {71,5,45} | {AAAAAAAAAAA21658,AAAAAAAAAAAA21089,AAA54451,AAAAAAAAAAAAAAAAAA54141,AAAAAAAAAAAAAA28620,AAAAAAAAAAA21658,AAAAAAAAAAA74076,AAAAAAAAA27249}
+ 11 | {41,86,74,48,22,74,47,50} | {AAAAAAAA9523,AAAAAAAAAAAA37562,AAAAAAAAAAAAAAAA14047,AAAAAAAAAAA46154,AAAA41702,AAAAAAAAAAAAAAAAA764,AAAAA62737,39557}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 13 | {3,52,34,23} | {AAAAAA98232,AAAA49534,AAAAAAAAAAA21658}
+ 14 | {78,57,19} | {AAAA8857,AAAAAAAAAAAAAAA73034,AAAAAAAA81587,AAAAAAAAAAAAAAA68526,AAAAA75968,AAAAAAAAAAAAAA65909,AAAAAAAAA10012,AAAAAAAAAAAAAA65909}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 16 | {14,63,85,11} | {AAAAAA66777}
+ 17 | {7,10,81,85} | {AAAAAA43678,AAAAAAA12144,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAAAAA15356}
+ 18 | {1} | {AAAAAAAAAAA33576,AAAAA95309,64261,AAA59323,AAAAAAAAAAAAAA95246,55847,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAAAA64374}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 20 | {72,89,70,51,54,37,8,49,79} | {AAAAAA58494}
+ 21 | {2,8,65,10,5,79,43} | {AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAAAAA91804,AAAAA64669,AAAAAAAAAAAAAAAA1443,AAAAAAAAAAAAAAAA23657,AAAAA12179,AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAA31334,AAAAAAAAAAAAAAAA41303,AAAAAAAAAAAAAAAAAAA85420}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 23 | {40,90,5,38,72,40,30,10,43,55} | {A6053,AAAAAAAAAAA6119,AA44673,AAAAAAAAAAAAAAAAA764,AA17009,AAAAA17383,AAAAA70514,AAAAA33250,AAAAA95309,AAAAAAAAAAAA37562}
+ 24 | {94,61,99,35,48} | {AAAAAAAAAAA50956,AAAAAAAAAAA15165,AAAA85070,AAAAAAAAAAAAAAA36627,AAAAA961,AAAAAAAAAA55219}
+ 25 | {31,1,10,11,27,79,38} | {AAAAAAAAAAAAAAAAAA59334,45449}
+ 26 | {71,10,9,69,75} | {47735,AAAAAAA21462,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA91804,AAAAAAAAA72121,AAAAAAAAAAAAAAAAAAA1205,AAAAA41597,AAAA8857,AAAAAAAAAAAAAAAAAAA15356,AA17009}
+ 27 | {94} | {AA6416,A6053,AAAAAAA21462,AAAAAAA57334,AAAAAAAAAAAAAAAAAA12591,AA88409,AAAAAAAAAAAAA70254}
+ 28 | {14,33,6,34,14} | {AAAAAAAAAAAAAAA13198,AAAAAAAA69452,AAAAAAAAAAA82945,AAAAAAA12144,AAAAAAAAA72121,AAAAAAAAAA18601}
+ 29 | {39,21} | {AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA38885,AAAA85070,AAAAAAAAAAAAAAAAAAA70104,AAAAA66674,AAAAAAAAAAAAA62007,AAAAAAAA69452,AAAAAAA1242,AAAAAAAAAAAAAAAA1729,AAAA35194}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 31 | {80,24,18,21,54} | {AAAAAAAAAAAAAAA13198,AAAAAAAAAAAAAAAAAAA70415,A27153,AAAAAAAAA53663,AAAAAAAAAAAAAAAAA50407,A68938}
+ 32 | {58,79,82,80,67,75,98,10,41} | {AAAAAAAAAAAAAAAAAA61286,AAA54451,AAAAAAAAAAAAAAAAAAA87527,A96617,51533}
+ 33 | {74,73} | {A85417,AAAAAAA56483,AAAAA17383,AAAAAAAAAAAAA62159,AAAAAAAAAAAA52814,AAAAAAAAAAAAA85723,AAAAAAAAAAAAAAAAAA55796}
+ 34 | {70,45} | {AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAA28620,AAAAAAAAAA55219,AAAAAAAA23648,AAAAAAAAAA22292,AAAAAAA1242}
+ 35 | {23,40} | {AAAAAAAAAAAA52814,AAAA48949,AAAAAAAAA34727,AAAA8857,AAAAAAAAAAAAAAAAAAA62179,AAAAAAAAAAAAAAA68526,AAAAAAA99836,AAAAAAAA50094,AAAA91194,AAAAAAAAAAAAA73084}
+ 36 | {79,82,14,52,30,5,79} | {AAAAAAAAA53663,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA89194,AA88409,AAAAAAAAAAAAAAA81326,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAA33598}
+ 37 | {53,11,81,39,3,78,58,64,74} | {AAAAAAAAAAAAAAAAAAA17075,AAAAAAA66161,AAAAAAAA23648,AAAAAAAAAAAAAA10611}
+ 38 | {59,5,4,95,28} | {AAAAAAAAAAA82945,A96617,47735,AAAAA12179,AAAAA64669,AAAAAA99807,AA74433,AAAAAAAAAAAAAAAAA59387}
+ 39 | {82,43,99,16,74} | {AAAAAAAAAAAAAAA67062,AAAAAAA57334,AAAAAAAAAAAAAA65909,A27153,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAA64777,AAAAAAAAAAAA81511,AAAAAAAAAAAAAA65909,AAAAAAAAAAAAAA28620}
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 41 | {19,26,63,12,93,73,27,94} | {AAAAAAA79710,AAAAAAAAAA55219,AAAA41702,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAAAAA63050,AAAAAAA99836,AAAAAAAAAAAAAA8666}
+ 42 | {15,76,82,75,8,91} | {AAAAAAAAAAA176,AAAAAA38063,45449,AAAAAA54032,AAAAAAA81898,AA6416,AAAAAAAAAAAAAAAAAAA62179,45449,AAAAA60038,AAAAAAAA81587}
+ 43 | {39,87,91,97,79,28} | {AAAAAAAAAAA74076,A96617,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAAAAA55796,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAA67946}
+ 44 | {40,58,68,29,54} | {AAAAAAA81898,AAAAAA66777,AAAAAA98232}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 46 | {53,24} | {AAAAAAAAAAA53908,AAAAAA54032,AAAAA17383,AAAA48949,AAAAAAAAAA18601,AAAAA64669,45449,AAAAAAAAAAA98051,AAAAAAAAAAAAAAAAAA71621}
+ 47 | {98,23,64,12,75,61} | {AAA59323,AAAAA95309,AAAAAAAAAAAAAAAA31334,AAAAAAAAA27249,AAAAA17383,AAAAAAAAAAAA37562,AAAAAA1059,A84822,55847,AAAAA70466}
+ 48 | {76,14} | {AAAAAAAAAAAAA59671,AAAAAAAAAAAAAAAAAAA91804,AAAAAA66777,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAA73084,AAAAAAA79710,AAAAAAAAAAAAAAA40402,AAAAAAAAAAAAAAAAAAA65037}
+ 49 | {56,5,54,37,49} | {AA21643,AAAAAAAAAAA92631,AAAAAAAA81587}
+ 50 | {20,12,37,64,93} | {AAAAAAAAAA5483,AAAAAAAAAAAAAAAAAAA1205,AA6416,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAAAA47955}
+ 51 | {47} | {AAAAAAAAAAAAAA96505,AAAAAAAAAAAAAAAAAA36842,AAAAA95309,AAAAAAAA81587,AA6416,AAAA91194,AAAAAA58494,AAAAAA1059,AAAAAAAA69452}
+ 52 | {89,0} | {AAAAAAAAAAAAAAAAAA47955,AAAAAAA48038,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAA73084,AAAAA70466,AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA46154,AA66862}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 54 | {70,47} | {AAAAAAAAAAAAAAAAAA54141,AAAAA40681,AAAAAAA48038,AAAAAAAAAAAAAAAA29150,AAAAA41597,AAAAAAAAAAAAAAAAAA59334,AA15322}
+ 55 | {47,79,47,64,72,25,71,24,93} | {AAAAAAAAAAAAAAAAAA55796,AAAAA62737}
+ 56 | {33,7,60,54,93,90,77,85,39} | {AAAAAAAAAAAAAAAAAA32918,AA42406}
+ 57 | {23,45,10,42,36,21,9,96} | {AAAAAAAAAAAAAAAAAAA70415}
+ 58 | {92} | {AAAAAAAAAAAAAAAA98414,AAAAAAAA23648,AAAAAAAAAAAAAAAAAA55796,AA25381,AAAAAAAAAAA6119}
+ 59 | {9,69,46,77} | {39557,AAAAAAA89932,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAAAAAA26540,AAA20874,AA6416,AAAAAAAAAAAAAAAAAA47955}
+ 60 | {62,2,59,38,89} | {AAAAAAA89932,AAAAAAAAAAAAAAAAAAA15356,AA99927,AA17009,AAAAAAAAAAAAAAA35875}
+ 61 | {72,2,44,95,54,54,13} | {AAAAAAAAAAAAAAAAAAA91804}
+ 62 | {83,72,29,73} | {AAAAAAAAAAAAA15097,AAAA8857,AAAAAAAAAAAA35809,AAAAAAAAAAAA52814,AAAAAAAAAAAAAAAAAAA38885,AAAAAAAAAAAAAAAAAA24183,AAAAAA43678,A96617}
+ 63 | {11,4,61,87} | {AAAAAAAAA27249,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAA13198,AAA20874,39557,51533,AAAAAAAAAAA53908,AAAAAAAAAAAAAA96505,AAAAAAAA78938}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 66 | {31,23,70,52,4,33,48,25} | {AAAAAAAAAAAAAAAAA69675,AAAAAAAA50094,AAAAAAAAAAA92631,AAAA35194,39557,AAAAAAA99836}
+ 67 | {31,94,7,10} | {AAAAAA38063,A96617,AAAA35194,AAAAAAAAAAAA67946}
+ 68 | {90,43,38} | {AA75092,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAA92631,AAAAAAAAA10012,AAAAAAAAAAAAA7929,AA21643}
+ 69 | {67,35,99,85,72,86,44} | {AAAAAAAAAAAAAAAAAAA1205,AAAAAAAA50094,AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAAAAAAA47955}
+ 70 | {56,70,83} | {AAAA41702,AAAAAAAAAAA82945,AA21643,AAAAAAAAAAA99000,A27153,AA25381,AAAAAAAAAAAAAA96505,AAAAAAA1242}
+ 71 | {74,26} | {AAAAAAAAAAA50956,AA74433,AAAAAAA21462,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAA70254,AAAAAAAAAA43419,39557}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 73 | {88,25,96,78,65,15,29,19} | {AAA54451,AAAAAAAAA27249,AAAAAAA9228,AAAAAAAAAAAAAAA67062,AAAAAAAAAAAAAAAAAAA70415,AAAAA17383,AAAAAAAAAAAAAAAA33598}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 75 | {12,96,83,24,71,89,55} | {AAAA48949,AAAAAAAA29716,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAA29150,AAA28075,AAAAAAAAAAAAAAAAA43052}
+ 76 | {92,55,10,7} | {AAAAAAAAAAAAAAA67062}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 78 | {55,89,44,84,34} | {AAAAAAAAAAA6119,AAAAAAAAAAAAAA8666,AA99927,AA42406,AAAAAAA81898,AAAAAAA9228,AAAAAAAAAAA92631,AA21643,AAAAAAAAAAAAAA28620}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 80 | {74,89,44,80,0} | {AAAA35194,AAAAAAAA79710,AAA20874,AAAAAAAAAAAAAAAAAAA70104,AAAAAAAAAAAAA73084,AAAAAAA57334,AAAAAAA9228,AAAAAAAAAAAAA62007}
+ 81 | {63,77,54,48,61,53,97} | {AAAAAAAAAAAAAAA81326,AAAAAAAAAA22292,AA25381,AAAAAAAAAAA74076,AAAAAAA81898,AAAAAAAAA72121}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 83 | {14,10} | {AAAAAAAAAA22292,AAAAAAAAAAAAA70254,AAAAAAAAAAA6119}
+ 84 | {11,83,35,13,96,94} | {AAAAA95309,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAAA24183}
+ 85 | {39,60} | {AAAAAAAAAAAAAAAA55798,AAAAAAAAAA22292,AAAAAAA66161,AAAAAAA21462,AAAAAAAAAAAAAAAAAA12591,55847,AAAAAA98232,AAAAAAAAAAA46154}
+ 86 | {33,81,72,74,45,36,82} | {AAAAAAAA81587,AAAAAAAAAAAAAA96505,45449,AAAA80176}
+ 87 | {57,27,50,12,97,68} | {AAAAAAAAAAAAAAAAA26540,AAAAAAAAA10012,AAAAAAAAAAAA35809,AAAAAAAAAAAAAAAA29150,AAAAAAAAAAA82945,AAAAAA66777,31228,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAA96505}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 90 | {88,75} | {AAAAA60038,AAAAAAAA23648,AAAAAAAAAAA99000,AAAA41702,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAA68526}
+ 91 | {78} | {AAAAAAAAAAAAA62007,AAA99043}
+ 92 | {85,63,49,45} | {AAAAAAA89932,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA21089}
+ 93 | {11} | {AAAAAAAAAAA176,AAAAAAAAAAAAAA8666,AAAAAAAAAAAAAAA453,AAAAAAAAAAAAA85723,A68938,AAAAAAAAAAAAA9821,AAAAAAA48038,AAAAAAAAAAAAAAAAA59387,AA99927,AAAAA17383}
+ 94 | {98,9,85,62,88,91,60,61,38,86} | {AAAAAAAA81587,AAAAA17383,AAAAAAAA81587}
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 99 | {37,86} | {AAAAAAAAAAAAAAAAAA32918,AAAAA70514,AAAAAAAAA10012,AAAAAAAAAAAAAAAAA59387,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA15356}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+ 101 | {} | {}
+ 102 | {NULL} | {NULL}
+(102 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_index_op_test WHERE t <@ '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+-- And try it with a multicolumn GIN index
+DROP INDEX intarrayidx, textarrayidx;
+CREATE INDEX botharrayidx ON array_index_op_test USING gin (i, t);
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(7 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(7 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32}' AND t && '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------------+------------------------------------------------------------------------------
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' AND t @> '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------------+------------------------------------------------------------------------------
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+--------+--------
+ 102 | {NULL} | {NULL}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+--
+-- Try a GIN index with a lot of items with same key. (GIN creates a posting
+-- tree when there are enough duplicates)
+--
+CREATE TABLE array_gin_test (a int[]);
+INSERT INTO array_gin_test SELECT ARRAY[1, g%5, g] FROM generate_series(1, 10000) g;
+CREATE INDEX array_gin_test_idx ON array_gin_test USING gin (a);
+SELECT COUNT(*) FROM array_gin_test WHERE a @> '{2}';
+ count
+-------
+ 2000
+(1 row)
+
+DROP TABLE array_gin_test;
+--
+-- Test GIN index's reloptions
+--
+CREATE INDEX gin_relopts_test ON array_index_op_test USING gin (i)
+ WITH (FASTUPDATE=on, GIN_PENDING_LIST_LIMIT=128);
+\d+ gin_relopts_test
+ Index "public.gin_relopts_test"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+---------+------+------------+---------+--------------
+ i | integer | yes | i | plain |
+gin, for table "public.array_index_op_test"
+Options: fastupdate=on, gin_pending_list_limit=128
+
+--
+-- HASH
+--
+CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
+CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
+CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
+CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops) WITH (fillfactor=60);
+CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
+CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
+DROP TABLE unlogged_hash_table;
+-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
+-- Test hash index build tuplesorting. Force hash tuplesort using low
+-- maintenance_work_mem setting and fillfactor:
+SET maintenance_work_mem = '1MB';
+CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (stringu1 = 'TVAAAA'::name)
+ -> Bitmap Index Scan on hash_tuplesort_idx
+ Index Cond: (stringu1 = 'TVAAAA'::name)
+(5 rows)
+
+SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
+ count
+-------
+ 14
+(1 row)
+
+DROP INDEX hash_tuplesort_idx;
+RESET maintenance_work_mem;
+--
+-- Test functional index
+--
+CREATE TABLE func_index_heap (f1 text, f2 text);
+CREATE UNIQUE INDEX func_index_index on func_index_heap (textcat(f1,f2));
+INSERT INTO func_index_heap VALUES('ABC','DEF');
+INSERT INTO func_index_heap VALUES('AB','CDEFG');
+INSERT INTO func_index_heap VALUES('QWE','RTY');
+-- this should fail because of unique index:
+INSERT INTO func_index_heap VALUES('ABCD', 'EF');
+ERROR: duplicate key value violates unique constraint "func_index_index"
+DETAIL: Key (textcat(f1, f2))=(ABCDEF) already exists.
+-- but this shouldn't:
+INSERT INTO func_index_heap VALUES('QWERTY');
+--
+-- Same test, expressional index
+--
+DROP TABLE func_index_heap;
+CREATE TABLE func_index_heap (f1 text, f2 text);
+CREATE UNIQUE INDEX func_index_index on func_index_heap ((f1 || f2) text_ops);
+INSERT INTO func_index_heap VALUES('ABC','DEF');
+INSERT INTO func_index_heap VALUES('AB','CDEFG');
+INSERT INTO func_index_heap VALUES('QWE','RTY');
+-- this should fail because of unique index:
+INSERT INTO func_index_heap VALUES('ABCD', 'EF');
+ERROR: duplicate key value violates unique constraint "func_index_index"
+DETAIL: Key ((f1 || f2))=(ABCDEF) already exists.
+-- but this shouldn't:
+INSERT INTO func_index_heap VALUES('QWERTY');
+--
+-- Test unique index with included columns
+--
+CREATE TABLE covering_index_heap (f1 int, f2 int, f3 text);
+CREATE UNIQUE INDEX covering_index_index on covering_index_heap (f1,f2) INCLUDE(f3);
+INSERT INTO covering_index_heap VALUES(1,1,'AAA');
+INSERT INTO covering_index_heap VALUES(1,2,'AAA');
+-- this should fail because of unique index on f1,f2:
+INSERT INTO covering_index_heap VALUES(1,2,'BBB');
+ERROR: duplicate key value violates unique constraint "covering_index_index"
+DETAIL: Key (f1, f2)=(1, 2) already exists.
+-- and this shouldn't:
+INSERT INTO covering_index_heap VALUES(1,4,'AAA');
+-- Try to build index on table that already contains data
+CREATE UNIQUE INDEX covering_pkey on covering_index_heap (f1,f2) INCLUDE(f3);
+-- Try to use existing covering index as primary key
+ALTER TABLE covering_index_heap ADD CONSTRAINT covering_pkey PRIMARY KEY USING INDEX
+covering_pkey;
+DROP TABLE covering_index_heap;
+--
+-- Also try building functional, expressional, and partial indexes on
+-- tables that already contain data.
+--
+create unique index hash_f8_index_1 on hash_f8_heap(abs(random));
+create unique index hash_f8_index_2 on hash_f8_heap((seqno + 1), random);
+create unique index hash_f8_index_3 on hash_f8_heap(random) where seqno > 1000;
+--
+-- Try some concurrent index builds
+--
+-- Unfortunately this only tests about half the code paths because there are
+-- no concurrent updates happening to the table at the same time.
+CREATE TABLE concur_heap (f1 text, f2 text);
+-- empty table
+CREATE INDEX CONCURRENTLY concur_index1 ON concur_heap(f2,f1);
+CREATE INDEX CONCURRENTLY IF NOT EXISTS concur_index1 ON concur_heap(f2,f1);
+NOTICE: relation "concur_index1" already exists, skipping
+INSERT INTO concur_heap VALUES ('a','b');
+INSERT INTO concur_heap VALUES ('b','b');
+-- unique index
+CREATE UNIQUE INDEX CONCURRENTLY concur_index2 ON concur_heap(f1);
+CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
+NOTICE: relation "concur_index2" already exists, skipping
+-- check if constraint is set up properly to be enforced
+INSERT INTO concur_heap VALUES ('b','x');
+ERROR: duplicate key value violates unique constraint "concur_index2"
+DETAIL: Key (f1)=(b) already exists.
+-- check if constraint is enforced properly at build time
+CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+ERROR: could not create unique index "concur_index3"
+DETAIL: Key (f2)=(b) is duplicated.
+-- test that expression indexes and partial indexes work concurrently
+CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
+CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
+-- here we also check that you can default the index name
+CREATE INDEX CONCURRENTLY on concur_heap((f2||f1));
+-- You can't do a concurrent index build in a transaction
+BEGIN;
+CREATE INDEX CONCURRENTLY concur_index7 ON concur_heap(f1);
+ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- But you can do a regular index build in a transaction
+BEGIN;
+CREATE INDEX std_index on concur_heap(f2);
+COMMIT;
+-- Failed builds are left invalid by VACUUM FULL, fixed by REINDEX
+VACUUM FULL concur_heap;
+REINDEX TABLE concur_heap;
+ERROR: could not create unique index "concur_index3"
+DETAIL: Key (f2)=(b) is duplicated.
+DELETE FROM concur_heap WHERE f1 = 'b';
+VACUUM FULL concur_heap;
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "concur_heap_expr_idx" btree ((f2 || f1))
+ "concur_index1" btree (f2, f1)
+ "concur_index2" UNIQUE, btree (f1)
+ "concur_index3" UNIQUE, btree (f2) INVALID
+ "concur_index4" btree (f2) WHERE f1 = 'a'::text
+ "concur_index5" btree (f2) WHERE f1 = 'x'::text
+ "std_index" btree (f2)
+
+REINDEX TABLE concur_heap;
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "concur_heap_expr_idx" btree ((f2 || f1))
+ "concur_index1" btree (f2, f1)
+ "concur_index2" UNIQUE, btree (f1)
+ "concur_index3" UNIQUE, btree (f2)
+ "concur_index4" btree (f2) WHERE f1 = 'a'::text
+ "concur_index5" btree (f2) WHERE f1 = 'x'::text
+ "std_index" btree (f2)
+
+--
+-- Try some concurrent index drops
+--
+DROP INDEX CONCURRENTLY "concur_index2"; -- works
+DROP INDEX CONCURRENTLY IF EXISTS "concur_index2"; -- notice
+NOTICE: index "concur_index2" does not exist, skipping
+-- failures
+DROP INDEX CONCURRENTLY "concur_index2", "concur_index3";
+ERROR: DROP INDEX CONCURRENTLY does not support dropping multiple objects
+BEGIN;
+DROP INDEX CONCURRENTLY "concur_index5";
+ERROR: DROP INDEX CONCURRENTLY cannot run inside a transaction block
+ROLLBACK;
+-- successes
+DROP INDEX CONCURRENTLY IF EXISTS "concur_index3";
+DROP INDEX CONCURRENTLY "concur_index4";
+DROP INDEX CONCURRENTLY "concur_index5";
+DROP INDEX CONCURRENTLY "concur_index1";
+DROP INDEX CONCURRENTLY "concur_heap_expr_idx";
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "std_index" btree (f2)
+
+DROP TABLE concur_heap;
+--
+-- Test ADD CONSTRAINT USING INDEX
+--
+CREATE TABLE cwi_test( a int , b varchar(10), c char);
+-- add some data so that all tests have something to work with.
+INSERT INTO cwi_test VALUES(1, 2), (3, 4), (5, 6);
+CREATE UNIQUE INDEX cwi_uniq_idx ON cwi_test(a , b);
+ALTER TABLE cwi_test ADD primary key USING INDEX cwi_uniq_idx;
+\d cwi_test
+ Table "public.cwi_test"
+ Column | Type | Collation | Nullable | Default
+--------+-----------------------+-----------+----------+---------
+ a | integer | | not null |
+ b | character varying(10) | | not null |
+ c | character(1) | | |
+Indexes:
+ "cwi_uniq_idx" PRIMARY KEY, btree (a, b)
+
+\d cwi_uniq_idx
+ Index "public.cwi_uniq_idx"
+ Column | Type | Key? | Definition
+--------+-----------------------+------+------------
+ a | integer | yes | a
+ b | character varying(10) | yes | b
+primary key, btree, for table "public.cwi_test"
+
+CREATE UNIQUE INDEX cwi_uniq2_idx ON cwi_test(b , a);
+ALTER TABLE cwi_test DROP CONSTRAINT cwi_uniq_idx,
+ ADD CONSTRAINT cwi_replaced_pkey PRIMARY KEY
+ USING INDEX cwi_uniq2_idx;
+NOTICE: ALTER TABLE / ADD CONSTRAINT USING INDEX will rename index "cwi_uniq2_idx" to "cwi_replaced_pkey"
+\d cwi_test
+ Table "public.cwi_test"
+ Column | Type | Collation | Nullable | Default
+--------+-----------------------+-----------+----------+---------
+ a | integer | | not null |
+ b | character varying(10) | | not null |
+ c | character(1) | | |
+Indexes:
+ "cwi_replaced_pkey" PRIMARY KEY, btree (b, a)
+
+\d cwi_replaced_pkey
+ Index "public.cwi_replaced_pkey"
+ Column | Type | Key? | Definition
+--------+-----------------------+------+------------
+ b | character varying(10) | yes | b
+ a | integer | yes | a
+primary key, btree, for table "public.cwi_test"
+
+DROP INDEX cwi_replaced_pkey; -- Should fail; a constraint depends on it
+ERROR: cannot drop index cwi_replaced_pkey because constraint cwi_replaced_pkey on table cwi_test requires it
+HINT: You can drop constraint cwi_replaced_pkey on table cwi_test instead.
+DROP TABLE cwi_test;
+-- ADD CONSTRAINT USING INDEX is forbidden on partitioned tables
+CREATE TABLE cwi_test(a int) PARTITION BY hash (a);
+create unique index on cwi_test (a);
+alter table cwi_test add primary key using index cwi_test_a_idx ;
+ERROR: ALTER TABLE / ADD CONSTRAINT USING INDEX is not supported on partitioned tables
+DROP TABLE cwi_test;
+--
+-- Check handling of indexes on system columns
+--
+CREATE TABLE syscol_table (a INT);
+-- System columns cannot be indexed
+CREATE INDEX ON syscolcol_table (ctid);
+ERROR: relation "syscolcol_table" does not exist
+-- nor used in expressions
+CREATE INDEX ON syscol_table ((ctid >= '(1000,0)'));
+ERROR: index creation on system columns is not supported
+-- nor used in predicates
+CREATE INDEX ON syscol_table (a) WHERE ctid >= '(1000,0)';
+ERROR: index creation on system columns is not supported
+DROP TABLE syscol_table;
+--
+-- Tests for IS NULL/IS NOT NULL with b-tree indexes
+--
+SELECT unique1, unique2 INTO onek_with_null FROM onek;
+INSERT INTO onek_with_null (unique1,unique2) VALUES (NULL, -1), (NULL, NULL);
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2,unique1);
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = ON;
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc nulls last,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 nulls first,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+-- Check initial-positioning logic too
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2);
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = OFF;
+SELECT unique1, unique2 FROM onek_with_null
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ | -1
+ 147 | 0
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= -1
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ | -1
+ 147 | 0
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= 0
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 147 | 0
+ 931 | 1
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ |
+ 278 | 999
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= -1
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 278 | 999
+ 0 | 998
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 < 999
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 0 | 998
+ 744 | 997
+(2 rows)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+DROP TABLE onek_with_null;
+--
+-- Check bitmap index path planning
+--
+EXPLAIN (COSTS OFF)
+SELECT * FROM tenk1
+ WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1
+ Recheck Cond: (((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3)) OR ((thousand = 42) AND (tenthous = 42)))
+ -> BitmapOr
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 1))
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 3))
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 42))
+(9 rows)
+
+SELECT * FROM tenk1
+ WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+ 42 | 5530 | 0 | 2 | 2 | 2 | 42 | 42 | 42 | 42 | 42 | 84 | 85 | QBAAAA | SEIAAA | OOOOxx
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM tenk1
+ WHERE hundred = 42 AND (thousand = 42 OR thousand = 99);
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: ((hundred = 42) AND ((thousand = 42) OR (thousand = 99)))
+ -> BitmapAnd
+ -> Bitmap Index Scan on tenk1_hundred
+ Index Cond: (hundred = 42)
+ -> BitmapOr
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: (thousand = 42)
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: (thousand = 99)
+(11 rows)
+
+SELECT count(*) FROM tenk1
+ WHERE hundred = 42 AND (thousand = 42 OR thousand = 99);
+ count
+-------
+ 10
+(1 row)
+
+--
+-- Check behavior with duplicate index column contents
+--
+CREATE TABLE dupindexcols AS
+ SELECT unique1 as id, stringu2::text as f1 FROM tenk1;
+CREATE INDEX dupindexcols_i ON dupindexcols (f1, id, f1 text_pattern_ops);
+ANALYZE dupindexcols;
+EXPLAIN (COSTS OFF)
+ SELECT count(*) FROM dupindexcols
+ WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on dupindexcols
+ Recheck Cond: ((f1 >= 'WA'::text) AND (f1 <= 'ZZZ'::text) AND (id < 1000) AND (f1 ~<~ 'YX'::text))
+ -> Bitmap Index Scan on dupindexcols_i
+ Index Cond: ((f1 >= 'WA'::text) AND (f1 <= 'ZZZ'::text) AND (id < 1000) AND (f1 ~<~ 'YX'::text))
+(5 rows)
+
+SELECT count(*) FROM dupindexcols
+ WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
+ count
+-------
+ 97
+(1 row)
+
+--
+-- Check ordering of =ANY indexqual results (bug in 9.2.0)
+--
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
+explain (costs off)
+SELECT unique1 FROM tenk1
+WHERE unique1 IN (1,42,7)
+ORDER BY unique1;
+ QUERY PLAN
+-------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+ Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(2 rows)
+
+SELECT unique1 FROM tenk1
+WHERE unique1 IN (1,42,7)
+ORDER BY unique1;
+ unique1
+---------
+ 1
+ 7
+ 42
+(3 rows)
+
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ QUERY PLAN
+-------------------------------------------------------
+ Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: (thousand < 2)
+ Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
+(3 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous
+----------+----------
+ 0 | 3000
+ 1 | 1001
+(2 rows)
+
+RESET enable_indexscan;
+SET enable_indexonlyscan = OFF;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Sort Key: thousand
+ -> Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(4 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous
+----------+----------
+ 0 | 3000
+ 1 | 1001
+(2 rows)
+
+RESET enable_seqscan;
+RESET enable_bitmapscan;
+RESET enable_indexonlyscan;
+--
+-- Check elimination of constant-NULL subexpressions
+--
+explain (costs off)
+ select * from tenk1 where (thousand, tenthous) in ((1,1001), (null,null));
+ QUERY PLAN
+------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand = 1) AND (tenthous = 1001))
+(2 rows)
+
+--
+-- Check matching of boolean index columns to WHERE conditions and sort keys
+--
+create temp table boolindex (b bool, i int, unique(b, i), junk float);
+explain (costs off)
+ select * from boolindex order by b, i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+(2 rows)
+
+explain (costs off)
+ select * from boolindex where b order by i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b = true order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where not b order by i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+ Index Cond: (b = false)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b is true order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b is false order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = false)
+(3 rows)
+
+--
+-- REINDEX (VERBOSE)
+--
+CREATE TABLE reindex_verbose(id integer primary key);
+\set VERBOSITY terse \\ -- suppress machine-dependent details
+REINDEX (VERBOSE) TABLE reindex_verbose;
+INFO: index "reindex_verbose_pkey" was reindexed
+\set VERBOSITY default
+DROP TABLE reindex_verbose;
+--
+-- REINDEX CONCURRENTLY
+--
+CREATE TABLE concur_reindex_tab (c1 int);
+-- REINDEX
+REINDEX TABLE concur_reindex_tab; -- notice
+NOTICE: table "concur_reindex_tab" has no indexes to reindex
+REINDEX TABLE CONCURRENTLY concur_reindex_tab; -- notice
+NOTICE: table "concur_reindex_tab" has no indexes that can be reindexed concurrently
+ALTER TABLE concur_reindex_tab ADD COLUMN c2 text; -- add toast index
+-- Normal index with integer column
+CREATE UNIQUE INDEX concur_reindex_ind1 ON concur_reindex_tab(c1);
+-- Normal index with text column
+CREATE INDEX concur_reindex_ind2 ON concur_reindex_tab(c2);
+-- UNIQUE index with expression
+CREATE UNIQUE INDEX concur_reindex_ind3 ON concur_reindex_tab(abs(c1));
+-- Duplicate column names
+CREATE INDEX concur_reindex_ind4 ON concur_reindex_tab(c1, c1, c2);
+-- Create table for check on foreign key dependence switch with indexes swapped
+ALTER TABLE concur_reindex_tab ADD PRIMARY KEY USING INDEX concur_reindex_ind1;
+CREATE TABLE concur_reindex_tab2 (c1 int REFERENCES concur_reindex_tab);
+INSERT INTO concur_reindex_tab VALUES (1, 'a');
+INSERT INTO concur_reindex_tab VALUES (2, 'a');
+-- Reindex concurrently of exclusion constraint currently not supported
+CREATE TABLE concur_reindex_tab3 (c1 int, c2 int4range, EXCLUDE USING gist (c2 WITH &&));
+INSERT INTO concur_reindex_tab3 VALUES (3, '[1,2]');
+REINDEX INDEX CONCURRENTLY concur_reindex_tab3_c2_excl; -- error
+ERROR: concurrent index creation for exclusion constraints is not supported
+REINDEX TABLE CONCURRENTLY concur_reindex_tab3; -- succeeds with warning
+WARNING: cannot reindex exclusion constraint index "public.concur_reindex_tab3_c2_excl" concurrently, skipping
+NOTICE: table "concur_reindex_tab3" has no indexes that can be reindexed concurrently
+INSERT INTO concur_reindex_tab3 VALUES (4, '[2,4]');
+ERROR: conflicting key value violates exclusion constraint "concur_reindex_tab3_c2_excl"
+DETAIL: Key (c2)=([2,5)) conflicts with existing key (c2)=([1,3)).
+-- Check materialized views
+CREATE MATERIALIZED VIEW concur_reindex_matview AS SELECT * FROM concur_reindex_tab;
+REINDEX INDEX CONCURRENTLY concur_reindex_ind1;
+REINDEX TABLE CONCURRENTLY concur_reindex_tab;
+REINDEX TABLE CONCURRENTLY concur_reindex_matview;
+NOTICE: table "concur_reindex_matview" has no indexes that can be reindexed concurrently
+-- Check that comments are preserved
+CREATE TABLE testcomment (i int);
+CREATE INDEX testcomment_idx1 ON testcomment (i);
+COMMENT ON INDEX testcomment_idx1 IS 'test comment';
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+REINDEX TABLE testcomment;
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+REINDEX TABLE CONCURRENTLY testcomment ;
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+DROP TABLE testcomment;
+-- Partitions
+-- Create some partitioned tables
+CREATE TABLE concur_reindex_part (c1 int, c2 int) PARTITION BY RANGE (c1);
+CREATE TABLE concur_reindex_part_0 PARTITION OF concur_reindex_part
+ FOR VALUES FROM (0) TO (10) PARTITION BY list (c2);
+CREATE TABLE concur_reindex_part_0_1 PARTITION OF concur_reindex_part_0
+ FOR VALUES IN (1);
+CREATE TABLE concur_reindex_part_0_2 PARTITION OF concur_reindex_part_0
+ FOR VALUES IN (2);
+-- This partitioned table will have no partitions.
+CREATE TABLE concur_reindex_part_10 PARTITION OF concur_reindex_part
+ FOR VALUES FROM (10) TO (20) PARTITION BY list (c2);
+-- Create some partitioned indexes
+CREATE INDEX concur_reindex_part_index ON ONLY concur_reindex_part (c1);
+CREATE INDEX concur_reindex_part_index_0 ON ONLY concur_reindex_part_0 (c1);
+ALTER INDEX concur_reindex_part_index ATTACH PARTITION concur_reindex_part_index_0;
+-- This partitioned index will have no partitions.
+CREATE INDEX concur_reindex_part_index_10 ON ONLY concur_reindex_part_10 (c1);
+ALTER INDEX concur_reindex_part_index ATTACH PARTITION concur_reindex_part_index_10;
+CREATE INDEX concur_reindex_part_index_0_1 ON ONLY concur_reindex_part_0_1 (c1);
+ALTER INDEX concur_reindex_part_index_0 ATTACH PARTITION concur_reindex_part_index_0_1;
+CREATE INDEX concur_reindex_part_index_0_2 ON ONLY concur_reindex_part_0_2 (c1);
+ALTER INDEX concur_reindex_part_index_0 ATTACH PARTITION concur_reindex_part_index_0_2;
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+-- REINDEX fails for partitioned indexes
+REINDEX INDEX concur_reindex_part_index_10;
+ERROR: REINDEX is not yet implemented for partitioned indexes
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_10;
+ERROR: REINDEX is not yet implemented for partitioned indexes
+-- REINDEX is a no-op for partitioned tables
+REINDEX TABLE concur_reindex_part_10;
+WARNING: REINDEX of partitioned tables is not yet implemented, skipping "concur_reindex_part_10"
+NOTICE: table "concur_reindex_part_10" has no indexes to reindex
+REINDEX TABLE CONCURRENTLY concur_reindex_part_10;
+WARNING: REINDEX of partitioned tables is not yet implemented, skipping "concur_reindex_part_10"
+NOTICE: table "concur_reindex_part_10" has no indexes that can be reindexed concurrently
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+-- REINDEX should preserve dependencies of partition tree.
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_0_1;
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_0_2;
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+REINDEX TABLE CONCURRENTLY concur_reindex_part_0_1;
+REINDEX TABLE CONCURRENTLY concur_reindex_part_0_2;
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+DROP TABLE concur_reindex_part;
+-- Check errors
+-- Cannot run inside a transaction block
+BEGIN;
+REINDEX TABLE CONCURRENTLY concur_reindex_tab;
+ERROR: REINDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+REINDEX TABLE CONCURRENTLY pg_class; -- no catalog relation
+ERROR: cannot reindex system catalogs concurrently
+REINDEX INDEX CONCURRENTLY pg_class_oid_index; -- no catalog index
+ERROR: cannot reindex system catalogs concurrently
+-- These are the toast table and index of pg_authid.
+REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_1260; -- no catalog toast table
+ERROR: cannot reindex system catalogs concurrently
+REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_1260_index; -- no catalog toast index
+ERROR: cannot reindex system catalogs concurrently
+REINDEX SYSTEM CONCURRENTLY postgres; -- not allowed for SYSTEM
+ERROR: cannot reindex system catalogs concurrently
+-- Warns about catalog relations
+REINDEX SCHEMA CONCURRENTLY pg_catalog;
+WARNING: cannot reindex system catalogs concurrently, skipping all
+-- Check the relation status, there should not be invalid indexes
+\d concur_reindex_tab
+ Table "public.concur_reindex_tab"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | not null |
+ c2 | text | | |
+Indexes:
+ "concur_reindex_ind1" PRIMARY KEY, btree (c1)
+ "concur_reindex_ind2" btree (c2)
+ "concur_reindex_ind3" UNIQUE, btree (abs(c1))
+ "concur_reindex_ind4" btree (c1, c1, c2)
+Referenced by:
+ TABLE "concur_reindex_tab2" CONSTRAINT "concur_reindex_tab2_c1_fkey" FOREIGN KEY (c1) REFERENCES concur_reindex_tab(c1)
+
+DROP MATERIALIZED VIEW concur_reindex_matview;
+DROP TABLE concur_reindex_tab, concur_reindex_tab2, concur_reindex_tab3;
+-- Check handling of invalid indexes
+CREATE TABLE concur_reindex_tab4 (c1 int);
+INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
+-- This trick creates an invalid index.
+CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+ERROR: could not create unique index "concur_reindex_ind5"
+DETAIL: Key (c1)=(1) is duplicated.
+-- Reindexing concurrently this index fails with the same failure.
+-- The extra index created is itself invalid, and can be dropped.
+REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
+ERROR: could not create unique index "concur_reindex_ind5_ccnew"
+DETAIL: Key (c1)=(1) is duplicated.
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+ "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
+
+DROP INDEX concur_reindex_ind5_ccnew;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- The invalid index is not processed when running REINDEX TABLE.
+REINDEX TABLE CONCURRENTLY concur_reindex_tab4;
+WARNING: cannot reindex invalid index "public.concur_reindex_ind5" concurrently, skipping
+NOTICE: table "concur_reindex_tab4" has no indexes that can be reindexed concurrently
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+
+-- But it is fixed with REINDEX INDEX.
+REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1)
+
+DROP TABLE concur_reindex_tab4;
+-- Check handling of indexes with expressions and predicates. The
+-- definitions of the rebuilt indexes should match the original
+-- definitions.
+CREATE TABLE concur_exprs_tab (c1 int , c2 boolean);
+INSERT INTO concur_exprs_tab (c1, c2) VALUES (1369652450, FALSE),
+ (414515746, TRUE),
+ (897778963, FALSE);
+CREATE UNIQUE INDEX concur_exprs_index_expr
+ ON concur_exprs_tab ((c1::text COLLATE "C"));
+CREATE UNIQUE INDEX concur_exprs_index_pred ON concur_exprs_tab (c1)
+ WHERE (c1::text > 500000000::text COLLATE "C");
+CREATE UNIQUE INDEX concur_exprs_index_pred_2
+ ON concur_exprs_tab ((1 / c1))
+ WHERE ('-H') >= (c2::TEXT) COLLATE "C";
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+--------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= ((c2)::text COLLATE "C"))
+(1 row)
+
+REINDEX TABLE CONCURRENTLY concur_exprs_tab;
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+--------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= ((c2)::text COLLATE "C"))
+(1 row)
+
+-- ALTER TABLE recreates the indexes, which should keep their collations.
+ALTER TABLE concur_exprs_tab ALTER c2 TYPE TEXT;
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= (c2 COLLATE "C"))
+(1 row)
+
+DROP TABLE concur_exprs_tab;
+--
+-- REINDEX SCHEMA
+--
+REINDEX SCHEMA schema_to_reindex; -- failure, schema does not exist
+ERROR: schema "schema_to_reindex" does not exist
+CREATE SCHEMA schema_to_reindex;
+SET search_path = 'schema_to_reindex';
+CREATE TABLE table1(col1 SERIAL PRIMARY KEY);
+INSERT INTO table1 SELECT generate_series(1,400);
+CREATE TABLE table2(col1 SERIAL PRIMARY KEY, col2 TEXT NOT NULL);
+INSERT INTO table2 SELECT generate_series(1,400), 'abc';
+CREATE INDEX ON table2(col2);
+CREATE MATERIALIZED VIEW matview AS SELECT col1 FROM table2;
+CREATE INDEX ON matview(col1);
+CREATE VIEW view AS SELECT col2 FROM table2;
+CREATE TABLE reindex_before AS
+SELECT oid, relname, relfilenode, relkind, reltoastrelid
+ FROM pg_class
+ where relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'schema_to_reindex');
+INSERT INTO reindex_before
+SELECT oid, 'pg_toast_TABLE', relfilenode, relkind, reltoastrelid
+FROM pg_class WHERE oid IN
+ (SELECT reltoastrelid FROM reindex_before WHERE reltoastrelid > 0);
+INSERT INTO reindex_before
+SELECT oid, 'pg_toast_TABLE_index', relfilenode, relkind, reltoastrelid
+FROM pg_class where oid in
+ (select indexrelid from pg_index where indrelid in
+ (select reltoastrelid from reindex_before where reltoastrelid > 0));
+REINDEX SCHEMA schema_to_reindex;
+CREATE TABLE reindex_after AS SELECT oid, relname, relfilenode, relkind
+ FROM pg_class
+ where relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'schema_to_reindex');
+SELECT b.relname,
+ b.relkind,
+ CASE WHEN a.relfilenode = b.relfilenode THEN 'relfilenode is unchanged'
+ ELSE 'relfilenode has changed' END
+ FROM reindex_before b JOIN pg_class a ON b.oid = a.oid
+ ORDER BY 1;
+ relname | relkind | case
+------------------+---------+--------------------------
+ matview | m | relfilenode is unchanged
+ matview_col1_idx | i | relfilenode has changed
+ table1 | r | relfilenode is unchanged
+ table1_col1_seq | S | relfilenode is unchanged
+ table1_pkey | i | relfilenode has changed
+ table2 | r | relfilenode is unchanged
+ table2_col1_seq | S | relfilenode is unchanged
+ table2_col2_idx | i | relfilenode has changed
+ table2_pkey | i | relfilenode has changed
+ view | v | relfilenode is unchanged
+(10 rows)
+
+REINDEX SCHEMA schema_to_reindex;
+BEGIN;
+REINDEX SCHEMA schema_to_reindex; -- failure, cannot run in a transaction
+ERROR: REINDEX SCHEMA cannot run inside a transaction block
+END;
+-- concurrently
+REINDEX SCHEMA CONCURRENTLY schema_to_reindex;
+-- Failure for unauthorized user
+CREATE ROLE regress_reindexuser NOLOGIN;
+SET SESSION ROLE regress_reindexuser;
+REINDEX SCHEMA schema_to_reindex;
+ERROR: must be owner of schema schema_to_reindex
+-- Clean up
+RESET ROLE;
+DROP ROLE regress_reindexuser;
+DROP SCHEMA schema_to_reindex CASCADE;
+NOTICE: drop cascades to 6 other objects
+DETAIL: drop cascades to table table1
+drop cascades to table table2
+drop cascades to materialized view matview
+drop cascades to view view
+drop cascades to table reindex_before
+drop cascades to table reindex_after
diff --git a/src/test/regress/expected/fsm_1.out b/src/test/regress/expected/fsm_1.out
new file mode 100644
index 00000000000..9b5f9be13a5
--- /dev/null
+++ b/src/test/regress/expected/fsm_1.out
@@ -0,0 +1,73 @@
+--
+-- Free Space Map test
+--
+SELECT current_setting('block_size')::integer AS blocksize,
+current_setting('block_size')::integer / 8 AS strsize
+\gset
+CREATE TABLE fsm_check_size (num int, str text);
+-- Fill 3 blocks with one record each
+ALTER TABLE fsm_check_size SET (fillfactor=15);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(1,3) i;
+-- There should be no FSM
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- The following operations are for testing the functionality of the local
+-- in-memory map. In particular, we want to be able to insert into some
+-- other block than the one at the end of the heap, without using a FSM.
+-- Fill most of the last block
+ALTER TABLE fsm_check_size SET (fillfactor=100);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(101,105) i;
+-- Make sure records can go into any block but the last one
+ALTER TABLE fsm_check_size SET (fillfactor=30);
+-- Insert large record and make sure it does not cause the relation to extend
+INSERT INTO fsm_check_size VALUES (111, rpad('', :strsize, 'a'));
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- Extend table with enough blocks to exceed the FSM threshold
+DO $$
+DECLARE curtid tid;
+num int;
+BEGIN
+num = 11;
+ LOOP
+ INSERT INTO fsm_check_size VALUES (num, 'b') RETURNING ctid INTO curtid;
+ EXIT WHEN curtid >= tid '(4, 0)';
+ num = num + 1;
+ END LOOP;
+END;
+$$;
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ fsm_nblocks
+-------------
+ 0
+(1 row)
+
+-- Add long random string to extend TOAST table to 1 block
+INSERT INTO fsm_check_size
+VALUES(0, (SELECT string_agg(md5(chr(i)), '')
+ FROM generate_series(1, :blocksize / 100) i));
+VACUUM fsm_check_size;
+SELECT pg_relation_size(reltoastrelid, 'main') / :blocksize AS toast_nblocks,
+pg_relation_size(reltoastrelid, 'fsm') / :blocksize AS toast_fsm_nblocks
+FROM pg_class WHERE relname = 'fsm_check_size';
+ toast_nblocks | toast_fsm_nblocks
+---------------+-------------------
+ |
+(1 row)
+
+DROP TABLE fsm_check_size;
diff --git a/src/test/regress/expected/join_hash_1.out b/src/test/regress/expected/join_hash_1.out
new file mode 100644
index 00000000000..0d1fb3f7048
--- /dev/null
+++ b/src/test/regress/expected/join_hash_1.out
@@ -0,0 +1,881 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(9 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious outer hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- An full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious outer hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+rollback;
diff --git a/src/test/regress/expected/limit_1.out b/src/test/regress/expected/limit_1.out
new file mode 100644
index 00000000000..55020ae1240
--- /dev/null
+++ b/src/test/regress/expected/limit_1.out
@@ -0,0 +1,505 @@
+--
+-- LIMIT
+-- Check the LIMIT/OFFSET feature of SELECT
+--
+SELECT ''::text AS two, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 50
+ ORDER BY unique1 LIMIT 2;
+ two | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 51 | 76 | ZBAAAA
+ | 52 | 985 | ACAAAA
+(2 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 60
+ ORDER BY unique1 LIMIT 5;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 61 | 560 | JCAAAA
+ | 62 | 633 | KCAAAA
+ | 63 | 296 | LCAAAA
+ | 64 | 479 | MCAAAA
+ | 65 | 64 | NCAAAA
+(5 rows)
+
+SELECT ''::text AS two, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 60 AND unique1 < 63
+ ORDER BY unique1 LIMIT 5;
+ two | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 61 | 560 | JCAAAA
+ | 62 | 633 | KCAAAA
+(2 rows)
+
+SELECT ''::text AS three, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 100
+ ORDER BY unique1 LIMIT 3 OFFSET 20;
+ three | unique1 | unique2 | stringu1
+-------+---------+---------+----------
+ | 121 | 700 | REAAAA
+ | 122 | 519 | SEAAAA
+ | 123 | 777 | TEAAAA
+(3 rows)
+
+SELECT ''::text AS zero, unique1, unique2, stringu1
+ FROM onek WHERE unique1 < 50
+ ORDER BY unique1 DESC LIMIT 8 OFFSET 99;
+ zero | unique1 | unique2 | stringu1
+------+---------+---------+----------
+(0 rows)
+
+SELECT ''::text AS eleven, unique1, unique2, stringu1
+ FROM onek WHERE unique1 < 50
+ ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
+ eleven | unique1 | unique2 | stringu1
+--------+---------+---------+----------
+ | 10 | 520 | KAAAAA
+ | 9 | 49 | JAAAAA
+ | 8 | 653 | IAAAAA
+ | 7 | 647 | HAAAAA
+ | 6 | 978 | GAAAAA
+ | 5 | 541 | FAAAAA
+ | 4 | 833 | EAAAAA
+ | 3 | 431 | DAAAAA
+ | 2 | 326 | CAAAAA
+ | 1 | 214 | BAAAAA
+ | 0 | 998 | AAAAAA
+(11 rows)
+
+SELECT ''::text AS ten, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 OFFSET 990;
+ ten | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 990 | 369 | CMAAAA
+ | 991 | 426 | DMAAAA
+ | 992 | 363 | EMAAAA
+ | 993 | 661 | FMAAAA
+ | 994 | 695 | GMAAAA
+ | 995 | 144 | HMAAAA
+ | 996 | 258 | IMAAAA
+ | 997 | 21 | JMAAAA
+ | 998 | 549 | KMAAAA
+ | 999 | 152 | LMAAAA
+(10 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 OFFSET 990 LIMIT 5;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 990 | 369 | CMAAAA
+ | 991 | 426 | DMAAAA
+ | 992 | 363 | EMAAAA
+ | 993 | 661 | FMAAAA
+ | 994 | 695 | GMAAAA
+(5 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 LIMIT 5 OFFSET 900;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 900 | 913 | QIAAAA
+ | 901 | 931 | RIAAAA
+ | 902 | 702 | SIAAAA
+ | 903 | 641 | TIAAAA
+ | 904 | 793 | UIAAAA
+(5 rows)
+
+-- Test null limit and offset. The planner would discard a simple null
+-- constant, so to ensure executor is exercised, do this:
+select * from int8_tbl limit (case when random() < 0.5 then null::bigint end);
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+select * from int8_tbl offset (case when random() < 0.5 then null::bigint end);
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+-- Test assorted cases involving backwards fetch from a LIMIT plan node
+begin;
+declare c1 cursor for select * from int8_tbl limit 10;
+fetch all in c1;
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+fetch 1 in c1;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c1;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | -4567890123456789
+(1 row)
+
+fetch backward all in c1;
+ q1 | q2
+------------------+------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | 123
+ 123 | 4567890123456789
+ 123 | 456
+(4 rows)
+
+fetch backward 1 in c1;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c1;
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+declare c2 cursor for select * from int8_tbl limit 3;
+fetch all in c2;
+ q1 | q2
+------------------+------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+(3 rows)
+
+fetch 1 in c2;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c2;
+ q1 | q2
+------------------+-----
+ 4567890123456789 | 123
+(1 row)
+
+fetch backward all in c2;
+ q1 | q2
+-----+------------------
+ 123 | 4567890123456789
+ 123 | 456
+(2 rows)
+
+fetch backward 1 in c2;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c2;
+ q1 | q2
+------------------+------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+(3 rows)
+
+declare c3 cursor for select * from int8_tbl offset 3;
+fetch all in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(2 rows)
+
+fetch 1 in c3;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | -4567890123456789
+(1 row)
+
+fetch backward all in c3;
+ q1 | q2
+------------------+------------------
+ 4567890123456789 | 4567890123456789
+(1 row)
+
+fetch backward 1 in c3;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(2 rows)
+
+declare c4 cursor for select * from int8_tbl offset 10;
+fetch all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+rollback;
+-- Stress test for variable LIMIT in conjunction with bounded-heap sorting
+SELECT
+ (SELECT n
+ FROM (VALUES (1)) AS x,
+ (SELECT n FROM generate_series(1,10) AS n
+ ORDER BY n LIMIT 1 OFFSET s-1) AS y) AS z
+ FROM generate_series(1,10) AS s;
+ z
+----
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+ 10
+(10 rows)
+
+--
+-- Test behavior of volatile and set-returning functions in conjunction
+-- with ORDER BY and LIMIT.
+--
+create temp sequence testseq;
+explain (verbose, costs off)
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by unique2 limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (nextval('testseq'::regclass))
+ -> Index Scan using tenk1_unique2 on public.tenk1
+ Output: unique1, unique2, nextval('testseq'::regclass)
+(4 rows)
+
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by unique2 limit 10;
+ unique1 | unique2 | nextval
+---------+---------+---------
+ 8800 | 0 | 1
+ 1891 | 1 | 2
+ 3420 | 2 | 3
+ 9850 | 3 | 4
+ 7164 | 4 | 5
+ 8009 | 5 | 6
+ 5057 | 6 | 7
+ 6701 | 7 | 8
+ 4321 | 8 | 9
+ 3043 | 9 | 10
+(10 rows)
+
+select currval('testseq');
+ currval
+---------
+ 10
+(1 row)
+
+explain (verbose, costs off)
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by tenthous limit 10;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (nextval('testseq'::regclass)), tenthous
+ -> Result
+ Output: unique1, unique2, nextval('testseq'::regclass), tenthous
+ -> Sort
+ Output: unique1, unique2, tenthous
+ Sort Key: tenk1.tenthous
+ -> Seq Scan on public.tenk1
+ Output: unique1, unique2, tenthous
+(9 rows)
+
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by tenthous limit 10;
+ unique1 | unique2 | nextval
+---------+---------+---------
+ 0 | 9998 | 11
+ 1 | 2838 | 12
+ 2 | 2716 | 13
+ 3 | 5679 | 14
+ 4 | 1621 | 15
+ 5 | 5557 | 16
+ 6 | 2855 | 17
+ 7 | 8518 | 18
+ 8 | 5435 | 19
+ 9 | 4463 | 20
+(10 rows)
+
+select currval('testseq');
+ currval
+---------
+ 20
+(1 row)
+
+explain (verbose, costs off)
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by unique2 limit 7;
+ QUERY PLAN
+------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (generate_series(1, 10))
+ -> ProjectSet
+ Output: unique1, unique2, generate_series(1, 10)
+ -> Index Scan using tenk1_unique2 on public.tenk1
+ Output: unique1, unique2
+(6 rows)
+
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by unique2 limit 7;
+ unique1 | unique2 | generate_series
+---------+---------+-----------------
+ 8800 | 0 | 1
+ 8800 | 0 | 2
+ 8800 | 0 | 3
+ 8800 | 0 | 4
+ 8800 | 0 | 5
+ 8800 | 0 | 6
+ 8800 | 0 | 7
+(7 rows)
+
+explain (verbose, costs off)
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by tenthous limit 7;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (generate_series(1, 10)), tenthous
+ -> ProjectSet
+ Output: unique1, unique2, generate_series(1, 10), tenthous
+ -> Sort
+ Output: unique1, unique2, tenthous
+ Sort Key: tenk1.tenthous
+ -> Seq Scan on public.tenk1
+ Output: unique1, unique2, tenthous
+(9 rows)
+
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by tenthous limit 7;
+ unique1 | unique2 | generate_series
+---------+---------+-----------------
+ 0 | 9998 | 1
+ 0 | 9998 | 2
+ 0 | 9998 | 3
+ 0 | 9998 | 4
+ 0 | 9998 | 5
+ 0 | 9998 | 6
+ 0 | 9998 | 7
+(7 rows)
+
+-- use of random() is to keep planner from folding the expressions together
+explain (verbose, costs off)
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ ProjectSet
+ Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
+ -> Result
+(3 rows)
+
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
+ s1 | s2
+----+----
+ 0 | 0
+ 1 | 1
+ 2 | 2
+(3 rows)
+
+explain (verbose, costs off)
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
+order by s2 desc;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------
+ Sort
+ Output: (generate_series(0, 2)), (generate_series(((random() * '0.1'::double precision))::integer, 2))
+ Sort Key: (generate_series(((random() * '0.1'::double precision))::integer, 2)) DESC
+ -> ProjectSet
+ Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
+ -> Result
+(6 rows)
+
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
+order by s2 desc;
+ s1 | s2
+----+----
+ 2 | 2
+ 1 | 1
+ 0 | 0
+(3 rows)
+
+-- test for failure to set all aggregates' aggtranstype
+explain (verbose, costs off)
+select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
+ from tenk1 group by thousand order by thousand limit 3;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------
+ Limit
+ Output: (sum(tenthous)), (((sum(tenthous))::double precision + (random() * '0'::double precision))), thousand
+ -> GroupAggregate
+ Output: sum(tenthous), ((sum(tenthous))::double precision + (random() * '0'::double precision)), thousand
+ Group Key: tenk1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on public.tenk1
+ Output: thousand, tenthous
+(7 rows)
+
+select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
+ from tenk1 group by thousand order by thousand limit 3;
+ s1 | s2
+-------+-------
+ 45000 | 45000
+ 45010 | 45010
+ 45020 | 45020
+(3 rows)
+
diff --git a/src/test/regress/expected/psql_crosstab_1.out b/src/test/regress/expected/psql_crosstab_1.out
new file mode 100644
index 00000000000..d8ab4c35f32
--- /dev/null
+++ b/src/test/regress/expected/psql_crosstab_1.out
@@ -0,0 +1,216 @@
+--
+-- \crosstabview
+--
+CREATE TABLE ctv_data (v, h, c, i, d) AS
+VALUES
+ ('v1','h2','foo', 3, '2015-04-01'::date),
+ ('v2','h1','bar', 3, '2015-01-02'),
+ ('v1','h0','baz', NULL, '2015-07-12'),
+ ('v0','h4','qux', 4, '2015-07-15'),
+ ('v0','h4','dbl', -3, '2014-12-15'),
+ ('v0',NULL,'qux', 5, '2014-07-15'),
+ ('v1','h2','quux',7, '2015-04-04');
+-- make plans more stable
+ANALYZE ctv_data;
+-- running \crosstabview after query uses query in buffer
+SELECT v, EXTRACT(year FROM d), count(*)
+ FROM ctv_data
+ GROUP BY 1, 2
+ ORDER BY 1, 2;
+ v | date_part | count
+----+-----------+-------
+ v0 | 2014 | 2
+ v0 | 2015 | 1
+ v1 | 2015 | 3
+ v2 | 2015 | 1
+(4 rows)
+
+-- basic usage with 3 columns
+ \crosstabview
+ v | 2014 | 2015
+----+------+------
+ v0 | 2 | 1
+ v1 | | 3
+ v2 | | 1
+(3 rows)
+
+-- ordered months in horizontal header, quoted column name
+SELECT v, to_char(d, 'Mon') AS "month name", EXTRACT(month FROM d) AS num,
+ count(*) FROM ctv_data GROUP BY 1,2,3 ORDER BY 1
+ \crosstabview v "month name" 4 num
+ v | Jan | Apr | Jul | Dec
+----+-----+-----+-----+-----
+ v0 | | | 2 | 1
+ v1 | | 2 | 1 |
+ v2 | 1 | | |
+(3 rows)
+
+-- ordered months in vertical header, ordered years in horizontal header
+SELECT EXTRACT(year FROM d) AS year, to_char(d,'Mon') AS """month"" name",
+ EXTRACT(month FROM d) AS month,
+ format('sum=%s avg=%s', sum(i), avg(i)::numeric(2,1))
+ FROM ctv_data
+ GROUP BY EXTRACT(year FROM d), to_char(d,'Mon'), EXTRACT(month FROM d)
+ORDER BY month
+\crosstabview """month"" name" year format year
+ "month" name | 2014 | 2015
+--------------+-----------------+----------------
+ Jan | | sum=3 avg=3.0
+ Apr | | sum=10 avg=5.0
+ Jul | sum=5 avg=5.0 | sum=4 avg=4.0
+ Dec | sum=-3 avg=-3.0 |
+(4 rows)
+
+-- combine contents vertically into the same cell (V/H duplicates)
+SELECT v, h, string_agg(c, E'\n') FROM ctv_data GROUP BY v, h ORDER BY 1,2,3
+ \crosstabview 1 2 3
+ v | h4 | | h0 | h2 | h1
+----+-----+-----+-----+------+-----
+ v0 | qux+| qux | | |
+ | dbl | | | |
+ v1 | | | baz | foo +|
+ | | | | quux |
+ v2 | | | | | bar
+(3 rows)
+
+-- horizontal ASC order from window function
+SELECT v,h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v0 | | | | qux+| qux
+ | | | | dbl |
+ v1 | baz | | foo +| |
+ | | | quux | |
+ v2 | | bar | | |
+(3 rows)
+
+-- horizontal DESC order from window function
+SELECT v, h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h DESC) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | | h4 | h2 | h1 | h0
+----+-----+-----+------+-----+-----
+ v0 | qux | qux+| | |
+ | | dbl | | |
+ v1 | | | foo +| | baz
+ | | | quux | |
+ v2 | | | | bar |
+(3 rows)
+
+-- horizontal ASC order from window function, NULLs pushed rightmost
+SELECT v,h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h NULLS LAST) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v0 | | | | qux+| qux
+ | | | | dbl |
+ v1 | baz | | foo +| |
+ | | | quux | |
+ v2 | | bar | | |
+(3 rows)
+
+-- only null, no column name, 2 columns: error
+SELECT null,null \crosstabview
+\crosstabview: query must return at least three columns
+-- only null, no column name, 3 columns: works
+SELECT null,null,null \crosstabview
+ ?column? |
+----------+--
+ |
+(1 row)
+
+-- null display
+\pset null '#null#'
+SELECT v,h, string_agg(i::text, E'\n') AS i FROM ctv_data
+GROUP BY v, h ORDER BY h,v
+ \crosstabview v h i
+ v | h0 | h1 | h2 | h4 | #null#
+----+--------+----+----+----+--------
+ v1 | #null# | | 3 +| |
+ | | | 7 | |
+ v2 | | 3 | | |
+ v0 | | | | 4 +| 5
+ | | | | -3 |
+(3 rows)
+
+\pset null ''
+-- refer to columns by position
+SELECT v,h,string_agg(i::text, E'\n'), string_agg(c, E'\n')
+FROM ctv_data GROUP BY v, h ORDER BY h,v
+ \crosstabview 2 1 4
+ h | v1 | v2 | v0
+----+------+-----+-----
+ h0 | baz | |
+ h1 | | bar |
+ h2 | quux+| |
+ | foo | |
+ h4 | | | qux+
+ | | | dbl
+ | | | qux
+(5 rows)
+
+-- refer to columns by positions and names mixed
+SELECT v,h, string_agg(i::text, E'\n') AS i, string_agg(c, E'\n') AS c
+FROM ctv_data GROUP BY v, h ORDER BY h,v
+ \crosstabview 1 "h" 4
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v1 | baz | | quux+| |
+ | | | foo | |
+ v2 | | bar | | |
+ v0 | | | | qux+| qux
+ | | | | dbl |
+(3 rows)
+
+-- refer to columns by quoted names, check downcasing of unquoted name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview "22" B "Foo"
+ 22 | 2
+----+---
+ 1 | 3
+(1 row)
+
+-- error: bad column name
+SELECT v,h,c,i FROM ctv_data
+ \crosstabview v h j
+\crosstabview: column name not found: "j"
+-- error: need to quote name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview 1 2 Foo
+\crosstabview: column name not found: "foo"
+-- error: need to not quote name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview 1 "B" "Foo"
+\crosstabview: column name not found: "B"
+-- error: bad column number
+SELECT v,h,i,c FROM ctv_data
+ \crosstabview 2 1 5
+\crosstabview: column number 5 is out of range 1..4
+-- error: same H and V columns
+SELECT v,h,i,c FROM ctv_data
+ \crosstabview 2 h 4
+\crosstabview: vertical and horizontal headers must be different columns
+-- error: too many columns
+SELECT a,a,1 FROM generate_series(1,3000) AS a
+ \crosstabview
+\crosstabview: maximum number of columns (1600) exceeded
+-- error: only one column
+SELECT 1 \crosstabview
+\crosstabview: query must return at least three columns
+DROP TABLE ctv_data;
+-- check error reporting (bug #14476)
+CREATE TABLE ctv_data (x int, y int, v text);
+INSERT INTO ctv_data SELECT 1, x, '*' || x FROM generate_series(1,10) x;
+SELECT * FROM ctv_data \crosstabview
+ x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
+---+----+----+----+----+----+----+----+----+----+-----
+ 1 | *1 | *2 | *3 | *4 | *5 | *6 | *7 | *8 | *9 | *10
+(1 row)
+
+INSERT INTO ctv_data VALUES (1, 10, '*'); -- duplicate data to cause error
+SELECT * FROM ctv_data \crosstabview
+\crosstabview: query result contains multiple data values for row "1", column "10"
+DROP TABLE ctv_data;
diff --git a/src/test/regress/expected/rangefuncs_1.out b/src/test/regress/expected/rangefuncs_1.out
new file mode 100644
index 00000000000..78b177ceb0d
--- /dev/null
+++ b/src/test/regress/expected/rangefuncs_1.out
@@ -0,0 +1,2100 @@
+CREATE TABLE rngfunc2(rngfuncid int, f2 int);
+INSERT INTO rngfunc2 VALUES(1, 11);
+INSERT INTO rngfunc2 VALUES(2, 22);
+INSERT INTO rngfunc2 VALUES(1, 111);
+CREATE FUNCTION rngfunct(int) returns setof rngfunc2 as 'SELECT * FROM rngfunc2 WHERE rngfuncid = $1 ORDER BY f2;' LANGUAGE SQL;
+-- function with ORDINALITY
+select * from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select * from rngfunct(1) with ordinality as z(a,b,ord) where b > 100; -- ordinal 2, not 1
+ a | b | ord
+---+-----+-----
+ 1 | 111 | 2
+(1 row)
+
+-- ordinality vs. column names and types
+select a,b,ord from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select a,ord from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select * from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select a,ord from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select * from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select row_to_json(s.*) from generate_series(11,14) with ordinality s;
+ row_to_json
+-------------------------
+ {"s":11,"ordinality":1}
+ {"s":12,"ordinality":2}
+ {"s":13,"ordinality":3}
+ {"s":14,"ordinality":4}
+(4 rows)
+
+-- ordinality vs. views
+create temporary view vw_ord as select * from (values (1)) v(n) join rngfunct(1) with ordinality as z(a,b,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | ord
+---+---+----+-----
+ 1 | 1 | 11 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN rngfunct(1) WITH ORDINALITY z(a, b, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- multiple functions
+select * from rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord);
+ a | b | c | d | ord
+---+-----+---+----+-----
+ 1 | 11 | 2 | 22 | 1
+ 1 | 111 | | | 2
+(2 rows)
+
+create temporary view vw_ord as select * from (values (1)) v(n) join rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | c | d | ord
+---+---+----+---+----+-----
+ 1 | 1 | 11 | 2 | 22 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.c, +
+ z.d, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN ROWS FROM(rngfunct(1), rngfunct(2)) WITH ORDINALITY z(a, b, c, d, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- expansions of unnest()
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]);
+ unnest | unnest | unnest
+--------+--------+--------
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(101,102)) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 101 | 1
+ 20 | bar | 102 | 2
+(2 rows)
+
+create temporary view vw_ord as select * from unnest(array[10,20],array['foo','bar'],array[1.0]) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(1,2)) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+---
+ 10 | foo | 1
+ 20 | bar | 2
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM ROWS FROM(unnest(ARRAY[10, 20]), unnest(ARRAY['foo'::text, 'bar'::text]), generate_series(1, 2)) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+-- ordinality and multiple functions vs. rewind and reverse scan
+begin;
+declare rf_cur scroll cursor for select * from rows from(generate_series(1,5),generate_series(1,2)) with ordinality as g(i,j,o);
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch backward all from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+ 4 | | 4
+ 3 | | 3
+ 2 | 2 | 2
+ 1 | 1 | 1
+(5 rows)
+
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+(1 row)
+
+fetch absolute 1 from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 4 | | 4
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+commit;
+-- function with implicit LATERAL
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) z where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2
+-----------+-----+-----------+-----
+ 1 | 11 | 1 | 11
+ 2 | 22 | 2 | 22
+ 1 | 111 | 1 | 111
+(3 rows)
+
+-- function with implicit LATERAL and explicit ORDINALITY
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) with ordinality as z(rngfuncid,f2,ord) where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2 | ord
+-----------+-----+-----------+-----+-----
+ 1 | 11 | 1 | 11 | 1
+ 2 | 22 | 2 | 22 | 1
+ 1 | 111 | 1 | 111 | 2
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+ 2 | 22
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(1) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = 1) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- nested functions
+select rngfunct.rngfuncid, rngfunct.f2 from rngfunct(sin(pi()/2)::int) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+CREATE TABLE rngfunc (rngfuncid int, rngfuncsubid int, rngfuncname text, primary key(rngfuncid,rngfuncsubid));
+INSERT INTO rngfunc VALUES(1,1,'Joe');
+INSERT INTO rngfunc VALUES(1,2,'Ed');
+INSERT INTO rngfunc VALUES(2,1,'Mary');
+-- sql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc1(int) RETURNS int AS 'SELECT $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc1(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc1(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc1
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1) WITH ORDINALITY as t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc2(int) RETURNS setof int AS 'SELECT rngfuncid FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc2(1) AS t1;
+ t1
+----
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc2
+-------------
+ 1
+ 1
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc3(int) RETURNS setof text AS 'SELECT rngfuncname FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc3(1) AS t1;
+ t1
+-----
+ Joe
+ Ed
+(2 rows)
+
+SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc3
+-------------
+ Joe
+ Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc4(int) RETURNS rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc4(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = c
+CREATE FUNCTION getrngfunc5(int) RETURNS setof rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc5(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = record
+CREATE FUNCTION getrngfunc6(int) RETURNS RECORD AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc6(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc6(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = record
+CREATE FUNCTION getrngfunc7(int) RETURNS setof record AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc7(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc7(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc8(int) RETURNS int AS 'DECLARE rngfuncint int; BEGIN SELECT rngfuncid into rngfuncint FROM rngfunc WHERE rngfuncid = $1; RETURN rngfuncint; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc8(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc8
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc9(int) RETURNS rngfunc AS 'DECLARE rngfunctup rngfunc%ROWTYPE; BEGIN SELECT * into rngfunctup FROM rngfunc WHERE rngfuncid = $1; RETURN rngfunctup; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc9(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- mix 'n match kinds, to exercise expandRTE and related logic
+select * from rows from(getrngfunc1(1),getrngfunc2(1),getrngfunc3(1),getrngfunc4(1),getrngfunc5(1),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc8(1),getrngfunc9(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+---+-----+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1
+ | 1 | Ed | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | | 2
+(2 rows)
+
+select * from rows from(getrngfunc9(1),getrngfunc8(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc5(1),getrngfunc4(1),getrngfunc3(1),getrngfunc2(1),getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+---+-----+---+---+-----+---+---+-----+---+---+-----+-----+---+---+---
+ 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | Joe | 1 | 1 | 1
+ | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | Ed | 1 | | 2
+(2 rows)
+
+create temporary view vw_rngfunc as
+ select * from rows from(getrngfunc9(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,n);
+select * from vw_rngfunc;
+ a | b | c | d | e | f | g | n
+---+---+-----+---+---+-----+---+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1
+ | | | 1 | 2 | Ed | | 2
+(2 rows)
+
+select pg_get_viewdef('vw_rngfunc');
+ pg_get_viewdef
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ SELECT t1.a, +
+ t1.b, +
+ t1.c, +
+ t1.d, +
+ t1.e, +
+ t1.f, +
+ t1.g, +
+ t1.n +
+ FROM ROWS FROM(getrngfunc9(1), getrngfunc7(1) AS (rngfuncid integer, rngfuncsubid integer, rngfuncname text), getrngfunc1(1)) WITH ORDINALITY t1(a, b, c, d, e, f, g, n);
+(1 row)
+
+drop view vw_rngfunc;
+DROP FUNCTION getrngfunc1(int);
+DROP FUNCTION getrngfunc2(int);
+DROP FUNCTION getrngfunc3(int);
+DROP FUNCTION getrngfunc4(int);
+DROP FUNCTION getrngfunc5(int);
+DROP FUNCTION getrngfunc6(int);
+DROP FUNCTION getrngfunc7(int);
+DROP FUNCTION getrngfunc8(int);
+DROP FUNCTION getrngfunc9(int);
+DROP FUNCTION rngfunct(int);
+DROP TABLE rngfunc2;
+DROP TABLE rngfunc;
+-- Rescan tests --
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq1;
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq2;
+CREATE TYPE rngfunc_rescan_t AS (i integer, s bigint);
+CREATE FUNCTION rngfunc_sql(int,int) RETURNS setof rngfunc_rescan_t AS 'SELECT i, nextval(''rngfunc_rescan_seq1'') FROM generate_series($1,$2) i;' LANGUAGE SQL;
+-- plpgsql functions use materialize mode
+CREATE FUNCTION rngfunc_mat(int,int) RETURNS setof rngfunc_rescan_t AS 'begin for i in $1..$2 loop return next (i, nextval(''rngfunc_rescan_seq2'')); end loop; end;' LANGUAGE plpgsql;
+--invokes ExecReScanFunctionScan - all these cases should materialize the function only once
+-- LEFT JOIN on a condition that the planner can't prove to be true is used to ensure the function
+-- is on the inner path of a nestloop join
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN ROWS FROM( rngfunc_sql(11,13), rngfunc_mat(11,13) ) WITH ORDINALITY AS f(i1,s1,i2,s2,o) ON (r+i1+i2)<100;
+ r | i1 | s1 | i2 | s2 | o
+---+----+----+----+----+---
+ 1 | 11 | 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 12 | 2 | 2
+ 1 | 13 | 3 | 13 | 3 | 3
+ 2 | 11 | 1 | 11 | 1 | 1
+ 2 | 12 | 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 13 | 3 | 3
+ 3 | 11 | 1 | 11 | 1 | 1
+ 3 | 12 | 2 | 12 | 2 | 2
+ 3 | 13 | 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 2 | 11
+ 2 | 12
+ 2 | 13
+ 3 | 11
+ 3 | 12
+ 3 | 13
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 10
+ 2 | 20
+ 2 | 30
+ 3 | 10
+ 3 | 20
+ 3 | 30
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 10 | 1
+ 2 | 20 | 2
+ 2 | 30 | 3
+ 3 | 10 | 1
+ 3 | 20 | 2
+ 3 | 30 | 3
+(9 rows)
+
+--invokes ExecReScanFunctionScan with chgParam != NULL (using implied LATERAL)
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+-- selective rescan of multiple functions:
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(11,11), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | | | 12 | 2
+ 1 | | | 13 | 3
+ 2 | 11 | 1 | 12 | 4
+ 2 | | | 13 | 5
+ 3 | 11 | 1 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(11,11) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | |
+ 1 | 13 | 3 | |
+ 2 | 12 | 4 | 11 | 1
+ 2 | 13 | 5 | |
+ 3 | 13 | 6 | 11 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | 12 | 2
+ 1 | 13 | 3 | 13 | 3
+ 2 | 12 | 4 | 12 | 4
+ 2 | 13 | 5 | 13 | 5
+ 3 | 13 | 6 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM generate_series(1,2) r1, generate_series(r1,3) r2, ROWS FROM( rngfunc_sql(10+r1,13), rngfunc_mat(10+r2,13) );
+ r1 | r2 | i | s | i | s
+----+----+----+----+----+---
+ 1 | 1 | 11 | 1 | 11 | 1
+ 1 | 1 | 12 | 2 | 12 | 2
+ 1 | 1 | 13 | 3 | 13 | 3
+ 1 | 2 | 11 | 4 | 12 | 4
+ 1 | 2 | 12 | 5 | 13 | 5
+ 1 | 2 | 13 | 6 | |
+ 1 | 3 | 11 | 7 | 13 | 6
+ 1 | 3 | 12 | 8 | |
+ 1 | 3 | 13 | 9 | |
+ 2 | 2 | 12 | 10 | 12 | 7
+ 2 | 2 | 13 | 11 | 13 | 8
+ 2 | 3 | 12 | 12 | 13 | 9
+ 2 | 3 | 13 | 13 | |
+(13 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) f(i);
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 12
+ 2 | 13
+ 2 | 14
+ 2 | 15
+ 2 | 16
+ 2 | 17
+ 2 | 18
+ 3 | 13
+ 3 | 14
+ 3 | 15
+ 3 | 16
+ 3 | 17
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 1 | 14 | 4
+ 1 | 15 | 5
+ 1 | 16 | 6
+ 1 | 17 | 7
+ 1 | 18 | 8
+ 1 | 19 | 9
+ 2 | 12 | 1
+ 2 | 13 | 2
+ 2 | 14 | 3
+ 2 | 15 | 4
+ 2 | 16 | 5
+ 2 | 17 | 6
+ 2 | 18 | 7
+ 3 | 13 | 1
+ 3 | 14 | 2
+ 3 | 15 | 3
+ 3 | 16 | 4
+ 3 | 17 | 5
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) f(i);
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 20
+ 2 | 40
+ 2 | 60
+ 3 | 30
+ 3 | 60
+ 3 | 90
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 20 | 1
+ 2 | 40 | 2
+ 2 | 60 | 3
+ 3 | 30 | 1
+ 3 | 60 | 2
+ 3 | 90 | 3
+(9 rows)
+
+-- deep nesting
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(21,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 21
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 21
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 21
+ 3 | 3 | 10 | 22
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 21
+ 3 | 3 | 30 | 22
+ 3 | 3 | 30 | 23
+(27 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(20+r1,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 23
+(18 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r2,r2+3) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 10
+ 1 | 1 | 10 | 11
+ 1 | 1 | 10 | 12
+ 1 | 1 | 10 | 13
+ 1 | 1 | 20 | 20
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 30
+ 1 | 1 | 30 | 31
+ 1 | 1 | 30 | 32
+ 1 | 1 | 30 | 33
+ 2 | 2 | 10 | 10
+ 2 | 2 | 10 | 11
+ 2 | 2 | 10 | 12
+ 2 | 2 | 10 | 13
+ 2 | 2 | 20 | 20
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 30
+ 2 | 2 | 30 | 31
+ 2 | 2 | 30 | 32
+ 2 | 2 | 30 | 33
+ 3 | 3 | 10 | 10
+ 3 | 3 | 10 | 11
+ 3 | 3 | 10 | 12
+ 3 | 3 | 10 | 13
+ 3 | 3 | 20 | 20
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 30
+ 3 | 3 | 30 | 31
+ 3 | 3 | 30 | 32
+ 3 | 3 | 30 | 33
+(36 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r1,2+r2/5) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+---
+ 1 | 1 | 10 | 1
+ 1 | 1 | 10 | 2
+ 1 | 1 | 10 | 3
+ 1 | 1 | 10 | 4
+ 1 | 1 | 20 | 1
+ 1 | 1 | 20 | 2
+ 1 | 1 | 20 | 3
+ 1 | 1 | 20 | 4
+ 1 | 1 | 20 | 5
+ 1 | 1 | 20 | 6
+ 1 | 1 | 30 | 1
+ 1 | 1 | 30 | 2
+ 1 | 1 | 30 | 3
+ 1 | 1 | 30 | 4
+ 1 | 1 | 30 | 5
+ 1 | 1 | 30 | 6
+ 1 | 1 | 30 | 7
+ 1 | 1 | 30 | 8
+ 2 | 2 | 10 | 2
+ 2 | 2 | 10 | 3
+ 2 | 2 | 10 | 4
+ 2 | 2 | 20 | 2
+ 2 | 2 | 20 | 3
+ 2 | 2 | 20 | 4
+ 2 | 2 | 20 | 5
+ 2 | 2 | 20 | 6
+ 2 | 2 | 30 | 2
+ 2 | 2 | 30 | 3
+ 2 | 2 | 30 | 4
+ 2 | 2 | 30 | 5
+ 2 | 2 | 30 | 6
+ 2 | 2 | 30 | 7
+ 2 | 2 | 30 | 8
+ 3 | 3 | 10 | 3
+ 3 | 3 | 10 | 4
+ 3 | 3 | 20 | 3
+ 3 | 3 | 20 | 4
+ 3 | 3 | 20 | 5
+ 3 | 3 | 20 | 6
+ 3 | 3 | 30 | 3
+ 3 | 3 | 30 | 4
+ 3 | 3 | 30 | 5
+ 3 | 3 | 30 | 6
+ 3 | 3 | 30 | 7
+ 3 | 3 | 30 | 8
+(45 rows)
+
+-- check handling of FULL JOIN with multiple lateral references (bug #15741)
+SELECT *
+FROM (VALUES (1),(2)) v1(r1)
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, v1.r1) AS gs1
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, gs1) AS gs2
+ LEFT JOIN generate_series(1, gs2) AS gs3 ON TRUE
+ ) AS ss1 ON TRUE
+ FULL JOIN generate_series(1, v1.r1) AS gs4 ON FALSE
+ ) AS ss0 ON TRUE;
+ r1 | gs1 | gs2 | gs3 | gs4
+----+-----+-----+-----+-----
+ 1 | | | | 1
+ 1 | 1 | 1 | 1 |
+ 2 | | | | 1
+ 2 | | | | 2
+ 2 | 1 | 1 | 1 |
+ 2 | 2 | 1 | 1 |
+ 2 | 2 | 2 | 1 |
+ 2 | 2 | 2 | 2 |
+(8 rows)
+
+DROP FUNCTION rngfunc_sql(int,int);
+DROP FUNCTION rngfunc_mat(int,int);
+DROP SEQUENCE rngfunc_rescan_seq1;
+DROP SEQUENCE rngfunc_rescan_seq2;
+--
+-- Test cases involving OUT parameters
+--
+CREATE FUNCTION rngfunc(in f1 int, out f2 int)
+AS 'select $1+1' LANGUAGE sql;
+SELECT rngfunc(42);
+ rngfunc
+---------
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42);
+ f2
+----
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42) AS p(x);
+ x
+----
+ 43
+(1 row)
+
+-- explicit spec of return type is OK
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+-- error, wrong result type
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS float
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be integer because of OUT parameters
+-- with multiple OUT params you must get a RECORD result
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be record because of OUT parameters
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text)
+RETURNS record
+AS 'select $1+1' LANGUAGE sql;
+ERROR: cannot change return type of existing function
+HINT: Use DROP FUNCTION rngfunc(integer) first.
+CREATE OR REPLACE FUNCTION rngfuncr(in f1 int, out f2 int, out text)
+AS $$select $1-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncr(f1) FROM int4_tbl;
+ f1 | rngfuncr
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (123455,123456z)
+ -123456 | (-123457,-123456z)
+ 2147483647 | (2147483646,2147483647z)
+ -2147483647 | (-2147483648,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncr(42);
+ f2 | column2
+----+---------
+ 41 | 42z
+(1 row)
+
+SELECT * FROM rngfuncr(42) AS p(a,b);
+ a | b
+----+-----
+ 41 | 42z
+(1 row)
+
+CREATE OR REPLACE FUNCTION rngfuncb(in f1 int, inout f2 int, out text)
+AS $$select $2-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncb(f1, f1/2) FROM int4_tbl;
+ f1 | rngfuncb
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (61727,123456z)
+ -123456 | (-61729,-123456z)
+ 2147483647 | (1073741822,2147483647z)
+ -2147483647 | (-1073741824,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncb(42, 99);
+ f2 | column2
+----+---------
+ 98 | 42z
+(1 row)
+
+SELECT * FROM rngfuncb(42, 99) AS p(a,b);
+ a | b
+----+-----
+ 98 | 42z
+(1 row)
+
+-- Can reference function with or without OUT params for DROP, etc
+DROP FUNCTION rngfunc(int);
+DROP FUNCTION rngfuncr(in f2 int, out f1 int, out text);
+DROP FUNCTION rngfuncb(in f1 int, inout f2 int);
+--
+-- For my next trick, polymorphic OUT parameters
+--
+CREATE FUNCTION dup (f1 anyelement, f2 out anyelement, f3 out anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+SELECT dup('xyz'); -- fails
+ERROR: could not determine polymorphic type because input has type unknown
+SELECT dup('xyz'::text);
+ dup
+-------------------
+ (xyz,"{xyz,xyz}")
+(1 row)
+
+SELECT * FROM dup('xyz'::text);
+ f2 | f3
+-----+-----------
+ xyz | {xyz,xyz}
+(1 row)
+
+-- fails, as we are attempting to rename first argument
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot change name of input parameter "f1"
+HINT: Use DROP FUNCTION dup(anyelement) first.
+DROP FUNCTION dup(anyelement);
+-- equivalent behavior, though different name exposed for input arg
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+DROP FUNCTION dup(anyelement);
+-- fails, no way to deduce outputs
+CREATE FUNCTION bad (f1 int, out f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot determine result data type
+DETAIL: A function returning a polymorphic type must have at least one polymorphic argument.
+--
+-- table functions
+--
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a int)
+AS $$ SELECT a FROM generate_series(1,5) a(a) $$ LANGUAGE sql;
+SELECT * FROM rngfunc();
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+DROP FUNCTION rngfunc();
+CREATE OR REPLACE FUNCTION rngfunc(int)
+RETURNS TABLE(a int, b int)
+AS $$ SELECT a, b
+ FROM generate_series(1,$1) a(a),
+ generate_series(1,$1) b(b) $$ LANGUAGE sql;
+SELECT * FROM rngfunc(3);
+ a | b
+---+---
+ 1 | 1
+ 1 | 2
+ 1 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(9 rows)
+
+DROP FUNCTION rngfunc(int);
+-- case that causes change of typmod knowledge during inlining
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a varchar(5))
+AS $$ SELECT 'hello'::varchar(5) $$ LANGUAGE sql STABLE;
+SELECT * FROM rngfunc() GROUP BY 1;
+ a
+-------
+ hello
+(1 row)
+
+DROP FUNCTION rngfunc();
+--
+-- some tests on SQL functions with RETURNING
+--
+create temp table tt(f1 serial, data text);
+create function insert_tt(text) returns int as
+$$ insert into tt(data) values($1) returning f1 $$
+language sql;
+select insert_tt('foo');
+ insert_tt
+-----------
+ 1
+(1 row)
+
+select insert_tt('bar');
+ insert_tt
+-----------
+ 2
+(1 row)
+
+select * from tt;
+ f1 | data
+----+------
+ 1 | foo
+ 2 | bar
+(2 rows)
+
+-- insert will execute to completion even if function needs just 1 row
+create or replace function insert_tt(text) returns int as
+$$ insert into tt(data) values($1),($1||$1) returning f1 $$
+language sql;
+select insert_tt('fool');
+ insert_tt
+-----------
+ 3
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+(4 rows)
+
+-- setof does what's expected
+create or replace function insert_tt2(text,text) returns setof int as
+$$ insert into tt(data) values($1),($2) returning f1 $$
+language sql;
+select insert_tt2('foolish','barrish');
+ insert_tt2
+------------
+ 5
+ 6
+(2 rows)
+
+select * from insert_tt2('baz','quux');
+ insert_tt2
+------------
+ 7
+ 8
+(2 rows)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+(8 rows)
+
+-- limit doesn't prevent execution to completion
+select insert_tt2('foolish','barrish') limit 1;
+ insert_tt2
+------------
+ 9
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+(10 rows)
+
+-- triggers will fire, too
+create function noticetrigger() returns trigger as $$
+begin
+ raise notice 'noticetrigger % %', new.f1, new.data;
+ return null;
+end $$ language plpgsql;
+create trigger tnoticetrigger after insert on tt for each row
+execute procedure noticetrigger();
+select insert_tt2('foolme','barme') limit 1;
+NOTICE: noticetrigger 11 foolme
+NOTICE: noticetrigger 12 barme
+ insert_tt2
+------------
+ 11
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+(12 rows)
+
+-- and rules work
+create temp table tt_log(f1 int, data text);
+create rule insert_tt_rule as on insert to tt do also
+ insert into tt_log values(new.*);
+select insert_tt2('foollog','barlog') limit 1;
+NOTICE: noticetrigger 13 foollog
+NOTICE: noticetrigger 14 barlog
+ insert_tt2
+------------
+ 13
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+ 13 | foollog
+ 14 | barlog
+(14 rows)
+
+-- note that nextval() gets executed a second time in the rule expansion,
+-- which is expected.
+select * from tt_log;
+ f1 | data
+----+---------
+ 15 | foollog
+ 16 | barlog
+(2 rows)
+
+-- test case for a whole-row-variable bug
+create function rngfunc1(n integer, out a text, out b text)
+ returns setof record
+ language sql
+ as $$ select 'foo ' || i, 'bar ' || i from generate_series(1,$1) i $$;
+set work_mem='64kB';
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+reset work_mem;
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+drop function rngfunc1(n integer);
+-- test use of SQL functions returning record
+-- this is supported in some cases where the query doesn't specify
+-- the actual record type ...
+create function array_to_set(anyarray) returns setof record as $$
+ select i AS "index", $1[i] AS "value" from generate_subscripts($1, 1) i
+$$ language sql strict immutable;
+select array_to_set(array['one', 'two']);
+ array_to_set
+--------------
+ (1,one)
+ (2,two)
+(2 rows)
+
+select * from array_to_set(array['one', 'two']) as t(f1 int,f2 text);
+ f1 | f2
+----+-----
+ 1 | one
+ 2 | two
+(2 rows)
+
+select * from array_to_set(array['one', 'two']); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from array_to_set(array['one', 'two']);
+ ^
+create temp table rngfunc(f1 int8, f2 int8);
+create function testrngfunc() returns record as $$
+ insert into rngfunc values (1,2) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+(1 row)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+(1 row)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+create function testrngfunc() returns setof record as $$
+ insert into rngfunc values (1,2), (3,4) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+ (3,4)
+(2 rows)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+ 3 | 4
+(2 rows)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+--
+-- Check some cases involving added/dropped columns in a rowtype result
+--
+create temp table users (userid text, seq int, email text, todrop bool, moredrop int, enabled bool);
+insert into users values ('id',1,'email',true,11,true);
+insert into users values ('id2',2,'email2',true,12,true);
+alter table users drop column todrop;
+create or replace function get_first_user() returns users as
+$$ SELECT * FROM users ORDER BY userid LIMIT 1; $$
+language sql stable;
+SELECT get_first_user();
+ get_first_user
+-------------------
+ (id,1,email,11,t)
+(1 row)
+
+SELECT * FROM get_first_user();
+ userid | seq | email | moredrop | enabled
+--------+-----+-------+----------+---------
+ id | 1 | email | 11 | t
+(1 row)
+
+create or replace function get_users() returns setof users as
+$$ SELECT * FROM users ORDER BY userid; $$
+language sql stable;
+SELECT get_users();
+ get_users
+---------------------
+ (id,1,email,11,t)
+ (id2,2,email2,12,t)
+(2 rows)
+
+SELECT * FROM get_users();
+ userid | seq | email | moredrop | enabled
+--------+-----+--------+----------+---------
+ id | 1 | email | 11 | t
+ id2 | 2 | email2 | 12 | t
+(2 rows)
+
+SELECT * FROM get_users() WITH ORDINALITY; -- make sure ordinality copes
+ userid | seq | email | moredrop | enabled | ordinality
+--------+-----+--------+----------+---------+------------
+ id | 1 | email | 11 | t | 1
+ id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+-- multiple functions vs. dropped columns
+SELECT * FROM ROWS FROM(generate_series(10,11), get_users()) WITH ORDINALITY;
+ generate_series | userid | seq | email | moredrop | enabled | ordinality
+-----------------+--------+-----+--------+----------+---------+------------
+ 10 | id | 1 | email | 11 | t | 1
+ 11 | id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+-- check that we can cope with post-parsing changes in rowtypes
+create temp view usersview as
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+alter table users add column junk text;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+begin;
+alter table users drop column moredrop;
+select * from usersview; -- expect clean failure
+ERROR: attribute 5 of type record has been dropped
+rollback;
+alter table users alter column seq type numeric;
+select * from usersview; -- expect clean failure
+ERROR: attribute 2 of type record has wrong type
+DETAIL: Table has type numeric, but query expects integer.
+drop view usersview;
+drop function get_first_user();
+drop function get_users();
+drop table users;
+-- this won't get inlined because of type coercion, but it shouldn't fail
+create or replace function rngfuncbar() returns setof text as
+$$ select 'foo'::varchar union all select 'bar'::varchar ; $$
+language sql stable;
+select rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+select * from rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+drop function rngfuncbar();
+-- check handling of a SQL function with multiple OUT params (bug #5777)
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1) $$ language sql;
+select * from rngfuncbar();
+ column1 | column2
+---------+---------
+ 1 | 2.1
+(1 row)
+
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned type integer at ordinal position 2, but query expects numeric.
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1, 3) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned row contains 3 attributes, but query expects 2.
+drop function rngfuncbar();
+-- check whole-row-Var handling in nested lateral functions (bug #11703)
+create function extractq2(t int8_tbl) returns int8 as $$
+ select t.q2
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ QUERY PLAN
+------------------------------------
+ Nested Loop
+ Output: f.x
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+ -> Function Scan on f
+ Output: f.x
+ Function Call: int8_tbl.q2
+(7 rows)
+
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+create function extractq2_2(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t) offset 0
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------------
+ Nested Loop
+ Output: ((int8_tbl.*).q2)
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.*
+ -> Result
+ Output: (int8_tbl.*).q2
+(6 rows)
+
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- without the "offset 0", this function gets optimized quite differently
+create function extractq2_2_opt(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t)
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------
+ Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+(2 rows)
+
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- check handling of nulls in SRF results (bug #7808)
+create type rngfunc2 as (a integer, b text);
+select *, row_to_json(u) from unnest(array[(1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+---+---------------------
+ | | {"a":null,"b":null}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, (1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ | | {"a":null,"b":null}
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(3 rows)
+
+select *, row_to_json(u) from unnest(array[]::rngfunc2[]) u;
+ a | b | row_to_json
+---+---+-------------
+(0 rows)
+
+drop type rngfunc2;
diff --git a/src/test/regress/expected/reloptions_1.out b/src/test/regress/expected/reloptions_1.out
new file mode 100644
index 00000000000..fd0b73a3656
--- /dev/null
+++ b/src/test/regress/expected/reloptions_1.out
@@ -0,0 +1,219 @@
+-- Simple create
+CREATE TABLE reloptions_test(i INT) WITH (FiLLFaCToR=30,
+ autovacuum_enabled = false, autovacuum_analyze_scale_factor = 0.2);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {fillfactor=30,autovacuum_enabled=false,autovacuum_analyze_scale_factor=0.2}
+(1 row)
+
+-- Fail min/max values check
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=2);
+ERROR: value 2 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=110);
+ERROR: value 110 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = -10.0);
+ERROR: value -10.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = 110.0);
+ERROR: value 110.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+-- Fail when option and namespace do not exist
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_namespace.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_namespace"
+-- Fail while setting improper values
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=-30.1);
+ERROR: value -30.1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor='string');
+ERROR: invalid value for integer option "fillfactor": string
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=true);
+ERROR: invalid value for integer option "fillfactor": true
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=12);
+ERROR: invalid value for boolean option "autovacuum_enabled": 12
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=30.5);
+ERROR: invalid value for boolean option "autovacuum_enabled": 30.5
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled='string');
+ERROR: invalid value for boolean option "autovacuum_enabled": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor='string');
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor=true);
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": true
+-- Fail if option is specified twice
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=30, fillfactor=40);
+ERROR: parameter "fillfactor" specified more than once
+-- Specifying name only for a non-Boolean option should fail
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor);
+ERROR: invalid value for integer option "fillfactor": true
+-- Simple ALTER TABLE
+ALTER TABLE reloptions_test SET (fillfactor=31,
+ autovacuum_analyze_scale_factor = 0.3);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {autovacuum_enabled=false,fillfactor=31,autovacuum_analyze_scale_factor=0.3}
+(1 row)
+
+-- Set boolean option to true without specifying value
+ALTER TABLE reloptions_test SET (autovacuum_enabled, fillfactor=32);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-----------------------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true,fillfactor=32}
+(1 row)
+
+-- Check that RESET works well
+ALTER TABLE reloptions_test RESET (fillfactor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+---------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true}
+(1 row)
+
+-- Resetting all values causes the column to become null
+ALTER TABLE reloptions_test RESET (autovacuum_enabled,
+ autovacuum_analyze_scale_factor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass AND
+ reloptions IS NULL;
+ reloptions
+------------
+
+(1 row)
+
+-- RESET fails if a value is specified
+ALTER TABLE reloptions_test RESET (fillfactor=12);
+ERROR: RESET must not include values for parameters
+-- Test vacuum_truncate option
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test(i INT NOT NULL, j text)
+ WITH (vacuum_truncate=false,
+ toast.vacuum_truncate=false,
+ autovacuum_enabled=false);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+--------------------------------------------------
+ {vacuum_truncate=false,autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid =
+ (SELECT reltoastrelid FROM pg_class
+ WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (vacuum_truncate);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+----------------------------
+ {autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') = 0;
+ ?column?
+----------
+ f
+(1 row)
+
+-- Test toast.* options
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR)
+ WITH (toast.autovacuum_vacuum_cost_delay = 23);
+SELECT reltoastrelid as toast_oid
+ FROM pg_class WHERE oid = 'reloptions_test'::regclass \gset
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test SET (toast.autovacuum_vacuum_cost_delay = 24);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (toast.autovacuum_vacuum_cost_delay);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+-- Fail on non-existent options in toast namespace
+CREATE TABLE reloptions_test2 (i int) WITH (toast.not_existing_option = 42);
+ERROR: unrecognized parameter "not_existing_option"
+-- Mix TOAST & heap
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR) WITH
+ (toast.autovacuum_vacuum_cost_delay = 23,
+ autovacuum_vacuum_cost_delay = 24, fillfactor = 40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-------------------------------------------------
+ {autovacuum_vacuum_cost_delay=24,fillfactor=40}
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid = (
+ SELECT reltoastrelid FROM pg_class WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+--
+-- CREATE INDEX, ALTER INDEX for btrees
+--
+CREATE INDEX reloptions_test_idx ON reloptions_test (s) WITH (fillfactor=30);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=30}
+(1 row)
+
+-- Fail when option and namespace do not exist
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_ns.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_ns"
+-- Check allowed ranges
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=1);
+ERROR: value 1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=130);
+ERROR: value 130 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+-- Check ALTER
+ALTER INDEX reloptions_test_idx SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
+-- Check ALTER on empty reloption list
+CREATE INDEX reloptions_test_idx3 ON reloptions_test (s);
+ALTER INDEX reloptions_test_idx3 SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx3'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
diff --git a/src/test/regress/expected/rowsecurity_1.out b/src/test/regress/expected/rowsecurity_1.out
new file mode 100644
index 00000000000..5c8c0ea8935
--- /dev/null
+++ b/src/test/regress/expected/rowsecurity_1.out
@@ -0,0 +1,4018 @@
+--
+-- Test of Row-level security feature
+--
+-- Clean up in case a prior regression run failed
+-- Suppress NOTICE messages when users/groups don't exist
+SET client_min_messages TO 'warning';
+DROP USER IF EXISTS regress_rls_alice;
+DROP USER IF EXISTS regress_rls_bob;
+DROP USER IF EXISTS regress_rls_carol;
+DROP USER IF EXISTS regress_rls_dave;
+DROP USER IF EXISTS regress_rls_exempt_user;
+DROP ROLE IF EXISTS regress_rls_group1;
+DROP ROLE IF EXISTS regress_rls_group2;
+DROP SCHEMA IF EXISTS regress_rls_schema CASCADE;
+RESET client_min_messages;
+-- initial setup
+CREATE USER regress_rls_alice NOLOGIN;
+CREATE USER regress_rls_bob NOLOGIN;
+CREATE USER regress_rls_carol NOLOGIN;
+CREATE USER regress_rls_dave NOLOGIN;
+CREATE USER regress_rls_exempt_user BYPASSRLS NOLOGIN;
+CREATE ROLE regress_rls_group1 NOLOGIN;
+CREATE ROLE regress_rls_group2 NOLOGIN;
+GRANT regress_rls_group1 TO regress_rls_bob;
+GRANT regress_rls_group2 TO regress_rls_carol;
+CREATE SCHEMA regress_rls_schema;
+GRANT ALL ON SCHEMA regress_rls_schema to public;
+SET search_path = regress_rls_schema;
+-- setup of malicious function
+CREATE OR REPLACE FUNCTION f_leak(text) RETURNS bool
+ COST 0.0000001 LANGUAGE plpgsql
+ AS 'BEGIN RAISE NOTICE ''f_leak => %'', $1; RETURN true; END';
+GRANT EXECUTE ON FUNCTION f_leak(text) TO public;
+-- BASIC Row-Level Security Scenario
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE uaccount (
+ pguser name primary key,
+ seclv int
+);
+GRANT SELECT ON uaccount TO public;
+INSERT INTO uaccount VALUES
+ ('regress_rls_alice', 99),
+ ('regress_rls_bob', 1),
+ ('regress_rls_carol', 2),
+ ('regress_rls_dave', 3);
+CREATE TABLE category (
+ cid int primary key,
+ cname text
+);
+GRANT ALL ON category TO public;
+INSERT INTO category VALUES
+ (11, 'novel'),
+ (22, 'science fiction'),
+ (33, 'technology'),
+ (44, 'manga');
+CREATE TABLE document (
+ did int primary key,
+ cid int references category(cid),
+ dlevel int not null,
+ dauthor name,
+ dtitle text
+);
+GRANT ALL ON document TO public;
+INSERT INTO document VALUES
+ ( 1, 11, 1, 'regress_rls_bob', 'my first novel'),
+ ( 2, 11, 2, 'regress_rls_bob', 'my second novel'),
+ ( 3, 22, 2, 'regress_rls_bob', 'my science fiction'),
+ ( 4, 44, 1, 'regress_rls_bob', 'my first manga'),
+ ( 5, 44, 2, 'regress_rls_bob', 'my second manga'),
+ ( 6, 22, 1, 'regress_rls_carol', 'great science fiction'),
+ ( 7, 33, 2, 'regress_rls_carol', 'great technology book'),
+ ( 8, 44, 1, 'regress_rls_carol', 'great manga'),
+ ( 9, 22, 1, 'regress_rls_dave', 'awesome science fiction'),
+ (10, 33, 2, 'regress_rls_dave', 'awesome technology book');
+ALTER TABLE document ENABLE ROW LEVEL SECURITY;
+-- user's security level must be higher than or equal to document's
+CREATE POLICY p1 ON document AS PERMISSIVE
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+-- try to create a policy of bogus type
+CREATE POLICY p1 ON document AS UGLY
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+ERROR: unrecognized row security option "ugly"
+LINE 1: CREATE POLICY p1 ON document AS UGLY
+ ^
+HINT: Only PERMISSIVE or RESTRICTIVE policies are supported currently.
+-- but Dave isn't allowed to anything at cid 50 or above
+-- this is to make sure that we sort the policies by name first
+-- when applying WITH CHECK, a later INSERT by Dave should fail due
+-- to p1r first
+CREATE POLICY p2r ON document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid <> 44 AND cid < 50);
+-- and Dave isn't allowed to see manga documents
+CREATE POLICY p1r ON document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid <> 44);
+\dp
+ Access privileges
+ Schema | Name | Type | Access privileges | Column privileges | Policies
+--------------------+----------+-------+---------------------------------------------+-------------------+--------------------------------------------
+ regress_rls_schema | category | table | regress_rls_alice=arwdDxt/regress_rls_alice+| |
+ | | | =arwdDxt/regress_rls_alice | |
+ regress_rls_schema | document | table | regress_rls_alice=arwdDxt/regress_rls_alice+| | p1: +
+ | | | =arwdDxt/regress_rls_alice | | (u): (dlevel <= ( SELECT uaccount.seclv +
+ | | | | | FROM uaccount +
+ | | | | | WHERE (uaccount.pguser = CURRENT_USER)))+
+ | | | | | p2r (RESTRICTIVE): +
+ | | | | | (u): ((cid <> 44) AND (cid < 50)) +
+ | | | | | to: regress_rls_dave +
+ | | | | | p1r (RESTRICTIVE): +
+ | | | | | (u): (cid <> 44) +
+ | | | | | to: regress_rls_dave
+ regress_rls_schema | uaccount | table | regress_rls_alice=arwdDxt/regress_rls_alice+| |
+ | | | =r/regress_rls_alice | |
+(3 rows)
+
+\d document
+ Table "regress_rls_schema.document"
+ Column | Type | Collation | Nullable | Default
+---------+---------+-----------+----------+---------
+ did | integer | | not null |
+ cid | integer | | |
+ dlevel | integer | | not null |
+ dauthor | name | | |
+ dtitle | text | | |
+Indexes:
+ "document_pkey" PRIMARY KEY, btree (did)
+Foreign-key constraints:
+ "document_cid_fkey" FOREIGN KEY (cid) REFERENCES category(cid)
+Policies:
+ POLICY "p1"
+ USING ((dlevel <= ( SELECT uaccount.seclv
+ FROM uaccount
+ WHERE (uaccount.pguser = CURRENT_USER))))
+ POLICY "p1r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING ((cid <> 44))
+ POLICY "p2r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING (((cid <> 44) AND (cid < 50)))
+
+SELECT * FROM pg_policies WHERE schemaname = 'regress_rls_schema' AND tablename = 'document' ORDER BY policyname;
+ schemaname | tablename | policyname | permissive | roles | cmd | qual | with_check
+--------------------+-----------+------------+-------------+--------------------+-----+--------------------------------------------+------------
+ regress_rls_schema | document | p1 | PERMISSIVE | {public} | ALL | (dlevel <= ( SELECT uaccount.seclv +|
+ | | | | | | FROM uaccount +|
+ | | | | | | WHERE (uaccount.pguser = CURRENT_USER))) |
+ regress_rls_schema | document | p1r | RESTRICTIVE | {regress_rls_dave} | ALL | (cid <> 44) |
+ regress_rls_schema | document | p2r | RESTRICTIVE | {regress_rls_dave} | ALL | ((cid <> 44) AND (cid < 50)) |
+(3 rows)
+
+-- viewpoint from regress_rls_bob
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(5 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+(5 rows)
+
+-- try a sampled version
+SELECT * FROM document TABLESAMPLE BERNOULLI(50) REPEATABLE(0)
+ WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+(10 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 44 | 5 | 2 | regress_rls_bob | my second manga | manga
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+ 33 | 10 | 2 | regress_rls_dave | awesome technology book | technology
+(10 rows)
+
+-- try a sampled version
+SELECT * FROM document TABLESAMPLE BERNOULLI(50) REPEATABLE(0)
+ WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------
+ Seq Scan on document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+-----------------------------------------------------------
+ Hash Join
+ Hash Cond: (category.cid = document.cid)
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on category
+ -> Hash
+ -> Seq Scan on document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(9 rows)
+
+-- viewpoint from regress_rls_dave
+SET SESSION AUTHORIZATION regress_rls_dave;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+(7 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+ 33 | 10 | 2 | regress_rls_dave | awesome technology book | technology
+(7 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
+ Seq Scan on document
+ Filter: ((cid <> 44) AND (cid <> 44) AND (cid < 50) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (category.cid = document.cid)
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on category
+ -> Hash
+ -> Seq Scan on document
+ Filter: ((cid <> 44) AND (cid <> 44) AND (cid < 50) AND (dlevel <= $0) AND f_leak(dtitle))
+(9 rows)
+
+-- 44 would technically fail for both p2r and p1r, but we should get an error
+-- back from p1r for this because it sorts first
+INSERT INTO document VALUES (100, 44, 1, 'regress_rls_dave', 'testing sorting of policies'); -- fail
+ERROR: new row violates row-level security policy "p1r" for table "document"
+-- Just to see a p2r error
+INSERT INTO document VALUES (100, 55, 1, 'regress_rls_dave', 'testing sorting of policies'); -- fail
+ERROR: new row violates row-level security policy "p2r" for table "document"
+-- only owner can change policies
+ALTER POLICY p1 ON document USING (true); --fail
+ERROR: must be owner of table document
+DROP POLICY p1 ON document; --fail
+ERROR: must be owner of relation document
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p1 ON document USING (dauthor = current_user);
+-- viewpoint from regress_rls_bob again
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+--------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+(5 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER by did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-----------------+--------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 44 | 5 | 2 | regress_rls_bob | my second manga | manga
+(5 rows)
+
+-- viewpoint from rls_regres_carol again
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+(3 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER by did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-----------------------+-----------------
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------
+ Seq Scan on document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Index Scan using category_pkey on category
+ Index Cond: (cid = document.cid)
+(5 rows)
+
+-- interaction of FK/PK constraints
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY p2 ON category
+ USING (CASE WHEN current_user = 'regress_rls_bob' THEN cid IN (11, 33)
+ WHEN current_user = 'regress_rls_carol' THEN cid IN (22, 44)
+ ELSE false END);
+ALTER TABLE category ENABLE ROW LEVEL SECURITY;
+-- cannot delete PK referenced by invisible FK
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM document d FULL OUTER JOIN category c on d.cid = c.cid ORDER BY d.did, c.cid;
+ did | cid | dlevel | dauthor | dtitle | cid | cname
+-----+-----+--------+-----------------+--------------------+-----+------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel | 11 | novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel | 11 | novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction | |
+ 4 | 44 | 1 | regress_rls_bob | my first manga | |
+ 5 | 44 | 2 | regress_rls_bob | my second manga | |
+ | | | | | 33 | technology
+(6 rows)
+
+DELETE FROM category WHERE cid = 33; -- fails with FK violation
+ERROR: update or delete on table "category" violates foreign key constraint "document_cid_fkey" on table "document"
+DETAIL: Key is still referenced from table "document".
+-- can insert FK referencing invisible PK
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document d FULL OUTER JOIN category c on d.cid = c.cid ORDER BY d.did, c.cid;
+ did | cid | dlevel | dauthor | dtitle | cid | cname
+-----+-----+--------+-------------------+-----------------------+-----+-----------------
+ 6 | 22 | 1 | regress_rls_carol | great science fiction | 22 | science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book | |
+ 8 | 44 | 1 | regress_rls_carol | great manga | 44 | manga
+(3 rows)
+
+INSERT INTO document VALUES (11, 33, 1, current_user, 'hoge');
+-- UNIQUE or PRIMARY KEY constraint violation DOES reveal presence of row
+SET SESSION AUTHORIZATION regress_rls_bob;
+INSERT INTO document VALUES (8, 44, 1, 'regress_rls_bob', 'my third manga'); -- Must fail with unique violation, revealing presence of did we can't see
+ERROR: duplicate key value violates unique constraint "document_pkey"
+SELECT * FROM document WHERE did = 8; -- and confirm we can't see it
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+---------+--------
+(0 rows)
+
+-- RLS policies are checked before constraints
+INSERT INTO document VALUES (8, 44, 1, 'regress_rls_carol', 'my third manga'); -- Should fail with RLS check violation, not duplicate key violation
+ERROR: new row violates row-level security policy for table "document"
+UPDATE document SET did = 8, dauthor = 'regress_rls_carol' WHERE did = 5; -- Should fail with RLS check violation, not duplicate key violation
+ERROR: new row violates row-level security policy for table "document"
+-- database superuser does bypass RLS policy when enabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- database superuser does bypass RLS policy when disabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- database non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- RLS policy does not apply to table owner when RLS enabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- RLS policy does not apply to table owner when RLS disabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+--
+-- Table inheritance and RLS policy
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE TABLE t1 (id int not null primary key, a int, junk1 text, b text);
+ALTER TABLE t1 DROP COLUMN junk1; -- just a disturbing factor
+GRANT ALL ON t1 TO public;
+COPY t1 FROM stdin WITH ;
+CREATE TABLE t2 (c float) INHERITS (t1);
+GRANT ALL ON t2 TO public;
+COPY t2 FROM stdin;
+CREATE TABLE t3 (id int not null primary key, c text, b text, a int);
+ALTER TABLE t3 INHERIT t1;
+GRANT ALL ON t3 TO public;
+COPY t3(id, a,b,c) FROM stdin;
+CREATE POLICY p1 ON t1 FOR ALL TO PUBLIC USING (a % 2 = 0); -- be even number
+CREATE POLICY p2 ON t2 FOR ALL TO PUBLIC USING (a % 2 = 1); -- be odd number
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE t2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM t1;
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+-- reference to system column
+SELECT tableoid::regclass, * FROM t1;
+ tableoid | id | a | b
+----------+-----+---+-----
+ t1 | 102 | 2 | bbb
+ t1 | 104 | 4 | dad
+ t2 | 202 | 2 | bcd
+ t2 | 204 | 4 | def
+ t3 | 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT *, t1 FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+-- reference to whole-row reference
+SELECT *, t1 FROM t1;
+ id | a | b | t1
+-----+---+-----+-------------
+ 102 | 2 | bbb | (102,2,bbb)
+ 104 | 4 | dad | (104,4,dad)
+ 202 | 2 | bcd | (202,2,bcd)
+ 204 | 4 | def | (204,4,def)
+ 302 | 2 | yyy | (302,2,yyy)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT *, t1 FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+-- for share/update lock
+SELECT * FROM t1 FOR SHARE;
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 FOR SHARE;
+ QUERY PLAN
+-------------------------------------
+ LockRows
+ -> Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3
+ Filter: ((a % 2) = 0)
+(8 rows)
+
+SELECT * FROM t1 WHERE f_leak(b) FOR SHARE;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b) FOR SHARE;
+ QUERY PLAN
+-----------------------------------------------------
+ LockRows
+ -> Append
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(8 rows)
+
+-- union all query
+SELECT a, b, tableoid::regclass FROM t2 UNION ALL SELECT a, b, tableoid::regclass FROM t3;
+ a | b | tableoid
+---+-----+----------
+ 1 | abc | t2
+ 3 | cde | t2
+ 1 | xxx | t3
+ 2 | yyy | t3
+ 3 | zzz | t3
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT a, b, tableoid::regclass FROM t2 UNION ALL SELECT a, b, tableoid::regclass FROM t3;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t2
+ Filter: ((a % 2) = 1)
+ -> Seq Scan on t3
+(4 rows)
+
+-- superuser is allowed to bypass RLS checks
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2
+ Filter: f_leak(b)
+ -> Seq Scan on t3
+ Filter: f_leak(b)
+(7 rows)
+
+-- non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2
+ Filter: f_leak(b)
+ -> Seq Scan on t3
+ Filter: f_leak(b)
+(7 rows)
+
+--
+-- Partitioned Tables
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE part_document (
+ did int,
+ cid int,
+ dlevel int not null,
+ dauthor name,
+ dtitle text
+) PARTITION BY RANGE (cid);
+GRANT ALL ON part_document TO public;
+-- Create partitions for document categories
+CREATE TABLE part_document_fiction PARTITION OF part_document FOR VALUES FROM (11) to (12);
+CREATE TABLE part_document_satire PARTITION OF part_document FOR VALUES FROM (55) to (56);
+CREATE TABLE part_document_nonfiction PARTITION OF part_document FOR VALUES FROM (99) to (100);
+GRANT ALL ON part_document_fiction TO public;
+GRANT ALL ON part_document_satire TO public;
+GRANT ALL ON part_document_nonfiction TO public;
+INSERT INTO part_document VALUES
+ ( 1, 11, 1, 'regress_rls_bob', 'my first novel'),
+ ( 2, 11, 2, 'regress_rls_bob', 'my second novel'),
+ ( 3, 99, 2, 'regress_rls_bob', 'my science textbook'),
+ ( 4, 55, 1, 'regress_rls_bob', 'my first satire'),
+ ( 5, 99, 2, 'regress_rls_bob', 'my history book'),
+ ( 6, 11, 1, 'regress_rls_carol', 'great science fiction'),
+ ( 7, 99, 2, 'regress_rls_carol', 'great technology book'),
+ ( 8, 55, 2, 'regress_rls_carol', 'great satire'),
+ ( 9, 11, 1, 'regress_rls_dave', 'awesome science fiction'),
+ (10, 99, 2, 'regress_rls_dave', 'awesome technology book');
+ALTER TABLE part_document ENABLE ROW LEVEL SECURITY;
+-- Create policy on parent
+-- user's security level must be higher than or equal to document's
+CREATE POLICY pp1 ON part_document AS PERMISSIVE
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+-- Dave is only allowed to see cid < 55
+CREATE POLICY pp1r ON part_document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid < 55);
+\d+ part_document
+ Partitioned table "regress_rls_schema.part_document"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+---------+---------+-----------+----------+---------+----------+--------------+-------------
+ did | integer | | | | plain | |
+ cid | integer | | | | plain | |
+ dlevel | integer | | not null | | plain | |
+ dauthor | name | | | | plain | |
+ dtitle | text | | | | extended | |
+Partition key: RANGE (cid)
+Policies:
+ POLICY "pp1"
+ USING ((dlevel <= ( SELECT uaccount.seclv
+ FROM uaccount
+ WHERE (uaccount.pguser = CURRENT_USER))))
+ POLICY "pp1r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING ((cid < 55))
+Partitions: part_document_fiction FOR VALUES FROM (11) TO (12),
+ part_document_nonfiction FOR VALUES FROM (99) TO (100),
+ part_document_satire FOR VALUES FROM (55) TO (56)
+
+SELECT * FROM pg_policies WHERE schemaname = 'regress_rls_schema' AND tablename like '%part_document%' ORDER BY policyname;
+ schemaname | tablename | policyname | permissive | roles | cmd | qual | with_check
+--------------------+---------------+------------+-------------+--------------------+-----+--------------------------------------------+------------
+ regress_rls_schema | part_document | pp1 | PERMISSIVE | {public} | ALL | (dlevel <= ( SELECT uaccount.seclv +|
+ | | | | | | FROM uaccount +|
+ | | | | | | WHERE (uaccount.pguser = CURRENT_USER))) |
+ regress_rls_schema | part_document | pp1r | RESTRICTIVE | {regress_rls_dave} | ALL | (cid < 55) |
+(2 rows)
+
+-- viewpoint from regress_rls_bob
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+-----------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+(10 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+-----------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- viewpoint from regress_rls_dave
+SET SESSION AUTHORIZATION regress_rls_dave;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+--------------------------------------------------------------
+ Seq Scan on part_document_fiction
+ Filter: ((cid < 55) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+-- pp1 ERROR
+INSERT INTO part_document VALUES (100, 11, 5, 'regress_rls_dave', 'testing pp1'); -- fail
+ERROR: new row violates row-level security policy for table "part_document"
+-- pp1r ERROR
+INSERT INTO part_document VALUES (100, 99, 1, 'regress_rls_dave', 'testing pp1r'); -- fail
+ERROR: new row violates row-level security policy "pp1r" for table "part_document"
+-- Show that RLS policy does not apply for direct inserts to children
+-- This should fail with RLS POLICY pp1r violation.
+INSERT INTO part_document VALUES (100, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- fail
+ERROR: new row violates row-level security policy "pp1r" for table "part_document"
+-- But this should succeed.
+INSERT INTO part_document_satire VALUES (100, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- success
+-- We still cannot see the row using the parent
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+-- But we can if we look directly
+SELECT * FROM part_document_satire WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => testing RLS with partitions
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- Turn on RLS and create policy on child to show RLS is checked before constraints
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER TABLE part_document_satire ENABLE ROW LEVEL SECURITY;
+CREATE POLICY pp3 ON part_document_satire AS RESTRICTIVE
+ USING (cid < 55);
+-- This should fail with RLS violation now.
+SET SESSION AUTHORIZATION regress_rls_dave;
+INSERT INTO part_document_satire VALUES (101, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- fail
+ERROR: new row violates row-level security policy for table "part_document_satire"
+-- And now we cannot see directly into the partition either, due to RLS
+SELECT * FROM part_document_satire WHERE f_leak(dtitle) ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+---------+--------
+(0 rows)
+
+-- The parent looks same as before
+-- viewpoint from regress_rls_dave
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+--------------------------------------------------------------
+ Seq Scan on part_document_fiction
+ Filter: ((cid < 55) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => testing RLS with partitions
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+-----------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- only owner can change policies
+ALTER POLICY pp1 ON part_document USING (true); --fail
+ERROR: must be owner of table part_document
+DROP POLICY pp1 ON part_document; --fail
+ERROR: must be owner of relation part_document
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY pp1 ON part_document USING (dauthor = current_user);
+-- viewpoint from regress_rls_bob again
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+---------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+(5 rows)
+
+-- viewpoint from rls_regres_carol again
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great satire
+NOTICE: f_leak => great technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------------
+ Append
+ -> Seq Scan on part_document_fiction
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+(7 rows)
+
+-- database superuser does bypass RLS policy when enabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+SELECT * FROM part_document ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- database non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM part_document ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- RLS policy does not apply to table owner when RLS enabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+SELECT * FROM part_document ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- When RLS disabled, other users get ERROR.
+SET SESSION AUTHORIZATION regress_rls_dave;
+SET row_security TO OFF;
+SELECT * FROM part_document ORDER by did;
+ERROR: query would be affected by row-level security policy for table "part_document"
+SELECT * FROM part_document_satire ORDER by did;
+ERROR: query would be affected by row-level security policy for table "part_document_satire"
+-- Check behavior with a policy that uses a SubPlan not an InitPlan.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE POLICY pp3 ON part_document AS RESTRICTIVE
+ USING ((SELECT dlevel <= seclv FROM uaccount WHERE pguser = current_user));
+SET SESSION AUTHORIZATION regress_rls_carol;
+INSERT INTO part_document VALUES (100, 11, 5, 'regress_rls_carol', 'testing pp3'); -- fail
+ERROR: new row violates row-level security policy "pp3" for table "part_document"
+----- Dependencies -----
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE TABLE dependee (x integer, y integer);
+CREATE TABLE dependent (x integer, y integer);
+CREATE POLICY d1 ON dependent FOR ALL
+ TO PUBLIC
+ USING (x = (SELECT d.x FROM dependee d WHERE d.y = y));
+DROP TABLE dependee; -- Should fail without CASCADE due to dependency on row security qual?
+ERROR: cannot drop table dependee because other objects depend on it
+DETAIL: policy d1 on table dependent depends on table dependee
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+DROP TABLE dependee CASCADE;
+NOTICE: drop cascades to policy d1 on table dependent
+EXPLAIN (COSTS OFF) SELECT * FROM dependent; -- After drop, should be unqualified
+ QUERY PLAN
+-----------------------
+ Seq Scan on dependent
+(1 row)
+
+----- RECURSION ----
+--
+-- Simple recursion
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rec1 (x integer, y integer);
+CREATE POLICY r1 ON rec1 USING (x = (SELECT r.x FROM rec1 r WHERE y = r.y));
+ALTER TABLE rec1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, direct recursion
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rec2 (a integer, b integer);
+ALTER POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2 WHERE b = y));
+CREATE POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1 WHERE y = b));
+ALTER TABLE rec2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion via views
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rec1v AS SELECT * FROM rec1;
+CREATE VIEW rec2v AS SELECT * FROM rec2;
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2v WHERE b = y));
+ALTER POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1v WHERE y = b));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion via views
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion via .s.b views
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+DROP VIEW rec1v, rec2v CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to policy r1 on table rec1
+drop cascades to policy r2 on table rec2
+CREATE VIEW rec1v WITH (security_barrier) AS SELECT * FROM rec1;
+CREATE VIEW rec2v WITH (security_barrier) AS SELECT * FROM rec2;
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2v WHERE b = y));
+CREATE POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1v WHERE y = b));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion via s.b. views
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- recursive RLS and VIEWs in policy
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE s1 (a int, b text);
+INSERT INTO s1 (SELECT x, md5(x::text) FROM generate_series(-10,10) x);
+CREATE TABLE s2 (x int, y text);
+INSERT INTO s2 (SELECT x, md5(x::text) FROM generate_series(-6,6) x);
+GRANT SELECT ON s1, s2 TO regress_rls_bob;
+CREATE POLICY p1 ON s1 USING (a in (select x from s2 where y like '%2f%'));
+CREATE POLICY p2 ON s2 USING (x in (select a from s1 where b like '%22%'));
+CREATE POLICY p3 ON s1 FOR INSERT WITH CHECK (a = (SELECT a FROM s1));
+ALTER TABLE s1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE s2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW v2 AS SELECT * FROM s2 WHERE y like '%af%';
+SELECT * FROM s1 WHERE f_leak(b); -- fail (infinite recursion)
+ERROR: infinite recursion detected in policy for relation "s1"
+INSERT INTO s1 VALUES (1, 'foo'); -- fail (infinite recursion)
+ERROR: infinite recursion detected in policy for relation "s1"
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p3 on s1;
+ALTER POLICY p2 ON s2 USING (x % 2 = 0);
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- OK
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+ a | b
+---+----------------------------------
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM only s1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on s1
+ Filter: ((hashed SubPlan 1) AND f_leak(b))
+ SubPlan 1
+ -> Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%2f%'::text))
+(5 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p1 ON s1 USING (a in (select x from v2)); -- using VIEW in RLS policy
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- OK
+NOTICE: f_leak => 0267aaf632e87a63288a08331f22c7c3
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+ a | b
+----+----------------------------------
+ -4 | 0267aaf632e87a63288a08331f22c7c3
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM s1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on s1
+ Filter: ((hashed SubPlan 1) AND f_leak(b))
+ SubPlan 1
+ -> Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%af%'::text))
+(5 rows)
+
+SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
+ xx | x | y
+----+----+----------------------------------
+ -6 | -6 | 596a3d04481816330f07e4f97510c28f
+ -4 | -4 | 0267aaf632e87a63288a08331f22c7c3
+ 2 | 2 | c81e728d9d4c2f636f067f89cc14862c
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%28%'::text))
+ SubPlan 2
+ -> Limit
+ -> Seq Scan on s1
+ Filter: (hashed SubPlan 1)
+ SubPlan 1
+ -> Seq Scan on s2 s2_1
+ Filter: (((x % 2) = 0) AND (y ~~ '%af%'::text))
+(9 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p2 ON s2 USING (x in (select a from s1 where b like '%d2%'));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- fail (infinite recursion via view)
+ERROR: infinite recursion detected in policy for relation "s1"
+-- prepared statement with regress_rls_alice privilege
+PREPARE p1(int) AS SELECT * FROM t1 WHERE a <= $1;
+EXECUTE p1(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p1(2);
+ QUERY PLAN
+----------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+ -> Seq Scan on t2
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+ -> Seq Scan on t3
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+(7 rows)
+
+-- superuser is allowed to bypass RLS checks
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2
+ Filter: f_leak(b)
+ -> Seq Scan on t3
+ Filter: f_leak(b)
+(7 rows)
+
+-- plan cache should be invalidated
+EXECUTE p1(2);
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+(6 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p1(2);
+ QUERY PLAN
+--------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (a <= 2)
+ -> Seq Scan on t2
+ Filter: (a <= 2)
+ -> Seq Scan on t3
+ Filter: (a <= 2)
+(7 rows)
+
+PREPARE p2(int) AS SELECT * FROM t1 WHERE a = $1;
+EXECUTE p2(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p2(2);
+ QUERY PLAN
+-------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (a = 2)
+ -> Seq Scan on t2
+ Filter: (a = 2)
+ -> Seq Scan on t3
+ Filter: (a = 2)
+(7 rows)
+
+-- also, case when privilege switch from superuser
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+EXECUTE p2(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p2(2);
+ QUERY PLAN
+---------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a = 2) AND ((a % 2) = 0))
+ -> Seq Scan on t2
+ Filter: ((a = 2) AND ((a % 2) = 0))
+ -> Seq Scan on t3
+ Filter: ((a = 2) AND ((a % 2) = 0))
+(7 rows)
+
+--
+-- UPDATE / DELETE and Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+EXPLAIN (COSTS OFF) UPDATE t1 SET b = b || b WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Update on t1
+ Update on t1
+ Update on t2
+ Update on t3
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(10 rows)
+
+UPDATE t1 SET b = b || b WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+EXPLAIN (COSTS OFF) UPDATE only t1 SET b = b || '_updt' WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Update on t1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+UPDATE only t1 SET b = b || '_updt' WHERE f_leak(b);
+NOTICE: f_leak => bbbbbb
+NOTICE: f_leak => daddad
+-- returning clause with system column
+UPDATE only t1 SET b = b WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+(2 rows)
+
+UPDATE t1 SET b = b WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ id | a | b
+-----+---+-------------
+ 102 | 2 | bbbbbb_updt
+ 104 | 4 | daddad_updt
+ 202 | 2 | bcdbcd
+ 204 | 4 | defdef
+ 302 | 2 | yyyyyy
+(5 rows)
+
+UPDATE t1 SET b = b WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+ t2 | 202 | 2 | bcdbcd | (202,2,bcdbcd)
+ t2 | 204 | 4 | defdef | (204,4,defdef)
+ t3 | 302 | 2 | yyyyyy | (302,2,yyyyyy)
+(5 rows)
+
+-- updates with from clause
+EXPLAIN (COSTS OFF) UPDATE t2 SET b=t2.b FROM t3
+WHERE t2.a = 3 and t3.a = 2 AND f_leak(t2.b) AND f_leak(t3.b);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t2
+ -> Nested Loop
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: ((a = 2) AND f_leak(b))
+(6 rows)
+
+UPDATE t2 SET b=t2.b FROM t3
+WHERE t2.a = 3 and t3.a = 2 AND f_leak(t2.b) AND f_leak(t3.b);
+NOTICE: f_leak => cde
+NOTICE: f_leak => yyyyyy
+EXPLAIN (COSTS OFF) UPDATE t1 SET b=t1.b FROM t2
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t1
+ Update on t1
+ Update on t2 t2_1
+ Update on t3
+ -> Nested Loop
+ -> Seq Scan on t1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Nested Loop
+ -> Seq Scan on t2 t2_1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Nested Loop
+ -> Seq Scan on t3
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+(19 rows)
+
+UPDATE t1 SET b=t1.b FROM t2
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+EXPLAIN (COSTS OFF) UPDATE t2 SET b=t2.b FROM t1
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on t2
+ -> Nested Loop
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t2_1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+(11 rows)
+
+UPDATE t2 SET b=t2.b FROM t1
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+NOTICE: f_leak => cde
+-- updates with from clause self join
+EXPLAIN (COSTS OFF) UPDATE t2 t2_1 SET b = t2_2.b FROM t2 t2_2
+WHERE t2_1.a = 3 AND t2_2.a = t2_1.a AND t2_2.b = t2_1.b
+AND f_leak(t2_1.b) AND f_leak(t2_2.b) RETURNING *, t2_1, t2_2;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t2 t2_1
+ -> Nested Loop
+ Join Filter: (t2_1.b = t2_2.b)
+ -> Seq Scan on t2 t2_1
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Seq Scan on t2 t2_2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+UPDATE t2 t2_1 SET b = t2_2.b FROM t2 t2_2
+WHERE t2_1.a = 3 AND t2_2.a = t2_1.a AND t2_2.b = t2_1.b
+AND f_leak(t2_1.b) AND f_leak(t2_2.b) RETURNING *, t2_1, t2_2;
+NOTICE: f_leak => cde
+NOTICE: f_leak => cde
+ id | a | b | c | id | a | b | c | t2_1 | t2_2
+-----+---+-----+-----+-----+---+-----+-----+-----------------+-----------------
+ 203 | 3 | cde | 3.3 | 203 | 3 | cde | 3.3 | (203,3,cde,3.3) | (203,3,cde,3.3)
+(1 row)
+
+EXPLAIN (COSTS OFF) UPDATE t1 t1_1 SET b = t1_2.b FROM t1 t1_2
+WHERE t1_1.a = 4 AND t1_2.a = t1_1.a AND t1_2.b = t1_1.b
+AND f_leak(t1_1.b) AND f_leak(t1_2.b) RETURNING *, t1_1, t1_2;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on t1 t1_1
+ Update on t1 t1_1
+ Update on t2 t1_1_1
+ Update on t3 t1_1_2
+ -> Nested Loop
+ Join Filter: (t1_1.b = t1_2.b)
+ -> Seq Scan on t1 t1_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Nested Loop
+ Join Filter: (t1_1_1.b = t1_2.b)
+ -> Seq Scan on t2 t1_1_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Nested Loop
+ Join Filter: (t1_1_2.b = t1_2.b)
+ -> Seq Scan on t3 t1_1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+(37 rows)
+
+UPDATE t1 t1_1 SET b = t1_2.b FROM t1 t1_2
+WHERE t1_1.a = 4 AND t1_2.a = t1_1.a AND t1_2.b = t1_1.b
+AND f_leak(t1_1.b) AND f_leak(t1_2.b) RETURNING *, t1_1, t1_2;
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => defdef
+NOTICE: f_leak => defdef
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => defdef
+ id | a | b | id | a | b | t1_1 | t1_2
+-----+---+-------------+-----+---+-------------+---------------------+---------------------
+ 104 | 4 | daddad_updt | 104 | 4 | daddad_updt | (104,4,daddad_updt) | (104,4,daddad_updt)
+ 204 | 4 | defdef | 204 | 4 | defdef | (204,4,defdef) | (204,4,defdef)
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 ORDER BY a,b;
+ id | a | b
+-----+---+-------------
+ 101 | 1 | aba
+ 201 | 1 | abc
+ 301 | 1 | xxx
+ 102 | 2 | bbbbbb_updt
+ 202 | 2 | bcdbcd
+ 302 | 2 | yyyyyy
+ 103 | 3 | ccc
+ 203 | 3 | cde
+ 303 | 3 | zzz
+ 104 | 4 | daddad_updt
+ 204 | 4 | defdef
+(11 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+EXPLAIN (COSTS OFF) DELETE FROM only t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Delete on t1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+EXPLAIN (COSTS OFF) DELETE FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Delete on t1
+ Delete on t1
+ Delete on t2
+ Delete on t3
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(10 rows)
+
+DELETE FROM only t1 WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+(2 rows)
+
+DELETE FROM t1 WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ tableoid | id | a | b | t1
+----------+-----+---+--------+----------------
+ t2 | 202 | 2 | bcdbcd | (202,2,bcdbcd)
+ t2 | 204 | 4 | defdef | (204,4,defdef)
+ t3 | 302 | 2 | yyyyyy | (302,2,yyyyyy)
+(3 rows)
+
+--
+-- S.b. view on top of Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE b1 (a int, b text);
+INSERT INTO b1 (SELECT x, md5(x::text) FROM generate_series(-10,10) x);
+CREATE POLICY p1 ON b1 USING (a % 2 = 0);
+ALTER TABLE b1 ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON b1 TO regress_rls_bob;
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW bv1 WITH (security_barrier) AS SELECT * FROM b1 WHERE a > 0 WITH CHECK OPTION;
+GRANT ALL ON bv1 TO regress_rls_carol;
+SET SESSION AUTHORIZATION regress_rls_carol;
+EXPLAIN (COSTS OFF) SELECT * FROM bv1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------------------------
+ Subquery Scan on bv1
+ Filter: f_leak(bv1.b)
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND ((a % 2) = 0))
+(4 rows)
+
+SELECT * FROM bv1 WHERE f_leak(b);
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+ a | b
+----+----------------------------------
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+(5 rows)
+
+INSERT INTO bv1 VALUES (-1, 'xxx'); -- should fail view WCO
+ERROR: new row violates row-level security policy for table "b1"
+INSERT INTO bv1 VALUES (11, 'xxx'); -- should fail RLS check
+ERROR: new row violates row-level security policy for table "b1"
+INSERT INTO bv1 VALUES (12, 'xxx'); -- ok
+EXPLAIN (COSTS OFF) UPDATE bv1 SET b = 'yyy' WHERE a = 4 AND f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on b1
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND (a = 4) AND ((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+UPDATE bv1 SET b = 'yyy' WHERE a = 4 AND f_leak(b);
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+EXPLAIN (COSTS OFF) DELETE FROM bv1 WHERE a = 6 AND f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Delete on b1
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND (a = 6) AND ((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+DELETE FROM bv1 WHERE a = 6 AND f_leak(b);
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM b1;
+ a | b
+-----+----------------------------------
+ -10 | 1b0fd9efa5279c4203b7c70233f86dbf
+ -9 | 252e691406782824eec43d7eadc3d256
+ -8 | a8d2ec85eaf98407310b72eb73dda247
+ -7 | 74687a12d3915d3c4d83f1af7b3683d5
+ -6 | 596a3d04481816330f07e4f97510c28f
+ -5 | 47c1b025fa18ea96c33fbb6718688c0f
+ -4 | 0267aaf632e87a63288a08331f22c7c3
+ -3 | b3149ecea4628efd23d2f86e5a723472
+ -2 | 5d7b9adcbe1c629ec722529dd12e5129
+ -1 | 6bb61e3b7bce0931da574d19d1d82c88
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | xxx
+ 4 | yyy
+(21 rows)
+
+--
+-- INSERT ... ON CONFLICT DO UPDATE and Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p1 ON document;
+DROP POLICY p1r ON document;
+CREATE POLICY p1 ON document FOR SELECT USING (true);
+CREATE POLICY p2 ON document FOR INSERT WITH CHECK (dauthor = current_user);
+CREATE POLICY p3 ON document FOR UPDATE
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'))
+ WITH CHECK (dauthor = current_user);
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Exists...
+SELECT * FROM document WHERE did = 2;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+(1 row)
+
+-- ...so violates actual WITH CHECK OPTION within UPDATE (not INSERT, since
+-- alternative UPDATE path happens to be taken):
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_carol', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, dauthor = EXCLUDED.dauthor;
+ERROR: new row violates row-level security policy for table "document"
+-- Violates USING qual for UPDATE policy p3.
+--
+-- UPDATE path is taken, but UPDATE fails purely because *existing* row to be
+-- updated is not a "novel"/cid 11 (row is not leaked, even though we have
+-- SELECT privileges sufficient to see the row in this instance):
+INSERT INTO document VALUES (33, 22, 1, 'regress_rls_bob', 'okay science fiction'); -- preparation for next statement
+INSERT INTO document VALUES (33, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'Some novel, replaces sci-fi') -- takes UPDATE path
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Fine (we UPDATE, since INSERT WCOs and UPDATE security barrier quals + WCOs
+-- not violated):
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+----------------
+ 2 | 11 | 2 | regress_rls_bob | my first novel
+(1 row)
+
+-- Fine (we INSERT, so "cid = 33" ("technology") isn't evaluated):
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------------
+ 78 | 11 | 1 | regress_rls_bob | some technology novel
+(1 row)
+
+-- Fine (same query, but we UPDATE, so "cid = 33", ("technology") is not the
+-- case in respect of *existing* tuple):
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------------
+ 78 | 33 | 1 | regress_rls_bob | some technology novel
+(1 row)
+
+-- Same query a third time, but now fails due to existing tuple finally not
+-- passing quals:
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Don't fail just because INSERT doesn't satisfy WITH CHECK option that
+-- originated as a barrier/USING() qual from the UPDATE. Note that the UPDATE
+-- path *isn't* taken, and so UPDATE-related policy does not apply:
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+----------------------------------
+ 79 | 33 | 1 | regress_rls_bob | technology book, can only insert
+(1 row)
+
+-- But this time, the same statement fails, because the UPDATE path is taken,
+-- and updating the row just inserted falls afoul of security barrier qual
+-- (enforced as WCO) -- what we might have updated target tuple to is
+-- irrelevant, in fact.
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Test default USING qual enforced as WCO
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p1 ON document;
+DROP POLICY p2 ON document;
+DROP POLICY p3 ON document;
+CREATE POLICY p3_with_default ON document FOR UPDATE
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'));
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Just because WCO-style enforcement of USING quals occurs with
+-- existing/target tuple does not mean that the implementation can be allowed
+-- to fail to also enforce this qual against the final tuple appended to
+-- relation (since in the absence of an explicit WCO, this is also interpreted
+-- as an UPDATE/ALL WCO in general).
+--
+-- UPDATE path is taken here (fails due to existing tuple). Note that this is
+-- not reported as a "USING expression", because it's an RLS UPDATE check that originated as
+-- a USING qual for the purposes of RLS in general, as opposed to an explicit
+-- USING qual that is ordinarily a security barrier. We leave it up to the
+-- UPDATE to make this fail:
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy for table "document"
+-- UPDATE path is taken here. Existing tuple passes, since its cid
+-- corresponds to "novel", but default USING qual is enforced against
+-- post-UPDATE tuple too (as always when updating with a policy that lacks an
+-- explicit WCO), and so this fails:
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET cid = EXCLUDED.cid, dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy for table "document"
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p3_with_default ON document;
+--
+-- Test ALL policies with ON CONFLICT DO UPDATE (much the same as existing UPDATE
+-- tests)
+--
+CREATE POLICY p3_with_all ON document FOR ALL
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'))
+ WITH CHECK (dauthor = current_user);
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Fails, since ALL WCO is enforced in insert path:
+INSERT INTO document VALUES (80, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_carol', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33;
+ERROR: new row violates row-level security policy for table "document"
+-- Fails, since ALL policy USING qual is enforced (existing, target tuple is in
+-- violation, since it has the "manga" cid):
+INSERT INTO document VALUES (4, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Fails, since ALL WCO are enforced:
+INSERT INTO document VALUES (1, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dauthor = 'regress_rls_carol';
+ERROR: new row violates row-level security policy for table "document"
+--
+-- ROLE/GROUP
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE z1 (a int, b text);
+CREATE TABLE z2 (a int, b text);
+GRANT SELECT ON z1,z2 TO regress_rls_group1, regress_rls_group2,
+ regress_rls_bob, regress_rls_carol;
+INSERT INTO z1 VALUES
+ (1, 'aba'),
+ (2, 'bbb'),
+ (3, 'ccc'),
+ (4, 'dad');
+CREATE POLICY p1 ON z1 TO regress_rls_group1 USING (a % 2 = 0);
+CREATE POLICY p2 ON z1 TO regress_rls_group2 USING (a % 2 = 1);
+ALTER TABLE z1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+PREPARE plancache_test AS SELECT * FROM z1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+PREPARE plancache_test2 AS WITH q AS MATERIALIZED (SELECT * FROM z1 WHERE f_leak(b)) SELECT * FROM q,z2;
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+PREPARE plancache_test3 AS WITH q AS MATERIALIZED (SELECT * FROM z2) SELECT * FROM q,z1 WHERE f_leak(z1.b);
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+SET ROLE regress_rls_group1;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => ccc
+ a | b
+---+-----
+ 1 | aba
+ 3 | ccc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+SET ROLE regress_rls_group2;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => ccc
+ a | b
+---+-----
+ 1 | aba
+ 3 | ccc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+--
+-- Views should follow policy for view owner.
+--
+-- View and Table owner are the same.
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE VIEW rls_view AS SELECT * FROM z1 WHERE f_leak(b);
+GRANT SELECT ON rls_view TO regress_rls_bob;
+-- Query as role that is not owner of view or table. Should return all records.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rls_view;
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 1 | aba
+ 2 | bbb
+ 3 | ccc
+ 4 | dad
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+---------------------
+ Seq Scan on z1
+ Filter: f_leak(b)
+(2 rows)
+
+-- Query as view/table owner. Should return all records.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM rls_view;
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 1 | aba
+ 2 | bbb
+ 3 | ccc
+ 4 | dad
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+---------------------
+ Seq Scan on z1
+ Filter: f_leak(b)
+(2 rows)
+
+DROP VIEW rls_view;
+-- View and Table owners are different.
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rls_view AS SELECT * FROM z1 WHERE f_leak(b);
+GRANT SELECT ON rls_view TO regress_rls_alice;
+-- Query as role that is not owner of view but is owner of table.
+-- Should return records based on view owner policies.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+-- Query as role that is not owner of table but is owner of view.
+-- Should return records based on view owner policies.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+-- Query as role that is not the owner of the table or view without permissions.
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM rls_view; --fail - permission denied.
+ERROR: permission denied for view rls_view
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view; --fail - permission denied.
+ERROR: permission denied for view rls_view
+-- Query as role that is not the owner of the table or view with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_carol;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+DROP VIEW rls_view;
+--
+-- Command specific
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE x1 (a int, b text, c text);
+GRANT ALL ON x1 TO PUBLIC;
+INSERT INTO x1 VALUES
+ (1, 'abc', 'regress_rls_bob'),
+ (2, 'bcd', 'regress_rls_bob'),
+ (3, 'cde', 'regress_rls_carol'),
+ (4, 'def', 'regress_rls_carol'),
+ (5, 'efg', 'regress_rls_bob'),
+ (6, 'fgh', 'regress_rls_bob'),
+ (7, 'fgh', 'regress_rls_carol'),
+ (8, 'fgh', 'regress_rls_carol');
+CREATE POLICY p0 ON x1 FOR ALL USING (c = current_user);
+CREATE POLICY p1 ON x1 FOR SELECT USING (a % 2 = 0);
+CREATE POLICY p2 ON x1 FOR INSERT WITH CHECK (a % 2 = 1);
+CREATE POLICY p3 ON x1 FOR UPDATE USING (a % 2 = 0);
+CREATE POLICY p4 ON x1 FOR DELETE USING (a < 8);
+ALTER TABLE x1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM x1 WHERE f_leak(b) ORDER BY a ASC;
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => efg
+NOTICE: f_leak => fgh
+NOTICE: f_leak => fgh
+ a | b | c
+---+-----+-------------------
+ 1 | abc | regress_rls_bob
+ 2 | bcd | regress_rls_bob
+ 4 | def | regress_rls_carol
+ 5 | efg | regress_rls_bob
+ 6 | fgh | regress_rls_bob
+ 8 | fgh | regress_rls_carol
+(6 rows)
+
+UPDATE x1 SET b = b || '_updt' WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => efg
+NOTICE: f_leak => fgh
+NOTICE: f_leak => fgh
+ a | b | c
+---+----------+-------------------
+ 1 | abc_updt | regress_rls_bob
+ 2 | bcd_updt | regress_rls_bob
+ 4 | def_updt | regress_rls_carol
+ 5 | efg_updt | regress_rls_bob
+ 6 | fgh_updt | regress_rls_bob
+ 8 | fgh_updt | regress_rls_carol
+(6 rows)
+
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM x1 WHERE f_leak(b) ORDER BY a ASC;
+NOTICE: f_leak => cde
+NOTICE: f_leak => fgh
+NOTICE: f_leak => bcd_updt
+NOTICE: f_leak => def_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => fgh_updt
+ a | b | c
+---+----------+-------------------
+ 2 | bcd_updt | regress_rls_bob
+ 3 | cde | regress_rls_carol
+ 4 | def_updt | regress_rls_carol
+ 6 | fgh_updt | regress_rls_bob
+ 7 | fgh | regress_rls_carol
+ 8 | fgh_updt | regress_rls_carol
+(6 rows)
+
+UPDATE x1 SET b = b || '_updt' WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => cde
+NOTICE: f_leak => fgh
+NOTICE: f_leak => bcd_updt
+NOTICE: f_leak => def_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => fgh_updt
+ a | b | c
+---+---------------+-------------------
+ 3 | cde_updt | regress_rls_carol
+ 7 | fgh_updt | regress_rls_carol
+ 2 | bcd_updt_updt | regress_rls_bob
+ 4 | def_updt_updt | regress_rls_carol
+ 6 | fgh_updt_updt | regress_rls_bob
+ 8 | fgh_updt_updt | regress_rls_carol
+(6 rows)
+
+DELETE FROM x1 WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => cde_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => bcd_updt_updt
+NOTICE: f_leak => def_updt_updt
+NOTICE: f_leak => fgh_updt_updt
+NOTICE: f_leak => fgh_updt_updt
+ a | b | c
+---+---------------+-------------------
+ 3 | cde_updt | regress_rls_carol
+ 7 | fgh_updt | regress_rls_carol
+ 2 | bcd_updt_updt | regress_rls_bob
+ 4 | def_updt_updt | regress_rls_carol
+ 6 | fgh_updt_updt | regress_rls_bob
+ 8 | fgh_updt_updt | regress_rls_carol
+(6 rows)
+
+--
+-- Duplicate Policy Names
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE y1 (a int, b text);
+CREATE TABLE y2 (a int, b text);
+GRANT ALL ON y1, y2 TO regress_rls_bob;
+CREATE POLICY p1 ON y1 FOR ALL USING (a % 2 = 0);
+CREATE POLICY p2 ON y1 FOR SELECT USING (a > 2);
+CREATE POLICY p1 ON y1 FOR SELECT USING (a % 2 = 1); --fail
+ERROR: policy "p1" for table "y1" already exists
+CREATE POLICY p1 ON y2 FOR ALL USING (a % 2 = 0); --OK
+ALTER TABLE y1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE y2 ENABLE ROW LEVEL SECURITY;
+--
+-- Expression structure with SBV
+--
+-- Create view as table owner. RLS should NOT be applied.
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE VIEW rls_sbv WITH (security_barrier) AS
+ SELECT * FROM y1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) SELECT * FROM rls_sbv WHERE (a = 1);
+ QUERY PLAN
+-----------------------------------
+ Seq Scan on y1
+ Filter: (f_leak(b) AND (a = 1))
+(2 rows)
+
+DROP VIEW rls_sbv;
+-- Create view as role that does not own table. RLS should be applied.
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rls_sbv WITH (security_barrier) AS
+ SELECT * FROM y1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) SELECT * FROM rls_sbv WHERE (a = 1);
+ QUERY PLAN
+------------------------------------------------------------------
+ Seq Scan on y1
+ Filter: ((a = 1) AND ((a > 2) OR ((a % 2) = 0)) AND f_leak(b))
+(2 rows)
+
+DROP VIEW rls_sbv;
+--
+-- Expression structure
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+INSERT INTO y2 (SELECT x, md5(x::text) FROM generate_series(0,20) x);
+CREATE POLICY p2 ON y2 USING (a % 3 = 0);
+CREATE POLICY p3 ON y2 USING (a % 4 = 0);
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM y2 WHERE f_leak(b);
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => eccbc87e4b5ce2fe28308fd9f2a7baf3
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => 45c48cce2e2d7fbdea1afc51c7c6ad26
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => 9bf31c7ff062936a96d3c8bd1f8f2ff3
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(14 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Seq Scan on y2
+ Filter: ((((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)) AND f_leak(b))
+(2 rows)
+
+--
+-- Qual push-down of leaky functions, when not referring to table
+--
+SELECT * FROM y2 WHERE f_leak('abc');
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(14 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 WHERE f_leak('abc');
+ QUERY PLAN
+---------------------------------------------------------------------------------------
+ Seq Scan on y2
+ Filter: (f_leak('abc'::text) AND (((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)))
+(2 rows)
+
+CREATE TABLE test_qual_pushdown (
+ abc text
+);
+INSERT INTO test_qual_pushdown VALUES ('abc'),('def');
+SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(abc);
+NOTICE: f_leak => abc
+NOTICE: f_leak => def
+ a | b | abc
+---+---+-----
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(abc);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (test_qual_pushdown.abc = y2.b)
+ -> Seq Scan on test_qual_pushdown
+ Filter: f_leak(abc)
+ -> Hash
+ -> Seq Scan on y2
+ Filter: (((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0))
+(7 rows)
+
+SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(b);
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => eccbc87e4b5ce2fe28308fd9f2a7baf3
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => 45c48cce2e2d7fbdea1afc51c7c6ad26
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => 9bf31c7ff062936a96d3c8bd1f8f2ff3
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b | abc
+---+---+-----
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (test_qual_pushdown.abc = y2.b)
+ -> Seq Scan on test_qual_pushdown
+ -> Hash
+ -> Seq Scan on y2
+ Filter: ((((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)) AND f_leak(b))
+(6 rows)
+
+DROP TABLE test_qual_pushdown;
+--
+-- Plancache invalidate on user change.
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE t1 CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to table t2
+drop cascades to table t3
+CREATE TABLE t1 (a integer);
+GRANT SELECT ON t1 TO regress_rls_bob, regress_rls_carol;
+CREATE POLICY p1 ON t1 TO regress_rls_bob USING ((a % 2) = 0);
+CREATE POLICY p2 ON t1 TO regress_rls_carol USING ((a % 4) = 0);
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+-- Prepare as regress_rls_bob
+SET ROLE regress_rls_bob;
+PREPARE role_inval AS SELECT * FROM t1;
+-- Check plan
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(2 rows)
+
+-- Change to regress_rls_carol
+SET ROLE regress_rls_carol;
+-- Check plan- should be different
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 4) = 0)
+(2 rows)
+
+-- Change back to regress_rls_bob
+SET ROLE regress_rls_bob;
+-- Check plan- should be back to original
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(2 rows)
+
+--
+-- CTE and RLS
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE t1 CASCADE;
+CREATE TABLE t1 (a integer, b text);
+CREATE POLICY p1 ON t1 USING (a % 2 = 0);
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON t1 TO regress_rls_bob;
+INSERT INTO t1 (SELECT x, md5(x::text) FROM generate_series(0,20) x);
+SET SESSION AUTHORIZATION regress_rls_bob;
+WITH cte1 AS MATERIALIZED (SELECT * FROM t1 WHERE f_leak(b)) SELECT * FROM cte1;
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(11 rows)
+
+EXPLAIN (COSTS OFF)
+WITH cte1 AS MATERIALIZED (SELECT * FROM t1 WHERE f_leak(b)) SELECT * FROM cte1;
+ QUERY PLAN
+-------------------------------------------------
+ CTE Scan on cte1
+ CTE cte1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(4 rows)
+
+WITH cte1 AS (UPDATE t1 SET a = a + 1 RETURNING *) SELECT * FROM cte1; --fail
+ERROR: new row violates row-level security policy for table "t1"
+WITH cte1 AS (UPDATE t1 SET a = a RETURNING *) SELECT * FROM cte1; --ok
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(11 rows)
+
+WITH cte1 AS (INSERT INTO t1 VALUES (21, 'Fail') RETURNING *) SELECT * FROM cte1; --fail
+ERROR: new row violates row-level security policy for table "t1"
+WITH cte1 AS (INSERT INTO t1 VALUES (20, 'Success') RETURNING *) SELECT * FROM cte1; --ok
+ a | b
+----+---------
+ 20 | Success
+(1 row)
+
+--
+-- Rename Policy
+--
+RESET SESSION AUTHORIZATION;
+ALTER POLICY p1 ON t1 RENAME TO p1; --fail
+ERROR: policy "p1" for table "t1" already exists
+SELECT polname, relname
+ FROM pg_policy pol
+ JOIN pg_class pc ON (pc.oid = pol.polrelid)
+ WHERE relname = 't1';
+ polname | relname
+---------+---------
+ p1 | t1
+(1 row)
+
+ALTER POLICY p1 ON t1 RENAME TO p2; --ok
+SELECT polname, relname
+ FROM pg_policy pol
+ JOIN pg_class pc ON (pc.oid = pol.polrelid)
+ WHERE relname = 't1';
+ polname | relname
+---------+---------
+ p2 | t1
+(1 row)
+
+--
+-- Check INSERT SELECT
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE TABLE t2 (a integer, b text);
+INSERT INTO t2 (SELECT * FROM t1);
+EXPLAIN (COSTS OFF) INSERT INTO t2 (SELECT * FROM t1);
+ QUERY PLAN
+-------------------------------
+ Insert on t2
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(3 rows)
+
+SELECT * FROM t2;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t2;
+ QUERY PLAN
+----------------
+ Seq Scan on t2
+(1 row)
+
+CREATE TABLE t3 AS SELECT * FROM t1;
+SELECT * FROM t3;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+SELECT * INTO t4 FROM t1;
+SELECT * FROM t4;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+--
+-- RLS with JOIN
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE blog (id integer, author text, post text);
+CREATE TABLE comment (blog_id integer, message text);
+GRANT ALL ON blog, comment TO regress_rls_bob;
+CREATE POLICY blog_1 ON blog USING (id % 2 = 0);
+ALTER TABLE blog ENABLE ROW LEVEL SECURITY;
+INSERT INTO blog VALUES
+ (1, 'alice', 'blog #1'),
+ (2, 'bob', 'blog #1'),
+ (3, 'alice', 'blog #2'),
+ (4, 'alice', 'blog #3'),
+ (5, 'john', 'blog #1');
+INSERT INTO comment VALUES
+ (1, 'cool blog'),
+ (1, 'fun blog'),
+ (3, 'crazy blog'),
+ (5, 'what?'),
+ (4, 'insane!'),
+ (2, 'who did it?');
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Check RLS JOIN with Non-RLS.
+SELECT id, author, message FROM blog JOIN comment ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 4 | alice | insane!
+ 2 | bob | who did it?
+(2 rows)
+
+-- Check Non-RLS JOIN with RLS.
+SELECT id, author, message FROM comment JOIN blog ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 4 | alice | insane!
+ 2 | bob | who did it?
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY comment_1 ON comment USING (blog_id < 4);
+ALTER TABLE comment ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Check RLS JOIN RLS
+SELECT id, author, message FROM blog JOIN comment ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 2 | bob | who did it?
+(1 row)
+
+SELECT id, author, message FROM comment JOIN blog ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 2 | bob | who did it?
+(1 row)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP TABLE blog, comment;
+--
+-- Default Deny Policy
+--
+RESET SESSION AUTHORIZATION;
+DROP POLICY p2 ON t1;
+ALTER TABLE t1 OWNER TO regress_rls_alice;
+-- Check that default deny does not apply to superuser.
+RESET SESSION AUTHORIZATION;
+SELECT * FROM t1;
+ a | b
+----+----------------------------------
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 11 | 6512bd43d9caa6e02c990b0a82652dca
+ 13 | c51ce410c124a10e0db5e4b97fc2af39
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 17 | 70efdf2ec9b086079795c442636b55fb
+ 19 | 1f0e3dad99908345f7439f8ffabdffc4
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(22 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+----------------
+ Seq Scan on t1
+(1 row)
+
+-- Check that default deny does not apply to table owner.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM t1;
+ a | b
+----+----------------------------------
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 11 | 6512bd43d9caa6e02c990b0a82652dca
+ 13 | c51ce410c124a10e0db5e4b97fc2af39
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 17 | 70efdf2ec9b086079795c442636b55fb
+ 19 | 1f0e3dad99908345f7439f8ffabdffc4
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(22 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+----------------
+ Seq Scan on t1
+(1 row)
+
+-- Check that default deny applies to non-owner/non-superuser when RLS on.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM t1;
+ a | b
+---+---
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+--------------------------
+ Result
+ One-Time Filter: false
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM t1;
+ a | b
+---+---
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+--------------------------
+ Result
+ One-Time Filter: false
+(2 rows)
+
+--
+-- COPY TO/FROM
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE copy_t CASCADE;
+ERROR: table "copy_t" does not exist
+CREATE TABLE copy_t (a integer, b text);
+CREATE POLICY p1 ON copy_t USING (a % 2 = 0);
+ALTER TABLE copy_t ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON copy_t TO regress_rls_bob, regress_rls_exempt_user;
+INSERT INTO copy_t (SELECT x, md5(x::text) FROM generate_series(0,10) x);
+-- Check COPY TO as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ',';
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ',';
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+2,c81e728d9d4c2f636f067f89cc14862c
+4,a87ff679a2f3e71d9181a67b7542122c
+6,1679091c5a880faf6fb5e6087eb1b2dc
+8,c9f0f895fb98ab9159f51fd0297e236d
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user without permissions. SET row_security TO OFF;
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_t
+-- Check COPY relation TO; keep it just one row to avoid reordering issues
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+CREATE TABLE copy_rel_to (a integer, b text);
+CREATE POLICY p1 ON copy_rel_to USING (a % 2 = 0);
+ALTER TABLE copy_rel_to ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON copy_rel_to TO regress_rls_bob, regress_rls_exempt_user;
+INSERT INTO copy_rel_to VALUES (1, md5('1'));
+-- Check COPY TO as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ',';
+1,c4ca4238a0b923820dcc509a6f75849b
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ',';
+1,c4ca4238a0b923820dcc509a6f75849b
+-- Check COPY TO as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_rel_to"
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+-- Check COPY TO as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+1,c4ca4238a0b923820dcc509a6f75849b
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+1,c4ca4238a0b923820dcc509a6f75849b
+-- Check COPY TO as user without permissions. SET row_security TO OFF;
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_rel_to
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_rel_to
+-- Check COPY FROM as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --ok
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --ok
+-- Check COPY FROM as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --fail - would be affected by RLS.
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --fail - COPY FROM not supported by RLS.
+ERROR: COPY FROM not supported with row-level security
+HINT: Use INSERT statements instead.
+-- Check COPY FROM as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --ok
+-- Check COPY FROM as user without permissions.
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --fail - permission denied.
+ERROR: permission denied for table copy_t
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --fail - permission denied.
+ERROR: permission denied for table copy_t
+RESET SESSION AUTHORIZATION;
+DROP TABLE copy_t;
+DROP TABLE copy_rel_to CASCADE;
+-- Check WHERE CURRENT OF
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE current_check (currentid int, payload text, rlsuser text);
+GRANT ALL ON current_check TO PUBLIC;
+INSERT INTO current_check VALUES
+ (1, 'abc', 'regress_rls_bob'),
+ (2, 'bcd', 'regress_rls_bob'),
+ (3, 'cde', 'regress_rls_bob'),
+ (4, 'def', 'regress_rls_bob');
+CREATE POLICY p1 ON current_check FOR SELECT USING (currentid % 2 = 0);
+CREATE POLICY p2 ON current_check FOR DELETE USING (currentid = 4 AND rlsuser = current_user);
+CREATE POLICY p3 ON current_check FOR UPDATE USING (currentid = 4) WITH CHECK (rlsuser = current_user);
+ALTER TABLE current_check ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Can SELECT even rows
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+ 4 | def | regress_rls_bob
+(2 rows)
+
+-- Cannot UPDATE row 2
+UPDATE current_check SET payload = payload || '_new' WHERE currentid = 2 RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+BEGIN;
+DECLARE current_check_cursor SCROLL CURSOR FOR SELECT * FROM current_check;
+-- Returns rows that can be seen according to SELECT policy, like plain SELECT
+-- above (even rows)
+FETCH ABSOLUTE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+-- Still cannot UPDATE row 2 through cursor
+UPDATE current_check SET payload = payload || '_new' WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+-- Can update row 4 through cursor, which is the next visible row
+FETCH RELATIVE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def | regress_rls_bob
+(1 row)
+
+UPDATE current_check SET payload = payload || '_new' WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def_new | regress_rls_bob
+(1 row)
+
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+ 4 | def_new | regress_rls_bob
+(2 rows)
+
+-- Plan should be a subquery TID scan
+EXPLAIN (COSTS OFF) UPDATE current_check SET payload = payload WHERE CURRENT OF current_check_cursor;
+ QUERY PLAN
+-------------------------------------------------------------
+ Update on current_check
+ -> Tid Scan on current_check
+ TID Cond: CURRENT OF current_check_cursor
+ Filter: ((currentid = 4) AND ((currentid % 2) = 0))
+(4 rows)
+
+-- Similarly can only delete row 4
+FETCH ABSOLUTE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+DELETE FROM current_check WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+FETCH RELATIVE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def | regress_rls_bob
+(1 row)
+
+DELETE FROM current_check WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def_new | regress_rls_bob
+(1 row)
+
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+COMMIT;
+--
+-- check pg_stats view filtering
+--
+SET row_security TO ON;
+SET SESSION AUTHORIZATION regress_rls_alice;
+ANALYZE current_check;
+-- Stats visible
+SELECT row_security_active('current_check');
+ row_security_active
+---------------------
+ f
+(1 row)
+
+SELECT attname, most_common_vals FROM pg_stats
+ WHERE tablename = 'current_check'
+ ORDER BY 1;
+ attname | most_common_vals
+-----------+-------------------
+ currentid | {4}
+ payload |
+ rlsuser | {regress_rls_bob}
+(3 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Stats not visible
+SELECT row_security_active('current_check');
+ row_security_active
+---------------------
+ t
+(1 row)
+
+SELECT attname, most_common_vals FROM pg_stats
+ WHERE tablename = 'current_check'
+ ORDER BY 1;
+ attname | most_common_vals
+---------+------------------
+(0 rows)
+
+--
+-- Collation support
+--
+BEGIN;
+CREATE TABLE coll_t (c) AS VALUES ('bar'::text);
+CREATE POLICY coll_p ON coll_t USING (c < ('foo'::text COLLATE "C"));
+ALTER TABLE coll_t ENABLE ROW LEVEL SECURITY;
+GRANT SELECT ON coll_t TO regress_rls_alice;
+SELECT (string_to_array(polqual, ':'))[7] AS inputcollid FROM pg_policy WHERE polrelid = 'coll_t'::regclass;
+ inputcollid
+------------------
+ inputcollid 950
+(1 row)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM coll_t;
+ c
+-----
+ bar
+(1 row)
+
+ROLLBACK;
+--
+-- Shared Object Dependencies
+--
+RESET SESSION AUTHORIZATION;
+BEGIN;
+CREATE ROLE regress_rls_eve;
+CREATE ROLE regress_rls_frank;
+CREATE TABLE tbl1 (c) AS VALUES ('bar'::text);
+GRANT SELECT ON TABLE tbl1 TO regress_rls_eve;
+CREATE POLICY P ON tbl1 TO regress_rls_eve, regress_rls_frank USING (true);
+SELECT refclassid::regclass, deptype
+ FROM pg_depend
+ WHERE classid = 'pg_policy'::regclass
+ AND refobjid = 'tbl1'::regclass;
+ refclassid | deptype
+------------+---------
+ pg_class | a
+(1 row)
+
+SELECT refclassid::regclass, deptype
+ FROM pg_shdepend
+ WHERE classid = 'pg_policy'::regclass
+ AND refobjid IN ('regress_rls_eve'::regrole, 'regress_rls_frank'::regrole);
+ refclassid | deptype
+------------+---------
+ pg_authid | r
+ pg_authid | r
+(2 rows)
+
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
+ERROR: role "regress_rls_eve" cannot be dropped because some objects depend on it
+DETAIL: privileges for table tbl1
+target of policy p on table tbl1
+ROLLBACK TO q;
+ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --fails due to dependency on GRANT SELECT
+ERROR: role "regress_rls_eve" cannot be dropped because some objects depend on it
+DETAIL: privileges for table tbl1
+ROLLBACK TO q;
+REVOKE ALL ON TABLE tbl1 FROM regress_rls_eve;
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --succeeds
+ROLLBACK TO q;
+SAVEPOINT q;
+DROP ROLE regress_rls_frank; --fails due to dependency on POLICY p
+ERROR: role "regress_rls_frank" cannot be dropped because some objects depend on it
+DETAIL: target of policy p on table tbl1
+ROLLBACK TO q;
+DROP POLICY p ON tbl1;
+SAVEPOINT q;
+DROP ROLE regress_rls_frank; -- succeeds
+ROLLBACK TO q;
+ROLLBACK; -- cleanup
+--
+-- Converting table to view
+--
+BEGIN;
+CREATE TABLE t (c int);
+CREATE POLICY p ON t USING (c % 2 = 1);
+ALTER TABLE t ENABLE ROW LEVEL SECURITY;
+SAVEPOINT q;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- fails due to row level security enabled
+ERROR: could not convert table "t" to a view because it has row security enabled
+ROLLBACK TO q;
+ALTER TABLE t DISABLE ROW LEVEL SECURITY;
+SAVEPOINT q;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- fails due to policy p on t
+ERROR: could not convert table "t" to a view because it has row security policies
+ROLLBACK TO q;
+DROP POLICY p ON t;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- succeeds
+ROLLBACK;
+--
+-- Policy expression handling
+--
+BEGIN;
+CREATE TABLE t (c) AS VALUES ('bar'::text);
+CREATE POLICY p ON t USING (max(c)); -- fails: aggregate functions are not allowed in policy expressions
+ERROR: aggregate functions are not allowed in policy expressions
+ROLLBACK;
+--
+-- Non-target relations are only subject to SELECT policies
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE r1 (a int);
+CREATE TABLE r2 (a int);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+GRANT ALL ON r1, r2 TO regress_rls_bob;
+CREATE POLICY p1 ON r1 USING (true);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON r2 FOR SELECT USING (true);
+CREATE POLICY p2 ON r2 FOR INSERT WITH CHECK (false);
+CREATE POLICY p3 ON r2 FOR UPDATE USING (false);
+CREATE POLICY p4 ON r2 FOR DELETE USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM r1;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+SELECT * FROM r2;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+-- r2 is read-only
+INSERT INTO r2 VALUES (2); -- Not allowed
+ERROR: new row violates row-level security policy for table "r2"
+UPDATE r2 SET a = 2 RETURNING *; -- Updates nothing
+ a
+---
+(0 rows)
+
+DELETE FROM r2 RETURNING *; -- Deletes nothing
+ a
+---
+(0 rows)
+
+-- r2 can be used as a non-target relation in DML
+INSERT INTO r1 SELECT a + 1 FROM r2 RETURNING *; -- OK
+ a
+----
+ 11
+ 21
+(2 rows)
+
+UPDATE r1 SET a = r2.a + 2 FROM r2 WHERE r1.a = r2.a RETURNING *; -- OK
+ a | a
+----+----
+ 12 | 10
+ 22 | 20
+(2 rows)
+
+DELETE FROM r1 USING r2 WHERE r1.a = r2.a + 2 RETURNING *; -- OK
+ a | a
+----+----
+ 12 | 10
+ 22 | 20
+(2 rows)
+
+SELECT * FROM r1;
+ a
+----
+ 11
+ 21
+(2 rows)
+
+SELECT * FROM r2;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP TABLE r1;
+DROP TABLE r2;
+--
+-- FORCE ROW LEVEL SECURITY applies RLS to owners too
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int);
+INSERT INTO r1 VALUES (10), (20);
+CREATE POLICY p1 ON r1 USING (false);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- No error, but no rows
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- RLS error
+INSERT INTO r1 VALUES (1);
+ERROR: new row violates row-level security policy for table "r1"
+-- No error (unable to see any rows to update)
+UPDATE r1 SET a = 1;
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- No error (unable to see any rows to delete)
+DELETE FROM r1;
+TABLE r1;
+ a
+---
+(0 rows)
+
+SET row_security = off;
+-- these all fail, would be affected by RLS
+TABLE r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+UPDATE r1 SET a = 1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+DELETE FROM r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+DROP TABLE r1;
+--
+-- FORCE ROW LEVEL SECURITY does not break RI
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Errors due to rows in r2
+DELETE FROM r1;
+ERROR: update or delete on table "r1" violates foreign key constraint "r2_a_fkey" on table "r2"
+DETAIL: Key (a)=(10) is still referenced from table "r2".
+-- Reset r2 to no-RLS
+DROP POLICY p1 ON r2;
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+ALTER TABLE r2 DISABLE ROW LEVEL SECURITY;
+-- clean out r2 for INSERT test below
+DELETE FROM r2;
+-- Change r1 to not allow rows to be seen
+CREATE POLICY p1 ON r1 USING (false);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- No rows seen
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- No error, RI still sees that row exists in r1
+INSERT INTO r2 VALUES (10);
+DROP TABLE r2;
+DROP TABLE r1;
+-- Ensure cascaded DELETE works
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1 ON DELETE CASCADE);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Deletes all records from both
+DELETE FROM r1;
+-- Remove FORCE from r2
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+-- As owner, we now bypass RLS
+-- verify no rows in r2 now
+TABLE r2;
+ a
+---
+(0 rows)
+
+DROP TABLE r2;
+DROP TABLE r1;
+-- Ensure cascaded UPDATE works
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1 ON UPDATE CASCADE);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Updates records in both
+UPDATE r1 SET a = a+5;
+-- Remove FORCE from r2
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+-- As owner, we now bypass RLS
+-- verify records in r2 updated
+TABLE r2;
+ a
+----
+ 15
+ 25
+(2 rows)
+
+DROP TABLE r2;
+DROP TABLE r1;
+--
+-- Test INSERT+RETURNING applies SELECT policies as
+-- WithCheckOptions (meaning an error is thrown)
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int);
+CREATE POLICY p1 ON r1 FOR SELECT USING (false);
+CREATE POLICY p2 ON r1 FOR INSERT WITH CHECK (true);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Works fine
+INSERT INTO r1 VALUES (10), (20);
+-- No error, but no rows
+TABLE r1;
+ a
+---
+(0 rows)
+
+SET row_security = off;
+-- fail, would be affected by RLS
+TABLE r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+SET row_security = on;
+-- Error
+INSERT INTO r1 VALUES (10), (20) RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+DROP TABLE r1;
+--
+-- Test UPDATE+RETURNING applies SELECT policies as
+-- WithCheckOptions (meaning an error is thrown)
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE POLICY p1 ON r1 FOR SELECT USING (a < 20);
+CREATE POLICY p2 ON r1 FOR UPDATE USING (a < 20) WITH CHECK (true);
+CREATE POLICY p3 ON r1 FOR INSERT WITH CHECK (true);
+INSERT INTO r1 VALUES (10);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Works fine
+UPDATE r1 SET a = 30;
+-- Show updated rows
+ALTER TABLE r1 NO FORCE ROW LEVEL SECURITY;
+TABLE r1;
+ a
+----
+ 30
+(1 row)
+
+-- reset value in r1 for test with RETURNING
+UPDATE r1 SET a = 10;
+-- Verify row reset
+TABLE r1;
+ a
+----
+ 10
+(1 row)
+
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Error
+UPDATE r1 SET a = 30 RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+-- UPDATE path of INSERT ... ON CONFLICT DO UPDATE should also error out
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT (a) DO UPDATE SET a = 30 RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+-- Should still error out without RETURNING (use of arbiter always requires
+-- SELECT permissions)
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT (a) DO UPDATE SET a = 30;
+ERROR: new row violates row-level security policy for table "r1"
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT ON CONSTRAINT r1_pkey DO UPDATE SET a = 30;
+ERROR: new row violates row-level security policy for table "r1"
+DROP TABLE r1;
+-- Check dependency handling
+RESET SESSION AUTHORIZATION;
+CREATE TABLE dep1 (c1 int);
+CREATE TABLE dep2 (c1 int);
+CREATE POLICY dep_p1 ON dep1 TO regress_rls_bob USING (c1 > (select max(dep2.c1) from dep2));
+ALTER POLICY dep_p1 ON dep1 TO regress_rls_bob,regress_rls_carol;
+-- Should return one
+SELECT count(*) = 1 FROM pg_depend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_class WHERE relname = 'dep2');
+ ?column?
+----------
+ t
+(1 row)
+
+ALTER POLICY dep_p1 ON dep1 USING (true);
+-- Should return one
+SELECT count(*) = 1 FROM pg_shdepend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_authid WHERE rolname = 'regress_rls_bob');
+ ?column?
+----------
+ t
+(1 row)
+
+-- Should return one
+SELECT count(*) = 1 FROM pg_shdepend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_authid WHERE rolname = 'regress_rls_carol');
+ ?column?
+----------
+ t
+(1 row)
+
+-- Should return zero
+SELECT count(*) = 0 FROM pg_depend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_class WHERE relname = 'dep2');
+ ?column?
+----------
+ t
+(1 row)
+
+-- DROP OWNED BY testing
+RESET SESSION AUTHORIZATION;
+CREATE ROLE regress_rls_dob_role1;
+CREATE ROLE regress_rls_dob_role2;
+CREATE TABLE dob_t1 (c1 int);
+CREATE TABLE dob_t2 (c1 int) PARTITION BY RANGE (c1);
+CREATE POLICY p1 ON dob_t1 TO regress_rls_dob_role1 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t1; -- should fail, already gone
+ERROR: policy "p1" for table "dob_t1" does not exist
+CREATE POLICY p1 ON dob_t1 TO regress_rls_dob_role1,regress_rls_dob_role2 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t1; -- should succeed
+CREATE POLICY p1 ON dob_t2 TO regress_rls_dob_role1,regress_rls_dob_role2 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t2; -- should succeed
+DROP USER regress_rls_dob_role1;
+DROP USER regress_rls_dob_role2;
+-- Bug #15708: view + table with RLS should check policies as view owner
+CREATE TABLE ref_tbl (a int);
+INSERT INTO ref_tbl VALUES (1);
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl VALUES (10);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (EXISTS (SELECT 1 FROM ref_tbl));
+GRANT SELECT ON ref_tbl TO regress_rls_bob;
+GRANT SELECT ON rls_tbl TO regress_rls_bob;
+CREATE VIEW rls_view AS SELECT * FROM rls_tbl;
+ALTER VIEW rls_view OWNER TO regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_alice;
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM ref_tbl; -- Permission denied
+ERROR: permission denied for table ref_tbl
+SELECT * FROM rls_tbl; -- Permission denied
+ERROR: permission denied for table rls_tbl
+SELECT * FROM rls_view; -- OK
+ a
+----
+ 10
+(1 row)
+
+RESET SESSION AUTHORIZATION;
+DROP VIEW rls_view;
+DROP TABLE rls_tbl;
+DROP TABLE ref_tbl;
+-- Leaky operator test
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl SELECT x/10 FROM generate_series(1, 100) x;
+ANALYZE rls_tbl;
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+GRANT SELECT ON rls_tbl TO regress_rls_alice;
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE FUNCTION op_leak(int, int) RETURNS bool
+ AS 'BEGIN RAISE NOTICE ''op_leak => %, %'', $1, $2; RETURN $1 < $2; END'
+ LANGUAGE plpgsql;
+CREATE OPERATOR <<< (procedure = op_leak, leftarg = int, rightarg = int,
+ restrict = scalarltsel);
+SELECT * FROM rls_tbl WHERE a <<< 1000;
+ a
+---
+(0 rows)
+
+DROP OPERATOR <<< (int, int);
+DROP FUNCTION op_leak(int, int);
+RESET SESSION AUTHORIZATION;
+DROP TABLE rls_tbl;
+--
+-- Clean up objects
+--
+RESET SESSION AUTHORIZATION;
+DROP SCHEMA regress_rls_schema CASCADE;
+NOTICE: drop cascades to 29 other objects
+DETAIL: drop cascades to function f_leak(text)
+drop cascades to table uaccount
+drop cascades to table category
+drop cascades to table document
+drop cascades to table part_document
+drop cascades to table dependent
+drop cascades to table rec1
+drop cascades to table rec2
+drop cascades to view rec1v
+drop cascades to view rec2v
+drop cascades to table s1
+drop cascades to table s2
+drop cascades to view v2
+drop cascades to table b1
+drop cascades to view bv1
+drop cascades to table z1
+drop cascades to table z2
+drop cascades to table x1
+drop cascades to table y1
+drop cascades to table y2
+drop cascades to table t1
+drop cascades to table t2
+drop cascades to table t3
+drop cascades to table t4
+drop cascades to table current_check
+drop cascades to table dep1
+drop cascades to table dep2
+drop cascades to table dob_t1
+drop cascades to table dob_t2
+DROP USER regress_rls_alice;
+DROP USER regress_rls_bob;
+DROP USER regress_rls_carol;
+DROP USER regress_rls_dave;
+DROP USER regress_rls_exempt_user;
+DROP ROLE regress_rls_group1;
+DROP ROLE regress_rls_group2;
+-- Arrange to have a few policies left over, for testing
+-- pg_dump/pg_restore
+CREATE SCHEMA regress_rls_schema;
+CREATE TABLE rls_tbl (c1 int);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (c1 > 5);
+CREATE POLICY p2 ON rls_tbl FOR SELECT USING (c1 <= 3);
+CREATE POLICY p3 ON rls_tbl FOR UPDATE USING (c1 <= 3) WITH CHECK (c1 > 5);
+CREATE POLICY p4 ON rls_tbl FOR DELETE USING (c1 <= 3);
+CREATE TABLE rls_tbl_force (c1 int);
+ALTER TABLE rls_tbl_force ENABLE ROW LEVEL SECURITY;
+ALTER TABLE rls_tbl_force FORCE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl_force USING (c1 = 5) WITH CHECK (c1 < 5);
+CREATE POLICY p2 ON rls_tbl_force FOR SELECT USING (c1 = 8);
+CREATE POLICY p3 ON rls_tbl_force FOR UPDATE USING (c1 = 8) WITH CHECK (c1 >= 5);
+CREATE POLICY p4 ON rls_tbl_force FOR DELETE USING (c1 = 8);
diff --git a/src/test/regress/expected/strings_1.out b/src/test/regress/expected/strings_1.out
new file mode 100644
index 00000000000..a5c324a8b76
--- /dev/null
+++ b/src/test/regress/expected/strings_1.out
@@ -0,0 +1,1823 @@
+--
+-- STRINGS
+-- Test various data entry syntaxes.
+--
+-- SQL string continuation syntax
+-- E021-03 character string literals
+SELECT 'first line'
+' - next line'
+ ' - third line'
+ AS "Three lines to one";
+ Three lines to one
+-------------------------------------
+ first line - next line - third line
+(1 row)
+
+-- illegal string continuation syntax
+SELECT 'first line'
+' - next line' /* this comment is not allowed here */
+' - third line'
+ AS "Illegal comment within continuation";
+ERROR: syntax error at or near "' - third line'"
+LINE 3: ' - third line'
+ ^
+-- Unicode escapes
+SET standard_conforming_strings TO on;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ data
+------
+ data
+(1 row)
+
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ dat\+000061
+-------------
+ dat\+000061
+(1 row)
+
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ tricky
+--------
+ \
+(1 row)
+
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: invalid Unicode escape value at or near "\061'"
+LINE 1: SELECT U&'wrong: \061';
+ ^
+SELECT U&'wrong: \+0061';
+ERROR: invalid Unicode escape value at or near "\+0061'"
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: invalid Unicode escape character at or near "+'"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+SET standard_conforming_strings TO off;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061...
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&' \' UESCAPE '!' AS "tricky";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: \+0061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR: invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR: invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR: invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+----------
+ DeSdBeEf
+(1 row)
+
+--
+-- test conversions between various string types
+-- E021-10 implicit casting among the character data types
+--
+SELECT CAST(f1 AS text) AS "text(char)" FROM CHAR_TBL;
+ text(char)
+------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(f1 AS text) AS "text(varchar)" FROM VARCHAR_TBL;
+ text(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS text) AS "text(name)";
+ text(name)
+------------
+ namefield
+(1 row)
+
+-- since this is an explicit cast, it should truncate w/o error:
+SELECT CAST(f1 AS char(10)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+------------
+ doh!
+ hi de ho n
+(2 rows)
+
+-- note: implicit-cast case is tested in char.sql
+SELECT CAST(f1 AS char(20)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+----------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS char(10)) AS "char(varchar)" FROM VARCHAR_TBL;
+ char(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS char(10)) AS "char(name)";
+ char(name)
+------------
+ namefield
+(1 row)
+
+SELECT CAST(f1 AS varchar) AS "varchar(text)" FROM TEXT_TBL;
+ varchar(text)
+-------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS varchar) AS "varchar(char)" FROM CHAR_TBL;
+ varchar(char)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS varchar) AS "varchar(name)";
+ varchar(name)
+---------------
+ namefield
+(1 row)
+
+--
+-- test SQL string functions
+-- E### and T### are feature reference numbers from SQL99
+--
+-- E021-09 trim function
+SELECT TRIM(BOTH FROM ' bunch o blanks ') = 'bunch o blanks' AS "bunch o blanks";
+ bunch o blanks
+----------------
+ t
+(1 row)
+
+SELECT TRIM(LEADING FROM ' bunch o blanks ') = 'bunch o blanks ' AS "bunch o blanks ";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(TRAILING FROM ' bunch o blanks ') = ' bunch o blanks' AS " bunch o blanks";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(BOTH 'x' FROM 'xxxxxsome Xsxxxxx') = 'some Xs' AS "some Xs";
+ some Xs
+---------
+ t
+(1 row)
+
+-- E021-06 substring expression
+SELECT SUBSTRING('1234567890' FROM 3) = '34567890' AS "34567890";
+ 34567890
+----------
+ t
+(1 row)
+
+SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
+ 456
+-----
+ t
+(1 row)
+
+-- T581 regular expression substring (with SQL's bizarre regexp syntax)
+SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
+ bcd
+-----
+ bcd
+(1 row)
+
+-- No match should return NULL
+SELECT SUBSTRING('abcdefg' FROM '#"(b_d)#"%' FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- Null inputs should return NULL
+SELECT SUBSTRING('abcdefg' FROM '%' FOR NULL) IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING(NULL FROM '%' FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM NULL FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- The first and last parts should act non-greedy
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- Vertical bar in any part affects only that part
+SELECT SUBSTRING('abcdefg' FROM 'a|b#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"x|g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%|ab#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+-- Can't have more than two part separators
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*#"x' FOR '#') AS "error";
+ERROR: SQL regular expression may not contain more than two escape-double-quote separators
+CONTEXT: SQL function "substring" statement 1
+-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
+SELECT SUBSTRING('abcdefg' FROM 'a#"%g' FOR '#') AS "bcdefg";
+ bcdefg
+--------
+ bcdefg
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a%g' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- substring() with just two arguments is not allowed by SQL spec;
+-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
+SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- With a parenthesized subexpression, return only what matches the subexpr
+SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- PostgreSQL extension to allow using back reference in replace string;
+SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
+ regexp_replace
+----------------
+ (111) 222-3333
+(1 row)
+
+SELECT regexp_replace('AAA BBB CCC ', E'\\s+', ' ', 'g');
+ regexp_replace
+----------------
+ AAA BBB CCC
+(1 row)
+
+SELECT regexp_replace('AAA', '^|$', 'Z', 'g');
+ regexp_replace
+----------------
+ ZAAAZ
+(1 row)
+
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'gi');
+ regexp_replace
+----------------
+ Z Z
+(1 row)
+
+-- invalid regexp option
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
+ERROR: invalid regular expression option: "z"
+-- set so we can tell NULL from empty string
+\pset null '\\N'
+-- return all matches from regexp
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,beque}
+(1 row)
+
+-- test case insensitive
+SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
+ regexp_matches
+----------------
+ {bAR,bEqUE}
+(1 row)
+
+-- global option - more than one match
+SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+ regexp_matches
+----------------
+ {bar,beque}
+ {bazil,barf}
+(2 rows)
+
+-- empty capture group (matched empty string)
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,"",beque}
+(1 row)
+
+-- no match
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)(beque)$re$);
+ regexp_matches
+----------------
+(0 rows)
+
+-- optional capture group did not match, null entry in array
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)?(beque)$re$);
+ regexp_matches
+------------------
+ {bar,NULL,beque}
+(1 row)
+
+-- no capture groups
+SELECT regexp_matches('foobarbequebaz', $re$barbeque$re$);
+ regexp_matches
+----------------
+ {barbeque}
+(1 row)
+
+-- start/end-of-line matches are of zero length
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '^', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '^.?', 'mg');
+ regexp_matches
+----------------
+ {1}
+ {2}
+ {3}
+ {4}
+ {""}
+(5 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+ {""}
+(10 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4', '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+(9 rows)
+
+-- give me errors
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$, 'gz');
+ERROR: invalid regular expression option: "z"
+SELECT regexp_matches('foobarbequebaz', $re$(barbeque$re$);
+ERROR: invalid regular expression: parentheses () not balanced
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque){2,1}$re$);
+ERROR: invalid regular expression: invalid repetition count(s)
+-- split string on regexp
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s+$re$) AS foo;
+ foo | length
+-------+--------
+ the | 3
+ quick | 5
+ brown | 5
+ fox | 3
+ jumps | 5
+ over | 4
+ the | 3
+ lazy | 4
+ dog | 3
+(9 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s+$re$);
+ regexp_split_to_array
+-----------------------------------------------
+ {the,quick,brown,fox,jumps,over,the,lazy,dog}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s*$re$) AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ f | 1
+ o | 1
+ x | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ t | 1
+ h | 1
+ e | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ d | 1
+ o | 1
+ g | 1
+(35 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s*$re$);
+ regexp_split_to_array
+-------------------------------------------------------------------------
+ {t,h,e,q,u,i,c,k,b,r,o,w,n,f,o,x,j,u,m,p,s,o,v,e,r,t,h,e,l,a,z,y,d,o,g}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '') AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ | 1
+ f | 1
+ o | 1
+ x | 1
+ | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ | 1
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ | 1
+ d | 1
+ o | 1
+ g | 1
+(43 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '');
+ regexp_split_to_array
+---------------------------------------------------------------------------------------------------------
+ {t,h,e," ",q,u,i,c,k," ",b,r,o,w,n," ",f,o,x," ",j,u,m,p,s," ",o,v,e,r," ",t,h,e," ",l,a,z,y," ",d,o,g}
+(1 row)
+
+-- case insensitive
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i') AS foo;
+ foo | length
+---------------------------+--------
+ th | 2
+ QUick bROWn FOx jUMPs ov | 25
+ r Th | 4
+ lazy dOG | 9
+(4 rows)
+
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i');
+ regexp_split_to_array
+-----------------------------------------------------
+ {th," QUick bROWn FOx jUMPs ov","r Th"," lazy dOG"}
+(1 row)
+
+-- no match of pattern
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', 'nomatch') AS foo;
+ foo | length
+---------------------------------------------+--------
+ the quick brown fox jumps over the lazy dog | 43
+(1 row)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', 'nomatch');
+ regexp_split_to_array
+-------------------------------------------------
+ {"the quick brown fox jumps over the lazy dog"}
+(1 row)
+
+-- some corner cases
+SELECT regexp_split_to_array('123456','1');
+ regexp_split_to_array
+-----------------------
+ {"",23456}
+(1 row)
+
+SELECT regexp_split_to_array('123456','6');
+ regexp_split_to_array
+-----------------------
+ {12345,""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','.');
+ regexp_split_to_array
+------------------------
+ {"","","","","","",""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('123456','(?:)');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('1','');
+ regexp_split_to_array
+-----------------------
+ {1}
+(1 row)
+
+-- errors
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'zippy') AS foo;
+ERROR: invalid regular expression option: "z"
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'iz');
+ERROR: invalid regular expression option: "z"
+-- global option meaningless for regexp_split
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g') AS foo;
+ERROR: regexp_split_to_table() does not support the "global" option
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g');
+ERROR: regexp_split_to_array() does not support the "global" option
+-- change NULL-display back
+\pset null ''
+-- E021-11 position expression
+SELECT POSITION('4' IN '1234567890') = '4' AS "4";
+ 4
+---
+ t
+(1 row)
+
+SELECT POSITION('5' IN '1234567890') = '5' AS "5";
+ 5
+---
+ t
+(1 row)
+
+-- T312 character overlay function
+SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";
+ yabadaba
+----------
+ yabadaba
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";
+ yabadabadoo
+-------------
+ yabadabadoo
+(1 row)
+
+SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
+ bubba
+-------
+ bubba
+(1 row)
+
+--
+-- test LIKE
+-- Be sure to form every test as a LIKE/NOT LIKE pair.
+--
+-- simplest examples
+-- E061-04 like predicate
+SELECT 'hawkeye' LIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' LIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'indio%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'indio%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'h%eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in__o' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in__o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in_o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in_o' AS "true";
+ true
+------
+ t
+(1 row)
+
+-- unused escape character
+SELECT 'hawkeye' LIKE 'h%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'ind_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'ind_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character
+-- E061-05 like predicate with escape clause
+SELECT 'h%' LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%' NOT LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%awkeye' LIKE 'h#%a%k%e' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%awkeye' NOT LIKE 'h#%a%k%e' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_nd_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_nd_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d%o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d%o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character same as pattern character
+SELECT 'maca' LIKE 'm%aca' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'maca' NOT LIKE 'm%aca' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'ma%a' LIKE 'm%a%%a' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'ma%a' NOT LIKE 'm%a%%a' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'bear' LIKE 'b_ear' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'bear' NOT LIKE 'b_ear' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE 'b_e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'be_r' NOT LIKE 'b_e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE '__e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' NOT LIKE '__e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+--
+-- test ILIKE (case-insensitive LIKE)
+-- Be sure to form every test as an ILIKE/NOT ILIKE pair.
+--
+SELECT 'hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%Eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%Eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'Hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'Hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+--
+-- test %/_ combination cases, cf bugs #4821 and #5478
+--
+SELECT 'foo' LIKE '_%' as t, 'f' LIKE '_%' as t, '' LIKE '_%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%_' as t, 'f' LIKE '%_' as t, '' LIKE '%_' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '__%' as t, 'foo' LIKE '___%' as t, 'foo' LIKE '____%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%__' as t, 'foo' LIKE '%___' as t, 'foo' LIKE '%____' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'jack' LIKE '%____%' AS t;
+ t
+---
+ t
+(1 row)
+
+--
+-- basic tests of LIKE with indexes
+--
+CREATE TABLE texttest (a text PRIMARY KEY, b int);
+SELECT * FROM texttest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+CREATE TABLE byteatest (a bytea PRIMARY KEY, b int);
+SELECT * FROM byteatest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+DROP TABLE texttest, byteatest;
+--
+-- test implicit type conversion
+--
+-- E021-07 character concatenation
+SELECT 'unknown' || ' and unknown' AS "Concat unknown types";
+ Concat unknown types
+----------------------
+ unknown and unknown
+(1 row)
+
+SELECT text 'text' || ' and unknown' AS "Concat text to unknown type";
+ Concat text to unknown type
+-----------------------------
+ text and unknown
+(1 row)
+
+SELECT char(20) 'characters' || ' and text' AS "Concat char to unknown type";
+ Concat char to unknown type
+-----------------------------
+ characters and text
+(1 row)
+
+SELECT text 'text' || char(20) ' and characters' AS "Concat text to char";
+ Concat text to char
+---------------------
+ text and characters
+(1 row)
+
+SELECT text 'text' || varchar ' and varchar' AS "Concat text to varchar";
+ Concat text to varchar
+------------------------
+ text and varchar
+(1 row)
+
+--
+-- test substr with toasted text values
+--
+CREATE TABLE toasttest(f1 text);
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+TRUNCATE TABLE toasttest;
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect >0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+TRUNCATE TABLE toasttest;
+ALTER TABLE toasttest set (toast_tuple_target = 4080);
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect 0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test substr with toasted bytea values
+--
+CREATE TABLE toasttest(f1 bytea);
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+DROP TABLE toasttest;
+-- test internally compressing datums
+-- this tests compressing a datum to a very small size which exercises a
+-- corner case in packed-varlena handling: even though small, the compressed
+-- datum must be given a 4-byte header because there are no bits to indicate
+-- compression in a 1-byte header
+CREATE TABLE toasttest (c char(4096));
+INSERT INTO toasttest VALUES('x');
+SELECT length(c), c::text FROM toasttest;
+ length | c
+--------+---
+ 1 | x
+(1 row)
+
+SELECT c FROM toasttest;
+ c
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ x
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test length
+--
+SELECT length('abcdef') AS "length_6";
+ length_6
+----------
+ 6
+(1 row)
+
+--
+-- test strpos
+--
+SELECT strpos('abcdef', 'cd') AS "pos_3";
+ pos_3
+-------
+ 3
+(1 row)
+
+SELECT strpos('abcdef', 'xy') AS "pos_0";
+ pos_0
+-------
+ 0
+(1 row)
+
+--
+-- test replace
+--
+SELECT replace('abcdef', 'de', '45') AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT replace('yabadabadoo', 'ba', '123') AS "ya123da123doo";
+ ya123da123doo
+---------------
+ ya123da123doo
+(1 row)
+
+SELECT replace('yabadoo', 'bad', '') AS "yaoo";
+ yaoo
+------
+ yaoo
+(1 row)
+
+--
+-- test split_part
+--
+select split_part('joeuser@mydatabase','@',0) AS "an error";
+ERROR: field position must be greater than zero
+select split_part('joeuser@mydatabase','@',1) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+select split_part('joeuser@mydatabase','@',2) AS "mydatabase";
+ mydatabase
+------------
+ mydatabase
+(1 row)
+
+select split_part('joeuser@mydatabase','@',3) AS "empty string";
+ empty string
+--------------
+
+(1 row)
+
+select split_part('@joeuser@mydatabase@','@',2) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+--
+-- test to_hex
+--
+select to_hex(256*256*256 - 1) AS "ffffff";
+ ffffff
+--------
+ ffffff
+(1 row)
+
+select to_hex(256::bigint*256::bigint*256::bigint*256::bigint - 1) AS "ffffffff";
+ ffffffff
+----------
+ ffffffff
+(1 row)
+
+--
+-- MD5 test suite - from IETF RFC 1321
+-- (see: ftp://ftp.rfc-editor.org/in-notes/rfc1321.txt)
+--
+select md5('') = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+--
+-- SHA-2
+--
+SET bytea_output TO hex;
+SELECT sha224('');
+ sha224
+------------------------------------------------------------
+ \xd14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f
+(1 row)
+
+SELECT sha224('The quick brown fox jumps over the lazy dog.');
+ sha224
+------------------------------------------------------------
+ \x619cba8e8e05826e9b8c519c0a5c68f4fb653e8a3d8aa04bb2c8cd4c
+(1 row)
+
+SELECT sha256('');
+ sha256
+--------------------------------------------------------------------
+ \xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
+(1 row)
+
+SELECT sha256('The quick brown fox jumps over the lazy dog.');
+ sha256
+--------------------------------------------------------------------
+ \xef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c
+(1 row)
+
+SELECT sha384('');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \x38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b
+(1 row)
+
+SELECT sha384('The quick brown fox jumps over the lazy dog.');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \xed892481d8272ca6df370bf706e4d7bc1b5739fa2177aae6c50e946678718fc67a7af2819a021c2fc34e91bdb63409d7
+(1 row)
+
+SELECT sha512('');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \xcf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e
+(1 row)
+
+SELECT sha512('The quick brown fox jumps over the lazy dog.');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \x91ea1245f20d46ae9a037a989f54f1f790f0a47607eeb8a14d12890cea77a1bbc6c7ed9cf205e67b7f2b8fd4c7dfd3a7a8617e45f3c463d481c7e586c39ac1ed
+(1 row)
+
+--
+-- test behavior of escape_string_warning and standard_conforming_strings options
+--
+set escape_string_warning = off;
+set standard_conforming_strings = off;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ off
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ off
+(1 row)
+
+set escape_string_warning = on;
+set standard_conforming_strings = on;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ on
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ on
+(1 row)
+
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd'...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set escape_string_warning = off;
+set standard_conforming_strings = on;
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+--
+-- Additional string functions
+--
+SET bytea_output TO escape;
+SELECT initcap('hi THOMAS');
+ initcap
+-----------
+ Hi Thomas
+(1 row)
+
+SELECT lpad('hi', 5, 'xy');
+ lpad
+-------
+ xyxhi
+(1 row)
+
+SELECT lpad('hi', 5);
+ lpad
+-------
+ hi
+(1 row)
+
+SELECT lpad('hi', -5, 'xy');
+ lpad
+------
+
+(1 row)
+
+SELECT lpad('hello', 2);
+ lpad
+------
+ he
+(1 row)
+
+SELECT lpad('hi', 5, '');
+ lpad
+------
+ hi
+(1 row)
+
+SELECT rpad('hi', 5, 'xy');
+ rpad
+-------
+ hixyx
+(1 row)
+
+SELECT rpad('hi', 5);
+ rpad
+-------
+ hi
+(1 row)
+
+SELECT rpad('hi', -5, 'xy');
+ rpad
+------
+
+(1 row)
+
+SELECT rpad('hello', 2);
+ rpad
+------
+ he
+(1 row)
+
+SELECT rpad('hi', 5, '');
+ rpad
+------
+ hi
+(1 row)
+
+SELECT ltrim('zzzytrim', 'xyz');
+ ltrim
+-------
+ trim
+(1 row)
+
+SELECT translate('', '14', 'ax');
+ translate
+-----------
+
+(1 row)
+
+SELECT translate('12345', '14', 'ax');
+ translate
+-----------
+ a23x5
+(1 row)
+
+SELECT ascii('x');
+ ascii
+-------
+ 120
+(1 row)
+
+SELECT ascii('');
+ ascii
+-------
+ 0
+(1 row)
+
+SELECT chr(65);
+ chr
+-----
+ A
+(1 row)
+
+SELECT chr(0);
+ERROR: null character not permitted
+SELECT repeat('Pg', 4);
+ repeat
+----------
+ PgPgPgPg
+(1 row)
+
+SELECT repeat('Pg', -4);
+ repeat
+--------
+
+(1 row)
+
+SELECT trim(E'\\000'::bytea from E'\\000Tom\\000'::bytea);
+ btrim
+-------
+ Tom
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, E'\\000'::bytea);
+ btrim
+-------
+ trim
+(1 row)
+
+SELECT btrim(''::bytea, E'\\000'::bytea);
+ btrim
+-------
+
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, ''::bytea);
+ btrim
+--------------
+ \000trim\000
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'Th\\001omas'::bytea from 2),'escape');
+ encode
+-------------
+ TTh\x01omas
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 8),'escape');
+ encode
+--------------------
+ Th\000omas\x02\x03
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 5 for 3),'escape');
+ encode
+-----------------
+ Th\000o\x02\x03
+(1 row)
+
diff --git a/src/test/regress/expected/tablesample_1.out b/src/test/regress/expected/tablesample_1.out
new file mode 100644
index 00000000000..5ea0cfe166e
--- /dev/null
+++ b/src/test/regress/expected/tablesample_1.out
@@ -0,0 +1,299 @@
+CREATE TABLE test_tablesample (id int, name text) WITH (fillfactor=10);
+-- use fillfactor so we don't have to load too much data to get multiple pages
+INSERT INTO test_tablesample
+ SELECT i, repeat(i::text, 200) FROM generate_series(0, 9) s(i);
+SELECT t.id FROM test_tablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (100.0/11) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
+ id
+----
+ 3
+ 4
+ 5
+ 7
+ 8
+(5 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (5.5) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+-- 100% should give repeatable count results (ie, all rows) in any case
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100);
+ count
+-------
+ 10
+(1 row)
+
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100) REPEATABLE (1+2);
+ count
+-------
+ 10
+(1 row)
+
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100) REPEATABLE (0.4);
+ count
+-------
+ 10
+(1 row)
+
+CREATE VIEW test_tablesample_v1 AS
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (10*2) REPEATABLE (2);
+CREATE VIEW test_tablesample_v2 AS
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (99);
+\d+ test_tablesample_v1
+ View "public.test_tablesample_v1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ id | integer | | | | plain |
+View definition:
+ SELECT test_tablesample.id
+ FROM test_tablesample TABLESAMPLE system ((10 * 2)) REPEATABLE (2);
+
+\d+ test_tablesample_v2
+ View "public.test_tablesample_v2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ id | integer | | | | plain |
+View definition:
+ SELECT test_tablesample.id
+ FROM test_tablesample TABLESAMPLE system (99);
+
+-- check a sampled query doesn't affect cursor in progress
+BEGIN;
+DECLARE tablesample_cur CURSOR FOR
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+FETCH FIRST FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH FIRST FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+CLOSE tablesample_cur;
+END;
+EXPLAIN (COSTS OFF)
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (2);
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sample Scan on test_tablesample
+ Sampling: system ('50'::real) REPEATABLE ('2'::double precision)
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+ SELECT * FROM test_tablesample_v1;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sample Scan on test_tablesample
+ Sampling: system ('20'::real) REPEATABLE ('2'::double precision)
+(2 rows)
+
+-- check inheritance behavior
+explain (costs off)
+ select count(*) from person tablesample bernoulli (100);
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Append
+ -> Sample Scan on person
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on emp
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on student
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on stud_emp
+ Sampling: bernoulli ('100'::real)
+(10 rows)
+
+select count(*) from person tablesample bernoulli (100);
+ count
+-------
+ 58
+(1 row)
+
+select count(*) from person;
+ count
+-------
+ 58
+(1 row)
+
+-- check that collations get assigned within the tablesample arguments
+SELECT count(*) FROM test_tablesample TABLESAMPLE bernoulli (('1'::text < '0'::text)::int);
+ count
+-------
+ 0
+(1 row)
+
+-- check behavior during rescans, as well as correct handling of min/max pct
+select * from
+ (values (0),(100)) v(pct),
+ lateral (select count(*) from tenk1 tablesample bernoulli (pct)) ss;
+ pct | count
+-----+-------
+ 0 | 0
+ 100 | 0
+(2 rows)
+
+select * from
+ (values (0),(100)) v(pct),
+ lateral (select count(*) from tenk1 tablesample system (pct)) ss;
+ pct | count
+-----+-------
+ 0 | 0
+ 100 | 10000
+(2 rows)
+
+explain (costs off)
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample bernoulli (pct)) ss
+ group by pct;
+ QUERY PLAN
+--------------------------------------------------------
+ HashAggregate
+ Group Key: "*VALUES*".column1
+ -> Nested Loop
+ -> Values Scan on "*VALUES*"
+ -> Sample Scan on tenk1
+ Sampling: bernoulli ("*VALUES*".column1)
+(6 rows)
+
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample bernoulli (pct)) ss
+ group by pct;
+ pct | count
+-----+-------
+(0 rows)
+
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample system (pct)) ss
+ group by pct;
+ pct | count
+-----+-------
+ 100 | 10000
+(1 row)
+
+-- errors
+SELECT id FROM test_tablesample TABLESAMPLE FOOBAR (1);
+ERROR: tablesample method foobar does not exist
+LINE 1: SELECT id FROM test_tablesample TABLESAMPLE FOOBAR (1);
+ ^
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (NULL);
+ERROR: TABLESAMPLE parameter cannot be null
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (NULL);
+ERROR: TABLESAMPLE REPEATABLE parameter cannot be null
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (-1);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (200);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (-1);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (200);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample_v1 TABLESAMPLE BERNOULLI (1);
+ERROR: TABLESAMPLE clause can only be applied to tables and materialized views
+LINE 1: SELECT id FROM test_tablesample_v1 TABLESAMPLE BERNOULLI (1)...
+ ^
+INSERT INTO test_tablesample_v1 VALUES(1);
+ERROR: cannot insert into view "test_tablesample_v1"
+DETAIL: Views containing TABLESAMPLE are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+WITH query_select AS (SELECT * FROM test_tablesample)
+SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEATABLE (1);
+ERROR: TABLESAMPLE clause can only be applied to tables and materialized views
+LINE 2: SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEA...
+ ^
+SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5);
+ERROR: syntax error at or near "TABLESAMPLE"
+LINE 1: ...CT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPL...
+ ^
+-- check partitioned tables support tablesample
+create table parted_sample (a int) partition by list (a);
+create table parted_sample_1 partition of parted_sample for values in (1);
+create table parted_sample_2 partition of parted_sample for values in (2);
+explain (costs off)
+ select * from parted_sample tablesample bernoulli (100);
+ QUERY PLAN
+-------------------------------------------
+ Append
+ -> Sample Scan on parted_sample_1
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on parted_sample_2
+ Sampling: bernoulli ('100'::real)
+(5 rows)
+
+drop table parted_sample, parted_sample_1, parted_sample_2;
diff --git a/src/test/regress/expected/transactions_1.out b/src/test/regress/expected/transactions_1.out
new file mode 100644
index 00000000000..b136c3595a2
--- /dev/null
+++ b/src/test/regress/expected/transactions_1.out
@@ -0,0 +1,944 @@
+--
+-- TRANSACTIONS
+--
+BEGIN;
+SELECT *
+ INTO TABLE xacttest
+ FROM aggtest;
+INSERT INTO xacttest (a, b) VALUES (777, 777.777);
+END;
+-- should retrieve one value--
+SELECT a FROM xacttest WHERE a > 100;
+ a
+-----
+ 777
+(1 row)
+
+BEGIN;
+CREATE TABLE disappear (a int4);
+DELETE FROM aggtest;
+-- should be empty
+SELECT * FROM aggtest;
+ a | b
+---+---
+(0 rows)
+
+ABORT;
+-- should not exist
+SELECT oid FROM pg_class WHERE relname = 'disappear';
+ oid
+-----
+(0 rows)
+
+-- should have members again
+SELECT * FROM aggtest;
+ a | b
+-----+---------
+ 56 | 7.8
+ 100 | 99.097
+ 0 | 0.09561
+ 42 | 324.78
+(4 rows)
+
+-- Read-only tests
+CREATE TABLE writetest (a int);
+CREATE TEMPORARY TABLE temptest (a int);
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ WRITE; --fail
+ERROR: transaction read-write mode must be set before any query
+COMMIT;
+BEGIN;
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; -- ok
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SAVEPOINT x;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; --fail
+ERROR: cannot set transaction read-write mode inside a read-only transaction
+COMMIT;
+BEGIN;
+SET TRANSACTION READ WRITE; -- ok
+SAVEPOINT x;
+SET TRANSACTION READ WRITE; -- ok
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; --fail
+ERROR: cannot set transaction read-write mode inside a read-only transaction
+COMMIT;
+BEGIN;
+SET TRANSACTION READ WRITE; -- ok
+SAVEPOINT x;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+ROLLBACK TO SAVEPOINT x;
+SHOW transaction_read_only; -- off
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SAVEPOINT y;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+RELEASE SAVEPOINT y;
+SHOW transaction_read_only; -- off
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+COMMIT;
+SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY;
+DROP TABLE writetest; -- fail
+ERROR: cannot execute DROP TABLE in a read-only transaction
+INSERT INTO writetest VALUES (1); -- fail
+ERROR: cannot execute INSERT in a read-only transaction
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+DELETE FROM temptest; -- ok
+UPDATE temptest SET a = 0 FROM writetest WHERE temptest.a = 1 AND writetest.a = temptest.a; -- ok
+PREPARE test AS UPDATE writetest SET a = 0; -- ok
+EXECUTE test; -- fail
+ERROR: cannot execute UPDATE in a read-only transaction
+SELECT * FROM writetest, temptest; -- ok
+ a | a
+---+---
+(0 rows)
+
+CREATE TABLE test AS SELECT * FROM writetest; -- fail
+ERROR: cannot execute CREATE TABLE AS in a read-only transaction
+START TRANSACTION READ WRITE;
+DROP TABLE writetest; -- ok
+COMMIT;
+-- Subtransactions, basic tests
+-- create & drop tables
+SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE;
+CREATE TABLE trans_foobar (a int);
+BEGIN;
+ CREATE TABLE trans_foo (a int);
+ SAVEPOINT one;
+ DROP TABLE trans_foo;
+ CREATE TABLE trans_bar (a int);
+ ROLLBACK TO SAVEPOINT one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ CREATE TABLE trans_baz (a int);
+ RELEASE SAVEPOINT two;
+ drop TABLE trans_foobar;
+ CREATE TABLE trans_barbaz (a int);
+COMMIT;
+-- should exist: trans_barbaz, trans_baz, trans_foo
+SELECT * FROM trans_foo; -- should be empty
+ a
+---
+(0 rows)
+
+SELECT * FROM trans_bar; -- shouldn't exist
+ERROR: relation "trans_bar" does not exist
+LINE 1: SELECT * FROM trans_bar;
+ ^
+SELECT * FROM trans_barbaz; -- should be empty
+ a
+---
+(0 rows)
+
+SELECT * FROM trans_baz; -- should be empty
+ a
+---
+(0 rows)
+
+-- inserts
+BEGIN;
+ INSERT INTO trans_foo VALUES (1);
+ SAVEPOINT one;
+ INSERT into trans_bar VALUES (1);
+ERROR: relation "trans_bar" does not exist
+LINE 1: INSERT into trans_bar VALUES (1);
+ ^
+ ROLLBACK TO one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ INSERT into trans_barbaz VALUES (1);
+ RELEASE two;
+ SAVEPOINT three;
+ SAVEPOINT four;
+ INSERT INTO trans_foo VALUES (2);
+ RELEASE SAVEPOINT four;
+ ROLLBACK TO SAVEPOINT three;
+ RELEASE SAVEPOINT three;
+ INSERT INTO trans_foo VALUES (3);
+COMMIT;
+SELECT * FROM trans_foo; -- should have 1 and 3
+ a
+---
+ 1
+ 3
+(2 rows)
+
+SELECT * FROM trans_barbaz; -- should have 1
+ a
+---
+ 1
+(1 row)
+
+-- test whole-tree commit
+BEGIN;
+ SAVEPOINT one;
+ SELECT trans_foo;
+ERROR: column "trans_foo" does not exist
+LINE 1: SELECT trans_foo;
+ ^
+ ROLLBACK TO SAVEPOINT one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ CREATE TABLE savepoints (a int);
+ SAVEPOINT three;
+ INSERT INTO savepoints VALUES (1);
+ SAVEPOINT four;
+ INSERT INTO savepoints VALUES (2);
+ SAVEPOINT five;
+ INSERT INTO savepoints VALUES (3);
+ ROLLBACK TO SAVEPOINT five;
+COMMIT;
+COMMIT; -- should not be in a transaction block
+WARNING: there is no transaction in progress
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- test whole-tree rollback
+BEGIN;
+ SAVEPOINT one;
+ DELETE FROM savepoints WHERE a=1;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ DELETE FROM savepoints WHERE a=1;
+ SAVEPOINT three;
+ DELETE FROM savepoints WHERE a=2;
+ROLLBACK;
+COMMIT; -- should not be in a transaction block
+WARNING: there is no transaction in progress
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- test whole-tree commit on an aborted subtransaction
+BEGIN;
+ INSERT INTO savepoints VALUES (4);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (5);
+ SELECT trans_foo;
+ERROR: column "trans_foo" does not exist
+LINE 1: SELECT trans_foo;
+ ^
+COMMIT;
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (6);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (7);
+ RELEASE SAVEPOINT one;
+ INSERT INTO savepoints VALUES (8);
+COMMIT;
+-- rows 6 and 8 should have been created by the same xact
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=6 AND b.a=8;
+ ?column?
+----------
+ t
+(1 row)
+
+-- rows 6 and 7 should have been created by different xacts
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=6 AND b.a=7;
+ ?column?
+----------
+ t
+(1 row)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (9);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (10);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (11);
+COMMIT;
+SELECT a FROM savepoints WHERE a in (9, 10, 11);
+ a
+----
+ 9
+ 11
+(2 rows)
+
+-- rows 9 and 11 should have been created by different xacts
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=9 AND b.a=11;
+ ?column?
+----------
+ t
+(1 row)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (12);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (13);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (14);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (15);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (16);
+ SAVEPOINT three;
+ INSERT INTO savepoints VALUES (17);
+COMMIT;
+SELECT a FROM savepoints WHERE a BETWEEN 12 AND 17;
+ a
+----
+ 12
+ 15
+ 16
+ 17
+(4 rows)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (18);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (19);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (20);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (21);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (22);
+COMMIT;
+SELECT a FROM savepoints WHERE a BETWEEN 18 AND 22;
+ a
+----
+ 18
+ 22
+(2 rows)
+
+DROP TABLE savepoints;
+-- only in a transaction block:
+SAVEPOINT one;
+ERROR: SAVEPOINT can only be used in transaction blocks
+ROLLBACK TO SAVEPOINT one;
+ERROR: ROLLBACK TO SAVEPOINT can only be used in transaction blocks
+RELEASE SAVEPOINT one;
+ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
+-- Only "rollback to" allowed in aborted state
+BEGIN;
+ SAVEPOINT one;
+ SELECT 0/0;
+ERROR: division by zero
+ SAVEPOINT two; -- ignored till the end of ...
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+ RELEASE SAVEPOINT one; -- ignored till the end of ...
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+ ROLLBACK TO SAVEPOINT one;
+ SELECT 1;
+ ?column?
+----------
+ 1
+(1 row)
+
+COMMIT;
+SELECT 1; -- this should work
+ ?column?
+----------
+ 1
+(1 row)
+
+-- check non-transactional behavior of cursors
+BEGIN;
+ DECLARE c CURSOR FOR SELECT unique2 FROM tenk1 ORDER BY unique2;
+ SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+(10 rows)
+
+ ROLLBACK TO SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 10
+ 11
+ 12
+ 13
+ 14
+ 15
+ 16
+ 17
+ 18
+ 19
+(10 rows)
+
+ RELEASE SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 20
+ 21
+ 22
+ 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+(10 rows)
+
+ CLOSE c;
+ DECLARE c CURSOR FOR SELECT unique2/0 FROM tenk1 ORDER BY unique2;
+ SAVEPOINT two;
+ FETCH 10 FROM c;
+ERROR: division by zero
+ ROLLBACK TO SAVEPOINT two;
+ -- c is now dead to the world ...
+ FETCH 10 FROM c;
+ERROR: portal "c" cannot be run
+ ROLLBACK TO SAVEPOINT two;
+ RELEASE SAVEPOINT two;
+ FETCH 10 FROM c;
+ERROR: portal "c" cannot be run
+COMMIT;
+--
+-- Check that "stable" functions are really stable. They should not be
+-- able to see the partial results of the calling query. (Ideally we would
+-- also check that they don't see commits of concurrent transactions, but
+-- that's a mite hard to do within the limitations of pg_regress.)
+--
+select * from xacttest;
+ a | b
+-----+---------
+ 56 | 7.8
+ 100 | 99.097
+ 0 | 0.09561
+ 42 | 324.78
+ 777 | 777.777
+(5 rows)
+
+create or replace function max_xacttest() returns smallint language sql as
+'select max(a) from xacttest' stable;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 787 | 99.097
+ 787 | 324.78
+ 787 | 777.777
+(5 rows)
+
+rollback;
+-- But a volatile function can see the partial results of the calling query
+create or replace function max_xacttest() returns smallint language sql as
+'select max(a) from xacttest' volatile;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 797 | 99.097
+ 807 | 324.78
+ 817 | 777.777
+(5 rows)
+
+rollback;
+-- Now the same test with plpgsql (since it depends on SPI which is different)
+create or replace function max_xacttest() returns smallint language plpgsql as
+'begin return max(a) from xacttest; end' stable;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 787 | 99.097
+ 787 | 324.78
+ 787 | 777.777
+(5 rows)
+
+rollback;
+create or replace function max_xacttest() returns smallint language plpgsql as
+'begin return max(a) from xacttest; end' volatile;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 797 | 99.097
+ 807 | 324.78
+ 817 | 777.777
+(5 rows)
+
+rollback;
+-- test case for problems with dropping an open relation during abort
+BEGIN;
+ savepoint x;
+ CREATE TABLE koju (a INT UNIQUE);
+ INSERT INTO koju VALUES (1);
+ INSERT INTO koju VALUES (1);
+ERROR: duplicate key value violates unique constraint "koju_a_key"
+DETAIL: Key (a)=(1) already exists.
+ rollback to x;
+ CREATE TABLE koju (a INT UNIQUE);
+ INSERT INTO koju VALUES (1);
+ INSERT INTO koju VALUES (1);
+ERROR: duplicate key value violates unique constraint "koju_a_key"
+DETAIL: Key (a)=(1) already exists.
+ROLLBACK;
+DROP TABLE trans_foo;
+DROP TABLE trans_baz;
+DROP TABLE trans_barbaz;
+-- test case for problems with revalidating an open relation during abort
+create function inverse(int) returns float8 as
+$$
+begin
+ analyze revalidate_bug;
+ return 1::float8/$1;
+exception
+ when division_by_zero then return 0;
+end$$ language plpgsql volatile;
+create table revalidate_bug (c float8 unique);
+insert into revalidate_bug values (1);
+insert into revalidate_bug values (inverse(0));
+drop table revalidate_bug;
+drop function inverse(int);
+-- verify that cursors created during an aborted subtransaction are
+-- closed, but that we do not rollback the effect of any FETCHs
+-- performed in the aborted subtransaction
+begin;
+savepoint x;
+create table abc (a int);
+insert into abc values (5);
+insert into abc values (10);
+declare foo cursor for select * from abc;
+fetch from foo;
+ a
+---
+ 5
+(1 row)
+
+rollback to x;
+-- should fail
+fetch from foo;
+ERROR: cursor "foo" does not exist
+commit;
+begin;
+create table abc (a int);
+insert into abc values (5);
+insert into abc values (10);
+insert into abc values (15);
+declare foo cursor for select * from abc;
+fetch from foo;
+ a
+---
+ 5
+(1 row)
+
+savepoint x;
+fetch from foo;
+ a
+----
+ 10
+(1 row)
+
+rollback to x;
+fetch from foo;
+ a
+----
+ 15
+(1 row)
+
+abort;
+-- Test for proper cleanup after a failure in a cursor portal
+-- that was created in an outer subtransaction
+CREATE FUNCTION invert(x float8) RETURNS float8 LANGUAGE plpgsql AS
+$$ begin return 1/x; end $$;
+CREATE FUNCTION create_temp_tab() RETURNS text
+LANGUAGE plpgsql AS $$
+BEGIN
+ CREATE TEMP TABLE new_table (f1 float8);
+ -- case of interest is that we fail while holding an open
+ -- relcache reference to new_table
+ INSERT INTO new_table SELECT invert(0.0);
+ RETURN 'foo';
+END $$;
+BEGIN;
+DECLARE ok CURSOR FOR SELECT * FROM int8_tbl;
+DECLARE ctt CURSOR FOR SELECT create_temp_tab();
+FETCH ok;
+ q1 | q2
+-----+-----
+ 123 | 456
+(1 row)
+
+SAVEPOINT s1;
+FETCH ok; -- should work
+ q1 | q2
+-----+------------------
+ 123 | 4567890123456789
+(1 row)
+
+FETCH ctt; -- error occurs here
+ERROR: division by zero
+CONTEXT: PL/pgSQL function invert(double precision) line 1 at RETURN
+SQL statement "INSERT INTO new_table SELECT invert(0.0)"
+PL/pgSQL function create_temp_tab() line 6 at SQL statement
+ROLLBACK TO s1;
+FETCH ok; -- should work
+ q1 | q2
+------------------+-----
+ 4567890123456789 | 123
+(1 row)
+
+FETCH ctt; -- must be rejected
+ERROR: portal "ctt" cannot be run
+COMMIT;
+DROP FUNCTION create_temp_tab();
+DROP FUNCTION invert(x float8);
+-- Tests for AND CHAIN
+CREATE TABLE abc (a int);
+-- set nondefault value so we have something to override below
+SET default_transaction_read_only = on;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (1);
+INSERT INTO abc VALUES (2);
+COMMIT AND CHAIN; -- TBLOCK_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+INSERT INTO abc VALUES (3); -- check it's really aborted
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+COMMIT AND CHAIN; -- TBLOCK_ABORT_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (4);
+COMMIT;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+SAVEPOINT x;
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+COMMIT AND CHAIN; -- TBLOCK_ABORT_PENDING
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (5);
+COMMIT;
+-- different mix of options just for fun
+START TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ WRITE, NOT DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+INSERT INTO abc VALUES (6);
+ROLLBACK AND CHAIN; -- TBLOCK_ABORT_PENDING
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+ROLLBACK AND CHAIN; -- TBLOCK_ABORT_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+ROLLBACK;
+SELECT * FROM abc ORDER BY 1;
+ a
+---
+ 1
+ 2
+ 4
+ 5
+(4 rows)
+
+RESET default_transaction_read_only;
+DROP TABLE abc;
+-- Test assorted behaviors around the implicit transaction block created
+-- when multiple SQL commands are sent in a single Query message. These
+-- tests rely on the fact that psql will not break SQL commands apart at a
+-- backslash-quoted semicolon, but will send them as one Query.
+create temp table i_table (f1 int);
+-- psql will show only the last result in a multi-statement Query
+SELECT 1\; SELECT 2\; SELECT 3;
+ ?column?
+----------
+ 3
+(1 row)
+
+-- this implicitly commits:
+insert into i_table values(1)\; select * from i_table;
+ f1
+----
+ 1
+(1 row)
+
+-- 1/0 error will cause rolling back the whole implicit transaction
+insert into i_table values(2)\; select * from i_table\; select 1/0;
+ERROR: division by zero
+select * from i_table;
+ f1
+----
+ 1
+(1 row)
+
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- can use regular begin/commit/rollback within a single Query
+begin\; insert into i_table values(3)\; commit;
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+begin\; insert into i_table values(4)\; rollback;
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- begin converts implicit transaction into a regular one that
+-- can extend past the end of the Query
+select 1\; begin\; insert into i_table values(5);
+commit;
+select 1\; begin\; insert into i_table values(6);
+rollback;
+-- commit in implicit-transaction state commits but issues a warning.
+insert into i_table values(7)\; commit\; insert into i_table values(8)\; select 1/0;
+WARNING: there is no transaction in progress
+ERROR: division by zero
+-- similarly, rollback aborts but issues a warning.
+insert into i_table values(9)\; rollback\; select 2;
+WARNING: there is no transaction in progress
+ ?column?
+----------
+ 2
+(1 row)
+
+select * from i_table;
+ f1
+----
+ 1
+ 3
+ 5
+ 7
+(4 rows)
+
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- implicit transaction block is still a transaction block, for e.g. VACUUM
+SELECT 1\; VACUUM;
+ERROR: VACUUM cannot run inside a transaction block
+SELECT 1\; COMMIT\; VACUUM;
+WARNING: there is no transaction in progress
+ERROR: VACUUM cannot run inside a transaction block
+-- we disallow savepoint-related commands in implicit-transaction state
+SELECT 1\; SAVEPOINT sp;
+ERROR: SAVEPOINT can only be used in transaction blocks
+SELECT 1\; COMMIT\; SAVEPOINT sp;
+WARNING: there is no transaction in progress
+ERROR: SAVEPOINT can only be used in transaction blocks
+ROLLBACK TO SAVEPOINT sp\; SELECT 2;
+ERROR: ROLLBACK TO SAVEPOINT can only be used in transaction blocks
+SELECT 2\; RELEASE SAVEPOINT sp\; SELECT 3;
+ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
+-- but this is OK, because the BEGIN converts it to a regular xact
+SELECT 1\; BEGIN\; SAVEPOINT sp\; ROLLBACK TO SAVEPOINT sp\; COMMIT;
+-- Test for successful cleanup of an aborted transaction at session exit.
+-- THIS MUST BE THE LAST TEST IN THIS FILE.
+begin;
+select 1/0;
+ERROR: division by zero
+rollback to X;
+ERROR: savepoint "x" does not exist
+-- DO NOT ADD ANYTHING HERE.
diff --git a/src/test/regress/expected/triggers_1.out b/src/test/regress/expected/triggers_1.out
new file mode 100644
index 00000000000..5a8f055aed9
--- /dev/null
+++ b/src/test/regress/expected/triggers_1.out
@@ -0,0 +1,2850 @@
+--
+-- TRIGGERS
+--
+create table pkeys (pkey1 int4 not null, pkey2 text not null);
+create table fkeys (fkey1 int4, fkey2 text, fkey3 int);
+create table fkeys2 (fkey21 int4, fkey22 text, pkey23 int not null);
+create index fkeys_i on fkeys (fkey1, fkey2);
+create index fkeys2_i on fkeys2 (fkey21, fkey22);
+create index fkeys2p_i on fkeys2 (pkey23);
+insert into pkeys values (10, '1');
+insert into pkeys values (20, '2');
+insert into pkeys values (30, '3');
+insert into pkeys values (40, '4');
+insert into pkeys values (50, '5');
+insert into pkeys values (60, '6');
+create unique index pkeys_i on pkeys (pkey1, pkey2);
+--
+-- For fkeys:
+-- (fkey1, fkey2) --> pkeys (pkey1, pkey2)
+-- (fkey3) --> fkeys2 (pkey23)
+--
+create trigger check_fkeys_pkey_exist
+ before insert or update on fkeys
+ for each row
+ execute function
+ check_primary_key ('fkey1', 'fkey2', 'pkeys', 'pkey1', 'pkey2');
+create trigger check_fkeys_pkey2_exist
+ before insert or update on fkeys
+ for each row
+ execute function check_primary_key ('fkey3', 'fkeys2', 'pkey23');
+--
+-- For fkeys2:
+-- (fkey21, fkey22) --> pkeys (pkey1, pkey2)
+--
+create trigger check_fkeys2_pkey_exist
+ before insert or update on fkeys2
+ for each row
+ execute procedure
+ check_primary_key ('fkey21', 'fkey22', 'pkeys', 'pkey1', 'pkey2');
+-- Test comments
+COMMENT ON TRIGGER check_fkeys2_pkey_bad ON fkeys2 IS 'wrong';
+ERROR: trigger "check_fkeys2_pkey_bad" for table "fkeys2" does not exist
+COMMENT ON TRIGGER check_fkeys2_pkey_exist ON fkeys2 IS 'right';
+COMMENT ON TRIGGER check_fkeys2_pkey_exist ON fkeys2 IS NULL;
+--
+-- For pkeys:
+-- ON DELETE/UPDATE (pkey1, pkey2) CASCADE:
+-- fkeys (fkey1, fkey2) and fkeys2 (fkey21, fkey22)
+--
+create trigger check_pkeys_fkey_cascade
+ before delete or update on pkeys
+ for each row
+ execute procedure
+ check_foreign_key (2, 'cascade', 'pkey1', 'pkey2',
+ 'fkeys', 'fkey1', 'fkey2', 'fkeys2', 'fkey21', 'fkey22');
+--
+-- For fkeys2:
+-- ON DELETE/UPDATE (pkey23) RESTRICT:
+-- fkeys (fkey3)
+--
+create trigger check_fkeys2_fkey_restrict
+ before delete or update on fkeys2
+ for each row
+ execute procedure check_foreign_key (1, 'restrict', 'pkey23', 'fkeys', 'fkey3');
+insert into fkeys2 values (10, '1', 1);
+insert into fkeys2 values (30, '3', 2);
+insert into fkeys2 values (40, '4', 5);
+insert into fkeys2 values (50, '5', 3);
+-- no key in pkeys
+insert into fkeys2 values (70, '5', 3);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys2_pkey_exist" found tuple referencing non-existent key in "pkeys".
+insert into fkeys values (10, '1', 2);
+insert into fkeys values (30, '3', 3);
+insert into fkeys values (40, '4', 2);
+insert into fkeys values (50, '5', 2);
+-- no key in pkeys
+insert into fkeys values (70, '5', 1);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys_pkey_exist" found tuple referencing non-existent key in "pkeys".
+-- no key in fkeys2
+insert into fkeys values (60, '6', 4);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys_pkey2_exist" found tuple referencing non-existent key in "fkeys2".
+delete from pkeys where pkey1 = 30 and pkey2 = '3';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+ERROR: "check_fkeys2_fkey_restrict": tuple is referenced in "fkeys"
+CONTEXT: SQL statement "delete from fkeys2 where fkey21 = $1 and fkey22 = $2 "
+delete from pkeys where pkey1 = 40 and pkey2 = '4';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys2 are deleted
+update pkeys set pkey1 = 7, pkey2 = '70' where pkey1 = 50 and pkey2 = '5';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+ERROR: "check_fkeys2_fkey_restrict": tuple is referenced in "fkeys"
+CONTEXT: SQL statement "delete from fkeys2 where fkey21 = $1 and fkey22 = $2 "
+update pkeys set pkey1 = 7, pkey2 = '70' where pkey1 = 10 and pkey2 = '1';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys2 are deleted
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table in ('pkeys', 'fkeys', 'fkeys2')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+----------------------------+--------------------+---------------------+--------------------+--------------+------------------+--------------------+---------------+----------------------------+----------------------------
+ check_fkeys2_fkey_restrict | DELETE | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_fkey_restrict | UPDATE | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_pkey_exist | INSERT | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_pkey_exist | UPDATE | public | fkeys2 | 2 | | ROW | BEFORE | |
+ check_fkeys_pkey2_exist | INSERT | public | fkeys | 1 | | ROW | BEFORE | |
+ check_fkeys_pkey2_exist | UPDATE | public | fkeys | 1 | | ROW | BEFORE | |
+ check_fkeys_pkey_exist | INSERT | public | fkeys | 2 | | ROW | BEFORE | |
+ check_fkeys_pkey_exist | UPDATE | public | fkeys | 2 | | ROW | BEFORE | |
+ check_pkeys_fkey_cascade | DELETE | public | pkeys | 1 | | ROW | BEFORE | |
+ check_pkeys_fkey_cascade | UPDATE | public | pkeys | 1 | | ROW | BEFORE | |
+(10 rows)
+
+DROP TABLE pkeys;
+DROP TABLE fkeys;
+DROP TABLE fkeys2;
+-- Check behavior when trigger returns unmodified trigtuple
+create table trigtest (f1 int, f2 text);
+create trigger trigger_return_old
+ before insert or delete or update on trigtest
+ for each row execute procedure trigger_return_old();
+insert into trigtest values(1, 'foo');
+select * from trigtest;
+ f1 | f2
+----+-----
+ 1 | foo
+(1 row)
+
+update trigtest set f2 = f2 || 'bar';
+select * from trigtest;
+ f1 | f2
+----+-----
+ 1 | foo
+(1 row)
+
+delete from trigtest;
+select * from trigtest;
+ f1 | f2
+----+----
+(0 rows)
+
+drop table trigtest;
+create sequence ttdummy_seq increment 10 start 0 minvalue 0;
+create table tttest (
+ price_id int4,
+ price_val int4,
+ price_on int4,
+ price_off int4 default 999999
+);
+create trigger ttdummy
+ before delete or update on tttest
+ for each row
+ execute procedure
+ ttdummy (price_on, price_off);
+create trigger ttserial
+ before insert or update on tttest
+ for each row
+ execute procedure
+ autoinc (price_on, ttdummy_seq);
+insert into tttest values (1, 1, null);
+insert into tttest values (2, 2, null);
+insert into tttest values (3, 3, 0);
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 999999
+ 3 | 3 | 30 | 999999
+(3 rows)
+
+delete from tttest where price_id = 2;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 3 | 3 | 30 | 999999
+ 2 | 2 | 20 | 40
+(3 rows)
+
+-- what do we see ?
+-- get current prices
+select * from tttest where price_off = 999999;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 3 | 3 | 30 | 999999
+(2 rows)
+
+-- change price for price_id == 3
+update tttest set price_val = 30 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 30 | 50 | 999999
+ 3 | 3 | 30 | 50
+(4 rows)
+
+-- now we want to change pric_id in ALL tuples
+-- this gets us not what we need
+update tttest set price_id = 5 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 3 | 30 | 50
+ 5 | 30 | 60 | 999999
+ 3 | 30 | 50 | 60
+(5 rows)
+
+-- restore data as before last update:
+select set_ttdummy(0);
+ set_ttdummy
+-------------
+ 1
+(1 row)
+
+delete from tttest where price_id = 5;
+update tttest set price_off = 999999 where price_val = 30;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 3 | 30 | 50
+ 3 | 30 | 50 | 999999
+(4 rows)
+
+-- and try change price_id now!
+update tttest set price_id = 5 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 5 | 3 | 30 | 50
+ 5 | 30 | 50 | 999999
+(4 rows)
+
+-- isn't it what we need ?
+select set_ttdummy(1);
+ set_ttdummy
+-------------
+ 0
+(1 row)
+
+-- we want to correct some "date"
+update tttest set price_on = -1 where price_id = 1;
+ERROR: ttdummy (tttest): you cannot change price_on and/or price_off columns (use set_ttdummy)
+-- but this doesn't work
+-- try in this way
+select set_ttdummy(0);
+ set_ttdummy
+-------------
+ 1
+(1 row)
+
+update tttest set price_on = -1 where price_id = 1;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 2 | 2 | 20 | 40
+ 5 | 3 | 30 | 50
+ 5 | 30 | 50 | 999999
+ 1 | 1 | -1 | 999999
+(4 rows)
+
+-- isn't it what we need ?
+-- get price for price_id == 5 as it was @ "date" 35
+select * from tttest where price_on <= 35 and price_off > 35 and price_id = 5;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 5 | 3 | 30 | 50
+(1 row)
+
+drop table tttest;
+drop sequence ttdummy_seq;
+--
+-- tests for per-statement triggers
+--
+CREATE TABLE log_table (tstamp timestamp default timeofday()::timestamp);
+CREATE TABLE main_table (a int unique, b int);
+COPY main_table (a,b) FROM stdin;
+CREATE FUNCTION trigger_func() RETURNS trigger LANGUAGE plpgsql AS '
+BEGIN
+ RAISE NOTICE ''trigger_func(%) called: action = %, when = %, level = %'', TG_ARGV[0], TG_OP, TG_WHEN, TG_LEVEL;
+ RETURN NULL;
+END;';
+CREATE TRIGGER before_ins_stmt_trig BEFORE INSERT ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('before_ins_stmt');
+CREATE TRIGGER after_ins_stmt_trig AFTER INSERT ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('after_ins_stmt');
+--
+-- if neither 'FOR EACH ROW' nor 'FOR EACH STATEMENT' was specified,
+-- CREATE TRIGGER should default to 'FOR EACH STATEMENT'
+--
+CREATE TRIGGER after_upd_stmt_trig AFTER UPDATE ON main_table
+EXECUTE PROCEDURE trigger_func('after_upd_stmt');
+-- Both insert and update statement level triggers (before and after) should
+-- fire. Doesn't fire UPDATE before trigger, but only because one isn't
+-- defined.
+INSERT INTO main_table (a, b) VALUES (5, 10) ON CONFLICT (a)
+ DO UPDATE SET b = EXCLUDED.b;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+CREATE TRIGGER after_upd_row_trig AFTER UPDATE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_row');
+INSERT INTO main_table DEFAULT VALUES;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+UPDATE main_table SET a = a + 1 WHERE b < 30;
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+-- UPDATE that effects zero rows should still call per-statement trigger
+UPDATE main_table SET a = a + 2 WHERE b > 100;
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+-- constraint now unneeded
+ALTER TABLE main_table DROP CONSTRAINT main_table_a_key;
+-- COPY should fire per-row and per-statement INSERT triggers
+COPY main_table (a, b) FROM stdin;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+SELECT * FROM main_table ORDER BY a, b;
+ a | b
+----+----
+ 6 | 10
+ 21 | 20
+ 30 | 40
+ 31 | 10
+ 50 | 35
+ 50 | 60
+ 81 | 15
+ |
+(8 rows)
+
+--
+-- test triggers with WHEN clause
+--
+CREATE TRIGGER modified_a BEFORE UPDATE OF a ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a) EXECUTE PROCEDURE trigger_func('modified_a');
+CREATE TRIGGER modified_any BEFORE UPDATE OF a ON main_table
+FOR EACH ROW WHEN (OLD.* IS DISTINCT FROM NEW.*) EXECUTE PROCEDURE trigger_func('modified_any');
+CREATE TRIGGER insert_a AFTER INSERT ON main_table
+FOR EACH ROW WHEN (NEW.a = 123) EXECUTE PROCEDURE trigger_func('insert_a');
+CREATE TRIGGER delete_a AFTER DELETE ON main_table
+FOR EACH ROW WHEN (OLD.a = 123) EXECUTE PROCEDURE trigger_func('delete_a');
+CREATE TRIGGER insert_when BEFORE INSERT ON main_table
+FOR EACH STATEMENT WHEN (true) EXECUTE PROCEDURE trigger_func('insert_when');
+CREATE TRIGGER delete_when AFTER DELETE ON main_table
+FOR EACH STATEMENT WHEN (true) EXECUTE PROCEDURE trigger_func('delete_when');
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table IN ('main_table')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+----------------------+--------------------+---------------------+--------------------+--------------+--------------------------------+--------------------+---------------+----------------------------+----------------------------
+ after_ins_stmt_trig | INSERT | public | main_table | 1 | | STATEMENT | AFTER | |
+ after_upd_row_trig | UPDATE | public | main_table | 1 | | ROW | AFTER | |
+ after_upd_stmt_trig | UPDATE | public | main_table | 1 | | STATEMENT | AFTER | |
+ before_ins_stmt_trig | INSERT | public | main_table | 1 | | STATEMENT | BEFORE | |
+ delete_a | DELETE | public | main_table | 1 | (old.a = 123) | ROW | AFTER | |
+ delete_when | DELETE | public | main_table | 1 | true | STATEMENT | AFTER | |
+ insert_a | INSERT | public | main_table | 1 | (new.a = 123) | ROW | AFTER | |
+ insert_when | INSERT | public | main_table | 2 | true | STATEMENT | BEFORE | |
+ modified_a | UPDATE | public | main_table | 1 | (old.a <> new.a) | ROW | BEFORE | |
+ modified_any | UPDATE | public | main_table | 2 | (old.* IS DISTINCT FROM new.*) | ROW | BEFORE | |
+(10 rows)
+
+INSERT INTO main_table (a) VALUES (123), (456);
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_when) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_a) called: action = INSERT, when = AFTER, level = ROW
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+COPY main_table FROM stdin;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_when) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_a) called: action = INSERT, when = AFTER, level = ROW
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+DELETE FROM main_table WHERE a IN (123, 456);
+NOTICE: trigger_func(delete_a) called: action = DELETE, when = AFTER, level = ROW
+NOTICE: trigger_func(delete_a) called: action = DELETE, when = AFTER, level = ROW
+NOTICE: trigger_func(delete_when) called: action = DELETE, when = AFTER, level = STATEMENT
+UPDATE main_table SET a = 50, b = 60;
+NOTICE: trigger_func(modified_any) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_any) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+SELECT * FROM main_table ORDER BY a, b;
+ a | b
+----+----
+ 6 | 10
+ 21 | 20
+ 30 | 40
+ 31 | 10
+ 50 | 35
+ 50 | 60
+ 81 | 15
+ |
+(8 rows)
+
+SELECT pg_get_triggerdef(oid, true) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_a BEFORE UPDATE OF a ON main_table FOR EACH ROW WHEN (old.a <> new.a) EXECUTE FUNCTION trigger_func('modified_a')
+(1 row)
+
+SELECT pg_get_triggerdef(oid, false) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ pg_get_triggerdef
+----------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_a BEFORE UPDATE OF a ON public.main_table FOR EACH ROW WHEN ((old.a <> new.a)) EXECUTE FUNCTION trigger_func('modified_a')
+(1 row)
+
+SELECT pg_get_triggerdef(oid, true) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_any';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_any BEFORE UPDATE OF a ON main_table FOR EACH ROW WHEN (old.* IS DISTINCT FROM new.*) EXECUTE FUNCTION trigger_func('modified_any')
+(1 row)
+
+-- Test RENAME TRIGGER
+ALTER TRIGGER modified_a ON main_table RENAME TO modified_modified_a;
+SELECT count(*) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ count
+-------
+ 0
+(1 row)
+
+SELECT count(*) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_modified_a';
+ count
+-------
+ 1
+(1 row)
+
+DROP TRIGGER modified_modified_a ON main_table;
+DROP TRIGGER modified_any ON main_table;
+DROP TRIGGER insert_a ON main_table;
+DROP TRIGGER delete_a ON main_table;
+DROP TRIGGER insert_when ON main_table;
+DROP TRIGGER delete_when ON main_table;
+-- Test WHEN condition accessing system columns.
+create table table_with_oids(a int);
+insert into table_with_oids values (1);
+create trigger oid_unchanged_trig after update on table_with_oids
+ for each row
+ when (new.tableoid = old.tableoid AND new.tableoid <> 0)
+ execute procedure trigger_func('after_upd_oid_unchanged');
+update table_with_oids set a = a + 1;
+NOTICE: trigger_func(after_upd_oid_unchanged) called: action = UPDATE, when = AFTER, level = ROW
+drop table table_with_oids;
+-- Test column-level triggers
+DROP TRIGGER after_upd_row_trig ON main_table;
+CREATE TRIGGER before_upd_a_row_trig BEFORE UPDATE OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_a_row');
+CREATE TRIGGER after_upd_b_row_trig AFTER UPDATE OF b ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_b_row');
+CREATE TRIGGER after_upd_a_b_row_trig AFTER UPDATE OF a, b ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_a_b_row');
+CREATE TRIGGER before_upd_a_stmt_trig BEFORE UPDATE OF a ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('before_upd_a_stmt');
+CREATE TRIGGER after_upd_b_stmt_trig AFTER UPDATE OF b ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('after_upd_b_stmt');
+SELECT pg_get_triggerdef(oid) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'after_upd_a_b_row_trig';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER after_upd_a_b_row_trig AFTER UPDATE OF a, b ON public.main_table FOR EACH ROW EXECUTE FUNCTION trigger_func('after_upd_a_b_row')
+(1 row)
+
+UPDATE main_table SET a = 50;
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+UPDATE main_table SET b = 10;
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+--
+-- Test case for bug with BEFORE trigger followed by AFTER trigger with WHEN
+--
+CREATE TABLE some_t (some_col boolean NOT NULL);
+CREATE FUNCTION dummy_update_func() RETURNS trigger AS $$
+BEGIN
+ RAISE NOTICE 'dummy_update_func(%) called: action = %, old = %, new = %',
+ TG_ARGV[0], TG_OP, OLD, NEW;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER some_trig_before BEFORE UPDATE ON some_t FOR EACH ROW
+ EXECUTE PROCEDURE dummy_update_func('before');
+CREATE TRIGGER some_trig_aftera AFTER UPDATE ON some_t FOR EACH ROW
+ WHEN (NOT OLD.some_col AND NEW.some_col)
+ EXECUTE PROCEDURE dummy_update_func('aftera');
+CREATE TRIGGER some_trig_afterb AFTER UPDATE ON some_t FOR EACH ROW
+ WHEN (NOT NEW.some_col)
+ EXECUTE PROCEDURE dummy_update_func('afterb');
+INSERT INTO some_t VALUES (TRUE);
+UPDATE some_t SET some_col = TRUE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (t), new = (t)
+UPDATE some_t SET some_col = FALSE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (t), new = (f)
+NOTICE: dummy_update_func(afterb) called: action = UPDATE, old = (t), new = (f)
+UPDATE some_t SET some_col = TRUE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (f), new = (t)
+NOTICE: dummy_update_func(aftera) called: action = UPDATE, old = (f), new = (t)
+DROP TABLE some_t;
+-- bogus cases
+CREATE TRIGGER error_upd_and_col BEFORE UPDATE OR UPDATE OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_upd_and_col');
+ERROR: duplicate trigger events specified at or near "ON"
+LINE 1: ...ER error_upd_and_col BEFORE UPDATE OR UPDATE OF a ON main_ta...
+ ^
+CREATE TRIGGER error_upd_a_a BEFORE UPDATE OF a, a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_upd_a_a');
+ERROR: column "a" specified more than once
+CREATE TRIGGER error_ins_a BEFORE INSERT OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_ins_a');
+ERROR: syntax error at or near "OF"
+LINE 1: CREATE TRIGGER error_ins_a BEFORE INSERT OF a ON main_table
+ ^
+CREATE TRIGGER error_ins_when BEFORE INSERT OR UPDATE ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a)
+EXECUTE PROCEDURE trigger_func('error_ins_old');
+ERROR: INSERT trigger's WHEN condition cannot reference OLD values
+LINE 2: FOR EACH ROW WHEN (OLD.a <> NEW.a)
+ ^
+CREATE TRIGGER error_del_when BEFORE DELETE OR UPDATE ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a)
+EXECUTE PROCEDURE trigger_func('error_del_new');
+ERROR: DELETE trigger's WHEN condition cannot reference NEW values
+LINE 2: FOR EACH ROW WHEN (OLD.a <> NEW.a)
+ ^
+CREATE TRIGGER error_del_when BEFORE INSERT OR UPDATE ON main_table
+FOR EACH ROW WHEN (NEW.tableoid <> 0)
+EXECUTE PROCEDURE trigger_func('error_when_sys_column');
+ERROR: BEFORE trigger's WHEN condition cannot reference NEW system columns
+LINE 2: FOR EACH ROW WHEN (NEW.tableoid <> 0)
+ ^
+CREATE TRIGGER error_stmt_when BEFORE UPDATE OF a ON main_table
+FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
+EXECUTE PROCEDURE trigger_func('error_stmt_when');
+ERROR: statement trigger's WHEN condition cannot reference column values
+LINE 2: FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
+ ^
+-- check dependency restrictions
+ALTER TABLE main_table DROP COLUMN b;
+ERROR: cannot drop column b of table main_table because other objects depend on it
+DETAIL: trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
+trigger after_upd_a_b_row_trig on table main_table depends on column b of table main_table
+trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+-- this should succeed, but we'll roll it back to keep the triggers around
+begin;
+DROP TRIGGER after_upd_a_b_row_trig ON main_table;
+DROP TRIGGER after_upd_b_row_trig ON main_table;
+DROP TRIGGER after_upd_b_stmt_trig ON main_table;
+ALTER TABLE main_table DROP COLUMN b;
+rollback;
+-- Test enable/disable triggers
+create table trigtest (i serial primary key);
+-- test that disabling RI triggers works
+create table trigtest2 (i int references trigtest(i) on delete cascade);
+create function trigtest() returns trigger as $$
+begin
+ raise notice '% % % %', TG_RELNAME, TG_OP, TG_WHEN, TG_LEVEL;
+ return new;
+end;$$ language plpgsql;
+create trigger trigtest_b_row_tg before insert or update or delete on trigtest
+for each row execute procedure trigtest();
+create trigger trigtest_a_row_tg after insert or update or delete on trigtest
+for each row execute procedure trigtest();
+create trigger trigtest_b_stmt_tg before insert or update or delete on trigtest
+for each statement execute procedure trigtest();
+create trigger trigtest_a_stmt_tg after insert or update or delete on trigtest
+for each statement execute procedure trigtest();
+insert into trigtest default values;
+NOTICE: trigtest INSERT BEFORE STATEMENT
+NOTICE: trigtest INSERT BEFORE ROW
+NOTICE: trigtest INSERT AFTER ROW
+NOTICE: trigtest INSERT AFTER STATEMENT
+alter table trigtest disable trigger trigtest_b_row_tg;
+insert into trigtest default values;
+NOTICE: trigtest INSERT BEFORE STATEMENT
+NOTICE: trigtest INSERT AFTER ROW
+NOTICE: trigtest INSERT AFTER STATEMENT
+alter table trigtest disable trigger user;
+insert into trigtest default values;
+alter table trigtest enable trigger trigtest_a_stmt_tg;
+insert into trigtest default values;
+NOTICE: trigtest INSERT AFTER STATEMENT
+set session_replication_role = replica;
+insert into trigtest default values; -- does not trigger
+alter table trigtest enable always trigger trigtest_a_stmt_tg;
+insert into trigtest default values; -- now it does
+NOTICE: trigtest INSERT AFTER STATEMENT
+reset session_replication_role;
+insert into trigtest2 values(1);
+insert into trigtest2 values(2);
+delete from trigtest where i=2;
+NOTICE: trigtest DELETE AFTER STATEMENT
+select * from trigtest2;
+ i
+---
+ 1
+(1 row)
+
+alter table trigtest disable trigger all;
+delete from trigtest where i=1;
+select * from trigtest2;
+ i
+---
+ 1
+(1 row)
+
+-- ensure we still insert, even when all triggers are disabled
+insert into trigtest default values;
+select * from trigtest;
+ i
+---
+ 3
+ 4
+ 5
+ 6
+ 7
+(5 rows)
+
+drop table trigtest2;
+drop table trigtest;
+-- dump trigger data
+CREATE TABLE trigger_test (
+ i int,
+ v varchar
+);
+CREATE OR REPLACE FUNCTION trigger_data() RETURNS trigger
+LANGUAGE plpgsql AS $$
+
+declare
+
+ argstr text;
+ relid text;
+
+begin
+
+ relid := TG_relid::regclass;
+
+ -- plpgsql can't discover its trigger data in a hash like perl and python
+ -- can, or by a sort of reflection like tcl can,
+ -- so we have to hard code the names.
+ raise NOTICE 'TG_NAME: %', TG_name;
+ raise NOTICE 'TG_WHEN: %', TG_when;
+ raise NOTICE 'TG_LEVEL: %', TG_level;
+ raise NOTICE 'TG_OP: %', TG_op;
+ raise NOTICE 'TG_RELID::regclass: %', relid;
+ raise NOTICE 'TG_RELNAME: %', TG_relname;
+ raise NOTICE 'TG_TABLE_NAME: %', TG_table_name;
+ raise NOTICE 'TG_TABLE_SCHEMA: %', TG_table_schema;
+ raise NOTICE 'TG_NARGS: %', TG_nargs;
+
+ argstr := '[';
+ for i in 0 .. TG_nargs - 1 loop
+ if i > 0 then
+ argstr := argstr || ', ';
+ end if;
+ argstr := argstr || TG_argv[i];
+ end loop;
+ argstr := argstr || ']';
+ raise NOTICE 'TG_ARGV: %', argstr;
+
+ if TG_OP != 'INSERT' then
+ raise NOTICE 'OLD: %', OLD;
+ end if;
+
+ if TG_OP != 'DELETE' then
+ raise NOTICE 'NEW: %', NEW;
+ end if;
+
+ if TG_OP = 'DELETE' then
+ return OLD;
+ else
+ return NEW;
+ end if;
+
+end;
+$$;
+CREATE TRIGGER show_trigger_data_trig
+BEFORE INSERT OR UPDATE OR DELETE ON trigger_test
+FOR EACH ROW EXECUTE PROCEDURE trigger_data(23,'skidoo');
+insert into trigger_test values(1,'insert');
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: INSERT
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: NEW: (1,insert)
+update trigger_test set v = 'update' where i = 1;
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: UPDATE
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: OLD: (1,insert)
+NOTICE: NEW: (1,update)
+delete from trigger_test;
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: DELETE
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: OLD: (1,update)
+DROP TRIGGER show_trigger_data_trig on trigger_test;
+DROP FUNCTION trigger_data();
+DROP TABLE trigger_test;
+--
+-- Test use of row comparisons on OLD/NEW
+--
+CREATE TABLE trigger_test (f1 int, f2 text, f3 text);
+-- this is the obvious (and wrong...) way to compare rows
+CREATE FUNCTION mytrigger() RETURNS trigger LANGUAGE plpgsql as $$
+begin
+ if row(old.*) = row(new.*) then
+ raise notice 'row % not changed', new.f1;
+ else
+ raise notice 'row % changed', new.f1;
+ end if;
+ return new;
+end$$;
+CREATE TRIGGER t
+BEFORE UPDATE ON trigger_test
+FOR EACH ROW EXECUTE PROCEDURE mytrigger();
+INSERT INTO trigger_test VALUES(1, 'foo', 'bar');
+INSERT INTO trigger_test VALUES(2, 'baz', 'quux');
+UPDATE trigger_test SET f3 = 'bar';
+NOTICE: row 1 not changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+-- this demonstrates that the above isn't really working as desired:
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+-- the right way when considering nulls is
+CREATE OR REPLACE FUNCTION mytrigger() RETURNS trigger LANGUAGE plpgsql as $$
+begin
+ if row(old.*) is distinct from row(new.*) then
+ raise notice 'row % changed', new.f1;
+ else
+ raise notice 'row % not changed', new.f1;
+ end if;
+ return new;
+end$$;
+UPDATE trigger_test SET f3 = 'bar';
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 not changed
+NOTICE: row 2 not changed
+DROP TABLE trigger_test;
+DROP FUNCTION mytrigger();
+-- Test snapshot management in serializable transactions involving triggers
+-- per bug report in 6bc73d4c0910042358k3d1adff3qa36f8df75198ecea@mail.gmail.com
+CREATE FUNCTION serializable_update_trig() RETURNS trigger LANGUAGE plpgsql AS
+$$
+declare
+ rec record;
+begin
+ new.description = 'updated in trigger';
+ return new;
+end;
+$$;
+CREATE TABLE serializable_update_tab (
+ id int,
+ filler text,
+ description text
+);
+CREATE TRIGGER serializable_update_trig BEFORE UPDATE ON serializable_update_tab
+ FOR EACH ROW EXECUTE PROCEDURE serializable_update_trig();
+INSERT INTO serializable_update_tab SELECT a, repeat('xyzxz', 100), 'new'
+ FROM generate_series(1, 50) a;
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+UPDATE serializable_update_tab SET description = 'no no', id = 1 WHERE id = 1;
+COMMIT;
+SELECT description FROM serializable_update_tab WHERE id = 1;
+ description
+--------------------
+ updated in trigger
+(1 row)
+
+DROP TABLE serializable_update_tab;
+-- minimal update trigger
+CREATE TABLE min_updates_test (
+ f1 text,
+ f2 int,
+ f3 int);
+INSERT INTO min_updates_test VALUES ('a',1,2),('b','2',null);
+CREATE TRIGGER z_min_update
+BEFORE UPDATE ON min_updates_test
+FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
+\set QUIET false
+UPDATE min_updates_test SET f1 = f1;
+UPDATE 0
+UPDATE min_updates_test SET f2 = f2 + 1;
+UPDATE 2
+UPDATE min_updates_test SET f3 = 2 WHERE f3 is null;
+UPDATE 1
+\set QUIET true
+SELECT * FROM min_updates_test;
+ f1 | f2 | f3
+----+----+----
+ a | 2 | 2
+ b | 3 | 2
+(2 rows)
+
+DROP TABLE min_updates_test;
+--
+-- Test triggers on views
+--
+CREATE VIEW main_view AS SELECT a, b FROM main_table;
+-- VIEW trigger function
+CREATE OR REPLACE FUNCTION view_trigger() RETURNS trigger
+LANGUAGE plpgsql AS $$
+declare
+ argstr text := '';
+begin
+ for i in 0 .. TG_nargs - 1 loop
+ if i > 0 then
+ argstr := argstr || ', ';
+ end if;
+ argstr := argstr || TG_argv[i];
+ end loop;
+
+ raise notice '% % % % (%)', TG_RELNAME, TG_WHEN, TG_OP, TG_LEVEL, argstr;
+
+ if TG_LEVEL = 'ROW' then
+ if TG_OP = 'INSERT' then
+ raise NOTICE 'NEW: %', NEW;
+ INSERT INTO main_table VALUES (NEW.a, NEW.b);
+ RETURN NEW;
+ end if;
+
+ if TG_OP = 'UPDATE' then
+ raise NOTICE 'OLD: %, NEW: %', OLD, NEW;
+ UPDATE main_table SET a = NEW.a, b = NEW.b WHERE a = OLD.a AND b = OLD.b;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN NEW;
+ end if;
+
+ if TG_OP = 'DELETE' then
+ raise NOTICE 'OLD: %', OLD;
+ DELETE FROM main_table WHERE a = OLD.a AND b = OLD.b;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN OLD;
+ end if;
+ end if;
+
+ RETURN NULL;
+end;
+$$;
+-- Before row triggers aren't allowed on views
+CREATE TRIGGER invalid_trig BEFORE INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_ins_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig BEFORE UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig BEFORE DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_del_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+-- After row triggers aren't allowed on views
+CREATE TRIGGER invalid_trig AFTER INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_ins_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig AFTER UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig AFTER DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_del_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+-- Truncate triggers aren't allowed on views
+CREATE TRIGGER invalid_trig BEFORE TRUNCATE ON main_view
+EXECUTE PROCEDURE trigger_func('before_tru_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have TRUNCATE triggers.
+CREATE TRIGGER invalid_trig AFTER TRUNCATE ON main_view
+EXECUTE PROCEDURE trigger_func('before_tru_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have TRUNCATE triggers.
+-- INSTEAD OF triggers aren't allowed on tables
+CREATE TRIGGER invalid_trig INSTEAD OF INSERT ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_ins');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+CREATE TRIGGER invalid_trig INSTEAD OF DELETE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_del');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+-- Don't support WHEN clauses with INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_view
+FOR EACH ROW WHEN (OLD.a <> NEW.a) EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers cannot have WHEN conditions
+-- Don't support column-level INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE OF a ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers cannot have column lists
+-- Don't support statement-level INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_view
+EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers must be FOR EACH ROW
+-- Valid INSTEAD OF triggers
+CREATE TRIGGER instead_of_insert_trig INSTEAD OF INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_ins');
+CREATE TRIGGER instead_of_update_trig INSTEAD OF UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+CREATE TRIGGER instead_of_delete_trig INSTEAD OF DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_del');
+-- Valid BEFORE statement VIEW triggers
+CREATE TRIGGER before_ins_stmt_trig BEFORE INSERT ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_ins_stmt');
+CREATE TRIGGER before_upd_stmt_trig BEFORE UPDATE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_upd_stmt');
+CREATE TRIGGER before_del_stmt_trig BEFORE DELETE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_del_stmt');
+-- Valid AFTER statement VIEW triggers
+CREATE TRIGGER after_ins_stmt_trig AFTER INSERT ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_ins_stmt');
+CREATE TRIGGER after_upd_stmt_trig AFTER UPDATE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_upd_stmt');
+CREATE TRIGGER after_del_stmt_trig AFTER DELETE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_del_stmt');
+\set QUIET false
+-- Insert into view using trigger
+INSERT INTO main_view VALUES (20, 30);
+NOTICE: main_view BEFORE INSERT STATEMENT (before_view_ins_stmt)
+NOTICE: main_view INSTEAD OF INSERT ROW (instead_of_ins)
+NOTICE: NEW: (20,30)
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER INSERT STATEMENT (after_view_ins_stmt)
+INSERT 0 1
+INSERT INTO main_view VALUES (21, 31) RETURNING a, b;
+NOTICE: main_view BEFORE INSERT STATEMENT (before_view_ins_stmt)
+NOTICE: main_view INSTEAD OF INSERT ROW (instead_of_ins)
+NOTICE: NEW: (21,31)
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER INSERT STATEMENT (after_view_ins_stmt)
+ a | b
+----+----
+ 21 | 31
+(1 row)
+
+INSERT 0 1
+-- Table trigger will prevent updates
+UPDATE main_view SET b = 31 WHERE a = 20;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (20,30), NEW: (20,31)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 0
+UPDATE main_view SET b = 32 WHERE a = 21 AND b = 31 RETURNING a, b;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (21,31), NEW: (21,32)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+ a | b
+---+---
+(0 rows)
+
+UPDATE 0
+-- Remove table trigger to allow updates
+DROP TRIGGER before_upd_a_row_trig ON main_table;
+DROP TRIGGER
+UPDATE main_view SET b = 31 WHERE a = 20;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (20,30), NEW: (20,31)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 1
+UPDATE main_view SET b = 32 WHERE a = 21 AND b = 31 RETURNING a, b;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (21,31), NEW: (21,32)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+ a | b
+----+----
+ 21 | 32
+(1 row)
+
+UPDATE 1
+-- Before and after stmt triggers should fire even when no rows are affected
+UPDATE main_view SET b = 0 WHERE false;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 0
+-- Delete from view using trigger
+DELETE FROM main_view WHERE a IN (20,21);
+NOTICE: main_view BEFORE DELETE STATEMENT (before_view_del_stmt)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (21,10)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (20,31)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (21,32)
+NOTICE: main_view AFTER DELETE STATEMENT (after_view_del_stmt)
+DELETE 3
+DELETE FROM main_view WHERE a = 31 RETURNING a, b;
+NOTICE: main_view BEFORE DELETE STATEMENT (before_view_del_stmt)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (31,10)
+NOTICE: main_view AFTER DELETE STATEMENT (after_view_del_stmt)
+ a | b
+----+----
+ 31 | 10
+(1 row)
+
+DELETE 1
+\set QUIET true
+-- Describe view should list triggers
+\d main_view
+ View "public.main_view"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+Triggers:
+ after_del_stmt_trig AFTER DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_del_stmt')
+ after_ins_stmt_trig AFTER INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_ins_stmt')
+ after_upd_stmt_trig AFTER UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_upd_stmt')
+ before_del_stmt_trig BEFORE DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_del_stmt')
+ before_ins_stmt_trig BEFORE INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_ins_stmt')
+ before_upd_stmt_trig BEFORE UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_upd_stmt')
+ instead_of_delete_trig INSTEAD OF DELETE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_del')
+ instead_of_insert_trig INSTEAD OF INSERT ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_ins')
+ instead_of_update_trig INSTEAD OF UPDATE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_upd')
+
+-- Test dropping view triggers
+DROP TRIGGER instead_of_insert_trig ON main_view;
+DROP TRIGGER instead_of_delete_trig ON main_view;
+\d+ main_view
+ View "public.main_view"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+ b | integer | | | | plain |
+View definition:
+ SELECT main_table.a,
+ main_table.b
+ FROM main_table;
+Triggers:
+ after_del_stmt_trig AFTER DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_del_stmt')
+ after_ins_stmt_trig AFTER INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_ins_stmt')
+ after_upd_stmt_trig AFTER UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_upd_stmt')
+ before_del_stmt_trig BEFORE DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_del_stmt')
+ before_ins_stmt_trig BEFORE INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_ins_stmt')
+ before_upd_stmt_trig BEFORE UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_upd_stmt')
+ instead_of_update_trig INSTEAD OF UPDATE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_upd')
+
+DROP VIEW main_view;
+--
+-- Test triggers on a join view
+--
+CREATE TABLE country_table (
+ country_id serial primary key,
+ country_name text unique not null,
+ continent text not null
+);
+INSERT INTO country_table (country_name, continent)
+ VALUES ('Japan', 'Asia'),
+ ('UK', 'Europe'),
+ ('USA', 'North America')
+ RETURNING *;
+ country_id | country_name | continent
+------------+--------------+---------------
+ 1 | Japan | Asia
+ 2 | UK | Europe
+ 3 | USA | North America
+(3 rows)
+
+CREATE TABLE city_table (
+ city_id serial primary key,
+ city_name text not null,
+ population bigint,
+ country_id int references country_table
+);
+CREATE VIEW city_view AS
+ SELECT city_id, city_name, population, country_name, continent
+ FROM city_table ci
+ LEFT JOIN country_table co ON co.country_id = ci.country_id;
+CREATE FUNCTION city_insert() RETURNS trigger LANGUAGE plpgsql AS $$
+declare
+ ctry_id int;
+begin
+ if NEW.country_name IS NOT NULL then
+ SELECT country_id, continent INTO ctry_id, NEW.continent
+ FROM country_table WHERE country_name = NEW.country_name;
+ if NOT FOUND then
+ raise exception 'No such country: "%"', NEW.country_name;
+ end if;
+ else
+ NEW.continent := NULL;
+ end if;
+
+ if NEW.city_id IS NOT NULL then
+ INSERT INTO city_table
+ VALUES(NEW.city_id, NEW.city_name, NEW.population, ctry_id);
+ else
+ INSERT INTO city_table(city_name, population, country_id)
+ VALUES(NEW.city_name, NEW.population, ctry_id)
+ RETURNING city_id INTO NEW.city_id;
+ end if;
+
+ RETURN NEW;
+end;
+$$;
+CREATE TRIGGER city_insert_trig INSTEAD OF INSERT ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_insert();
+CREATE FUNCTION city_delete() RETURNS trigger LANGUAGE plpgsql AS $$
+begin
+ DELETE FROM city_table WHERE city_id = OLD.city_id;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN OLD;
+end;
+$$;
+CREATE TRIGGER city_delete_trig INSTEAD OF DELETE ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_delete();
+CREATE FUNCTION city_update() RETURNS trigger LANGUAGE plpgsql AS $$
+declare
+ ctry_id int;
+begin
+ if NEW.country_name IS DISTINCT FROM OLD.country_name then
+ SELECT country_id, continent INTO ctry_id, NEW.continent
+ FROM country_table WHERE country_name = NEW.country_name;
+ if NOT FOUND then
+ raise exception 'No such country: "%"', NEW.country_name;
+ end if;
+
+ UPDATE city_table SET city_name = NEW.city_name,
+ population = NEW.population,
+ country_id = ctry_id
+ WHERE city_id = OLD.city_id;
+ else
+ UPDATE city_table SET city_name = NEW.city_name,
+ population = NEW.population
+ WHERE city_id = OLD.city_id;
+ NEW.continent := OLD.continent;
+ end if;
+
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN NEW;
+end;
+$$;
+CREATE TRIGGER city_update_trig INSTEAD OF UPDATE ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_update();
+\set QUIET false
+-- INSERT .. RETURNING
+INSERT INTO city_view(city_name) VALUES('Tokyo') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_name, population) VALUES('London', 7556900) RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 2 | London | 7556900 | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_name, country_name) VALUES('Washington DC', 'USA') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+---------------+------------+--------------+---------------
+ 3 | Washington DC | | USA | North America
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_id, city_name) VALUES(123456, 'New York') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 123456 | New York | | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view VALUES(234567, 'Birmingham', 1016800, 'UK', 'EU') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+INSERT 0 1
+-- UPDATE .. RETURNING
+UPDATE city_view SET country_name = 'Japon' WHERE city_name = 'Tokyo'; -- error
+ERROR: No such country: "Japon"
+CONTEXT: PL/pgSQL function city_update() line 9 at RAISE
+UPDATE city_view SET country_name = 'Japan' WHERE city_name = 'Takyo'; -- no match
+UPDATE 0
+UPDATE city_view SET country_name = 'Japan' WHERE city_name = 'Tokyo' RETURNING *; -- OK
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | | Japan | Asia
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET population = 13010279 WHERE city_name = 'Tokyo' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | 13010279 | Japan | Asia
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET country_name = 'UK' WHERE city_name = 'New York' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 123456 | New York | | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET country_name = 'USA', population = 8391881 WHERE city_name = 'New York' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+---------------
+ 123456 | New York | 8391881 | USA | North America
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET continent = 'EU' WHERE continent = 'Europe' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE city_view v1 SET country_name = v2.country_name FROM city_view v2
+ WHERE v2.city_name = 'Birmingham' AND v1.city_name = 'London' RETURNING *;
+ city_id | city_name | population | country_name | continent | city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------+---------+------------+------------+--------------+-----------
+ 2 | London | 7556900 | UK | Europe | 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+UPDATE 1
+-- DELETE .. RETURNING
+DELETE FROM city_view WHERE city_name = 'Birmingham' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+DELETE 1
+\set QUIET true
+-- read-only view with WHERE clause
+CREATE VIEW european_city_view AS
+ SELECT * FROM city_view WHERE continent = 'Europe';
+SELECT count(*) FROM european_city_view;
+ count
+-------
+ 1
+(1 row)
+
+CREATE FUNCTION no_op_trig_fn() RETURNS trigger LANGUAGE plpgsql
+AS 'begin RETURN NULL; end';
+CREATE TRIGGER no_op_trig INSTEAD OF INSERT OR UPDATE OR DELETE
+ON european_city_view FOR EACH ROW EXECUTE PROCEDURE no_op_trig_fn();
+\set QUIET false
+INSERT INTO european_city_view VALUES (0, 'x', 10000, 'y', 'z');
+INSERT 0 0
+UPDATE european_city_view SET population = 10000;
+UPDATE 0
+DELETE FROM european_city_view;
+DELETE 0
+\set QUIET true
+-- rules bypassing no-op triggers
+CREATE RULE european_city_insert_rule AS ON INSERT TO european_city_view
+DO INSTEAD INSERT INTO city_view
+VALUES (NEW.city_id, NEW.city_name, NEW.population, NEW.country_name, NEW.continent)
+RETURNING *;
+CREATE RULE european_city_update_rule AS ON UPDATE TO european_city_view
+DO INSTEAD UPDATE city_view SET
+ city_name = NEW.city_name,
+ population = NEW.population,
+ country_name = NEW.country_name
+WHERE city_id = OLD.city_id
+RETURNING NEW.*;
+CREATE RULE european_city_delete_rule AS ON DELETE TO european_city_view
+DO INSTEAD DELETE FROM city_view WHERE city_id = OLD.city_id RETURNING *;
+\set QUIET false
+-- INSERT not limited by view's WHERE clause, but UPDATE AND DELETE are
+INSERT INTO european_city_view(city_name, country_name)
+ VALUES ('Cambridge', 'USA') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+---------------
+ 4 | Cambridge | | USA | North America
+(1 row)
+
+INSERT 0 1
+UPDATE european_city_view SET country_name = 'UK'
+ WHERE city_name = 'Cambridge';
+UPDATE 0
+DELETE FROM european_city_view WHERE city_name = 'Cambridge';
+DELETE 0
+-- UPDATE and DELETE via rule and trigger
+UPDATE city_view SET country_name = 'UK'
+ WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE european_city_view SET population = 122800
+ WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | 122800 | UK | Europe
+(1 row)
+
+UPDATE 1
+DELETE FROM european_city_view WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | 122800 | UK | Europe
+(1 row)
+
+DELETE 1
+-- join UPDATE test
+UPDATE city_view v SET population = 599657
+ FROM city_table ci, country_table co
+ WHERE ci.city_name = 'Washington DC' and co.country_name = 'USA'
+ AND v.city_id = ci.city_id AND v.country_name = co.country_name
+ RETURNING co.country_id, v.country_name,
+ v.city_id, v.city_name, v.population;
+ country_id | country_name | city_id | city_name | population
+------------+--------------+---------+---------------+------------
+ 3 | USA | 3 | Washington DC | 599657
+(1 row)
+
+UPDATE 1
+\set QUIET true
+SELECT * FROM city_view;
+ city_id | city_name | population | country_name | continent
+---------+---------------+------------+--------------+---------------
+ 1 | Tokyo | 13010279 | Japan | Asia
+ 123456 | New York | 8391881 | USA | North America
+ 2 | London | 7556900 | UK | Europe
+ 3 | Washington DC | 599657 | USA | North America
+(4 rows)
+
+DROP TABLE city_table CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view city_view
+drop cascades to view european_city_view
+DROP TABLE country_table;
+-- Test pg_trigger_depth()
+create table depth_a (id int not null primary key);
+create table depth_b (id int not null primary key);
+create table depth_c (id int not null primary key);
+create function depth_a_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ insert into depth_b values (new.id);
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ return new;
+end;
+$$;
+create trigger depth_a_tr before insert on depth_a
+ for each row execute procedure depth_a_tf();
+create function depth_b_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ begin
+ execute 'insert into depth_c values (' || new.id::text || ')';
+ exception
+ when sqlstate 'U9999' then
+ raise notice 'SQLSTATE = U9999: depth = %', pg_trigger_depth();
+ end;
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ if new.id = 1 then
+ execute 'insert into depth_c values (' || new.id::text || ')';
+ end if;
+ return new;
+end;
+$$;
+create trigger depth_b_tr before insert on depth_b
+ for each row execute procedure depth_b_tf();
+create function depth_c_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ if new.id = 1 then
+ raise exception sqlstate 'U9999';
+ end if;
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ return new;
+end;
+$$;
+create trigger depth_c_tr before insert on depth_c
+ for each row execute procedure depth_c_tf();
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+insert into depth_a values (1);
+NOTICE: depth_a_tr: depth = 1
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+NOTICE: SQLSTATE = U9999: depth = 2
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+ERROR: U9999
+CONTEXT: PL/pgSQL function depth_c_tf() line 5 at RAISE
+SQL statement "insert into depth_c values (1)"
+PL/pgSQL function depth_b_tf() line 12 at EXECUTE
+SQL statement "insert into depth_b values (new.id)"
+PL/pgSQL function depth_a_tf() line 4 at SQL statement
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+insert into depth_a values (2);
+NOTICE: depth_a_tr: depth = 1
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+NOTICE: depth_c_tr: depth = 3
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_a_tr: depth = 1
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+drop table depth_a, depth_b, depth_c;
+drop function depth_a_tf();
+drop function depth_b_tf();
+drop function depth_c_tf();
+--
+-- Test updates to rows during firing of BEFORE ROW triggers.
+-- As of 9.2, such cases should be rejected (see bug #6123).
+--
+create temp table parent (
+ aid int not null primary key,
+ val1 text,
+ val2 text,
+ val3 text,
+ val4 text,
+ bcnt int not null default 0);
+create temp table child (
+ bid int not null primary key,
+ aid int not null,
+ val1 text);
+create function parent_upd_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if old.val1 <> new.val1 then
+ new.val2 = new.val1;
+ delete from child where child.aid = new.aid and child.val1 = new.val1;
+ end if;
+ return new;
+end;
+$$;
+create trigger parent_upd_trig before update on parent
+ for each row execute procedure parent_upd_func();
+create function parent_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ delete from child where aid = old.aid;
+ return old;
+end;
+$$;
+create trigger parent_del_trig before delete on parent
+ for each row execute procedure parent_del_func();
+create function child_ins_func()
+ returns trigger language plpgsql as
+$$
+begin
+ update parent set bcnt = bcnt + 1 where aid = new.aid;
+ return new;
+end;
+$$;
+create trigger child_ins_trig after insert on child
+ for each row execute procedure child_ins_func();
+create function child_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ update parent set bcnt = bcnt - 1 where aid = old.aid;
+ return old;
+end;
+$$;
+create trigger child_del_trig after delete on child
+ for each row execute procedure child_del_func();
+insert into parent values (1, 'a', 'a', 'a', 'a', 0);
+insert into child values (10, 1, 'b');
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+update parent set val1 = 'b' where aid = 1; -- should fail
+ERROR: tuple to be updated was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+delete from parent where aid = 1; -- should fail
+ERROR: tuple to be deleted was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+-- replace the trigger function with one that restarts the deletion after
+-- having modified a child
+create or replace function parent_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ delete from child where aid = old.aid;
+ if found then
+ delete from parent where aid = old.aid;
+ return null; -- cancel outer deletion
+ end if;
+ return old;
+end;
+$$;
+delete from parent where aid = 1;
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+(0 rows)
+
+ bid | aid | val1
+-----+-----+------
+(0 rows)
+
+drop table parent, child;
+drop function parent_upd_func();
+drop function parent_del_func();
+drop function child_ins_func();
+drop function child_del_func();
+-- similar case, but with a self-referencing FK so that parent and child
+-- rows can be affected by a single operation
+create temp table self_ref_trigger (
+ id int primary key,
+ parent int references self_ref_trigger,
+ data text,
+ nchildren int not null default 0
+);
+create function self_ref_trigger_ins_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if new.parent is not null then
+ update self_ref_trigger set nchildren = nchildren + 1
+ where id = new.parent;
+ end if;
+ return new;
+end;
+$$;
+create trigger self_ref_trigger_ins_trig before insert on self_ref_trigger
+ for each row execute procedure self_ref_trigger_ins_func();
+create function self_ref_trigger_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if old.parent is not null then
+ update self_ref_trigger set nchildren = nchildren - 1
+ where id = old.parent;
+ end if;
+ return old;
+end;
+$$;
+create trigger self_ref_trigger_del_trig before delete on self_ref_trigger
+ for each row execute procedure self_ref_trigger_del_func();
+insert into self_ref_trigger values (1, null, 'root');
+insert into self_ref_trigger values (2, 1, 'root child A');
+insert into self_ref_trigger values (3, 1, 'root child B');
+insert into self_ref_trigger values (4, 2, 'grandchild 1');
+insert into self_ref_trigger values (5, 3, 'grandchild 2');
+update self_ref_trigger set data = 'root!' where id = 1;
+select * from self_ref_trigger;
+ id | parent | data | nchildren
+----+--------+--------------+-----------
+ 2 | 1 | root child A | 1
+ 4 | 2 | grandchild 1 | 0
+ 3 | 1 | root child B | 1
+ 5 | 3 | grandchild 2 | 0
+ 1 | | root! | 2
+(5 rows)
+
+delete from self_ref_trigger;
+ERROR: tuple to be deleted was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from self_ref_trigger;
+ id | parent | data | nchildren
+----+--------+--------------+-----------
+ 2 | 1 | root child A | 1
+ 4 | 2 | grandchild 1 | 0
+ 3 | 1 | root child B | 1
+ 5 | 3 | grandchild 2 | 0
+ 1 | | root! | 2
+(5 rows)
+
+drop table self_ref_trigger;
+drop function self_ref_trigger_ins_func();
+drop function self_ref_trigger_del_func();
+--
+-- Check that statement triggers work correctly even with all children excluded
+--
+create table stmt_trig_on_empty_upd (a int);
+create table stmt_trig_on_empty_upd1 () inherits (stmt_trig_on_empty_upd);
+create function update_stmt_notice() returns trigger as $$
+begin
+ raise notice 'updating %', TG_TABLE_NAME;
+ return null;
+end;
+$$ language plpgsql;
+create trigger before_stmt_trigger
+ before update on stmt_trig_on_empty_upd
+ execute procedure update_stmt_notice();
+create trigger before_stmt_trigger
+ before update on stmt_trig_on_empty_upd1
+ execute procedure update_stmt_notice();
+-- inherited no-op update
+update stmt_trig_on_empty_upd set a = a where false returning a+1 as aa;
+NOTICE: updating stmt_trig_on_empty_upd
+ aa
+----
+(0 rows)
+
+-- simple no-op update
+update stmt_trig_on_empty_upd1 set a = a where false returning a+1 as aa;
+NOTICE: updating stmt_trig_on_empty_upd1
+ aa
+----
+(0 rows)
+
+drop table stmt_trig_on_empty_upd cascade;
+NOTICE: drop cascades to table stmt_trig_on_empty_upd1
+drop function update_stmt_notice();
+--
+-- Check that index creation (or DDL in general) is prohibited in a trigger
+--
+create table trigger_ddl_table (
+ col1 integer,
+ col2 integer
+);
+create function trigger_ddl_func() returns trigger as $$
+begin
+ alter table trigger_ddl_table add primary key (col1);
+ return new;
+end$$ language plpgsql;
+create trigger trigger_ddl_func before insert on trigger_ddl_table for each row
+ execute procedure trigger_ddl_func();
+insert into trigger_ddl_table values (1, 42); -- fail
+ERROR: cannot ALTER TABLE "trigger_ddl_table" because it is being used by active queries in this session
+CONTEXT: SQL statement "alter table trigger_ddl_table add primary key (col1)"
+PL/pgSQL function trigger_ddl_func() line 3 at SQL statement
+create or replace function trigger_ddl_func() returns trigger as $$
+begin
+ create index on trigger_ddl_table (col2);
+ return new;
+end$$ language plpgsql;
+insert into trigger_ddl_table values (1, 42); -- fail
+ERROR: cannot CREATE INDEX "trigger_ddl_table" because it is being used by active queries in this session
+CONTEXT: SQL statement "create index on trigger_ddl_table (col2)"
+PL/pgSQL function trigger_ddl_func() line 3 at SQL statement
+drop table trigger_ddl_table;
+drop function trigger_ddl_func();
+--
+-- Verify behavior of before and after triggers with INSERT...ON CONFLICT
+-- DO UPDATE
+--
+create table upsert (key int4 primary key, color text);
+create function upsert_before_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if (TG_OP = 'UPDATE') then
+ raise warning 'before update (old): %', old.*::text;
+ raise warning 'before update (new): %', new.*::text;
+ elsif (TG_OP = 'INSERT') then
+ raise warning 'before insert (new): %', new.*::text;
+ if new.key % 2 = 0 then
+ new.key := new.key + 1;
+ new.color := new.color || ' trig modified';
+ raise warning 'before insert (new, modified): %', new.*::text;
+ end if;
+ end if;
+ return new;
+end;
+$$;
+create trigger upsert_before_trig before insert or update on upsert
+ for each row execute procedure upsert_before_func();
+create function upsert_after_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if (TG_OP = 'UPDATE') then
+ raise warning 'after update (old): %', old.*::text;
+ raise warning 'after update (new): %', new.*::text;
+ elsif (TG_OP = 'INSERT') then
+ raise warning 'after insert (new): %', new.*::text;
+ end if;
+ return null;
+end;
+$$;
+create trigger upsert_after_trig after insert or update on upsert
+ for each row execute procedure upsert_after_func();
+insert into upsert values(1, 'black') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (1,black)
+WARNING: after insert (new): (1,black)
+insert into upsert values(2, 'red') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (2,red)
+WARNING: before insert (new, modified): (3,"red trig modified")
+WARNING: after insert (new): (3,"red trig modified")
+insert into upsert values(3, 'orange') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (3,orange)
+WARNING: before update (old): (3,"red trig modified")
+WARNING: before update (new): (3,"updated red trig modified")
+WARNING: after update (old): (3,"red trig modified")
+WARNING: after update (new): (3,"updated red trig modified")
+insert into upsert values(4, 'green') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (4,green)
+WARNING: before insert (new, modified): (5,"green trig modified")
+WARNING: after insert (new): (5,"green trig modified")
+insert into upsert values(5, 'purple') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (5,purple)
+WARNING: before update (old): (5,"green trig modified")
+WARNING: before update (new): (5,"updated green trig modified")
+WARNING: after update (old): (5,"green trig modified")
+WARNING: after update (new): (5,"updated green trig modified")
+insert into upsert values(6, 'white') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (6,white)
+WARNING: before insert (new, modified): (7,"white trig modified")
+WARNING: after insert (new): (7,"white trig modified")
+insert into upsert values(7, 'pink') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (7,pink)
+WARNING: before update (old): (7,"white trig modified")
+WARNING: before update (new): (7,"updated white trig modified")
+WARNING: after update (old): (7,"white trig modified")
+WARNING: after update (new): (7,"updated white trig modified")
+insert into upsert values(8, 'yellow') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (8,yellow)
+WARNING: before insert (new, modified): (9,"yellow trig modified")
+WARNING: after insert (new): (9,"yellow trig modified")
+select * from upsert;
+ key | color
+-----+-----------------------------
+ 1 | black
+ 3 | updated red trig modified
+ 5 | updated green trig modified
+ 7 | updated white trig modified
+ 9 | yellow trig modified
+(5 rows)
+
+drop table upsert;
+drop function upsert_before_func();
+drop function upsert_after_func();
+--
+-- Verify that triggers with transition tables are not allowed on
+-- views
+--
+create table my_table (i int);
+create view my_view as select * from my_table;
+create function my_trigger_function() returns trigger as $$ begin end; $$ language plpgsql;
+create trigger my_trigger after update on my_view referencing old table as old_table
+ for each statement execute procedure my_trigger_function();
+ERROR: "my_view" is a view
+DETAIL: Triggers on views cannot have transition tables.
+drop function my_trigger_function();
+drop view my_view;
+drop table my_table;
+--
+-- Verify cases that are unsupported with partitioned tables
+--
+create table parted_trig (a int) partition by list (a);
+create function trigger_nothing() returns trigger
+ language plpgsql as $$ begin end; $$;
+create trigger failed before insert or update or delete on parted_trig
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a partitioned table
+DETAIL: Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.
+create trigger failed instead of update on parted_trig
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+create trigger failed after update on parted_trig
+ referencing old table as old_table
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a partitioned table
+DETAIL: Triggers on partitioned tables cannot have transition tables.
+drop table parted_trig;
+--
+-- Verify trigger creation for partitioned tables, and drop behavior
+--
+create table trigpart (a int, b int) partition by range (a);
+create table trigpart1 partition of trigpart for values from (0) to (1000);
+create trigger trg1 after insert on trigpart for each row execute procedure trigger_nothing();
+create table trigpart2 partition of trigpart for values from (1000) to (2000);
+create table trigpart3 (like trigpart);
+alter table trigpart attach partition trigpart3 for values from (2000) to (3000);
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+-----------+--------+-----------------
+ trigpart | trg1 | trigger_nothing
+ trigpart1 | trg1 | trigger_nothing
+ trigpart2 | trg1 | trigger_nothing
+ trigpart3 | trg1 | trigger_nothing
+(4 rows)
+
+drop trigger trg1 on trigpart1; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop trigger trg1 on trigpart2; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart2 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop trigger trg1 on trigpart3; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart3 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop table trigpart2; -- ok, trigger should be gone in that partition
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+-----------+--------+-----------------
+ trigpart | trg1 | trigger_nothing
+ trigpart1 | trg1 | trigger_nothing
+ trigpart3 | trg1 | trigger_nothing
+(3 rows)
+
+drop trigger trg1 on trigpart; -- ok, all gone
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+---------+--------+--------
+(0 rows)
+
+drop table trigpart;
+drop function trigger_nothing();
+--
+-- Verify that triggers are fired for partitioned tables
+--
+create table parted_stmt_trig (a int) partition by list (a);
+create table parted_stmt_trig1 partition of parted_stmt_trig for values in (1);
+create table parted_stmt_trig2 partition of parted_stmt_trig for values in (2);
+create table parted2_stmt_trig (a int) partition by list (a);
+create table parted2_stmt_trig1 partition of parted2_stmt_trig for values in (1);
+create table parted2_stmt_trig2 partition of parted2_stmt_trig for values in (2);
+create or replace function trigger_notice() returns trigger as $$
+ begin
+ raise notice 'trigger % on % % % for %', TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL;
+ if TG_LEVEL = 'ROW' then
+ return NEW;
+ end if;
+ return null;
+ end;
+ $$ language plpgsql;
+-- insert/update/delete statement-level triggers on the parent
+create trigger trig_ins_before before insert on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_ins_after after insert on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_before before update on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_after after update on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_before before delete on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_after after delete on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+-- insert/update/delete row-level triggers on the parent
+create trigger trig_ins_after_parent after insert on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_after_parent after update on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+create trigger trig_del_after_parent after delete on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+-- insert/update/delete row-level triggers on the first partition
+create trigger trig_ins_before_child before insert on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_ins_after_child after insert on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_before_child before update on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_after_child after update on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_del_before_child before delete on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_del_after_child after delete on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+-- insert/update/delete statement-level triggers on the parent
+create trigger trig_ins_before_3 before insert on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_ins_after_3 after insert on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_before_3 before update on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_after_3 after update on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_before_3 before delete on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_after_3 after delete on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+with ins (a) as (
+ insert into parted2_stmt_trig values (1), (2) returning a
+) insert into parted_stmt_trig select a from ins returning tableoid::regclass, a;
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_3 on parted2_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig2 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_3 on parted2_stmt_trig AFTER INSERT for STATEMENT
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+ tableoid | a
+-------------------+---
+ parted_stmt_trig1 | 1
+ parted_stmt_trig2 | 2
+(2 rows)
+
+with upd as (
+ update parted2_stmt_trig set a = a
+) update parted_stmt_trig set a = a;
+NOTICE: trigger trig_upd_before on parted_stmt_trig BEFORE UPDATE for STATEMENT
+NOTICE: trigger trig_upd_before_child on parted_stmt_trig1 BEFORE UPDATE for ROW
+NOTICE: trigger trig_upd_before_3 on parted2_stmt_trig BEFORE UPDATE for STATEMENT
+NOTICE: trigger trig_upd_after_child on parted_stmt_trig1 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after_parent on parted_stmt_trig1 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after_parent on parted_stmt_trig2 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after on parted_stmt_trig AFTER UPDATE for STATEMENT
+NOTICE: trigger trig_upd_after_3 on parted2_stmt_trig AFTER UPDATE for STATEMENT
+delete from parted_stmt_trig;
+NOTICE: trigger trig_del_before on parted_stmt_trig BEFORE DELETE for STATEMENT
+NOTICE: trigger trig_del_before_child on parted_stmt_trig1 BEFORE DELETE for ROW
+NOTICE: trigger trig_del_after_parent on parted_stmt_trig2 AFTER DELETE for ROW
+NOTICE: trigger trig_del_after on parted_stmt_trig AFTER DELETE for STATEMENT
+-- insert via copy on the parent
+copy parted_stmt_trig(a) from stdin;
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig2 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+-- insert via copy on the first partition
+copy parted_stmt_trig1(a) from stdin;
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+-- Disabling a trigger in the parent table should disable children triggers too
+alter table parted_stmt_trig disable trigger trig_ins_after_parent;
+insert into parted_stmt_trig values (1);
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+alter table parted_stmt_trig enable trigger trig_ins_after_parent;
+insert into parted_stmt_trig values (1);
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+drop table parted_stmt_trig, parted2_stmt_trig;
+-- Verify that triggers fire in alphabetical order
+create table parted_trig (a int) partition by range (a);
+create table parted_trig_1 partition of parted_trig for values from (0) to (1000)
+ partition by range (a);
+create table parted_trig_1_1 partition of parted_trig_1 for values from (0) to (100);
+create table parted_trig_2 partition of parted_trig for values from (1000) to (2000);
+create trigger zzz after insert on parted_trig for each row execute procedure trigger_notice();
+create trigger mmm after insert on parted_trig_1_1 for each row execute procedure trigger_notice();
+create trigger aaa after insert on parted_trig_1 for each row execute procedure trigger_notice();
+create trigger bbb after insert on parted_trig for each row execute procedure trigger_notice();
+create trigger qqq after insert on parted_trig_1_1 for each row execute procedure trigger_notice();
+insert into parted_trig values (50), (1500);
+NOTICE: trigger aaa on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger bbb on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger mmm on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger qqq on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger zzz on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger bbb on parted_trig_2 AFTER INSERT for ROW
+NOTICE: trigger zzz on parted_trig_2 AFTER INSERT for ROW
+drop table parted_trig;
+-- Verify propagation of trigger arguments to partitions
+create table parted_trig (a int) partition by list (a);
+create table parted_trig1 partition of parted_trig for values in (1);
+create or replace function trigger_notice() returns trigger as $$
+ declare
+ arg1 text = TG_ARGV[0];
+ arg2 integer = TG_ARGV[1];
+ begin
+ raise notice 'trigger % on % % % for % args % %',
+ TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL, arg1, arg2;
+ return null;
+ end;
+ $$ language plpgsql;
+create trigger aaa after insert on parted_trig
+ for each row execute procedure trigger_notice('quirky', 1);
+-- Verify propagation of trigger arguments to partitions attached after creating trigger
+create table parted_trig2 partition of parted_trig for values in (2);
+create table parted_trig3 (like parted_trig);
+alter table parted_trig attach partition parted_trig3 for values in (3);
+insert into parted_trig values (1), (2), (3);
+NOTICE: trigger aaa on parted_trig1 AFTER INSERT for ROW args quirky 1
+NOTICE: trigger aaa on parted_trig2 AFTER INSERT for ROW args quirky 1
+NOTICE: trigger aaa on parted_trig3 AFTER INSERT for ROW args quirky 1
+drop table parted_trig;
+-- test irregular partitions (i.e., different column definitions),
+-- including that the WHEN clause works
+create function bark(text) returns bool language plpgsql immutable
+ as $$ begin raise notice '% <- woof!', $1; return true; end; $$;
+create or replace function trigger_notice_ab() returns trigger as $$
+ begin
+ raise notice 'trigger % on % % % for %: (a,b)=(%,%)',
+ TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL,
+ NEW.a, NEW.b;
+ if TG_LEVEL = 'ROW' then
+ return NEW;
+ end if;
+ return null;
+ end;
+ $$ language plpgsql;
+create table parted_irreg_ancestor (fd text, b text, fd2 int, fd3 int, a int)
+ partition by range (b);
+alter table parted_irreg_ancestor drop column fd,
+ drop column fd2, drop column fd3;
+create table parted_irreg (fd int, a int, fd2 int, b text)
+ partition by range (b);
+alter table parted_irreg drop column fd, drop column fd2;
+alter table parted_irreg_ancestor attach partition parted_irreg
+ for values from ('aaaa') to ('zzzz');
+create table parted1_irreg (b text, fd int, a int);
+alter table parted1_irreg drop column fd;
+alter table parted_irreg attach partition parted1_irreg
+ for values from ('aaaa') to ('bbbb');
+create trigger parted_trig after insert on parted_irreg
+ for each row execute procedure trigger_notice_ab();
+create trigger parted_trig_odd after insert on parted_irreg for each row
+ when (bark(new.b) AND new.a % 2 = 1) execute procedure trigger_notice_ab();
+-- we should hear barking for every insert, but parted_trig_odd only emits
+-- noise for odd values of a. parted_trig does it for all inserts.
+insert into parted_irreg values (1, 'aardvark'), (2, 'aanimals');
+NOTICE: aardvark <- woof!
+NOTICE: aanimals <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_odd on parted1_irreg AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(2,aanimals)
+insert into parted1_irreg values ('aardwolf', 2);
+NOTICE: aardwolf <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+insert into parted_irreg_ancestor values ('aasvogel', 3);
+NOTICE: aasvogel <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+NOTICE: trigger parted_trig_odd on parted1_irreg AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+drop table parted_irreg_ancestor;
+--
+-- Constraint triggers and partitioned tables
+create table parted_constr_ancestor (a int, b text)
+ partition by range (b);
+create table parted_constr (a int, b text)
+ partition by range (b);
+alter table parted_constr_ancestor attach partition parted_constr
+ for values from ('aaaa') to ('zzzz');
+create table parted1_constr (a int, b text);
+alter table parted_constr attach partition parted1_constr
+ for values from ('aaaa') to ('bbbb');
+create constraint trigger parted_trig after insert on parted_constr_ancestor
+ deferrable
+ for each row execute procedure trigger_notice_ab();
+create constraint trigger parted_trig_two after insert on parted_constr
+ deferrable initially deferred
+ for each row when (bark(new.b) AND new.a % 2 = 1)
+ execute procedure trigger_notice_ab();
+-- The immediate constraint is fired immediately; the WHEN clause of the
+-- deferred constraint is also called immediately. The deferred constraint
+-- is fired at commit time.
+begin;
+insert into parted_constr values (1, 'aardvark');
+NOTICE: aardvark <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+insert into parted1_constr values (2, 'aardwolf');
+NOTICE: aardwolf <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+insert into parted_constr_ancestor values (3, 'aasvogel');
+NOTICE: aasvogel <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+commit;
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+-- The WHEN clause is immediate, and both constraint triggers are fired at
+-- commit time.
+begin;
+set constraints parted_trig deferred;
+insert into parted_constr values (1, 'aardvark');
+NOTICE: aardvark <- woof!
+insert into parted1_constr values (2, 'aardwolf'), (3, 'aasvogel');
+NOTICE: aardwolf <- woof!
+NOTICE: aasvogel <- woof!
+commit;
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+drop table parted_constr_ancestor;
+drop function bark(text);
+-- Test that the WHEN clause is set properly to partitions
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create trigger parted_trigger after update on parted_trigger
+ for each row when (new.a % 2 = 1 and length(old.b) >= 2) execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (3);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (3) to (5);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+insert into parted_trigger values
+ (0, 'a'), (1, 'bbb'), (2, 'bcd'), (3, 'c'),
+ (1000, 'c'), (1001, 'ddd'), (1002, 'efg'), (1003, 'f'),
+ (2000, 'e'), (2001, 'fff'), (2002, 'ghi'), (2003, 'h');
+update parted_trigger set a = a + 2; -- notice for odd 'a' values, long 'b' values
+NOTICE: trigger parted_trigger on parted_trigger_1 AFTER UPDATE for ROW: (a,b)=(3,bbb)
+NOTICE: trigger parted_trigger on parted_trigger_2 AFTER UPDATE for ROW: (a,b)=(1003,ddd)
+NOTICE: trigger parted_trigger on parted_trigger_3_2 AFTER UPDATE for ROW: (a,b)=(2003,fff)
+drop table parted_trigger;
+-- try a constraint trigger, also
+create table parted_referenced (a int);
+create table unparted_trigger (a int, b text); -- for comparison purposes
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create constraint trigger parted_trigger after update on parted_trigger
+ from parted_referenced
+ for each row execute procedure trigger_notice_ab();
+create constraint trigger parted_trigger after update on unparted_trigger
+ from parted_referenced
+ for each row execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (3);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (3) to (5);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+select tgname, conname, t.tgrelid::regclass, t.tgconstrrelid::regclass,
+ c.conrelid::regclass, c.confrelid::regclass
+ from pg_trigger t join pg_constraint c on (t.tgconstraint = c.oid)
+ where tgname = 'parted_trigger'
+ order by t.tgrelid::regclass::text;
+ tgname | conname | tgrelid | tgconstrrelid | conrelid | confrelid
+----------------+----------------+--------------------+-------------------+--------------------+-----------
+ parted_trigger | parted_trigger | parted_trigger | parted_referenced | parted_trigger | -
+ parted_trigger | parted_trigger | parted_trigger_1 | parted_referenced | parted_trigger_1 | -
+ parted_trigger | parted_trigger | parted_trigger_2 | parted_referenced | parted_trigger_2 | -
+ parted_trigger | parted_trigger | parted_trigger_3 | parted_referenced | parted_trigger_3 | -
+ parted_trigger | parted_trigger | parted_trigger_3_1 | parted_referenced | parted_trigger_3_1 | -
+ parted_trigger | parted_trigger | parted_trigger_3_2 | parted_referenced | parted_trigger_3_2 | -
+ parted_trigger | parted_trigger | unparted_trigger | parted_referenced | unparted_trigger | -
+(7 rows)
+
+drop table parted_referenced, parted_trigger, unparted_trigger;
+-- verify that the "AFTER UPDATE OF columns" event is propagated correctly
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create trigger parted_trigger after update of b on parted_trigger
+ for each row execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (4);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (4) to (8);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+insert into parted_trigger values (0, 'a'), (1000, 'c'), (2000, 'e'), (2001, 'eeee');
+update parted_trigger set a = a + 2; -- no notices here
+update parted_trigger set b = b || 'b'; -- all triggers should fire
+NOTICE: trigger parted_trigger on parted_trigger_1 AFTER UPDATE for ROW: (a,b)=(2,ab)
+NOTICE: trigger parted_trigger on parted_trigger_2 AFTER UPDATE for ROW: (a,b)=(1002,cb)
+NOTICE: trigger parted_trigger on parted_trigger_3_1 AFTER UPDATE for ROW: (a,b)=(2002,eb)
+NOTICE: trigger parted_trigger on parted_trigger_3_2 AFTER UPDATE for ROW: (a,b)=(2003,eeeeb)
+drop table parted_trigger;
+drop function trigger_notice_ab();
+-- Make sure we don't end up with unnecessary copies of triggers, when
+-- cloning them.
+create table trg_clone (a int) partition by range (a);
+create table trg_clone1 partition of trg_clone for values from (0) to (1000);
+alter table trg_clone add constraint uniq unique (a) deferrable;
+create table trg_clone2 partition of trg_clone for values from (1000) to (2000);
+create table trg_clone3 partition of trg_clone for values from (2000) to (3000)
+ partition by range (a);
+create table trg_clone_3_3 partition of trg_clone3 for values from (2000) to (2100);
+select tgrelid::regclass, count(*) from pg_trigger
+ where tgrelid::regclass in ('trg_clone', 'trg_clone1', 'trg_clone2',
+ 'trg_clone3', 'trg_clone_3_3')
+ group by tgrelid::regclass order by tgrelid::regclass;
+ tgrelid | count
+---------------+-------
+ trg_clone | 1
+ trg_clone1 | 1
+ trg_clone2 | 1
+ trg_clone3 | 1
+ trg_clone_3_3 | 1
+(5 rows)
+
+drop table trg_clone;
+--
+-- Test the interaction between transition tables and both kinds of
+-- inheritance. We'll dump the contents of the transition tables in a
+-- format that shows the attribute order, so that we can distinguish
+-- tuple formats (though not dropped attributes).
+--
+create or replace function dump_insert() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, new table = %',
+ TG_NAME,
+ (select string_agg(new_table::text, ', ' order by a) from new_table);
+ return null;
+ end;
+$$;
+create or replace function dump_update() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, old table = %, new table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' order by a) from old_table),
+ (select string_agg(new_table::text, ', ' order by a) from new_table);
+ return null;
+ end;
+$$;
+create or replace function dump_delete() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, old table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' order by a) from old_table);
+ return null;
+ end;
+$$;
+--
+-- Verify behavior of statement triggers on partition hierarchy with
+-- transition tables. Tuples should appear to each trigger in the
+-- format of the relation the trigger is attached to.
+--
+-- set up a partition hierarchy with some different TupleDescriptors
+create table parent (a text, b int) partition by list (a);
+-- a child matching parent
+create table child1 partition of parent for values in ('AAA');
+-- a child with a dropped column
+create table child2 (x int, a text, b int);
+alter table child2 drop column x;
+alter table parent attach partition child2 for values in ('BBB');
+-- a child with a different column order
+create table child3 (b int, a text);
+alter table parent attach partition child3 for values in ('CCC');
+create trigger parent_insert_trig
+ after insert on parent referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger parent_update_trig
+ after update on parent referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger parent_delete_trig
+ after delete on parent referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child1_insert_trig
+ after insert on child1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child1_update_trig
+ after update on child1 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child1_delete_trig
+ after delete on child1 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child2_insert_trig
+ after insert on child2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child2_update_trig
+ after update on child2 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child2_delete_trig
+ after delete on child2 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child3_insert_trig
+ after insert on child3 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child3_update_trig
+ after update on child3 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child3_delete_trig
+ after delete on child3 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table IN ('parent', 'child1', 'child2', 'child3')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+--------------------+--------------------+---------------------+--------------------+--------------+------------------+--------------------+---------------+----------------------------+----------------------------
+ child1_delete_trig | DELETE | public | child1 | 1 | | STATEMENT | AFTER | old_table |
+ child1_insert_trig | INSERT | public | child1 | 1 | | STATEMENT | AFTER | | new_table
+ child1_update_trig | UPDATE | public | child1 | 1 | | STATEMENT | AFTER | old_table | new_table
+ child2_delete_trig | DELETE | public | child2 | 1 | | STATEMENT | AFTER | old_table |
+ child2_insert_trig | INSERT | public | child2 | 1 | | STATEMENT | AFTER | | new_table
+ child2_update_trig | UPDATE | public | child2 | 1 | | STATEMENT | AFTER | old_table | new_table
+ child3_delete_trig | DELETE | public | child3 | 1 | | STATEMENT | AFTER | old_table |
+ child3_insert_trig | INSERT | public | child3 | 1 | | STATEMENT | AFTER | | new_table
+ child3_update_trig | UPDATE | public | child3 | 1 | | STATEMENT | AFTER | old_table | new_table
+ parent_delete_trig | DELETE | public | parent | 1 | | STATEMENT | AFTER | old_table |
+ parent_insert_trig | INSERT | public | parent | 1 | | STATEMENT | AFTER | | new_table
+ parent_update_trig | UPDATE | public | parent | 1 | | STATEMENT | AFTER | old_table | new_table
+(12 rows)
+
+-- insert directly into children sees respective child-format tuples
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values ('BBB', 42);
+NOTICE: trigger = child2_insert_trig, new table = (BBB,42)
+insert into child3 values (42, 'CCC');
+NOTICE: trigger = child3_insert_trig, new table = (42,CCC)
+-- update via parent sees parent-format tuples
+update parent set b = b + 1;
+NOTICE: trigger = parent_update_trig, old table = (AAA,42), (BBB,42), (CCC,42), new table = (AAA,43), (BBB,43), (CCC,43)
+-- delete via parent sees parent-format tuples
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,43), (BBB,43), (CCC,43)
+-- insert into parent sees parent-format tuples
+insert into parent values ('AAA', 42);
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42)
+insert into parent values ('BBB', 42);
+NOTICE: trigger = parent_insert_trig, new table = (BBB,42)
+insert into parent values ('CCC', 42);
+NOTICE: trigger = parent_insert_trig, new table = (CCC,42)
+-- delete from children sees respective child-format tuples
+delete from child1;
+NOTICE: trigger = child1_delete_trig, old table = (AAA,42)
+delete from child2;
+NOTICE: trigger = child2_delete_trig, old table = (BBB,42)
+delete from child3;
+NOTICE: trigger = child3_delete_trig, old table = (42,CCC)
+-- copy into parent sees parent-format tuples
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- DML affecting parent sees tuples collected from children even if
+-- there is no transition table trigger on the children
+drop trigger child1_insert_trig on child1;
+drop trigger child1_update_trig on child1;
+drop trigger child1_delete_trig on child1;
+drop trigger child2_insert_trig on child2;
+drop trigger child2_update_trig on child2;
+drop trigger child2_delete_trig on child2;
+drop trigger child3_insert_trig on child3;
+drop trigger child3_update_trig on child3;
+drop trigger child3_delete_trig on child3;
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,42), (BBB,42), (CCC,42)
+-- copy into parent sees tuples collected from children even if there
+-- is no transition-table trigger on the children
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- insert into parent with a before trigger on a child tuple before
+-- insertion, and we capture the newly modified row in parent format
+create or replace function intercept_insert() returns trigger language plpgsql as
+$$
+ begin
+ new.b = new.b + 1000;
+ return new;
+ end;
+$$;
+create trigger intercept_insert_child3
+ before insert on child3
+ for each row execute procedure intercept_insert();
+-- insert, parent trigger sees post-modification parent-format tuple
+insert into parent values ('AAA', 42), ('BBB', 42), ('CCC', 66);
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,1066)
+-- copy, parent trigger sees post-modification parent-format tuple
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,1234)
+drop table child1, child2, child3, parent;
+drop function intercept_insert();
+--
+-- Verify prohibition of row triggers with transition triggers on
+-- partitions
+--
+create table parent (a text, b int) partition by list (a);
+create table child partition of parent for values in ('AAA');
+-- adding row trigger with transition table fails
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+ERROR: ROW triggers with transition tables are not supported on partitions
+-- detaching it first works
+alter table parent detach partition child;
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+-- but now we're not allowed to reattach it
+alter table parent attach partition child for values in ('AAA');
+ERROR: trigger "child_row_trig" prevents table "child" from becoming a partition
+DETAIL: ROW triggers with transition tables are not supported on partitions
+-- drop the trigger, and now we're allowed to attach it again
+drop trigger child_row_trig on child;
+alter table parent attach partition child for values in ('AAA');
+drop table child, parent;
+--
+-- Verify behavior of statement triggers on (non-partition)
+-- inheritance hierarchy with transition tables; similar to the
+-- partition case, except there is no rerouting on insertion and child
+-- tables can have extra columns
+--
+-- set up inheritance hierarchy with different TupleDescriptors
+create table parent (a text, b int);
+-- a child matching parent
+create table child1 () inherits (parent);
+-- a child with a different column order
+create table child2 (b int, a text);
+alter table child2 inherit parent;
+-- a child with an extra column
+create table child3 (c text) inherits (parent);
+create trigger parent_insert_trig
+ after insert on parent referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger parent_update_trig
+ after update on parent referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger parent_delete_trig
+ after delete on parent referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child1_insert_trig
+ after insert on child1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child1_update_trig
+ after update on child1 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child1_delete_trig
+ after delete on child1 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child2_insert_trig
+ after insert on child2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child2_update_trig
+ after update on child2 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child2_delete_trig
+ after delete on child2 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child3_insert_trig
+ after insert on child3 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child3_update_trig
+ after update on child3 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child3_delete_trig
+ after delete on child3 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+-- insert directly into children sees respective child-format tuples
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values (42, 'BBB');
+NOTICE: trigger = child2_insert_trig, new table = (42,BBB)
+insert into child3 values ('CCC', 42, 'foo');
+NOTICE: trigger = child3_insert_trig, new table = (CCC,42,foo)
+-- update via parent sees parent-format tuples
+update parent set b = b + 1;
+NOTICE: trigger = parent_update_trig, old table = (AAA,42), (BBB,42), (CCC,42), new table = (AAA,43), (BBB,43), (CCC,43)
+-- delete via parent sees parent-format tuples
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,43), (BBB,43), (CCC,43)
+-- reinsert values into children for next test...
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values (42, 'BBB');
+NOTICE: trigger = child2_insert_trig, new table = (42,BBB)
+insert into child3 values ('CCC', 42, 'foo');
+NOTICE: trigger = child3_insert_trig, new table = (CCC,42,foo)
+-- delete from children sees respective child-format tuples
+delete from child1;
+NOTICE: trigger = child1_delete_trig, old table = (AAA,42)
+delete from child2;
+NOTICE: trigger = child2_delete_trig, old table = (42,BBB)
+delete from child3;
+NOTICE: trigger = child3_delete_trig, old table = (CCC,42,foo)
+-- copy into parent sees parent-format tuples (no rerouting, so these
+-- are really inserted into the parent)
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- same behavior for copy if there is an index (interesting because rows are
+-- captured by a different code path in copy.c if there are indexes)
+create index on parent(b);
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (DDD,42)
+-- DML affecting parent sees tuples collected from children even if
+-- there is no transition table trigger on the children
+drop trigger child1_insert_trig on child1;
+drop trigger child1_update_trig on child1;
+drop trigger child1_delete_trig on child1;
+drop trigger child2_insert_trig on child2;
+drop trigger child2_update_trig on child2;
+drop trigger child2_delete_trig on child2;
+drop trigger child3_insert_trig on child3;
+drop trigger child3_update_trig on child3;
+drop trigger child3_delete_trig on child3;
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,42), (BBB,42), (CCC,42), (DDD,42)
+drop table child1, child2, child3, parent;
+--
+-- Verify prohibition of row triggers with transition triggers on
+-- inheritance children
+--
+create table parent (a text, b int);
+create table child () inherits (parent);
+-- adding row trigger with transition table fails
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+ERROR: ROW triggers with transition tables are not supported on inheritance children
+-- disinheriting it first works
+alter table child no inherit parent;
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+-- but now we're not allowed to make it inherit anymore
+alter table child inherit parent;
+ERROR: trigger "child_row_trig" prevents table "child" from becoming an inheritance child
+DETAIL: ROW triggers with transition tables are not supported in inheritance hierarchies.
+-- drop the trigger, and now we're allowed to make it inherit again
+drop trigger child_row_trig on child;
+alter table child inherit parent;
+drop table child, parent;
+--
+-- Verify behavior of queries with wCTEs, where multiple transition
+-- tuplestores can be active at the same time because there are
+-- multiple DML statements that might fire triggers with transition
+-- tables
+--
+create table table1 (a int);
+create table table2 (a text);
+create trigger table1_trig
+ after insert on table1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger table2_trig
+ after insert on table2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+with wcte as (insert into table1 values (42))
+ insert into table2 values ('hello world');
+NOTICE: trigger = table2_trig, new table = ("hello world")
+NOTICE: trigger = table1_trig, new table = (42)
+with wcte as (insert into table1 values (43))
+ insert into table1 values (44);
+NOTICE: trigger = table1_trig, new table = (43), (44)
+select * from table1;
+ a
+----
+ 42
+ 44
+ 43
+(3 rows)
+
+select * from table2;
+ a
+-------------
+ hello world
+(1 row)
+
+drop table table1;
+drop table table2;
+--
+-- Verify behavior of INSERT ... ON CONFLICT DO UPDATE ... with
+-- transition tables.
+--
+create table my_table (a int primary key, b text);
+create trigger my_table_insert_trig
+ after insert on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger my_table_update_trig
+ after update on my_table referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+-- inserts only
+insert into my_table values (1, 'AAA'), (2, 'BBB')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = <NULL>, new table = <NULL>
+NOTICE: trigger = my_table_insert_trig, new table = (1,AAA), (2,BBB)
+-- mixture of inserts and updates
+insert into my_table values (1, 'AAA'), (2, 'BBB'), (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = (1,AAA), (2,BBB), new table = (1,AAA:AAA), (2,BBB:BBB)
+NOTICE: trigger = my_table_insert_trig, new table = (3,CCC), (4,DDD)
+-- updates only
+insert into my_table values (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = (3,CCC), (4,DDD), new table = (3,CCC:CCC), (4,DDD:DDD)
+NOTICE: trigger = my_table_insert_trig, new table = <NULL>
+--
+-- now using a partitioned table
+--
+create table iocdu_tt_parted (a int primary key, b text) partition by list (a);
+create table iocdu_tt_parted1 partition of iocdu_tt_parted for values in (1);
+create table iocdu_tt_parted2 partition of iocdu_tt_parted for values in (2);
+create table iocdu_tt_parted3 partition of iocdu_tt_parted for values in (3);
+create table iocdu_tt_parted4 partition of iocdu_tt_parted for values in (4);
+create trigger iocdu_tt_parted_insert_trig
+ after insert on iocdu_tt_parted referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger iocdu_tt_parted_update_trig
+ after update on iocdu_tt_parted referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+-- inserts only
+insert into iocdu_tt_parted values (1, 'AAA'), (2, 'BBB')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = <NULL>, new table = <NULL>
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = (1,AAA), (2,BBB)
+-- mixture of inserts and updates
+insert into iocdu_tt_parted values (1, 'AAA'), (2, 'BBB'), (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = (1,AAA), (2,BBB), new table = (1,AAA:AAA), (2,BBB:BBB)
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = (3,CCC), (4,DDD)
+-- updates only
+insert into iocdu_tt_parted values (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = (3,CCC), (4,DDD), new table = (3,CCC:CCC), (4,DDD:DDD)
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = <NULL>
+drop table iocdu_tt_parted;
+--
+-- Verify that you can't create a trigger with transition tables for
+-- more than one event.
+--
+create trigger my_table_multievent_trig
+ after insert or update on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+ERROR: transition tables cannot be specified for triggers with more than one event
+--
+-- Verify that you can't create a trigger with transition tables with
+-- a column list.
+--
+create trigger my_table_col_update_trig
+ after update of b on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+ERROR: transition tables cannot be specified for triggers with column lists
+drop table my_table;
+--
+-- Test firing of triggers with transition tables by foreign key cascades
+--
+create table refd_table (a int primary key, b text);
+create table trig_table (a int, b text,
+ foreign key (a) references refd_table on update cascade on delete cascade
+);
+create trigger trig_table_before_trig
+ before insert or update or delete on trig_table
+ for each statement execute procedure trigger_func('trig_table');
+create trigger trig_table_insert_trig
+ after insert on trig_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger trig_table_update_trig
+ after update on trig_table referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger trig_table_delete_trig
+ after delete on trig_table referencing old table as old_table
+ for each statement execute procedure dump_delete();
+insert into refd_table values
+ (1, 'one'),
+ (2, 'two'),
+ (3, 'three');
+insert into trig_table values
+ (1, 'one a'),
+ (1, 'one b'),
+ (2, 'two a'),
+ (2, 'two b'),
+ (3, 'three a'),
+ (3, 'three b');
+NOTICE: trigger_func(trig_table) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_insert_trig, new table = (1,"one a"), (1,"one b"), (2,"two a"), (2,"two b"), (3,"three a"), (3,"three b")
+update refd_table set a = 11 where b = 'one';
+NOTICE: trigger_func(trig_table) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_update_trig, old table = (1,"one a"), (1,"one b"), new table = (11,"one a"), (11,"one b")
+select * from trig_table;
+ a | b
+----+---------
+ 2 | two a
+ 2 | two b
+ 3 | three a
+ 3 | three b
+ 11 | one a
+ 11 | one b
+(6 rows)
+
+delete from refd_table where length(b) = 3;
+NOTICE: trigger_func(trig_table) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b"), (11,"one a"), (11,"one b")
+select * from trig_table;
+ a | b
+---+---------
+ 3 | three a
+ 3 | three b
+(2 rows)
+
+drop table refd_table, trig_table;
+--
+-- self-referential FKs are even more fun
+--
+create table self_ref (a int primary key,
+ b int references self_ref(a) on delete cascade);
+create trigger self_ref_before_trig
+ before delete on self_ref
+ for each statement execute procedure trigger_func('self_ref');
+create trigger self_ref_r_trig
+ after delete on self_ref referencing old table as old_table
+ for each row execute procedure dump_delete();
+create trigger self_ref_s_trig
+ after delete on self_ref referencing old table as old_table
+ for each statement execute procedure dump_delete();
+insert into self_ref values (1, null), (2, 1), (3, 2);
+delete from self_ref where a = 1;
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_r_trig, old table = (1,), (2,1)
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_r_trig, old table = (1,), (2,1)
+NOTICE: trigger = self_ref_s_trig, old table = (1,), (2,1)
+NOTICE: trigger = self_ref_r_trig, old table = (3,2)
+NOTICE: trigger = self_ref_s_trig, old table = (3,2)
+-- without AR trigger, cascaded deletes all end up in one transition table
+drop trigger self_ref_r_trig on self_ref;
+insert into self_ref values (1, null), (2, 1), (3, 2), (4, 3);
+delete from self_ref where a = 1;
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_s_trig, old table = (1,), (2,1), (3,2), (4,3)
+drop table self_ref;
+-- cleanup
+drop function dump_insert();
+drop function dump_update();
+drop function dump_delete();
diff --git a/src/test/regress/expected/tsrf_1.out b/src/test/regress/expected/tsrf_1.out
new file mode 100644
index 00000000000..a0f7d80c699
--- /dev/null
+++ b/src/test/regress/expected/tsrf_1.out
@@ -0,0 +1,712 @@
+--
+-- tsrf - targetlist set returning function tests
+--
+-- simple srf
+SELECT generate_series(1, 3);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- parallel iteration
+SELECT generate_series(1, 3), generate_series(3,5);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 3
+ 2 | 4
+ 3 | 5
+(3 rows)
+
+-- parallel iteration, different number of rows
+SELECT generate_series(1, 2), generate_series(1,4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 1
+ 2 | 2
+ | 3
+ | 4
+(4 rows)
+
+-- srf, with SRF argument
+SELECT generate_series(1, generate_series(1, 3));
+ generate_series
+-----------------
+ 1
+ 1
+ 2
+ 1
+ 2
+ 3
+(6 rows)
+
+-- but we've traditionally rejected the same in FROM
+SELECT * FROM generate_series(1, generate_series(1, 3));
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM generate_series(1, generate_series(1, 3));
+ ^
+-- srf, with two SRF arguments
+SELECT generate_series(generate_series(1,3), generate_series(2, 4));
+ generate_series
+-----------------
+ 1
+ 2
+ 2
+ 3
+ 3
+ 4
+(6 rows)
+
+-- check proper nesting of SRFs in different expressions
+explain (verbose, costs off)
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ ProjectSet
+ Output: generate_series(1, (generate_series(1, 3))), (generate_series(2, 4))
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(2, 4)
+ -> Result
+(5 rows)
+
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 2
+ 1 | 3
+ 2 | 3
+ 1 | 4
+ 2 | 4
+ 3 | 4
+(6 rows)
+
+CREATE TABLE few(id int, dataa text, datab text);
+INSERT INTO few VALUES(1, 'a', 'foo'),(2, 'a', 'bar'),(3, 'b', 'bar');
+-- SRF with a provably-dummy relation
+explain (verbose, costs off)
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ QUERY PLAN
+--------------------------------------
+ ProjectSet
+ Output: unnest('{1,2}'::integer[])
+ -> Result
+ One-Time Filter: false
+(4 rows)
+
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ unnest
+--------
+(0 rows)
+
+-- SRF shouldn't prevent upper query from recognizing lower as dummy
+explain (verbose, costs off)
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ QUERY PLAN
+------------------------------------------------
+ Result
+ Output: f1.id, f1.dataa, f1.datab, ss.unnest
+ One-Time Filter: false
+(3 rows)
+
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ id | dataa | datab | unnest
+----+-------+-------+--------
+(0 rows)
+
+-- SRF output order of sorting is maintained, if SRF is not referenced
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id DESC;
+ id | g
+----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 1 | 1
+ 1 | 2
+ 1 | 3
+(9 rows)
+
+-- but SRFs can be referenced in sort
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, g DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, generate_series(1,3) DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+-- it's weird to have ORDER BYs that increase the number of results
+SELECT few.id FROM few ORDER BY id, generate_series(1,3) DESC;
+ id
+----
+ 1
+ 1
+ 1
+ 2
+ 2
+ 2
+ 3
+ 3
+ 3
+(9 rows)
+
+-- SRFs are computed after aggregation
+SET enable_hashagg TO 0; -- stable output order
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(3 rows)
+
+-- unless referenced in GROUP BY clause
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, unnest('{1,1,3}'::int[]);
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, 5;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+RESET enable_hashagg;
+-- check HAVING works when GROUP BY does [not] reference SRF output
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1, 2 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+-- it's weird to have GROUP BYs that increase the number of results
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+(1 row)
+
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa, unnest('{1,1,3}'::int[]) ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+ a | 4
+(2 rows)
+
+-- SRFs are not allowed if they'd need to be conditionally executed
+SELECT q1, case when q1 > 0 then generate_series(1,3) else 0 end FROM int8_tbl;
+ERROR: set-returning functions are not allowed in CASE
+LINE 1: SELECT q1, case when q1 > 0 then generate_series(1,3) else 0...
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ERROR: set-returning functions are not allowed in COALESCE
+LINE 1: SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are not allowed in aggregate arguments
+SELECT min(generate_series(1, 3)) FROM few;
+ERROR: aggregate function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- ... unless they're within a sub-select
+SELECT sum((3 = ANY(SELECT generate_series(1,4)))::int);
+ sum
+-----
+ 1
+(1 row)
+
+SELECT sum((3 = ANY(SELECT lag(x) over(order by x)
+ FROM generate_series(1,4) x))::int);
+ sum
+-----
+ 1
+(1 row)
+
+-- SRFs are not allowed in window function arguments, either
+SELECT min(generate_series(1, 3)) OVER() FROM few;
+ERROR: window function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) OVER() FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are normally computed after window functions
+SELECT id,lag(id) OVER(), count(*) OVER(), generate_series(1,3) FROM few;
+ id | lag | count | generate_series
+----+-----+-------+-----------------
+ 1 | | 3 | 1
+ 1 | | 3 | 2
+ 1 | | 3 | 3
+ 2 | 1 | 3 | 1
+ 2 | 1 | 3 | 2
+ 2 | 1 | 3 | 3
+ 3 | 2 | 3 | 1
+ 3 | 2 | 3 | 2
+ 3 | 2 | 3 | 3
+(9 rows)
+
+-- unless referencing SRFs
+SELECT SUM(count(*)) OVER(PARTITION BY generate_series(1,3) ORDER BY generate_series(1,3)), generate_series(1,3) g FROM few GROUP BY g;
+ sum | g
+-----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(3 rows)
+
+-- sorting + grouping
+SELECT few.dataa, count(*), min(id), max(id), generate_series(1,3) FROM few GROUP BY few.dataa ORDER BY 5, 1;
+ dataa | count | min | max | generate_series
+-------+-------+-----+-----+-----------------
+ a | 2 | 1 | 2 | 1
+ b | 1 | 3 | 3 | 1
+ a | 2 | 1 | 2 | 2
+ b | 1 | 3 | 3 | 2
+ a | 2 | 1 | 2 | 3
+ b | 1 | 3 | 3 | 3
+(6 rows)
+
+-- grouping sets are a bit special, they produce NULLs in columns not actually NULL
+set enable_hashagg = false;
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | 1 | 1
+ | | 1 | 3
+ | bar | 1 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ b | | 2 | 1
+ a | foo | 2 | 1
+ | bar | 2 | 2
+ a | | 2 | 2
+ | | 2 | 3
+ b | bar | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | foo | | 2
+ a | | | 4
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | | 2 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | foo | | 2
+ a | | | 4
+ a | | 2 | 2
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | | 2
+ b | | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | 2 | 1
+ | | 2 | 3
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ | | 1 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ b | bar | 1 | 1
+ | bar | 1 | 2
+ | foo | 1 | 1
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | bar | 2 | 2
+ | | 2 | 3
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ a | foo | 2 | 1
+ b | bar | 2 | 1
+ a | | | 4
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ a | foo | | 2
+ a | bar | | 2
+ | bar | | 4
+ | foo | | 2
+(24 rows)
+
+reset enable_hashagg;
+-- case with degenerate ORDER BY
+explain (verbose, costs off)
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ QUERY PLAN
+------------------------------------------------
+ ProjectSet
+ Output: ('foo'::text), generate_series(1, 2)
+ -> Seq Scan on public.few
+ Output: 'foo'::text
+(4 rows)
+
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ f | g
+-----+---
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+(6 rows)
+
+-- data modification
+CREATE TABLE fewmore AS SELECT generate_series(1,3) AS data;
+INSERT INTO fewmore VALUES(generate_series(4,5));
+SELECT * FROM fewmore;
+ data
+------
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+-- SRFs are not allowed in UPDATE (they once were, but it was nonsense)
+UPDATE fewmore SET data = generate_series(4,9);
+ERROR: set-returning functions are not allowed in UPDATE
+LINE 1: UPDATE fewmore SET data = generate_series(4,9);
+ ^
+-- SRFs are not allowed in RETURNING
+INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3);
+ERROR: set-returning functions are not allowed in RETURNING
+LINE 1: INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3)...
+ ^
+-- nor standalone VALUES (but surely this is a bug?)
+VALUES(1, generate_series(1,2));
+ERROR: set-returning functions are not allowed in VALUES
+LINE 1: VALUES(1, generate_series(1,2));
+ ^
+-- We allow tSRFs that are not at top level
+SELECT int4mul(generate_series(1,2), 10);
+ int4mul
+---------
+ 10
+ 20
+(2 rows)
+
+SELECT generate_series(1,3) IS DISTINCT FROM 2;
+ ?column?
+----------
+ t
+ f
+ t
+(3 rows)
+
+-- but SRFs in function RTEs must be at top level (annoying restriction)
+SELECT * FROM int4mul(generate_series(1,2), 10);
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM int4mul(generate_series(1,2), 10);
+ ^
+-- DISTINCT ON is evaluated before tSRF evaluation if SRF is not
+-- referenced either in ORDER BY or in the DISTINCT ON list. The ORDER
+-- BY reference can be implicitly generated, if there's no other ORDER BY.
+-- implicit reference (via implicit ORDER) to all columns
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 1 | 1 | 1
+ 3 | 2 | 1
+ 5 | 3 | 1
+(3 rows)
+
+-- unreferenced in DISTINCT ON or ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 1
+ 1 | 4 | 2
+ 1 | 4 | 3
+ 3 | 2 | 1
+ 3 | 2 | 2
+ 3 | 2 | 3
+ 5 | 3 | 1
+ 5 | 3 | 2
+ 5 | 3 | 3
+(9 rows)
+
+-- referenced in ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 3 | 2 | 3
+ 5 | 3 | 3
+(3 rows)
+
+-- referenced in ORDER BY and DISTINCT ON
+SELECT DISTINCT ON (a, b, g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 1 | 4 | 2
+ 1 | 4 | 1
+ 1 | 1 | 3
+ 1 | 1 | 2
+ 1 | 1 | 1
+ 3 | 2 | 3
+ 3 | 2 | 2
+ 3 | 2 | 1
+ 3 | 1 | 3
+ 3 | 1 | 2
+ 3 | 1 | 1
+ 5 | 3 | 3
+ 5 | 3 | 2
+ 5 | 3 | 1
+ 5 | 1 | 3
+ 5 | 1 | 2
+ 5 | 1 | 1
+(18 rows)
+
+-- only SRF mentioned in DISTINCT ON
+SELECT DISTINCT ON (g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 3 | 2 | 1
+ 5 | 1 | 2
+ 3 | 1 | 3
+(3 rows)
+
+-- LIMIT / OFFSET is evaluated after SRF evaluation
+SELECT a, generate_series(1,2) FROM (VALUES(1),(2),(3)) r(a) LIMIT 2 OFFSET 2;
+ a | generate_series
+---+-----------------
+ 2 | 1
+ 2 | 2
+(2 rows)
+
+-- SRFs are not allowed in LIMIT.
+SELECT 1 LIMIT generate_series(1,3);
+ERROR: set-returning functions are not allowed in LIMIT
+LINE 1: SELECT 1 LIMIT generate_series(1,3);
+ ^
+-- tSRF in correlated subquery, referencing table outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET few.id) FROM few;
+ generate_series
+-----------------
+ 2
+ 3
+
+(3 rows)
+
+-- tSRF in correlated subquery, referencing SRF outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET g.i) FROM generate_series(0,3) g(i);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+
+(4 rows)
+
+-- Operators can return sets too
+CREATE OPERATOR |@| (PROCEDURE = unnest, RIGHTARG = ANYARRAY);
+SELECT |@|ARRAY[1,2,3];
+ ?column?
+----------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- Some fun cases involving duplicate SRF calls
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(1, 3)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ x | xp1
+---+-----
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+explain (verbose, costs off)
+select generate_series(1,3)+1 order by generate_series(1,3);
+ QUERY PLAN
+------------------------------------------------------------------------
+ Sort
+ Output: (((generate_series(1, 3)) + 1)), (generate_series(1, 3))
+ Sort Key: (generate_series(1, 3))
+ -> Result
+ Output: ((generate_series(1, 3)) + 1), (generate_series(1, 3))
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(8 rows)
+
+select generate_series(1,3)+1 order by generate_series(1,3);
+ ?column?
+----------
+ 2
+ 3
+ 4
+(3 rows)
+
+-- Check that SRFs of same nesting level run in lockstep
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(3, 6)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(3, 6)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ x | y
+---+---
+ 1 | 4
+ 2 | 5
+ 3 | 6
+ | 7
+(4 rows)
+
+-- Clean up
+DROP TABLE few;
+DROP TABLE fewmore;
diff --git a/src/test/regress/expected/updatable_views_1.out b/src/test/regress/expected/updatable_views_1.out
new file mode 100644
index 00000000000..2a8ed174e21
--- /dev/null
+++ b/src/test/regress/expected/updatable_views_1.out
@@ -0,0 +1,3018 @@
+--
+-- UPDATABLE VIEWS
+--
+-- avoid bit-exact output here because operations may not be bit-exact.
+SET extra_float_digits = 0;
+-- check that non-updatable views and columns are rejected with useful error
+-- messages
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW ro_view1 AS SELECT DISTINCT a, b FROM base_tbl; -- DISTINCT not supported
+CREATE VIEW ro_view2 AS SELECT a, b FROM base_tbl GROUP BY a, b; -- GROUP BY not supported
+CREATE VIEW ro_view3 AS SELECT 1 FROM base_tbl HAVING max(a) > 0; -- HAVING not supported
+CREATE VIEW ro_view4 AS SELECT count(*) FROM base_tbl; -- Aggregate functions not supported
+CREATE VIEW ro_view5 AS SELECT a, rank() OVER() FROM base_tbl; -- Window functions not supported
+CREATE VIEW ro_view6 AS SELECT a, b FROM base_tbl UNION SELECT -a, b FROM base_tbl; -- Set ops not supported
+CREATE VIEW ro_view7 AS WITH t AS (SELECT a, b FROM base_tbl) SELECT * FROM t; -- WITH not supported
+CREATE VIEW ro_view8 AS SELECT a, b FROM base_tbl ORDER BY a OFFSET 1; -- OFFSET not supported
+CREATE VIEW ro_view9 AS SELECT a, b FROM base_tbl ORDER BY a LIMIT 1; -- LIMIT not supported
+CREATE VIEW ro_view10 AS SELECT 1 AS a; -- No base relations
+CREATE VIEW ro_view11 AS SELECT b1.a, b2.b FROM base_tbl b1, base_tbl b2; -- Multiple base relations
+CREATE VIEW ro_view12 AS SELECT * FROM generate_series(1, 10) AS g(a); -- SRF in rangetable
+CREATE VIEW ro_view13 AS SELECT a, b FROM (SELECT * FROM base_tbl) AS t; -- Subselect in rangetable
+CREATE VIEW rw_view14 AS SELECT ctid, a, b FROM base_tbl; -- System columns may be part of an updatable view
+CREATE VIEW rw_view15 AS SELECT a, upper(b) FROM base_tbl; -- Expression/function may be part of an updatable view
+CREATE VIEW rw_view16 AS SELECT a, b, a AS aa FROM base_tbl; -- Repeated column may be part of an updatable view
+CREATE VIEW ro_view17 AS SELECT * FROM ro_view1; -- Base relation not updatable
+CREATE VIEW ro_view18 AS SELECT * FROM (VALUES(1)) AS tmp(a); -- VALUES in rangetable
+CREATE SEQUENCE uv_seq;
+CREATE VIEW ro_view19 AS SELECT * FROM uv_seq; -- View based on a sequence
+CREATE VIEW ro_view20 AS SELECT a, b, generate_series(1, a) g FROM base_tbl; -- SRF in targetlist not supported
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ ro_view1 | NO
+ ro_view10 | NO
+ ro_view11 | NO
+ ro_view12 | NO
+ ro_view13 | NO
+ ro_view17 | NO
+ ro_view18 | NO
+ ro_view19 | NO
+ ro_view2 | NO
+ ro_view20 | NO
+ ro_view3 | NO
+ ro_view4 | NO
+ ro_view5 | NO
+ ro_view6 | NO
+ ro_view7 | NO
+ ro_view8 | NO
+ ro_view9 | NO
+ rw_view14 | YES
+ rw_view15 | YES
+ rw_view16 | YES
+(20 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ ro_view1 | NO | NO
+ ro_view10 | NO | NO
+ ro_view11 | NO | NO
+ ro_view12 | NO | NO
+ ro_view13 | NO | NO
+ ro_view17 | NO | NO
+ ro_view18 | NO | NO
+ ro_view19 | NO | NO
+ ro_view2 | NO | NO
+ ro_view20 | NO | NO
+ ro_view3 | NO | NO
+ ro_view4 | NO | NO
+ ro_view5 | NO | NO
+ ro_view6 | NO | NO
+ ro_view7 | NO | NO
+ ro_view8 | NO | NO
+ ro_view9 | NO | NO
+ rw_view14 | YES | YES
+ rw_view15 | YES | YES
+ rw_view16 | YES | YES
+(20 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ ro_view1 | a | NO
+ ro_view1 | b | NO
+ ro_view10 | a | NO
+ ro_view11 | a | NO
+ ro_view11 | b | NO
+ ro_view12 | a | NO
+ ro_view13 | a | NO
+ ro_view13 | b | NO
+ ro_view17 | a | NO
+ ro_view17 | b | NO
+ ro_view18 | a | NO
+ ro_view19 | last_value | NO
+ ro_view19 | log_cnt | NO
+ ro_view19 | is_called | NO
+ ro_view2 | a | NO
+ ro_view2 | b | NO
+ ro_view20 | a | NO
+ ro_view20 | b | NO
+ ro_view20 | g | NO
+ ro_view3 | ?column? | NO
+ ro_view4 | count | NO
+ ro_view5 | a | NO
+ ro_view5 | rank | NO
+ ro_view6 | a | NO
+ ro_view6 | b | NO
+ ro_view7 | a | NO
+ ro_view7 | b | NO
+ ro_view8 | a | NO
+ ro_view8 | b | NO
+ ro_view9 | a | NO
+ ro_view9 | b | NO
+ rw_view14 | ctid | NO
+ rw_view14 | a | YES
+ rw_view14 | b | YES
+ rw_view15 | a | YES
+ rw_view15 | upper | NO
+ rw_view16 | a | YES
+ rw_view16 | b | YES
+ rw_view16 | aa | YES
+(39 rows)
+
+-- Read-only views
+DELETE FROM ro_view1;
+ERROR: cannot delete from view "ro_view1"
+DETAIL: Views containing DISTINCT are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view2;
+ERROR: cannot delete from view "ro_view2"
+DETAIL: Views containing GROUP BY are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view3;
+ERROR: cannot delete from view "ro_view3"
+DETAIL: Views containing HAVING are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view4;
+ERROR: cannot delete from view "ro_view4"
+DETAIL: Views that return aggregate functions are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view5;
+ERROR: cannot delete from view "ro_view5"
+DETAIL: Views that return window functions are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view6;
+ERROR: cannot delete from view "ro_view6"
+DETAIL: Views containing UNION, INTERSECT, or EXCEPT are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+UPDATE ro_view7 SET a=a+1;
+ERROR: cannot update view "ro_view7"
+DETAIL: Views containing WITH are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view8 SET a=a+1;
+ERROR: cannot update view "ro_view8"
+DETAIL: Views containing LIMIT or OFFSET are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view9 SET a=a+1;
+ERROR: cannot update view "ro_view9"
+DETAIL: Views containing LIMIT or OFFSET are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view10 SET a=a+1;
+ERROR: cannot update view "ro_view10"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view11 SET a=a+1;
+ERROR: cannot update view "ro_view11"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view12 SET a=a+1;
+ERROR: cannot update view "ro_view12"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+INSERT INTO ro_view13 VALUES (3, 'Row 3');
+ERROR: cannot insert into view "ro_view13"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+-- Partially updatable view
+INSERT INTO rw_view14 VALUES (null, 3, 'Row 3'); -- should fail
+ERROR: cannot insert into column "ctid" of view "rw_view14"
+DETAIL: View columns that refer to system columns are not updatable.
+INSERT INTO rw_view14 (a, b) VALUES (3, 'Row 3'); -- should be OK
+UPDATE rw_view14 SET ctid=null WHERE a=3; -- should fail
+ERROR: cannot update column "ctid" of view "rw_view14"
+DETAIL: View columns that refer to system columns are not updatable.
+UPDATE rw_view14 SET b='ROW 3' WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+--------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ 3 | ROW 3
+(6 rows)
+
+DELETE FROM rw_view14 WHERE a=3; -- should be OK
+-- Partially updatable view
+INSERT INTO rw_view15 VALUES (3, 'ROW 3'); -- should fail
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view15 (a) VALUES (3); -- should be OK
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT DO NOTHING; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO NOTHING; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO UPDATE set a = excluded.a; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO UPDATE set upper = 'blarg'; -- fails
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+ALTER VIEW rw_view15 ALTER COLUMN upper SET DEFAULT 'NOT SET';
+INSERT INTO rw_view15 (a) VALUES (4); -- should fail
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET upper='ROW 3' WHERE a=3; -- should fail
+ERROR: cannot update column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET upper=DEFAULT WHERE a=3; -- should fail
+ERROR: cannot update column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET a=4 WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+-------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ 4 | Unspecified
+(6 rows)
+
+DELETE FROM rw_view15 WHERE a=4; -- should be OK
+-- Partially updatable view
+INSERT INTO rw_view16 VALUES (3, 'Row 3', 3); -- should fail
+ERROR: multiple assignments to same column "a"
+INSERT INTO rw_view16 (a, b) VALUES (3, 'Row 3'); -- should be OK
+UPDATE rw_view16 SET a=3, aa=-3 WHERE a=3; -- should fail
+ERROR: multiple assignments to same column "a"
+UPDATE rw_view16 SET aa=-3 WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+--------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ -3 | Row 3
+(6 rows)
+
+DELETE FROM rw_view16 WHERE a=-3; -- should be OK
+-- Read-only views
+INSERT INTO ro_view17 VALUES (3, 'ROW 3');
+ERROR: cannot insert into view "ro_view1"
+DETAIL: Views containing DISTINCT are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+DELETE FROM ro_view18;
+ERROR: cannot delete from view "ro_view18"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+UPDATE ro_view19 SET last_value=1000;
+ERROR: cannot update view "ro_view19"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view20 SET b=upper(b);
+ERROR: cannot update view "ro_view20"
+DETAIL: Views that return set-returning functions are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 16 other objects
+DETAIL: drop cascades to view ro_view1
+drop cascades to view ro_view17
+drop cascades to view ro_view2
+drop cascades to view ro_view3
+drop cascades to view ro_view4
+drop cascades to view ro_view5
+drop cascades to view ro_view6
+drop cascades to view ro_view7
+drop cascades to view ro_view8
+drop cascades to view ro_view9
+drop cascades to view ro_view11
+drop cascades to view ro_view13
+drop cascades to view rw_view14
+drop cascades to view rw_view15
+drop cascades to view rw_view16
+drop cascades to view ro_view20
+DROP VIEW ro_view10, ro_view12, ro_view18;
+DROP SEQUENCE uv_seq CASCADE;
+NOTICE: drop cascades to view ro_view19
+-- simple updatable view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view1';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view1';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view1'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | YES
+ rw_view1 | b | YES
+(2 rows)
+
+INSERT INTO rw_view1 VALUES (3, 'Row 3');
+INSERT INTO rw_view1 (a) VALUES (4);
+UPDATE rw_view1 SET a=5 WHERE a=4;
+DELETE FROM rw_view1 WHERE b='Row 2';
+SELECT * FROM base_tbl;
+ a | b
+----+-------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 3 | Row 3
+ 5 | Unspecified
+(6 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET a=6 WHERE a=5;
+ QUERY PLAN
+--------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a > 0) AND (a = 5))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE a=5;
+ QUERY PLAN
+--------------------------------------------------
+ Delete on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a > 0) AND (a = 5))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- view on top of view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT b AS bb, a AS aa FROM base_tbl WHERE a>0;
+CREATE VIEW rw_view2 AS SELECT aa AS aaa, bb AS bbb FROM rw_view1 WHERE aa<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view2';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view2 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view2';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view2 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view2'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view2 | aaa | YES
+ rw_view2 | bbb | YES
+(2 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3');
+INSERT INTO rw_view2 (aaa) VALUES (4);
+SELECT * FROM rw_view2;
+ aaa | bbb
+-----+-------------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row 3
+ 4 | Unspecified
+(4 rows)
+
+UPDATE rw_view2 SET bbb='Row 4' WHERE aaa=4;
+DELETE FROM rw_view2 WHERE aaa=2;
+SELECT * FROM rw_view2;
+ aaa | bbb
+-----+-------
+ 1 | Row 1
+ 3 | Row 3
+ 4 | Row 4
+(3 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET aaa=5 WHERE aaa=4;
+ QUERY PLAN
+--------------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a < 10) AND (a > 0) AND (a = 4))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE aaa=4;
+ QUERY PLAN
+--------------------------------------------------------
+ Delete on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a < 10) AND (a > 0) AND (a = 4))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- view on top of view with rules
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0 OFFSET 0; -- not updatable without rules/triggers
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | NO
+ rw_view2 | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a, NEW.b) RETURNING *;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | YES
+ rw_view2 | NO | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_upd_rule AS ON UPDATE TO rw_view1
+ DO INSTEAD UPDATE base_tbl SET b=NEW.b WHERE a=OLD.a RETURNING NEW.*;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | YES
+ rw_view2 | NO | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_del_rule AS ON DELETE TO rw_view1
+ DO INSTEAD DELETE FROM base_tbl WHERE a=OLD.a RETURNING OLD.*;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+ rw_view2 | YES | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | YES
+ rw_view1 | b | YES
+ rw_view2 | a | YES
+ rw_view2 | b | YES
+(4 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3') RETURNING *;
+ a | b
+---+-------
+ 3 | Row 3
+(1 row)
+
+UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-----------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row three
+(3 rows)
+
+DELETE FROM rw_view2 WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-------
+ 1 | Row 1
+ 2 | Row 2
+(2 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET a=3 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Update on base_tbl
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl base_tbl_1
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(10 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Delete on base_tbl
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl base_tbl_1
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(10 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- view on top of view with triggers
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0 OFFSET 0; -- not updatable without rules/triggers
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | NO | NO | NO
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ INSERT INTO base_tbl VALUES (NEW.a, NEW.b);
+ RETURN NEW;
+ ELSIF TG_OP = 'UPDATE' THEN
+ UPDATE base_tbl SET b=NEW.b WHERE a=OLD.a;
+ RETURN NEW;
+ ELSIF TG_OP = 'DELETE' THEN
+ DELETE FROM base_tbl WHERE a=OLD.a;
+ RETURN OLD;
+ END IF;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_ins_trig INSTEAD OF INSERT ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | NO | NO | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE TRIGGER rw_view1_upd_trig INSTEAD OF UPDATE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | YES | NO | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE TRIGGER rw_view1_del_trig INSTEAD OF DELETE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | YES | YES | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3') RETURNING *;
+ a | b
+---+-------
+ 3 | Row 3
+(1 row)
+
+UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-----------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row three
+(3 rows)
+
+DELETE FROM rw_view2 WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-------
+ 1 | Row 1
+ 2 | Row 2
+(2 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET a=3 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------
+ Update on rw_view1 rw_view1_1
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(7 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------
+ Delete on rw_view1 rw_view1_1
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(7 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP FUNCTION rw_view1_trig_fn();
+-- update using whole row from view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT b AS bb, a AS aa FROM base_tbl;
+CREATE FUNCTION rw_view1_aa(x rw_view1)
+ RETURNS int AS $$ SELECT x.aa $$ LANGUAGE sql;
+UPDATE rw_view1 v SET bb='Updated row 2' WHERE rw_view1_aa(v)=2
+ RETURNING rw_view1_aa(v), v.bb;
+ rw_view1_aa | bb
+-------------+---------------
+ 2 | Updated row 2
+(1 row)
+
+SELECT * FROM base_tbl;
+ a | b
+----+---------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Updated row 2
+(5 rows)
+
+EXPLAIN (costs off)
+UPDATE rw_view1 v SET bb='Updated row 2' WHERE rw_view1_aa(v)=2
+ RETURNING rw_view1_aa(v), v.bb;
+ QUERY PLAN
+--------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to function rw_view1_aa(rw_view1)
+-- permissions checks
+CREATE USER regress_view_user1;
+CREATE USER regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user1;
+CREATE TABLE base_tbl(a int, b text, c float);
+INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
+CREATE VIEW rw_view1 AS SELECT b AS bb, c AS cc, a AS aa FROM base_tbl;
+INSERT INTO rw_view1 VALUES ('Row 2', 2.0, 2);
+GRANT SELECT ON base_tbl TO regress_view_user2;
+GRANT SELECT ON rw_view1 TO regress_view_user2;
+GRANT UPDATE (a,c) ON base_tbl TO regress_view_user2;
+GRANT UPDATE (bb,cc) ON rw_view1 TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+CREATE VIEW rw_view2 AS SELECT b AS bb, c AS cc, a AS aa FROM base_tbl;
+SELECT * FROM base_tbl; -- ok
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+ 2 | Row 2 | 2
+(2 rows)
+
+SELECT * FROM rw_view1; -- ok
+ bb | cc | aa
+-------+----+----
+ Row 1 | 1 | 1
+ Row 2 | 2 | 2
+(2 rows)
+
+SELECT * FROM rw_view2; -- ok
+ bb | cc | aa
+-------+----+----
+ Row 1 | 1 | 1
+ Row 2 | 2 | 2
+(2 rows)
+
+INSERT INTO base_tbl VALUES (3, 'Row 3', 3.0); -- not allowed
+ERROR: permission denied for table base_tbl
+INSERT INTO rw_view1 VALUES ('Row 3', 3.0, 3); -- not allowed
+ERROR: permission denied for view rw_view1
+INSERT INTO rw_view2 VALUES ('Row 3', 3.0, 3); -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE base_tbl SET a=a, c=c; -- ok
+UPDATE base_tbl SET b=b; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET bb=bb, cc=cc; -- ok
+UPDATE rw_view1 SET aa=aa; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET aa=aa, cc=cc; -- ok
+UPDATE rw_view2 SET bb=bb; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM base_tbl; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM rw_view1; -- not allowed
+ERROR: permission denied for view rw_view1
+DELETE FROM rw_view2; -- not allowed
+ERROR: permission denied for table base_tbl
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT INSERT, DELETE ON base_tbl TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+INSERT INTO base_tbl VALUES (3, 'Row 3', 3.0); -- ok
+INSERT INTO rw_view1 VALUES ('Row 4', 4.0, 4); -- not allowed
+ERROR: permission denied for view rw_view1
+INSERT INTO rw_view2 VALUES ('Row 4', 4.0, 4); -- ok
+DELETE FROM base_tbl WHERE a=1; -- ok
+DELETE FROM rw_view1 WHERE aa=2; -- not allowed
+ERROR: permission denied for view rw_view1
+DELETE FROM rw_view2 WHERE aa=2; -- ok
+SELECT * FROM base_tbl;
+ a | b | c
+---+-------+---
+ 3 | Row 3 | 3
+ 4 | Row 4 | 4
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user1;
+REVOKE INSERT, DELETE ON base_tbl FROM regress_view_user2;
+GRANT INSERT, DELETE ON rw_view1 TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+INSERT INTO base_tbl VALUES (5, 'Row 5', 5.0); -- not allowed
+ERROR: permission denied for table base_tbl
+INSERT INTO rw_view1 VALUES ('Row 5', 5.0, 5); -- ok
+INSERT INTO rw_view2 VALUES ('Row 6', 6.0, 6); -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM base_tbl WHERE a=3; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM rw_view1 WHERE aa=3; -- ok
+DELETE FROM rw_view2 WHERE aa=4; -- not allowed
+ERROR: permission denied for table base_tbl
+SELECT * FROM base_tbl;
+ a | b | c
+---+-------+---
+ 4 | Row 4 | 4
+ 5 | Row 5 | 5
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- nested-view permissions
+CREATE TABLE base_tbl(a int, b text, c float);
+INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
+SET SESSION AUTHORIZATION regress_view_user1;
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl;
+SELECT * FROM rw_view1; -- not allowed
+ERROR: permission denied for table base_tbl
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1;
+SELECT * FROM rw_view2; -- not allowed
+ERROR: permission denied for view rw_view1
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+RESET SESSION AUTHORIZATION;
+GRANT SELECT ON base_tbl TO regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2; -- not allowed
+ERROR: permission denied for view rw_view1
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT SELECT ON rw_view1 TO regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+RESET SESSION AUTHORIZATION;
+GRANT UPDATE ON base_tbl TO regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT UPDATE ON rw_view1 TO regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1;
+RESET SESSION AUTHORIZATION;
+REVOKE UPDATE ON base_tbl FROM regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-----+---
+ 1 | bar | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | bar | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+RESET SESSION AUTHORIZATION;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP USER regress_view_user1;
+DROP USER regress_view_user2;
+-- column defaults
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified', c serial);
+INSERT INTO base_tbl VALUES (1, 'Row 1');
+INSERT INTO base_tbl VALUES (2, 'Row 2');
+INSERT INTO base_tbl VALUES (3);
+CREATE VIEW rw_view1 AS SELECT a AS aa, b AS bb FROM base_tbl;
+ALTER VIEW rw_view1 ALTER COLUMN bb SET DEFAULT 'View default';
+INSERT INTO rw_view1 VALUES (4, 'Row 4');
+INSERT INTO rw_view1 (aa) VALUES (5);
+SELECT * FROM base_tbl;
+ a | b | c
+---+--------------+---
+ 1 | Row 1 | 1
+ 2 | Row 2 | 2
+ 3 | Unspecified | 3
+ 4 | Row 4 | 4
+ 5 | View default | 5
+(5 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- Table having triggers
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl VALUES (1, 'Row 1');
+INSERT INTO base_tbl VALUES (2, 'Row 2');
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ UPDATE base_tbl SET b=NEW.b WHERE a=1;
+ RETURN NULL;
+ END IF;
+ RETURN NULL;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_ins_trig AFTER INSERT ON base_tbl
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+CREATE VIEW rw_view1 AS SELECT a AS aa, b AS bb FROM base_tbl;
+INSERT INTO rw_view1 VALUES (3, 'Row 3');
+select * from base_tbl;
+ a | b
+---+-------
+ 2 | Row 2
+ 3 | Row 3
+ 1 | Row 3
+(3 rows)
+
+DROP VIEW rw_view1;
+DROP TRIGGER rw_view1_ins_trig on base_tbl;
+DROP FUNCTION rw_view1_trig_fn();
+DROP TABLE base_tbl;
+-- view with ORDER BY
+CREATE TABLE base_tbl (a int, b int);
+INSERT INTO base_tbl VALUES (1,2), (4,5), (3,-3);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl ORDER BY a+b;
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 3 | -3
+ 1 | 2
+ 4 | 5
+(3 rows)
+
+INSERT INTO rw_view1 VALUES (7,-8);
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 7 | -8
+ 3 | -3
+ 1 | 2
+ 4 | 5
+(4 rows)
+
+EXPLAIN (verbose, costs off) UPDATE rw_view1 SET b = b + 1 RETURNING *;
+ QUERY PLAN
+-------------------------------------------------------------
+ Update on public.base_tbl
+ Output: base_tbl.a, base_tbl.b
+ -> Seq Scan on public.base_tbl
+ Output: base_tbl.a, (base_tbl.b + 1), base_tbl.ctid
+(4 rows)
+
+UPDATE rw_view1 SET b = b + 1 RETURNING *;
+ a | b
+---+----
+ 1 | 3
+ 4 | 6
+ 3 | -2
+ 7 | -7
+(4 rows)
+
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 7 | -7
+ 3 | -2
+ 1 | 3
+ 4 | 6
+(4 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- multiple array-column updates
+CREATE TABLE base_tbl (a int, arr int[]);
+INSERT INTO base_tbl VALUES (1,ARRAY[2]), (3,ARRAY[4]);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl;
+UPDATE rw_view1 SET arr[1] = 42, arr[2] = 77 WHERE a = 3;
+SELECT * FROM rw_view1;
+ a | arr
+---+---------
+ 1 | {2}
+ 3 | {42,77}
+(2 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- views with updatable and non-updatable columns
+CREATE TABLE base_tbl(a float);
+INSERT INTO base_tbl SELECT i/10.0 FROM generate_series(1,10) g(i);
+CREATE VIEW rw_view1 AS
+ SELECT ctid, sin(a) s, a, cos(a) c
+ FROM base_tbl
+ WHERE a != 0
+ ORDER BY abs(a);
+INSERT INTO rw_view1 VALUES (null, null, 1.1, null); -- should fail
+ERROR: cannot insert into column "ctid" of view "rw_view1"
+DETAIL: View columns that refer to system columns are not updatable.
+INSERT INTO rw_view1 (s, c, a) VALUES (null, null, 1.1); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view1 (a) VALUES (1.1) RETURNING a, s, c; -- OK
+ a | s | c
+-----+-------------------+-------------------
+ 1.1 | 0.891207360061435 | 0.453596121425577
+(1 row)
+
+UPDATE rw_view1 SET s = s WHERE a = 1.1; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view1 SET a = 1.05 WHERE a = 1.1 RETURNING s; -- OK
+ s
+-------------------
+ 0.867423225594017
+(1 row)
+
+DELETE FROM rw_view1 WHERE a = 1.05; -- OK
+CREATE VIEW rw_view2 AS
+ SELECT s, c, s/c t, a base_a, ctid
+ FROM rw_view1;
+INSERT INTO rw_view2 VALUES (null, null, null, 1.1, null); -- should fail
+ERROR: cannot insert into column "t" of view "rw_view2"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view2(s, c, base_a) VALUES (null, null, 1.1); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view2(base_a) VALUES (1.1) RETURNING t; -- OK
+ t
+------------------
+ 1.96475965724865
+(1 row)
+
+UPDATE rw_view2 SET s = s WHERE base_a = 1.1; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view2 SET t = t WHERE base_a = 1.1; -- should fail
+ERROR: cannot update column "t" of view "rw_view2"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view2 SET base_a = 1.05 WHERE base_a = 1.1; -- OK
+DELETE FROM rw_view2 WHERE base_a = 1.05 RETURNING base_a, s, c, t; -- OK
+ base_a | s | c | t
+--------+-------------------+-------------------+------------------
+ 1.05 | 0.867423225594017 | 0.497571047891727 | 1.74331530998317
+(1 row)
+
+CREATE VIEW rw_view3 AS
+ SELECT s, c, s/c t, ctid
+ FROM rw_view1;
+INSERT INTO rw_view3 VALUES (null, null, null, null); -- should fail
+ERROR: cannot insert into column "t" of view "rw_view3"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view3(s) VALUES (null); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view3 SET s = s; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+DELETE FROM rw_view3 WHERE s = sin(0.1); -- should be OK
+SELECT * FROM base_tbl ORDER BY a;
+ a
+-----
+ 0.2
+ 0.3
+ 0.4
+ 0.5
+ 0.6
+ 0.7
+ 0.8
+ 0.9
+ 1
+(9 rows)
+
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+ rw_view3 | NO
+(3 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+ rw_view2 | YES | YES
+ rw_view3 | NO | NO
+(3 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | ctid | NO
+ rw_view1 | s | NO
+ rw_view1 | a | YES
+ rw_view1 | c | NO
+ rw_view2 | s | NO
+ rw_view2 | c | NO
+ rw_view2 | t | NO
+ rw_view2 | base_a | YES
+ rw_view2 | ctid | NO
+ rw_view3 | s | NO
+ rw_view3 | c | NO
+ rw_view3 | t | NO
+ rw_view3 | ctid | NO
+(13 rows)
+
+SELECT events & 4 != 0 AS upd,
+ events & 8 != 0 AS ins,
+ events & 16 != 0 AS del
+ FROM pg_catalog.pg_relation_is_updatable('rw_view3'::regclass, false) t(events);
+ upd | ins | del
+-----+-----+-----
+ f | f | t
+(1 row)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+drop cascades to view rw_view3
+-- inheritance tests
+CREATE TABLE base_tbl_parent (a int);
+CREATE TABLE base_tbl_child (CHECK (a > 0)) INHERITS (base_tbl_parent);
+INSERT INTO base_tbl_parent SELECT * FROM generate_series(-8, -1);
+INSERT INTO base_tbl_child SELECT * FROM generate_series(1, 8);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl_parent;
+CREATE VIEW rw_view2 AS SELECT * FROM ONLY base_tbl_parent;
+SELECT * FROM rw_view1 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+(16 rows)
+
+SELECT * FROM ONLY rw_view1 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+(16 rows)
+
+SELECT * FROM rw_view2 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+(8 rows)
+
+INSERT INTO rw_view1 VALUES (-100), (100);
+INSERT INTO rw_view2 VALUES (-200), (200);
+UPDATE rw_view1 SET a = a*10 WHERE a IN (-1, 1); -- Should produce -10 and 10
+UPDATE ONLY rw_view1 SET a = a*10 WHERE a IN (-2, 2); -- Should produce -20 and 20
+UPDATE rw_view2 SET a = a*10 WHERE a IN (-3, 3); -- Should produce -30 only
+UPDATE ONLY rw_view2 SET a = a*10 WHERE a IN (-4, 4); -- Should produce -40 only
+DELETE FROM rw_view1 WHERE a IN (-5, 5); -- Should delete -5 and 5
+DELETE FROM ONLY rw_view1 WHERE a IN (-6, 6); -- Should delete -6 and 6
+DELETE FROM rw_view2 WHERE a IN (-7, 7); -- Should delete -7 only
+DELETE FROM ONLY rw_view2 WHERE a IN (-8, 8); -- Should delete -8 only
+SELECT * FROM ONLY base_tbl_parent ORDER BY a;
+ a
+------
+ -200
+ -100
+ -40
+ -30
+ -20
+ -10
+ 100
+ 200
+(8 rows)
+
+SELECT * FROM base_tbl_child ORDER BY a;
+ a
+----
+ 3
+ 4
+ 7
+ 8
+ 10
+ 20
+(6 rows)
+
+CREATE TABLE other_tbl_parent (id int);
+CREATE TABLE other_tbl_child () INHERITS (other_tbl_parent);
+INSERT INTO other_tbl_parent VALUES (7),(200);
+INSERT INTO other_tbl_child VALUES (8),(100);
+EXPLAIN (costs off)
+UPDATE rw_view1 SET a = a + 1000 FROM other_tbl_parent WHERE a = id;
+ QUERY PLAN
+---------------------------------------------------------------
+ Update on base_tbl_parent
+ Update on base_tbl_parent
+ Update on base_tbl_child
+ -> Merge Join
+ Merge Cond: (base_tbl_parent.a = other_tbl_parent.id)
+ -> Sort
+ Sort Key: base_tbl_parent.a
+ -> Seq Scan on base_tbl_parent
+ -> Sort
+ Sort Key: other_tbl_parent.id
+ -> Append
+ -> Seq Scan on other_tbl_parent
+ -> Seq Scan on other_tbl_child
+ -> Merge Join
+ Merge Cond: (base_tbl_child.a = other_tbl_parent.id)
+ -> Sort
+ Sort Key: base_tbl_child.a
+ -> Seq Scan on base_tbl_child
+ -> Sort
+ Sort Key: other_tbl_parent.id
+ -> Append
+ -> Seq Scan on other_tbl_parent
+ -> Seq Scan on other_tbl_child
+(23 rows)
+
+UPDATE rw_view1 SET a = a + 1000 FROM other_tbl_parent WHERE a = id;
+SELECT * FROM ONLY base_tbl_parent ORDER BY a;
+ a
+------
+ -200
+ -100
+ -40
+ -30
+ -20
+ -10
+ 1100
+ 1200
+(8 rows)
+
+SELECT * FROM base_tbl_child ORDER BY a;
+ a
+------
+ 3
+ 4
+ 10
+ 20
+ 1007
+ 1008
+(6 rows)
+
+DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP TABLE other_tbl_parent CASCADE;
+NOTICE: drop cascades to table other_tbl_child
+-- simple WITH CHECK OPTION
+CREATE TABLE base_tbl (a int, b int DEFAULT 10);
+INSERT INTO base_tbl VALUES (1,2), (2,3), (1,-1);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a < b
+ WITH LOCAL CHECK OPTION;
+\d+ rw_view1
+ View "public.rw_view1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+ b | integer | | | | plain |
+View definition:
+ SELECT base_tbl.a,
+ base_tbl.b
+ FROM base_tbl
+ WHERE base_tbl.a < base_tbl.b;
+Options: check_option=local
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view1';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+------------------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view1 | SELECT base_tbl.a, +| LOCAL | YES | YES | NO | NO | NO
+ | | | base_tbl.b +| | | | | |
+ | | | FROM base_tbl +| | | | | |
+ | | | WHERE (base_tbl.a < base_tbl.b); | | | | | |
+(1 row)
+
+INSERT INTO rw_view1 VALUES(3,4); -- ok
+INSERT INTO rw_view1 VALUES(4,3); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (4, 3).
+INSERT INTO rw_view1 VALUES(5,null); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (5, null).
+UPDATE rw_view1 SET b = 5 WHERE a = 3; -- ok
+UPDATE rw_view1 SET b = -5 WHERE a = 3; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (3, -5).
+INSERT INTO rw_view1(a) VALUES (9); -- ok
+INSERT INTO rw_view1(a) VALUES (10); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, 10).
+SELECT * FROM base_tbl;
+ a | b
+---+----
+ 1 | 2
+ 2 | 3
+ 1 | -1
+ 3 | 5
+ 9 | 10
+(5 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH LOCAL/CASCADED CHECK OPTION
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a > 0;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a < 10
+ WITH CHECK OPTION; -- implicitly cascaded
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+Options: check_option=cascaded
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (-5); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-5).
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (15); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (15).
+SELECT * FROM base_tbl;
+ a
+---
+ 5
+(1 row)
+
+UPDATE rw_view2 SET a = a - 10; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-5).
+UPDATE rw_view2 SET a = a + 10; -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (15).
+CREATE OR REPLACE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a < 10
+ WITH LOCAL CHECK OPTION;
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+Options: check_option=local
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| LOCAL | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (-10); -- ok, but not in view
+INSERT INTO rw_view2 VALUES (20); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (20).
+SELECT * FROM base_tbl;
+ a
+-----
+ 5
+ -10
+(2 rows)
+
+ALTER VIEW rw_view1 SET (check_option=here); -- invalid
+ERROR: invalid value for "check_option" option
+DETAIL: Valid values are "local" and "cascaded".
+ALTER VIEW rw_view1 SET (check_option=local);
+INSERT INTO rw_view2 VALUES (-20); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-20).
+INSERT INTO rw_view2 VALUES (30); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (30).
+ALTER VIEW rw_view2 RESET (check_option);
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| NONE | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (30); -- ok, but not in view
+SELECT * FROM base_tbl;
+ a
+-----
+ 5
+ -10
+ 30
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- WITH CHECK OPTION with no local view qual
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WITH CHECK OPTION;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a > 0;
+CREATE VIEW rw_view3 AS SELECT * FROM rw_view2 WITH CHECK OPTION;
+SELECT * FROM information_schema.views WHERE table_name LIKE E'rw\\_view_' ORDER BY table_name;
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+---------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view1 | SELECT base_tbl.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM base_tbl; | | | | | |
+ regression | public | rw_view2 | SELECT rw_view1.a +| NONE | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a > 0); | | | | | |
+ regression | public | rw_view3 | SELECT rw_view2.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM rw_view2; | | | | | |
+(3 rows)
+
+INSERT INTO rw_view1 VALUES (-1); -- ok
+INSERT INTO rw_view1 VALUES (1); -- ok
+INSERT INTO rw_view2 VALUES (-2); -- ok, but not in view
+INSERT INTO rw_view2 VALUES (2); -- ok
+INSERT INTO rw_view3 VALUES (-3); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-3).
+INSERT INTO rw_view3 VALUES (3); -- ok
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+drop cascades to view rw_view3
+-- WITH CHECK OPTION with scalar array ops
+CREATE TABLE base_tbl (a int, b int[]);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a = ANY (b)
+ WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (1, ARRAY[1,2,3]); -- ok
+INSERT INTO rw_view1 VALUES (10, ARRAY[4,5]); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, {4,5}).
+UPDATE rw_view1 SET b[2] = -b[2] WHERE a = 1; -- ok
+UPDATE rw_view1 SET b[1] = -b[1] WHERE a = 1; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (1, {-1,-2,3}).
+PREPARE ins(int, int[]) AS INSERT INTO rw_view1 VALUES($1, $2);
+EXECUTE ins(2, ARRAY[1,2,3]); -- ok
+EXECUTE ins(10, ARRAY[4,5]); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, {4,5}).
+DEALLOCATE PREPARE ins;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH CHECK OPTION with subquery
+CREATE TABLE base_tbl (a int);
+CREATE TABLE ref_tbl (a int PRIMARY KEY);
+INSERT INTO ref_tbl SELECT * FROM generate_series(1,10);
+CREATE VIEW rw_view1 AS
+ SELECT * FROM base_tbl b
+ WHERE EXISTS(SELECT 1 FROM ref_tbl r WHERE r.a = b.a)
+ WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (5); -- ok
+INSERT INTO rw_view1 VALUES (15); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15).
+UPDATE rw_view1 SET a = a + 5; -- ok
+UPDATE rw_view1 SET a = a + 5; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15).
+EXPLAIN (costs off) INSERT INTO rw_view1 VALUES (5);
+ QUERY PLAN
+---------------------------------------------------------
+ Insert on base_tbl b
+ -> Result
+ SubPlan 1
+ -> Index Only Scan using ref_tbl_pkey on ref_tbl r
+ Index Cond: (a = b.a)
+ SubPlan 2
+ -> Seq Scan on ref_tbl r_1
+(7 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET a = a + 5;
+ QUERY PLAN
+-----------------------------------------------------------
+ Update on base_tbl b
+ -> Hash Join
+ Hash Cond: (b.a = r.a)
+ -> Seq Scan on base_tbl b
+ -> Hash
+ -> Seq Scan on ref_tbl r
+ SubPlan 1
+ -> Index Only Scan using ref_tbl_pkey on ref_tbl r_1
+ Index Cond: (a = b.a)
+ SubPlan 2
+ -> Seq Scan on ref_tbl r_2
+(11 rows)
+
+DROP TABLE base_tbl, ref_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH CHECK OPTION with BEFORE trigger on base table
+CREATE TABLE base_tbl (a int, b int);
+CREATE FUNCTION base_tbl_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ NEW.b := 10;
+ RETURN NEW;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER base_tbl_trig BEFORE INSERT OR UPDATE ON base_tbl
+ FOR EACH ROW EXECUTE PROCEDURE base_tbl_trig_fn();
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a < b WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (5,0); -- ok
+INSERT INTO rw_view1 VALUES (15, 20); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15, 10).
+UPDATE rw_view1 SET a = 20, b = 30; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (20, 10).
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+DROP FUNCTION base_tbl_trig_fn();
+-- WITH LOCAL CHECK OPTION with INSTEAD OF trigger on base view
+CREATE TABLE base_tbl (a int, b int);
+CREATE VIEW rw_view1 AS SELECT a FROM base_tbl WHERE a < b;
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ INSERT INTO base_tbl VALUES (NEW.a, 10);
+ RETURN NEW;
+ ELSIF TG_OP = 'UPDATE' THEN
+ UPDATE base_tbl SET a=NEW.a WHERE a=OLD.a;
+ RETURN NEW;
+ ELSIF TG_OP = 'DELETE' THEN
+ DELETE FROM base_tbl WHERE a=OLD.a;
+ RETURN OLD;
+ END IF;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_trig
+ INSTEAD OF INSERT OR UPDATE OR DELETE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+CREATE VIEW rw_view2 AS
+ SELECT * FROM rw_view1 WHERE a > 0 WITH LOCAL CHECK OPTION;
+INSERT INTO rw_view2 VALUES (-5); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-5).
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (50); -- ok, but not in view
+UPDATE rw_view2 SET a = a - 10; -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-5).
+SELECT * FROM base_tbl;
+ a | b
+----+----
+ 5 | 10
+ 50 | 10
+(2 rows)
+
+-- Check option won't cascade down to base view with INSTEAD OF triggers
+ALTER VIEW rw_view2 SET (check_option=cascaded);
+INSERT INTO rw_view2 VALUES (100); -- ok, but not in view (doesn't fail rw_view1's check)
+UPDATE rw_view2 SET a = 200 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view1's check)
+SELECT * FROM base_tbl;
+ a | b
+-----+----
+ 50 | 10
+ 100 | 10
+ 200 | 10
+(3 rows)
+
+-- Neither local nor cascaded check options work with INSTEAD rules
+DROP TRIGGER rw_view1_trig ON rw_view1;
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a, 10);
+CREATE RULE rw_view1_upd_rule AS ON UPDATE TO rw_view1
+ DO INSTEAD UPDATE base_tbl SET a=NEW.a WHERE a=OLD.a;
+INSERT INTO rw_view2 VALUES (-10); -- ok, but not in view (doesn't fail rw_view2's check)
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (20); -- ok, but not in view (doesn't fail rw_view1's check)
+UPDATE rw_view2 SET a = 30 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view1's check)
+INSERT INTO rw_view2 VALUES (5); -- ok
+UPDATE rw_view2 SET a = -5 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view2's check)
+SELECT * FROM base_tbl;
+ a | b
+-----+----
+ 50 | 10
+ 100 | 10
+ 200 | 10
+ -10 | 10
+ 20 | 10
+ 30 | 10
+ -5 | 10
+(7 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP FUNCTION rw_view1_trig_fn();
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT a,10 AS b FROM base_tbl;
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a);
+CREATE VIEW rw_view2 AS
+ SELECT * FROM rw_view1 WHERE a > b WITH LOCAL CHECK OPTION;
+INSERT INTO rw_view2 VALUES (2,3); -- ok, but not in view (doesn't fail rw_view2's check)
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- security barrier view
+CREATE TABLE base_tbl (person text, visibility text);
+INSERT INTO base_tbl VALUES ('Tom', 'public'),
+ ('Dick', 'private'),
+ ('Harry', 'public');
+CREATE VIEW rw_view1 AS
+ SELECT person FROM base_tbl WHERE visibility = 'public';
+CREATE FUNCTION snoop(anyelement)
+RETURNS boolean AS
+$$
+BEGIN
+ RAISE NOTICE 'snooped value: %', $1;
+ RETURN true;
+END;
+$$
+LANGUAGE plpgsql COST 0.000001;
+CREATE OR REPLACE FUNCTION leakproof(anyelement)
+RETURNS boolean AS
+$$
+BEGIN
+ RETURN true;
+END;
+$$
+LANGUAGE plpgsql STRICT IMMUTABLE LEAKPROOF;
+SELECT * FROM rw_view1 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view1 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Harry
+DELETE FROM rw_view1 WHERE NOT snoop(person);
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+ALTER VIEW rw_view1 SET (security_barrier = true);
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view1';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view1';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view1'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | person | YES
+(1 row)
+
+SELECT * FROM rw_view1 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view1 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+DELETE FROM rw_view1 WHERE NOT snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+EXPLAIN (costs off) SELECT * FROM rw_view1 WHERE snoop(person);
+ QUERY PLAN
+-----------------------------------------------
+ Subquery Scan on rw_view1
+ Filter: snoop(rw_view1.person)
+ -> Seq Scan on base_tbl
+ Filter: (visibility = 'public'::text)
+(4 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET person=person WHERE snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------
+ Update on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE NOT snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Delete on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND (NOT snoop(person)))
+(3 rows)
+
+-- security barrier view on top of security barrier view
+CREATE VIEW rw_view2 WITH (security_barrier = true) AS
+ SELECT * FROM rw_view1 WHERE snoop(person);
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view2';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view2 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view2';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view2 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view2'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view2 | person | YES
+(1 row)
+
+SELECT * FROM rw_view2 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view2 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+DELETE FROM rw_view2 WHERE NOT snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+EXPLAIN (costs off) SELECT * FROM rw_view2 WHERE snoop(person);
+ QUERY PLAN
+-----------------------------------------------------
+ Subquery Scan on rw_view2
+ Filter: snoop(rw_view2.person)
+ -> Subquery Scan on rw_view1
+ Filter: snoop(rw_view1.person)
+ -> Seq Scan on base_tbl
+ Filter: (visibility = 'public'::text)
+(6 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET person=person WHERE snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Update on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person) AND snoop(person))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE NOT snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------------------------
+ Delete on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person) AND (NOT snoop(person)))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- security barrier view on top of table with rules
+CREATE TABLE base_tbl(id int PRIMARY KEY, data text, deleted boolean);
+INSERT INTO base_tbl VALUES (1, 'Row 1', false), (2, 'Row 2', true);
+CREATE RULE base_tbl_ins_rule AS ON INSERT TO base_tbl
+ WHERE EXISTS (SELECT 1 FROM base_tbl t WHERE t.id = new.id)
+ DO INSTEAD
+ UPDATE base_tbl SET data = new.data, deleted = false WHERE id = new.id;
+CREATE RULE base_tbl_del_rule AS ON DELETE TO base_tbl
+ DO INSTEAD
+ UPDATE base_tbl SET deleted = true WHERE id = old.id;
+CREATE VIEW rw_view1 WITH (security_barrier=true) AS
+ SELECT id, data FROM base_tbl WHERE NOT deleted;
+SELECT * FROM rw_view1;
+ id | data
+----+-------
+ 1 | Row 1
+(1 row)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE id = 1 AND snoop(data);
+ QUERY PLAN
+-------------------------------------------------------------------
+ Update on base_tbl base_tbl_1
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl base_tbl_1
+ Index Cond: (id = 1)
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (id = 1)
+ Filter: ((NOT deleted) AND snoop(data))
+(7 rows)
+
+DELETE FROM rw_view1 WHERE id = 1 AND snoop(data);
+NOTICE: snooped value: Row 1
+EXPLAIN (costs off) INSERT INTO rw_view1 VALUES (2, 'New row 2');
+ QUERY PLAN
+-----------------------------------------------------------
+ Insert on base_tbl
+ InitPlan 1 (returns $0)
+ -> Index Only Scan using base_tbl_pkey on base_tbl t
+ Index Cond: (id = 2)
+ -> Result
+ One-Time Filter: ($0 IS NOT TRUE)
+
+ Update on base_tbl
+ InitPlan 1 (returns $0)
+ -> Index Only Scan using base_tbl_pkey on base_tbl t
+ Index Cond: (id = 2)
+ -> Result
+ One-Time Filter: $0
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (id = 2)
+(15 rows)
+
+INSERT INTO rw_view1 VALUES (2, 'New row 2');
+SELECT * FROM base_tbl;
+ id | data | deleted
+----+-----------+---------
+ 1 | Row 1 | t
+ 2 | New row 2 | f
+(2 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- security barrier view based on inheritance set
+CREATE TABLE t1 (a int, b float, c text);
+CREATE INDEX t1_a_idx ON t1(a);
+INSERT INTO t1
+SELECT i,i,'t1' FROM generate_series(1,10) g(i);
+ANALYZE t1;
+CREATE TABLE t11 (d text) INHERITS (t1);
+CREATE INDEX t11_a_idx ON t11(a);
+INSERT INTO t11
+SELECT i,i,'t11','t11d' FROM generate_series(1,10) g(i);
+ANALYZE t11;
+CREATE TABLE t12 (e int[]) INHERITS (t1);
+CREATE INDEX t12_a_idx ON t12(a);
+INSERT INTO t12
+SELECT i,i,'t12','{1,2}'::int[] FROM generate_series(1,10) g(i);
+ANALYZE t12;
+CREATE TABLE t111 () INHERITS (t11, t12);
+NOTICE: merging multiple inherited definitions of column "a"
+NOTICE: merging multiple inherited definitions of column "b"
+NOTICE: merging multiple inherited definitions of column "c"
+CREATE INDEX t111_a_idx ON t111(a);
+INSERT INTO t111
+SELECT i,i,'t111','t111d','{1,1,1}'::int[] FROM generate_series(1,10) g(i);
+ANALYZE t111;
+CREATE VIEW v1 WITH (security_barrier=true) AS
+SELECT *, (SELECT d FROM t11 WHERE t11.a = t1.a LIMIT 1) AS d
+FROM t1
+WHERE a > 5 AND EXISTS(SELECT 1 FROM t12 WHERE t12.a = t1.a);
+SELECT * FROM v1 WHERE a=3; -- should not see anything
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM v1 WHERE a=8;
+ a | b | c | d
+---+---+------+------
+ 8 | 8 | t1 | t11d
+ 8 | 8 | t11 | t11d
+ 8 | 8 | t12 | t11d
+ 8 | 8 | t111 | t11d
+(4 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------------------------
+ Update on public.t1
+ Update on public.t1
+ Update on public.t11
+ Update on public.t12
+ Update on public.t111
+ -> Index Scan using t1_a_idx on public.t1
+ Output: 100, t1.b, t1.c, t1.ctid
+ Index Cond: ((t1.a > 5) AND (t1.a < 7))
+ Filter: ((t1.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t1.a) AND leakproof(t1.a))
+ SubPlan 1
+ -> Append
+ -> Seq Scan on public.t12 t12_1
+ Filter: (t12_1.a = t1.a)
+ -> Seq Scan on public.t111 t111_1
+ Filter: (t111_1.a = t1.a)
+ SubPlan 2
+ -> Append
+ -> Seq Scan on public.t12 t12_2
+ Output: t12_2.a
+ -> Seq Scan on public.t111 t111_2
+ Output: t111_2.a
+ -> Index Scan using t11_a_idx on public.t11
+ Output: 100, t11.b, t11.c, t11.d, t11.ctid
+ Index Cond: ((t11.a > 5) AND (t11.a < 7))
+ Filter: ((t11.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t11.a) AND leakproof(t11.a))
+ -> Index Scan using t12_a_idx on public.t12
+ Output: 100, t12.b, t12.c, t12.e, t12.ctid
+ Index Cond: ((t12.a > 5) AND (t12.a < 7))
+ Filter: ((t12.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t12.a) AND leakproof(t12.a))
+ -> Index Scan using t111_a_idx on public.t111
+ Output: 100, t111.b, t111.c, t111.d, t111.e, t111.ctid
+ Index Cond: ((t111.a > 5) AND (t111.a < 7))
+ Filter: ((t111.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t111.a) AND leakproof(t111.a))
+(33 rows)
+
+UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6;
+SELECT * FROM v1 WHERE a=100; -- Nothing should have been changed to 100
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM t1 WHERE a=100; -- Nothing should have been changed to 100
+ a | b | c
+---+---+---
+(0 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE v1 SET a=a+1 WHERE snoop(a) AND leakproof(a) AND a = 8;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
+ Update on public.t1
+ Update on public.t1
+ Update on public.t11
+ Update on public.t12
+ Update on public.t111
+ -> Index Scan using t1_a_idx on public.t1
+ Output: (t1.a + 1), t1.b, t1.c, t1.ctid
+ Index Cond: ((t1.a > 5) AND (t1.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t1.a) AND leakproof(t1.a))
+ SubPlan 1
+ -> Append
+ -> Seq Scan on public.t12 t12_1
+ Filter: (t12_1.a = t1.a)
+ -> Seq Scan on public.t111 t111_1
+ Filter: (t111_1.a = t1.a)
+ SubPlan 2
+ -> Append
+ -> Seq Scan on public.t12 t12_2
+ Output: t12_2.a
+ -> Seq Scan on public.t111 t111_2
+ Output: t111_2.a
+ -> Index Scan using t11_a_idx on public.t11
+ Output: (t11.a + 1), t11.b, t11.c, t11.d, t11.ctid
+ Index Cond: ((t11.a > 5) AND (t11.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t11.a) AND leakproof(t11.a))
+ -> Index Scan using t12_a_idx on public.t12
+ Output: (t12.a + 1), t12.b, t12.c, t12.e, t12.ctid
+ Index Cond: ((t12.a > 5) AND (t12.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t12.a) AND leakproof(t12.a))
+ -> Index Scan using t111_a_idx on public.t111
+ Output: (t111.a + 1), t111.b, t111.c, t111.d, t111.e, t111.ctid
+ Index Cond: ((t111.a > 5) AND (t111.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t111.a) AND leakproof(t111.a))
+(33 rows)
+
+UPDATE v1 SET a=a+1 WHERE snoop(a) AND leakproof(a) AND a = 8;
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+SELECT * FROM v1 WHERE b=8;
+ a | b | c | d
+---+---+------+------
+ 9 | 8 | t1 | t11d
+ 9 | 8 | t11 | t11d
+ 9 | 8 | t12 | t11d
+ 9 | 8 | t111 | t11d
+(4 rows)
+
+DELETE FROM v1 WHERE snoop(a) AND leakproof(a); -- should not delete everything, just where a>5
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 9
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+TABLE t1; -- verify all a<=5 are intact
+ a | b | c
+---+---+------
+ 1 | 1 | t1
+ 2 | 2 | t1
+ 3 | 3 | t1
+ 4 | 4 | t1
+ 5 | 5 | t1
+ 1 | 1 | t11
+ 2 | 2 | t11
+ 3 | 3 | t11
+ 4 | 4 | t11
+ 5 | 5 | t11
+ 1 | 1 | t12
+ 2 | 2 | t12
+ 3 | 3 | t12
+ 4 | 4 | t12
+ 5 | 5 | t12
+ 1 | 1 | t111
+ 2 | 2 | t111
+ 3 | 3 | t111
+ 4 | 4 | t111
+ 5 | 5 | t111
+(20 rows)
+
+DROP TABLE t1, t11, t12, t111 CASCADE;
+NOTICE: drop cascades to view v1
+DROP FUNCTION snoop(anyelement);
+DROP FUNCTION leakproof(anyelement);
+CREATE TABLE tx1 (a integer);
+CREATE TABLE tx2 (b integer);
+CREATE TABLE tx3 (c integer);
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 values (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+(1 row)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+CREATE TABLE tx1 (a integer);
+CREATE TABLE tx2 (b integer);
+CREATE TABLE tx3 (c integer);
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 VALUES (1);
+INSERT INTO vx1 VALUES (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+CREATE TABLE tx1 (a integer, b integer);
+CREATE TABLE tx2 (b integer, c integer);
+CREATE TABLE tx3 (c integer, d integer);
+ALTER TABLE tx1 DROP COLUMN b;
+ALTER TABLE tx2 DROP COLUMN c;
+ALTER TABLE tx3 DROP COLUMN d;
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 VALUES (1);
+INSERT INTO vx1 VALUES (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+--
+-- Test handling of vars from correlated subqueries in quals from outer
+-- security barrier views, per bug #13988
+--
+CREATE TABLE t1 (a int, b text, c int);
+INSERT INTO t1 VALUES (1, 'one', 10);
+CREATE TABLE t2 (cc int);
+INSERT INTO t2 VALUES (10), (20);
+CREATE VIEW v1 WITH (security_barrier = true) AS
+ SELECT * FROM t1 WHERE (a > 0)
+ WITH CHECK OPTION;
+CREATE VIEW v2 WITH (security_barrier = true) AS
+ SELECT * FROM v1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.cc = v1.c)
+ WITH CHECK OPTION;
+INSERT INTO v2 VALUES (2, 'two', 20); -- ok
+INSERT INTO v2 VALUES (-2, 'minus two', 20); -- not allowed
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-2, minus two, 20).
+INSERT INTO v2 VALUES (3, 'three', 30); -- not allowed
+ERROR: new row violates check option for view "v2"
+DETAIL: Failing row contains (3, three, 30).
+UPDATE v2 SET b = 'ONE' WHERE a = 1; -- ok
+UPDATE v2 SET a = -1 WHERE a = 1; -- not allowed
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-1, ONE, 10).
+UPDATE v2 SET c = 30 WHERE a = 1; -- not allowed
+ERROR: new row violates check option for view "v2"
+DETAIL: Failing row contains (1, ONE, 30).
+DELETE FROM v2 WHERE a = 2; -- ok
+SELECT * FROM v2;
+ a | b | c
+---+-----+----
+ 1 | ONE | 10
+(1 row)
+
+DROP VIEW v2;
+DROP VIEW v1;
+DROP TABLE t2;
+DROP TABLE t1;
+--
+-- Test CREATE OR REPLACE VIEW turning a non-updatable view into an
+-- auto-updatable view and adding check options in a single step
+--
+CREATE TABLE t1 (a int, b text);
+CREATE VIEW v1 AS SELECT null::int AS a;
+CREATE OR REPLACE VIEW v1 AS SELECT * FROM t1 WHERE a > 0 WITH CHECK OPTION;
+INSERT INTO v1 VALUES (1, 'ok'); -- ok
+INSERT INTO v1 VALUES (-1, 'invalid'); -- should fail
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-1, invalid).
+DROP VIEW v1;
+DROP TABLE t1;
+-- check that an auto-updatable view on a partitioned table works correctly
+create table uv_pt (a int, b int, v varchar) partition by range (a, b);
+create table uv_pt1 (b int not null, v varchar, a int not null) partition by range (b);
+create table uv_pt11 (like uv_pt1);
+alter table uv_pt11 drop a;
+alter table uv_pt11 add a int;
+alter table uv_pt11 drop a;
+alter table uv_pt11 add a int not null;
+alter table uv_pt1 attach partition uv_pt11 for values from (2) to (5);
+alter table uv_pt attach partition uv_pt1 for values from (1, 2) to (1, 10);
+create view uv_ptv as select * from uv_pt;
+select events & 4 != 0 AS upd,
+ events & 8 != 0 AS ins,
+ events & 16 != 0 AS del
+ from pg_catalog.pg_relation_is_updatable('uv_pt'::regclass, false) t(events);
+ upd | ins | del
+-----+-----+-----
+ t | t | t
+(1 row)
+
+select pg_catalog.pg_column_is_updatable('uv_pt'::regclass, 1::smallint, false);
+ pg_column_is_updatable
+------------------------
+ t
+(1 row)
+
+select pg_catalog.pg_column_is_updatable('uv_pt'::regclass, 2::smallint, false);
+ pg_column_is_updatable
+------------------------
+ t
+(1 row)
+
+select table_name, is_updatable, is_insertable_into
+ from information_schema.views where table_name = 'uv_ptv';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ uv_ptv | YES | YES
+(1 row)
+
+select table_name, column_name, is_updatable
+ from information_schema.columns where table_name = 'uv_ptv' order by column_name;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ uv_ptv | a | YES
+ uv_ptv | b | YES
+ uv_ptv | v | YES
+(3 rows)
+
+insert into uv_ptv values (1, 2);
+select tableoid::regclass, * from uv_pt;
+ tableoid | a | b | v
+----------+---+---+---
+ uv_pt11 | 1 | 2 |
+(1 row)
+
+create view uv_ptv_wco as select * from uv_pt where a = 0 with check option;
+insert into uv_ptv_wco values (1, 2);
+ERROR: new row violates check option for view "uv_ptv_wco"
+DETAIL: Failing row contains (1, 2, null).
+drop view uv_ptv, uv_ptv_wco;
+drop table uv_pt, uv_pt1, uv_pt11;
+-- check that wholerow vars appearing in WITH CHECK OPTION constraint expressions
+-- work fine with partitioned tables
+create table wcowrtest (a int) partition by list (a);
+create table wcowrtest1 partition of wcowrtest for values in (1);
+create view wcowrtest_v as select * from wcowrtest where wcowrtest = '(2)'::wcowrtest with check option;
+insert into wcowrtest_v values (1);
+ERROR: new row violates check option for view "wcowrtest_v"
+DETAIL: Failing row contains (1).
+alter table wcowrtest add b text;
+create table wcowrtest2 (b text, c int, a int);
+alter table wcowrtest2 drop c;
+alter table wcowrtest attach partition wcowrtest2 for values in (2);
+create table sometable (a int, b text);
+insert into sometable values (1, 'a'), (2, 'b');
+create view wcowrtest_v2 as
+ select *
+ from wcowrtest r
+ where r in (select s from sometable s where r.a = s.a)
+with check option;
+-- WITH CHECK qual will be processed with wcowrtest2's
+-- rowtype after tuple-routing
+insert into wcowrtest_v2 values (2, 'no such row in sometable');
+ERROR: new row violates check option for view "wcowrtest_v2"
+DETAIL: Failing row contains (2, no such row in sometable).
+drop view wcowrtest_v, wcowrtest_v2;
+drop table wcowrtest, sometable;
+-- Check INSERT .. ON CONFLICT DO UPDATE works correctly when the view's
+-- columns are named and ordered differently than the underlying table's.
+create table uv_iocu_tab (a text unique, b float);
+insert into uv_iocu_tab values ('xyxyxy', 0);
+create view uv_iocu_view as
+ select b, b+1 as c, a, '2.0'::text as two from uv_iocu_tab;
+insert into uv_iocu_view (a, b) values ('xyxyxy', 1)
+ on conflict (a) do update set b = uv_iocu_view.b;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 0
+(1 row)
+
+insert into uv_iocu_view (a, b) values ('xyxyxy', 1)
+ on conflict (a) do update set b = excluded.b;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 1
+(1 row)
+
+-- OK to access view columns that are not present in underlying base
+-- relation in the ON CONFLICT portion of the query
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = cast(excluded.two as float);
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 2
+(1 row)
+
+explain (costs off)
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = excluded.b where excluded.c > 0;
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Insert on uv_iocu_tab
+ Conflict Resolution: UPDATE
+ Conflict Arbiter Indexes: uv_iocu_tab_a_key
+ Conflict Filter: ((excluded.b + '1'::double precision) > '0'::double precision)
+ -> Result
+(5 rows)
+
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = excluded.b where excluded.c > 0;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 3
+(1 row)
+
+drop view uv_iocu_view;
+drop table uv_iocu_tab;
+-- Test whole-row references to the view
+create table uv_iocu_tab (a int unique, b text);
+create view uv_iocu_view as
+ select b as bb, a as aa, uv_iocu_tab::text as cc from uv_iocu_tab;
+insert into uv_iocu_view (aa,bb) values (1,'x');
+explain (costs off)
+insert into uv_iocu_view (aa,bb) values (1,'y')
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*
+ where excluded.aa > 0
+ and excluded.bb != ''
+ and excluded.cc is not null;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
+ Insert on uv_iocu_tab
+ Conflict Resolution: UPDATE
+ Conflict Arbiter Indexes: uv_iocu_tab_a_key
+ Conflict Filter: ((excluded.a > 0) AND (excluded.b <> ''::text) AND ((excluded.*)::text IS NOT NULL))
+ -> Result
+(5 rows)
+
+insert into uv_iocu_view (aa,bb) values (1,'y')
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*
+ where excluded.aa > 0
+ and excluded.bb != ''
+ and excluded.cc is not null;
+select * from uv_iocu_view;
+ bb | aa | cc
+-------------------------+----+---------------------------------
+ Rejected: (y,1,"(1,y)") | 1 | (1,"Rejected: (y,1,""(1,y)"")")
+(1 row)
+
+-- Test omitting a column of the base relation
+delete from uv_iocu_view;
+insert into uv_iocu_view (aa,bb) values (1,'x');
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-----------------------+----+-------------------------------
+ Rejected: (,1,"(1,)") | 1 | (1,"Rejected: (,1,""(1,)"")")
+(1 row)
+
+alter table uv_iocu_tab alter column b set default 'table default';
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-------------------------------------------------------+----+---------------------------------------------------------------------
+ Rejected: ("table default",1,"(1,""table default"")") | 1 | (1,"Rejected: (""table default"",1,""(1,""""table default"""")"")")
+(1 row)
+
+alter view uv_iocu_view alter column bb set default 'view default';
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-----------------------------------------------------+----+-------------------------------------------------------------------
+ Rejected: ("view default",1,"(1,""view default"")") | 1 | (1,"Rejected: (""view default"",1,""(1,""""view default"""")"")")
+(1 row)
+
+-- Should fail to update non-updatable columns
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set cc = 'XXX';
+ERROR: cannot insert into column "cc" of view "uv_iocu_view"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+drop view uv_iocu_view;
+drop table uv_iocu_tab;
+-- ON CONFLICT DO UPDATE permissions checks
+create user regress_view_user1;
+create user regress_view_user2;
+set session authorization regress_view_user1;
+create table base_tbl(a int unique, b text, c float);
+insert into base_tbl values (1,'xxx',1.0);
+create view rw_view1 as select b as bb, c as cc, a as aa from base_tbl;
+grant select (aa,bb) on rw_view1 to regress_view_user2;
+grant insert on rw_view1 to regress_view_user2;
+grant update (bb) on rw_view1 to regress_view_user2;
+set session authorization regress_view_user2;
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = excluded.cc; -- Not allowed
+ERROR: permission denied for view rw_view1
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = rw_view1.cc; -- Not allowed
+ERROR: permission denied for view rw_view1
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+insert into rw_view1 values ('zzz',2.0,1)
+ on conflict (aa) do update set bb = rw_view1.bb||'xxx'; -- OK
+insert into rw_view1 values ('zzz',2.0,1)
+ on conflict (aa) do update set cc = 3.0; -- Not allowed
+ERROR: permission denied for view rw_view1
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+--------+---
+ 1 | yyyxxx | 1
+(1 row)
+
+set session authorization regress_view_user1;
+grant select (a,b) on base_tbl to regress_view_user2;
+grant insert (a,b) on base_tbl to regress_view_user2;
+grant update (a,b) on base_tbl to regress_view_user2;
+set session authorization regress_view_user2;
+create view rw_view2 as select b as bb, c as cc, a as aa from base_tbl;
+insert into rw_view2 (aa,bb) values (1,'xxx')
+ on conflict (aa) do update set bb = excluded.bb; -- Not allowed
+ERROR: permission denied for table base_tbl
+create view rw_view3 as select b as bb, a as aa from base_tbl;
+insert into rw_view3 (aa,bb) values (1,'xxx')
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+-----+---
+ 1 | xxx | 1
+(1 row)
+
+set session authorization regress_view_user2;
+create view rw_view4 as select aa, bb, cc FROM rw_view1;
+insert into rw_view4 (aa,bb) values (1,'yyy')
+ on conflict (aa) do update set bb = excluded.bb; -- Not allowed
+ERROR: permission denied for view rw_view1
+create view rw_view5 as select aa, bb FROM rw_view1;
+insert into rw_view5 (aa,bb) values (1,'yyy')
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+-----+---
+ 1 | yyy | 1
+(1 row)
+
+drop view rw_view5;
+drop view rw_view4;
+drop view rw_view3;
+drop view rw_view2;
+drop view rw_view1;
+drop table base_tbl;
+drop user regress_view_user1;
+drop user regress_view_user2;
+-- Test single- and multi-row inserts with table and view defaults.
+-- Table defaults should be used, unless overridden by view defaults.
+create table base_tab_def (a int, b text default 'Table default',
+ c text default 'Table default', d text, e text);
+create view base_tab_def_view as select * from base_tab_def;
+alter view base_tab_def_view alter b set default 'View default';
+alter view base_tab_def_view alter d set default 'View default';
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | Table default | View default |
+ 12 | View default | Table default | View default |
+ 13 | View default | Table default | View default |
+ 14 | View default | Table default | View default |
+ 15 | View default | Table default | View default |
+ 16 | View default | Table default | View default |
+ 17 | View default | Table default | View default |
+ | View default | Table default | View default |
+(14 rows)
+
+-- Adding an INSTEAD OF trigger should cause NULLs to be inserted instead of
+-- table defaults, where there are no view defaults.
+create function base_tab_def_view_instrig_func() returns trigger
+as
+$$
+begin
+ insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+ return new;
+end;
+$$
+language plpgsql;
+create trigger base_tab_def_view_instrig instead of insert on base_tab_def_view
+ for each row execute function base_tab_def_view_instrig_func();
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | | View default |
+ 12 | View default | | View default |
+ 13 | View default | | View default |
+ 14 | View default | | View default |
+ 15 | View default | | View default |
+ 16 | View default | | View default |
+ 17 | View default | | View default |
+ | View default | | View default |
+(14 rows)
+
+-- Using an unconditional DO INSTEAD rule should also cause NULLs to be
+-- inserted where there are no view defaults.
+drop trigger base_tab_def_view_instrig on base_tab_def_view;
+drop function base_tab_def_view_instrig_func;
+create rule base_tab_def_view_ins_rule as on insert to base_tab_def_view
+ do instead insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | | View default |
+ 12 | View default | | View default |
+ 13 | View default | | View default |
+ 14 | View default | | View default |
+ 15 | View default | | View default |
+ 16 | View default | | View default |
+ 17 | View default | | View default |
+ | View default | | View default |
+(14 rows)
+
+-- A DO ALSO rule should cause each row to be inserted twice. The first
+-- insert should behave the same as an auto-updatable view (using table
+-- defaults, unless overridden by view defaults). The second insert should
+-- behave the same as a rule-updatable view (inserting NULLs where there are
+-- no view defaults).
+drop rule base_tab_def_view_ins_rule on base_tab_def_view;
+create rule base_tab_def_view_ins_rule as on insert to base_tab_def_view
+ do also insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a, c NULLS LAST;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | Table default | View default |
+ 11 | View default | | View default |
+ 12 | View default | Table default | View default |
+ 12 | View default | | View default |
+ 13 | View default | Table default | View default |
+ 13 | View default | | View default |
+ 14 | View default | Table default | View default |
+ 14 | View default | | View default |
+ 15 | View default | Table default | View default |
+ 15 | View default | | View default |
+ 16 | View default | Table default | View default |
+ 16 | View default | | View default |
+ 17 | View default | Table default | View default |
+ 17 | View default | | View default |
+ | View default | Table default | View default |
+ | View default | | View default |
+(22 rows)
+
+drop view base_tab_def_view;
+drop table base_tab_def;
+-- Test defaults with array assignments
+create table base_tab (a serial, b int[], c text, d text default 'Table default');
+create view base_tab_view as select c, a, b from base_tab;
+alter view base_tab_view alter column c set default 'View default';
+insert into base_tab_view (b[1], b[2], c, b[5], b[4], a, b[3])
+values (1, 2, default, 5, 4, default, 3), (10, 11, 'C value', 14, 13, 100, 12);
+select * from base_tab order by a;
+ a | b | c | d
+-----+------------------+--------------+---------------
+ 1 | {1,2,3,4,5} | View default | Table default
+ 100 | {10,11,12,13,14} | C value | Table default
+(2 rows)
+
+drop view base_tab_view;
+drop table base_tab;
diff --git a/src/test/regress/expected/vacuum_1.out b/src/test/regress/expected/vacuum_1.out
new file mode 100644
index 00000000000..667cd5f5bc0
--- /dev/null
+++ b/src/test/regress/expected/vacuum_1.out
@@ -0,0 +1,342 @@
+--
+-- VACUUM
+--
+CREATE TABLE vactst (i INT);
+INSERT INTO vactst VALUES (1);
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst VALUES (0);
+SELECT count(*) FROM vactst;
+ count
+-------
+ 2049
+(1 row)
+
+DELETE FROM vactst WHERE i != 0;
+SELECT * FROM vactst;
+ i
+---
+ 0
+(1 row)
+
+VACUUM FULL vactst;
+UPDATE vactst SET i = i + 1;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst VALUES (0);
+SELECT count(*) FROM vactst;
+ count
+-------
+ 2049
+(1 row)
+
+DELETE FROM vactst WHERE i != 0;
+VACUUM (FULL) vactst;
+DELETE FROM vactst;
+SELECT * FROM vactst;
+ i
+---
+(0 rows)
+
+VACUUM (FULL, FREEZE) vactst;
+VACUUM (ANALYZE, FULL) vactst;
+CREATE TABLE vaccluster (i INT PRIMARY KEY);
+ALTER TABLE vaccluster CLUSTER ON vaccluster_pkey;
+CLUSTER vaccluster;
+CREATE FUNCTION do_analyze() RETURNS VOID VOLATILE LANGUAGE SQL
+ AS 'ANALYZE pg_am';
+CREATE FUNCTION wrap_do_analyze(c INT) RETURNS INT IMMUTABLE LANGUAGE SQL
+ AS 'SELECT $1 FROM do_analyze()';
+CREATE INDEX ON vaccluster(wrap_do_analyze(i));
+INSERT INTO vaccluster VALUES (1), (2);
+ANALYZE vaccluster;
+ERROR: ANALYZE cannot be executed from VACUUM or ANALYZE
+CONTEXT: SQL function "do_analyze" statement 1
+SQL function "wrap_do_analyze" statement 1
+-- Test ANALYZE in transaction, where the transaction surrounding
+-- analyze performed modifications. This tests for the bug at
+-- https://postgr.es/m/c7988239-d42c-ddc4-41db-171b23b35e4f%40ssinger.info
+-- (which hopefully is unlikely to be reintroduced), but also seems
+-- independently worthwhile to cover.
+INSERT INTO vactst SELECT generate_series(1, 300);
+DELETE FROM vactst WHERE i % 7 = 0; -- delete a few rows outside
+BEGIN;
+INSERT INTO vactst SELECT generate_series(301, 400);
+DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
+ANALYZE vactst;
+COMMIT;
+VACUUM FULL pg_am;
+VACUUM FULL pg_class;
+VACUUM FULL pg_database;
+VACUUM FULL vaccluster;
+ERROR: ANALYZE cannot be executed from VACUUM or ANALYZE
+CONTEXT: SQL function "do_analyze" statement 1
+SQL function "wrap_do_analyze" statement 1
+VACUUM FULL vactst;
+VACUUM (DISABLE_PAGE_SKIPPING) vaccluster;
+-- INDEX_CLEANUP option
+CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
+-- Use uncompressed data stored in toast.
+CREATE INDEX no_index_cleanup_idx ON no_index_cleanup(t);
+ALTER TABLE no_index_cleanup ALTER COLUMN t SET STORAGE EXTERNAL;
+INSERT INTO no_index_cleanup(i, t) VALUES (generate_series(1,30),
+ repeat('1234567890',300));
+-- index cleanup option is ignored if VACUUM FULL
+VACUUM (INDEX_CLEANUP TRUE, FULL TRUE) no_index_cleanup;
+VACUUM (FULL TRUE) no_index_cleanup;
+-- Toast inherits the value from its parent table.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false);
+DELETE FROM no_index_cleanup WHERE i < 15;
+-- Nothing is cleaned up.
+VACUUM no_index_cleanup;
+-- Both parent relation and toast are cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = true);
+VACUUM no_index_cleanup;
+-- Parameter is set for both the parent table and its toast relation.
+INSERT INTO no_index_cleanup(i, t) VALUES (generate_series(31,60),
+ repeat('1234567890',300));
+DELETE FROM no_index_cleanup WHERE i < 45;
+-- Only toast index is cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false,
+ toast.vacuum_index_cleanup = true);
+VACUUM no_index_cleanup;
+-- Only parent is cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = true,
+ toast.vacuum_index_cleanup = false);
+VACUUM no_index_cleanup;
+-- Test some extra relations.
+VACUUM (INDEX_CLEANUP FALSE) vaccluster;
+VACUUM (INDEX_CLEANUP FALSE) vactst; -- index cleanup option is ignored if no indexes
+VACUUM (INDEX_CLEANUP FALSE, FREEZE TRUE) vaccluster;
+-- TRUNCATE option
+CREATE TABLE vac_truncate_test(i INT NOT NULL, j text)
+ WITH (vacuum_truncate=true, autovacuum_enabled=false);
+INSERT INTO vac_truncate_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM (TRUNCATE FALSE) vac_truncate_test;
+SELECT pg_relation_size('vac_truncate_test') > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+VACUUM vac_truncate_test;
+SELECT pg_relation_size('vac_truncate_test') = 0;
+ ?column?
+----------
+ f
+(1 row)
+
+VACUUM (TRUNCATE FALSE, FULL TRUE) vac_truncate_test;
+DROP TABLE vac_truncate_test;
+-- partitioned table
+CREATE TABLE vacparted (a int, b char) PARTITION BY LIST (a);
+CREATE TABLE vacparted1 PARTITION OF vacparted FOR VALUES IN (1);
+INSERT INTO vacparted VALUES (1, 'a');
+UPDATE vacparted SET b = 'b';
+VACUUM (ANALYZE) vacparted;
+VACUUM (FULL) vacparted;
+VACUUM (FREEZE) vacparted;
+-- check behavior with duplicate column mentions
+VACUUM ANALYZE vacparted(a,b,a);
+ERROR: column "a" of relation "vacparted" appears more than once
+ANALYZE vacparted(a,b,b);
+ERROR: column "b" of relation "vacparted" appears more than once
+-- multiple tables specified
+VACUUM vaccluster, vactst;
+VACUUM vacparted, does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+VACUUM (FREEZE) vacparted, vaccluster, vactst;
+VACUUM (FREEZE) does_not_exist, vaccluster;
+ERROR: relation "does_not_exist" does not exist
+VACUUM ANALYZE vactst, vacparted (a);
+VACUUM ANALYZE vactst (does_not_exist), vacparted (b);
+ERROR: column "does_not_exist" of relation "vactst" does not exist
+VACUUM FULL vacparted, vactst;
+VACUUM FULL vactst, vacparted (a, b), vaccluster (i);
+ERROR: ANALYZE option must be specified when a column list is provided
+ANALYZE vactst, vacparted;
+ANALYZE vacparted (b), vactst;
+ANALYZE vactst, does_not_exist, vacparted;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE vactst (i), vacparted (does_not_exist);
+ERROR: column "does_not_exist" of relation "vacparted" does not exist
+-- parenthesized syntax for ANALYZE
+ANALYZE (VERBOSE) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE (nonexistent-arg) does_not_exist;
+ERROR: syntax error at or near "arg"
+LINE 1: ANALYZE (nonexistent-arg) does_not_exist;
+ ^
+ANALYZE (nonexistentarg) does_not_exit;
+ERROR: unrecognized ANALYZE option "nonexistentarg"
+LINE 1: ANALYZE (nonexistentarg) does_not_exit;
+ ^
+-- ensure argument order independence, and that SKIP_LOCKED on non-existing
+-- relation still errors out.
+ANALYZE (SKIP_LOCKED, VERBOSE) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE (VERBOSE, SKIP_LOCKED) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+-- SKIP_LOCKED option
+VACUUM (SKIP_LOCKED) vactst;
+VACUUM (SKIP_LOCKED, FULL) vactst;
+ANALYZE (SKIP_LOCKED) vactst;
+-- ensure VACUUM and ANALYZE don't have a problem with serializable
+SET default_transaction_isolation = serializable;
+VACUUM vactst;
+ANALYZE vactst;
+RESET default_transaction_isolation;
+BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ANALYZE vactst;
+COMMIT;
+DROP TABLE vaccluster;
+DROP TABLE vactst;
+DROP TABLE vacparted;
+DROP TABLE no_index_cleanup;
+-- relation ownership, WARNING logs generated as all are skipped.
+CREATE TABLE vacowned (a int);
+CREATE TABLE vacowned_parted (a int) PARTITION BY LIST (a);
+CREATE TABLE vacowned_part1 PARTITION OF vacowned_parted FOR VALUES IN (1);
+CREATE TABLE vacowned_part2 PARTITION OF vacowned_parted FOR VALUES IN (2);
+CREATE ROLE regress_vacuum;
+SET ROLE regress_vacuum;
+-- Simple table
+VACUUM vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can vacuum it
+ANALYZE vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can vacuum it
+-- Catalog
+VACUUM pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can vacuum it
+ANALYZE pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can analyze it
+VACUUM (ANALYZE) pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can vacuum it
+-- Shared catalog
+VACUUM pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can vacuum it
+ANALYZE pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can analyze it
+VACUUM (ANALYZE) pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can vacuum it
+-- Partitioned table and its partitions, nothing owned by other user.
+-- Relations are not listed in a single command to test ownership
+-- independently.
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Partitioned table and one partition owned by other user.
+ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
+ALTER TABLE vacowned_part1 OWNER TO regress_vacuum;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Only one partition owned by other user.
+ALTER TABLE vacowned_parted OWNER TO CURRENT_USER;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Only partitioned table owned by other user.
+ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
+ALTER TABLE vacowned_part1 OWNER TO CURRENT_USER;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+DROP TABLE vacowned;
+DROP TABLE vacowned_parted;
+DROP ROLE regress_vacuum;
diff --git a/src/test/regress/expected/zedstore.out b/src/test/regress/expected/zedstore.out
new file mode 100644
index 00000000000..96c951fc027
--- /dev/null
+++ b/src/test/regress/expected/zedstore.out
@@ -0,0 +1,609 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+(10 rows)
+
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+ c1 | c3
+----+----
+ 1 | 3
+ 2 | 4
+ 3 | 5
+ 4 | 6
+ 5 | 7
+ 6 | 8
+ 7 | 9
+ 8 | 10
+ 9 | 11
+ 10 | 12
+(10 rows)
+
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+ c3
+----
+ 7
+ 8
+ 9
+ 10
+ 11
+ 12
+(6 rows)
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(15 rows)
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+-- index scan
+select * from t_zedstore where c1 = 5;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+(1 row)
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+ c1
+----
+ 5
+(1 row)
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+ c1 | c2
+----+----
+ 5 | 6
+ 6 | 7
+ 7 | 8
+ 8 | 9
+ 9 | 10
+ 10 | 11
+(6 rows)
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(14 rows)
+
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(11 rows)
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+select c1, length(t) from t_zedtoast;
+ c1 | length
+----+--------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+ 6 | 10000
+ 7 | 10000
+ 8 | 10000
+ 9 | 10000
+ 10 | 10000
+(10 rows)
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+select c2 from t_zednullvalues;
+ c2
+----
+
+ 2
+(2 rows)
+
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ 1 |
+(2 rows)
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 4 | 4 | stuff | test_4 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(12 rows)
+
+COPY t_zedcopy (a, d, e) to stdout;
+9999 NN \N
+10000 41 51
+1 test_1 \N
+2 test_2 \N
+3 test_3 \N
+4 test_4 \N
+5 test_5 \N
+10001 42 52
+10002 43 53
+10003 44 54
+10004 45 55
+10005 46 56
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(11 rows)
+
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(9 rows)
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+-----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+ 5 | 100 | stuff | test_5 |
+(9 rows)
+
+-- Test rolling back COPY
+begin;
+COPY t_zedcopy (b, d) from stdin;
+rollback;
+select count(*) from t_zedcopy where b >= 20000;
+ count
+-------
+ 0
+(1 row)
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+ count
+-------
+ 1
+(1 row)
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+ a | b
+---+---
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+(1 row)
+
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+ 2 | 3
+(2 rows)
+
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+ a | b | c
+---+---+---
+ 1 | 2 | 3
+ 2 | 3 | 3
+ 3 | 4 | 3
+(3 rows)
+
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+ d
+----------
+ abcdefgh
+ abcdefgh
+ abcdefgh
+(3 rows)
+
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+ a | b | c | d
+---+---+---+----------
+ 1 | 2 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+ 3 | 4 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+(4 rows)
+
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+ b | c | d
+---+---+-------------
+ 2 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 5 | test_insert
+(5 rows)
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,14) | 141
+ (1,16) | 143
+ (1,18) | 145
+ (1,20) | 147
+ (1,22) | 149
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,36) | 163
+ (1,38) | 165
+ (1,40) | 167
+ (1,42) | 169
+ (1,44) | 171
+ (1,46) | 173
+ (1,48) | 175
+ (1,50) | 177
+ (1,52) | 179
+ (1,54) | 181
+ (1,56) | 183
+ (1,58) | 185
+ (1,60) | 187
+ (1,62) | 189
+ (1,64) | 191
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,72) | 199
+ (1,74) | 201
+ (1,76) | 203
+ (1,78) | 205
+ (1,80) | 207
+ (1,82) | 209
+ (1,84) | 211
+ (1,86) | 213
+ (1,88) | 215
+ (1,90) | 217
+ (1,92) | 219
+ (1,94) | 221
+ (1,96) | 223
+ (1,98) | 225
+ (1,100) | 227
+ (1,102) | 229
+ (1,104) | 231
+ (1,106) | 233
+ (1,108) | 235
+ (1,110) | 237
+ (1,112) | 239
+ (1,114) | 241
+ (1,116) | 243
+ (1,118) | 245
+ (1,120) | 247
+ (1,122) | 249
+ (1,124) | 251
+ (1,126) | 253
+ (1,128) | 255
+ (2,2) | 257
+ (2,4) | 259
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,14) | 269
+ (2,16) | 271
+ (2,18) | 273
+ (2,20) | 275
+ (2,22) | 277
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,32) | 287
+ (2,34) | 289
+ (2,36) | 291
+ (2,38) | 293
+ (2,40) | 295
+ (2,42) | 297
+ (2,44) | 299
+(86 rows)
+
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (0,4) | 3
+ (0,6) | 5
+ (0,8) | 7
+ (0,20) | 19
+ (0,30) | 29
+ (0,42) | 41
+ (0,44) | 43
+ (0,48) | 47
+ (0,52) | 51
+ (0,54) | 53
+ (0,56) | 55
+ (0,62) | 61
+ (0,64) | 63
+ (0,66) | 65
+ (0,76) | 75
+ (0,80) | 79
+ (0,82) | 81
+ (0,84) | 83
+ (0,88) | 87
+ (0,90) | 89
+ (0,92) | 91
+ (0,98) | 97
+ (0,106) | 105
+ (0,108) | 107
+ (0,122) | 121
+ (0,126) | 125
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,20) | 147
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,40) | 167
+ (1,44) | 171
+ (1,46) | 173
+ (1,58) | 185
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,78) | 205
+ (1,80) | 207
+ (1,88) | 215
+ (1,92) | 219
+ (1,96) | 223
+ (1,100) | 227
+ (1,102) | 229
+ (1,106) | 233
+ (1,112) | 239
+ (1,116) | 243
+ (1,120) | 247
+ (1,122) | 249
+ (1,126) | 253
+ (2,2) | 257
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,16) | 271
+ (2,18) | 273
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,34) | 289
+ (2,36) | 291
+ (2,42) | 297
+ (2,44) | 299
+(74 rows)
+
diff --git a/src/test/regress/gpdiff.pl b/src/test/regress/gpdiff.pl
new file mode 100755
index 00000000000..02ec0272c65
--- /dev/null
+++ b/src/test/regress/gpdiff.pl
@@ -0,0 +1,260 @@
+#!/usr/bin/env perl
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Pod::Usage is loaded lazily when needed, if the --help or other such option
+# is actually used. Loading the module takes some time, which adds up when
+# running hundreds of regression tests, and gpdiff.pl calls this script twice
+# for every test. See lazy_pod2usage().
+#use Pod::Usage;
+
+use strict;
+use warnings;
+use File::Spec;
+use Getopt::Long qw(GetOptions);
+Getopt::Long::Configure qw(pass_through);
+
+# Load atmsort module from the same dir as this script
+use FindBin;
+use lib "$FindBin::Bin";
+use atmsort;
+
+=head1 NAME
+
+B<gpdiff.pl>
+
+=head1 SYNOPSIS
+
+B<gpdiff.pl> [options] logfile [logfile...]
+
+Options:
+
+Normally, gpdiff takes the standard "diff" options and passes them
+directly to the diff program. Try `diff --help' for more information
+on the standard options. The following options are specific to gpdiff:
+
+ -help brief help message
+ -man full documentation
+ -version print gpdiff version and underlying diff version
+ -verbose print verbose info
+ -gpd_ignore_plans ignore explain plan content in input files
+ -gpd_init <file> load initialization file
+
+=head1 OPTIONS
+
+=over 8
+
+=item B<-help>
+
+ Print a brief help message and exits.
+
+=item B<-man>
+
+ Prints the manual page and exits.
+
+=item B<-version>
+
+ Prints the gpdiff version and underlying diff version
+
+=item B<-verbose>
+
+ Prints verbose information.
+
+=item B<-gpd_ignore_plans>
+
+Specify this option to ignore any explain plan diffs between the
+input files. This will completely ignore any plan content in
+the input files thus masking differences in plans between the input files.
+
+=item B<-init_file> <file>
+
+Specify an initialization file containing a series of directives
+(mainly for match_subs) that get applied to the input files. To
+specify multiple initialization files, use multiple init_file arguments, eg:
+
+ -init_file file1 -init_file file2
+
+=back
+
+=head1 DESCRIPTION
+
+gpdiff compares files using diff after processing them with
+atmsort.pm. This comparison is designed to handle the cases where
+query output order may differ or plans maybe differ. Type "atmsort.pl
+--man" for more details. gpdiff is invoked by pg_regress as part of
+"make installcheck-world". In this case the diff options are
+something like:
+
+ "-w -I NOTICE: -I HINT: -I CONTEXT: -I REGRESS_IGNORE:".
+
+Like diff, gpdiff can compare two files, a file and directory, a
+directory and file, and two directories. However, when gpdiff compares
+two directories, it only returns the exit status of the diff
+comparison of the final two files.
+
+=head1 BUGS
+
+While the exit status is set correctly for most cases,
+STDERR messages from diff are not displayed.
+
+=cut
+
+# Calls pod2usage, but loads the module first.
+sub lazy_pod2usage
+{
+ require Pod::Usage;
+ Pod::Usage::pod2usage(@_);
+}
+
+my %glob_atmsort_args;
+
+my $glob_ignore_plans;
+my $glob_init_file = [];
+
+sub gpdiff_files
+{
+ my ($f1, $f2, $d2d) = @_;
+ my @tmpfils;
+ my $newf1;
+ my $newf2;
+
+ atmsort::atmsort_init(%glob_atmsort_args);
+ $newf1 = atmsort::run($f1);
+ $newf2 = atmsort::run($f2);
+
+ my $args = join(' ', @ARGV, $newf1, $newf2);
+
+# print "args: $args\n";
+
+ my $outi =`diff $args`;
+
+ my $stat = $? >> 8; # diff status
+
+ # prefix the diff output with the files names for a "directory to
+ # directory" diff
+ if (defined($d2d) && length($outi))
+ {
+ $outi = "diff $f1 $f2\n" . $outi;
+ }
+
+ # replace temp file name references with actual file names
+ $outi =~ s/\Q$newf1\E/\Q$f1\E/gm;
+ $outi =~ s/\Q$newf2\E/\Q$f2\E/gm;
+
+ print $outi;
+
+#my $stat = WEXITVALUE($?); # diff status
+
+ unlink $newf1;
+ unlink $newf2;
+
+ return ($stat);
+}
+
+sub filefunc
+{
+ my ($f1, $f2, $d2d) = @_;
+
+ if ((-f $f1) && (-f $f2))
+ {
+ return (gpdiff_files($f1, $f2, $d2d));
+ }
+
+ # if f1 is a directory, do the filefunc of every file in that directory
+ if ((-d $f1) && (-d $f2))
+ {
+ my $dir = $f1;
+ my ($dir_h, $stat);
+
+ if (opendir($dir_h, $dir))
+ {
+ my $fnam;
+ while ($fnam = readdir($dir_h))
+ {
+ # ignore ., ..
+ next if ($fnam eq '.' || $fnam eq '..');
+
+ my $absname = File::Spec->rel2abs(
+ File::Spec->catfile($dir, $fnam));
+
+ # specify that is a directory comparison
+ $d2d = {} unless (defined($d2d));
+ $d2d->{dir} = 1;
+ $stat = filefunc($absname, $f2, $d2d);
+ }
+ closedir $dir_h;
+ }
+ return $stat;
+ }
+
+ # if f2 is a directory, find the corresponding file in that directory
+ if ((-f $f1) && (-d $f2))
+ {
+ my $stat;
+ my @foo = File::Spec->splitpath($f1);
+
+ return 0 unless (scalar(@foo));
+ my $basenam = $foo[-1];
+
+ my $fnam = File::Spec->rel2abs(File::Spec->catfile( $f2, $basenam));
+
+ $stat = filefunc($f1, $fnam, $d2d);
+
+ return $stat;
+ }
+
+ # find f2 in dir f1
+ if ((-f $f2) && (-d $f1))
+ {
+ my $stat;
+ my @foo = File::Spec->splitpath($f2);
+
+ return 0 unless (scalar(@foo));
+ my $basenam = $foo[-1];
+
+ my $fnam = File::Spec->rel2abs( File::Spec->catfile( $f1, $basenam));
+
+ $stat = filefunc($fnam, $f2, $d2d);
+
+ return $stat;
+ }
+
+ return 0;
+}
+
+sub print_version
+{
+ print "(PostgreSQL)";
+ exit(0);
+}
+
+if (1)
+{
+ my $pmsg = "";
+
+ GetOptions(
+ "man" => sub { lazy_pod2usage(-msg => $pmsg, -exitstatus => 0, -verbose => 2) },
+ "help" => sub { lazy_pod2usage(-msg => $pmsg, -exitstatus => 1) },
+ "version|v" => \&print_version ,
+ "verbose|Verbose" => \$glob_atmsort_args{VERBOSE},
+ "gpd_ignore_plans|gp_ignore_plans" => \$glob_atmsort_args{IGNORE_PLANS},
+ "gpd_init|gp_init_file=s" => \@{$glob_atmsort_args{INIT_FILES}}
+ );
+
+ lazy_pod2usage(-msg => $pmsg, -exitstatus => 1) unless (scalar(@ARGV) >= 2);
+
+ my $f2 = pop @ARGV;
+ my $f1 = pop @ARGV;
+
+ for my $fname ($f1, $f2)
+ {
+ unless (-e $fname)
+ {
+ print STDERR "gpdiff: $fname: No such file or directory\n";
+ }
+ }
+ exit(2) unless ((-e $f1) && (-e $f2));
+
+ exit(filefunc($f1, $f2));
+}
diff --git a/src/test/regress/output/misc_1.source b/src/test/regress/output/misc_1.source
new file mode 100644
index 00000000000..c29c54c414c
--- /dev/null
+++ b/src/test/regress/output/misc_1.source
@@ -0,0 +1,692 @@
+--
+-- MISC
+--
+--
+-- BTREE
+--
+UPDATE onek
+ SET unique1 = onek.unique1 + 1;
+UPDATE onek
+ SET unique1 = onek.unique1 - 1;
+--
+-- BTREE partial
+--
+-- UPDATE onek2
+-- SET unique1 = onek2.unique1 + 1;
+--UPDATE onek2
+-- SET unique1 = onek2.unique1 - 1;
+--
+-- BTREE shutting out non-functional updates
+--
+-- the following two tests seem to take a long time on some
+-- systems. This non-func update stuff needs to be examined
+-- more closely. - jolly (2/22/96)
+--
+UPDATE tmp
+ SET stringu1 = reverse_name(onek.stringu1)
+ FROM onek
+ WHERE onek.stringu1 = 'JBAAAA' and
+ onek.stringu1 = tmp.stringu1;
+UPDATE tmp
+ SET stringu1 = reverse_name(onek2.stringu1)
+ FROM onek2
+ WHERE onek2.stringu1 = 'JCAAAA' and
+ onek2.stringu1 = tmp.stringu1;
+DROP TABLE tmp;
+--UPDATE person*
+-- SET age = age + 1;
+--UPDATE person*
+-- SET age = age + 3
+-- WHERE name = 'linda';
+--
+-- copy
+--
+COPY onek TO '@abs_builddir@/results/onek.data';
+DELETE FROM onek;
+COPY onek FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+DELETE FROM onek2;
+COPY onek2 FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek2 WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+COPY BINARY stud_emp TO '@abs_builddir@/results/stud_emp.data';
+DELETE FROM stud_emp;
+COPY BINARY stud_emp FROM '@abs_builddir@/results/stud_emp.data';
+SELECT * FROM stud_emp;
+ name | age | location | salary | manager | gpa | percent
+-------+-----+------------+--------+---------+-----+---------
+ jeff | 23 | (8,7.7) | 600 | sharon | 3.5 |
+ cim | 30 | (10.5,4.7) | 400 | | 3.4 |
+ linda | 19 | (0.9,6.1) | 100 | | 2.9 |
+(3 rows)
+
+-- COPY aggtest FROM stdin;
+-- 56 7.8
+-- 100 99.097
+-- 0 0.09561
+-- 42 324.78
+-- .
+-- COPY aggtest TO stdout;
+--
+-- inheritance stress test
+--
+SELECT * FROM a_star*;
+ class | a
+-------+----
+ a | 1
+ a | 2
+ a |
+ b | 3
+ b | 4
+ b |
+ b |
+ c | 5
+ c | 6
+ c |
+ c |
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d |
+ d | 11
+ d | 12
+ d | 13
+ d |
+ d |
+ d |
+ d | 14
+ d |
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e | 17
+ e |
+ e | 18
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f |
+ f | 24
+ f | 25
+ f | 26
+ f |
+ f |
+ f |
+ f | 27
+ f |
+ f |
+ f |
+ f |
+(50 rows)
+
+SELECT *
+ FROM b_star* x
+ WHERE x.b = text 'bumble' or x.a < 3;
+ class | a | b
+-------+---+--------
+ b | | bumble
+(1 row)
+
+SELECT class, a
+ FROM c_star* x
+ WHERE x.c ~ text 'hi';
+ class | a
+-------+----
+ c | 5
+ c |
+ d | 7
+ d | 8
+ d | 10
+ d |
+ d | 12
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f |
+ f | 24
+ f |
+ f |
+ f |
+(22 rows)
+
+SELECT class, b, c
+ FROM d_star* x
+ WHERE x.a < 100;
+ class | b | c
+-------+---------+------------
+ d | grumble | hi sunita
+ d | stumble | hi koko
+ d | rumble |
+ d | | hi kristin
+ d | fumble |
+ d | | hi avi
+ d | |
+ d | |
+(8 rows)
+
+SELECT class, c FROM e_star* x WHERE x.c NOTNULL;
+ class | c
+-------+-------------
+ e | hi carol
+ e | hi bob
+ e | hi michelle
+ e | hi elisa
+ f | hi claire
+ f | hi mike
+ f | hi marcel
+ f | hi keith
+ f | hi marc
+ f | hi allison
+ f | hi jeff
+ f | hi carl
+(12 rows)
+
+SELECT * FROM f_star* x WHERE x.c ISNULL;
+ class | a | c | e | f
+-------+----+---+-----+-------------------------------------------
+ f | 22 | | -7 | ((111,555),(222,666),(333,777),(444,888))
+ f | 25 | | -9 |
+ f | 26 | | | ((11111,33333),(22222,44444))
+ f | | | -11 | ((1111111,3333333),(2222222,4444444))
+ f | 27 | | |
+ f | | | -12 |
+ f | | | | ((11111111,33333333),(22222222,44444444))
+ f | | | |
+(8 rows)
+
+-- grouping and aggregation on inherited sets have been busted in the past...
+SELECT sum(a) FROM a_star*;
+ sum
+-----
+ 355
+(1 row)
+
+SELECT class, sum(a) FROM a_star* GROUP BY class ORDER BY class;
+ class | sum
+-------+-----
+ a | 3
+ b | 7
+ c | 11
+ d | 84
+ e | 66
+ f | 184
+(6 rows)
+
+ALTER TABLE f_star RENAME COLUMN f TO ff;
+ALTER TABLE e_star* RENAME COLUMN e TO ee;
+ALTER TABLE d_star* RENAME COLUMN d TO dd;
+ALTER TABLE c_star* RENAME COLUMN c TO cc;
+ALTER TABLE b_star* RENAME COLUMN b TO bb;
+ALTER TABLE a_star* RENAME COLUMN a TO aa;
+SELECT class, aa
+ FROM a_star* x
+ WHERE aa ISNULL;
+ class | aa
+-------+----
+ a |
+ b |
+ b |
+ c |
+ c |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ e |
+ e |
+ e |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+(24 rows)
+
+-- As of Postgres 7.1, ALTER implicitly recurses,
+-- so this should be same as ALTER a_star*
+ALTER TABLE a_star RENAME COLUMN aa TO foo;
+SELECT class, foo
+ FROM a_star* x
+ WHERE x.foo >= 2;
+ class | foo
+-------+-----
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(25 rows)
+
+ALTER TABLE a_star RENAME COLUMN foo TO aa;
+SELECT *
+ from a_star*
+ WHERE aa < 1000;
+ class | aa
+-------+----
+ a | 1
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(26 rows)
+
+ALTER TABLE f_star ADD COLUMN f int4;
+UPDATE f_star SET f = 10;
+ALTER TABLE e_star* ADD COLUMN e int4;
+--UPDATE e_star* SET e = 42;
+SELECT * FROM e_star*;
+ class | aa | cc | ee | e
+-------+----+-------------+-----+---
+ e | 15 | hi carol | -1 |
+ e | 16 | hi bob | |
+ e | 17 | | -2 |
+ e | | hi michelle | -3 |
+ e | 18 | | |
+ e | | hi elisa | |
+ e | | | -4 |
+ f | 19 | hi claire | -5 |
+ f | 20 | hi mike | -6 |
+ f | 21 | hi marcel | |
+ f | 22 | | -7 |
+ f | | hi keith | -8 |
+ f | 24 | hi marc | |
+ f | 25 | | -9 |
+ f | 26 | | |
+ f | | hi allison | -10 |
+ f | | hi jeff | |
+ f | | | -11 |
+ f | 27 | | |
+ f | | hi carl | |
+ f | | | -12 |
+ f | | | |
+ f | | | |
+(23 rows)
+
+ALTER TABLE a_star* ADD COLUMN a text;
+NOTICE: merging definition of column "a" for child "d_star"
+-- That ALTER TABLE should have added TOAST tables.
+SELECT relname, reltoastrelid <> 0 AS has_toast_table
+ FROM pg_class
+ WHERE oid::regclass IN ('a_star', 'c_star')
+ ORDER BY 1;
+ relname | has_toast_table
+---------+-----------------
+ a_star | f
+ c_star | f
+(2 rows)
+
+--UPDATE b_star*
+-- SET a = text 'gazpacho'
+-- WHERE aa > 4;
+SELECT class, aa, a FROM a_star*;
+ class | aa | a
+-------+----+---
+ a | 1 |
+ a | 2 |
+ a | |
+ b | 3 |
+ b | 4 |
+ b | |
+ b | |
+ c | 5 |
+ c | 6 |
+ c | |
+ c | |
+ d | 7 |
+ d | 8 |
+ d | 9 |
+ d | 10 |
+ d | |
+ d | 11 |
+ d | 12 |
+ d | 13 |
+ d | |
+ d | |
+ d | |
+ d | 14 |
+ d | |
+ d | |
+ d | |
+ d | |
+ e | 15 |
+ e | 16 |
+ e | 17 |
+ e | |
+ e | 18 |
+ e | |
+ e | |
+ f | 19 |
+ f | 20 |
+ f | 21 |
+ f | 22 |
+ f | |
+ f | 24 |
+ f | 25 |
+ f | 26 |
+ f | |
+ f | |
+ f | |
+ f | 27 |
+ f | |
+ f | |
+ f | |
+ f | |
+(50 rows)
+
+--
+-- versions
+--
+--
+-- postquel functions
+--
+--
+-- mike does post_hacking,
+-- joe and sally play basketball, and
+-- everyone else does nothing.
+--
+SELECT p.name, name(p.hobbies) FROM ONLY person p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+(3 rows)
+
+--
+-- as above, but jeff also does post_hacking.
+--
+SELECT p.name, name(p.hobbies) FROM person* p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+ jeff | posthacking
+(4 rows)
+
+--
+-- the next two queries demonstrate how functions generate bogus duplicates.
+-- this is a "feature" ..
+--
+SELECT DISTINCT hobbies_r.name, name(hobbies_r.equipment) FROM hobbies_r
+ ORDER BY 1,2;
+ name | name
+-------------+---------------
+ basketball | hightops
+ posthacking | advil
+ posthacking | peet's coffee
+ skywalking | guts
+(4 rows)
+
+SELECT hobbies_r.name, (hobbies_r.equipment).name FROM hobbies_r;
+ name | name
+-------------+---------------
+ posthacking | advil
+ posthacking | peet's coffee
+ posthacking | advil
+ posthacking | peet's coffee
+ basketball | hightops
+ basketball | hightops
+ skywalking | guts
+(7 rows)
+
+--
+-- mike needs advil and peet's coffee,
+-- joe and sally need hightops, and
+-- everyone else is fine.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM ONLY person p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+(4 rows)
+
+--
+-- as above, but jeff needs advil and peet's coffee as well.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM person* p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+ jeff | posthacking | advil
+ jeff | posthacking | peet's coffee
+(6 rows)
+
+--
+-- just like the last two, but make sure that the target list fixup and
+-- unflattening is being done correctly.
+--
+SELECT name(equipment(p.hobbies)), p.name, name(p.hobbies) FROM ONLY person p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+(4 rows)
+
+SELECT (p.hobbies).equipment.name, p.name, name(p.hobbies) FROM person* p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+ advil | jeff | posthacking
+ peet's coffee | jeff | posthacking
+(6 rows)
+
+SELECT (p.hobbies).equipment.name, name(p.hobbies), p.name FROM ONLY person p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+(4 rows)
+
+SELECT name(equipment(p.hobbies)), name(p.hobbies), p.name FROM person* p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+ advil | posthacking | jeff
+ peet's coffee | posthacking | jeff
+(6 rows)
+
+SELECT name(equipment(hobby_construct(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1a(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1b(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1c(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2a(text 'skywalking'));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2b(text 'skywalking'));
+ name
+---------------
+ advil
+ peet's coffee
+ hightops
+ guts
+(4 rows)
+
+SELECT hobbies_by_name('basketball');
+ hobbies_by_name
+-----------------
+ joe
+(1 row)
+
+SELECT name, overpaid(emp.*) FROM emp;
+ name | overpaid
+--------+----------
+ sharon | t
+ sam | t
+ bill | t
+ jeff | f
+ cim | f
+ linda | f
+(6 rows)
+
+--
+-- Try a few cases with SQL-spec row constructor expressions
+--
+SELECT * FROM equipment(ROW('skywalking', 'mer'));
+ name | hobby
+------+------------
+ guts | skywalking
+(1 row)
+
+SELECT name(equipment(ROW('skywalking', 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT *, name(equipment(h.*)) FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+SELECT *, (equipment(CAST((h.*) AS hobbies_r))).name FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+--
+-- functional joins
+--
+--
+-- instance rules
+--
+--
+-- rewrite rules
+--
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index fc0f14122bb..e7197a6c334 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
# ----------
# Another group of parallel tests
# ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8 zedstore
# rules cannot run concurrently with any test that creates
# a view or rule in the public schema
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 117a9544eaf..bc172a6ea6c 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -57,14 +57,16 @@ char *host_platform = HOST_TUPLE;
static char *shellprog = SHELLPROG;
#endif
+static char gpdiffprog[MAXPGPATH] = "diff";
+
/*
* On Windows we use -w in diff switches to avoid problems with inconsistent
* newline representation. The actual result files will generally have
* Windows-style newlines, but the comparison files might or might not.
*/
#ifndef WIN32
-const char *basic_diff_opts = "";
-const char *pretty_diff_opts = "-U3";
+const char *basic_diff_opts = "-I REGRESS_IGNORE:";
+const char *pretty_diff_opts = "-I REGRESS_IGNORE: -U3";
#else
const char *basic_diff_opts = "-w";
const char *pretty_diff_opts = "-w -U3";
@@ -94,6 +96,8 @@ static bool port_specified_by_user = false;
static char *dlpath = PKGLIBDIR;
static char *user = NULL;
static _stringlist *extraroles = NULL;
+static char *initfile = NULL;
+static bool ignore_plans_tuple_order_diff = false;
static char *config_auth_datadir = NULL;
/* internal variables */
@@ -124,6 +128,9 @@ static void header(const char *fmt,...) pg_attribute_printf(1, 2);
static void status(const char *fmt,...) pg_attribute_printf(1, 2);
static void psql_command(const char *database, const char *query,...) pg_attribute_printf(2, 3);
+static int
+run_diff(const char *cmd, const char *filename);
+
/*
* allow core files if possible.
*/
@@ -1353,11 +1360,14 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
char diff[MAXPGPATH];
char cmd[MAXPGPATH * 3];
char best_expect_file[MAXPGPATH];
+ char diff_opts[MAXPGPATH];
+ char m_pretty_diff_opts[MAXPGPATH];
FILE *difffile;
int best_line_count;
int i;
int l;
const char *platform_expectfile;
+ const char *ignore_plans_opts;
/*
* We can pass either the resultsfile or the expectfile, they should have
@@ -1378,13 +1388,36 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
strcpy(++p, platform_expectfile);
}
+ if (ignore_plans_tuple_order_diff)
+ ignore_plans_opts = " -gpd_ignore_plans";
+ else
+ ignore_plans_opts = "";
+
/* Name to use for temporary diff file */
snprintf(diff, sizeof(diff), "%s.diff", resultsfile);
+ /* Add init file arguments if provided via commandline */
+ if (initfile)
+ {
+ snprintf(diff_opts, sizeof(diff_opts),
+ "%s%s --gpd_init %s", basic_diff_opts, ignore_plans_opts, initfile);
+
+ snprintf(m_pretty_diff_opts, sizeof(m_pretty_diff_opts),
+ "%s%s --gpd_init %s", pretty_diff_opts, ignore_plans_opts, initfile);
+ }
+ else
+ {
+ snprintf(diff_opts, sizeof(diff_opts),
+ "%s%s", basic_diff_opts, ignore_plans_opts);
+
+ snprintf(m_pretty_diff_opts, sizeof(m_pretty_diff_opts),
+ "%s%s", pretty_diff_opts, ignore_plans_opts);
+ }
+
/* OK, run the diff */
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, expectfile, resultsfile, diff);
/* Is the diff file empty? */
if (run_diff(cmd, diff) == 0)
@@ -1416,8 +1449,8 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
}
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, alt_expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, alt_expectfile, resultsfile, diff);
if (run_diff(cmd, diff) == 0)
{
@@ -1444,8 +1477,8 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
if (platform_expectfile)
{
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, default_expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, default_expectfile, resultsfile, diff);
if (run_diff(cmd, diff) == 0)
{
@@ -1478,10 +1511,9 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
fclose(difffile);
}
- /* Run diff */
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" >> \"%s\"",
- pretty_diff_opts, best_expect_file, resultsfile, difffilename);
+ "%s %s \"%s\" \"%s\" >> \"%s\"",
+ gpdiffprog, m_pretty_diff_opts, best_expect_file, resultsfile, difffilename);
run_diff(cmd, difffilename);
unlink(diff);
@@ -1905,6 +1937,27 @@ run_single_test(const char *test, test_function tfunc)
status_end();
}
+/*
+ * Find the gpdiff.pl binary.
+ */
+static void
+find_helper_programs(const char *argv0)
+{
+ if (find_other_exec(argv0, "gpdiff.pl", "(PostgreSQL)", gpdiffprog) != 0)
+ {
+ char full_path[MAXPGPATH];
+
+ if (find_my_exec(argv0, full_path) < 0)
+ strlcpy(full_path, progname, sizeof(full_path));
+
+ fprintf(stderr,
+ _("The program \"gpdiff.pl\" is needed by %s "
+ "but was not found in the same directory as \"%s\".\n"),
+ progname, full_path);
+ exit(1);
+ }
+}
+
/*
* Create the summary-output files (making them empty if already existing)
*/
@@ -2098,6 +2151,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
{"load-extension", required_argument, NULL, 22},
{"config-auth", required_argument, NULL, 24},
{"max-concurrent-tests", required_argument, NULL, 25},
+ {"ignore-plans-and-tuple-order-diff", no_argument, NULL, 26},
+ {"ignore-tuple-order-diff", no_argument, NULL, 27},
{NULL, 0, NULL, 0}
};
@@ -2107,6 +2162,7 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
int option_index;
char buf[MAXPGPATH * 4];
char buf2[MAXPGPATH * 4];
+ bool ignore_tuple_order_diff = false;
pg_logging_init(argv[0]);
progname = get_progname(argv[0]);
@@ -2217,6 +2273,14 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
case 25:
max_concurrent_tests = atoi(optarg);
break;
+ case 26:
+ /* ignore plans means also ignore tuple order differences */
+ ignore_plans_tuple_order_diff = true;
+ ignore_tuple_order_diff = true;
+ break;
+ case 27:
+ ignore_tuple_order_diff = true;
+ break;
default:
/* getopt_long already emitted a complaint */
fprintf(stderr, _("\nTry \"%s -h\" for more information.\n"),
@@ -2260,6 +2324,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
/*
* Initialization
*/
+ if (ignore_tuple_order_diff || ignore_plans_tuple_order_diff)
+ find_helper_programs(argv[0]);
open_result_files();
initialize_environment();
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 68ac56acdb2..882cc8b9b2c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -134,6 +134,7 @@ test: sysviews
test: tsrf
test: tidscan
test: collate.icu.utf8
+test: zedstore
test: rules
test: psql
test: psql_crosstab
diff --git a/src/test/regress/sql/box.sql b/src/test/regress/sql/box.sql
index cd3e00261f7..1d219bd8ca2 100644
--- a/src/test/regress/sql/box.sql
+++ b/src/test/regress/sql/box.sql
@@ -148,6 +148,7 @@ INSERT INTO box_temp
('(-infinity,-infinity)(infinity,infinity)');
SET enable_seqscan = false;
+SET enable_bitmapscan = false;
SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
@@ -186,6 +187,7 @@ SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
RESET enable_seqscan;
+RESET enable_bitmapscan;
DROP INDEX box_spgist;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f96bebf410d..26f958fe624 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -679,7 +679,9 @@ SELECT count(*) FROM dupindexcols
-- Check ordering of =ANY indexqual results (bug in 9.2.0)
--
-vacuum tenk1; -- ensure we get consistent plans here
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
explain (costs off)
SELECT unique1 FROM tenk1
@@ -699,6 +701,7 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+RESET enable_indexscan;
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -710,6 +713,8 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+RESET enable_seqscan;
+RESET enable_bitmapscan;
RESET enable_indexonlyscan;
--
diff --git a/src/test/regress/sql/zedstore.sql b/src/test/regress/sql/zedstore.sql
new file mode 100644
index 00000000000..11a94df8c11
--- /dev/null
+++ b/src/test/regress/sql/zedstore.sql
@@ -0,0 +1,188 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+
+-- index scan
+select * from t_zedstore where c1 = 5;
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+
+--
+-- Test toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+
+select c1, length(t) from t_zedtoast;
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+select c2 from t_zednullvalues;
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+9999 \N \\N \NN \N
+10000 21 31 41 51
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+1 test_1
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+2 test_2
+3 test_3
+4 test_4
+5 test_5
+\.
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+10001 22 32 42 52
+10002 23 33 43 53
+10003 24 34 44 54
+10004 25 35 45 55
+10005 26 36 46 56
+\.
+
+select * from t_zedcopy;
+COPY t_zedcopy (a, d, e) to stdout;
+
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+
+
+-- Test rolling back COPY
+begin;
+COPY t_zedcopy (b, d) from stdin;
+20001 test_1
+20002 test_2
+20003 test_3
+20004 test_4
+\.
+rollback;
+select count(*) from t_zedcopy where b >= 20000;
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
diff --git a/src/test/storageperf/driver.sql b/src/test/storageperf/driver.sql
new file mode 100644
index 00000000000..73981d1c94b
--- /dev/null
+++ b/src/test/storageperf/driver.sql
@@ -0,0 +1,36 @@
+--
+-- Main script, to run all the tests, and print the results.
+--
+--
+
+-- First run the tests using heap.
+DROP SCHEMA IF EXISTS storagetest_heap CASCADE;
+CREATE SCHEMA storagetest_heap;
+SET search_path='storagetest_heap';
+
+CREATE TABLE results (testname text, val numeric) USING heap;
+
+SET default_table_access_method=heap;
+\i tests.sql
+
+
+-- Repeat with zedstore
+
+DROP SCHEMA IF EXISTS storagetest_zedstore CASCADE;
+CREATE SCHEMA storagetest_zedstore;
+SET search_path='storagetest_zedstore';
+
+CREATE TABLE results (testname text, val numeric) USING heap;
+
+SET default_table_access_method=zedstore;
+\i tests.sql
+
+
+SET search_path='public';
+
+SELECT COALESCE(h.testname, zs.testname) as testname,
+ h.val as heap,
+ zs.val as zedstore,
+ round(zs.val / h.val, 2) as "heap / zedstore"
+FROM storagetest_heap.results h
+FULL OUTER JOIN storagetest_zedstore.results zs ON (h.testname = zs.testname);
diff --git a/src/test/storageperf/sql/nullcol.sql b/src/test/storageperf/sql/nullcol.sql
new file mode 100644
index 00000000000..1977d0c8c77
--- /dev/null
+++ b/src/test/storageperf/sql/nullcol.sql
@@ -0,0 +1,38 @@
+-- Tests with a narrow, single-column table, with some nulls.
+
+CREATE UNLOGGED TABLE nullcol (i int4);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select extract(epoch from now()) as before
+\gset
+
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+INSERT INTO nullcol SELECT NULL FROM generate_series(1, 100000) g;
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+INSERT INTO nullcol SELECT g FROM generate_series(1, 100000) g;
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('nullcol, insert-select, size', pg_total_relation_size('nullcol'));
+INSERT INTO results (testname, val) VALUES ('nullcol, insert-select, time', :after - :before);
+
+COPY nullcol TO '/tmp/nullcol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE nullcol;
+
+select extract(epoch from now()) as before
+\gset
+
+COPY nullcol FROM '/tmp/nullcol.data';
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('nullcol, COPY, size', pg_total_relation_size('nullcol'));
+INSERT INTO results (testname, val) VALUES ('nullcol, COPY, time', :after - :before);
diff --git a/src/test/storageperf/sql/onecol.sql b/src/test/storageperf/sql/onecol.sql
new file mode 100644
index 00000000000..b2df228dd13
--- /dev/null
+++ b/src/test/storageperf/sql/onecol.sql
@@ -0,0 +1,85 @@
+-- Tests with a narrow, single-column table.
+
+CREATE UNLOGGED TABLE onecol (i int4);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select extract(epoch from now()) as before
+\gset
+
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('onecol, insert-select, size', pg_total_relation_size('onecol'));
+INSERT INTO results (testname, val) VALUES ('onecol, insert-select, time', :after - :before);
+
+COPY onecol TO '/tmp/onecol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE onecol;
+
+select extract(epoch from now()) as before
+\gset
+
+COPY onecol FROM '/tmp/onecol.data';
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('onecol, COPY, size', pg_total_relation_size('onecol'));
+INSERT INTO results (testname, val) VALUES ('onecol, COPY, time', :after - :before);
+
+--
+-- SELECT
+--
+
+VACUUM FREEZE onecol;
+
+select extract(epoch from now()) as before
+\gset
+
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('SELECT, time', :after - :before);
+
+--
+-- Delete half of the rows
+--
+
+select extract(epoch from now()) as before
+\gset
+
+DELETE FROM onecol WHERE i%2 = 0;
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('onecol, deleted half, size', pg_total_relation_size('onecol'));
+INSERT INTO results (testname, val) VALUES ('onecol, deleted half, time', :after - :before);
+
+--
+-- And vacuum the deleted rows away
+--
+select extract(epoch from now()) as before
+\gset
+
+VACUUM onecol;
+
+select extract(epoch from now()) as after
+\gset
+
+INSERT INTO results (testname, val) VALUES ('onecol, vacuumed, size', pg_total_relation_size('onecol'));
+INSERT INTO results (testname, val) VALUES ('onecol, vacuumed, time', :after - :before);
diff --git a/src/test/storageperf/tests.sql b/src/test/storageperf/tests.sql
new file mode 100644
index 00000000000..18cf7a08bd3
--- /dev/null
+++ b/src/test/storageperf/tests.sql
@@ -0,0 +1,4 @@
+-- Test "schedule". List all the tests you want to run here.
+
+\i sql/onecol.sql
+\i sql/nullcol.sql
On Thu, Aug 15, 2019 at 01:05:49PM +0300, Heikki Linnakangas wrote:
We've continued hacking on Zedstore, here's a new patch version against
current PostgreSQL master (commit f1bf619acdf). If you want to follow the
development in real-time, we're working on this branch:
https://github.com/greenplum-db/postgres/tree/zedstore
Thanks for persuing this. It's an exciting development and I started
looking at how we'd put it to use. I imagine we'd use it in favour of ZFS
tablespaces, which I hope to retire.
I've just done very brief experiment so far. Some thoughts:
. I was missing a way to check for compression ratio; it looks like zedstore
with lz4 gets ~4.6x for our largest customer's largest table. zfs using
compress=gzip-1 gives 6x compression across all their partitioned tables,
and I'm surprised it beats zedstore .
. What do you think about pg_restore --no-tableam; similar to
--no-tablespaces, it would allow restoring a table to a different AM:
PGOPTIONS='-c default_table_access_method=zedstore' pg_restore --no-tableam ./pg_dump.dat -d postgres
Otherwise, the dump says "SET default_table_access_method=heap", which
overrides any value from PGOPTIONS and precludes restoring to new AM.
. It occured to me that indices won't be compressed. That's no fault of
zedstore, but it may mean that some sites would need to retain their ZFS
tablespace, and suggests the possibility of an separate, future project
(I wonder if there's some way a new meta-AM could "enable" compression of
other index AMs, to avoid the need to implement zbtree, zhash, zgin, ...).
. it'd be nice if there was an ALTER TABLE SET ACCESS METHOD, to allow
migrating data. Otherwise I think the alternative is:
begin; lock t;
CREATE TABLE new_t LIKE (t INCLUDING ALL) USING (zedstore);
INSERT INTO new_t SELECT * FROM t;
for index; do CREATE INDEX...; done
DROP t; RENAME new_t (and all its indices). attach/inherit, etc.
commit;
. Speaking of which, I think LIKE needs a new option for ACCESS METHOD, which
is otherwise lost.
Cheers,
Justin
On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
. I was missing a way to check for compression ratio;
Here are the ways to check compression ratio for zedstore:
Table level:
select sum(uncompressedsz::numeric) / sum(totalsz) as compratio from
pg_zs_btree_pages(<tablename>);
Per column level:
select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as
compratio from pg_zs_btree_pages(<tablename>) group by attno order by attno;
it looks like zedstore
with lz4 gets ~4.6x for our largest customer's largest table. zfs using
compress=gzip-1 gives 6x compression across all their partitioned
tables,
and I'm surprised it beats zedstore .
What kind of tables did you use? Is it possible to give us the schema
of the table? Did you perform 'INSERT INTO ... SELECT' or COPY?
Currently COPY give better compression ratios than single INSERT
because it generates less pages for meta data. Using the above per column
level compression ratio will provide which columns have lower
compression ratio.
We plan to add other compression algorithms like RLE and delta
encoding which should give better compression ratios for column store
along with LZ4.
On Mon, Aug 19, 2019 at 04:15:30PM -0700, Alexandra Wang wrote:
On Sun, Aug 18, 2019 at 12:35 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
. I was missing a way to check for compression ratio;
Here are the ways to check compression ratio for zedstore:
Table level:
SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM pg_zs_btree_pages(<tablename>);
postgres=# SELECT sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM pg_zs_btree_pages('child.cdrs_huawei_pgwrecord_2019_07_01');
compratio | 4.2730304163521529
For a fair test, I created a separate ZFS tablspace for storing just a copy of
that table.
ts=# CREATE TABLE test TABLESPACE testcomp AS SELECT * FROM child.cdrs_huawei_pgwrecord_2019_07_01;
SELECT 39933381
Time: 882417.775 ms (14:42.418)
zfs/testJTP20190819 compressratio 6.01x -
zfs/testJTP20190819 compression gzip-1 inherited from zfs
Per column level:
select attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz) as compratio from pg_zs_btree_pages(<tablename>) group by attno order by attno;
Order by 3; I see we have SOME highly compressed columns.
It's still surprising to me that's as low as it is, given their content: phone
numbers and IPv4 addresses in text form, using characters limited to
[[:digit:].]
(I realize we can probably save space using inet type.)
0 | 4743 | 1.00000000000000000000
32 | 21912 | 1.05953637381493823513
80 | 36441 | 1.2416446300175039
4 | 45059 | 1.3184106811322728
83 | 45059 | 1.3184106811322728
52 | 39208 | 1.3900788061770992
...
74 | 3464 | 10.8258665101057364
17 | 3535 | 10.8776086243096534
3 | 7092 | 11.0388009154683678
11 | 3518 | 11.4396055611832109
65 | 3333 | 14.6594723104237634
35 | 14077 | 15.1642131499381887
...
43 | 1601 | 21.4200106784573211
79 | 1599 | 21.4487670806076829
89 | 1934 | 23.6292134031933401
33 | 1934 | 23.6292134031933401
It seems clear the columns with high n_distinct have low compress ratio, and
columns with high compress ratio are those with n_distinct=1...
CREATE TEMP TABLE zs AS SELECT zs.*, n_distinct, avg_width, a.attname FROM (SELECT 'child.cdrs_huawei_pgwrecord_2019_07_01'::regclass t)t , LATERAL (SELECT attno, count(*), sum(uncompressedsz::numeric)/sum(totalsz) AS compratio FROM pg_zs_btree_pages(t) GROUP BY attno)zs , pg_attribute a, pg_class c, pg_stats s WHERE a.attrelid=t AND a.attnum=zs.attno AND c.oid=a.attrelid AND c.relname=s.tablename AND s.attname=a.attname;
n_distinct | compratio
------------+------------------------
217141 | 1.2416446300175039
154829 | 1.5306062496764190
144486 | 1.3900788061770992
128334 | 1.5395022739568842
121324 | 1.4005533187886683
86341 | 1.6262709389296389
84073 | 4.4379336418590519
65413 | 5.1890181028038757
63703 | 5.5029855093836425
63637 | 5.3648468796642262
46450 | 1.3184106811322728
46450 | 1.3184106811322728
43029 | 1.8003513772661308
39363 | 1.5845730687475706
36720 | 1.4751147557399539
36445 | 1.8403087513759131
36445 | 1.5453935268318613
11455 | 1.05953637381493823513
2862 | 9.8649823666870671
2625 | 2.3573614181847621
1376 | 1.7895024285340428
1335 | 2.2812551964262787
807 | 7.1192324141359373
610 | 7.9373623460089360
16 | 11.4396055611832109
10 | 5.5429763442365557
7 | 5.0440578041440675
7 | 5.2000132813261135
4 | 6.9741514753325536
4 | 4.2872818036896340
3 | 1.9080838412634827
3 | 2.9915954457453485
3 | 2.3056387009407882
2 | 10.8776086243096534
2 | 5.5950929307378287
2 | 18.5796576388128741
2 | 10.8258665101057364
2 | 9.1112820658021406
2 | 3.4986057630739795
2 | 4.6250999234025238
2 | 11.0388009154683678
1 | 15.1642131499381887
1 | 2.8855860118178798
1 | 23.6292134031933401
1 | 21.4200106784573211
[...]
it looks like zedstore
with lz4 gets ~4.6x for our largest customer's largest table. zfs using
compress=gzip-1 gives 6x compression across all their partitioned
tables,
and I'm surprised it beats zedstore .What kind of tables did you use? Is it possible to give us the schema
of the table? Did you perform 'INSERT INTO ... SELECT' or COPY?
I did this:
|time ~/src/postgresql.bin/bin/pg_restore /srv/cdrperfbackup/ts/final/child.cdrs_huawei_pgwrecord_2019_07_01 -f- |PGOPTIONS='-cdefault_table_access_method=zedstore' psql --port 5678 postgres --host /tmp
...
COPY 39933381
...
real 100m25.764s
child | cdrs_huawei_pgwrecord_2019_07_01 | table | pryzbyj | permanent | 8277 MB |
postgres=# SELECT array_to_string(array_agg(format_type(atttypid, atttypmod) ||CASE WHEN attnotnull THEN ' not null' ELSE '' END ORDER BY attnum),',') FROM pg_attribute WHERE attrelid='child.cdrs_huawei_pgwrecord_2019_07_01'::regclass AND attnum>0;
array_to_string | text not null,text,text not null,text not null,text not null,text,text,text,boolean,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,timestamp without time zone not null,bigint not null,text not null,text,text,text,text,text,text,text,text,text,text not null,text,boolean,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,text,timestamp with time zone,timestamp with time zone,text,text,boolean,text,text,boolean,boolean,text not null,text not null
On 20/08/2019 05:04, Justin Pryzby wrote:
it looks like zedstore
with lz4 gets ~4.6x for our largest customer's largest table. zfs using
compress=gzip-1 gives 6x compression across all their partitioned
tables,
and I'm surprised it beats zedstore .
I did a quick test, with 10 million random IP addresses, in text format.
I loaded it into a zedstore table ("create table ips (ip text) using
zedstore"), and poked around a little bit to see how the space is used.
postgres=# select lokey, nitems, ncompressed, totalsz, uncompressedsz,
freespace from pg_zs_btree_pages('ips') where attno=1 and level=0 limit 10;
lokey | nitems | ncompressed | totalsz | uncompressedsz | freespace
-------+--------+-------------+---------+----------------+-----------
1 | 4 | 4 | 6785 | 7885 | 1320
537 | 5 | 5 | 7608 | 8818 | 492
1136 | 4 | 4 | 6762 | 7888 | 1344
1673 | 5 | 5 | 7548 | 8776 | 540
2269 | 4 | 4 | 6841 | 7895 | 1256
2807 | 5 | 5 | 7555 | 8784 | 540
3405 | 5 | 5 | 7567 | 8772 | 524
4001 | 4 | 4 | 6791 | 7899 | 1320
4538 | 5 | 5 | 7596 | 8776 | 500
5136 | 4 | 4 | 6750 | 7875 | 1360
(10 rows)
There's on average about 10% of free space on the pages. We're losing
quite a bit to to ZFS compression right there. I'm sure there's some
free space on the heap pages as well, but ZFS compression will squeeze
it out.
The compression ratio is indeed not very good. I think one reason is
that zedstore does LZ4 in relatively small chunks, while ZFS surely
compresses large blocks in one go. Looking at the above, there is on
average 125 datums packed into each "item" (avg(hikey-lokey) / nitems).
I did a quick test with the "lz4" command-line utility, compressing flat
files containing random IP addresses.
$ lz4 /tmp/125-ips.txt
Compressed filename will be : /tmp/125-ips.txt.lz4
Compressed 1808 bytes into 1519 bytes ==> 84.02%
$ lz4 /tmp/550-ips.txt
Compressed filename will be : /tmp/550-ips.txt.lz4
Compressed 7863 bytes into 6020 bytes ==> 76.56%
$ lz4 /tmp/750-ips.txt
Compressed filename will be : /tmp/750-ips.txt.lz4
Compressed 10646 bytes into 8035 bytes ==> 75.47%
The first case is roughly what we do in zedstore currently: we compress
about 125 datums as one chunk. The second case is roughty what we would
get, if we collected on 8k worth of datums and compressed them all as
one chunk. And the third case simulates the case we would allow the
input to be larger than 8k, so that the compressed chunk just fits on an
8k page. Not too much difference between the second and third case, but
its pretty clear that we're being hurt by splitting the input into such
small chunks.
The downside of using a larger compression chunk size is that random
access becomes more expensive. Need to give the on-disk format some more
thought. Although I actually don't feel too bad about the current
compression ratio, perfect can be the enemy of good.
- Heikki
Thanks Ashwin and Heikki for your responses. I've one more query here,
If BTree index is created on a zedstore table, the t_tid field of
Index tuple contains the physical tid that is not actually pointing to
the data block instead it contains something from which the logical
tid can be derived. So, when IndexScan is performed on a zedstore
table, it fetches the physical tid from the index page and derives the
logical tid out of it and then retrieves the data corresponding to
this logical tid from the zedstore table. For that, it kind of
performs SeqScan on the zedstore table for the given tid. From this it
appears to me as if the Index Scan is as good as SeqScan for zedstore
table. If that is true, will we be able to get the benefit of
IndexScan on zedstore tables? Please let me know if i am missing
something here.
AFAIU, the following user level query on zedstore table
select * from zed_tab where a = 3;
gets internally converted to
select * from zed_tab where tid = 3; -- assuming that index is created
on column 'a' and the logical tid associated with a = 3 is 3.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Show quoted text
On Thu, Aug 15, 2019 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 14/08/2019 20:32, Ashwin Agrawal wrote:
On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma wrote:
2) Is there a chance that IndexOnlyScan would ever be required for
zedstore tables considering the design approach taken for it?We have not given much thought to IndexOnlyScans so far. But I think
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.Hmm, yeah. An index-only scan on a zedstore table could perform the "VM
checks" by checking the TID tree in the zedstore. It's not as compact as
the 2 bits per TID in the heapam's visibility map, but it's pretty good.Further, I tried creating a zedstore table with btree index on one of
it's column and loaded around 50 lacs record into the table. When the
indexed column was scanned (with enable_seqscan flag set to off), it
went for IndexOnlyScan and that took around 15-20 times more than it
would take for IndexOnly Scan on heap table just because IndexOnlyScan
in zedstore always goes to heap as the visibility check fails.Currently, an index-only scan on zedstore should be pretty much the same
speed as a regular index scan. All the visibility checks will fail, and
you end up fetching every row from the table, just like a regular index
scan. So I think what you're seeing is that the index fetches on a
zedstore table is much slower than on heap.Ideally, on a column store the index fetches would only fetch the needed
columns, but I don't think that's been implemented yet, so all the
columns are fetched. That can make a big difference, if you have a wide
table with lots of columns, but only actually need a few of them. Was
your test case something like that?We haven't spent much effort on optimizing index fetches yet, so I hope
there's many other little tweaks there as well, that we can do to make
it faster.- Heikki
On Mon, Aug 26, 2019 at 5:36 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:
Thanks Ashwin and Heikki for your responses. I've one more query here,
If BTree index is created on a zedstore table, the t_tid field of
Index tuple contains the physical tid that is not actually pointing to
the data block instead it contains something from which the logical
tid can be derived. So, when IndexScan is performed on a zedstore
table, it fetches the physical tid from the index page and derives the
logical tid out of it and then retrieves the data corresponding to
this logical tid from the zedstore table. For that, it kind of
performs SeqScan on the zedstore table for the given tid.
Nope, it won't perform seqscan. As zedstore is laid out as btree itself
with logical TID as its key. It can quickly find which page the logical TID
belongs to and only access that page. It doesn't need to perform the
seqscan for the same. That's one of the rationals for laying out things in
btree fashion to easily connect logical to physical world and not keep any
external mapping.
AFAIU, the following user level query on zedstore table
select * from zed_tab where a = 3;
gets internally converted to
select * from zed_tab where tid = 3; -- assuming that index is created
on column 'a' and the logical tid associated with a = 3 is 3.
So, for this it will first only access the TID btree, find the leaf page
with tid=3. Perform the visibility checks for the tuple and if tuple is
visible, then only will fetch all the columns for that TID. Again using the
btrees for those columns to only fetch leaf page for that logical tid.
Hope that helps to clarify the confusion.
On Tue, Aug 27, 2019 at 6:03 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
Hope that helps to clarify the confusion.
Thanks for the explanation. Yes, it does clarify my doubt to some extent.
My point is, once we find the leaf page containing the given tid, we go
through each item in the page until we find the data corresponding to the
given tid which means we kind of perform a sequential scan at the page
level. I'm referring to the below loop in zsbt_attr_scan_fetch_array().
for (off = FirstOffsetNumber; off <= maxoff; off++)
{
ItemId iid = PageGetItemId(page, off);
ZSAttributeArrayItem *item = (ZSAttributeArrayItem *)
PageGetItem(page, iid);
if (item->t_endtid <= nexttid)
continue;
if (item->t_firsttid > nexttid)
break;
But that's not true for IndexScan in case of heap table because there the
index tuple contains the exact physical location of tuple in the heap. So,
there is no need to scan the entire page.
Further here are some minor comments that i could find while doing a quick
code walkthrough.
1) In zsundo_insert_finish(), there is a double call to
BufferGetPage(undobuf); Is that required ?
2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called
twice? I'm referring to the below code.
if (fetch_proj->num_proj_atts == 0)
{
....
....
zsbt_tid_begin_scan(rel, tid, tid + 1,
snapshot,
&fetch_proj->tid_scan);
fetch_proj->tid_scan.serializable = true;
for (int i = 1; i < fetch_proj->num_proj_atts; i++)
{
int attno = fetch_proj->proj_atts[i];
zsbt_attr_begin_scan(rel, reldesc, attno,
&fetch_proj->attr_scans[i - 1]);
}
MemoryContextSwitchTo(oldcontext);
zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot,
&fetch_proj->tid_scan);
}
Also, for all types of update operation (be it key or non-key update) we
create a new tid for the new version of tuple. Can't we use the tid
associated with the old tuple for the cases where there is no concurrent
transactions to whom the old tuple is still visible.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
On Tue, Aug 27, 2019 at 12:03 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:
My point is, once we find the leaf page containing the given tid, we go
through each item in the page until we find the data corresponding to the
given tid which means we kind of perform a sequential scan at the page
level. I'm referring to the below loop in zsbt_attr_scan_fetch_array().for (off = FirstOffsetNumber; off <= maxoff; off++)
{
ItemId iid = PageGetItemId(page, off);
ZSAttributeArrayItem *item = (ZSAttributeArrayItem *)
PageGetItem(page, iid);if (item->t_endtid <= nexttid)
continue;if (item->t_firsttid > nexttid)
break;But that's not true for IndexScan in case of heap table because there the
index tuple contains the exact physical location of tuple in the heap. So,
there is no need to scan the entire page.
You are correct that we currently go through each item in the leaf page that
contains the given tid, specifically, the logic to retrieve all the
attribute
items inside a ZSAttStream is now moved to decode_attstream() in the latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given tid. One
optimization we can to is to tell decode_attstream() to stop decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do binary
search
instead of linear search, and for variable length attribute, we can probably
try something that we didn't think of yet.
1) In zsundo_insert_finish(), there is a double call to
BufferGetPage(undobuf); Is that required ?
Fixed, thanks!
2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called
twice? I'm referring to the below code.
if (fetch_proj->num_proj_atts == 0)
{
....
....
zsbt_tid_begin_scan(rel, tid, tid + 1,
snapshot,
&fetch_proj->tid_scan);
fetch_proj->tid_scan.serializable = true;for (int i = 1; i < fetch_proj->num_proj_atts; i++)
{
int attno = fetch_proj->proj_atts[i];zsbt_attr_begin_scan(rel, reldesc, attno,
&fetch_proj->attr_scans[i - 1]);
}
MemoryContextSwitchTo(oldcontext);zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot,
&fetch_proj->tid_scan);
}
I removed the second call, thanks!
Also, for all types of update operation (be it key or non-key update) we
create a new tid for the new version of tuple. Can't we use the tid
associated with the old tuple for the cases where there is no concurrent
transactions to whom the old tuple is still visible.
Zedstore currently implement update as delete+insert, hence the old tid is
not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old tuple
means
putting the old tuple in the UNDO log, which we have not implemented yet.
Thanks for reporting, this is very helpful! Patches are welcome as well!
On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io> wrote:
You are correct that we currently go through each item in the leaf page
that
contains the given tid, specifically, the logic to retrieve all the
attribute
items inside a ZSAttStream is now moved to decode_attstream() in the latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given tid.
Okay. Any idea why this new way of storing attribute data as streams
(lowerstream and upperstream) has been chosen just for the attributes but
not for tids. Are only attribute blocks compressed but not the tids blocks?
One
optimization we can to is to tell decode_attstream() to stop decoding at
the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do binary
search
instead of linear search, and for variable length attribute, we can
probably
try something that we didn't think of yet.
I think we can probably ask decode_attstream() to stop once it has found
the tid that we are searching for but then we only need to do that for
Index Scans.
Zedstore currently implement update as delete+insert, hence the old tid is
not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old tuple
means
putting the old tuple in the UNDO log, which we have not implemented yet.
OKay, so that means performing update on a non-key attribute would also
require changes in the index table. In short, HOT update is currently not
possible with zedstore table. Am I right?
--
With Regards,
Ashutosh Sharma
EnterpriseDB:*http://www.enterprisedb.com <http://www.enterprisedb.com/>*
On 29/08/2019 14:30, Ashutosh Sharma wrote:
On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io
<mailto:lewang@pivotal.io>> wrote:You are correct that we currently go through each item in the leaf
page that
contains the given tid, specifically, the logic to retrieve all the
attribute
items inside a ZSAttStream is now moved to decode_attstream() in the
latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given
tid.Okay. Any idea why this new way of storing attribute data as streams
(lowerstream and upperstream) has been chosen just for the attributes
but not for tids. Are only attribute blocks compressed but not the tids
blocks?
Right, only attribute blocks are currently compressed. Tid blocks need
to be modified when there are UPDATEs or DELETE, so I think having to
decompress and recompress them would be more costly. Also, there is no
user data on the TID tree, and the Simple-8b encoded codewords used to
represent the TIDs are already pretty compact. I'm not sure how much
gain you would get from passing it through a general purpose compressor.
I could be wrong though. We could certainly try it out, and see how it
performs.
One
optimization we can to is to tell decode_attstream() to stop
decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do
binary search
instead of linear search, and for variable length attribute, we can
probably
try something that we didn't think of yet.I think we can probably ask decode_attstream() to stop once it has found
the tid that we are searching for but then we only need to do that for
Index Scans.
I've been thinking that we should add a few "bookmarks" on long streams,
so that you could skip e.g. to the midpoint in a stream. It's a tradeoff
though; when you add more information for random access, it makes the
representation less compact.
Zedstore currently implement update as delete+insert, hence the old
tid is not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old
tuple means
putting the old tuple in the UNDO log, which we have not implemented
yet.OKay, so that means performing update on a non-key attribute would also
require changes in the index table. In short, HOT update is currently
not possible with zedstore table. Am I right?
That's right. There's a lot of potential gain for doing HOT updates. For
example, if you UPDATE one column on every row on a table, ideally you
would only modify the attribute tree containing that column. But that
hasn't been implemented.
- Heikki
On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 29/08/2019 14:30, Ashutosh Sharma wrote:
On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io
<mailto:lewang@pivotal.io>> wrote:You are correct that we currently go through each item in the leaf
page that
contains the given tid, specifically, the logic to retrieve all the
attribute
items inside a ZSAttStream is now moved to decode_attstream() in the
latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given
tid.Okay. Any idea why this new way of storing attribute data as streams
(lowerstream and upperstream) has been chosen just for the attributes
but not for tids. Are only attribute blocks compressed but not the tids
blocks?Right, only attribute blocks are currently compressed. Tid blocks need
to be modified when there are UPDATEs or DELETE, so I think having to
decompress and recompress them would be more costly. Also, there is no
user data on the TID tree, and the Simple-8b encoded codewords used to
represent the TIDs are already pretty compact. I'm not sure how much
gain you would get from passing it through a general purpose compressor.I could be wrong though. We could certainly try it out, and see how it
performs.One
optimization we can to is to tell decode_attstream() to stop
decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do
binary search
instead of linear search, and for variable length attribute, we can
probably
try something that we didn't think of yet.I think we can probably ask decode_attstream() to stop once it has found
the tid that we are searching for but then we only need to do that for
Index Scans.I've been thinking that we should add a few "bookmarks" on long streams,
so that you could skip e.g. to the midpoint in a stream. It's a tradeoff
though; when you add more information for random access, it makes the
representation less compact.Zedstore currently implement update as delete+insert, hence the old
tid is not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old
tuple means
putting the old tuple in the UNDO log, which we have not implemented
yet.OKay, so that means performing update on a non-key attribute would also
require changes in the index table. In short, HOT update is currently
not possible with zedstore table. Am I right?That's right. There's a lot of potential gain for doing HOT updates. For
example, if you UPDATE one column on every row on a table, ideally you
would only modify the attribute tree containing that column. But that
hasn't been implemented.
Thanks Heikki for your reply. After quite some time today I got chance
to look back into the code. I could see that you have changed the
tuple insertion and update mechanism a bit. As per the latest changes
all the tuples being inserted/updated in a transaction are spooled
into a hash table and then flushed at the time of transaction commit
and probably due to this change, I could see that the server crashes
when trying to perform UPDATE operation on a zedstore table having 10
lacs record. See below example,
create table t1(a int, b int) using zedstore;
insert into t1 select i, i+10 from generate_series(1, 1000000) i;
postgres=# update t1 set b = 200;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Above update statement crashed due to some extensive memory leak.
Further, the UPDATE operation on zedstore table is very slow. I think
that's because in case of zedstore table we have to update all the
btree data structures even if one column is updated and that really
sucks. Please let me know if there is some other reason for it.
I also found some typos when going through the writeup in
zedstore_internal.h and thought of correcting those. Attached is the
patch with the changes.
Thanks,
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Attachments:
fix_typos.patchtext/x-patch; charset=US-ASCII; name=fix_typos.patchDownload
diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
index 5330c70..c2b726b 100644
--- a/src/include/access/zedstore_internal.h
+++ b/src/include/access/zedstore_internal.h
@@ -177,7 +177,7 @@ ZSBtreeInternalPageIsFull(Page page)
* Attribute B-tree leaf page layout
*
* Leaf pages in the attribute trees don't follow the normal page layout
- * with line pointers and items. They use the standard page page header,
+ * with line pointers and items. They use the standard page header,
* with pd_lower and pd_upper, but the data stored in the lower and upper
* parts are different from the normal usage.
*
@@ -226,7 +226,7 @@ ZSBtreeInternalPageIsFull(Page page)
* replacing both with one larger compressed stream.
*
* The names "lower" and "upper" refer to the physical location of
- * the stream on the page. The data in the in the lower attstream
+ * the stream on the page. The data in the lower attstream
* have higher-numbered TIDs than the data in the upper attstream.
* No overlap is allowed. This works well with the usual usage
* pattern that new data is added to the end (i.e. with increasing
@@ -255,7 +255,7 @@ ZSBtreeInternalPageIsFull(Page page)
* fits, and second time to compress just what fits. But that would be twice
* as slow. In practice, the wasted space doesn't matter much. We try to
* keep each chunk relatively small, to minimize the waste. And because we
- * now the next chunk wouldn't fit on the page anyway, there isn't much else
+ * know the next chunk wouldn't fit on the page anyway, there isn't much else
* we could do with the wasted space, anyway.
*/
typedef struct
On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:
create table t1(a int, b int) using zedstore;
insert into t1 select i, i+10 from generate_series(1, 1000000) i;
postgres=# update t1 set b = 200;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.Above update statement crashed due to some extensive memory leak.
Thank you for reporting! We have located the memory leak and also
noticed some other memory related bugs. We are working on the fixes
please stay tuned!
I also found some typos when going through the writeup in
zedstore_internal.h and thought of correcting those. Attached is the
patch with the changes.
Applied. Thank you!
On Thu, Sep 19, 2019 at 8:10 AM Alexandra Wang <lewang@pivotal.io> wrote:
On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
create table t1(a int, b int) using zedstore;
insert into t1 select i, i+10 from generate_series(1, 1000000) i;
postgres=# update t1 set b = 200;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.Above update statement crashed due to some extensive memory leak.
Thank you for reporting! We have located the memory leak and also
noticed some other memory related bugs. We are working on the fixes
please stay tuned!
Cool. As I suspected earlier, it's basically "ZedstoreAMTupleBuffers"
context that is completely exhausting the memory and it is being used
to spool the tuples.
I also found some typos when going through the writeup in
zedstore_internal.h and thought of correcting those. Attached is the
patch with the changes.Applied. Thank you!
Thanks for that.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
On Thu, Sep 19, 2019 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Thu, Sep 19, 2019 at 8:10 AM Alexandra Wang <lewang@pivotal.io> wrote:
On Tue, Sep 17, 2019 at 4:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
create table t1(a int, b int) using zedstore;
insert into t1 select i, i+10 from generate_series(1, 1000000) i;
postgres=# update t1 set b = 200;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.Above update statement crashed due to some extensive memory leak.
Thank you for reporting! We have located the memory leak and also
noticed some other memory related bugs. We are working on the fixes
please stay tuned!Cool. As I suspected earlier, it's basically "ZedstoreAMTupleBuffers"
context that is completely exhausting the memory and it is being used
to spool the tuples.
Some more updates on top of this:
When doing update operation, for each tuple being modified,
*tuplebuffers_insert()* says that there is no entry for the relation
being modified in the hash table although it was already added when
the first tuple in the table was updated. Why is it so? I mean if I
have added an entry in the hash table *tuplebuffers* for let's say
table t1 then should the subsequent call to tuplebuffers_insert() say
that there is no entry for table t1 in the *tuplebuffers*. Shouldn't
that only happen once you have flushed all the tuples in the
tupbuffer->attbuffers. Because of this reason, for each tuple,
tupbuffer->attbuffers is allocated resulting into a lot of memory
consumption. OTOH if the insert is performed on the same table only
for the first tuple tuplebuffers_insert() says that is no entry for
the the table t1 in hash but from the second time onwards that doesn;t
happen. I think because of this reason the memory leak is happening in
case of update operation. Please let me know if I'm missing something
here just because I didn't get chance to spent
much time on this. Thank you.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
When doing update operation, for each tuple being modified,
*tuplebuffers_insert()* says that there is no entry for the relation
being modified in the hash table although it was already added when
the first tuple in the table was updated. Why is it so?
Currently, when doing an update, it will actually flush the tuple
buffers every time we update a tuple. As a result, we only ever spool
up one tuple at a time. This is a good place to put in an optimization
like was implemented for insert, but I haven't gotten around to
looking into that yet.
The memory leak is actually happening because it isn't freeing the
attbuffers after flushing. Alexandra Wang and I have a working
branch[1]https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues where we tried to plug the leak by freeing the attbuffers,
but it has exposed an issue with triggers that I need to understand
before I push the fix into the main zedstore branch.
I don't like our solution of freeing the buffers either, because they
could easily be reused. I'm going to take a stab at making that better
before merging in the fix.
[1]: https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues
On Fri, Sep 20, 2019 at 5:48 AM Taylor Vesely <tvesely@pivotal.io> wrote:
When doing update operation, for each tuple being modified,
*tuplebuffers_insert()* says that there is no entry for the relation
being modified in the hash table although it was already added when
the first tuple in the table was updated. Why is it so?Currently, when doing an update, it will actually flush the tuple
buffers every time we update a tuple. As a result, we only ever spool
up one tuple at a time. This is a good place to put in an optimization
like was implemented for insert, but I haven't gotten around to
looking into that yet.
Okay. So, that's the root cause. Spooling just one tuple where at
least 60 tuples can be spooled and then not freeing it at all is
altogether the reason for this extensive memory leak.
The memory leak is actually happening because it isn't freeing the
attbuffers after flushing. Alexandra Wang and I have a working
branch[1] where we tried to plug the leak by freeing the attbuffers,
but it has exposed an issue with triggers that I need to understand
before I push the fix into the main zedstore branch.I don't like our solution of freeing the buffers either, because they
could easily be reused. I'm going to take a stab at making that better
before merging in the fix.
That's right, why do we need to free the memory after flushing data in
attbuffers. We can simply reuse it for next set of data to be updated.
[1] https://github.com/l-wang/postgres-1/tree/zedstore-fix-memory-issues
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Hi Alexandra,
On Tue, Sep 17, 2019 at 4:45 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 29/08/2019 14:30, Ashutosh Sharma wrote:
On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang@pivotal.io
<mailto:lewang@pivotal.io>> wrote:Further, the UPDATE operation on zedstore table is very slow. I think
that's because in case of zedstore table we have to update all the
btree data structures even if one column is updated and that really
sucks. Please let me know if there is some other reason for it.
There was no answer for this in your previous reply. It seems like you
missed it. As I said earlier, I tried performing UPDATE operation with
optimised build and found that to update around 10 lacs record in
zedstore table it takes around 24k ms whereas for normal heap table it
takes 2k ms. Is that because in case of zedstore table we have to
update all the Btree data structures even if one column is updated or
there is some other reason for it. If yes, could you please let us
know. FYI, I'm trying to update the table with just two columns.
Further, In the latest code I'm getting this warning message when it
is compiled using -O2 optimisation flag.
zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’:
zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
opaque = ZSBtreePageGetOpaque(page);
^
Attached is the patch that fixes it.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Attachments:
fix_warning_zsbt.patchtext/x-patch; charset=US-ASCII; name=fix_warning_zsbt.patchDownload
diff --git a/src/backend/access/zedstore/zedstore_tidpage.c b/src/backend/access/zedstore/zedstore_tidpage.c
index 7730ef3..f590f79 100644
--- a/src/backend/access/zedstore/zedstore_tidpage.c
+++ b/src/backend/access/zedstore/zedstore_tidpage.c
@@ -956,9 +956,10 @@ zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid, uint64 *num_
buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, true);
if (!BufferIsValid(buf))
return result;
- page = BufferGetPage(buf);
}
+ page = BufferGetPage(buf);
+
maxoff = PageGetMaxOffsetNumber(page);
for (off = FirstOffsetNumber; off <= maxoff; off++)
{
Hi Ashutosh,
Sorry I indeed missed your question, thanks for the reminder!
On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:
Further, the UPDATE operation on zedstore table is very slow. I think
that's because in case of zedstore table we have to update all the
btree data structures even if one column is updated and that really
sucks. Please let me know if there is some other reason for it.There was no answer for this in your previous reply. It seems like you
missed it. As I said earlier, I tried performing UPDATE operation with
optimised build and found that to update around 10 lacs record in
zedstore table it takes around 24k ms whereas for normal heap table it
takes 2k ms. Is that because in case of zedstore table we have to
update all the Btree data structures even if one column is updated or
there is some other reason for it. If yes, could you please let us
know. FYI, I'm trying to update the table with just two columns.
Zedstore UPDATE operation currently fetches the old rows, updates the
undo pointers stored in the tid btree, and insert new rows into all
the attribute btrees with the new tids. So performance of updating one
column makes no difference from updating all the columns. That said,
the wider the table is, the longer it takes to update, regardless
updating one column or all the columns.
However, since your test table only has two columns, and we also
tested the same on a one-column table and got similar results as
yours, there is definitely room for optimizations. Attached file
zedstore_update_flames_lz4_first_update.svg is the profiling results
for the update query on a one-column table with 1M records. It spent
most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
zedstoream_fetch_row(), Taylor and I had some interesting findings
which I'm going to talk about next, I haven't dived into
zsbt_tid_update() yet and need to think about it more.
To understand what slows down zedstore UDPATE, Taylor and I did the
following test and profiling on a zedstore table with only one column.
postgres=# create table onecol(a int) using zedstore;
postgres=# insert into onecol select i from generate_series(1, 1000000) i;
-- Create view to count zedstore pages group by page types
postgres=# CREATE VIEW pg_zs_page_counts AS
SELECT
c.relnamespace::regnamespace,
c.oid,
c.relname,
pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
count(*)
FROM pg_am am
JOIN pg_class c ON (c.relam = am.oid)
WHERE am.amname='zedstore'
GROUP BY 1,2,3,4;
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 640
public | 32768 | onecol | FREE | 90
public | 32768 | onecol | META | 1
(3 rows)
-- Run update query the first time
postgres=# update onecol set a = 200; -- profiling attached in
zedstore_update_flames_lz4_first_update.svg
Time: 28760.199 ms (00:28.760)
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 6254
public | 32768 | onecol | FREE | 26915
public | 32768 | onecol | META | 1
(6 rows)
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
count
-------
5740
(1 row)
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
count
-------
514
(1 row)
postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and
totalsz > 0;
blkno | nextblk | attno | level | lokey | hikey | nitems |
ncompressed | totalsz | uncompressedsz | freespace
-------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+-----------
730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 |
1 | 3156 | 778480 | 4980
6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 |
1 | 8125 | 859104 | 11
13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 |
1 | 8125 | 859104 | 11
19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 |
1 | 8125 | 859104 | 11
25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 |
1 | 8125 | 859104 | 11
32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 |
1 | 2033 | 105016 | 6103
(6 rows)
-- Run update query the second time
postgres=# update onecol set a = 200; -- profiling attached in
zedstore_update_flames_lz4_second_update.svg
Time: 267135.703 ms (04:27.136)
As you can see, it took 28s to run the update query for the first
time, it was slow but expected. However, when we run the same update
query again it took 4 mins and 27s, almost 10x slower than the first
run. The profiling result of the second update is attached, it shows
that 57% of all the time it's doing decode_chunk_fixed(), which is
used for decoding a chunk in a attstream so that we can confirm
whether the tid of interest is in that chunk and fetch it if true.
Right now, each chunk contains at most 60 tids for fixed length
attributes and at most 30 tids for varlena attributes, and we decode
all the tids each chunk contains one by one.
Going back to our test, before and after the first UPDATE, the BTREE
page counts increased from 640 to 6254, however, only 6 out of the 514
attribute btree pages actually store data. It seems like a bug that we
left behind 508 empty btree pages, we should fix it, but let's put it
aside as a seperate problem. With 6 pages we stored 1M rows, each page
contains as many as 198,320 tids. This is the reason why the second
UPDATE spent so much time at decoding chunks. The btree structure only
helps us locate the page for a given tid, but once we get to the page,
the better compression we have, the more chunks we can pack in one
page, the more calls per page to decode_chunk(). Even worse, unlike
INSERT, UPDATE currently initialize a new fetcher every time it
fetches a new row, which means it doesn't remember the last position
the decoder was at in the attstream, so everytime it fetches a new
row, the decoder starts all over from the beginning of the attstream,
and we are talking about an attstream that could have 198,320 records.
We also haven't done any optimization inside of decode_chunk() itself,
like checking first and last tid, stop decoding once found the tid, or
doing binary search for fixed length attributes.
So, I think what slows down the second UPDATE are also part of the
reasons why the first UPDATE is slow. We still haven't done any
optimization for UPDATE so far, probably because we didn't expect it
to be better than heap, but we should try to make it not too much
worse.
Further, In the latest code I'm getting this warning message when it
is compiled using -O2 optimisation flag.zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’:
zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
opaque = ZSBtreePageGetOpaque(page);
^
Attached is the patch that fixes it.
Applied. Thanks!
On Fri, Sep 27, 2019 at 3:09 PM Alexandra Wang <lewang@pivotal.io> wrote:
Hi Ashutosh,
Sorry I indeed missed your question, thanks for the reminder!
On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Further, the UPDATE operation on zedstore table is very slow. I think
that's because in case of zedstore table we have to update all the
btree data structures even if one column is updated and that really
sucks. Please let me know if there is some other reason for it.There was no answer for this in your previous reply. It seems like you
missed it. As I said earlier, I tried performing UPDATE operation with
optimised build and found that to update around 10 lacs record in
zedstore table it takes around 24k ms whereas for normal heap table it
takes 2k ms. Is that because in case of zedstore table we have to
update all the Btree data structures even if one column is updated or
there is some other reason for it. If yes, could you please let us
know. FYI, I'm trying to update the table with just two columns.Zedstore UPDATE operation currently fetches the old rows, updates the
undo pointers stored in the tid btree, and insert new rows into all
the attribute btrees with the new tids. So performance of updating one
column makes no difference from updating all the columns. That said,
the wider the table is, the longer it takes to update, regardless
updating one column or all the columns.However, since your test table only has two columns, and we also
tested the same on a one-column table and got similar results as
yours, there is definitely room for optimizations. Attached file
zedstore_update_flames_lz4_first_update.svg is the profiling results
for the update query on a one-column table with 1M records. It spent
most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
zedstoream_fetch_row(), Taylor and I had some interesting findings
which I'm going to talk about next, I haven't dived into
zsbt_tid_update() yet and need to think about it more.To understand what slows down zedstore UDPATE, Taylor and I did the
following test and profiling on a zedstore table with only one column.postgres=# create table onecol(a int) using zedstore;
postgres=# insert into onecol select i from generate_series(1, 1000000) i;-- Create view to count zedstore pages group by page types
postgres=# CREATE VIEW pg_zs_page_counts AS
SELECT
c.relnamespace::regnamespace,
c.oid,
c.relname,
pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
count(*)
FROM pg_am am
JOIN pg_class c ON (c.relam = am.oid)
WHERE am.amname='zedstore'
GROUP BY 1,2,3,4;postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 640
public | 32768 | onecol | FREE | 90
public | 32768 | onecol | META | 1
(3 rows)-- Run update query the first time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_first_update.svg
Time: 28760.199 ms (00:28.760)postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 6254
public | 32768 | onecol | FREE | 26915
public | 32768 | onecol | META | 1
(6 rows)
Oops, the first UPDATE created a lot of free pages.
Just FYI, when the second update was ran, it took around 5 mins (which
is almost 10-12 times more than what 1st UPDATE took) but this time
there was no more free pages added, instead the already available free
pages were used. Here is the stats observed before and after second
update,
before:
=====
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 16390 | t1 | FREE | 26915
public | 16390 | t1 | BTREE | 7277
public | 16390 | t1 | META | 1
(3 rows)
after:
====
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 16390 | t1 | FREE | 26370
public | 16390 | t1 | BTREE | 7822
public | 16390 | t1 | META | 1
(3 rows)
You may see that around 545 pages got added this time and they were
all taken from the free pages list.
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
count
-------
5740
(1 row)
This could be because currently tid blocks are not compressed as
against the other attribute blocks.
postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
count
-------
514
(1 row)postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and totalsz > 0;
blkno | nextblk | attno | level | lokey | hikey | nitems | ncompressed | totalsz | uncompressedsz | freespace
-------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+-----------
730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 | 1 | 3156 | 778480 | 4980
6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 | 1 | 8125 | 859104 | 11
13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 | 1 | 8125 | 859104 | 11
19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 | 1 | 8125 | 859104 | 11
25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 | 1 | 8125 | 859104 | 11
32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 | 1 | 2033 | 105016 | 6103
(6 rows)-- Run update query the second time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_second_update.svg
Time: 267135.703 ms (04:27.136)As you can see, it took 28s to run the update query for the first
time, it was slow but expected. However, when we run the same update
query again it took 4 mins and 27s, almost 10x slower than the first
run. The profiling result of the second update is attached, it shows
that 57% of all the time it's doing decode_chunk_fixed(), which is
used for decoding a chunk in a attstream so that we can confirm
whether the tid of interest is in that chunk and fetch it if true.
Right now, each chunk contains at most 60 tids for fixed length
attributes and at most 30 tids for varlena attributes, and we decode
all the tids each chunk contains one by one.Going back to our test, before and after the first UPDATE, the BTREE
page counts increased from 640 to 6254, however, only 6 out of the 514
attribute btree pages actually store data. It seems like a bug that we
left behind 508 empty btree pages, we should fix it, but let's put it
aside as a seperate problem. With 6 pages we stored 1M rows, each page
contains as many as 198,320 tids. This is the reason why the second
UPDATE spent so much time at decoding chunks. The btree structure only
helps us locate the page for a given tid, but once we get to the page,
the better compression we have, the more chunks we can pack in one
page, the more calls per page to decode_chunk(). Even worse, unlike
INSERT, UPDATE currently initialize a new fetcher every time it
fetches a new row, which means it doesn't remember the last position
the decoder was at in the attstream, so everytime it fetches a new
row, the decoder starts all over from the beginning of the attstream,
and we are talking about an attstream that could have 198,320 records.
We also haven't done any optimization inside of decode_chunk() itself,
like checking first and last tid, stop decoding once found the tid, or
doing binary search for fixed length attributes.So, I think what slows down the second UPDATE are also part of the
reasons why the first UPDATE is slow. We still haven't done any
optimization for UPDATE so far, probably because we didn't expect it
to be better than heap, but we should try to make it not too much
worse.
That's right, if the situation is too worse, it would be difficult to
compromise. So, some fix is certainly required here.
Further, In the latest code I'm getting this warning message when it
is compiled using -O2 optimisation flag.zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’:
zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
opaque = ZSBtreePageGetOpaque(page);
^
Attached is the patch that fixes it.Applied. Thanks!
Thanks for that and for sharing the detail information on why update
operation on zedstore table is so slow.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Hi,
I got chance to spend some time looking into the recent changes done
in the zedstore code, basically the functions for packing datums into
the attribute streams and handling attribute leaf pages. I didn't find
any issues but there are some minor comments that I found when
reviewing. I have worked on those and attached is the patch with the
changes. See if the changes looks meaningful to you.
Thanks,
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Show quoted text
On Mon, Sep 30, 2019 at 4:08 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Fri, Sep 27, 2019 at 3:09 PM Alexandra Wang <lewang@pivotal.io> wrote:
Hi Ashutosh,
Sorry I indeed missed your question, thanks for the reminder!
On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Further, the UPDATE operation on zedstore table is very slow. I think
that's because in case of zedstore table we have to update all the
btree data structures even if one column is updated and that really
sucks. Please let me know if there is some other reason for it.There was no answer for this in your previous reply. It seems like you
missed it. As I said earlier, I tried performing UPDATE operation with
optimised build and found that to update around 10 lacs record in
zedstore table it takes around 24k ms whereas for normal heap table it
takes 2k ms. Is that because in case of zedstore table we have to
update all the Btree data structures even if one column is updated or
there is some other reason for it. If yes, could you please let us
know. FYI, I'm trying to update the table with just two columns.Zedstore UPDATE operation currently fetches the old rows, updates the
undo pointers stored in the tid btree, and insert new rows into all
the attribute btrees with the new tids. So performance of updating one
column makes no difference from updating all the columns. That said,
the wider the table is, the longer it takes to update, regardless
updating one column or all the columns.However, since your test table only has two columns, and we also
tested the same on a one-column table and got similar results as
yours, there is definitely room for optimizations. Attached file
zedstore_update_flames_lz4_first_update.svg is the profiling results
for the update query on a one-column table with 1M records. It spent
most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
zedstoream_fetch_row(), Taylor and I had some interesting findings
which I'm going to talk about next, I haven't dived into
zsbt_tid_update() yet and need to think about it more.To understand what slows down zedstore UDPATE, Taylor and I did the
following test and profiling on a zedstore table with only one column.postgres=# create table onecol(a int) using zedstore;
postgres=# insert into onecol select i from generate_series(1, 1000000) i;-- Create view to count zedstore pages group by page types
postgres=# CREATE VIEW pg_zs_page_counts AS
SELECT
c.relnamespace::regnamespace,
c.oid,
c.relname,
pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
count(*)
FROM pg_am am
JOIN pg_class c ON (c.relam = am.oid)
WHERE am.amname='zedstore'
GROUP BY 1,2,3,4;postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 640
public | 32768 | onecol | FREE | 90
public | 32768 | onecol | META | 1
(3 rows)-- Run update query the first time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_first_update.svg
Time: 28760.199 ms (00:28.760)postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 6254
public | 32768 | onecol | FREE | 26915
public | 32768 | onecol | META | 1
(6 rows)Oops, the first UPDATE created a lot of free pages.
Just FYI, when the second update was ran, it took around 5 mins (which
is almost 10-12 times more than what 1st UPDATE took) but this time
there was no more free pages added, instead the already available free
pages were used. Here is the stats observed before and after second
update,before:
=====
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 16390 | t1 | FREE | 26915
public | 16390 | t1 | BTREE | 7277
public | 16390 | t1 | META | 1
(3 rows)after:
====
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 16390 | t1 | FREE | 26370
public | 16390 | t1 | BTREE | 7822
public | 16390 | t1 | META | 1
(3 rows)You may see that around 545 pages got added this time and they were
all taken from the free pages list.postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
count
-------
5740
(1 row)This could be because currently tid blocks are not compressed as
against the other attribute blocks.postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
count
-------
514
(1 row)postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and totalsz > 0;
blkno | nextblk | attno | level | lokey | hikey | nitems | ncompressed | totalsz | uncompressedsz | freespace
-------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+-----------
730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 | 1 | 3156 | 778480 | 4980
6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 | 1 | 8125 | 859104 | 11
13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 | 1 | 8125 | 859104 | 11
19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 | 1 | 8125 | 859104 | 11
25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 | 1 | 8125 | 859104 | 11
32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 | 1 | 2033 | 105016 | 6103
(6 rows)-- Run update query the second time
postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_second_update.svg
Time: 267135.703 ms (04:27.136)As you can see, it took 28s to run the update query for the first
time, it was slow but expected. However, when we run the same update
query again it took 4 mins and 27s, almost 10x slower than the first
run. The profiling result of the second update is attached, it shows
that 57% of all the time it's doing decode_chunk_fixed(), which is
used for decoding a chunk in a attstream so that we can confirm
whether the tid of interest is in that chunk and fetch it if true.
Right now, each chunk contains at most 60 tids for fixed length
attributes and at most 30 tids for varlena attributes, and we decode
all the tids each chunk contains one by one.Going back to our test, before and after the first UPDATE, the BTREE
page counts increased from 640 to 6254, however, only 6 out of the 514
attribute btree pages actually store data. It seems like a bug that we
left behind 508 empty btree pages, we should fix it, but let's put it
aside as a seperate problem. With 6 pages we stored 1M rows, each page
contains as many as 198,320 tids. This is the reason why the second
UPDATE spent so much time at decoding chunks. The btree structure only
helps us locate the page for a given tid, but once we get to the page,
the better compression we have, the more chunks we can pack in one
page, the more calls per page to decode_chunk(). Even worse, unlike
INSERT, UPDATE currently initialize a new fetcher every time it
fetches a new row, which means it doesn't remember the last position
the decoder was at in the attstream, so everytime it fetches a new
row, the decoder starts all over from the beginning of the attstream,
and we are talking about an attstream that could have 198,320 records.
We also haven't done any optimization inside of decode_chunk() itself,
like checking first and last tid, stop decoding once found the tid, or
doing binary search for fixed length attributes.So, I think what slows down the second UPDATE are also part of the
reasons why the first UPDATE is slow. We still haven't done any
optimization for UPDATE so far, probably because we didn't expect it
to be better than heap, but we should try to make it not too much
worse.That's right, if the situation is too worse, it would be difficult to
compromise. So, some fix is certainly required here.Further, In the latest code I'm getting this warning message when it
is compiled using -O2 optimisation flag.zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’:
zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
opaque = ZSBtreePageGetOpaque(page);
^
Attached is the patch that fixes it.Applied. Thanks!
Thanks for that and for sharing the detail information on why update
operation on zedstore table is so slow.--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Attachments:
minor_corrections.patchtext/x-patch; charset=US-ASCII; name=minor_corrections.patchDownload
diff --git a/src/backend/access/zedstore/zedstore_attpage.c b/src/backend/access/zedstore/zedstore_attpage.c
index 4c435f2..42ec1c2 100644
--- a/src/backend/access/zedstore/zedstore_attpage.c
+++ b/src/backend/access/zedstore/zedstore_attpage.c
@@ -513,7 +513,7 @@ zsbt_attr_add(Relation rel, AttrNumber attno, attstream_buffer *attbuf)
* modify the existing data. The new combined stream might even be smaller
* than the old stream, if the last codewords are packed more tighthly.
* But at the moment, append_attstreams_inplace() doesn't do anything
- * that smart. So we asume that the existing data didn't change, and we
+ * that smart. So we assume that the existing data didn't change, and we
* only need to WAL log the new data at the end of the stream.
*/
((PageHeader) origpage)->pd_lower = new_pd_lower;
diff --git a/src/backend/access/zedstore/zedstore_attstream.c b/src/backend/access/zedstore/zedstore_attstream.c
index 4d3ba05..c0c24e7 100644
--- a/src/backend/access/zedstore/zedstore_attstream.c
+++ b/src/backend/access/zedstore/zedstore_attstream.c
@@ -1935,12 +1935,13 @@ decode_chunk_varlen(zstid *lasttid, char *chunk,
else
{
zs_toast_header_external hdr;
+ BlockNumber toastblkno;
+ varatt_zs_toastptr *toastptr;
memcpy(&hdr, p, sizeof(zs_toast_header_external));
p += sizeof(zs_toast_header_external);
- BlockNumber toastblkno = hdr.toast_blkno;
- varatt_zs_toastptr *toastptr;
+ toastblkno = hdr.toast_blkno;
toastptr = palloc0(sizeof(varatt_zs_toastptr));
SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
diff --git a/src/backend/access/zedstore/zedstore_tuplebuffer.c b/src/backend/access/zedstore/zedstore_tuplebuffer.c
index a771f81..ef45b6f 100644
--- a/src/backend/access/zedstore/zedstore_tuplebuffer.c
+++ b/src/backend/access/zedstore/zedstore_tuplebuffer.c
@@ -361,7 +361,7 @@ zsbt_attbuffer_flush(Relation rel, AttrNumber attno, attbuffer *attbuffer, bool
while ((all && chunks->len - chunks->cursor > 0) ||
chunks->len - chunks->cursor > ATTBUFFER_SIZE)
{
- zsbt_attr_add(rel, attno, &attbuffer->chunks);
+ zsbt_attr_add(rel, attno, chunks);
}
}
On 15/10/2019 13:49, Ashutosh Sharma wrote:
Hi,
I got chance to spend some time looking into the recent changes done
in the zedstore code, basically the functions for packing datums into
the attribute streams and handling attribute leaf pages. I didn't find
any issues but there are some minor comments that I found when
reviewing. I have worked on those and attached is the patch with the
changes. See if the changes looks meaningful to you.
Thanks for looking! Applied to the development repository
(https://github.com/greenplum-db/postgres/tree/zedstore/)
- Heikki
On Thu, Oct 17, 2019 at 2:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 15/10/2019 13:49, Ashutosh Sharma wrote:
Hi,
I got chance to spend some time looking into the recent changes done
in the zedstore code, basically the functions for packing datums into
the attribute streams and handling attribute leaf pages. I didn't find
any issues but there are some minor comments that I found when
reviewing. I have worked on those and attached is the patch with the
changes. See if the changes looks meaningful to you.Thanks for looking! Applied to the development repository
Thank you. Here is one more observation:
When a zedstore table is queried using *invalid* ctid, the server
crashes due to assertion failure. See below,
postgres=# select * from t1 where ctid = '(0, 0)';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
I believe above should have either returned 0 rows or failed with some
user friendly error.
Further, when the same table is queried using some non-existing ctid,
the query returns 0 rows. See below,
postgres=# select count(*) from t1;
count
-------
2
(1 row)
postgres=# select * from t1 where ctid = '(0, 2)';
a | b
---+------
2 | str2
(1 row)
postgres=# select * from t1 where ctid = '(0, 3)';
a | b
---+---
(0 rows)
postgres=# select * from t1 where ctid = '(0, 4)';
a | b
---+---
(0 rows)
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
When a zedstore table is queried using *invalid* ctid, the server
crashes due to assertion failure. See below,postgres=# select * from t2 where ctid = '(0, 0)';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Thank you for pointing that out! I will look into fixing that some
time this week. If we run without assertions the query still fails
with this error because zedstoream_tuple_tid_valid incorrectly reports
the TID as valid:
ERROR: arrived at incorrect block 2 while descending zedstore btree
I believe above should have either returned 1 rows or failed with some
user friendly error.
Agreed. I think it should match the behavior of heap as closely as
possible.
Alex Wang and I have been doing some performance analysis of the most
recent version of the zedstore branch, and have some interesting
statistics to share.
We specifically focused on TPC-DS query 2, because it plays to what
should be the strength of zedstore- namely it does a full table scan
of only a subset of columns. I've attached the explain verbose output
for reference.
We scan two columns of 'catalog_sales', and two columns of 'web_sales'.
-> Parallel Append
-> Parallel Seq Scan on tpcds.catalog_sales
Output: catalog_sales.cs_ext_sales_price,
catalog_sales.cs_sold_date_sk
-> Parallel Seq Scan on tpcds.web_sales
Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk
For heap, it needs to do a full table scan of both tables, and we need
to read the entire table into memory. For our dataset, that totals
around 119GB of data.
***HEAP***
tpcds=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
39 GB
(1 row)
tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
80 GB
(1 row)
***/HEAP***
With Zedstore the total relation size is smaller because of
compression. When scanning the table, we only scan the blocks with
data we are interested in, and leave the rest alone. So the total
size we need to scan for these tables totals around 4GB
***ZEDSTORE***
zedstore=# select pg_size_pretty(pg_relation_size('web_sales'));
pg_size_pretty
----------------
20 GB
(1 row)
zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales'));
pg_size_pretty
----------------
40 GB
(1 row)
zedstore=# with zedstore_tables as (select d.oid, f.*
zedstore(# from (select c.oid
zedstore(# from pg_am am
zedstore(# join pg_class c on (c.relam = am.oid)
zedstore(# where am.amname = 'zedstore') d,
zedstore(# pg_zs_btree_pages(d.oid) f)
zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno)
as pages
zedstore-# pg_size_pretty(count(zs.attno) * 8 * 1024) from
zedstore_tables zs
zedstore-# left join pg_attribute att on zs.attno = att.attnum
zedstore-# and zs.oid = att.attrelid
zedstore-# where zs.oid in ('catalog_sales'::regclass,
'web_sales'::regclass)
zedstore-# and (att.attname in
('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk')
zedstore(# or zs.attno = 0)
zedstore-# group by zs.attno, att.attname, zs.oid
zedstore-# order by zs.oid , zs.attno;
attno | attname | oid | pages | pg_size_pretty
-------+--------------------+---------------+--------+----------------
0 | | catalog_sales | 39549 | 309 MB
1 | cs_sold_date_sk | catalog_sales | 2441 | 19 MB
24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB
0 | | web_sales | 20013 | 156 MB
1 | ws_sold_date_sk | web_sales | 17578 | 137 MB
24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB
***/ZEDSTORE ***
On our test machine, our tables were stored on a single spinning disk,
so our read speed was pretty abysmal with this query. This query is
I/O bound for us, so it was the single largest factor. With heap, the
tables are scanned sequentially, and therefore can scan around 150MB of
table data per second:
***HEAP***
avg-cpu: %user %nice %system %iowait %steal %idle
8.54 0.00 1.85 11.62 0.00 77.98
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1685.33 0.00 157069.33 0.00 18.67 0.00 1.10
0.00 1.56 0.00 2.62 93.20 0.00 0.59 100.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1655.33 0.00 154910.67 0.00 21.33 0.00 1.27
0.00 1.62 0.00 2.68 93.58 0.00 0.60 100.13
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdd 1746.33 0.00 155121.33 0.00 28.00 0.00 1.58
0.00 1.48 0.00 2.61 88.83 0.00 0.57 100.00
***/HEAP***
Because zedstore resembled random I/O, the read speed was
significantly hindered on our single disk. As a result, we saw ~150x
slower read speeds.
***ZEDSTORE***
avg-cpu: %user %nice %system %iowait %steal %idle
6.24 0.00 1.22 6.34 0.00 86.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 129.33 0.00 1034.67 0.00 0.00 0.00 0.00
0.00 15.89 0.00 2.05 8.00 0.00 7.67 99.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 120.67 0.00 965.33 0.00 0.00 0.00 0.00
0.00 16.51 0.00 1.99 8.00 0.00 8.21 99.07
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 121.00 0.00 968.00 0.00 0.00 0.00 0.00
0.00 16.76 0.00 2.02 8.00 0.00 8.19 99.07
***/ZEDSTORE***
The total query time:
***HEAP***
Execution Time: 758807.571 ms
***/HEAP***
***ZEDSTORE***
Execution Time: 2111576.259 ms
***/ZEDSTORE***
Every attribute in zedstore is stored in a btree with the TID as a
key. Unlike heap, the TID is a logical address, and not a physical
one. The pages of one attribute are interspersed with the pages of all
other attributes. When you do a sequential scan on zedstore the pages
are, therefore, not stored in sequential order, so the access pattern
can resemble random I/O.
On our system, query time for zedstore was around 3x slower than heap
for this query. If your storage does not handle semi-random reads very
well, then zedstore can be very slow. This setup was a worst case
scenario because random read was 150x slower than with sequential
read. On hardware with better random I/O zedstore would really shine.
On a side note, a second run of this query with zedstore was finished
in around 57 seconds, because the ~4GB of column data was already in
the relcache. The data size is smaller because we only store the
relevant columns in memory, also the datums are compressed and
encoded. Conversely, subsequently running the same query with heap
still takes around 750 seconds because our system cannot store 119GB
of relation data in the relcache/system caches.
Our main takeaway with this is that anything we can do to group
together data that is accessed together can help zedstore to have
larger, more frequent sequential reads.
On Mon, Oct 28, 2019 at 3:22 PM Taylor Vesely <tvesely@pivotal.io> wrote:
Show quoted text
When a zedstore table is queried using *invalid* ctid, the server
crashes due to assertion failure. See below,postgres=# select * from t2 where ctid = '(0, 0)';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.Thank you for pointing that out! I will look into fixing that some
time this week. If we run without assertions the query still fails
with this error because zedstoream_tuple_tid_valid incorrectly reports
the TID as valid:ERROR: arrived at incorrect block 2 while descending zedstore btree
I believe above should have either returned 1 rows or failed with some
user friendly error.Agreed. I think it should match the behavior of heap as closely as
possible.
When a zedstore table is queried using *invalid* ctid, the server
crashes due to assertion failure. See below,postgres=# select * from t1 where ctid = '(0, 0)';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.I believe above should have either returned 0 rows or failed with some
user friendly error.
We pushed a fix for this today. It now returns zero rows, like the
equivalent query with heap. Thanks for reporting!
Hello,
We (David and I) recently observed that a Zedstore table can be considerably
bloated when we load data into it with concurrent copies. Also, we found
that
concurrent insert performance was less than desirable. This is a detailed
analysis of the extent of the problem, the cause of the problem and how we
fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki,
Melanie, Taylor and myself.
An example of the bloat that we observed:
TPC-DS scale = 270:
Table heap zed(serial) zed(16 parallel COPYs)
web_sales 39G 19G 39G
We found that it was caused due to inefficient page splits resultant from
out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The
degree of under-utilization was significant - attribute tree leaves with a
serial data load had 6-8x more datums than the attribute tree leaves
resultant with a parallel load of 16 sessions.
Consider the scenario below:
Assumptions:
1. Let us consider two concurrent copy commands executing (sessions S1 and
S2).
2. The table has only one (fixed-length for sake of argument) attribute 'a'.
3. For attribute 'a', a full attribute tree leaf page can accommodate 1500
datums.
TID allocations:
S1: 1-1000
S2: 1001-2000, 2001-3000
Order of operations:
1. S2 writes datums for tids 1001-2000, 2001-3000.
The resulting leaves are:
L1:
lokey = 1 hikey = 2500
firsttid = 1001 lasttid = 2500
L2:
lokey = 2501 hikey = MaxZSTid
firsttid = 2501 lasttid = 3000
2. S1 now writes datums for tids 1-1000.
We have to split L1 into L1' and L1''.
L1':
lokey = 1 hikey = 1500
firsttid = 1 lasttid = 1500
L1'': [under-utilized page]
lokey = 1501 hikey = 2000
firsttid = 1501 lasttid = 2000
L2:
lokey = 2501 hikey = MaxZSTid
firsttid = 2501 lasttid = 3000
Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas
firsttid
and lasttid reflect what actually have been inserted.
L1'' will be an under-utilized page that is not going to be filled again
because
it inherits the tight hikey from L1. In this example, space wastage in L1''
is
66% but it could very easily be close to 100%, especially under concurrent
workloads which mixes single and multi-inserts, or even unequally sized
multi-inserts.
Solution (kudos to Ashwin!):
For every multi-insert (and only multi-insert, not for singleton inserts),
allocate N times more tids. Each session will keep these extra tids in a
buffer. Subsequent calls to multi-insert would use these buffered tids. If
at
any time a tid allocation request cannot be met by the remaining buffered
tids,
a new batch of N times the number of tids requested will again be allocated.
If we take the same example above and say we allocated N=5 times the number
of
tids upon the first request for 1000 tids.:
TID allocations:
S1: 1-5000
S2: 5001-10000
Order of operations:
1. S2 writes datums for tids 5001-6000, 6001-7000.
The resulting leaves are:
L1:
lokey = 1 hikey = 6500
firsttid = 5001 lasttid = 6500
L2:
lokey = 6501 hikey = MaxZSTid
firsttid = 6501 lasttid = 7000
2. S1 writes datums for tids 1-1000.
L1 will be split into L1' and L1''.
L1':
lokey = 1 hikey = 5500
firsttid = 1 lasttid = 1000
L1'' [under-utilized page]:
lokey = 5501 hikey = 6500
firsttid = 5501 lasttid = 6500
L2:
lokey = 6501 hikey = MaxZSTid
firsttid = 6501 lasttid = 7000
Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive.
However, we do end up with the inefficient page L1''. With a high enough
value
of N, we reduce the frequency of such pages. We could further reduce this
wastage by incorporating a special left split (Since L1 was already full, we
don't change it at all -> we simply update it's lokey -> L1 becomes L1''
and we
fork of a new leaf to its left: L1'). This would look like:
L1':
lokey = 1 hikey = 5000
firsttid = 1 lasttid = 1000
L1'':
lokey = 5001 hikey = 6500
firsttid = 5001 lasttid = 6500
We found that with a high enough value of N, we did not get significant
space
benefits from the left split. Thus, we decided to only incorporate N.
Results: [TPC-DS scale = 50, 16 conc copies]
Table zed N=10 N=100 N=1000 heap
zed(serial)
catalog_sales 15G 9.1G 7.7G 7.5G 15G
8.0G
catalog_returns 1.5G 0.9G 0.7G 0.7G 1.2G
0.8G
store_returns 2.1G 1.2G 1.1G 1.1G 1.9G
1.2G
store_sales 17G 11G 10.1G 10.1G 21G 10G
Load time:
N=10 30min
N=100 10min
N=1000 7min
zed 100min
heap 8min
'zed' refers to the zedstore branch without our fix. We see that with N =
10, we
get closer to what we get with serial inserts. For N = 100, we even beat
serial
insert.
We can attribute the differences in runtime to the fact that by lowering the
number of tid range requests, we reduce the contention on the tid tree -
which
is a bottleneck for concurrent loads. A significant win!
How N relates to the other parameters in play:
Let S be the number of concurrent sessions
Let T be the average number of rows that a session wants to write in t sized
multi-insert batches
Let A be the number of attributes
Number of times a session multi-inserts into the tid tree without buffered
allocation = T/t
Number of times a session multi-inserts into the tid tree with buffered
allocation = T/Nt
Total number of multi-inserts into the tid tree = Mt = ST/Nt
Also, total number of adverse insert cases where we could have bloat ∝ Mt
So, bloat ∝ Mt
Run-time of a parallel data load ∝ Mt * A
So the guidance would be to increase N with the increase in S or in T (t
will
be relatively constant for a certain table - it is constrained by the size
of a
row and the copy buffer) and also if the table is significantly wide.
We can see that it is difficult to provide a default to N, it really should
be
a GUC. Also, SINGLE_INSERT_TID_RESERVATION_THRESHOLD and
SINGLE_INSERT_TID_RESERVATION_SIZE should be turned into GUCs. In our
implementation, we treat MULTI_INSERT_TID_RESERVATION_FACTOR = N. We leave
the
GUC implementation for later.
Cost of killing the extra unused tids not consumed by multi-inserts:
The maximum number of tids that can be wasted (W) is capped at (tN - 1) *
S. This is
the worst case: where the last tid allocation request only used 1 tid out
of the
tN tids it received and buffered for every session.
So average case ~ (tN /2) * S. Number of times the tid tree has to be
accessed
to delete these (tN/2) * S tids is S. So taking tid wastage into account,
on
average, number of accesses to the tid tree = Mt + W = ST/Nt +
Thus this additional cost of S, and thus cost of tid killing is not really
significant.
Regards,
Soumyadeep & David
Hello,
On Wed, Oct 30, 2019 at 3:34 PM Taylor Vesely <tvesely@pivotal.io> wrote:
Because zedstore resembled random I/O, the read speed was
significantly hindered on our single disk. As a result, we saw ~150x
slower read speeds.
Deep and I have committed a fix for this. The root cause of this problem is
that attribute tree (and tid tree) pages are not contiguous enough on disk,
especially if we are loading data concurrently into the same table. The
effect
of the non-contiguity is especially felt on rotational disks and is
magnified
by increasing the number of sessions that load the table concurrently.
Since a base requirement for a column store is that blocks for a single
column
be physically adjacent (or nearly adjacent), we sought to optimize for this
requirement.
What we have done is to introduce attribute-level free page maps (FPMs) and
to
batch up relation extension requests. We have introduced a new reloption
zedstore_rel_extension_factor: whenever we want to extend the relation by a
single page for the tid/attribute tree, we extend it by
zedstore_rel_extension_factor number of blocks. We return the one block
requested and prepend the extra blocks to the attribute-level FPM. This
makes
the blocks available to other concurrent backends and thus, in spite of
backend-interleaved flushes into the same attribute tree, we see more
contiguity of leaf blocks.
We reason about contiguity of blocks by making some major assumptions:
We consider that two blocks are near each other if they have block numbers
that
are close to each other. (Refer: BufferGetBlockNumber())
Also we assume that if two successive relation extension requests would
yield
blocks with block numbers that are close to each other.
Recycling of pages for attribute and tid tree also are now done at the
attribute-tree level.
Experiment results and methodology:
Metric used to measure performance -> I/O read time reported by the “I/O
Timings” field in: explain (analyze, buffers, timing, verbose) output with
the
track_io_timing GUC on. Before every explain run, we restart the database to
flush the buffers and clear the OS page cache.
Experiment parameters: TPC-DS Scale = 270, table = store_sales, opt_level =
-O2
#parallel COPY sessions loading store_sales = 16.
N = zedstore_rel_extension_factor
GUCs used:
shared_buffers: 10GB
max_wal_size: 1GB
checkpoint_flush_after: 1MB
max_parallel_workers: 8
max_parallel_maintenance_workers: 8
maintenance_work_mem: 4GB
log_statement: all
effective_cache_size: 32GB
track_io_timing: on
For rotational disks:
Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 115s. Total exec time =
212s
Zed (w/o fix): Table size = 59G. I/O time = 634s. Total exec time =
730s
Zed (N=32): Table size = 59G. I/O time = 91s. Total exec time =
175s
Zed (N=512): Table size = 59G. I/O time = 7s. Total exec time = 87s
Zed (N=4096): Table size = 59G. I/O time = 2.5s. Total exec time = 82s
Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 130s. Total exec time =
214s
Zed (w/o fix): Table size = 59G. I/O time = 2401s. Total exec time =
2813s
Zed (N=32): Table size = 59G. I/O time = 929s. Total exec time =
1300s
Zed (N=512): Table size = 59G. I/O time = 485s. Total exec time =
847s
Zed (N=4096): Table size = 59G. I/O time = 354s. Total exec time =
716s
We also saw discernible differences in I/O time for scale = 50, table size
= 10G
for Zedstore and 21G for heap. Results not reported for brevity.
Our fix doesn't impact COPY performance, so we saw no difference in the time
taken to load the data into store_sales.
For NVMe SSDs:
We see no discernible differences in I/O times with and without the fix
(performance for select * was slightly worse for N=4096). Here
are some of the results:
Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 59s. Total exec time = 123s
Zed (w/o fix): Table size = 59G. I/O time = 20s. Total exec time = 79s
Zed (N=4096): Table size = 59G. I/O time = 21s. Total exec time = 87s
Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 64s. Total exec time = 127s
Zed (w/o fix): Table size = 61G. I/O time = 449s. Total exec time = 757s
Zed (N=4096): Table size = 61G. I/O time = 487s. Total exec time = 812s
Analysis of fix:
The following query inspects the (block distance) absolute difference
between
two logically adjacent leaf blocks for the ss_sold_date_sk attribute of
store_sales. It shows us the distribution of the block distances in the
ss_sold_date_sk attribute tree. Output is limited for brevity.
with blk_dist(dist) as (select abs(nextblk - blkno) as dist from
pg_zs_btree_pages('store_sales'::regclass) where attno=1 and level=0 and
nextblk != 4294967295)
select dist, count(dist) as cnt from blk_dist group by
dist order by cnt desc limit 5;
W/o fix: #parallel_copies=16,
W/ fix: #parallel_copies=16, extension_factor=16
W/o fix W/ fix
dist | cnt dist | cnt
-----+----- -----+------
25 | 89 1 | 3228
26 | 83 2 | 3192
23 | 78 3 | 2664
1 | 75 4 | 2218
29 | 74 5 | 1866
We can see that by increasing zedstore_rel_extension_factor, we end up with
a high number of lower block distances.
Implications of fix:
1. We have to keep track of the FPM heads for the attribute/tid trees in the
meta-page, and since we don't have an extensible meta-page yet, we further
limit
the number of columns Zedstore can support. We will get around to it
eventually.
2. Worst case extra space wasted on disk from extra free pages that could
linger
after a bulk load = zedstore_rel_extension_factor * #attributes * 8192
bytes.
For zedstore_rel_extension_factor = 16, #attributes = 23:
wastage = 16*24*8192/1024/1024 = 3M
For zedstore_rel_extension_factor = 4096, #attributes = 23:
wastage = 4096*24*8192/1024/1024 = 768M
Note: The free pages left behind can of course, be used by subsequent
operations
on the table.
In conclusion, increasing zedstore_rel_extension_factor for a wide table may
lead to bloating of the relfile. The percentage of bloat would also be
magnified
if the table doesn't have a lot of data.
3. Amount of extra WAL being written (since we are placing/removing the
extra
blocks on the FPMs, something we never did without this fix) is independent
of
zedstore_rel_extension_factor and we found that we had written
approximately 14M
extra WAL for every 1G relfile.
Guidance on setting zedstore_rel_extension_factor:
Users should set a high zedstore_rel_extension_factor, when they are loading
data on rotational disks, with/without a high degree of concurrency and when
they have significant data size.
Attached is a patch with our changes: [1]0001-Attribute-level-FPMs-and-rel-extension-batching.patch
Also attached is a rebased version of Zedstore on latest PG master. [2]v4-zedstore.patch
Github branch for Zedstore: [3]https://github.com/greenplum-db/postgres/tree/zedstore
[1]: 0001-Attribute-level-FPMs-and-rel-extension-batching.patch
[2]: v4-zedstore.patch
[3]: https://github.com/greenplum-db/postgres/tree/zedstore
--
Alex & Deep
Attachments:
0001-Attribute-level-FPMs-and-rel-extension-batching.patchapplication/octet-stream; name=0001-Attribute-level-FPMs-and-rel-extension-batching.patchDownload
From bf35797308493f4be5e54869701f60ca6cd01cce Mon Sep 17 00:00:00 2001
From: Alexandra Wang <lewang@pivotal.io>
Date: Wed, 4 Mar 2020 11:58:47 -0800
Subject: [PATCH 1/1] Attribute-level FPMs and rel extension batching
To reduce the non-contuguity of blocks for attribute and tid tree
(especially leaves), which in turn improves I/O readahead and I/O
performance, we introduced the following:
1. Attribute level FPM
2. A new reloption: zedstore_rel_extension_factor, which designates the
number of blocks that will be fetched from the storage manager when we
request a single block. Out of the blocks fetched from the storage
manager, 1 block will be returned (to meet the request for a single
block) and the rest will be prepended to the attribute-level FPM.
From now on, the table level FPM, in ZSMetaPageOpaque, will only serve
UNDO pages.
Even TOAST pages, given their access pattern of being accessed
immediately following an attribute leaf, will be served from the
attribute-level FPM.
By increasing zedstore_rel_extension_factor (such as 16, 32, 512, 4096
etc.) in our experiments, we found that we get a very significant
reduction in I/O time for reads (as reported by track_io_timing and
explain (analyze, verbose, timing, buffers)) on rotational disks and
especially where the data was loaded into the same table by many
concurrent sessions (such as 16).
Notes:
1. We also use the term attribute-level FPM when we refer to the tid
tree (attno = 0).
2. Since we store the FPM heads for each attribute in the meta-page, we
hit the limit on the number of columns we can support. This causes the extra
wide table test to fail in create_table.sql and sanity_check.out. Added
FIXMEs for them to be done when we tackle meta-page overflow.
Discussion (including benchmarks):
Co-authored-by: Soumyadeep Chakraborty <sochakraborty@pivotal.io>
---
src/backend/access/common/reloptions.c | 13 +++-
src/backend/access/zedstore/README | 18 +++--
.../access/zedstore/zedstore_attpage.c | 4 +-
src/backend/access/zedstore/zedstore_btree.c | 8 +-
.../access/zedstore/zedstore_freepagemap.c | 74 ++++++++++++++++---
src/backend/access/zedstore/zedstore_meta.c | 10 ++-
.../access/zedstore/zedstore_tidpage.c | 10 +--
src/backend/access/zedstore/zedstore_toast.c | 4 +-
.../access/zedstore/zedstore_undolog.c | 4 +-
src/include/access/zedstore_internal.h | 11 +--
src/include/utils/rel.h | 8 ++
src/test/regress/expected/brin_1.out | 2 +-
src/test/regress/expected/create_table.out | 16 ++--
src/test/regress/expected/sanity_check.out | 1 -
src/test/regress/sql/create_table.sql | 15 ++--
15 files changed, 142 insertions(+), 56 deletions(-)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 48377ace24..9a1d85ef6f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -350,6 +350,15 @@ static relopt_int intRelOpts[] =
},
-1, 0, 1024
},
+ {
+ {
+ "zedstore_rel_extension_factor",
+ "Extend zedstore relations by zedstore_rel_extension_factor #blocks.",
+ RELOPT_KIND_HEAP,
+ ShareUpdateExclusiveLock
+ },
+ ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR, 1, INT_MAX
+ },
/* list terminator */
{{NULL}}
@@ -1524,7 +1533,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
- offsetof(StdRdOptions, vacuum_truncate)}
+ offsetof(StdRdOptions, vacuum_truncate)},
+ {"zedstore_rel_extension_factor", RELOPT_TYPE_INT,
+ offsetof(StdRdOptions, zedstore_rel_extension_factor)}
};
return (bytea *) build_reloptions(reloptions, validate, kind,
diff --git a/src/backend/access/zedstore/README b/src/backend/access/zedstore/README
index ad0f753770..e2c6ae461f 100644
--- a/src/backend/access/zedstore/README
+++ b/src/backend/access/zedstore/README
@@ -269,17 +269,25 @@ TOAST Page:
Free Pages Map
--------------
-There is a simple Free Pages Map, which is just a linked list of unused
-blocks. The block number of the first unused page in the list is stored
+The Free Page Map structure used in Zedstore is simply a linked list of unused
+blocks. There are multiple free page maps, with one free page map for the table,
+to cater to allocation requests for UNDO pages. There is a free page map for the
+tid tree and a separate free page map for each attribute (we use the term
+attribute-level free page map for the free page map for the tid tree as well,
+considering it as attribute 0).
+The block number of the first unused page for each of these lists is stored
in the metapage. Each unused block contains link to the next unused
block in the chain. When a block comes unused, it is added to the
head of the list.
+By batching page allocations and by having attribute-level free page maps, we
+ensure that each attribute B-tree gets more contiguous ranges of blocks, even under
+concurrent inserts to the same table to allow I/O readahead to be effective.
+The batching factor we use is the reloption: zedstore_rel_extension_factor.
+
TODO: That doesn't scale very well, and the pages are reused in LIFO
order. We'll probably want to do something smarter to avoid making the
-metapage a bottleneck for this, as well as try to batch the page
-allocations so that each attribute B-tree would get contiguous ranges
-of blocks, to allow I/O readahead to be effective.
+metapage a bottleneck for this.
Enhancement ideas / alternative designs
diff --git a/src/backend/access/zedstore/zedstore_attpage.c b/src/backend/access/zedstore/zedstore_attpage.c
index 1b42235f82..52c2a5a1b4 100644
--- a/src/backend/access/zedstore/zedstore_attpage.c
+++ b/src/backend/access/zedstore/zedstore_attpage.c
@@ -909,7 +909,7 @@ zsbt_attr_repack_writeback_pages(zsbt_attr_repack_context *cxt,
Assert(stack->next->buf == InvalidBuffer);
- nextbuf = zspage_getnewbuf(rel);
+ nextbuf = zspage_getnewbuf(rel, attno);
stack->next->buf = nextbuf;
thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
@@ -955,7 +955,7 @@ zsbt_attr_repack_writeback_pages(zsbt_attr_repack_context *cxt,
}
/* Finally, overwrite all the pages we had to modify */
- zs_apply_split_changes(rel, cxt->stack_head, NULL);
+ zs_apply_split_changes(rel, cxt->stack_head, NULL, attno);
}
static void
diff --git a/src/backend/access/zedstore/zedstore_btree.c b/src/backend/access/zedstore/zedstore_btree.c
index 0b649c52c0..1e0ab08380 100644
--- a/src/backend/access/zedstore/zedstore_btree.c
+++ b/src/backend/access/zedstore/zedstore_btree.c
@@ -315,7 +315,7 @@ zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks)
ListCell *lc;
int i;
- newrootbuf = zspage_getnewbuf(rel);
+ newrootbuf = zspage_getnewbuf(rel, attno);
metabuf = ReadBuffer(rel, ZS_META_BLK);
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
@@ -531,7 +531,7 @@ zsbt_split_internal_page(Relation rel, AttrNumber attno, Buffer origbuf,
BlockNumber blkno;
ZSBtreeInternalPageItem *downlink;
- buf = zspage_getnewbuf(rel);
+ buf = zspage_getnewbuf(rel, attno);
blkno = BufferGetBlockNumber(buf);
page = palloc(BLCKSZ);
PageInit(page, BLCKSZ, sizeof(ZSBtreePageOpaque));
@@ -795,7 +795,7 @@ zs_new_split_stack_entry(Buffer buf, Page page)
* entries.
*/
void
-zs_apply_split_changes(Relation rel, zs_split_stack *stack, zs_pending_undo_op *undo_op)
+zs_apply_split_changes(Relation rel, zs_split_stack *stack, zs_pending_undo_op *undo_op, AttrNumber attrNumber)
{
zs_split_stack *head = stack;
bool wal_needed = RelationNeedsWAL(rel);
@@ -908,7 +908,7 @@ zs_apply_split_changes(Relation rel, zs_split_stack *stack, zs_pending_undo_op *
/* add this page to the Free Page Map for recycling */
if (stack->recycle)
- zspage_delete_page(rel, stack->buf, InvalidBuffer);
+ zspage_delete_page(rel, stack->buf, InvalidBuffer, attrNumber);
UnlockReleaseBuffer(stack->buf);
diff --git a/src/backend/access/zedstore/zedstore_freepagemap.c b/src/backend/access/zedstore/zedstore_freepagemap.c
index 85546c80ba..47db7fa84e 100644
--- a/src/backend/access/zedstore/zedstore_freepagemap.c
+++ b/src/backend/access/zedstore/zedstore_freepagemap.c
@@ -123,22 +123,27 @@ zspage_is_unused(Buffer buf)
* That's unfortunate, but hopefully won't happen too often.
*/
Buffer
-zspage_getnewbuf(Relation rel)
+zspage_getnewbuf(Relation rel, AttrNumber attrNumber)
{
Buffer buf;
BlockNumber blk;
Buffer metabuf;
Page metapage;
+ ZSMetaPage *metapg;
ZSMetaPageOpaque *metaopaque;
metabuf = ReadBuffer(rel, ZS_META_BLK);
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
- metapage = BufferGetPage(metabuf);
+ metapage = BufferGetPage(metabuf);
metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+
+ if (attrNumber == ZS_INVALID_ATTRIBUTE_NUM)
+ blk = metaopaque->zs_fpm_head;
+ else
+ blk = metapg->tree_root_dir[attrNumber].fpm_head;
- /* Get a block from the FPM. */
- blk = metaopaque->zs_fpm_head;
if (blk == ZS_META_BLK)
{
/* metapage, not expected */
@@ -163,7 +168,10 @@ zspage_getnewbuf(Relation rel)
opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
next_free_blkno = opaque->zs_next;
- metaopaque->zs_fpm_head = next_free_blkno;
+ if (attrNumber == ZS_INVALID_ATTRIBUTE_NUM)
+ metaopaque->zs_fpm_head = next_free_blkno;
+ else
+ metapg->tree_root_dir[attrNumber].fpm_head = next_free_blkno;
if (RelationNeedsWAL(rel))
{
@@ -191,10 +199,38 @@ zspage_getnewbuf(Relation rel)
}
else
{
- /* No free pages. Have to extend the relation. */
- UnlockReleaseBuffer(metabuf);
+ /*
+ * No free pages in the FPM. Have to extend the relation.
+ * 1. We extend the relation by zedstore_rel_extension_factor #blocks.
+ * 2. Out of the zedstore_rel_extension_factor #blocks returned by the
+ * storage manager, we return the first block. The other blocks
+ * returned are prepended to the attribute level FPM.
+ */
+ StdRdOptions *rd_options = (StdRdOptions *)rel->rd_options;
+ int extension_factor = rd_options ? rd_options->zedstore_rel_extension_factor : ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR;
+
buf = zspage_extendrel_newbuf(rel);
blk = BufferGetBlockNumber(buf);
+
+ Buffer *extrabufs = palloc((extension_factor - 1) * sizeof(Buffer));
+ for (int i = 0; i < extension_factor - 1; i++) {
+ extrabufs[i] = zspage_extendrel_newbuf(rel);
+ /*
+ * We unlock the extrabuf here to prevent hitting MAX_SIMUL_LWLOCKS.
+ * It is safe to unlock the extrabuf here as it cannot be referenced
+ * by other backends until it is put on the attribute-level FPM.
+ * We grab the lock again in the following loop before placing the
+ * page on the FPM.
+ */
+ LockBuffer(extrabufs[i], BUFFER_LOCK_UNLOCK);
+ }
+
+ for (int i = extension_factor - 2; i >=0; i--) {
+ LockBuffer(extrabufs[i], BUFFER_LOCK_EXCLUSIVE);
+ zspage_delete_page(rel, extrabufs[i], metabuf, attrNumber);
+ UnlockReleaseBuffer(extrabufs[i]);
+ }
+ UnlockReleaseBuffer(metabuf);
}
return buf;
@@ -280,11 +316,12 @@ zspage_extendrel_newbuf(Relation rel)
* you can use zspage_mark_page_deleted() to avoid it.
*/
void
-zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf)
+zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf, AttrNumber attrNumber)
{
bool release_metabuf;
BlockNumber blk = BufferGetBlockNumber(buf);
Page metapage;
+ ZSMetaPage *metapg;
ZSMetaPageOpaque *metaopaque;
Page page;
BlockNumber next_free_blkno;
@@ -299,12 +336,27 @@ zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf)
release_metabuf = false;
metapage = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
page = BufferGetPage(buf);
- next_free_blkno = metaopaque->zs_fpm_head;
- zspage_mark_page_deleted(page, next_free_blkno);
- metaopaque->zs_fpm_head = blk;
+
+ if (attrNumber != ZS_INVALID_ATTRIBUTE_NUM)
+ {
+ /*
+ * Add the page to the attribute specific free page map.
+ */
+ next_free_blkno = metapg->tree_root_dir[attrNumber].fpm_head;
+ zspage_mark_page_deleted(page, next_free_blkno);
+ metapg->tree_root_dir[attrNumber].fpm_head = blk;
+ }
+ else
+ {
+ next_free_blkno = metaopaque->zs_fpm_head;
+ zspage_mark_page_deleted(page, next_free_blkno);
+ metaopaque->zs_fpm_head = blk;
+ }
+
MarkBufferDirty(metabuf);
MarkBufferDirty(buf);
diff --git a/src/backend/access/zedstore/zedstore_meta.c b/src/backend/access/zedstore/zedstore_meta.c
index f3e859091e..50496dc5cb 100644
--- a/src/backend/access/zedstore/zedstore_meta.c
+++ b/src/backend/access/zedstore/zedstore_meta.c
@@ -130,7 +130,10 @@ zsmeta_expand_metapage_for_new_attributes(Relation rel)
/* Initialize the new attribute roots to InvalidBlockNumber */
for (int i = metapg->nattributes; i < natts; i++)
+ {
metapg->tree_root_dir[i].root = InvalidBlockNumber;
+ metapg->tree_root_dir[i].fpm_head = InvalidBlockNumber;
+ }
metapg->nattributes = natts;
((PageHeader) page)->pd_lower = new_pd_lower;
@@ -194,7 +197,10 @@ zsmeta_initmetapage_internal(int natts)
metapg->nattributes = natts;
for (int i = 0; i < natts; i++)
+ {
metapg->tree_root_dir[i].root = InvalidBlockNumber;
+ metapg->tree_root_dir[i].fpm_head = InvalidBlockNumber;
+ }
((PageHeader) page)->pd_lower = new_pd_lower;
return page;
@@ -449,7 +455,7 @@ zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool readonly)
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
/* TODO: release lock on metapage while we do I/O */
- rootbuf = zspage_getnewbuf(rel);
+ rootbuf = zspage_getnewbuf(rel, attno);
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
metapg = (ZSMetaPage *) PageGetContents(page);
@@ -461,7 +467,7 @@ zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool readonly)
* finding a free page. We won't need the page we allocated,
* after all.
*/
- zspage_delete_page(rel, rootbuf, metabuf);
+ zspage_delete_page(rel, rootbuf, metabuf, attno);
}
else
{
diff --git a/src/backend/access/zedstore/zedstore_tidpage.c b/src/backend/access/zedstore/zedstore_tidpage.c
index 02ebd0e8f5..83b89db857 100644
--- a/src/backend/access/zedstore/zedstore_tidpage.c
+++ b/src/backend/access/zedstore/zedstore_tidpage.c
@@ -1194,7 +1194,7 @@ zsbt_tid_remove(Relation rel, IntegerSet *tids)
}
/* apply the changes */
- zs_apply_split_changes(rel, stack, NULL);
+ zs_apply_split_changes(rel, stack, NULL, 0);
}
ReleaseBuffer(buf);
@@ -1458,7 +1458,7 @@ zsbt_tid_add_items(Relation rel, Buffer buf, List *newitems, zs_pending_undo_op
}
/* apply the changes */
- zs_apply_split_changes(rel, stack, undo_op);
+ zs_apply_split_changes(rel, stack, undo_op, 0);
}
list_free(items);
@@ -1633,7 +1633,7 @@ zsbt_tid_replace_item(Relation rel, Buffer buf, OffsetNumber targetoff, List *ne
}
/* apply the changes */
- zs_apply_split_changes(rel, stack, undo_op);
+ zs_apply_split_changes(rel, stack, undo_op, 0);
}
list_free(items);
@@ -1840,7 +1840,7 @@ zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items, zs_pending
Assert(stack->next->buf == InvalidBuffer);
- nextbuf = zspage_getnewbuf(rel);
+ nextbuf = zspage_getnewbuf(rel, 0);
stack->next->buf = nextbuf;
thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
@@ -1894,7 +1894,7 @@ zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items, zs_pending
}
/* Finally, overwrite all the pages we had to modify */
- zs_apply_split_changes(rel, cxt.stack_head, undo_op);
+ zs_apply_split_changes(rel, cxt.stack_head, undo_op, 0);
}
static OffsetNumber
diff --git a/src/backend/access/zedstore/zedstore_toast.c b/src/backend/access/zedstore/zedstore_toast.c
index a41ef65bc6..70eb1277b9 100644
--- a/src/backend/access/zedstore/zedstore_toast.c
+++ b/src/backend/access/zedstore/zedstore_toast.c
@@ -93,7 +93,7 @@ zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value, zstid tid)
{
Size thisbytes;
- buf = zspage_getnewbuf(rel);
+ buf = zspage_getnewbuf(rel, ZS_INVALID_ATTRIBUTE_NUM);
if (prevbuf == InvalidBuffer)
firstblk = BufferGetBlockNumber(buf);
@@ -245,7 +245,7 @@ zedstore_toast_delete(Relation rel, Form_pg_attribute attr, zstid tid, BlockNumb
Assert(opaque->zs_attno == attr->attnum);
nextblk = opaque->zs_next;
- zspage_delete_page(rel, buf, InvalidBuffer);
+ zspage_delete_page(rel, buf, InvalidBuffer, ZS_INVALID_ATTRIBUTE_NUM);
UnlockReleaseBuffer(buf);
}
}
diff --git a/src/backend/access/zedstore/zedstore_undolog.c b/src/backend/access/zedstore/zedstore_undolog.c
index 07a6d2bee8..8749e4a827 100644
--- a/src/backend/access/zedstore/zedstore_undolog.c
+++ b/src/backend/access/zedstore/zedstore_undolog.c
@@ -119,7 +119,7 @@ retry_lock_tail:
LockBuffer(tail_buf, BUFFER_LOCK_UNLOCK);
/* new page */
- newbuf = zspage_getnewbuf(rel);
+ newbuf = zspage_getnewbuf(rel, ZS_INVALID_ATTRIBUTE_NUM);
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
if (metaopaque->zs_undo_tail != tail_blk)
@@ -129,7 +129,7 @@ retry_lock_tail:
* the new page, after all. (Or maybe we do, if the new
* tail block is already full, but we're not smart about it.)
*/
- zspage_delete_page(rel, newbuf, metabuf);
+ zspage_delete_page(rel, newbuf, metabuf, ZS_INVALID_ATTRIBUTE_NUM);
UnlockReleaseBuffer(newbuf);
goto retry_lock_tail;
}
diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
index f3ecdd5fc3..aab1a8afa0 100644
--- a/src/include/access/zedstore_internal.h
+++ b/src/include/access/zedstore_internal.h
@@ -22,6 +22,7 @@
struct zs_pending_undo_op;
#define ZS_META_ATTRIBUTE_NUM 0
+#define ZS_INVALID_ATTRIBUTE_NUM (-1)
#define INVALID_SPECULATIVE_TOKEN 0
@@ -529,6 +530,7 @@ zs_datumCopy(Datum value, bool typByVal, int typLen)
typedef struct ZSRootDirItem
{
BlockNumber root;
+ BlockNumber fpm_head;
} ZSRootDirItem;
typedef struct ZSMetaPage
@@ -566,8 +568,7 @@ typedef struct ZSMetaPageOpaque
*/
ZSUndoRecPtr zs_undo_oldestptr;
- BlockNumber zs_fpm_head; /* head of the Free Page Map list */
-
+ BlockNumber zs_fpm_head; /* head of the Free Page Map list for UNDO pages */
uint16 zs_flags;
uint16 zs_page_id;
} ZSMetaPageOpaque;
@@ -960,7 +961,7 @@ extern zs_split_stack *zsbt_insert_downlinks(Relation rel, AttrNumber attno,
extern void zsbt_attr_remove(Relation rel, AttrNumber attno, IntegerSet *tids);
extern zs_split_stack *zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level);
extern zs_split_stack *zs_new_split_stack_entry(Buffer buf, Page page);
-extern void zs_apply_split_changes(Relation rel, zs_split_stack *stack, struct zs_pending_undo_op *undo_op);
+extern void zs_apply_split_changes(Relation rel, zs_split_stack *stack, struct zs_pending_undo_op *undo_op, AttrNumber attrNumber);
extern Buffer zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly);
extern Buffer zsbt_find_and_lock_leaf_containing_tid(Relation rel, AttrNumber attno,
Buffer buf, zstid nexttid, int lockmode);
@@ -1050,9 +1051,9 @@ extern Datum zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, D
extern void zedstore_toast_delete(Relation rel, Form_pg_attribute attr, zstid tid, BlockNumber blkno);
/* prototypes for functions in zedstore_freepagemap.c */
-extern Buffer zspage_getnewbuf(Relation rel);
+extern Buffer zspage_getnewbuf(Relation rel, AttrNumber attrNumber);
extern void zspage_mark_page_deleted(Page page, BlockNumber next_free_blk);
-extern void zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf);
+extern void zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf, AttrNumber attrNumber);
typedef struct ZedstoreTupleTableSlot
{
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a10e..ca57165835 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -271,8 +271,16 @@ typedef struct StdRdOptions
int parallel_workers; /* max number of parallel workers */
bool vacuum_index_cleanup; /* enables index vacuuming and cleanup */
bool vacuum_truncate; /* enables vacuum to truncate a relation */
+ /*
+ * request zedstore_rel_extension_factor #blocks from storage manager
+ * whenever we need to extend the relation by one block for attribute/tid
+ * tree pages.
+ */
+ int zedstore_rel_extension_factor;
} StdRdOptions;
+#define ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR 1
+
#define HEAP_MIN_FILLFACTOR 10
#define HEAP_DEFAULT_FILLFACTOR 100
diff --git a/src/test/regress/expected/brin_1.out b/src/test/regress/expected/brin_1.out
index b40ab37e47..29afe3604c 100644
--- a/src/test/regress/expected/brin_1.out
+++ b/src/test/regress/expected/brin_1.out
@@ -411,7 +411,7 @@ ERROR: "tenk1_unique1" is not a BRIN index
SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected
brin_summarize_new_values
---------------------------
- 8
+ 9
(1 row)
-- Tests for brin_desummarize_range
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index f63016871c..35d436a9a8 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -275,16 +275,16 @@ CREATE TABLE IF NOT EXISTS as_select1 AS EXECUTE select1;
NOTICE: relation "as_select1" already exists, skipping
DROP TABLE as_select1;
DEALLOCATE select1;
+-- FIXME: enable this test when we introduce meta-page overflow for zedstore
-- create an extra wide table to test for issues related to that
-- (temporarily hide query, to avoid the long CREATE TABLE stmt)
-\set ECHO none
-INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
-SELECT firstc, lastc FROM extra_wide_table;
- firstc | lastc
------------+----------
- first col | last col
-(1 row)
-
+-- \set ECHO none
+-- SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
+-- FROM generate_series(1, 1100) g(i)
+-- \gexec
+-- \set ECHO all
+-- INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
+-- SELECT firstc, lastc FROM extra_wide_table;
-- check that tables with oids cannot be created anymore
CREATE TABLE withoid() WITH OIDS;
ERROR: syntax error at or near "OIDS"
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index 070de78e85..0c9785c179 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -43,7 +43,6 @@ dupindexcols|t
e_star|f
emp|f
equipment_r|f
-extra_wide_table|f
f_star|f
fast_emp4000|t
float4_tbl|f
diff --git a/src/test/regress/sql/create_table.sql b/src/test/regress/sql/create_table.sql
index e835b65ac4..d5d9e221e0 100644
--- a/src/test/regress/sql/create_table.sql
+++ b/src/test/regress/sql/create_table.sql
@@ -285,15 +285,16 @@ CREATE TABLE IF NOT EXISTS as_select1 AS EXECUTE select1;
DROP TABLE as_select1;
DEALLOCATE select1;
+-- FIXME: enable this test when we introduce meta-page overflow for zedstore
-- create an extra wide table to test for issues related to that
-- (temporarily hide query, to avoid the long CREATE TABLE stmt)
-\set ECHO none
-SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
-FROM generate_series(1, 1100) g(i)
-\gexec
-\set ECHO all
-INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
-SELECT firstc, lastc FROM extra_wide_table;
+-- \set ECHO none
+-- SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
+-- FROM generate_series(1, 1100) g(i)
+-- \gexec
+-- \set ECHO all
+-- INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
+-- SELECT firstc, lastc FROM extra_wide_table;
-- check that tables with oids cannot be created anymore
CREATE TABLE withoid() WITH OIDS;
--
2.24.1
v4-zedstore.patchapplication/octet-stream; name=v4-zedstore.patchDownload
README.md | 27 +
configure | 118 +
configure.in | 19 +
contrib/file_fdw/file_fdw.c | 35 +-
contrib/postgres_fdw/postgres_fdw.c | 56 +-
src/backend/access/Makefile | 2 +-
src/backend/access/common/reloptions.c | 13 +-
src/backend/access/heap/heapam_handler.c | 110 +-
src/backend/access/nbtree/nbtsort.c | 3 +-
src/backend/access/rmgrdesc/Makefile | 3 +-
src/backend/access/rmgrdesc/zedstoredesc.c | 135 +
src/backend/access/table/tableam.c | 220 +-
src/backend/access/transam/rmgr.c | 1 +
src/backend/access/transam/xact.c | 21 +
src/backend/access/zedstore/Makefile | 23 +
src/backend/access/zedstore/README | 362 ++
src/backend/access/zedstore/zedstore_attpage.c | 1068 +++++
src/backend/access/zedstore/zedstore_attstream.c | 2523 ++++++++++
src/backend/access/zedstore/zedstore_btree.c | 1002 ++++
src/backend/access/zedstore/zedstore_compression.c | 92 +
src/backend/access/zedstore/zedstore_freepagemap.c | 438 ++
src/backend/access/zedstore/zedstore_inspect.c | 1191 +++++
src/backend/access/zedstore/zedstore_meta.c | 509 ++
src/backend/access/zedstore/zedstore_simple8b.c | 391 ++
src/backend/access/zedstore/zedstore_tid.c | 159 +
src/backend/access/zedstore/zedstore_tiditem.c | 932 ++++
src/backend/access/zedstore/zedstore_tidpage.c | 2064 ++++++++
src/backend/access/zedstore/zedstore_toast.c | 324 ++
src/backend/access/zedstore/zedstore_tuplebuffer.c | 583 +++
src/backend/access/zedstore/zedstore_tupslot.c | 268 ++
src/backend/access/zedstore/zedstore_undolog.c | 651 +++
src/backend/access/zedstore/zedstore_undorec.c | 893 ++++
src/backend/access/zedstore/zedstore_visibility.c | 930 ++++
src/backend/access/zedstore/zedstore_wal.c | 107 +
src/backend/access/zedstore/zedstoream_handler.c | 3197 +++++++++++++
src/backend/commands/analyze.c | 234 +-
src/backend/commands/copy.c | 21 +-
src/backend/commands/tablecmds.c | 11 +-
src/backend/commands/trigger.c | 8 +
src/backend/executor/execScan.c | 70 +
src/backend/executor/nodeIndexscan.c | 20 +-
src/backend/executor/nodeSeqscan.c | 41 +-
src/backend/optimizer/path/allpaths.c | 87 +-
src/backend/optimizer/plan/createplan.c | 3 +
src/backend/optimizer/util/plancat.c | 2 +
src/backend/partitioning/partbounds.c | 14 +-
src/backend/replication/logical/decode.c | 1 +
src/backend/storage/buffer/bufmgr.c | 22 +
src/backend/utils/adt/tid.c | 7 +-
src/bin/pg_waldump/rmgrdesc.c | 1 +
src/include/access/rmgrlist.h | 1 +
src/include/access/tableam.h | 182 +-
src/include/access/xlogrecord.h | 2 +-
src/include/access/zedstore_compression.h | 16 +
src/include/access/zedstore_internal.h | 1083 +++++
src/include/access/zedstore_simple8b.h | 21 +
src/include/access/zedstore_tid.h | 84 +
src/include/access/zedstore_undolog.h | 108 +
src/include/access/zedstore_undorec.h | 196 +
src/include/access/zedstore_wal.h | 220 +
src/include/access/zedstoream.h | 17 +
src/include/catalog/pg_am.dat | 3 +
src/include/catalog/pg_cast.dat | 11 +
src/include/catalog/pg_operator.dat | 26 +
src/include/catalog/pg_proc.dat | 99 +
src/include/catalog/pg_statistic.h | 3 +
src/include/catalog/pg_type.dat | 6 +
src/include/commands/vacuum.h | 6 +
src/include/executor/executor.h | 4 +
src/include/fmgr.h | 2 +-
src/include/foreign/fdwapi.h | 7 +-
src/include/nodes/pathnodes.h | 1 +
src/include/pg_config.h.in | 9 +
src/include/storage/bufmgr.h | 1 +
src/include/storage/itemptr.h | 6 +
src/include/utils/rel.h | 8 +
.../isolation/expected/multiple-row-versions_1.out | 25 +
.../isolation/expected/read-only-anomaly-2_1.out | 45 +
.../isolation/expected/read-only-anomaly-3_1.out | 27 +
.../isolation/expected/serializable-parallel_1.out | 45 +
src/test/isolation/expected/vacuum-reltuples_1.out | 59 +
src/test/isolation/specs/read-only-anomaly-2.spec | 6 +-
src/test/regress/GNUmakefile | 6 +
src/test/regress/atmsort.pl | 346 ++
src/test/regress/atmsort.pm | 1371 ++++++
src/test/regress/expected/.gitignore | 1 +
src/test/regress/expected/alter_table_1.out | 4285 +++++++++++++++++
src/test/regress/expected/box.out | 2 +
src/test/regress/expected/brin_1.out | 528 +++
src/test/regress/expected/cluster_1.out | 582 +++
src/test/regress/expected/combocid_1.out | 169 +
src/test/regress/expected/create_am.out | 11 +-
src/test/regress/expected/create_index.out | 7 +-
src/test/regress/expected/create_index_1.out | 2598 ++++++++++
src/test/regress/expected/create_table.out | 16 +-
src/test/regress/expected/fsm_1.out | 73 +
src/test/regress/expected/join_hash_1.out | 1015 ++++
src/test/regress/expected/limit_1.out | 505 ++
src/test/regress/expected/opr_sanity.out | 6 +
src/test/regress/expected/psql_1.out | 4977 ++++++++++++++++++++
src/test/regress/expected/psql_crosstab_1.out | 216 +
src/test/regress/expected/rangefuncs_1.out | 2100 +++++++++
src/test/regress/expected/reloptions_1.out | 219 +
src/test/regress/expected/rowsecurity_1.out | 4052 ++++++++++++++++
src/test/regress/expected/sanity_check.out | 1 -
src/test/regress/expected/select_parallel_1.out | 1150 +++++
src/test/regress/expected/strings_1.out | 1973 ++++++++
src/test/regress/expected/tablesample_1.out | 300 ++
src/test/regress/expected/transactions_1.out | 1022 ++++
src/test/regress/expected/triggers_1.out | 2850 +++++++++++
src/test/regress/expected/tsrf_1.out | 712 +++
src/test/regress/expected/updatable_views_1.out | 3018 ++++++++++++
src/test/regress/expected/update_1.out | 937 ++++
src/test/regress/expected/vacuum_1.out | 382 ++
src/test/regress/expected/zedstore.out | 666 +++
src/test/regress/expected/zstidscan.out | 213 +
src/test/regress/gpdiff.pl | 260 +
src/test/regress/output/misc_1.source | 692 +++
src/test/regress/parallel_schedule | 10 +-
src/test/regress/pg_regress.c | 88 +-
src/test/regress/serial_schedule | 4 +-
src/test/regress/sql/box.sql | 2 +
src/test/regress/sql/create_index.sql | 7 +-
src/test/regress/sql/create_table.sql | 15 +-
src/test/regress/sql/zedstore.sql | 211 +
src/test/regress/sql/zstidscan.sql | 69 +
src/test/storageperf/driver.sql | 45 +
src/test/storageperf/sql/inlinecompress.sql | 133 +
src/test/storageperf/sql/lockperf.sql | 49 +
src/test/storageperf/sql/nullcol.sql | 45 +
src/test/storageperf/sql/onecol.sql | 158 +
src/test/storageperf/sql/toast.sql | 133 +
src/test/storageperf/tests.sql | 7 +
133 files changed, 59309 insertions(+), 269 deletions(-)
diff --git README.md README.md
new file mode 100644
index 0000000000..54901cc4b1
--- /dev/null
+++ README.md
@@ -0,0 +1,27 @@
+Zedstore - Column Store for PostgreSQL
+--------------------------------------
+
+Zedstore is a column store for PostgreSQL. It is under development. The development happens here, you are in the right place.
+
+This is a fork of the PostgreSQL repository. All the interesting Zedstore stuff is in the subdirectory [src/backend/access/zedstore/](src/backend/access/zedstore/) subdirectory. There are only few modifications to the rest of PostgreSQL, outside that subdirectory. Eventually, any modifications needed to accommodate Zedstore needs to be submitted as separate patches and committed to PostgreSQL itself.
+
+Join the discussion on pgsql-hackers:
+
+https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com
+
+
+Try it out
+----------
+
+Clone the repository, and compile it. Use the --with-lz4 option configure option, otherwise zedstore will be horrendously slow:
+
+ ./configure --with-lz4
+
+To use zedstore:
+
+ CREATE TABLE mytable (t text) using zedstore;
+
+Or you can set it as the default for all your tables, in postgresql.conf:
+
+ default_table_access_method = 'zedstore'
+
diff --git configure configure
index 899116517c..e048fbfb19 100755
--- configure
+++ configure
@@ -699,6 +699,7 @@ with_gnu_ld
LD
LDFLAGS_SL
LDFLAGS_EX
+with_lz4
with_zlib
with_system_tzdata
with_libxslt
@@ -866,6 +867,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_lz4
with_gnu_ld
enable_largefile
'
@@ -1570,6 +1572,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-lz4 build with LZ4 support
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8413,6 +8416,41 @@ fi
+#
+# LZ4
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with LZ4 support" >&5
+$as_echo_n "checking whether to build with LZ4 support... " >&6; }
+
+
+
+# Check whether --with-lz4 was given.
+if test "${with_lz4+set}" = set; then :
+ withval=$with_lz4;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LZ4 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-lz4 option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_lz4=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_lz4" >&5
+$as_echo "$with_lz4" >&6; }
+
+
#
# Assignments
#
@@ -11904,6 +11942,56 @@ fi
fi
+if test "$with_lz4" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for LZ4_compress_default in -llz4" >&5
+$as_echo_n "checking for LZ4_compress_default in -llz4... " >&6; }
+if ${ac_cv_lib_lz4_LZ4_compress_default+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-llz4 $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char LZ4_compress_default ();
+int
+main ()
+{
+return LZ4_compress_default ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_lz4_LZ4_compress_default=yes
+else
+ ac_cv_lib_lz4_LZ4_compress_default=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_lz4_LZ4_compress_default" >&5
+$as_echo "$ac_cv_lib_lz4_LZ4_compress_default" >&6; }
+if test "x$ac_cv_lib_lz4_LZ4_compress_default" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBLZ4 1
+_ACEOF
+
+ LIBS="-llz4 $LIBS"
+
+else
+ as_fn_error $? "library 'lz4' is required for LZ4 support" "$LINENO" 5
+fi
+
+fi
+
if test "$enable_spinlocks" = yes; then
$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
@@ -13103,6 +13191,36 @@ Use --without-zlib to disable zlib support." "$LINENO" 5
fi
+fi
+
+if test "$with_lz4" = yes; then
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ as_fn_error $? "lz4.h header file is required for LZ4" "$LINENO" 5
+fi
+
+done
+
+fi
+
+done
+
fi
if test "$with_gssapi" = yes ; then
diff --git configure.in configure.in
index ecdf172396..f7d414ccad 100644
--- configure.in
+++ configure.in
@@ -990,6 +990,16 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# LZ4
+#
+AC_MSG_CHECKING([whether to build with LZ4 support])
+PGAC_ARG_BOOL(with, lz4, no,
+ [build with LZ4 support],
+ [AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])])
+AC_MSG_RESULT([$with_lz4])
+AC_SUBST(with_lz4)
+
#
# Assignments
#
@@ -1177,6 +1187,10 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
+fi
+
if test "$enable_spinlocks" = yes; then
AC_DEFINE(HAVE_SPINLOCKS, 1, [Define to 1 if you have spinlocks.])
else
@@ -1388,6 +1402,11 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_HEADERS(lz4.h, [],
+ [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is required for LZ4])])])
+fi
+
if test "$with_gssapi" = yes ; then
AC_CHECK_HEADERS(gssapi/gssapi.h, [],
[AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is required for GSSAPI])])])
diff --git contrib/file_fdw/file_fdw.c contrib/file_fdw/file_fdw.c
index fbcf7ca9c9..a8c52310bf 100644
--- contrib/file_fdw/file_fdw.c
+++ contrib/file_fdw/file_fdw.c
@@ -19,6 +19,7 @@
#include "access/reloptions.h"
#include "access/sysattr.h"
#include "access/table.h"
+#include "access/tableam.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_foreign_table.h"
#include "commands/copy.h"
@@ -158,10 +159,8 @@ static void estimate_size(PlannerInfo *root, RelOptInfo *baserel,
static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
FileFdwPlanState *fdw_private,
Cost *startup_cost, Cost *total_cost);
-static int file_acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows);
-
+static void file_acquire_sample_rows(Relation onerel, int elevel,
+ AnalyzeSampleContext *context);
/*
* Foreign-data wrapper handler function: return a struct with pointers
@@ -1092,14 +1091,16 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
* may be meaningless, but it's OK because we don't use the estimates
* currently (the planner only pays attention to correlation for indexscans).
*/
-static int
+static void
file_acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
int numrows = 0;
+ int targrows = 0;
+ double totalrows = 0;
double rowstoskip = -1; /* -1 means not set yet */
ReservoirStateData rstate;
+ HeapTuple tuple;
TupleDesc tupDesc;
Datum *values;
bool *nulls;
@@ -1112,6 +1113,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
MemoryContext oldcontext = CurrentMemoryContext;
MemoryContext tupcontext;
+ targrows = context->targrows;
+
Assert(onerel);
Assert(targrows > 0);
@@ -1145,8 +1148,6 @@ file_acquire_sample_rows(Relation onerel, int elevel,
errcallback.previous = error_context_stack;
error_context_stack = &errcallback;
- *totalrows = 0;
- *totaldeadrows = 0;
for (;;)
{
/* Check for user-requested abort or sleep */
@@ -1171,7 +1172,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
*/
if (numrows < targrows)
{
- rows[numrows++] = heap_form_tuple(tupDesc, values, nulls);
+ tuple = heap_form_tuple(tupDesc, values, nulls);
+ AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, numrows++, false /* replace */, false);
}
else
{
@@ -1181,7 +1183,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
* not-yet-incremented value of totalrows as t.
*/
if (rowstoskip < 0)
- rowstoskip = reservoir_get_next_S(&rstate, *totalrows, targrows);
+ rowstoskip = reservoir_get_next_S(&rstate, totalrows, targrows);
if (rowstoskip <= 0)
{
@@ -1192,14 +1194,14 @@ file_acquire_sample_rows(Relation onerel, int elevel,
int k = (int) (targrows * sampler_random_fract(rstate.randstate));
Assert(k >= 0 && k < targrows);
- heap_freetuple(rows[k]);
- rows[k] = heap_form_tuple(tupDesc, values, nulls);
+ tuple = heap_form_tuple(tupDesc, values, nulls);
+ AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, k, true /* replace */, false);
}
rowstoskip -= 1;
}
- *totalrows += 1;
+ totalrows += 1;
}
/* Remove error callback. */
@@ -1220,7 +1222,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
(errmsg("\"%s\": file contains %.0f rows; "
"%d rows in sample",
RelationGetRelationName(onerel),
- *totalrows, numrows)));
+ totalrows, numrows)));
- return numrows;
+ context->totalrows += totalrows;
+ context->totalsampledrows += numrows;
}
diff --git contrib/postgres_fdw/postgres_fdw.c contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..c01fd1a6b5 100644
--- contrib/postgres_fdw/postgres_fdw.c
+++ contrib/postgres_fdw/postgres_fdw.c
@@ -17,6 +17,7 @@
#include "access/htup_details.h"
#include "access/sysattr.h"
#include "access/table.h"
+#include "access/tableam.h"
#include "catalog/pg_class.h"
#include "commands/defrem.h"
#include "commands/explain.h"
@@ -237,7 +238,6 @@ typedef struct PgFdwAnalyzeState
List *retrieved_attrs; /* attr numbers retrieved by query */
/* collected sample rows */
- HeapTuple *rows; /* array of size targrows */
int targrows; /* target # of sample rows */
int numrows; /* # of sample rows collected */
@@ -463,12 +463,11 @@ static void process_query_params(ExprContext *econtext,
FmgrInfo *param_flinfo,
List *param_exprs,
const char **param_values);
-static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows,
- double *totaldeadrows);
+static void postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+ AnalyzeSampleContext *context);
static void analyze_row_processor(PGresult *res, int row,
- PgFdwAnalyzeState *astate);
+ PgFdwAnalyzeState *astate,
+ AnalyzeSampleContext *context);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -4488,11 +4487,9 @@ postgresAnalyzeForeignTable(Relation relation,
* may be meaningless, but it's OK because we don't use the estimates
* currently (the planner only pays attention to correlation for indexscans).
*/
-static int
+static void
postgresAcquireSampleRowsFunc(Relation relation, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows,
- double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
PgFdwAnalyzeState astate;
ForeignTable *table;
@@ -4506,13 +4503,11 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
/* Initialize workspace state */
astate.rel = relation;
astate.attinmeta = TupleDescGetAttInMetadata(RelationGetDescr(relation));
-
- astate.rows = rows;
- astate.targrows = targrows;
+ astate.targrows = context->targrows;
astate.numrows = 0;
astate.samplerows = 0;
astate.rowstoskip = -1; /* -1 means not set yet */
- reservoir_init_selection_state(&astate.rstate, targrows);
+ reservoir_init_selection_state(&astate.rstate, astate.targrows);
/* Remember ANALYZE context, and create a per-tuple temp context */
astate.anl_cxt = CurrentMemoryContext;
@@ -4604,7 +4599,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
/* Process whatever we got. */
numrows = PQntuples(res);
for (i = 0; i < numrows; i++)
- analyze_row_processor(res, i, &astate);
+ analyze_row_processor(res, i, &astate, context);
PQclear(res);
res = NULL;
@@ -4628,10 +4623,13 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
ReleaseConnection(conn);
/* We assume that we have no dead tuple. */
- *totaldeadrows = 0.0;
+ context->totaldeadrows = 0.0;
/* We've retrieved all living tuples from foreign server. */
- *totalrows = astate.samplerows;
+ context->totalrows += astate.samplerows;
+
+ /* Increase the number of sample rows stored in the context */
+ context->totalsampledrows += astate.numrows;
/*
* Emit some interesting relation info
@@ -4640,8 +4638,6 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
(errmsg("\"%s\": table contains %.0f rows, %d rows in sample",
RelationGetRelationName(relation),
astate.samplerows, astate.numrows)));
-
- return astate.numrows;
}
/*
@@ -4650,10 +4646,11 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
* - Subsequently, replace already-sampled tuples randomly.
*/
static void
-analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
+analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate, AnalyzeSampleContext *context)
{
int targrows = astate->targrows;
int pos; /* array index to store tuple in */
+ bool replace;
MemoryContext oldcontext;
/* Always increment sample row counter. */
@@ -4667,6 +4664,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
{
/* First targrows rows are always included into the sample */
pos = astate->numrows++;
+ replace = false;
}
else
{
@@ -4683,7 +4681,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
/* Choose a random reservoir element to replace. */
pos = (int) (targrows * sampler_random_fract(astate->rstate.randstate));
Assert(pos >= 0 && pos < targrows);
- heap_freetuple(astate->rows[pos]);
+ replace = true;
}
else
{
@@ -4696,18 +4694,22 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
if (pos >= 0)
{
+ HeapTuple tuple;
/*
* Create sample tuple from current result row, and store it in the
* position determined above. The tuple has to be created in anl_cxt.
*/
oldcontext = MemoryContextSwitchTo(astate->anl_cxt);
- astate->rows[pos] = make_tuple_from_result_row(res, row,
- astate->rel,
- astate->attinmeta,
- astate->retrieved_attrs,
- NULL,
- astate->temp_cxt);
+ tuple = make_tuple_from_result_row(res, row,
+ astate->rel,
+ astate->attinmeta,
+ astate->retrieved_attrs,
+ NULL,
+ astate->temp_cxt);
+
+ /* Tuple is already created in anl_cxt, we can record it directly */
+ AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, pos, replace, false);
MemoryContextSwitchTo(oldcontext);
}
diff --git src/backend/access/Makefile src/backend/access/Makefile
index 0880e0a8bb..6d36f3bd26 100644
--- src/backend/access/Makefile
+++ src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- table tablesample transam
+ table tablesample transam zedstore
include $(top_srcdir)/src/backend/common.mk
diff --git src/backend/access/common/reloptions.c src/backend/access/common/reloptions.c
index ec207d3b26..ef61f585db 100644
--- src/backend/access/common/reloptions.c
+++ src/backend/access/common/reloptions.c
@@ -373,6 +373,15 @@ static relopt_int intRelOpts[] =
},
-1, 0, 1024
},
+ {
+ {
+ "zedstore_rel_extension_factor",
+ "Extend zedstore relations by zedstore_rel_extension_factor #blocks.",
+ RELOPT_KIND_HEAP,
+ ShareUpdateExclusiveLock
+ },
+ ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR, 1, INT_MAX
+ },
/* list terminator */
{{NULL}}
@@ -1547,7 +1556,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
- offsetof(StdRdOptions, vacuum_truncate)}
+ offsetof(StdRdOptions, vacuum_truncate)},
+ {"zedstore_rel_extension_factor", RELOPT_TYPE_INT,
+ offsetof(StdRdOptions, zedstore_rel_extension_factor)}
};
return (bytea *) build_reloptions(reloptions, validate, kind,
diff --git src/backend/access/heap/heapam_handler.c src/backend/access/heap/heapam_handler.c
index ca52846b97..0a82fca67c 100644
--- src/backend/access/heap/heapam_handler.c
+++ src/backend/access/heap/heapam_handler.c
@@ -35,6 +35,7 @@
#include "executor/executor.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "parser/analyze.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/lmgr.h"
@@ -44,6 +45,7 @@
#include "utils/builtins.h"
#include "utils/rel.h"
+static int compare_rows(const void *a, const void *b);
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
Datum *values, bool *isnull, RewriteState rwstate);
@@ -974,10 +976,25 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pfree(isnull);
}
+static void
+heapam_scan_analyze_beginscan(Relation onerel, AnalyzeSampleContext *context)
+{
+ context->scan = table_beginscan_analyze(onerel);
+
+ /* initialize the totalblocks analyze can scan */
+ context->totalblocks = RelationGetNumberOfBlocks(onerel);
+
+ /* reset the statistic */
+ context->liverows = 0;
+ context->deadrows = 0;
+ context->ordered = true;
+}
+
static bool
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
- BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(BlockNumber blockno,
+ AnalyzeSampleContext *context)
{
+ TableScanDesc scan = context->scan;
HeapScanDesc hscan = (HeapScanDesc) scan;
/*
@@ -992,7 +1009,7 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
hscan->rs_cblock = blockno;
hscan->rs_cindex = FirstOffsetNumber;
hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
- blockno, RBM_NORMAL, bstrategy);
+ blockno, RBM_NORMAL, context->bstrategy);
LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
/* in heap all blocks can contain tuples, so always return true */
@@ -1000,14 +1017,14 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
}
static bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
- double *liverows, double *deadrows,
- TupleTableSlot *slot)
+heapam_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
{
+ TableScanDesc scan = context->scan;
HeapScanDesc hscan = (HeapScanDesc) scan;
Page targpage;
OffsetNumber maxoffset;
BufferHeapTupleTableSlot *hslot;
+ TupleTableSlot *slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DATA);
Assert(TTS_IS_BUFFERTUPLE(slot));
@@ -1033,7 +1050,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
if (!ItemIdIsNormal(itemid))
{
if (ItemIdIsDead(itemid))
- *deadrows += 1;
+ context->deadrows += 1;
continue;
}
@@ -1048,13 +1065,13 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
{
case HEAPTUPLE_LIVE:
sample_it = true;
- *liverows += 1;
+ context->liverows += 1;
break;
case HEAPTUPLE_DEAD:
case HEAPTUPLE_RECENTLY_DEAD:
/* Count dead and recently-dead rows */
- *deadrows += 1;
+ context->deadrows += 1;
break;
case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1080,7 +1097,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
{
sample_it = true;
- *liverows += 1;
+ context->liverows += 1;
}
break;
@@ -1109,11 +1126,11 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* concurrent transaction never commits.
*/
if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
- *deadrows += 1;
+ context->deadrows += 1;
else
{
sample_it = true;
- *liverows += 1;
+ context->liverows += 1;
}
break;
@@ -1142,6 +1159,71 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
return false;
}
+static void
+heapam_scan_analyze_sample_tuple(int pos, bool replace, AnalyzeSampleContext *context)
+{
+ TupleTableSlot *slot;
+ Relation onerel = context->scan->rs_rd;
+
+ Assert(pos >= 0);
+ /*
+ * heapam_scan_analyze_next_tuple should already put the tuple
+ * in the sample slot, just record it into the array of sample
+ * rows.
+ */
+ slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DATA);
+ AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DATA, pos, replace, true);
+
+ /*
+ * if replace happens, the sample rows are no longer ordered
+ * in physical position.
+ */
+ if (replace)
+ context->ordered = false;
+}
+
+static void
+heapam_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+ HeapTuple *rows = AnalyzeGetSampleRows(context, ANALYZE_SAMPLE_DATA, context->totalsampledrows);
+
+ /*
+ * If we didn't find as many tuples as we wanted then we're done. No sort
+ * is needed, since they're already in order.
+ *
+ * Otherwise we need to sort the collected tuples by position
+ * (itempointer).
+ */
+ if (!context->ordered)
+ qsort((void *)rows, context->targrows, sizeof(HeapTuple), compare_rows);
+
+ table_endscan(context->scan);
+}
+
+/*
+ * qsort comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b)
+{
+ HeapTuple ha = *(const HeapTuple *) a;
+ HeapTuple hb = *(const HeapTuple *) b;
+ BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+ OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+ BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+ OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+ if (ba < bb)
+ return -1;
+ if (ba > bb)
+ return 1;
+ if (oa < ob)
+ return -1;
+ if (oa > ob)
+ return 1;
+ return 0;
+}
+
static double
heapam_index_build_range_scan(Relation heapRelation,
Relation indexRelation,
@@ -2501,6 +2583,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
+ .scans_leverage_column_projection = false,
.slot_callbacks = heapam_slot_callbacks,
@@ -2538,8 +2621,11 @@ static const TableAmRoutine heapam_methods = {
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
.relation_vacuum = heap_vacuum_rel,
+ .scan_analyze_beginscan = heapam_scan_analyze_beginscan,
.scan_analyze_next_block = heapam_scan_analyze_next_block,
.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
+ .scan_analyze_sample_tuple = heapam_scan_analyze_sample_tuple,
+ .scan_analyze_endscan = heapam_scan_analyze_endscan,
.index_build_range_scan = heapam_index_build_range_scan,
.index_validate_scan = heapam_index_validate_scan,
diff --git src/backend/access/nbtree/nbtsort.c src/backend/access/nbtree/nbtsort.c
index e66cd36dfa..acf3ccfe63 100644
--- src/backend/access/nbtree/nbtsort.c
+++ src/backend/access/nbtree/nbtsort.c
@@ -1918,7 +1918,8 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
indexInfo = BuildIndexInfo(btspool->index);
indexInfo->ii_Concurrent = btshared->isconcurrent;
scan = table_beginscan_parallel(btspool->heap,
- ParallelTableScanFromBTShared(btshared));
+ ParallelTableScanFromBTShared(btshared),
+ NULL);
reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
true, progress, _bt_build_callback,
(void *) &buildstate, scan);
diff --git src/backend/access/rmgrdesc/Makefile src/backend/access/rmgrdesc/Makefile
index f88d72fd86..3585dbd7e4 100644
--- src/backend/access/rmgrdesc/Makefile
+++ src/backend/access/rmgrdesc/Makefile
@@ -29,6 +29,7 @@ OBJS = \
standbydesc.o \
tblspcdesc.o \
xactdesc.o \
- xlogdesc.o
+ xlogdesc.o \
+ zedstoredesc.o
include $(top_srcdir)/src/backend/common.mk
diff --git src/backend/access/rmgrdesc/zedstoredesc.c src/backend/access/rmgrdesc/zedstoredesc.c
new file mode 100644
index 0000000000..c2499d9e6f
--- /dev/null
+++ src/backend/access/rmgrdesc/zedstoredesc.c
@@ -0,0 +1,135 @@
+/*
+ * zedstoredesc.c
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/zedstoredesc.c
+ */
+#include "postgres.h"
+
+#include "access/xlogreader.h"
+#include "access/zedstore_tid.h"
+#include "access/zedstore_wal.h"
+#include "lib/stringinfo.h"
+
+void
+zedstore_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == WAL_ZEDSTORE_INIT_METAPAGE)
+ {
+ wal_zedstore_init_metapage *walrec = (wal_zedstore_init_metapage *) rec;
+
+ appendStringInfo(buf, "natts %d", walrec->natts);
+ }
+ else if (info == WAL_ZEDSTORE_UNDO_NEWPAGE)
+ {
+ wal_zedstore_undo_newpage *walrec = (wal_zedstore_undo_newpage *) rec;
+
+ appendStringInfo(buf, "first_counter " UINT64_FORMAT, walrec->first_counter);
+ }
+ else if (info == WAL_ZEDSTORE_UNDO_DISCARD)
+ {
+ wal_zedstore_undo_discard *walrec = (wal_zedstore_undo_discard *) rec;
+
+ appendStringInfo(buf, "oldest_undorecptr " UINT64_FORMAT ", oldest_undopage %u",
+ walrec->oldest_undorecptr.counter,
+ walrec->oldest_undopage);
+ }
+ else if (info == WAL_ZEDSTORE_BTREE_NEW_ROOT)
+ {
+ wal_zedstore_btree_new_root *walrec = (wal_zedstore_btree_new_root *) rec;
+
+ appendStringInfo(buf, "attno %d", walrec->attno);
+ }
+ else if (info == WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS)
+ {
+ wal_zedstore_tidleaf_items *walrec = (wal_zedstore_tidleaf_items *) rec;
+
+ appendStringInfo(buf, "%d items, off %d", walrec->nitems, walrec->off);
+ }
+ else if (info == WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM)
+ {
+ wal_zedstore_tidleaf_items *walrec = (wal_zedstore_tidleaf_items *) rec;
+
+ appendStringInfo(buf, "%d items, off %d", walrec->nitems, walrec->off);
+ }
+ else if (info == WAL_ZEDSTORE_ATTSTREAM_CHANGE)
+ {
+ wal_zedstore_attstream_change *walrec = (wal_zedstore_attstream_change *) rec;
+
+ if (walrec->is_upper)
+ appendStringInfo(buf, "upper stream change");
+ else
+ appendStringInfo(buf, "lower stream change");
+ appendStringInfo(buf, ", new size %d", walrec->new_attstream_size);
+ }
+ else if (info == WAL_ZEDSTORE_TOAST_NEWPAGE)
+ {
+ wal_zedstore_toast_newpage *walrec = (wal_zedstore_toast_newpage *) rec;
+
+ appendStringInfo(buf, "tid (%u/%d), attno %d, offset %d/%d",
+ ZSTidGetBlockNumber(walrec->tid), ZSTidGetOffsetNumber(walrec->tid),
+ walrec->attno, walrec->offset, walrec->total_size);
+ }
+ else if (info == WAL_ZEDSTORE_FPM_DELETE_PAGE)
+ {
+ wal_zedstore_fpm_delete_page *walrec = (wal_zedstore_fpm_delete_page *) rec;
+
+ appendStringInfo(buf, "nextblkno %u", walrec->next_free_blkno);
+ }
+ else if (info == WAL_ZEDSTORE_FPM_REUSE_PAGE)
+ {
+ wal_zedstore_fpm_reuse_page *walrec = (wal_zedstore_fpm_reuse_page *) rec;
+
+ appendStringInfo(buf, "nextblkno %u", walrec->next_free_blkno);
+ }
+}
+
+const char *
+zedstore_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case WAL_ZEDSTORE_INIT_METAPAGE:
+ id = "INIT_METAPAGE";
+ break;
+ case WAL_ZEDSTORE_UNDO_NEWPAGE:
+ id = "UNDO_NEWPAGE";
+ break;
+ case WAL_ZEDSTORE_UNDO_DISCARD:
+ id = "UNDO_DISCARD";
+ break;
+ case WAL_ZEDSTORE_BTREE_NEW_ROOT:
+ id = "BTREE_NEW_ROOT";
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS:
+ id = "BTREE_TIDLEAF_ADD_ITEMS";
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM:
+ id = "BTREE_TIDLEAF_REPLACE_ITEM";
+ break;
+ case WAL_ZEDSTORE_BTREE_REWRITE_PAGES:
+ id = "BTREE_REWRITE_PAGES";
+ break;
+ case WAL_ZEDSTORE_ATTSTREAM_CHANGE:
+ id = "ATTSTREAM_CHANGE";
+ break;
+ case WAL_ZEDSTORE_TOAST_NEWPAGE:
+ id = "ZSTOAST_NEWPAGE";
+ break;
+ case WAL_ZEDSTORE_FPM_DELETE_PAGE:
+ id = "FPM_DELETE_PAGE";
+ break;
+ case WAL_ZEDSTORE_FPM_REUSE_PAGE:
+ id = "FPM_REUSE_PAGE";
+ break;
+ }
+ return id;
+}
diff --git src/backend/access/table/tableam.c src/backend/access/table/tableam.c
index c814733b22..5d34681ce7 100644
--- src/backend/access/table/tableam.c
+++ src/backend/access/table/tableam.c
@@ -23,7 +23,9 @@
#include "access/heapam.h" /* for ss_* */
#include "access/tableam.h"
+#include "access/tupconvert.h"
#include "access/xact.h"
+#include "catalog/pg_type.h"
#include "optimizer/plancat.h"
#include "storage/bufmgr.h"
#include "storage/shmem.h"
@@ -158,7 +160,7 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
}
TableScanDesc
-table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
+table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan, Bitmapset *proj)
{
Snapshot snapshot;
uint32 flags = SO_TYPE_SEQSCAN |
@@ -180,6 +182,9 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
snapshot = SnapshotAny;
}
+ if (proj)
+ return relation->rd_tableam->scan_begin_with_column_projection(relation, snapshot, 0, NULL,
+ parallel_scan, flags, proj);
return relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
parallel_scan, flags);
}
@@ -647,3 +652,216 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
else
*allvisfrac = (double) relallvisible / curpages;
}
+
+/* Create the analyze sample context to acquire sample rows */
+AnalyzeSampleContext *
+CreateAnalyzeSampleContext(Relation onerel,
+ List *anl_cols,
+ int totaltargrows,
+ BufferAccessStrategy strategy)
+{
+ AnalyzeSampleContext *context;
+
+ context = (AnalyzeSampleContext *) palloc(sizeof(AnalyzeSampleContext));
+ context->parent = onerel;
+ context->anl_cols = anl_cols;
+ context->bstrategy = strategy;
+ context->totaltargrows = totaltargrows;
+ context->targrows = totaltargrows;
+ context->scan = NULL;
+ context->totalblocks = 0;
+ context->totalrows = 0;
+ context->totaldeadrows = 0;
+ context->totalsampledrows = 0;
+ context->liverows = 0;
+ context->deadrows = 0;
+ context->ordered = false;
+ context->tup_convert_map = NULL;
+
+ /* empty all sample type */
+ memset(context->sample_slots, 0, MAX_ANALYZE_SAMPLE * sizeof(TupleTableSlot *));
+ memset(context->sample_rows, 0, MAX_ANALYZE_SAMPLE * sizeof(HeapTuple *));
+
+ return context;
+}
+
+/* Destroy analyze sample context */
+void
+DestroyAnalyzeSampleContext(AnalyzeSampleContext *context)
+{
+ for (int i = 0; i < MAX_ANALYZE_SAMPLE; i++)
+ {
+ TupleTableSlot *slot = context->sample_slots[i];
+ if (slot)
+ ExecDropSingleTupleTableSlot(slot);
+ }
+}
+
+/*
+ * To acquire sample rows from an inherited table, all child
+ * relations use the same analyze sample context, this function
+ * must be called before starting analyze a new child relation.
+ */
+void
+InitAnalyzeSampleContextForChild(AnalyzeSampleContext *context,
+ Relation child,
+ int childtargrows)
+{
+ /* Set targrows to childtargrows */
+ context->targrows = childtargrows;
+
+ /* We may need to convert from child's rowtype to parent's */
+ if (!equalTupleDescs(RelationGetDescr(child),
+ RelationGetDescr(context->parent)))
+ {
+ if (context->tup_convert_map)
+ free_conversion_map(context->tup_convert_map);
+ /* Create a convert map so it can be used when recording sample rows */
+ context->tup_convert_map =
+ convert_tuples_by_name(RelationGetDescr(child),
+ RelationGetDescr(context->parent));
+
+ /* We also cannot use previous sample slot anymore */
+ if (context->sample_slots[ANALYZE_SAMPLE_DATA])
+ {
+ ExecDropSingleTupleTableSlot(context->sample_slots[ANALYZE_SAMPLE_DATA]);
+ context->sample_slots[ANALYZE_SAMPLE_DATA] = NULL;
+ }
+ /* We also cannot use previous sample slot anymore */
+ if (context->sample_slots[ANALYZE_SAMPLE_DISKSIZE])
+ {
+ ExecDropSingleTupleTableSlot(context->sample_slots[ANALYZE_SAMPLE_DISKSIZE]);
+ context->sample_slots[ANALYZE_SAMPLE_DISKSIZE] = NULL;
+ }
+ }
+}
+
+void
+AnalyzeGetSampleStats(AnalyzeSampleContext *context,
+ int *totalsampledrows,
+ double *totalrows,
+ double *totaldeadrows)
+{
+ if (totalsampledrows)
+ *totalsampledrows = context->totalsampledrows;
+ if (totalrows)
+ *totalrows = context->totalrows;
+ if (*totaldeadrows)
+ *totaldeadrows = context->totaldeadrows;
+}
+
+
+/*
+ * Get or initialize a sample slot to hold sample tuple, normally
+ * the tuple in the slot will be copied to the sample_rows[type]
+ * by AnalyzeRecordSampleRow().
+ */
+TupleTableSlot *
+AnalyzeGetSampleSlot(AnalyzeSampleContext *context,
+ Relation onerel,
+ AnalyzeSampleType type)
+{
+ TupleDesc tupdesc;
+ int attr_cnt = onerel->rd_att->natts;
+
+ if (context->sample_slots[type])
+ return context->sample_slots[type];
+
+ switch (type)
+ {
+ case ANALYZE_SAMPLE_DATA:
+ tupdesc = RelationGetDescr(onerel);
+ break;
+ case ANALYZE_SAMPLE_DISKSIZE:
+ tupdesc = CreateTemplateTupleDesc(attr_cnt);
+ for (int i = 1; i <= attr_cnt; i++)
+ TupleDescInitEntry(tupdesc, i, "", FLOAT8OID, -1, 0);
+ break;
+ default:
+ elog(ERROR, "unknown analyze sample type");
+ }
+
+ context->sample_slots[type] =
+ MakeSingleTupleTableSlot(tupdesc, table_slot_callbacks(onerel));
+ return context->sample_slots[type];
+}
+
+HeapTuple *
+AnalyzeGetSampleRows(AnalyzeSampleContext *context,
+ AnalyzeSampleType type,
+ int offset)
+{
+ Assert(offset < context->totaltargrows);
+ if (!context->sample_rows[type])
+ context->sample_rows[type] =
+ (HeapTuple *) palloc(context->totaltargrows * sizeof(HeapTuple));
+
+ return context->sample_rows[type] + offset;
+}
+
+/*
+ * Record a sample tuple into sample_rows[type].
+ *
+ * sample_tuple:
+ * Input sample tuple. Sometimes, callers has already
+ * formed sample tuple in its memory context, we can
+ * record it directly.
+ * sample_slot:
+ * Slot which contains the sample tuple. We need to copy
+ * the sample tuple and then record it.
+ * pos:
+ * The postion in the sample_rows[type].
+ * replace:
+ * Replace the old sample tuple in the specified position.
+ * withtid:
+ * Set the tid of sample tuple, this is only valid when
+ * sample_slot is set.
+ *
+ * We prefer to use sample_slot if both sample_tuple and
+ * sample_slot are set, sample_slot is the most common case.
+ */
+void
+AnalyzeRecordSampleRow(AnalyzeSampleContext *context,
+ TupleTableSlot *sample_slot,
+ HeapTuple sample_tuple,
+ AnalyzeSampleType type,
+ int pos,
+ bool replace,
+ bool withtid)
+{
+ HeapTuple tuple;
+ HeapTuple *rows;
+
+ rows = AnalyzeGetSampleRows(context, type, context->totalsampledrows);
+
+ /* We need to free the old tuple if replace is true */
+ if (replace)
+ heap_freetuple(rows[pos]);
+
+ Assert(sample_slot || sample_tuple);
+ if (sample_slot)
+ tuple = ExecCopySlotHeapTuple(sample_slot);
+ else
+ tuple = sample_tuple;
+
+ /* We may need to convert from child's rowtype to parent's */
+ if (context->tup_convert_map != NULL && type == ANALYZE_SAMPLE_DATA)
+ {
+ HeapTuple newtup;
+ newtup = execute_attr_map_tuple(tuple, context->tup_convert_map);
+ heap_freetuple(tuple);
+ tuple = newtup;
+ }
+
+ if (withtid && sample_slot)
+ tuple->t_self = sample_slot->tts_tid;
+
+ /* store the tuple to right position */
+ rows[pos] = tuple;
+}
+
+bool
+AnalyzeSampleIsValid(AnalyzeSampleContext *context, AnalyzeSampleType type)
+{
+ return context->sample_rows[type] != NULL;
+}
diff --git src/backend/access/transam/rmgr.c src/backend/access/transam/rmgr.c
index 58091f6b52..74ef539770 100644
--- src/backend/access/transam/rmgr.c
+++ src/backend/access/transam/rmgr.c
@@ -20,6 +20,7 @@
#include "access/spgxlog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/zedstore_wal.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
#include "commands/sequence.h"
diff --git src/backend/access/transam/xact.c src/backend/access/transam/xact.c
index e3c60f23cd..fff45f829e 100644
--- src/backend/access/transam/xact.c
+++ src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
+#include "access/zedstoream.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
@@ -2103,6 +2104,13 @@ CommitTransaction(void)
/* Shut down the deferred-trigger manager */
AfterTriggerEndXact(true);
+ /*
+ * Flush tuple buffers in zedstore. Cannot insert/update zedstore tables
+ * in this transaction after this. This must happen before ON COMMIT
+ * actions, so we don't fail on flushing to ON COMMIT DROP tables.
+ */
+ AtEOXact_zedstore_tuplebuffers(true);
+
/*
* Let ON COMMIT management do its thing (must happen after closing
* cursors, to avoid dangling-reference problems)
@@ -2336,6 +2344,13 @@ PrepareTransaction(void)
/* Shut down the deferred-trigger manager */
AfterTriggerEndXact(true);
+ /*
+ * Flush tuple buffers in zedstore. Cannot insert/update zedstore tables
+ * in this transaction after this. This must happen before ON COMMIT
+ * actions, so we don't fail on flushing to ON COMMIT DROP tables.
+ */
+ AtEOXact_zedstore_tuplebuffers(true);
+
/*
* Let ON COMMIT management do its thing (must happen after closing
* cursors, to avoid dangling-reference problems)
@@ -2660,6 +2675,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ AtEOXact_zedstore_tuplebuffers(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -4749,6 +4765,8 @@ StartSubTransaction(void)
s->state = TRANS_INPROGRESS;
+ AtSubStart_zedstore_tuplebuffers();
+
/*
* Call start-of-subxact callbacks
*/
@@ -4787,6 +4805,8 @@ CommitSubTransaction(void)
s->parallelModeLevel = 0;
}
+ AtEOSubXact_zedstore_tuplebuffers(true);
+
/* Do the actual "commit", such as it is */
s->state = TRANS_COMMIT;
@@ -4964,6 +4984,7 @@ AbortSubTransaction(void)
s->parent->subTransactionId,
s->curTransactionOwner,
s->parent->curTransactionOwner);
+ AtEOSubXact_zedstore_tuplebuffers(false);
AtEOSubXact_LargeObject(false, s->subTransactionId,
s->parent->subTransactionId);
AtSubAbort_Notify();
diff --git src/backend/access/zedstore/Makefile src/backend/access/zedstore/Makefile
new file mode 100644
index 0000000000..2e3ca364df
--- /dev/null
+++ src/backend/access/zedstore/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/zedstore
+#
+# IDENTIFICATION
+# src/backend/access/zedstore/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/zedstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = zedstore_btree.o zedstore_tiditem.o zedstore_tidpage.o \
+ zedstore_tid.o zedstore_attstream.o zedstore_attpage.o \
+ zedstore_compression.o zedstore_simple8b.o zedstoream_handler.o \
+ zedstore_meta.o zedstore_undolog.o zedstore_undorec.o \
+ zedstore_toast.o zedstore_visibility.o zedstore_inspect.o \
+ zedstore_freepagemap.o zedstore_tupslot.o zedstore_wal.o \
+ zedstore_tuplebuffer.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git src/backend/access/zedstore/README src/backend/access/zedstore/README
new file mode 100644
index 0000000000..e2c6ae461f
--- /dev/null
+++ src/backend/access/zedstore/README
@@ -0,0 +1,362 @@
+
+src/backend/access/zedstore/README
+
+ZedStore - compressed column (and row) store for PostgreSQL
+===========================================================
+
+The purpose of this README is to provide overview of zedstore's
+design, major requirements/objectives it intends to fulfill and
+high-level implementation details.
+
+Objectives
+----------
+
+* Performance improvement for queries selecting subset of columns
+(reduced IO).
+
+* Reduced on-disk footprint compared to heap table. Shorter tuple
+headers and also leveraging compression of similar type data
+
+* Be first-class citizen in the Postgres architecture (tables data can
+just independently live in columnar storage) and not be at arm's
+length though an opaque interface.
+
+* Fully MVCC compliant - basically all operations supported similar to
+heap, like update, delete, serializable transactions etc...
+
+* All Indexes supported
+
+* Hybrid row-column store, where some columns are stored together, and
+others separately. Provide flexibility of granularity on how to divide
+the columns. Columns accessed together can be stored together.
+
+* Provide better control over bloat.
+
+* Eliminate need for separate toast tables
+
+* Faster add / drop column or changing data type of column by avoiding
+full rewrite of the table.
+
+Highlevel design of zedStore - B-trees for the win!
+---------------------------------------------------
+
+ZedStore consists of multiple B-trees. There is one B-tree, called the
+TID tree, which contains the visibility information of each tuple, but
+no user data. In addition to that, there is one B-tree for each
+attribute, called the attribute trees, to store the user data. Note that
+these B-tree implementations are completely unrelated to PostgreSQL's
+B-tree indexes.
+
+The TID tree, and all the attribute trees, use the TID as the key. The
+TID is used as a logical row identifier. Internally, ZedStore passed
+TIDs around as 64-bit integers (zstid), but for interfacing with the
+rest of the system, they are converted to/from ItemPointers. When
+converted to an ItemPointer, the conversion ensures that the ItemPointer
+looks valid, i.e. offset 0 is never used. However, the TID is just a
+48-bit row identifier, the traditional division into block and offset
+numbers is meaningless. There is locality of access, though; TIDs that
+are close to each other, will probably also reside close to each other
+on disk. So, for example, bitmap index scans or BRIN indexes, which
+work with block numbers, still make some sense, even though the "block
+number" stored in a zedstore ItemPointer doesn't correspond to a
+physical block.
+
+The internal pages of the B-trees are super simple and boring. The internal
+pages of the TID and attribute trees look identical. Functions that work
+with either the TID or attribute tree use ZS_META_ATTRIBUTE_NUM as the
+"attribute number", when working with the TID tree.
+
+
+
+The leaf pages look different TID tree and the attribute trees. Let's
+look at the TID tree first:
+
+TID tree
+--------
+
+A TID tree page consists of multiple ZSTidArrayItems. Each ZSTidArrayItem
+represents a group of tuples, with TIDs in a particular range. The TID
+ranges of ZSTidArrayItems never overlap. For each tuple, we logically
+store the TID, and its UNDO pointer. The actual visibility information
+is stored in the UNDO log, if the tuple was recently modified.
+
+A tuple can also be marked as dead, which means that the tuple is not
+visible to anyone. Dead tuples are marked with a special constant
+UNDO pointer value, DeadUndoPtr. The TIDs of dead tuples cannot be
+reused, until all index pointers to the tuples have been removed, by
+VACUUM. VACUUM scans the TID tree to collect all the dead TIDs. (Note
+that VACUUM does not need to scan the attribute trees, and the TID tree
+is hopefully just a small fraction of the table. Vacuum on zedstore is
+therefore hopefully much faster than on heap. (Although the freeze map
+can be pretty effective on the heap, too))
+
+So logically, the TID tree stores the TID and UNDO pointer for every
+tuple. However, that would take a lot of space. To reduce disk usage,
+the TID tree consists of ZSTidArrayItems, which contain the TIDs and
+their UNDO pointers in a specially encoded format. The encoded format
+is optimized for the common cases, where the gaps between TIDs are
+small, and most tuples are visible to everyone. See comments
+ZSTidArrayItem in zedstore_internal.h for details.
+
+Having a TID tree that's separate from the attributes helps to support
+zero column tables (which can be result of ADD COLUMN DROP COLUMN actions
+as well). Plus, having meta-data stored separately from data, helps to get
+better compression ratios. And also helps to simplify the overall
+design/implementation as for deletes just need to edit the TID tree
+and avoid touching the attribute btrees.
+
+
+Attribute trees
+---------------
+
+The leaf pages on the attribute tree contain two "streams" of attribute
+data, one compressed, and one uncompressed. The compressed stream is
+compressed using LZ4. (Assuming the server has been built with "configure
+--with-lz4". Otherwise, PostgreSQL's built-in pglz algorithm is used, but
+it is *much* slower). When new rows are added, the new attribute data is
+appended to the uncompressed stream, until the page gets full, at which
+point all the uncompressed data is repacked and moved to the compressed
+stream. An attribute stream consists of smaller "chunks", and each chunk
+contains the TIDs and data for 1-60 rows.
+
+In uncompressed form, an attribute stream on a page can be arbitrarily
+large, but after compression, it must fit into a physical 8k block. If
+on insert or update of a tuple, the page cannot be compressed below 8k
+anymore, the page is split. Note that because TIDs are logical rather
+than physical identifiers, we can freely move tuples from one physical
+page to another during page split. A tuple's TID never changes.
+
+The buffer cache caches compressed blocks. Likewise, WAL-logging,
+full-page images etc. work on compressed blocks. Uncompression is done
+on-the-fly, as and when needed in backend-private memory, when
+reading. For some compressions like rel encoding or delta encoding
+tuples can be constructed directly from compressed data.
+
+
+To reconstruct a row with given TID, scan descends down the B-trees for
+all the columns using that TID, and fetches all attributes. Likewise, a
+sequential scan walks all the B-trees in lockstep.
+
+
+TODO: Currently, each attribute is stored in a separate attribute
+B-tree. But a hybrid row-column store would also be possible, where some
+columns were stored together in the same tree. Or even a row store, where
+all the user data was stored in a single tree, or even combined with the
+TID tree.
+
+Metapage
+--------
+
+A metapage at block 0, has links to the roots of the B-trees.
+
+
+Low-level locking / concurrency issues
+------------------------------- ------
+Design principles:
+
+* Every page is self-identifying. Every page has a page type ID,
+ which indicates what kind of a page it is. For a B-tree page,
+ the page header contains the attribute number and lo/hi key.
+ That is enough information to find the downlink to the page, so
+ that it can be deleted if necessary. There is enough information
+ on each leaf page to easily re-build the internal pages from
+ scratch, in case of corruption, for example.
+
+* Concurrency control: When traversing the B-tree, or walking UNDO
+ or TOAST pages, it's possible that a concurrent process splits or
+ moves a page just when we're about to step on it. There is enough
+ information on each page to detect that case. For example, if a
+ B-tree page is split just when you are about to step on it, you
+ can detect that by looking at the lo/hi key. If a page is deleted,
+ that can be detected too, because the attribute number or lo/hikey
+ are not what you expected. In that case, start the scan from the
+ root.
+
+* Any page can be fairly easily be moved, starting with just the
+ page itself. When you have a B-tree page at hand, you can re-find
+ its parent using its lokey, and modify the downlink. A toast page
+ contains the attno/TID, which can be used to find the pointer to
+ it in the b-tree. An UNDO page cannot currently be moved because
+ UNDO pointers contain the physical block number, but as soon as an
+ UNDO page expires, it can be deleted.
+
+
+MVCC
+----
+
+Undo record pointers are used to implement MVCC, like in zheap. Hence,
+transaction information if not directly stored with the data. In
+zheap, there's a small, fixed, number of "transaction slots" on each
+page, but zedstore has undo pointer with each item directly; in normal
+cases, the compression squeezes this down to almost nothing. In case
+of bulk load the undo record pointer is maintained for array of items
+and not per item. Undo pointer is only stored in meta-column and all
+MVCC operations are performed using the meta-column only.
+
+
+Insert:
+Inserting a new row, splits the row into datums. Then while adding
+entry for meta-column adds, decides block to insert, picks a TID for
+it, and writes undo record for the same. All the data columns are
+inserted using that TID.
+
+Toast:
+When an overly large datum is stored, it is divided into chunks, and
+each chunk is stored on a dedicated toast page within the same
+physical file. The toast pages of a datum form list, each page has a
+next/prev pointer.
+
+Select:
+Property is added to Table AM to convey if column projection is
+leveraged by AM for scans. While scanning tables with AM leveraging
+this property, executor parses the plan. Leverages the target list and
+quals to find the required columns for query. This list is passed down
+to AM on beginscan. Zedstore uses this column projection list to only
+pull data from selected columns. Virtual tuple table slot is used to
+pass back the datums for subset of columns.
+
+Current table am API requires enhancement here to pass down column
+projection to AM. The patch showcases two different ways for the same.
+
+* For sequential scans added new beginscan_with_column_projection()
+API. Executor checks AM property and if it leverages column projection
+uses this new API else normal beginscan() API.
+
+* For index scans instead of modifying the begin scan API, added new
+API to specifically pass column projection list after calling begin
+scan to populate the scan descriptor but before fetching the tuples.
+
+Delete:
+When deleting a tuple, new undo record is created for delete and only
+meta-column item is updated with this new undo record. New undo record
+created points to previous undo record pointer (insert undo record)
+present for the tuple. Hence, delete only operates on meta-column and
+no data column is edited.
+
+Update:
+Update in zedstore is pretty equivalent to delete and insert. Delete
+action is performed as stated above and new entry is added with
+updated values. So, no in-place update happens.
+
+Index Support:
+Building index also leverages columnar storage and only scans columns
+required to build the index. Indexes work pretty similar to heap
+tables. Data is inserted into tables and TID for the tuple gets stored
+in index. On index scans, required column Btrees are scanned for given
+TID and datums passed back using virtual tuple. Since only meta-column
+is leveraged to perform visibility check, only visible tuples data are
+fetched from rest of the Btrees.
+
+Page Format
+-----------
+A ZedStore table contains different kinds of pages, all in the same
+file. Kinds of pages are meta-page, per-attribute btree internal and
+leaf pages, UNDO log page, and toast pages. Each page type has its own
+distinct data storage format.
+
+META Page:
+Block 0 is always a metapage. It contains the block numbers of the
+other data structures stored within the file, like the per-attribute
+B-trees, and the UNDO log.
+
+BTREE Page:
+
+UNDO Page:
+
+TOAST Page:
+
+
+Free Pages Map
+--------------
+
+The Free Page Map structure used in Zedstore is simply a linked list of unused
+blocks. There are multiple free page maps, with one free page map for the table,
+to cater to allocation requests for UNDO pages. There is a free page map for the
+tid tree and a separate free page map for each attribute (we use the term
+attribute-level free page map for the free page map for the tid tree as well,
+considering it as attribute 0).
+The block number of the first unused page for each of these lists is stored
+in the metapage. Each unused block contains link to the next unused
+block in the chain. When a block comes unused, it is added to the
+head of the list.
+
+By batching page allocations and by having attribute-level free page maps, we
+ensure that each attribute B-tree gets more contiguous ranges of blocks, even under
+concurrent inserts to the same table to allow I/O readahead to be effective.
+The batching factor we use is the reloption: zedstore_rel_extension_factor.
+
+TODO: That doesn't scale very well, and the pages are reused in LIFO
+order. We'll probably want to do something smarter to avoid making the
+metapage a bottleneck for this.
+
+
+Enhancement ideas / alternative designs
+---------------------------------------
+
+Instead of compressing all the tuples on a page in one batch, store a
+small "dictionary", e.g. in page header or meta page or separate
+dedicated page, and use it to compress tuple by tuple. That could make
+random reads and updates of individual tuples faster. Need to find how
+to create the dictionary first.
+
+Only cached compressed pages in the page cache. If we want to cache
+uncompressed pages instead, or in addition to that, we need to invent
+a whole new kind of a buffer cache that can deal with the
+variable-size blocks. For a first version, I think we can live without
+it.
+
+Instead of storing all columns in the same file, we could store them
+in separate files (separate forks?). That would allow immediate reuse
+of space, after dropping a column. It's not clear how to use an FSM in
+that case, though. Might have to implement an integrated FSM,
+too. (Which might not be a bad idea, anyway).
+
+Design allows for hybrid row-column store, where some columns are
+stored together, and others have a dedicated B-tree. Need to have user
+facing syntax to allow specifying how to group the columns.
+
+Salient points for the design
+------------------------------
+
+* Layout the data/tuples in mapped fashion instead of keeping the
+logical to physical mapping separate from actual data. So, keep all
+the meta-data and data logically in single stream of file, avoiding
+the need for separate forks/files to store meta-data and data.
+
+* Handle/treat operations at tuple level and not block level.
+
+* Stick to fixed size physical blocks. Variable size blocks (for
+possibly higher compression ratios) pose need for increased logical to
+physical mapping maintenance, plus restrictions on concurrency of
+writes and reads to files. Hence adopt compression to fit fixed size
+blocks instead of other way round.
+
+
+Predicate locking
+-----------------
+
+Predicate locks, to support SERIALIZABLE transactinons, are taken like
+with the heap. From README-SSI:
+
+* For a table scan, the entire relation will be locked.
+
+* Each tuple read which is visible to the reading transaction will be
+locked, whether or not it meets selection criteria; except that there
+is no need to acquire an SIREAD lock on a tuple when the transaction
+already holds a write lock on any tuple representing the row, since a
+rw-conflict would also create a ww-dependency which has more
+aggressive enforcement and thus will prevent any anomaly.
+
+* Modifying a heap tuple creates a rw-conflict with any transaction
+that holds a SIREAD lock on that tuple, or on the page or relation
+that contains it.
+
+* Inserting a new tuple creates a rw-conflict with any transaction
+holding a SIREAD lock on the entire relation. It doesn't conflict with
+page-level locks, because page-level locks are only used to aggregate
+tuple locks. Unlike index page locks, they don't lock "gaps" on the
+page.
+
+
+ZedStore isn't block-based, so page-level locks really just mean a
+range of TIDs. They're only used to aggregate tuple locks.
diff --git src/backend/access/zedstore/zedstore_attpage.c src/backend/access/zedstore/zedstore_attpage.c
new file mode 100644
index 0000000000..52c2a5a1b4
--- /dev/null
+++ src/backend/access/zedstore/zedstore_attpage.c
@@ -0,0 +1,1068 @@
+/*
+ * zedstore_attpage.c
+ * Routines for handling attribute leaf pages.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with a scan of one attribute tree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_attpage.c
+ */
+#include "postgres.h"
+
+#include "access/xlogutils.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+typedef struct
+{
+ Page currpage;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ AttrNumber attno;
+ zstid hikey;
+
+ BlockNumber nextblkno;
+} zsbt_attr_repack_context;
+
+/* prototypes for local functions */
+static ZSAttStream *get_page_lowerstream(Page page);
+static ZSAttStream *get_page_upperstream(Page page);
+static void wal_log_attstream_change(Relation rel, Buffer buf, ZSAttStream *attstream, bool is_upper,
+ uint16 begin_offset, uint16 end_offset);
+
+static void zsbt_attr_repack_init(zsbt_attr_repack_context *cxt, AttrNumber attno, Buffer oldbuf, bool append);
+static void zsbt_attr_repack_newpage(zsbt_attr_repack_context *cxt, zstid nexttid);
+static void zsbt_attr_pack_attstream(Form_pg_attribute attr, attstream_buffer *buf, Page page);
+static void zsbt_attr_repack_writeback_pages(zsbt_attr_repack_context *cxt,
+ Relation rel, AttrNumber attno,
+ Buffer oldbuf);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of an attribute btree.
+ *
+ * Fills in the scan struct in *scan.
+ */
+void
+zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno,
+ ZSAttrTreeScan *scan)
+{
+ scan->rel = rel;
+ scan->attno = attno;
+ scan->attdesc = TupleDescAttr(tdesc, attno - 1);
+
+ scan->context = CurrentMemoryContext;
+
+ init_attstream_decoder(&scan->decoder, scan->attdesc->attbyval, scan->attdesc->attlen);
+ scan->decoder.tmpcxt = AllocSetContextCreate(scan->context,
+ "ZedstoreAMAttrScanContext",
+ ALLOCSET_DEFAULT_SIZES);
+
+ scan->decoder_last_idx = -1;
+
+ scan->active = true;
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+}
+
+void
+zsbt_attr_end_scan(ZSAttrTreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+
+ scan->active = false;
+
+ destroy_attstream_decoder(&scan->decoder);
+}
+
+/*
+ * Load scan->array_* arrays with data that contains 'nexttid'.
+ *
+ * Return true if data containing 'nexttid' was found. The tid/Datum/isnull
+ * data are placed into scan->array_* fields. The data is valid until the
+ * next call of this function. Note that the item's range contains 'nexttid',
+ * but its TID list might not include the exact TID itself. The caller
+ * must scan the array to check for that.
+ *
+ * This is normally not used directly. Use the zsbt_attr_fetch() wrapper,
+ * instead.
+ */
+bool
+zsbt_attr_scan_fetch_array(ZSAttrTreeScan *scan, zstid nexttid)
+{
+ Buffer buf;
+ Page page;
+ ZSAttStream *stream;
+
+ if (!scan->active)
+ return InvalidZSTid;
+
+ /*
+ * If the TID we're looking for is in the current attstream, we just
+ * need to decoder more of it.
+ *
+ * TODO: We could restart the decoder, if the current attstream
+ * covers the target TID, but we already decoded past it.
+ */
+ if (scan->decoder.pos < scan->decoder.chunks_len &&
+ nexttid >= scan->decoder.firsttid &&
+ nexttid <= scan->decoder.lasttid)
+ {
+ if (nexttid <= scan->decoder.prevtid)
+ {
+ /*
+ * The target TID is in this attstream, but we already scanned
+ * past it. Restart the decoder.
+ */
+ scan->decoder.pos = 0;
+ scan->decoder.prevtid = 0;
+ }
+
+ /* Advance the scan, until we have reached the target TID */
+ while (nexttid > scan->decoder.prevtid)
+ (void) decode_attstream_cont(&scan->decoder);
+
+ if (scan->decoder.num_elements == 0 ||
+ nexttid < scan->decoder.tids[0])
+ return false;
+ else
+ return true;
+ }
+
+ /* reset the decoder */
+ scan->decoder.num_elements = 0;
+ scan->decoder.chunks_len = 0;
+ scan->decoder.pos = 0;
+ scan->decoder.prevtid = 0;
+
+ /*
+ * Descend the tree, tind and lock the leaf page containing 'nexttid'.
+ */
+ buf = zsbt_find_and_lock_leaf_containing_tid(scan->rel, scan->attno,
+ scan->lastbuf, nexttid,
+ BUFFER_LOCK_SHARE);
+ scan->lastbuf = buf;
+ if (!BufferIsValid(buf))
+ {
+ /*
+ * Completely empty tree. This should only happen at the beginning
+ * of a scan - a tree cannot go missing after it's been created -
+ * but we don't currently check for that.
+ */
+ return false;
+ }
+ page = BufferGetPage(buf);
+
+ /* See if the upper stream covers the target tid */
+ stream = get_page_upperstream(page);
+ if (stream && nexttid <= stream->t_lasttid)
+ {
+ decode_attstream_begin(&scan->decoder, stream);
+ }
+ /*
+ * How about the lower stream? (We assume that the upper stream is < lower
+ * stream, and there's no overlap).
+ */
+ else
+ {
+ stream = get_page_lowerstream(page);
+ if (stream && nexttid <= stream->t_lasttid)
+ {
+ /* If there is a match, it will be in this attstream */
+ decode_attstream_begin(&scan->decoder, stream);
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * We now have the attstream we need copied into scan->decoder (or not, if
+ * no covering attstream was found)
+ */
+ if (scan->decoder.pos < scan->decoder.chunks_len &&
+ nexttid >= scan->decoder.firsttid &&
+ nexttid <= scan->decoder.lasttid)
+ {
+ /* Advance the scan, until we have reached the target TID */
+ while (nexttid > scan->decoder.prevtid)
+ (void) decode_attstream_cont(&scan->decoder);
+
+ if (scan->decoder.num_elements == 0 ||
+ nexttid < scan->decoder.tids[0])
+ return false;
+ else
+ return true;
+ }
+ else
+ return false;
+}
+
+/*
+ * Remove data for the given TIDs from the attribute tree.
+ */
+void
+zsbt_attr_remove(Relation rel, AttrNumber attno, IntegerSet *tids)
+{
+ Form_pg_attribute attr = &rel->rd_att->attrs[attno - 1];
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ zstid nexttid;
+ MemoryContext oldcontext;
+ MemoryContext tmpcontext;
+ zstid *tids_to_remove;
+ int num_to_remove;
+ int allocated_size;
+
+ tids_to_remove = palloc(1000 * sizeof(zstid));
+ allocated_size = 1000;
+
+ tmpcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMVacuumContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(tmpcontext);
+
+ intset_begin_iterate(tids);
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = InvalidZSTid;
+
+ while (nexttid < MaxPlusOneZSTid)
+ {
+ ZSAttStream *lowerstream;
+ ZSAttStream *upperstream;
+
+ buf = zsbt_descend(rel, attno, nexttid, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * We now have a page at hand, that (should) contain at least one
+ * of the TIDs we want to remove.
+ */
+ num_to_remove = 0;
+ while (nexttid < opaque->zs_hikey)
+ {
+ if (num_to_remove == allocated_size)
+ {
+ tids_to_remove = repalloc(tids_to_remove, (allocated_size * 2) * sizeof(zstid));
+ allocated_size *= 2;
+ }
+ tids_to_remove[num_to_remove++] = nexttid;
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+
+ /* Remove the data for those TIDs, and rewrite the page */
+ lowerstream = get_page_lowerstream(page);
+ upperstream = get_page_upperstream(page);
+
+ if (lowerstream || upperstream)
+ {
+ attstream_buffer upperbuf;
+ attstream_buffer lowerbuf;
+ attstream_buffer *newbuf;
+ zsbt_attr_repack_context cxt;
+
+ upperbuf.len = 0;
+ upperbuf.cursor = 0;
+ lowerbuf.len = 0;
+ lowerbuf.cursor = 0;
+
+ if (upperstream)
+ vacuum_attstream(rel, attno, &upperbuf, upperstream,
+ tids_to_remove, num_to_remove);
+
+ if (lowerstream)
+ vacuum_attstream(rel, attno, &lowerbuf, lowerstream,
+ tids_to_remove, num_to_remove);
+
+ if (upperbuf.len - upperbuf.cursor > 0 &&
+ lowerbuf.len - lowerbuf.cursor > 0)
+ {
+ merge_attstream_buffer(attr, &upperbuf, &lowerbuf);
+ newbuf = &upperbuf;
+ }
+ else if (upperbuf.len - upperbuf.cursor > 0)
+ newbuf = &upperbuf;
+ else
+ newbuf = &lowerbuf;
+
+ /*
+ * Now we have a list of non-overlapping items, containing all the old and
+ * new data. zsbt_attr_rewrite_page() takes care of storing them on the
+ * page, splitting the page if needed.
+ */
+ zsbt_attr_repack_init(&cxt, attno, buf, false);
+ if (newbuf->len - newbuf->cursor > 0)
+ {
+ /*
+ * Then, store them on the page, creating new pages as needed.
+ */
+ zsbt_attr_pack_attstream(attr, newbuf, cxt.currpage);
+ while (newbuf->cursor < newbuf->len)
+ {
+ zsbt_attr_repack_newpage(&cxt, newbuf->firsttid);
+ zsbt_attr_pack_attstream(attr, newbuf, cxt.currpage);
+ }
+ }
+ zsbt_attr_repack_writeback_pages(&cxt, rel, attno, buf);
+ /* zsbt_attr_rewriteback_pages() unlocked and released the buffer */
+ }
+ else
+ UnlockReleaseBuffer(buf);
+
+ /*
+ * We can now free the decompression contexts. The pointers in the 'items' list
+ * point to decompression buffers, so we cannot free them until after writing out
+ * the pages.
+ */
+ MemoryContextReset(tmpcontext);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(tmpcontext);
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+static ZSAttStream *
+get_page_lowerstream(Page page)
+{
+ int lowersize;
+ ZSAttStream *lowerstream;
+
+ /* Read old uncompressed (lower) attstream */
+ lowersize = ((PageHeader) page)->pd_lower - SizeOfPageHeaderData;
+ if (lowersize > SizeOfZSAttStreamHeader)
+ {
+ lowerstream = (ZSAttStream *) (((char *) page) + SizeOfPageHeaderData);
+ Assert((lowerstream)->t_size == lowersize);
+
+ /* by convention, lower stream is always uncompressed */
+ Assert((lowerstream->t_flags & ATTSTREAM_COMPRESSED) == 0);
+ }
+ else
+ {
+ Assert (lowersize == 0);
+ lowerstream = NULL;
+ }
+
+ return lowerstream;
+}
+
+static ZSAttStream *
+get_page_upperstream(Page page)
+{
+ int uppersize;
+ ZSAttStream *upperstream;
+
+ uppersize = ((PageHeader) page)->pd_special - ((PageHeader) page)->pd_upper;
+ if (uppersize > SizeOfZSAttStreamHeader)
+ {
+ upperstream = (ZSAttStream *) (((char *) page) + ((PageHeader) page)->pd_upper);
+ Assert(upperstream->t_size == uppersize);
+ /* by convention, upper stream is always compressed */
+ Assert((upperstream->t_flags & ATTSTREAM_COMPRESSED) != 0);
+ }
+ else
+ {
+ upperstream = NULL;
+ Assert(uppersize == 0);
+ }
+ return upperstream;
+}
+
+/*
+ * Add data to attribute leaf pages.
+ *
+ * 'attbuf' contains the new attribute data that is to be added to the page.
+ *
+ * This function writes as much data as is convenient; typically, as much
+ * as fits on a single page, after compression. Some data is always written.
+ * If you want to flush all data to disk, call zsbt_attr_add() repeatedly,
+ * until 'attbuf' is empty.
+ *
+ * 'attbuf' is updated, so that on exit, it contains the data that remains,
+ * i.e. data that was not yet written out.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page, as needed.
+ */
+void
+zsbt_attr_add(Relation rel, AttrNumber attno, attstream_buffer *attbuf)
+{
+ Form_pg_attribute attr = &rel->rd_att->attrs[attno - 1];
+ Buffer origbuf;
+ Page origpage;
+ ZSBtreePageOpaque *origpageopaque;
+ ZSAttStream *lowerstream;
+ ZSAttStream *upperstream;
+ int lowerstreamsz;
+ uint16 orig_pd_lower;
+ uint16 new_pd_lower;
+ zstid firstnewtid;
+ zstid splittid;
+ zsbt_attr_repack_context cxt;
+ bool split = false;
+
+ Assert (attbuf->len - attbuf->cursor > 0);
+
+ /*
+ * Find the right place to insert the new data.
+ */
+ origbuf = zsbt_descend(rel, attno, attbuf->firsttid, 0, false);
+ origpage = BufferGetPage(origbuf);
+ origpageopaque = ZSBtreePageGetOpaque(origpage);
+ splittid = origpageopaque->zs_hikey - 1;
+
+ Assert (attbuf->firsttid <= splittid);
+
+ lowerstream = get_page_lowerstream(origpage);
+ upperstream = get_page_upperstream(origpage);
+
+ /* Is there space to add the new attstream as it is? */
+ orig_pd_lower = ((PageHeader) origpage)->pd_lower;
+
+ if (lowerstream == NULL)
+ {
+ /*
+ * No existing uncompressed data on page, see if the new data can fit
+ * into the uncompressed area.
+ */
+ Assert(orig_pd_lower == SizeOfPageHeaderData);
+
+ if (SizeOfZSAttStreamHeader + (attbuf->len - attbuf->cursor) <= PageGetExactFreeSpace(origpage))
+ {
+ ZSAttStream newhdr;
+ attstream_buffer newattbuf;
+
+ newhdr.t_flags = 0;
+ newhdr.t_decompressed_size = 0;
+ newhdr.t_decompressed_bufsize = 0;
+
+ if (attbuf->lasttid > splittid)
+ {
+ /*
+ * We should not accommodate items with tids greater than the
+ * hikey of the target leaf page. So if our attbuf does have such
+ * items, we split the attbuf into two buffers at tid: hikey - 1.
+ * This will ensure that we only insert the tids that fit into
+ * the page's range.
+ */
+ split_attstream_buffer(attbuf, &newattbuf, splittid);
+ split = true;
+ }
+
+ newhdr.t_size = SizeOfZSAttStreamHeader + (attbuf->len - attbuf->cursor);
+ newhdr.t_lasttid = attbuf->lasttid;
+ new_pd_lower = SizeOfPageHeaderData + newhdr.t_size;
+
+ START_CRIT_SECTION();
+
+ memcpy(origpage + SizeOfPageHeaderData, &newhdr, SizeOfZSAttStreamHeader);
+ memcpy(origpage + SizeOfPageHeaderData + SizeOfZSAttStreamHeader,
+ attbuf->data + attbuf->cursor, attbuf->len - attbuf->cursor);
+ ((PageHeader) origpage)->pd_lower = new_pd_lower;
+
+ MarkBufferDirty(origbuf);
+
+ if (RelationNeedsWAL(rel))
+ wal_log_attstream_change(rel, origbuf,
+ (ZSAttStream *) (origpage + SizeOfPageHeaderData), false,
+ orig_pd_lower, new_pd_lower);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(origbuf);
+ if (split)
+ {
+ /*
+ * Make attbuf represent the chunks that were on the right hand
+ * side of the split. These are the chunks that are left over.
+ */
+ pfree(attbuf->data);
+ memcpy(attbuf, &newattbuf, sizeof(attstream_buffer));
+ }
+ else
+ attbuf->cursor = attbuf->len;
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Try to append the new data to the existing uncompressed data first
+ */
+ START_CRIT_SECTION();
+
+ if (attbuf->lasttid <= splittid &&
+ append_attstream_inplace(attr, lowerstream,
+ PageGetExactFreeSpace(origpage),
+ attbuf))
+ {
+ new_pd_lower = SizeOfPageHeaderData + lowerstream->t_size;
+
+ /* fast path succeeded */
+ MarkBufferDirty(origbuf);
+
+ /*
+ * NOTE: in theory, if append_attstream_inplace() was smarter, it might
+ * modify the existing data. The new combined stream might even be smaller
+ * than the old stream, if the last codewords are packed more tighthly.
+ * But at the moment, append_attstreams_inplace() doesn't do anything
+ * that smart. So we assume that the existing data didn't change, and we
+ * only need to WAL log the new data at the end of the stream.
+ */
+ ((PageHeader) origpage)->pd_lower = new_pd_lower;
+
+ if (RelationNeedsWAL(rel))
+ wal_log_attstream_change(rel, origbuf, lowerstream, false,
+ orig_pd_lower, new_pd_lower);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(origbuf);
+ return;
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * If the orig page contains already-compressed data, and it is almost full,
+ * leave the old data untouched and create a new page. This avoids repeatedly
+ * recompressing pages when inserting rows one by one. Somewhat arbitrarily,
+ * we put the threshold at 2.5%.
+ *
+ * TODO: skipping allocating new page here if attbuf->lasttid > splittid,
+ * because we don't know how to handle that without calling merge_attstream()
+ */
+ firstnewtid = attbuf->firsttid;
+ lowerstreamsz = lowerstream ? lowerstream->t_size : 0;
+ if (attbuf->lasttid <= splittid &&
+ PageGetExactFreeSpace(origpage) + lowerstreamsz < (int) (BLCKSZ * 0.025) &&
+ (lowerstream == NULL || firstnewtid > lowerstream->t_lasttid) &&
+ upperstream && firstnewtid > upperstream->t_lasttid)
+ {
+ /*
+ * Keep the original page unmodified, and allocate a new page
+ * for the new data.
+ */
+ zsbt_attr_repack_init(&cxt, attno, origbuf, true);
+ zsbt_attr_repack_newpage(&cxt, attbuf->firsttid);
+
+ /* write out the new data (or part of it) */
+ zsbt_attr_pack_attstream(attr, attbuf, cxt.currpage);
+ }
+ else
+ {
+ /*
+ * Rewrite existing data on the page, and add as much of the
+ * new data as fits. But make sure that we write at least one
+ * chunk of new data, otherwise we might get stuck in a loop
+ * without making any progress.
+ */
+ zstid mintid = attbuf->firsttid;
+ attstream_buffer rightattbuf;
+
+#if 0
+ if (upperstream && lowerstream)
+ elog(NOTICE, "merging upper %d lower %d new %d", upperstream->t_decompressed_size, lowerstream->t_size, attbuf->len - attbuf->cursor);
+ else if (upperstream)
+ elog(NOTICE, "merging upper %d new %d", upperstream->t_decompressed_size, attbuf->len - attbuf->cursor);
+ else if (lowerstream)
+ elog(NOTICE, "merging lower %d new %d", lowerstream->t_size, attbuf->len - attbuf->cursor);
+ else if (lowerstream)
+ elog(NOTICE, "merging new %d", attbuf->len - attbuf->cursor);
+#endif
+
+ /* merge the old items to the working buffer */
+ if (upperstream && lowerstream)
+ {
+ attstream_buffer tmpbuf;
+
+ mintid = Max(mintid, lowerstream->t_lasttid);
+ mintid = Max(mintid, upperstream->t_lasttid);
+
+ init_attstream_buffer_from_stream(&tmpbuf, attr->attbyval,
+ attr->attlen, upperstream, GetMemoryChunkContext(attbuf->data));
+
+ merge_attstream(attr, attbuf, lowerstream);
+
+ merge_attstream_buffer(attr, &tmpbuf, attbuf);
+
+ pfree(attbuf->data);
+ *attbuf = tmpbuf;
+ }
+ else if (lowerstream)
+ {
+ mintid = Max(mintid, lowerstream->t_lasttid);
+ merge_attstream(attr, attbuf, lowerstream);
+ }
+ else if (upperstream)
+ {
+ mintid = Max(mintid, upperstream->t_lasttid);
+ merge_attstream(attr, attbuf, upperstream);
+ }
+
+ /*
+ * Now we have a list of non-overlapping items, containing all the old and
+ * new data. Write it out, making sure that at least all the old data is
+ * written out (otherwise, we'd momentarily remove existing data!)
+ */
+ zsbt_attr_repack_init(&cxt, attno, origbuf, false);
+
+ if (attbuf->lasttid > splittid)
+ {
+ /*
+ * We should not accommodate items with tids greater than the
+ * hikey of the target leaf page. So if our attbuf does have such
+ * items, we split the attbuf into two buffers at tid: hikey - 1.
+ * This will ensure that we only insert the tids that fit into
+ * the page's range.
+ */
+ split_attstream_buffer(attbuf, &rightattbuf, splittid);
+ split = true;
+ }
+
+ zsbt_attr_pack_attstream(attr, attbuf, cxt.currpage);
+
+ while (attbuf->cursor < attbuf->len && (split || attbuf->firsttid <= mintid))
+ {
+ zsbt_attr_repack_newpage(&cxt, attbuf->firsttid);
+ zsbt_attr_pack_attstream(attr, attbuf, cxt.currpage);
+ }
+
+ if (split)
+ {
+ /*
+ * Make attbuf represent the chunks that were on the right hand
+ * side of the split. These are the chunks that are left over.
+ */
+ Assert(attbuf->cursor == attbuf->len);
+ pfree(attbuf->data);
+ memcpy(attbuf, &rightattbuf, sizeof(attstream_buffer));
+ }
+ }
+ zsbt_attr_repack_writeback_pages(&cxt, rel, attno, origbuf);
+}
+
+/*
+ * Repacker routines
+ *
+ * Usage:
+ *
+ * 1. Call zsbt_attr_repack_init() to start a repacking operation.
+ * 2. Call zsbt_attr_pack_attstream() to compress and chop a page-sied slice
+ * of incoming data, and store it on the current page copy.
+ * 3. Call zsbt_attr_repack_newpage() to allocate a new page, if you want
+ * to compress and write more data, and goto 2.
+ * 4. Call zsbt_attr_repack_writeback_pages() to finish the repacking
+ * operation, making all on-disk changes.
+ *
+ * Steps 1-3 happen on in-memory pages copies. No data on-disk is
+ * is modified until step 4.
+ */
+static void
+zsbt_attr_repack_init(zsbt_attr_repack_context *cxt, AttrNumber attno, Buffer origbuf, bool append)
+{
+ Page origpage;
+ ZSBtreePageOpaque *origopaque;
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ origpage = BufferGetPage(origbuf);
+ origopaque = ZSBtreePageGetOpaque(origpage);
+
+ cxt->stack_head = cxt->stack_tail = NULL;
+ cxt->attno = attno;
+ cxt->hikey = origopaque->zs_hikey;
+ cxt->nextblkno = origopaque->zs_next;
+
+ newpage = (Page) palloc(BLCKSZ);
+ if (append)
+ memcpy(newpage, origpage, BLCKSZ);
+ else
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ cxt->currpage = newpage;
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (append)
+ stack->special_only = true;
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = cxt->attno;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = origopaque->zs_lokey;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = origopaque->zs_flags & ZSBT_ROOT;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_attr_repack_newpage(zsbt_attr_repack_context *cxt, zstid nexttid)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ /* set the last tid on previous page */
+ oldopaque->zs_hikey = nexttid;
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ cxt->stack_tail->next = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = cxt->attno;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = 0;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+/*
+ * Compress and write as much of the data from 'attbuf' onto 'page' as fits.
+ * 'attbuf' is updated in place, so that on exit, it contains the remaining chunks
+ * that did not fit on 'page'.
+ */
+static void
+zsbt_attr_pack_attstream(Form_pg_attribute attr, attstream_buffer *attbuf,
+ Page page)
+{
+ Size freespc;
+ int orig_bytes;
+ char *pstart;
+ char *pend;
+ char *dst;
+ int complete_chunks_len;
+ zstid lasttid = 0;
+ int srcSize;
+ int compressed_size;
+ ZSAttStream *hdr;
+ char compressbuf[BLCKSZ];
+
+ /* this should only be called on an empty page */
+ Assert(((PageHeader) page)->pd_lower == SizeOfPageHeaderData);
+ freespc = PageGetExactFreeSpace(page);
+
+ pstart = &attbuf->data[attbuf->cursor];
+ pend = &attbuf->data[attbuf->len];
+ orig_bytes = pend - pstart;
+
+ freespc -= SizeOfZSAttStreamHeader;
+
+ /*
+ * Try compressing.
+ *
+ * Note: we try compressing, even if the data fits uncompressed. That might seem
+ * like a waste of time, but compression is very cheap, and this leaves more free
+ * space on the page for new additions.
+ */
+ srcSize = orig_bytes;
+ compressed_size = zs_compress_destSize(pstart, compressbuf, &srcSize, freespc);
+ if (compressed_size > 0)
+ {
+ /* store compressed, in upper stream */
+ int bytes_compressed = srcSize;
+
+ /*
+ * How many complete chunks did we compress?
+ */
+ if (bytes_compressed == orig_bytes)
+ {
+ complete_chunks_len = orig_bytes;
+ lasttid = attbuf->lasttid;
+ }
+ else
+ complete_chunks_len =
+ find_chunk_for_offset(attbuf, bytes_compressed, &lasttid);
+
+ if (complete_chunks_len == 0)
+ elog(ERROR, "could not fit any chunks on page");
+
+ dst = (char *) page + ((PageHeader) page)->pd_special - (SizeOfZSAttStreamHeader + compressed_size);
+ hdr = (ZSAttStream *) dst;
+ hdr->t_size = SizeOfZSAttStreamHeader + compressed_size;
+ hdr->t_flags = ATTSTREAM_COMPRESSED;
+ hdr->t_decompressed_size = complete_chunks_len;
+ hdr->t_decompressed_bufsize = bytes_compressed;
+ hdr->t_lasttid = lasttid;
+
+ dst = hdr->t_payload;
+ memcpy(dst, compressbuf, compressed_size);
+ ((PageHeader) page)->pd_upper -= hdr->t_size;
+ }
+ else
+ {
+ /* Store uncompressed, in lower stream. */
+
+ /*
+ * How many complete chunks can we fit?
+ */
+ if (orig_bytes < freespc)
+ {
+ complete_chunks_len = orig_bytes;
+ lasttid = attbuf->lasttid;
+ }
+ else
+ complete_chunks_len =
+ find_chunk_for_offset(attbuf, freespc, &lasttid);
+
+ if (complete_chunks_len == 0)
+ elog(ERROR, "could not fit any chunks on page");
+
+ hdr = (ZSAttStream *) ((char *) page + SizeOfPageHeaderData);
+
+ hdr->t_size = SizeOfZSAttStreamHeader + complete_chunks_len;
+ hdr->t_flags = 0;
+ hdr->t_decompressed_size = 0;
+ hdr->t_decompressed_bufsize = 0;
+ hdr->t_lasttid = lasttid;
+
+ dst = hdr->t_payload;
+ memcpy(dst, pstart, complete_chunks_len);
+ ((PageHeader) page)->pd_lower += hdr->t_size;
+ }
+
+ /*
+ * Chop off the part of the chunk stream in 'attbuf' that we wrote out.
+ */
+ trim_attstream_upto_offset(attbuf, complete_chunks_len, lasttid);
+}
+static void
+zsbt_attr_repack_writeback_pages(zsbt_attr_repack_context *cxt,
+ Relation rel, AttrNumber attno,
+ Buffer oldbuf)
+{
+ Page oldpage = BufferGetPage(oldbuf);
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(oldpage);
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt->stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, attno);
+ stack->next->buf = nextbuf;
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = cxt->nextblkno;
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt->stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt->stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt->stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt->stack_tail->next = zsbt_newroot(rel, attno, oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt->stack_tail->next = zsbt_insert_downlinks(rel, attno,
+ oldopaque->zs_lokey,
+ BufferGetBlockNumber(oldbuf),
+ oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt->stack_head, NULL, attno);
+}
+
+static void
+wal_log_attstream_change(Relation rel, Buffer buf, ZSAttStream *attstream,
+ bool is_upper, uint16 begin_offset, uint16 end_offset)
+{
+ /*
+ * log only the modified portion.
+ */
+ Page page = BufferGetPage(buf);
+ XLogRecPtr recptr;
+ wal_zedstore_attstream_change xlrec;
+#ifdef USE_ASSERT_CHECKING
+ uint16 pd_lower = ((PageHeader) page)->pd_lower;
+ uint16 pd_upper = ((PageHeader) page)->pd_upper;
+ uint16 pd_special = ((PageHeader) page)->pd_special;
+#endif
+
+ Assert(begin_offset < end_offset);
+
+ memset(&xlrec, 0, sizeof(xlrec)); /* clear padding */
+ xlrec.is_upper = is_upper;
+
+ xlrec.new_attstream_size = attstream->t_size;
+ xlrec.new_decompressed_size = attstream->t_decompressed_size;
+ xlrec.new_decompressed_bufsize = attstream->t_decompressed_bufsize;
+ xlrec.new_lasttid = attstream->t_lasttid;
+
+ xlrec.begin_offset = begin_offset;
+ xlrec.end_offset = end_offset;
+
+ if (is_upper)
+ Assert(begin_offset >= pd_upper && end_offset <= pd_special);
+ else
+ Assert(begin_offset >= SizeOfPageHeaderData && end_offset <= pd_lower);
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalAttstreamChange);
+ XLogRegisterBufData(0, (char *) page + begin_offset, end_offset - begin_offset);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_ATTSTREAM_CHANGE);
+
+ PageSetLSN(page, recptr);
+}
+
+void
+zsbt_attstream_change_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_attstream_change *xlrec =
+ (wal_zedstore_attstream_change *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+ {
+ Page page = (Page) BufferGetPage(buffer);
+ Size datasz;
+ char *data = XLogRecGetBlockData(record, 0, &datasz);
+ ZSAttStream *attstream;
+
+ Assert(datasz == xlrec->end_offset - xlrec->begin_offset);
+
+ if (xlrec->is_upper)
+ {
+ /*
+ * In the upper stream, if the size changes, the old data is moved
+ * to begin at pd_upper, and then the new data is applied.
+ *
+ * XXX: we could be much smarter about this, and not move data that
+ * we will overwrite on the next line.
+ */
+ uint16 pd_special = ((PageHeader) page)->pd_special;
+ uint16 new_pd_upper = pd_special - xlrec->new_attstream_size;
+ uint16 old_pd_upper = ((PageHeader) page)->pd_upper;
+ uint16 old_size = old_pd_upper - pd_special;
+ uint16 new_size = new_pd_upper - pd_special;
+
+ memmove(page + new_pd_upper, page + old_pd_upper, Min(old_size, new_size));
+
+ ((PageHeader) page)->pd_upper = new_pd_upper;
+ }
+ else
+ {
+ uint16 new_pd_lower = SizeOfPageHeaderData + xlrec->new_attstream_size;
+
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+ }
+
+ memcpy(page + xlrec->begin_offset, data, datasz);
+
+ /*
+ * Finally, adjust the size in the attstream header to match.
+ * (if the replacement data in the WAL record covered the attstream
+ * header, this is unnecessarily but harmless)
+ */
+ attstream = (ZSAttStream *) (
+ xlrec->is_upper ? (page + ((PageHeader) page)->pd_upper) :
+ (page + SizeOfPageHeaderData));
+ attstream->t_size = xlrec->new_attstream_size;
+ attstream->t_decompressed_size = xlrec->new_decompressed_size;
+ attstream->t_decompressed_bufsize = xlrec->new_decompressed_bufsize;
+ attstream->t_lasttid = xlrec->new_lasttid;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+}
diff --git src/backend/access/zedstore/zedstore_attstream.c src/backend/access/zedstore/zedstore_attstream.c
new file mode 100644
index 0000000000..a25946c677
--- /dev/null
+++ src/backend/access/zedstore/zedstore_attstream.c
@@ -0,0 +1,2523 @@
+/*
+ * zedstore_attstream.c
+ * Routines for packing datums into "attribute streams", to be stored
+ * on attribute tree leaf pages.
+ *
+ * An attribute stream consists of "chunks", where one chunk contains
+ * the TIDs of 1-60 datums, packed in a compact form, and their datums.
+ * Each chunk begins with a 64-bit codeword, which contains the TIDs
+ * in the chunk. The TIDs are delta-encoded, so we store the difference of
+ * each TID to the previous TID in the stream, and the differences are
+ * packed in 64-bit codewords using a variant of Simple-8b encoding.
+ *
+ * For the first TID in a stream, the "previous" TID is thought to be 0,
+ * so the first TID in the stream's first chunk actually stores the
+ * absolute TID.
+ *
+ * The encoding of TIDs in the codeword is a variant of the Simple-8b
+ * algorithm. 4 bits in each codeword determine a "mode", and the remaining
+ * 60 bits encode the TIDs in a format that depends on the mode. But we also
+ * use the codeword to encode the presence of NULLs, and in the case of
+ * variable-width attributes, the length of each datum in the chunk.
+ * Therefore, fixed- and variable-length attributes use different "modes".
+ *
+ * This chunked format has a few desireable properties:
+ *
+ * - It is compact for the common case of no or few gaps between TIDs.
+ * In the best case, one codeword can pack 60 consecutive TIDs in
+ * one 64-bit codeword. It also "degrades gracefully", as TIDs are
+ * removed, so deleting a few TIDs here and there doesn't usually make
+ * the overall stream larger.
+ *
+ * - It is relatively fast to encode and decode.
+ *
+ * - A long stream can be split easily. You can cut the stream at any chunk,
+ * having to re-encode only the first chunk after the split point. Also,
+ * each chunk is relatively small, which minimizes waste when a large
+ * stream needs to be chopped into page-sized pieces.
+ *
+ * - Two streams can easily be appended to each other, without having to
+ * re-encode the chunks (although it might not result in the most compact
+ * possible codewords.)
+ *
+ * Some drawbacks:
+ *
+ * - Random access is not possible. To access a particular TID, the stream
+ * must be read starting from the beginning.
+ *
+ *
+ * Most of the functions in this file deal with 'attstream_buffers'. An
+ * attstream_buffer is an in-memory representation of an attribute stream.
+ * It is a resizeable buffer, without the ZSAttStream header, but enough
+ * information in the attstream_buffer struct to construct the ZSAttStream
+ * header when needed.
+ *
+ * Another in-memory representation is 'attstream_decoder'. It holds state
+ * when reading an attribute stream.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_attstream.c
+ */
+#include "postgres.h"
+
+#include "access/detoast.h"
+#include "access/toast_internals.h"
+#include "access/zedstore_internal.h"
+#include "miscadmin.h"
+#include "utils/datum.h"
+#include "utils/memutils.h"
+
+#define TARGET_CHUNK_SIZE 128
+
+/*
+ * Internal functions that operate on a single chunk.
+ */
+static bool replace_first_tid_in_place(int attlen, zstid newtid, char *chunk);
+static int skip_chunk(int attlen, char *chunk, zstid *lasttid);
+static int get_chunk_length(int attlen, char *chunk);
+static zstid get_chunk_first_tid(int attlen, char *chunk);
+static int decode_chunk(bool attbyval, int attlen, zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls);
+static int encode_chunk(attstream_buffer *dst, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls);
+static int chunk_num_elements(char *chunk, int attlen);
+
+/* Other internal functions. */
+static void merge_attstream_guts(Form_pg_attribute attr, attstream_buffer *buffer, char *chunks2, int chunks2len, zstid lasttid2);
+#ifdef USE_ASSERT_CHECKING
+static void verify_attstream(attstream_buffer *buffer);
+#endif
+
+
+/* ----------------------------------------------------------------------------
+ * Decoder routines
+ *
+ * To read an attribute stream, initialize a "decoder" by calling
+ * init_attstream_decoder(). Then call decode_attstream_begin()
+ * to load the decoder with data. Read the data, one chunk at a time,
+ * by calling decode_attstream_cont(), until it returns false. Each
+ * call to decode_attstream_cont() fills the arrays in the decoder
+ * struct with the TIDs, Datums and isnull-flags in current chunk.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Initialize a decoder.
+ */
+void
+init_attstream_decoder(attstream_decoder *decoder, bool attbyval, int16 attlen)
+{
+ decoder->cxt = CurrentMemoryContext;
+ decoder->tmpcxt = NULL; /* can be set by caller */
+
+ decoder->attbyval = attbyval;
+ decoder->attlen = attlen;
+
+ decoder->chunks_buf = NULL;
+ decoder->chunks_buf_size = 0;
+ decoder->chunks_len = 0;
+ decoder->lasttid = InvalidZSTid;
+
+ decoder->pos = 0;
+ decoder->prevtid = InvalidZSTid;
+
+ decoder->num_elements = 0;
+}
+
+void
+destroy_attstream_decoder(attstream_decoder *decoder)
+{
+ if (decoder->chunks_buf)
+ pfree(decoder->chunks_buf);
+ decoder->chunks_buf = NULL;
+ decoder->chunks_buf_size = 0;
+ decoder->chunks_len = 0;
+ decoder->num_elements = 0;
+}
+
+/*
+ * Begin reading an attribute stream.
+ */
+void
+decode_attstream_begin(attstream_decoder *decoder, ZSAttStream *attstream)
+{
+ int buf_size_needed;
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ buf_size_needed = attstream->t_decompressed_bufsize;
+ else
+ buf_size_needed = attstream->t_size - SizeOfZSAttStreamHeader;
+
+ if (decoder->chunks_buf_size < buf_size_needed)
+ {
+ if (decoder->chunks_buf)
+ pfree(decoder->chunks_buf);
+
+ decoder->chunks_buf = MemoryContextAlloc(decoder->cxt, buf_size_needed);
+ decoder->chunks_buf_size = buf_size_needed;
+ }
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ /* decompress */
+ zs_decompress(attstream->t_payload, decoder->chunks_buf,
+ attstream->t_size - SizeOfZSAttStreamHeader,
+ attstream->t_decompressed_bufsize);
+ decoder->chunks_len = attstream->t_decompressed_size;
+ decoder->compression_ratio = ((float8) buf_size_needed) / attstream->t_size;
+ }
+ else
+ {
+ memcpy(decoder->chunks_buf,
+ ((char *) attstream) + SizeOfZSAttStreamHeader,
+ attstream->t_size - SizeOfZSAttStreamHeader);
+ decoder->chunks_len = attstream->t_size - SizeOfZSAttStreamHeader;
+ decoder->compression_ratio = 1.0;
+ }
+ decoder->firsttid = get_chunk_first_tid(decoder->attlen, decoder->chunks_buf);
+ decoder->lasttid = attstream->t_lasttid;
+
+ decoder->pos = 0;
+ decoder->prevtid = 0;
+
+ decoder->num_elements = 0;
+ decoder->avg_elements_size = 0;
+}
+
+/*
+ * internal routine like decode_attstream_begin(), for reading chunks without the
+ * ZSAttStream header.
+ */
+static void
+decode_chunks_begin(attstream_decoder *decoder, char *chunks, int chunkslen, zstid lasttid)
+{
+ if (decoder->chunks_buf_size < chunkslen)
+ {
+ if (decoder->chunks_buf)
+ pfree(decoder->chunks_buf);
+
+ decoder->chunks_buf = MemoryContextAlloc(decoder->cxt, chunkslen);
+ decoder->chunks_buf_size = chunkslen;
+ }
+
+ /* XXX: do we really need to make a copy? */
+ memcpy(decoder->chunks_buf, chunks, chunkslen);
+ decoder->chunks_len = chunkslen;
+ decoder->lasttid = lasttid;
+
+ decoder->pos = 0;
+ decoder->prevtid = 0;
+
+ decoder->num_elements = 0;
+}
+
+/*
+ * Decode the next chunk in an attribute steam.
+ *
+ * The TIDs, Datums and isnull flags in 'decoder' are filled in with
+ * data from the next chunk. Returns true if there was more data,
+ * false if the end of chunk was reached.
+ *
+ * TODO: avoid extracting elements we're not interested in, by passing
+ * starttid/endtid. Or provide a separate "fast forward" function.
+ */
+bool
+decode_attstream_cont(attstream_decoder *decoder)
+{
+ zstid lasttid;
+ int total_decoded;
+ char *p;
+ char *lastp;
+ char *pend;
+ MemoryContext oldcxt;
+
+ oldcxt = CurrentMemoryContext;
+ if (decoder->tmpcxt)
+ {
+ MemoryContextReset(decoder->tmpcxt);
+ MemoryContextSwitchTo(decoder->tmpcxt);
+ }
+
+ lastp = p = decoder->chunks_buf + decoder->pos;
+ pend = decoder->chunks_buf + decoder->chunks_len;
+
+ total_decoded = 0;
+ lasttid = decoder->prevtid;
+
+ if (p >= pend)
+ return false;
+
+ while (p < pend && total_decoded + 60 < DECODER_MAX_ELEMS)
+ {
+ int num_decoded;
+
+ p += decode_chunk(decoder->attbyval, decoder->attlen, &lasttid, p,
+ &num_decoded,
+ &decoder->tids[total_decoded],
+ &decoder->datums[total_decoded],
+ &decoder->isnulls[total_decoded]);
+ total_decoded += num_decoded;
+ }
+
+ MemoryContextSwitchTo(oldcxt);
+
+ Assert(p <= pend);
+ decoder->num_elements = total_decoded;
+ decoder->avg_elements_size = ((p - lastp) / total_decoded) / decoder->compression_ratio;
+ decoder->pos = p - decoder->chunks_buf;
+ if (total_decoded > 0)
+ {
+ decoder->prevtid = decoder->tids[total_decoded - 1];
+ return true;
+ }
+ else
+ return false;
+}
+
+bool
+get_attstream_chunk_cont(attstream_decoder *decoder, zstid *prevtid, zstid *firsttid, zstid *lasttid, bytea **chunk)
+{
+ char *p;
+ char *pend;
+ MemoryContext oldcxt;
+ int len;
+ bytea *attr_data = NULL;
+
+ oldcxt = CurrentMemoryContext;
+ if(decoder->tmpcxt)
+ {
+ MemoryContextReset(decoder->tmpcxt);
+ MemoryContextSwitchTo(decoder->tmpcxt);
+ }
+
+ p = decoder->chunks_buf + decoder->pos;
+ pend = decoder->chunks_buf + decoder->chunks_len;
+ *prevtid = decoder->prevtid;
+ *firsttid = get_chunk_first_tid(decoder->attlen, p) + decoder->prevtid;
+
+ if (p < pend)
+ {
+ len = skip_chunk(decoder->attlen, p, &decoder->prevtid);
+
+ attr_data = (bytea *) palloc(len + VARHDRSZ);
+ SET_VARSIZE(attr_data, len + VARHDRSZ);
+ memcpy(VARDATA(attr_data), p, len);
+
+ decoder->num_elements = chunk_num_elements(p, decoder->attlen);
+
+ p += len;
+ }
+
+ MemoryContextSwitchTo(oldcxt);
+
+ Assert(p <= pend);
+ decoder->pos = p - decoder->chunks_buf;
+
+ if(attr_data != NULL) {
+ *lasttid = decoder->prevtid;
+ *chunk = attr_data;
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------------------
+ * Functions for constructing and manipulating attribute streams.
+ * ----------------------------------------------------------------------------
+ */
+
+static void
+enlarge_attstream_buffer_slow(attstream_buffer *buf, int needed)
+{
+ /* copied from StringInfo */
+ int newlen;
+
+ if (((Size) needed) >= (MaxAllocSize - (Size) buf->len))
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("out of memory"),
+ errdetail("Cannot enlarge attstream buffer containing %d bytes by %d more bytes.",
+ buf->len, needed)));
+
+ needed += buf->len; /* total space required now */
+
+ /* Because of the above test, we now have needed <= MaxAllocSize */
+
+ if (needed <= buf->maxlen)
+ return; /* got enough space already */
+
+ /*
+ * We don't want to allocate just a little more space with each append;
+ * for efficiency, double the buffer size each time it overflows.
+ * Actually, we might need to more than double it if 'needed' is big...
+ */
+ newlen = 2 * buf->maxlen;
+ while (needed > newlen)
+ newlen = 2 * newlen;
+
+ /*
+ * Clamp to MaxAllocSize in case we went past it. Note we are assuming
+ * here that MaxAllocSize <= INT_MAX/2, else the above loop could
+ * overflow. We will still have newlen >= needed.
+ */
+ if (newlen > (int) MaxAllocSize)
+ newlen = (int) MaxAllocSize;
+
+ buf->data = (char *) repalloc(buf->data, newlen);
+
+ buf->maxlen = newlen;
+}
+
+static inline void
+enlarge_attstream_buffer(attstream_buffer *buf, int needed)
+{
+ if (needed > buf->maxlen - buf->len)
+ enlarge_attstream_buffer_slow(buf, needed);
+}
+
+/*
+ * Pack given datums into an attstream.
+ */
+void
+create_attstream(attstream_buffer *dst, bool attbyval, int16 attlen,
+ int nelems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ int num_encoded;
+ int elems_remain;
+ zstid prevtid;
+
+ Assert(nelems > 0);
+
+#define INIT_ATTREAM_BUF_SIZE 1024
+ dst->data = palloc(INIT_ATTREAM_BUF_SIZE);
+ dst->len = 0;
+ dst->maxlen = INIT_ATTREAM_BUF_SIZE;
+ dst->cursor = 0;
+ dst->attlen = attlen;
+ dst->attbyval = attbyval;
+
+ dst->firsttid = tids[0];
+ dst->lasttid = tids[nelems - 1];
+
+ prevtid = 0;
+ elems_remain = nelems;
+ while (elems_remain > 0)
+ {
+ num_encoded = encode_chunk(dst, prevtid, elems_remain, tids, datums, isnulls);
+ Assert(num_encoded > 0);
+ prevtid = tids[num_encoded - 1];
+ datums += num_encoded;
+ isnulls += num_encoded;
+ tids += num_encoded;
+ elems_remain -= num_encoded;
+ }
+}
+
+int
+append_attstream(attstream_buffer *buf, bool all, int nelems,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ int num_encoded;
+ int elems_remain;
+ zstid prevtid;
+
+ /* Can we avoid enlarging the buffer by moving the existing data? */
+ if (buf->cursor > 128 * 1024 && buf->cursor > buf->len / 2)
+ {
+ memcpy(buf->data, buf->data + buf->cursor, buf->len - buf->cursor);
+ buf->len -= buf->cursor;
+ buf->cursor = 0;
+ }
+
+ Assert(nelems > 0);
+ Assert((tids[0] > buf->lasttid) || buf->cursor == buf->len);
+
+ if (buf->len - buf->cursor == 0)
+ {
+ buf->firsttid = tids[0];
+ prevtid = 0;
+ }
+ else
+ prevtid = buf->lasttid;
+ elems_remain = nelems;
+ while (elems_remain > (all ? 0 : 59))
+ {
+ num_encoded = encode_chunk(buf, prevtid, elems_remain, tids, datums, isnulls);
+ Assert(num_encoded > 0);
+ prevtid = tids[num_encoded - 1];
+ datums += num_encoded;
+ isnulls += num_encoded;
+ tids += num_encoded;
+ elems_remain -= num_encoded;
+ }
+
+ buf->lasttid = prevtid;
+
+ return nelems - elems_remain;
+}
+
+/*
+ * Split the attstream_buffer supplied in oldattbuf at splittid. The oldattbuf
+ * and the newattbuf will represent the left and right attstream_buffers
+ * resultant from the split respectively.
+ */
+
+void
+split_attstream_buffer(attstream_buffer *oldattbuf, attstream_buffer *newattbuf, zstid splittid)
+{
+ zstid lasttid_prev_chunk;
+ zstid lasttid;
+
+ int split_chunk_pos;
+ int split_chunk_len;
+ char *split_chunk;
+ int split_chunk_num_elems = 0;
+
+ int left_chunk_num_elems = 0;
+ zstid left_chunk_tids[60];
+ Datum left_chunk_datums[60];
+ bool left_chunk_isnulls[60];
+
+ int right_chunk_num_elems = 0;
+ zstid right_chunk_tids[60];
+ Datum right_chunk_datums[60];
+ bool right_chunk_isnulls[60];
+
+ char *right_stream_second_chunk;
+ int right_stream_remaining_length;
+
+ MemoryContext attstream_memory_context = GetMemoryChunkContext(oldattbuf->data);
+ MemoryContext oldcontext;
+
+ Assert(oldattbuf->lasttid > splittid);
+
+ /*
+ * First, we find the beginning offset of the chunk that contains the
+ * splittid: the split_chunk and the lasttid of the chunk preceding the
+ * split_chunk.
+ */
+
+ split_chunk_pos =
+ find_chunk_containing_tid(oldattbuf, splittid, &lasttid_prev_chunk);
+ split_chunk = oldattbuf->data + oldattbuf->cursor + split_chunk_pos;
+ lasttid = lasttid_prev_chunk;
+
+ /*
+ * Then we decode the split_chunk so that we can extract the items that will
+ * belong to the left and right attstream_buffers respectively (called
+ * left_chunk and right_chunk). The left attstream_buffer will end with
+ * left_chunk and the right attstream_buffer will begin with right_chunk.
+ */
+
+ split_chunk_len = decode_chunk(oldattbuf->attbyval,
+ oldattbuf->attlen,
+ &lasttid,
+ split_chunk,
+ &split_chunk_num_elems,
+ left_chunk_tids,
+ left_chunk_datums,
+ left_chunk_isnulls);
+ left_chunk_num_elems = split_chunk_num_elems;
+ for (int i = 0; i < split_chunk_num_elems; i++)
+ {
+ if (left_chunk_tids[i] > splittid)
+ {
+ left_chunk_num_elems = i;
+ right_chunk_num_elems = split_chunk_num_elems - left_chunk_num_elems;
+
+ memmove(right_chunk_tids, &left_chunk_tids[i], sizeof(zstid) * right_chunk_num_elems);
+ memmove(right_chunk_datums, &left_chunk_datums[i], sizeof(Datum) * right_chunk_num_elems);
+ memmove(right_chunk_isnulls, &left_chunk_isnulls[i], sizeof(bool) * right_chunk_num_elems);
+ break;
+ }
+ }
+ right_chunk_num_elems = split_chunk_num_elems - left_chunk_num_elems;
+ oldcontext = MemoryContextSwitchTo(attstream_memory_context);
+ if (right_chunk_num_elems > 0)
+ right_stream_second_chunk = split_chunk + split_chunk_len;
+ else
+ {
+ Assert(right_chunk_num_elems == 0);
+ /*
+ * This indicates that the splittid lies on a chunk boundary. We need to
+ * treat the chunk starting at that boundary to be the right chunk. In
+ * order to do so, we need to decode that chunk.
+ */
+ right_stream_second_chunk = split_chunk + split_chunk_len +
+ decode_chunk(oldattbuf->attbyval,
+ oldattbuf->attlen,
+ &lasttid,
+ split_chunk + split_chunk_len,
+ &right_chunk_num_elems,
+ right_chunk_tids,
+ right_chunk_datums,
+ right_chunk_isnulls);
+ }
+
+ /*
+ * Initialize the right stream to begin with right_chunk. Then, append all
+ * of the chunks following the right_chunk. This will complete the right
+ * attstream_buffer.
+ */
+
+ create_attstream(newattbuf, oldattbuf->attbyval, oldattbuf->attlen,
+ right_chunk_num_elems,
+ right_chunk_tids,
+ right_chunk_datums,
+ right_chunk_isnulls);
+ MemoryContextSwitchTo(oldcontext);
+ right_stream_remaining_length = (oldattbuf->data + oldattbuf->len) - right_stream_second_chunk;
+ enlarge_attstream_buffer(newattbuf, right_stream_remaining_length);
+ memcpy(newattbuf->data + newattbuf->len, right_stream_second_chunk,
+ right_stream_remaining_length);
+ newattbuf->len += right_stream_remaining_length;
+ newattbuf->lasttid = oldattbuf->lasttid;
+
+ /*
+ * Truncate the left attstream_buffer beyond the start of the split_chunk.
+ * Then, append the left_chunk to the end of the left attstream_buffer.
+ * This will complete the left attstream_buffer.
+ */
+
+ oldattbuf->len = oldattbuf->cursor + split_chunk_pos;
+ oldattbuf->lasttid = lasttid_prev_chunk;
+ if (left_chunk_num_elems > 0)
+ append_attstream(oldattbuf, true, left_chunk_num_elems,
+ left_chunk_tids, left_chunk_datums, left_chunk_isnulls);
+
+#ifdef USE_ASSERT_CHECKING
+ verify_attstream(oldattbuf);
+ verify_attstream(newattbuf);
+#endif
+}
+
+/*
+ * Trim attstream buffer from cursor to 'chunk_pos'.
+ * The chunk beginning at 'chunk_pos' will be the new first chunk of the attstream
+ * buffer after this operation is completed. 'prev_lasttid' is the lasttid of
+ * the chunk that precedes the one that starts at 'chunk_pos', in order to
+ * calculate the new absolute first tid of the attstream buffer.
+ */
+void
+trim_attstream_upto_offset(attstream_buffer *buf, int chunk_pos, zstid prev_lasttid)
+{
+ /*
+ * 'first_chunk' represents the first chunk in the resultant attstream buffer.
+ */
+ char *first_chunk;
+ int first_chunk_len;
+ zstid first_chunk_tids[60];
+ Datum first_chunk_datums[60];
+ bool first_chunk_isnulls[60];
+ int first_chunk_num_elems;
+
+ attstream_buffer tmpbuf;
+ zstid newfirsttid;
+
+ /*
+ * Trim the attstream buffer from the beginning by advancing the cursor.
+ */
+ buf->cursor += chunk_pos;
+ Assert(buf->cursor <= buf->len);
+ if (buf->cursor >= buf->len)
+ {
+ Assert(buf->cursor == buf->len);
+ return;
+ }
+
+ /* FIXME: arbitrary limit. We need some space before the split point, in
+ * case we need to re-encode the first new chunk. Compute this correctly,
+ * and perhaps reallocate a bigger buffer if needed. ATM, though, this is
+ * only used to chop large attstreams to page-sized parts, so this never
+ * gets called with a very small 'chunk_pos'.
+ */
+ if (buf->cursor < 500)
+ elog(ERROR, "cannot split");
+
+ first_chunk = buf->data + buf->cursor;
+
+ newfirsttid = prev_lasttid + get_chunk_first_tid(buf->attlen, first_chunk);
+ if (!replace_first_tid_in_place(buf->attlen, newfirsttid, first_chunk))
+ {
+ /*
+ * We need to decode and re-encode the chunk in order to ensure that the
+ * firsttid of the attstream buffer is absolute.
+ */
+ first_chunk_len = decode_chunk(buf->attbyval, buf->attlen, &prev_lasttid,
+ first_chunk,
+ &first_chunk_num_elems,
+ first_chunk_tids,
+ first_chunk_datums,
+ first_chunk_isnulls);
+ create_attstream(&tmpbuf, buf->attbyval, buf->attlen,
+ first_chunk_num_elems,
+ first_chunk_tids,
+ first_chunk_datums,
+ first_chunk_isnulls);
+
+ /*
+ * Replace the chunk with the re-encoded version.
+ * First, shave off the existing chunk. Then prepend the re-encoded
+ * version of the chunk into the attstream buffer.
+ */
+ buf->cursor += first_chunk_len;
+ if (buf->cursor < tmpbuf.len - tmpbuf.cursor)
+ elog(ERROR, "not enough work space to split");
+ buf->cursor -= (tmpbuf.len - tmpbuf.cursor);
+ memcpy(&buf->data[buf->cursor],
+ tmpbuf.data + tmpbuf.cursor,
+ tmpbuf.len - tmpbuf.cursor);
+
+ pfree(tmpbuf.data);
+
+ }
+ buf->firsttid = newfirsttid;
+#ifdef USE_ASSERT_CHECKING
+ verify_attstream(buf);
+#endif
+}
+
+/*
+ * Find the beginning offset of chunk containing 'tid'.
+ * Also populate the lasttid of the chunk preceding the one we found in 'lasttid'.
+ * Returns -1 if there are no full chunks. (FIXME: no it doesn't currently)
+ */
+int
+find_chunk_containing_tid(attstream_buffer *attbuf, zstid tid, zstid *lasttid)
+{
+ zstid prev_lasttid;
+ char *prev_chunk;
+ char *chunk;
+
+ chunk = prev_chunk = &attbuf->data[attbuf->cursor];
+ prev_lasttid = *lasttid = 0;
+
+ while (*lasttid < tid)
+ {
+ prev_lasttid = *lasttid;
+ prev_chunk = chunk;
+
+ chunk += skip_chunk(attbuf->attlen, chunk, lasttid);
+ }
+
+ *lasttid = prev_lasttid;
+
+ /*
+ * prev_chunk now points to the beginning of the chunk that contains tid
+ */
+ return prev_chunk - &attbuf->data[attbuf->cursor];
+}
+
+/*
+ * Find the beginning offset of the first chunk that contains the 'offset'.
+ * Also, populate the lasttid of the chunk preceding the one found in 'lasttid'.
+ * Returns -1 if there are no full chunks. (FIXME: no it doesn't currently)
+ */
+int
+find_chunk_for_offset(attstream_buffer *attbuf, int offset, zstid *lasttid)
+{
+ char *chunks = attbuf->data + attbuf->cursor;
+ char *p = chunks;
+ char *pend = p + offset;
+
+ *lasttid = 0;
+ while (p + sizeof(uint64) <= pend)
+ {
+ int this_chunk_len;
+
+ this_chunk_len = get_chunk_length(attbuf->attlen, p);
+
+ if (p + this_chunk_len > pend)
+ break; /* this one is not complete */
+
+ p += skip_chunk(attbuf->attlen, p, lasttid);
+ }
+ /* 'p' now points to the first incomplete chunk */
+ return p - (char *) chunks;
+}
+
+void
+init_attstream_buffer_from_stream(attstream_buffer *buf, bool attbyval, int16 attlen,
+ ZSAttStream *attstream, MemoryContext memcontext)
+{
+ int buf_size;
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ buf_size = attstream->t_decompressed_bufsize;
+ else
+ buf_size = attstream->t_size - SizeOfZSAttStreamHeader;
+
+ buf->data = MemoryContextAlloc(memcontext, buf_size);
+ buf->len = 0;
+ buf->maxlen = buf_size;
+ buf->cursor = 0;
+
+ buf->attlen = attlen;
+ buf->attbyval = attbyval;
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ zs_decompress(attstream->t_payload, buf->data,
+ attstream->t_size - SizeOfZSAttStreamHeader,
+ attstream->t_decompressed_bufsize);
+ buf->len = attstream->t_decompressed_size;
+ }
+ else
+ {
+ memcpy(buf->data, attstream->t_payload, attstream->t_size - SizeOfZSAttStreamHeader);
+ buf->len = attstream->t_size - SizeOfZSAttStreamHeader;
+ }
+
+ buf->firsttid = get_chunk_first_tid(buf->attlen, buf->data + buf->cursor);
+ buf->lasttid = attstream->t_lasttid;
+}
+
+void
+init_attstream_buffer(attstream_buffer *buf, bool attbyval, int16 attlen)
+{
+#define ATTBUF_INIT_SIZE 1024
+ buf->data = palloc(ATTBUF_INIT_SIZE);
+ buf->len = 0;
+ buf->maxlen = ATTBUF_INIT_SIZE;
+ buf->cursor = 0;
+
+ buf->firsttid = 0;
+ buf->lasttid = 0;
+
+ buf->attlen = attlen;
+ buf->attbyval = attbyval;
+}
+
+void
+vacuum_attstream(Relation rel, AttrNumber attno, attstream_buffer *dst,
+ ZSAttStream *attstream,
+ zstid *tids_to_remove, int num_tids_to_remove)
+{
+ Form_pg_attribute attr = &rel->rd_att->attrs[attno - 1];
+ attstream_decoder decoder;
+ int removeidx;
+ zstid *tids;
+ Datum *datums;
+ bool *isnulls;
+ int num_buffered;
+ int buffer_size = 1000;
+
+ /*
+ * Decode the input, leave out the items that are to be removed, and
+ * re-encode as we go.
+ */
+ tids = palloc(buffer_size * sizeof(zstid));
+ datums = palloc(buffer_size * sizeof(Datum));
+ isnulls = palloc(buffer_size * sizeof(bool));
+
+ init_attstream_buffer(dst, attr->attbyval, attr->attlen);
+
+ init_attstream_decoder(&decoder, attr->attbyval, attr->attlen);
+ decode_attstream_begin(&decoder, attstream);
+
+ num_buffered = 0;
+ removeidx = 0;
+ while (decode_attstream_cont(&decoder))
+ {
+ for (int idx = 0; idx < decoder.num_elements; idx++)
+ {
+ zstid tid = decoder.tids[idx];
+ Datum datum = decoder.datums[idx];
+ bool isnull = decoder.isnulls[idx];
+
+ /* also "merge" in the list of tids to remove */
+ while (removeidx < num_tids_to_remove && tid > tids_to_remove[removeidx])
+ removeidx++;
+ if (removeidx < num_tids_to_remove && tid == tids_to_remove[removeidx])
+ {
+ /*
+ * This datum needs to be removed. Leave it out from the result.
+ *
+ * If it's a toasted datum, also remove the toast blocks.
+ */
+ if (attr->attlen == -1 && !isnull &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(datum);
+ BlockNumber toast_blkno = toastptr->zst_block;
+
+ zedstore_toast_delete(rel, attr, tid, toast_blkno);
+ }
+ }
+ else
+ {
+ tids[num_buffered] = tid;
+ datums[num_buffered] = datum;
+ isnulls[num_buffered] = isnull;
+ num_buffered++;
+
+ if (num_buffered == buffer_size)
+ {
+ /* encode datums that we've buffered so far */
+ int num_encoded;
+ int num_remain;
+
+ num_encoded = append_attstream(dst, false, num_buffered, tids, datums, isnulls);
+
+ /* move the remaining ones to beginning of buffer, and continue */
+ num_remain = num_buffered - num_encoded;
+ memmove(tids, &tids[num_encoded], num_remain * sizeof(zstid));
+ memmove(datums, &datums[num_encoded], num_remain * sizeof(Datum));
+ memmove(isnulls, &isnulls[num_encoded], num_remain * sizeof(bool));
+ num_buffered = num_remain;
+ }
+ }
+ }
+ }
+
+ /* flush the rest of buffered datums to the attstream */
+ if (num_buffered > 0)
+ append_attstream(dst, true, num_buffered, tids, datums, isnulls);
+
+#ifdef USE_ASSERT_CHECKING
+ verify_attstream(dst);
+#endif
+
+ pfree(datums);
+ pfree(isnulls);
+ pfree(tids);
+ destroy_attstream_decoder(&decoder);
+}
+
+/*
+ * Merge two attstreams together.
+ *
+ * This is the workhorse of repacking and re-encoding data, when
+ * new attribute data is added to a page (INSERT/UPDATE), or when
+ * some data is removed (VACUUM after a DELETE).
+ *
+ * 'attstream1' and 'attstream2' are the two streams to merge.
+ * Either one can be NULL, if you just want to re-encode and
+ * recompress an existing stream.
+ *
+ * 'tids_to_remove' is an optional array of TIDs to remove from
+ * the stream(s).
+ *
+ * There are some heuristics here:
+ *
+ * - A compressed attstream is assumed to already be in a "dense"
+ * form, using maximally compact codewords. An uncompressed
+ * stream, however, might not be, so uncompressed streams are
+ * always decoded into constituent datums, and re-encoded.
+ *
+ */
+void
+merge_attstream(Form_pg_attribute attr, attstream_buffer *buf, ZSAttStream *attstream2)
+{
+ if (attstream2 == NULL)
+ return;
+
+ /*
+ * If the input is compressed, decompress it.
+ */
+ if ((attstream2->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ char *decompress_buf;
+
+ decompress_buf = palloc(attstream2->t_decompressed_bufsize);
+ zs_decompress(attstream2->t_payload, decompress_buf,
+ attstream2->t_size - SizeOfZSAttStreamHeader,
+ attstream2->t_decompressed_bufsize);
+
+ merge_attstream_guts(attr, buf,
+ decompress_buf, attstream2->t_decompressed_size,
+ attstream2->t_lasttid);
+
+ pfree(decompress_buf);
+ }
+ else
+ {
+ merge_attstream_guts(attr, buf,
+ attstream2->t_payload, attstream2->t_size - SizeOfZSAttStreamHeader,
+ attstream2->t_lasttid);
+ }
+}
+
+void
+merge_attstream_buffer(Form_pg_attribute attr, attstream_buffer *buf, attstream_buffer *buf2)
+{
+ merge_attstream_guts(attr, buf,
+ buf2->data + buf2->cursor, buf2->len - buf2->cursor, buf2->lasttid);
+}
+static void
+merge_attstream_guts(Form_pg_attribute attr, attstream_buffer *buf, char *chunks2, int chunks2len, zstid lasttid2)
+{
+ attstream_decoder decoder1;
+ bool decoder1_continues;
+ int decoder1_idx;
+ attstream_decoder decoder2;
+ bool decoder2_continues;
+ int decoder2_idx;
+ Datum *result_datums;
+ bool *result_isnulls;
+ zstid *result_tids;
+ int num_buffered;
+ zstid lasttid1;
+ zstid firsttid2;
+ int buffer_size = 1000;
+ char *olddata;
+ int newmaxlen;
+
+ lasttid1 = buf->lasttid;
+ firsttid2 = get_chunk_first_tid(buf->attlen, chunks2);
+
+ /*
+ * Fast path:
+ *
+ * If the two streams don't overlap, then
+ * we can avoid re-encoding and just append one stream after the other.
+ * We only do this if the stream that comes first was compressed:
+ * otherwise it may not be optimally packed, and we want to re-encode it
+ * to make sure it's using densest possible codewords.
+ *
+ * XXX: we don't take this fastpath, if the new stream is strictly
+ * below the old stream. We could swap the inputs and do it in that
+ * case too...
+ *
+ * FIXME: we don't actually pay attention to the compression anymore.
+ * We never repack.
+ * FIXME: this is backwords, the normal fast path is if (firsttid1 > lasttid2)
+ */
+ if (firsttid2 > lasttid1)
+ {
+ char *pos_new;
+ uint64 delta;
+
+ enlarge_attstream_buffer(buf, chunks2len);
+ pos_new = buf->data + buf->len;
+
+ memcpy(pos_new, chunks2, chunks2len);
+
+ delta = firsttid2 - lasttid1;
+ replace_first_tid_in_place(buf->attlen, delta, pos_new);
+
+ buf->len += chunks2len;
+ buf->lasttid = lasttid2;
+
+ return;
+ }
+
+ /*
+ * naive implementation: decode everything, merge arrays, and re-encode.
+ * FIXME: becuase this is naive, this could be costly when we have a large number of tids in the attbuffer
+ */
+ init_attstream_decoder(&decoder1, attr->attbyval, attr->attlen);
+ decode_chunks_begin(&decoder1, buf->data + buf->cursor, buf->len - buf->cursor, buf->lasttid);
+ decoder1_continues = decode_attstream_cont(&decoder1);
+ decoder1_idx = 0;
+
+ init_attstream_decoder(&decoder2, attr->attbyval, attr->attlen);
+ decode_chunks_begin(&decoder2, chunks2, chunks2len, lasttid2);
+ decoder2_continues = decode_attstream_cont(&decoder2);
+ decoder2_idx = 0;
+
+ buffer_size = 1000; /* arbitrary initial size */
+ result_tids = palloc(buffer_size * sizeof(zstid));
+ result_datums = palloc(buffer_size * sizeof(Datum));
+ result_isnulls = palloc(buffer_size * sizeof(bool));
+
+ /*
+ * Reallocate a new buffer, in the same memory context as the old one, to
+ * write the result to. (We can't write diretly to the old buffer, because
+ * we'll read it simultaneously.
+ */
+ newmaxlen = (buf->len - buf->cursor) + chunks2len + 100; /* 100 is gives some headroom, to avoid
+ * repallocs */
+ olddata = buf->data;
+ buf->data = MemoryContextAlloc(GetMemoryChunkContext(olddata), newmaxlen);
+ buf->maxlen = newmaxlen;
+ buf->len = 0;
+ buf->cursor = 0;
+ buf->firsttid = 0;
+ buf->lasttid = 0;
+
+ num_buffered = 0;
+ for (;;)
+ {
+ attstream_decoder *decodernext;
+ int *decodernext_idx;
+ bool *decodernext_continues;
+ zstid tid;
+ Datum datum;
+ bool isnull;
+
+ if (decoder1_continues && decoder2_continues)
+ {
+ if (decoder1.tids[decoder1_idx] < decoder2.tids[decoder2_idx])
+ {
+ decodernext = &decoder1;
+ decodernext_idx = &decoder1_idx;
+ decodernext_continues = &decoder1_continues;
+ }
+ else if (decoder1.tids[decoder1_idx] > decoder2.tids[decoder2_idx])
+ {
+ decodernext = &decoder2;
+ decodernext_idx = &decoder2_idx;
+ decodernext_continues = &decoder2_continues;
+ }
+ else
+ elog(ERROR, "cannot merge attstreams with duplicate TIDs");
+ }
+ else if (decoder1_continues)
+ {
+ decodernext = &decoder1;
+ decodernext_idx = &decoder1_idx;
+ decodernext_continues = &decoder1_continues;
+ }
+ else if (decoder2_continues)
+ {
+ decodernext = &decoder2;
+ decodernext_idx = &decoder2_idx;
+ decodernext_continues = &decoder2_continues;
+ }
+ else
+ {
+ break; /* all done */
+ }
+
+ tid = decodernext->tids[*decodernext_idx];
+ datum = decodernext->datums[*decodernext_idx];
+ isnull = decodernext->isnulls[*decodernext_idx];
+ (*decodernext_idx)++;
+
+ result_tids[num_buffered] = tid;
+ result_datums[num_buffered] = datum;
+ result_isnulls[num_buffered] = isnull;
+ num_buffered++;
+
+ if (num_buffered == buffer_size)
+ {
+ /* encode datums that we've buffered so far */
+ int num_encoded;
+ int num_remain;
+
+ num_encoded = append_attstream(buf, false, num_buffered,
+ result_tids, result_datums, result_isnulls);
+
+ /* move the remaining ones to beginning of buffer, and continue */
+ num_remain = num_buffered - num_encoded;
+ memmove(result_tids, &result_tids[num_encoded], num_remain * sizeof(zstid));
+ memmove(result_datums, &result_datums[num_encoded], num_remain * sizeof(Datum));
+ memmove(result_isnulls, &result_isnulls[num_encoded], num_remain * sizeof(bool));
+ num_buffered = num_remain;
+ }
+
+ if (*decodernext_idx == decodernext->num_elements)
+ {
+ *decodernext_continues = decode_attstream_cont(decodernext);
+ *decodernext_idx = 0;
+ }
+ }
+
+ /* flush the rest of buffered datums to the attstream */
+ if (num_buffered > 0)
+ append_attstream(buf, true, num_buffered, result_tids, result_datums, result_isnulls);
+
+ pfree(olddata);
+ pfree(result_datums);
+ pfree(result_isnulls);
+ pfree(result_tids);
+ destroy_attstream_decoder(&decoder1);
+ destroy_attstream_decoder(&decoder2);
+}
+
+/*
+ * Append 'newstream' to 'oldstream' in place, modifying 'oldstream'.
+ *
+ * There is assumed to be 'freespace' bytes after 'oldstream', where we can
+ * write the new data.
+ *
+ * If the new data doesn't fit in the available space, does nothing and
+ * returns false.
+ *
+ * NB: This is used within a critical section, so keep it simple. No ereport
+ * or pallocs!
+ */
+bool
+append_attstream_inplace(Form_pg_attribute att, ZSAttStream *oldstream, int freespace,
+ attstream_buffer *newbuf)
+{
+ zstid firstnewtid;
+ char *pos_new;
+ zstid delta;
+
+ Assert(CritSectionCount > 0);
+
+ /*
+ * fast path requirements:
+ *
+ * - the new stream goes after the old one
+ * - there is enough space to append 'newbuf'
+ * - neither stream is compressed
+ */
+ if (oldstream->t_flags & ATTSTREAM_COMPRESSED)
+ return false;
+
+ if (freespace < newbuf->len - newbuf->cursor)
+ return false; /* no space */
+
+ firstnewtid = get_chunk_first_tid(att->attlen, newbuf->data + newbuf->cursor);
+ if (firstnewtid <= oldstream->t_lasttid)
+ {
+ /* overlap */
+ return false;
+ }
+
+ /*
+ * We can do it!
+ *
+ * The trivial way is to just append the new stream to the new stream,
+ * adjusting the first TID at the seam, so that it's a delta from the last
+ * old tid.
+ *
+ * TODO A better way: try to re-code the last old item, and first new item
+ * together. For example, if new data is added one row at a time, we currently
+ * generate a stream of single-datum chunks, with the 8-byte codeword for
+ * every datum. It would be better to combine the chunks at the seam, using
+ * more compact codewords. But if you implement that, make sure the callers
+ * are happy with that! At the moment, the caller WAL-logs the change, and
+ * doesn't expect us to change the existing data.
+ */
+ pos_new = ((char *) oldstream) + oldstream->t_size;
+ memcpy(pos_new,
+ newbuf->data + newbuf->cursor,
+ newbuf->len - newbuf->cursor);
+
+ delta = firstnewtid - oldstream->t_lasttid;
+ replace_first_tid_in_place(att->attlen, delta, pos_new);
+ oldstream->t_size += newbuf->len - newbuf->cursor;
+ oldstream->t_lasttid = newbuf->lasttid;
+
+ newbuf->cursor = newbuf->len;
+
+ return true;
+}
+
+
+#ifdef USE_ASSERT_CHECKING
+static void
+verify_attstream(attstream_buffer *attbuf)
+{
+ char *p = attbuf->data + attbuf->cursor;
+ char *pend = attbuf->data + attbuf->len;
+ zstid tid;
+
+ tid = 0;
+
+ while (p < pend)
+ {
+ p += skip_chunk(attbuf->attlen, p, &tid);
+ }
+ Assert(tid == attbuf->lasttid);
+ Assert(p == pend);
+}
+#endif
+
+void
+print_attstream(int attlen, char *chunk, int len)
+{
+ char *p = chunk;
+ char *pend = chunk + len;
+ zstid tid;
+
+ tid = 0;
+ while (p < pend)
+ {
+ uint64 codeword;
+ int len;
+
+ memcpy(&codeword, p, sizeof(uint64));
+
+ len = skip_chunk(attlen, p, &tid);
+ elog(NOTICE, "%016lX: TID %lu, %d bytes", codeword, tid, len);
+
+ p += len;
+ }
+}
+
+/* ----------------------------------------------------------------------------
+ * Functions work with individual chunks in an attstream.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * FIXED-LENGTH CODEWORD MODES
+ * ---------------------------
+ *
+ * These modes are used with fixed-length attributes (attlen > 0). Each codeword
+ * includes a 4-bit mode selector, and between 1-60 TIDs, and in some modes, a NULL
+ * bitmap. To avoid creating too large chunks, which might not fit conveniently on
+ * a page, we avoid using the most dense modes when the resulting chunk would exceed
+ * TARGET_CHUNK_SIZE.
+ *
+ * Glossary:
+ *
+ * x Bit positions representing TIDs (or rather, deltas between TIDs.
+ * 0..9 In the lower modes that encode a lot of TIDs, the boundaries between TIDs
+ * are not shown and 'x' is used to represent all of them. In higher modes,
+ * the numbers are used to indicate which bit position encodes which TID.)
+ *
+ * N Bit positions used for a NULL bitmap
+ *
+ * w unused, wasted, bits
+ *
+ * mode 0: 0000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 60 TIDs, 1 bit each
+ *
+ * mode 1: 0001 NNNN NNNNNNNN NNNNNNNN NNNNNNNN NNxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 30 TIDs, 1 bit each
+ * 30 NULL bits
+ *
+ * mode 2: 0010 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 30 TIDs, 2 bits each
+ *
+ * mode 3: 0011 NNNN NNNNNNNN NNNNNNNN xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 20 TIDs, 2 bits each
+ * 20 NULL bits
+ *
+ * mode 4: 0100 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 20 TIDs, 3 bits each
+ *
+ * mode 5: 0101 NNNN NNNNNNNN NNNxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 15 TIDs, 3 bits each
+ * 15 NULL bits
+ *
+ * mode 6: 0110 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 15 TIDs, 4 bits each
+ *
+ * mode 7: 0111 NNNN NNNNNNNN xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 12 TIDs, 4 bits each
+ * 12 NULL bits
+ *
+ * mode 8: 1000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 12 TIDs, 5 bits each
+ *
+ * mode 9: 1001 NNNN NNNNNNxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 10 TIDs, 5 bits each
+ * 10 NULL bits
+ *
+ * mode 10: 1010 wwww NNNNNNNN 88888877 77776666 66555555 44444433 33332222 22111111
+ *
+ * 8 TIDs, 6 bits each
+ * 8 NULL bits
+ * (four bits are wasted)
+ *
+ * mode 11: 1011 NNNN NN666666 66655555 55554444 44444333 33333322 22222221 11111111
+ *
+ * 6 TIDs, 9 bits each
+ * 6 NULL bits
+ *
+ * mode 12: 1100 NNNN 44444444 44444433 33333333 33332222 22222222 22111111 11111111
+ * 4 TIDs, 14 bits each
+ * 4 NULL bits
+ *
+ * mode 13: 1101 NNN3 33333333 33333333 33222222 22222222 22222111 11111111 11111111
+ *
+ * three TIDs, 19 bits each
+ * 3 NULL bits
+ *
+ * mode 14: 1110 NN22 22222222 22222222 22222222 22211111 11111111 11111111 11111111
+ *
+ * two TIDs, 29 bits each
+ * two NULL bits
+ *
+ * mode 15: 1111 0000 Nxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * one TID, 59 bits
+ * NULL bit
+ *
+ * XXX: we store the first TID in the low bits, and subsequent TIDs in higher bits. Not
+ * sure if that's how it's usually done...
+ *
+ * XXX: We could use delta 0 to mark unused slots. That way, we wouldn't need to shift
+ * to a higher mode when we're running out of TIDs to encode. Or we could subtract one
+ * from each distance, so that value 0 didn't go wasted, and we could sometimes use
+ * more compact modes.
+ */
+static const struct codeword_mode
+{
+ uint8 bits_per_int;
+ uint8 num_ints;
+ bool nullbitmap;
+} fixed_width_modes[17] =
+{
+ {1, 60, false}, /* mode 0 */
+ {1, 30, true}, /* mode 1 */
+ {2, 30, false}, /* mode 2 */
+ {2, 20, true}, /* mode 3 */
+ {3, 20, false}, /* mode 4 */
+ {3, 15, true}, /* mode 5 */
+ {4, 15, false}, /* mode 6 */
+ {4, 12, true}, /* mode 7 */
+ {5, 12, false}, /* mode 8 */
+ {5, 10, true}, /* mode 9 */
+ {6, 8, true}, /* mode 10 */
+ {9, 6, true}, /* mode 11 */
+ {14, 4, true}, /* mode 12 */
+ {19, 3, true}, /* mode 13 */
+ {29, 2, true}, /* mode 14 */
+ {55, 1, true}, /* mode 15 */
+ {0, 0, false} /* sentinel */
+};
+
+static int
+get_chunk_length_fixed(int attlen, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int nints = fixed_width_modes[selector].num_ints;
+ int bits = fixed_width_modes[selector].bits_per_int;
+ bool has_nulls = fixed_width_modes[selector].nullbitmap;
+ int num_nulls;
+
+ /* skip over the TIDs */
+ codeword >>= bits * nints;
+
+ num_nulls = 0;
+ if (has_nulls)
+ {
+ /* count set bits in the NULL bitmap */
+ for (int i = 0; i < nints; i++)
+ {
+ if (codeword & 1)
+ num_nulls++;
+ codeword >>= 1;
+ }
+ }
+ return sizeof(uint64) + (nints - num_nulls) * attlen;
+ }
+}
+
+static zstid
+get_chunk_first_tid_fixed(int attlen, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = fixed_width_modes[selector].bits_per_int;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ /* get first tid */
+ return (codeword & mask);
+ }
+}
+
+static bool
+replace_first_tid_fixed(int attlen, zstid newtid, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = fixed_width_modes[selector].bits_per_int;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ /* get first tid */
+ if (newtid >= (1 << bits))
+ return false;
+
+ /* zero out the TID */
+ codeword &= ~mask;
+ codeword |= newtid;
+
+ memcpy(chunk, &codeword, sizeof(uint64));
+
+ return true;
+ }
+}
+
+static int
+skip_chunk_fixed(int attlen, char *chunk, zstid *lasttid)
+{
+ zstid prevtid = *lasttid;
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int nints = fixed_width_modes[selector].num_ints;
+ int bits = fixed_width_modes[selector].bits_per_int;
+ bool has_nulls = fixed_width_modes[selector].nullbitmap;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+ int num_nulls;
+ zstid tid = prevtid;
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ tid += val;
+ codeword >>= bits;
+ }
+
+ num_nulls = 0;
+ if (has_nulls)
+ {
+ /* count set bits in the NULL bitmap */
+ for (int i = 0; i < nints; i++)
+ {
+ if (codeword & 1)
+ num_nulls++;
+ codeword >>= 1;
+ }
+ }
+
+ /* ignore the datums */
+ *lasttid = tid;
+ return sizeof(uint64) + (nints - num_nulls) * attlen;
+ }
+}
+
+static int
+decode_chunk_fixed(bool attbyval, int attlen, zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int bits = fixed_width_modes[selector].bits_per_int;
+ bool has_nulls = fixed_width_modes[selector].nullbitmap;
+ int nints = fixed_width_modes[selector].num_ints;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+ zstid tid = *lasttid;
+ uint64 nullbitmap;
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ tid = tid + val;
+ tids[i] = tid;
+ codeword >>= bits;
+ }
+ *lasttid = tid;
+
+ if (has_nulls)
+ nullbitmap = codeword & UINT64CONST(0x0FFFFFFFFFFFFF);
+ else
+ nullbitmap = 0;
+
+ /* datums follow */
+ if (attbyval)
+ {
+ if (nullbitmap == 0)
+ {
+ /* FIXME: the loops below ignore alignment. 'p' might not be aligned */
+ if (attlen == sizeof(Datum))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = *((Datum *) p);
+ isnulls[i] = false;
+ p += sizeof(Datum);
+ }
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = Int32GetDatum(*(int32 *) p);
+ isnulls[i] = false;
+ p += sizeof(int32);
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = DatumGetInt16(*(int16 *) p);
+ isnulls[i] = false;
+ p += sizeof(int16);
+ }
+ }
+ else if (attlen == sizeof(char))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = CharGetDatum(*p);
+ isnulls[i] = false;
+ p++;
+ }
+ }
+ else
+ elog(ERROR, "unsupported byval length: %d", attlen);
+ }
+ else
+ {
+ /* FIXME: the loops below ignore alignment. 'p' might not be aligned */
+ if (attlen == sizeof(Datum))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = *((Datum *) p);
+ isnulls[i] = false;
+ p += sizeof(Datum);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = Int32GetDatum(*(int32 *) p);
+ isnulls[i] = false;
+ p += sizeof(int32);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = DatumGetInt16(*(int16 *) p);
+ isnulls[i] = false;
+ p += sizeof(int16);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else if (attlen == sizeof(char))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = CharGetDatum(*p);
+ isnulls[i] = false;
+ p++;
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else
+ elog(ERROR, "unsupported byval length: %d", attlen);
+ }
+ }
+ else
+ {
+ char *datumbuf = palloc(MAXALIGN(attlen) * nints); /* XXX: attalign */
+ char *datump = datumbuf;
+
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ memcpy(datump, p, attlen);
+ datums[i] = PointerGetDatum(datump);
+ isnulls[i] = false;
+ p += attlen;
+ datump += MAXALIGN(attlen);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+
+ *num_elems = nints;
+ return p - chunk;
+ }
+}
+
+static int
+encode_chunk_fixed(attstream_buffer *dst, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ bool attbyval = dst->attbyval;
+ int16 attlen = dst->attlen;
+ int selector;
+ int this_nints;
+ int this_bits;
+ bool this_supports_nulls;
+ uint64 val;
+ int i;
+ bool has_nulls;
+ int size = sizeof(uint64);
+ uint64 codeword;
+ uint64 deltas[60];
+ char *p;
+
+ selector = 0;
+ this_nints = fixed_width_modes[0].num_ints;
+ this_bits = fixed_width_modes[0].bits_per_int;
+ this_supports_nulls = fixed_width_modes[0].nullbitmap;
+
+ val = tids[0] - prevtid;
+ has_nulls = isnulls[0];
+ i = 0;
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << this_bits) ||
+ (has_nulls && !this_supports_nulls))
+ {
+ /* Too large, or need NULL bitmap. Step up to next mode */
+ selector++;
+ this_nints = fixed_width_modes[selector].num_ints;
+ this_bits = fixed_width_modes[selector].bits_per_int;
+ this_supports_nulls = fixed_width_modes[selector].nullbitmap;
+
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= this_nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ deltas[i] = val;
+ if (!isnulls[i])
+ size += attlen;
+ i++;
+ if (i >= this_nints)
+ break;
+ /* examine next delta */
+ has_nulls |= isnulls[i];
+ if (i < ntids && size + attlen <= TARGET_CHUNK_SIZE)
+ {
+ val = tids[i] - tids[i - 1];
+ }
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ Assert(i > 0);
+
+ /*
+ * Encode the integers using the selected mode.
+ */
+ codeword = 0;
+ if (has_nulls)
+ {
+ for (int i = 0; i < this_nints; i++)
+ codeword |= isnulls[i] ? (1 << i) : 0;
+ codeword <<= this_nints * this_bits;
+ }
+ for (int i = 0; i < this_nints; i++)
+ codeword |= deltas[i] << (i * this_bits);
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ /*
+ * Note: 'size' is too large at this point, if we had to "back down" to a
+ * less dense mode. That's fine for sizing the destination buffer, but we
+ * can't rely on it for the final size of the chunk.
+ */
+ enlarge_attstream_buffer(dst, size);
+ p = &dst->data[dst->len];
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+
+ /*
+ * Now, the data
+ */
+
+ /* FIXME: the loops below ignore alignment. 'p' might not be aligned */
+ if (attbyval)
+ {
+ if (attlen == sizeof(Datum))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ *((Datum *) p) = datums[i];
+ p += sizeof(Datum);
+ }
+ }
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ *((int32 *) p) = DatumGetInt32(datums[i]);
+ p += sizeof(int32);
+ }
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ *((int16 *) p) = DatumGetInt16(datums[i]);
+ p += sizeof(int16);
+ }
+ }
+ }
+ else if (attlen == sizeof(char))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ *(p++) = DatumGetChar(datums[i]);
+ }
+ }
+ else
+ elog(ERROR, "unsupported byval length: %d", attlen);
+ }
+ else
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ memcpy(p, DatumGetPointer(datums[i]), attlen);
+ p += attlen;
+ }
+ }
+ }
+ dst->len = p - dst->data;
+ Assert(dst->len <= dst->maxlen);
+
+ return this_nints;
+}
+
+/*
+ * VARIABLE-SIZE MODES
+ * -------------------
+ *
+ * These are used with varlenas. With varlenas, we encode not only the
+ * TIDs and NULLness of each datum, but also its length, in the same
+ * codeword. The value 0 stored in the length field is used to represent
+ * a NULL; there is no separate NULL bitmap. For non-NULLs, the value
+ * stored in the length is the real length + 1.
+ *
+ * We don't track a maximum size for the chunk during encoding, but the
+ * fact that we use a smallish number of bits to store the length, depending
+ * on the mode, puts a cap on the max chunk size. For example, in mode 4,
+ * we encode 10 TIDs in a codeword with 4 bits to store the length. With four
+ * bits, each datum can be max 14 bytes long. That limits the max size of a
+ * chunk in mode 4 to 10*14 = 140 bytes. Below table shows the modes with the
+ * number of bits use for the TID and length of each datum, and the maximum
+ * chunk size they give (not including the size of the codeword itself)
+ *
+ * mode tidbits lenbits wasted ntids maxsize
+ * 0 1 1 0 30 30
+ * 1 1 2 0 20 40
+ * 2 1 3 0 15 90
+ * 3 2 3 0 12 72
+ * 4 2 4 0 10 140
+ * 5 3 4 4 8 112
+ * 6 4 4 4 7 98
+ * 7 5 5 0 6 180
+ * 8 6 6 0 5 310
+ * 9 8 7 0 4 504
+ * 10 13 7 0 3 378
+ * 11 23 7 0 2 252
+ * 12 45 15 0 1 32766
+ * 13 unused
+ * 14 toast
+ * 15 unused
+ *
+ * Modes 13 and 15 are currently unused. (The idea is that 15 could be
+ * used for various extended modes with special handling, using more
+ * bits to indicate which extended mode it is. And it seems logical to
+ * have special modes, like the toast mode, at the end. We could use
+ * 13 for another "regular" mode.. )
+ *
+ * Mode 14 is special: It is used to encode a toasted datum. The toast
+ * datum is compressed with toast_compress_datum(). Unlike the other
+ * modes, the toast mode lenbits field is overloaded and is used to
+ * identify where the toast datum is stored. The toast datum could be
+ * stored either inline or in separate toast pages. The TID of the
+ * datum is stored in the codeword as is. If the toast datum is stored
+ * inline, then after the codeword comes the zs_toast_header_inline
+ * followed by the datum. If the toast datum is stored in separate
+ * toast pages, then after the codeword comes the block number of the
+ * first toast block, as a 32-bit integer.
+ *
+ * FIXME: Mode 12 is the widest mode, but it only uses up to 45 bits for
+ * the TID. That's not enough to cover the whole range of valid zstids.
+ * I think we need one more special mode, where we use full 60 bits for
+ * the TID, with the length stored separately after the codeword, for
+ * the odd case that you have a very large datum with a very high TID.
+ */
+static const struct
+{
+ uint8 bits_per_tid;
+ uint8 lenbits;
+ uint8 num_ints;
+} varlen_modes[17] =
+{
+ { 1, 1, 30 }, /* mode 0 */
+ { 1, 2, 20 }, /* mode 1 */
+ { 1, 3, 15 }, /* mode 2 */
+ { 2, 3, 12 }, /* mode 3 */
+ { 2, 4, 10 }, /* mode 4 */
+ { 3, 4, 8 }, /* mode 5 */
+ { 4, 4, 7 }, /* mode 6 */
+ { 5, 5, 6 }, /* mode 7 */
+ { 6, 6, 5 }, /* mode 8 */
+ { 8, 7, 4 }, /* mode 9 */
+ { 13, 7, 3 }, /* mode 10 */
+ { 23, 7, 2 }, /* mode 11 */
+ { 45, 15, 1 }, /* mode 12 */
+
+ /* special modes */
+ { 0, 0, 0 }, /* mode 13 (unused) */
+ { 48, 12, 1 }, /* mode 14 (toast) */
+ { 0, 0, 0 }, /* mode 15 */
+
+ { 0, 0, 0 } /* sentinel */
+};
+
+#define ZS_VARLENA_INLINE_TOAST 1
+#define ZS_VARLENA_TOAST_PAGE 0
+
+typedef struct zs_toast_header_external {
+ BlockNumber toast_blkno;
+} zs_toast_header_external;
+
+typedef struct zs_toast_header_inline
+{
+ uint32 compressed_size;
+ uint32 rawsize;
+} zs_toast_header_inline;
+
+static int
+get_toast_chunk_length(char *chunk, uint64 toast_mode_selector)
+{
+ int chunk_len;
+
+ if (toast_mode_selector == ZS_VARLENA_INLINE_TOAST)
+ {
+ zs_toast_header_inline *hdr;
+ chunk_len = sizeof(uint64) + sizeof(zs_toast_header_inline);
+
+ hdr = (zs_toast_header_inline *) (chunk + sizeof(uint64));
+ chunk_len += hdr->compressed_size;
+ }
+ else if (toast_mode_selector == ZS_VARLENA_TOAST_PAGE)
+ {
+ chunk_len = sizeof(uint64) + sizeof(zs_toast_header_external);
+ }
+ else
+ {
+ elog(ERROR, "Invalid toast chunk type");
+ }
+
+ return chunk_len;
+}
+
+static int
+get_chunk_length_varlen(char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int nints = varlen_modes[selector].num_ints;
+ int tidbits = varlen_modes[selector].bits_per_tid;
+ int lenbits = varlen_modes[selector].lenbits;
+ uint64 lenmask = (UINT64CONST(1) << lenbits) - 1;
+ int total_len;
+
+ /* skip over the TIDs */
+ codeword >>= tidbits * nints;
+
+ if (selector == 14)
+ {
+ /* in-line toast datum or toast pointer */
+ return get_toast_chunk_length(chunk, codeword & lenmask);
+ }
+
+ /* Sum up the lengths */
+ total_len = 0;
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 len = codeword & lenmask;
+
+ if (len > 0)
+ total_len += len - 1;
+ codeword >>= lenbits;
+ }
+ return sizeof(uint64) + total_len;
+ }
+}
+
+static zstid
+get_chunk_first_tid_varlen(char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = varlen_modes[selector].bits_per_tid;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ /* get first tid */
+ return (codeword & mask);
+ }
+}
+
+static bool
+replace_first_tid_varlen(zstid newtid, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = varlen_modes[selector].bits_per_tid;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ if (newtid >= (1 << bits))
+ return false;
+
+ /* zero out the TID */
+ codeword &= ~mask;
+ codeword |= newtid;
+
+ memcpy(chunk, &codeword, sizeof(uint64));
+
+ return true;
+ }
+}
+
+static int
+skip_chunk_varlen(char *chunk, zstid *lasttid)
+{
+ zstid prevtid = *lasttid;
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int nints = varlen_modes[selector].num_ints;
+ int tidbits = varlen_modes[selector].bits_per_tid;
+ int lenbits = varlen_modes[selector].lenbits;
+ uint64 mask = (UINT64CONST(1) << tidbits) - 1;
+ uint64 lenmask = (UINT64CONST(1) << lenbits) - 1;
+ int total_len;
+ zstid tid = prevtid;
+
+ if (selector == 14)
+ {
+ /* toast pointer */
+ *lasttid = tid + (codeword & mask);
+ codeword >>= tidbits * nints;
+ return get_toast_chunk_length(chunk, codeword & lenmask);
+ }
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ tid += val;
+ codeword >>= tidbits;
+ }
+
+ /* Sum up the lengths */
+ total_len = 0;
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 len = codeword & lenmask;
+
+ if (len > 0)
+ total_len += len - 1;
+ codeword >>= lenbits;
+ }
+
+ /* ignore the datums */
+ *lasttid = tid;
+ return sizeof(uint64) + total_len;
+ }
+}
+
+static int
+chunk_num_elements(char *chunk, int attlen)
+{
+ char *p = chunk;
+ uint64 codeword;
+ int selector;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ selector = (codeword >> 60);
+ return attlen > 0 ? fixed_width_modes[selector].num_ints : varlen_modes[selector].num_ints;
+}
+
+static int
+decode_chunk_varlen(zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int nints = varlen_modes[selector].num_ints;
+ int tidbits = varlen_modes[selector].bits_per_tid;
+ int lenbits = varlen_modes[selector].lenbits;
+ uint64 tidmask = (UINT64CONST(1) << tidbits) - 1;
+ uint64 lenmask = (UINT64CONST(1) << lenbits) - 1;
+ zstid tid = *lasttid;
+ char *datump;
+
+ if (selector == 14)
+ {
+ /* in-line toast datum or toast pointer */
+ tid += (codeword & tidmask);
+
+ if ((codeword >> (tidbits * nints)) & lenmask & ZS_VARLENA_INLINE_TOAST)
+ {
+ zs_toast_header_inline hdr;
+ uint32 len;
+
+ memcpy(&hdr, p, sizeof(zs_toast_header_inline));
+ p += sizeof(zs_toast_header_inline);
+
+ len = hdr.compressed_size;
+ datump = palloc0(len + TOAST_COMPRESS_HDRSZ);
+ SET_VARSIZE_COMPRESSED(datump, len + TOAST_COMPRESS_HDRSZ);
+ TOAST_COMPRESS_SET_RAWSIZE(datump, hdr.rawsize);
+ memcpy(datump + TOAST_COMPRESS_HDRSZ, p, len);
+ p += len;
+
+ datums[0] = PointerGetDatum(datump);
+ }
+ else
+ {
+ zs_toast_header_external hdr;
+ BlockNumber toastblkno;
+ varatt_zs_toastptr *toastptr;
+
+ memcpy(&hdr, p, sizeof(zs_toast_header_external));
+ p += sizeof(zs_toast_header_external);
+
+ toastblkno = hdr.toast_blkno;
+
+ toastptr = palloc0(sizeof(varatt_zs_toastptr));
+ SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
+ toastptr->zst_block = toastblkno;
+
+ datums[0] = PointerGetDatum(toastptr);
+ }
+
+ tids[0] = tid;
+ isnulls[0] = false;
+ *num_elems = 1;
+
+ *lasttid = tid;
+ return p - chunk;
+ }
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & tidmask;
+
+ tid = tid + val;
+ tids[i] = tid;
+ codeword >>= tidbits;
+ }
+ *lasttid = tid;
+
+ /* Decode the datums / isnulls */
+ datump = palloc(MAXALIGN(VARHDRSZ + ((1 << lenbits))) * nints);
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 len = codeword & lenmask;
+
+ if (len == 0)
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ else
+ {
+ memcpy(VARDATA(datump), p, len - 1);
+ SET_VARSIZE(datump, len - 1 + VARHDRSZ);
+
+ datums[i] = PointerGetDatum(datump);
+ isnulls[i] = false;
+
+ datump += MAXALIGN(VARHDRSZ + len - 1);
+ p += (len - 1);
+ }
+ codeword >>= lenbits;
+ }
+
+ *num_elems = nints;
+ return p - chunk;
+ }
+}
+
+static int
+encode_chunk_varlen_inline_toast(attstream_buffer *dst, zstid prevtid,
+ zstid *tids, Datum *datums)
+{
+ uint32 len;
+ uint64 codeword;
+ char *p;
+
+ zs_toast_header_inline hdr;
+ hdr.compressed_size = TOAST_COMPRESS_SIZE(datums[0]);
+ hdr.rawsize = TOAST_COMPRESS_RAWSIZE(datums[0]);
+ len = hdr.compressed_size;
+
+ codeword = UINT64CONST(14) << 12;
+ codeword = (codeword | ((uint64)ZS_VARLENA_INLINE_TOAST)) << 48;
+ codeword = codeword | (tids[0] - prevtid);
+
+ enlarge_attstream_buffer(dst, sizeof(uint64) +
+ sizeof(zs_toast_header_inline) + len);
+ p = dst->data + dst->len;
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+ memcpy(p, (char *) &hdr, sizeof(zs_toast_header_inline));
+ p += sizeof(zs_toast_header_inline);
+ memcpy(p, (char *) TOAST_COMPRESS_RAWDATA(datums[0]), len);
+ dst->len += sizeof(uint64) + sizeof(zs_toast_header_inline) + len;
+ return 1;
+}
+
+static int
+encode_chunk_varlen_toast_page(attstream_buffer *dst, zstid prevtid, zstid *tids, Datum * datums)
+{
+ int64 codeword;
+ char *p;
+ zs_toast_header_external hdr;
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(datums[0]);
+ hdr.toast_blkno = toastptr->zst_block;
+
+ codeword = UINT64CONST(14) << 12;
+ codeword = (codeword | ((uint64)ZS_VARLENA_TOAST_PAGE)) << 48;
+ codeword = codeword | (tids[0] - prevtid);
+
+ enlarge_attstream_buffer(dst, sizeof(uint64) + sizeof(zs_toast_header_external));
+ p = dst->data + dst->len;
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+ memcpy(p, (char *) &hdr, sizeof(zs_toast_header_external));
+ dst->len += sizeof(uint64) + sizeof(zs_toast_header_external);
+ return 1;
+}
+
+static int
+encode_chunk_varlen(attstream_buffer *dst, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ int selector;
+ int this_nints;
+ int this_tidbits;
+ int this_lenbits;
+ uint64 val;
+ int len;
+ int i;
+ uint64 codeword;
+ uint64 deltas[60];
+ char *p;
+
+ /* special case for inline toast */
+ if (!isnulls[0] && VARATT_IS_COMPRESSED(datums[0]))
+ return encode_chunk_varlen_inline_toast(dst, prevtid, tids, datums);
+
+ /* special case for toast pointers */
+ else if (!isnulls[0] && VARATT_IS_EXTERNAL(datums[0]) && VARTAG_EXTERNAL(datums[0]) == VARTAG_ZEDSTORE)
+ return encode_chunk_varlen_toast_page(dst, prevtid, tids, datums);
+
+ selector = 0;
+ this_nints = varlen_modes[0].num_ints;
+ this_tidbits = varlen_modes[0].bits_per_tid;
+ this_lenbits = varlen_modes[0].lenbits;
+
+ val = tids[0] - prevtid;
+ if (isnulls[0])
+ len = 0;
+ else
+ len = VARSIZE_ANY_EXHDR(datums[0]) + 1;
+ i = 0;
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << this_tidbits) ||
+ len >= (UINT64CONST(1) << this_lenbits))
+ {
+ /* Too large TID distance, or length. Step up to next mode */
+ selector++;
+ this_nints = varlen_modes[selector].num_ints;
+ this_tidbits = varlen_modes[selector].bits_per_tid;
+ this_lenbits = varlen_modes[selector].lenbits;
+
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= this_nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ deltas[i] = val;
+ i++;
+
+ if (i >= this_nints)
+ break;
+
+ /* examine next delta and length */
+ if (i < ntids)
+ {
+ val = tids[i] - tids[i - 1];
+ if (isnulls[i])
+ len = 0;
+ else
+ {
+ if ((VARATT_IS_EXTERNAL(datums[i]) && VARTAG_EXTERNAL(datums[i]) == VARTAG_ZEDSTORE) ||
+ VARATT_IS_COMPRESSED(datums[i]))
+ {
+ /* toast pointer or toast datum, bail out */
+ val = PG_UINT64_MAX;
+ len = PG_INT32_MAX;
+ }
+ else
+ len = VARSIZE_ANY_EXHDR(datums[i]) + 1;
+ }
+ }
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ len = PG_INT32_MAX;
+ }
+ }
+ }
+
+ Assert(i > 0);
+
+ /*
+ * Encode the length and TID deltas using the selected mode.
+ */
+ codeword = 0;
+ for (int i = 0; i < this_nints; i++)
+ {
+ int len;
+
+ if (isnulls[i])
+ len = 0;
+ else
+ len = VARSIZE_ANY_EXHDR(datums[i]) + 1;
+ codeword |= (uint64) len << (i * this_lenbits);
+ }
+ codeword <<= this_nints * this_tidbits;
+
+ for (int i = 0; i < this_nints; i++)
+ codeword |= deltas[i] << (i * this_tidbits);
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ enlarge_attstream_buffer(dst, sizeof(uint64) + (1 << this_lenbits) * this_nints);
+ p = &dst->data[dst->len];
+
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+
+ /*
+ * Now, the data
+ */
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ int len = VARSIZE_ANY_EXHDR(datums[i]);
+
+ memcpy(p, VARDATA_ANY(datums[i]), len);
+ p += len;
+ }
+ }
+
+ Assert(p - dst->data < dst->maxlen);
+ dst->len = p - dst->data;
+ return this_nints;
+}
+
+
+
+
+/*
+ * Wrapper functions over the fixed-length and varlen variants.
+ */
+
+static bool
+replace_first_tid_in_place(int attlen, zstid newtid, char *chunk)
+{
+ if (attlen > 0)
+ return replace_first_tid_fixed(attlen, newtid, chunk);
+ else
+ return replace_first_tid_varlen(newtid, chunk);
+}
+
+static int
+skip_chunk(int attlen, char *chunk, zstid *lasttid)
+{
+ if (attlen > 0)
+ return skip_chunk_fixed(attlen, chunk, lasttid);
+ else
+ return skip_chunk_varlen(chunk, lasttid);
+}
+
+static int
+get_chunk_length(int attlen, char *chunk)
+{
+ if (attlen > 0)
+ return get_chunk_length_fixed(attlen, chunk);
+ else
+ return get_chunk_length_varlen(chunk);
+}
+
+static zstid
+get_chunk_first_tid(int attlen, char *chunk)
+{
+ if (attlen > 0)
+ return get_chunk_first_tid_fixed(attlen, chunk);
+ else
+ return get_chunk_first_tid_varlen(chunk);
+}
+
+/*
+ * Decode given chunk into 'tids', 'datums', 'isnulls'.
+ * Also populate 'num_elems' - the number of elements present in the chunk.
+ * N.B. 'lasttid' is an I/O parameter. The caller must supply the absolute
+ * lasttid of the preceding chunk. This is to ensure that we can calculate
+ * absolute tids for the 'tids' array.
+ */
+static int
+decode_chunk(bool attbyval, int attlen, zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ if (attlen > 0)
+ return decode_chunk_fixed(attbyval, attlen, lasttid, chunk, num_elems,
+ tids, datums, isnulls);
+ else
+ return decode_chunk_varlen(lasttid, chunk, num_elems,
+ tids, datums, isnulls);
+}
+
+static int
+encode_chunk(attstream_buffer *buf, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ if (buf->attlen > 0)
+ return encode_chunk_fixed(buf, prevtid, ntids,
+ tids, datums, isnulls);
+ else
+ return encode_chunk_varlen(buf, prevtid, ntids,
+ tids, datums, isnulls);
+}
diff --git src/backend/access/zedstore/zedstore_btree.c src/backend/access/zedstore/zedstore_btree.c
new file mode 100644
index 0000000000..1e0ab08380
--- /dev/null
+++ src/backend/access/zedstore/zedstore_btree.c
@@ -0,0 +1,1002 @@
+/*
+ * zedstore_btree.c
+ * Common routines for handling TID and attibute B-tree structures
+ *
+ * A Zedstore table consists of multiple B-trees, one to store TIDs and
+ * visibility information of the rows, and one tree for each attribute,
+ * to hold the data. The TID and attribute trees differ at the leaf
+ * level, but the internal pages have the same layout. This file contains
+ * routines to deal with internal pages, and some other common
+ * functionality.
+ *
+ * When dealing with the TID tree, pass ZS_META_ATTRIBUTE_NUM as the
+ * attribute number.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_btree.c
+ */
+#include "postgres.h"
+
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static zs_split_stack *zsbt_split_internal_page(Relation rel, AttrNumber attno,
+ Buffer leftbuf, OffsetNumber newoff, List *downlinks);
+static zs_split_stack *zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left);
+
+static int zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems);
+
+/*
+ * Find the page containing the given key TID at the given level.
+ *
+ * Level 0 means leaf. The returned buffer is exclusive-locked.
+ *
+ * If tree doesn't exist at all (probably because the table was just created
+ * or truncated), the behavior depends on the 'readonly' argument. If
+ * readonly == true, then returns InvalidBuffer. If readonly == false, then
+ * the tree is created.
+ */
+Buffer
+zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly)
+{
+ BlockNumber next;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ int nextlevel;
+ BlockNumber failblk = InvalidBlockNumber;
+ int faillevel = -1;
+ ZSMetaCacheData *metacache;
+
+ Assert(key != InvalidZSTid);
+
+ /* Fast path for the very common case that we're looking for the rightmost page */
+ metacache = zsmeta_get_cache(rel);
+ if (level == 0 &&
+ attno < metacache->cache_nattributes &&
+ metacache->cache_attrs[attno].rightmost != InvalidBlockNumber &&
+ key >= metacache->cache_attrs[attno].rightmost_lokey)
+ {
+ next = metacache->cache_attrs[attno].rightmost;
+ nextlevel = 0;
+ }
+ else
+ {
+ /* start from root */
+ next = zsmeta_get_root_for_attribute(rel, attno, readonly);
+ if (next == InvalidBlockNumber)
+ {
+ /* completely empty tree */
+ return InvalidBuffer;
+ }
+ nextlevel = -1;
+ }
+ for (;;)
+ {
+ /*
+ * If we arrive again to a block that was a dead-end earlier, it seems
+ * that the tree is corrupt.
+ *
+ * XXX: It's theoretically possible that the block was removed, but then
+ * added back at the same location, and removed again. So perhaps retry
+ * a few times?
+ */
+ if (next == failblk || next == ZS_META_BLK)
+ elog(ERROR, "arrived at incorrect block %u while descending zedstore btree", next);
+
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); /* TODO: shared */
+ page = BufferGetPage(buf);
+ if (!zsbt_page_is_expected(rel, attno, key, nextlevel, buf))
+ {
+ /*
+ * We arrived at an unexpected page. This can happen with concurrent
+ * splits, or page deletions. We could try following the right-link, but
+ * there's no guarantee that's the correct page either, so let's restart
+ * from the root. If we landed here because of concurrent modifications,
+ * the next attempt should land on the correct page. Remember that we
+ * incorrectly ended up on this page, so that if this happens because
+ * the tree is corrupt, rather than concurrent splits, and we land here
+ * again, we won't loop forever.
+ */
+ UnlockReleaseBuffer(buf);
+
+ failblk = next;
+ faillevel = nextlevel;
+ nextlevel = -1;
+ zsmeta_invalidate_cache(rel);
+ next = zsmeta_get_root_for_attribute(rel, attno, readonly);
+ if (next == InvalidBlockNumber)
+ elog(ERROR, "could not find root for attribute %d", attno);
+
+ /*
+ * If the root was split after we cached the metadata, it's
+ * possible that the page we thought was the root page no longer
+ * is, but as we descend from the new root page, we'll end up on
+ * the same page again anyway. Don't treat thatas an error. To
+ * avoid it, check for the root case here, and if reset 'failblk'.
+ */
+ if (faillevel == -1)
+ {
+ if (next == failblk)
+ elog(ERROR, "arrived at incorrect block %u while descending zedstore btree", next);
+ failblk = InvalidBlockNumber;
+ }
+ continue;
+ }
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (nextlevel == -1)
+ nextlevel = opaque->zs_level;
+
+ else if (opaque->zs_level != nextlevel)
+ elog(ERROR, "unexpected level encountered when descending tree");
+
+ if (opaque->zs_level == level)
+ break;
+
+ /* Find the downlink and follow it */
+ items = ZSBtreeInternalPageGetItems(page);
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+
+ itemno = zsbt_binsrch_internal(key, items, nitems);
+ if (itemno < 0)
+ elog(ERROR, "could not descend tree for tid (%u, %u)",
+ ZSTidGetBlockNumber(key), ZSTidGetOffsetNumber(key));
+
+ next = items[itemno].childblk;
+ nextlevel--;
+
+ UnlockReleaseBuffer(buf);
+ }
+
+ if (opaque->zs_level == 0 && opaque->zs_next == InvalidBlockNumber)
+ {
+ metacache = zsmeta_get_cache(rel);
+ if (attno < metacache->cache_nattributes)
+ {
+ metacache->cache_attrs[attno].rightmost = next;
+ metacache->cache_attrs[attno].rightmost_lokey = opaque->zs_lokey;
+ }
+ }
+
+ return buf;
+}
+
+
+/*
+ * Find and lock the leaf page that contains data for scan->nexttid.
+ *
+ * If 'buf' is valid, it is a previously pinned page. We will check that
+ * page first. If it's not the correct page, it will be released.
+ *
+ * Returns InvalidBuffer, if the attribute tree doesn't exist at all.
+ * That should only happen after ALTER TABLE ADD COLUMN. Or on a newly
+ * created table, but none of the current callers would even try to
+ * fetch attribute data, without scanning the TID tree first.)
+ */
+Buffer
+zsbt_find_and_lock_leaf_containing_tid(Relation rel, AttrNumber attno,
+ Buffer buf, zstid nexttid, int lockmode)
+{
+ if (BufferIsValid(buf))
+ {
+retry:
+ LockBuffer(buf, lockmode);
+
+ /*
+ * It's possible that the page was concurrently split or recycled by
+ * another backend (or ourselves). Have to re-check that the page is
+ * still valid.
+ */
+ if (zsbt_page_is_expected(rel, attno, nexttid, 0, buf))
+ return buf;
+ else
+ {
+ /*
+ * It's not valid for the TID we're looking for, but maybe it was the
+ * right page for the previous TID. In that case, we don't need to
+ * restart from the root, we can follow the right-link instead.
+ */
+ if (nexttid > MinZSTid &&
+ zsbt_page_is_expected(rel, attno, nexttid - 1, 0, buf))
+ {
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ BlockNumber next = opaque->zs_next;
+
+ if (next != InvalidBlockNumber)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf = ReleaseAndReadBuffer(buf, rel, next);
+ goto retry;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ }
+ }
+
+ /* Descend the B-tree to find the correct leaf page. */
+ if (!BufferIsValid(buf))
+ buf = zsbt_descend(rel, attno, nexttid, 0, true);
+
+ return buf;
+}
+
+
+/*
+ * Check that a page is a valid B-tree page, and covers the given key.
+ *
+ * This is used when traversing the tree, to check that e.g. a concurrent page
+ * split didn't move pages around, so that the page we were walking to isn't
+ * the correct one anymore.
+ */
+bool
+zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque;
+
+ /*
+ * The page might have been deleted and even reused as a completely different
+ * kind of a page, so we must be prepared for anything.
+ */
+ if (PageIsNew(page))
+ return false;
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ return false;
+
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ return false;
+
+ if (opaque->zs_attno != attno)
+ return false;
+
+ if (level == -1)
+ {
+ if ((opaque->zs_flags & ZSBT_ROOT) == 0)
+ return false;
+ }
+ else
+ {
+ if (opaque->zs_level != level)
+ return false;
+ }
+
+ if (opaque->zs_lokey > key || opaque->zs_hikey <= key)
+ return false;
+
+ /* extra checks for corrupted pages */
+ if (opaque->zs_next == BufferGetBlockNumber(buf))
+ elog(ERROR, "btree page %u next-pointer points to itself", opaque->zs_next);
+
+ return true;
+}
+
+/*
+ * Create a new btree root page, containing supplied downlinks.
+ *
+ * NOTE: the very first root page of a btree, which is also the leaf, is created
+ * in zsmeta_get_root_for_attribute(), not here.
+ *
+ * XXX: What if there are too many downlinks to fit on a page? Shouldn't happen
+ * in practice..
+ */
+zs_split_stack *
+zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks)
+{
+ Page metapage;
+ ZSMetaPage *metapg;
+ Buffer newrootbuf;
+ Page newrootpage;
+ ZSBtreePageOpaque *newrootopaque;
+ ZSBtreeInternalPageItem *items;
+ Buffer metabuf;
+ zs_split_stack *stack1;
+ zs_split_stack *stack2;
+ ListCell *lc;
+ int i;
+
+ newrootbuf = zspage_getnewbuf(rel, attno);
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* allocate a new root page */
+ newrootpage = palloc(BLCKSZ);
+ PageInit(newrootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ newrootopaque = ZSBtreePageGetOpaque(newrootpage);
+ newrootopaque->zs_attno = attno;
+ newrootopaque->zs_next = InvalidBlockNumber;
+ newrootopaque->zs_lokey = MinZSTid;
+ newrootopaque->zs_hikey = MaxPlusOneZSTid;
+ newrootopaque->zs_level = level;
+ newrootopaque->zs_flags = ZSBT_ROOT;
+ newrootopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ items = ZSBtreeInternalPageGetItems(newrootpage);
+
+ /* add all the downlinks */
+ i = 0;
+ foreach (lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ items[i++] = *downlink;
+ }
+ ((PageHeader) newrootpage)->pd_lower += i * sizeof(ZSBtreeInternalPageItem);
+
+ /* FIXME: Check that all the downlinks fit on the page. */
+
+ /* update the metapage */
+ metapage = PageGetTempPageCopy(BufferGetPage(metabuf));
+
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+ if ((attno != ZS_META_ATTRIBUTE_NUM) && (attno <= 0 || attno > metapg->nattributes))
+ elog(ERROR, "invalid attribute number %d (table \"%s\" has only %d attributes)",
+ attno, RelationGetRelationName(rel), metapg->nattributes);
+
+ metapg->tree_root_dir[attno].root = BufferGetBlockNumber(newrootbuf);
+
+ stack1 = zs_new_split_stack_entry(metabuf, metapage);
+ stack2 = zs_new_split_stack_entry(newrootbuf, newrootpage);
+ stack2->next = stack1;
+
+ return stack2;
+}
+
+/*
+ * After page split, insert the downlink of 'rightblkno' to the parent.
+ *
+ * On entry, 'leftbuf' must be pinned exclusive-locked.
+ */
+zs_split_stack *
+zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks)
+{
+ int numdownlinks = list_length(downlinks);
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ Buffer parentbuf;
+ Page parentpage;
+ zs_split_stack *split_stack;
+ ZSBtreeInternalPageItem *firstdownlink;
+
+ /*
+ * re-find parent
+ *
+ * TODO: this is a bit inefficient. Usually, we have just descended the
+ * tree, and if we just remembered the path we descended, we could just
+ * walk back up.
+ */
+
+ /*
+ * XXX:: There was a concurrency bug here, too, observed by running
+ * "make installcheck-parallel":
+ * We're holding a lock on a page on level 0, and the root is at level 1.
+ * However, the metacache says that the page we're holding locked is
+ * the root. When we get here, to find the parent page, we will start
+ * with the cached block at level 0, and deadlock with ourselves.
+ *
+ * To fix that, invalidate the cache, if it claims that the child
+ * block is the root. I'm not sure this fixes the whole general problem
+ * those, so this needs some more thought...
+ */
+ {
+ ZSMetaCacheData *metacache;
+
+ metacache = zsmeta_get_cache(rel);
+ if (attno < metacache->cache_nattributes &&
+ metacache->cache_attrs[attno].root == leftblkno)
+ {
+ metacache->cache_attrs[attno].root = InvalidBlockNumber;
+ }
+ }
+
+ parentbuf = zsbt_descend(rel, attno, leftlokey, level, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ firstdownlink = (ZSBtreeInternalPageItem *) linitial(downlinks);
+
+ /* Find the position in the parent for the downlink */
+ items = ZSBtreeInternalPageGetItems(parentpage);
+ nitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(firstdownlink->tid, items, nitems);
+
+ /* sanity checks */
+ if (itemno < 0 || items[itemno].tid != leftlokey ||
+ items[itemno].childblk != leftblkno)
+ {
+ elog(ERROR, "could not find downlink for block %u TID (%u, %u)",
+ leftblkno, ZSTidGetBlockNumber(leftlokey),
+ ZSTidGetOffsetNumber(leftlokey));
+ }
+ itemno++;
+
+ if (PageGetExactFreeSpace(parentpage) < numdownlinks * sizeof(ZSBtreeInternalPageItem))
+ {
+ /* split internal page */
+ split_stack = zsbt_split_internal_page(rel, attno, parentbuf, itemno, downlinks);
+ }
+ else
+ {
+ ZSBtreeInternalPageItem *newitems;
+ Page newpage;
+ int i;
+ ListCell *lc;
+
+ newpage = PageGetTempPageCopySpecial(parentpage);
+
+ split_stack = zs_new_split_stack_entry(parentbuf, newpage);
+
+ /* insert the new downlink for the right page. */
+ newitems = ZSBtreeInternalPageGetItems(newpage);
+ memcpy(newitems, items, itemno * sizeof(ZSBtreeInternalPageItem));
+
+ i = itemno;
+ foreach(lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ Assert(downlink->childblk != 0);
+ newitems[i++] = *downlink;
+ }
+
+ memcpy(&newitems[i], &items[itemno], (nitems - itemno) * sizeof(ZSBtreeInternalPageItem));
+ ((PageHeader) newpage)->pd_lower += (nitems + numdownlinks) * sizeof(ZSBtreeInternalPageItem);
+ }
+ return split_stack;
+}
+
+/*
+ * Split an internal page.
+ *
+ * The new downlink specified by 'newkey' is inserted to position 'newoff', on 'leftbuf'.
+ * The page is split.
+ */
+static zs_split_stack *
+zsbt_split_internal_page(Relation rel, AttrNumber attno, Buffer origbuf,
+ OffsetNumber newoff, List *newitems)
+{
+ Page origpage = BufferGetPage(origbuf);
+ ZSBtreePageOpaque *origopaque = ZSBtreePageGetOpaque(origpage);
+ Buffer buf;
+ Page page;
+ ZSBtreeInternalPageItem *origitems;
+ int orignitems;
+ zs_split_stack *stack_first;
+ zs_split_stack *stack;
+ Size splitthreshold;
+ ListCell *lc;
+ int origitemno;
+ List *downlinks = NIL;
+
+ origitems = ZSBtreeInternalPageGetItems(origpage);
+ orignitems = ZSBtreeInternalPageGetNumItems(origpage);
+
+ page = PageGetTempPageCopySpecial(origpage);
+ buf = origbuf;
+
+ stack = zs_new_split_stack_entry(buf, page);
+ stack_first = stack;
+
+ /* XXX: currently, we always do 90/10 splits */
+ splitthreshold = PageGetExactFreeSpace(page) * 0.10;
+
+ lc = list_head(newitems);
+ origitemno = 0;
+ for (;;)
+ {
+ ZSBtreeInternalPageItem *item;
+ ZSBtreeInternalPageItem *p;
+
+ if (origitemno == newoff && lc)
+ {
+ item = lfirst(lc);
+ lc = lnext(newitems, lc);
+ }
+ else
+ {
+ if (origitemno == orignitems)
+ break;
+ item = &origitems[origitemno];
+ origitemno++;
+ }
+
+ if (PageGetExactFreeSpace(page) < splitthreshold)
+ {
+ /* have to split to another page */
+ ZSBtreePageOpaque *prevopaque = ZSBtreePageGetOpaque(page);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ BlockNumber blkno;
+ ZSBtreeInternalPageItem *downlink;
+
+ buf = zspage_getnewbuf(rel, attno);
+ blkno = BufferGetBlockNumber(buf);
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ opaque = ZSBtreePageGetOpaque(page);
+ opaque->zs_attno = attno;
+ opaque->zs_next = prevopaque->zs_next;
+ opaque->zs_lokey = item->tid;
+ opaque->zs_hikey = prevopaque->zs_hikey;
+ opaque->zs_level = prevopaque->zs_level;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ prevopaque->zs_next = blkno;
+ prevopaque->zs_hikey = item->tid;
+
+ stack->next = zs_new_split_stack_entry(buf, page);
+ stack = stack->next;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = item->tid;
+ downlink->childblk = blkno;
+ downlinks = lappend(downlinks, downlink);
+ }
+
+ p = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+ *p = *item;
+ ((PageHeader) page)->pd_lower += sizeof(ZSBtreeInternalPageItem);
+ }
+
+ /* recurse to insert downlinks, if we had to split. */
+ if (downlinks)
+ {
+ if ((origopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(origbuf);
+ downlinks = lcons(downlink, downlinks);
+
+ stack->next = zsbt_newroot(rel, attno, origopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ ZSBtreePageGetOpaque(stack_first->page)->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ stack->next = zsbt_insert_downlinks(rel, attno,
+ origopaque->zs_lokey,
+ BufferGetBlockNumber(origbuf),
+ origopaque->zs_level + 1,
+ downlinks);
+ }
+ }
+
+ return stack_first;
+}
+
+
+/*
+ * Removes the last item from page, and unlinks the page from the tree.
+ *
+ * NOTE: you cannot remove the only leaf. Returns NULL if the page could not
+ * be deleted.
+ */
+zs_split_stack *
+zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ Buffer leftbuf;
+ Buffer rightbuf;
+ zs_split_stack *stack;
+
+ /* cannot currently remove the only page at its level. */
+ if (opaque->zs_lokey == MinZSTid && opaque->zs_hikey == MaxPlusOneZSTid)
+ {
+ return NULL;
+ }
+
+ /*
+ * Find left sibling.
+ * or if this is leftmost page, find right sibling.
+ */
+ if (opaque->zs_lokey != MinZSTid)
+ {
+ rightbuf = buf;
+ leftbuf = zsbt_descend(rel, attno, opaque->zs_lokey - 1, level, false);
+
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, false);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(leftbuf);
+ return NULL;
+ }
+ }
+ else
+ {
+ rightbuf = zsbt_descend(rel, attno, opaque->zs_hikey, level, false);
+ leftbuf = buf;
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, true);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(rightbuf);
+ return NULL;
+ }
+ }
+
+ return stack;
+}
+
+/*
+ * Page deletion:
+ *
+ * Mark page empty, remove downlink. If parent becomes empty, recursively delete it.
+ *
+ * Unlike in the nbtree index, we don't need to worry about concurrent scans. They
+ * will simply retry if they land on an unexpected page.
+ */
+static zs_split_stack *
+zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left)
+{
+ Buffer parentbuf;
+ Page origleftpage;
+ Page leftpage;
+ Page rightpage;
+ ZSBtreePageOpaque *leftopaque;
+ ZSBtreePageOpaque *origleftopaque;
+ ZSBtreePageOpaque *rightopaque;
+ ZSBtreeInternalPageItem *parentitems;
+ int parentnitems;
+ Page parentpage;
+ int itemno;
+ zs_split_stack *stack;
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ origleftpage = BufferGetPage(leftbuf);
+ origleftopaque = ZSBtreePageGetOpaque(origleftpage);
+ rightpage = BufferGetPage(rightbuf);
+ rightopaque = ZSBtreePageGetOpaque(rightpage);
+
+ /* find downlink for 'rightbuf' in the parent */
+ parentbuf = zsbt_descend(rel, attno, rightopaque->zs_lokey, origleftopaque->zs_level + 1, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ parentitems = ZSBtreeInternalPageGetItems(parentpage);
+ parentnitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(rightopaque->zs_lokey, parentitems, parentnitems);
+ if (itemno < 0 || parentitems[itemno].childblk != BufferGetBlockNumber(rightbuf))
+ elog(ERROR, "could not find downlink to FPM page %u", BufferGetBlockNumber(rightbuf));
+
+ if (parentnitems > 1 && itemno == 0)
+ {
+ /*
+ * Don't delete the leftmost child of a parent. That would move the
+ * keyspace of the parent, so we'd need to adjust the lo/hikey of
+ * the parent page, and the parent's downlink in the grandparent.
+ * Maybe later...
+ */
+ UnlockReleaseBuffer(parentbuf);
+ elog(DEBUG1, "deleting leftmost child of a parent not implemented");
+ return NULL;
+ }
+
+ if (target_is_left)
+ {
+ /* move all items from right to left before unlinking the right page */
+ leftpage = PageGetTempPageCopy(rightpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+
+ memcpy(leftopaque, origleftopaque, sizeof(ZSBtreePageOpaque));
+ }
+ else
+ {
+ /* right page is empty. */
+ leftpage = PageGetTempPageCopy(origleftpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+ }
+
+ /* update left hikey */
+ leftopaque->zs_hikey = ZSBtreePageGetOpaque(rightpage)->zs_hikey;
+ leftopaque->zs_next = ZSBtreePageGetOpaque(rightpage)->zs_next;
+
+ Assert(ZSBtreePageGetOpaque(leftpage)->zs_level == ZSBtreePageGetOpaque(rightpage)->zs_level);
+
+ stack = zs_new_split_stack_entry(leftbuf, leftpage);
+ stack_head = stack_tail = stack;
+
+ /* Mark right page as empty/unused */
+ rightpage = palloc0(BLCKSZ);
+
+ stack = zs_new_split_stack_entry(rightbuf, rightpage);
+ stack->recycle = true;
+ stack_tail->next = stack;
+ stack_tail = stack;
+
+ /* remove downlink from parent */
+ if (parentnitems > 1)
+ {
+ Page newpage = PageGetTempPageCopySpecial(parentpage);
+ ZSBtreeInternalPageItem *newitems = ZSBtreeInternalPageGetItems(newpage);
+
+ memcpy(newitems, parentitems, itemno * sizeof(ZSBtreeInternalPageItem));
+ memcpy(&newitems[itemno], &parentitems[itemno + 1], (parentnitems - itemno -1) * sizeof(ZSBtreeInternalPageItem));
+
+ ((PageHeader) newpage)->pd_lower += (parentnitems - 1) * sizeof(ZSBtreeInternalPageItem);
+
+ stack = zs_new_split_stack_entry(parentbuf, newpage);
+ stack_tail->next = stack;
+ stack_tail = stack;
+ }
+ else
+ {
+ /* the parent becomes empty as well. Recursively remove it. */
+ stack_tail->next = zsbt_unlink_page(rel, attno, parentbuf, leftopaque->zs_level + 1);
+ if (stack_tail->next == NULL)
+ {
+ /* oops, couldn't remove the parent. Back out */
+ stack = stack_head;
+ while (stack)
+ {
+ zs_split_stack *next = stack->next;
+
+ pfree(stack->page);
+ pfree(stack);
+ stack = next;
+ }
+ }
+ }
+
+ return stack_head;
+}
+
+/*
+ * Allocate a new zs_split_stack struct.
+ */
+zs_split_stack *
+zs_new_split_stack_entry(Buffer buf, Page page)
+{
+ zs_split_stack *stack;
+
+ stack = palloc(sizeof(zs_split_stack));
+ stack->next = NULL;
+ stack->buf = buf;
+ stack->page = page;
+
+ /* caller can change these */
+ stack->recycle = false;
+ stack->special_only = false;
+
+ return stack;
+}
+
+#define MAX_BLOCKS_IN_REWRITE 199
+
+/*
+ * Apply all the changes represented by a list of zs_split_stack
+ * entries.
+ */
+void
+zs_apply_split_changes(Relation rel, zs_split_stack *stack, zs_pending_undo_op *undo_op, AttrNumber attrNumber)
+{
+ zs_split_stack *head = stack;
+ bool wal_needed = RelationNeedsWAL(rel);
+ wal_zedstore_btree_rewrite_pages *xlrec = NULL;
+ int xlrecsz = 0;
+ int block_id = 0;
+ XLogRecPtr recptr;
+
+ if (wal_needed)
+ {
+ int num_pages = 0;
+ int i;
+
+ for (stack = head; stack != NULL; stack = stack->next)
+ num_pages++;
+
+ if (num_pages > MAX_BLOCKS_IN_REWRITE)
+ elog(ERROR, "cannot rewrite more than %d pages in one WAL record",
+ MAX_BLOCKS_IN_REWRITE);
+ /*
+ * +1 buffers, for UNDO.
+ * +1 rdata for UNDO, and +1 for the wal_zedstore_btree_rewrite_pages struct
+ */
+ XLogEnsureRecordSpace(1 + num_pages,
+ 1 + 1 + num_pages);
+
+ xlrecsz = SizeOfZSWalBtreeRewritePages(num_pages);
+ xlrec = palloc(xlrecsz);
+
+ xlrec->numpages = num_pages;
+ i = 0;
+ for (stack = head; stack != NULL; stack = stack->next)
+ {
+ xlrec->pageinfo[i].recycle = stack->recycle;
+ xlrec->pageinfo[i].special_only = stack->special_only;
+ i++;
+ }
+ Assert(i == num_pages);
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) xlrec, xlrecsz);
+
+ if (undo_op)
+ XLogRegisterUndoOp(0, undo_op);
+ }
+
+ START_CRIT_SECTION();
+
+ block_id = 1; /* 0 is undo page */
+ for (stack = head; stack != NULL; stack = stack->next)
+ {
+ Page origpage = BufferGetPage(stack->buf);
+
+ if (stack->special_only)
+ {
+ char *orig_special_area = PageGetSpecialPointer(origpage);
+ char *new_special_area = PageGetSpecialPointer(stack->page);
+ uint16 special_size = PageGetSpecialSize(stack->page);
+
+ Assert(PageGetSpecialSize(origpage) == special_size);
+ Assert(memcmp(origpage, stack->page, BLCKSZ - special_size) == 0);
+ memcpy(orig_special_area, new_special_area, special_size);
+ pfree(stack->page);
+
+ if (wal_needed)
+ {
+ XLogRegisterBuffer(block_id, stack->buf, REGBUF_STANDARD);
+ XLogRegisterBufData(block_id, orig_special_area, special_size);
+ }
+ }
+ else
+ {
+ PageRestoreTempPage(stack->page, BufferGetPage(stack->buf));
+
+ if (wal_needed)
+ XLogRegisterBuffer(block_id, stack->buf,
+ REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ }
+ MarkBufferDirty(stack->buf);
+
+ block_id++;
+ }
+
+ if (undo_op)
+ zsundo_finish_pending_op(undo_op, (char *) undo_op->payload);
+
+ if (wal_needed)
+ {
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_BTREE_REWRITE_PAGES);
+
+ if (undo_op)
+ PageSetLSN(BufferGetPage(undo_op->reservation.undobuf), recptr);
+
+ for (stack = head; stack != NULL; stack = stack->next)
+ PageSetLSN(BufferGetPage(stack->buf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ if (undo_op)
+ {
+ UnlockReleaseBuffer(undo_op->reservation.undobuf);
+ pfree(undo_op);
+ }
+
+ stack = head;
+ while (stack)
+ {
+ zs_split_stack *next;
+
+ /* add this page to the Free Page Map for recycling */
+ if (stack->recycle)
+ zspage_delete_page(rel, stack->buf, InvalidBuffer, attrNumber);
+
+ UnlockReleaseBuffer(stack->buf);
+
+ next = stack->next;
+ pfree(stack);
+ stack = next;
+ }
+ if (wal_needed)
+ pfree(xlrec);
+}
+
+static int
+zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid].tid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
+
+void
+zsbt_rewrite_pages_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_btree_rewrite_pages *xlrec =
+ (wal_zedstore_btree_rewrite_pages *) XLogRecGetData(record);
+ Buffer buffers[MAX_BLOCKS_IN_REWRITE];
+ uint8 block_id;
+
+ /* sanity checks */
+ if (record->max_block_id >= MAX_BLOCKS_IN_REWRITE)
+ elog(ERROR, "too many blocks in zedstore rewrite_pages record: %d", record->max_block_id + 1);
+ if (xlrec->numpages != record->max_block_id)
+ elog(ERROR, "number of blocks in WAL record %d does not match record struct %d",
+ record->max_block_id, xlrec->numpages);
+ if (XLogRecGetDataLen(record) != SizeOfZSWalBtreeRewritePages(xlrec->numpages))
+ elog(ERROR, "incorrect record struct size %d %ld", XLogRecGetDataLen(record), SizeOfZSWalBtreeRewritePages(xlrec->numpages));
+
+ if (XLogRecHasBlockRef(record, 0))
+ buffers[0] = XLogRedoUndoOp(record, 0);
+ else
+ buffers[0] = InvalidBuffer;
+
+ /* Iterate over blocks */
+ for (block_id = 1; block_id <= record->max_block_id; block_id++)
+ {
+ if (XLogReadBufferForRedo(record, block_id, &buffers[block_id]) == BLK_NEEDS_REDO)
+ {
+ Page page = BufferGetPage(buffers[block_id]);
+ char *special_area = PageGetSpecialPointer(page);
+ uint16 special_size = PageGetSpecialSize(page);
+ Size new_special_size;
+ char *new_special_area = XLogRecGetBlockData(record, block_id, &new_special_size);
+
+ if (!xlrec->pageinfo[block_id - 1].special_only)
+ elog(ERROR, "zedstore rewrite_pages WAL record did not contain a full-page image");
+
+ if (new_special_size != special_size)
+ elog(ERROR, "size of page's special area in WAL record does not match old page");
+
+ memcpy(special_area, new_special_area, special_size);
+ PageSetLSN(page, lsn);
+ }
+ }
+
+ /* Changes are done: unlock and release all buffers */
+ for (block_id = 0; block_id <= record->max_block_id; block_id++)
+ {
+ if (BufferIsValid(buffers[block_id]))
+ UnlockReleaseBuffer(buffers[block_id]);
+ }
+
+ /*
+ * XXX: The WAL record includes the 'recycle' flags, but we don't use them
+ * for anything. Deleting a page is WAL-logged separately. We could use the
+ * recycle flag here to catch leaked pages on crash, but it's probably not
+ * a big deal in practice.
+ */
+}
diff --git src/backend/access/zedstore/zedstore_compression.c src/backend/access/zedstore/zedstore_compression.c
new file mode 100644
index 0000000000..0eaff9c625
--- /dev/null
+++ src/backend/access/zedstore/zedstore_compression.c
@@ -0,0 +1,92 @@
+/*
+ * zedstore_compression.c
+ * Routines for compression
+ *
+ * There are two implementations at the moment: LZ4, and the Postgres
+ * pg_lzcompress(). LZ4 support requires that the server was compiled
+ * with --with-lz4.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_compression.c
+ */
+#include "postgres.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#include "access/zedstore_compression.h"
+#include "common/pg_lzcompress.h"
+#include "utils/datum.h"
+
+#ifdef USE_LZ4
+
+int
+zs_compress_destSize(const char *src, char *dst, int *srcSizePtr, int targetDstSize)
+{
+ return LZ4_compress_destSize(src, dst, srcSizePtr, targetDstSize);
+}
+
+void
+zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize)
+{
+ int decompressed_size;
+
+ decompressed_size = LZ4_decompress_safe(src, dst, compressedSize, uncompressedSize);
+ if (decompressed_size < 0)
+ elog(ERROR, "could not decompress chunk (%d bytes compressed, %d bytes uncompressed)",
+ compressedSize, uncompressedSize);
+ if (decompressed_size != uncompressedSize)
+ elog(ERROR, "unexpected decompressed size");
+}
+
+#else
+/* PGLZ implementation */
+
+int
+zs_compress_destSize(const char *src, char *dst, int *srcSizePtr, int targetDstSize)
+{
+ int maxInputSize;
+ int compressed_size;
+
+ /*
+ * FIXME: pglz doesn't have an interface like LZ4 does, to compress up to a certain
+ * target compressed output size. We take a conservative approach and compress
+ * 'targetDstSize' bytes, and return that. Alternatively, we could guess the
+ * compression ratio, and try compressing a larget chunk hoping that it will fit
+ * in the target size, and try again if it didn't fit. Or we could enhance pglz
+ * code to do this cleverly. But it doesn't seem worth the effort, LZ4 (or something
+ * else, but not pglz) is the future.
+ */
+
+ /* reverse the computation of PGLZ_MAX_OUTPUT */
+ if (targetDstSize < 4)
+ return 0;
+
+ maxInputSize = targetDstSize - 4;
+ Assert(PGLZ_MAX_OUTPUT(maxInputSize) <= targetDstSize);
+ if (maxInputSize > *srcSizePtr)
+ maxInputSize = *srcSizePtr;
+
+ compressed_size = pglz_compress(src, maxInputSize, dst, PGLZ_strategy_always);
+ *srcSizePtr = maxInputSize;
+
+ return compressed_size;
+}
+
+void
+zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize)
+{
+ int decompressed_size;
+
+ decompressed_size = pglz_decompress(src, compressedSize, dst, uncompressedSize, true);
+ if (decompressed_size < 0)
+ elog(ERROR, "could not decompress chunk");
+ if (decompressed_size != uncompressedSize)
+ elog(ERROR, "unexpected decompressed size");
+}
+
+#endif /* !USE_LZ4 */
diff --git src/backend/access/zedstore/zedstore_freepagemap.c src/backend/access/zedstore/zedstore_freepagemap.c
new file mode 100644
index 0000000000..47db7fa84e
--- /dev/null
+++ src/backend/access/zedstore/zedstore_freepagemap.c
@@ -0,0 +1,438 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstore_freepagemap.c
+ * ZedStore free space management
+ *
+ * The Free Page Map keeps track of unused pages in the relation.
+ *
+ * The FPM is a linked list of pages. Each page contains a pointer to the
+ * next free page.
+
+ * Design principles:
+ *
+ * - it's ok to have a block incorrectly stored in the FPM. Before actually
+ * reusing a page, we must check that it's safe.
+ *
+ * - a deletable page must be simple to detect just by looking at the page,
+ * and perhaps a few other pages. It should *not* require scanning the
+ * whole table, or even a whole b-tree. For example, if a column is dropped,
+ * we can detect if a b-tree page belongs to the dropped column just by
+ * looking at the information (the attribute number) stored in the page
+ * header.
+ *
+ * - if a page is deletable, it should become immediately reusable. No
+ * "wait out all possible readers that might be about to follow a link
+ * to it" business. All code that reads pages need to keep pages locked
+ * while following a link, or be prepared to retry if they land on an
+ * unexpected page.
+ *
+ *
+ * TODO:
+ *
+ * - Avoid fragmentation. If B-tree page is split, try to hand out a page
+ * that's close to the old page. When the relation is extended, allocate
+ * a larger chunk at once.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_freepagemap.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufpage.h"
+#include "storage/lmgr.h"
+#include "utils/rel.h"
+
+/*
+ * Deleted pages are initialized as having this structure, in the
+ * "special area".
+ *
+ * zs_next points to the next free block in the FPM chain.
+ */
+typedef struct ZSFreePageOpaque
+{
+ BlockNumber zs_next;
+ uint16 padding;
+ uint16 zs_page_id; /* ZS_FREE_PAGE_ID */
+} ZSFreePageOpaque;
+
+static Buffer zspage_extendrel_newbuf(Relation rel);
+
+/*
+ * zspage_is_recyclable()
+ *
+ * Is the current page recyclable?
+ *
+ * It can be:
+ *
+ * - an empty, all-zeros page,
+ * - explicitly marked as deleted,
+ * - an UNDO page older than oldest_undo_ptr
+ * - a b-tree page belonging to a deleted attribute
+ * - a TOAST page belonging to a dead item
+ *
+ * TODO: currently though, we require that it's always explicitly marked as empty.
+ *
+ */
+static bool
+zspage_is_unused(Buffer buf)
+{
+ Page page;
+ ZSFreePageOpaque *opaque;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
+ return false;
+
+ if (PageGetSpecialSize(page) != sizeof(ZSFreePageOpaque))
+ return false;
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_FREE_PAGE_ID)
+ return false;
+
+ return true;
+}
+
+/*
+ * Allocate a new page.
+ *
+ * The page is exclusive-locked, but not initialized.
+ *
+ * The head of the FPM chain is kept in the metapage, and thus this
+ * function will acquire the lock on the metapage. The caller must
+ * not be holding it, or we will self-deadlock!
+ *
+ * Unlinking the page from the FPM is WAL-logged. Once this function
+ * returns, the caller must use the page, and WAL-log its initialization,
+ * or give it back by calling zspage_delete_page().
+ *
+ * NOTE: There is a gap between this function unlinking the page from the
+ * FPM, and the caller initializing the page and linking it to somewhere
+ * else. If we crash in between, the page will be permanently leaked.
+ * That's unfortunate, but hopefully won't happen too often.
+ */
+Buffer
+zspage_getnewbuf(Relation rel, AttrNumber attrNumber)
+{
+ Buffer buf;
+ BlockNumber blk;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPage *metapg;
+ ZSMetaPageOpaque *metaopaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+ metapage = BufferGetPage(metabuf);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+
+ if (attrNumber == ZS_INVALID_ATTRIBUTE_NUM)
+ blk = metaopaque->zs_fpm_head;
+ else
+ blk = metapg->tree_root_dir[attrNumber].fpm_head;
+
+ if (blk == ZS_META_BLK)
+ {
+ /* metapage, not expected */
+ elog(ERROR, "could not find valid page in FPM");
+ }
+ if (blk != InvalidBlockNumber)
+ {
+ ZSFreePageOpaque *opaque;
+ Page page;
+ BlockNumber next_free_blkno;
+
+ buf = ReadBuffer(rel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Check that the page really is unused. */
+ if (!zspage_is_unused(buf))
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "unexpected page found in free page list");
+ }
+ page = BufferGetPage(buf);
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ next_free_blkno = opaque->zs_next;
+
+ if (attrNumber == ZS_INVALID_ATTRIBUTE_NUM)
+ metaopaque->zs_fpm_head = next_free_blkno;
+ else
+ metapg->tree_root_dir[attrNumber].fpm_head = next_free_blkno;
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_fpm_reuse_page xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.next_free_blkno = next_free_blkno;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalFpmReusePage);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+ /*
+ * NOTE: We don't WAL-log the reused page here. It's up to the
+ * caller to WAL-log its initialization. If we crash between here
+ * and the initialization, the page is leaked. That's unfortunate,
+ * but it should be rare enough that we can live with it.
+ */
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_FPM_REUSE_PAGE);
+
+ PageSetLSN(BufferGetPage(metabuf), recptr);
+ }
+ UnlockReleaseBuffer(metabuf);
+ }
+ else
+ {
+ /*
+ * No free pages in the FPM. Have to extend the relation.
+ * 1. We extend the relation by zedstore_rel_extension_factor #blocks.
+ * 2. Out of the zedstore_rel_extension_factor #blocks returned by the
+ * storage manager, we return the first block. The other blocks
+ * returned are prepended to the attribute level FPM.
+ */
+ StdRdOptions *rd_options = (StdRdOptions *)rel->rd_options;
+ int extension_factor = rd_options ? rd_options->zedstore_rel_extension_factor : ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR;
+
+ buf = zspage_extendrel_newbuf(rel);
+ blk = BufferGetBlockNumber(buf);
+
+ Buffer *extrabufs = palloc((extension_factor - 1) * sizeof(Buffer));
+ for (int i = 0; i < extension_factor - 1; i++) {
+ extrabufs[i] = zspage_extendrel_newbuf(rel);
+ /*
+ * We unlock the extrabuf here to prevent hitting MAX_SIMUL_LWLOCKS.
+ * It is safe to unlock the extrabuf here as it cannot be referenced
+ * by other backends until it is put on the attribute-level FPM.
+ * We grab the lock again in the following loop before placing the
+ * page on the FPM.
+ */
+ LockBuffer(extrabufs[i], BUFFER_LOCK_UNLOCK);
+ }
+
+ for (int i = extension_factor - 2; i >=0; i--) {
+ LockBuffer(extrabufs[i], BUFFER_LOCK_EXCLUSIVE);
+ zspage_delete_page(rel, extrabufs[i], metabuf, attrNumber);
+ UnlockReleaseBuffer(extrabufs[i]);
+ }
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ return buf;
+}
+
+void
+zspage_reuse_page_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_fpm_reuse_page *xlrec = (wal_zedstore_fpm_reuse_page *) XLogRecGetData(record);
+ Buffer metabuf;
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_fpm_head = xlrec->next_free_blkno;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * Extend the relation.
+ *
+ * Returns the new page, exclusive-locked.
+ */
+static Buffer
+zspage_extendrel_newbuf(Relation rel)
+{
+ Buffer buf;
+ bool needLock;
+
+ /*
+ * Extend the relation by one page.
+ *
+ * We have to use a lock to ensure no one else is extending the rel at
+ * the same time, else we will both try to initialize the same new
+ * page. We can skip locking for new or temp relations, however,
+ * since no one else could be accessing them.
+ */
+ needLock = !RELATION_IS_LOCAL(rel);
+
+ if (needLock)
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ buf = ReadBuffer(rel, P_NEW);
+
+ /* Acquire buffer lock on new page */
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * Release the file-extension lock; it's now OK for someone else to
+ * extend the relation some more. Note that we cannot release this
+ * lock before we have buffer lock on the new page, or we risk a race
+ * condition against btvacuumscan --- see comments therein.
+ */
+ if (needLock)
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ return buf;
+}
+
+/*
+ * Explictly mark a page as deleted and recyclable, and add it to the FPM.
+ *
+ * The caller must hold an exclusive-lock on the page.
+ *
+ * This function needs to modify the metapage, to link the page to the
+ * FPM chain. If the caller is already holding a lock on the metapage,
+ * pass it in the 'metabuf' argument.
+ *
+ * NOTE: The deletion of the page is WAL-logged. There is a gap between
+ * the caller making the page obsolete, and calling this function, and
+ * if we crash in between, the page will be leaked. That's unfortunate,
+ * but like in zspage_getnewbuf(), we mostly just live with it. However,
+ * you can use zspage_mark_page_deleted() to avoid it.
+ */
+void
+zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf, AttrNumber attrNumber)
+{
+ bool release_metabuf;
+ BlockNumber blk = BufferGetBlockNumber(buf);
+ Page metapage;
+ ZSMetaPage *metapg;
+ ZSMetaPageOpaque *metaopaque;
+ Page page;
+ BlockNumber next_free_blkno;
+
+ if (metabuf == InvalidBuffer)
+ {
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ release_metabuf = true;
+ }
+ else
+ release_metabuf = false;
+
+ metapage = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ page = BufferGetPage(buf);
+
+ if (attrNumber != ZS_INVALID_ATTRIBUTE_NUM)
+ {
+ /*
+ * Add the page to the attribute specific free page map.
+ */
+ next_free_blkno = metapg->tree_root_dir[attrNumber].fpm_head;
+ zspage_mark_page_deleted(page, next_free_blkno);
+ metapg->tree_root_dir[attrNumber].fpm_head = blk;
+ }
+ else
+ {
+ next_free_blkno = metaopaque->zs_fpm_head;
+ zspage_mark_page_deleted(page, next_free_blkno);
+ metaopaque->zs_fpm_head = blk;
+ }
+
+
+ MarkBufferDirty(metabuf);
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_fpm_delete_page xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.next_free_blkno = next_free_blkno;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalFpmDeletePage);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+ XLogRegisterBuffer(1, buf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_FPM_DELETE_PAGE);
+
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(page, recptr);
+ }
+
+ if (release_metabuf)
+ UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * Initialize a page as deleted.
+ *
+ * This is a low-level function, used by zspage_delete_page(), but it can
+ * also be used by callers that are willing to deal with managing the FPM
+ * chain and WAL-logging by themselves.
+ */
+void
+zspage_mark_page_deleted(Page page, BlockNumber next_free_blk)
+{
+ ZSFreePageOpaque *opaque;
+
+ PageInit(page, BLCKSZ, sizeof(ZSFreePageOpaque));
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_page_id = ZS_FREE_PAGE_ID;
+ opaque->zs_next = next_free_blk;
+}
+
+void
+zspage_delete_page_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_fpm_delete_page *xlrec = (wal_zedstore_fpm_delete_page *) XLogRecGetData(record);
+ Buffer metabuf;
+ Buffer deletedbuf;
+ Page deletedpg;
+ BlockNumber deletedblkno;
+
+ deletedbuf = XLogInitBufferForRedo(record, 1);
+ deletedpg = BufferGetPage(deletedbuf);
+ deletedblkno = BufferGetBlockNumber(deletedbuf);
+
+ zspage_mark_page_deleted(deletedpg, xlrec->next_free_blkno);
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_fpm_head = deletedblkno;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ PageSetLSN(deletedpg, lsn);
+ MarkBufferDirty(deletedbuf);
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ UnlockReleaseBuffer(deletedbuf);
+}
diff --git src/backend/access/zedstore/zedstore_inspect.c src/backend/access/zedstore/zedstore_inspect.c
new file mode 100644
index 0000000000..2c4e6dceb8
--- /dev/null
+++ src/backend/access/zedstore/zedstore_inspect.c
@@ -0,0 +1,1191 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_inspect.c
+ * Debugging functions, for viewing ZedStore page contents
+ *
+ * These should probably be moved to contrib/, but it's handy to have them
+ * here during development.
+ *
+ * Example queries
+ * ---------------
+ *
+ * How many pages of each type a table has?
+ *
+ * select count(*), pg_zs_page_type('t_zedstore', g)
+ * from generate_series(0, pg_table_size('t_zedstore') / 8192 - 1) g group by 2;
+ *
+ * count | pg_zs_page_type
+ * -------+-----------------
+ * 1 | META
+ * 3701 | BTREE
+ * 6 | UNDO
+ * (3 rows)
+ *
+ * Compression ratio of B-tree leaf pages (other pages are not compressed):
+ *
+ * select sum(uncompressedsz::numeric) / sum(totalsz) as compratio
+ * from pg_zs_btree_pages('t_zedstore') ;
+ * compratio
+ * --------------------
+ * 3.6623829559208134
+ * (1 row)
+ *
+ * Per column compression ratio and number of pages:
+ *
+ * select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as
+ * compratio from pg_zs_btree_pages('t_zedstore') group by attno order by
+ * attno;
+ *
+ * attno | count | compratio
+ * -------+-------+------------------------
+ * 0 | 395 | 1.00000000000000000000
+ * 1 | 56 | 1.0252948766341260
+ * 2 | 3 | 38.7542309420398383
+ * (3 rows)
+ *
+ *
+ * Measure of leaf page randomness
+ *
+ * A run is a sequence of consecutive leaf blocks. Two blocks are consecutive
+ * if they have consecutive block numbers
+ *
+ * select (pg_zs_calculate_adjacent_block('t_zedstore'::regclass)).*;
+ *
+ * attnum | nruns | nblocks
+ * -------+-------+---------
+ * 0 | 21 | 27
+ * 1 | 4 | 107
+ * 2 | 4 | 107
+ * 3 | 4 | 107
+ * (4 rows)
+ *
+ *
+ * Get attstreams inside an attribute leaf page. Each row represents an encoded chunk.
+ *
+ * select * from pg_zs_dump_attstreams('t_zedstore', 3);
+ *
+ * select attno, chunkno, upperstream, compressed, chunk_start, chunk_len, prevtid, firsttid, lasttid, itemcount, chunk from pg_zs_dump_attstreams('t_zedstore', 11) limit 5;
+ * attno | chunkno | upperstream | compressed | chunk_start | chunk_len | prevtid | firsttid | lasttid | itemcount | chunk
+ *
+ * -------+---------+-------------+------------+-------------+-----------+---------+----------+---------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
+ * ---------------------------------------------------------------------------------------------------
+ * 1 | 0 | f | f | 0 | 12 | 0 | 10001 | 10001 | 1 | \x11270000000000f001000000
+ * 1 | 0 | t | t | 0 | 24 | 0 | 9931 | 9934 | 4 | \xcb660010000400c0cb260000cc260000cd260000ce260000
+ * 1 | 1 | t | t | 24 | 88 | 9934 | 9935 | 9954 | 20 | \x5555555555000030cf260000d0260000d1260000d2260000d3260000d4260000d5260000d6260000d7260000d8260000d9260000da260000db260000dc260000dd260000de260000df260000e02600
+ * 00e1260000e2260000
+ * 1 | 2 | t | t | 112 | 32 | 9954 | 9955 | 9960 | 6 | \x01020408102000b0e3260000e4260000e5260000e6260000e7260000e8260000
+ * 1 | 3 | t | t | 144 | 128 | 9960 | 9961 | 9990 | 30 | \xffffff3f00000010e9260000ea260000eb260000ec260000ed260000ee260000ef260000f0260000f1260000f2260000f3260000f4260000f5260000f6260000f7260000f8260000f9260000fa2600
+ * 00fb260000fc260000fd260000fe260000ff26000000270000012700000227000003270000042700000527000006270000
+ * (5 rows)
+ *
+ *
+ * Decode chunks inside an attribute leaf page.
+ *
+ * select * from pg_zs_dump_attstreams('t_zedstore', 11), pg_zs_decode_chunk(attbyval,attlen,prevtid,lasttid,chunk);
+ *
+ * select chunkno, tids, datums, isnulls from pg_zs_dump_attstreams('t_zedstore', 11), pg_zs_decode_chunk(attbyval,attlen,prevtid,lasttid,chunk);
+ * chunkno | tids | datums | isnulls
+ * ---------+-----------------------+-----------------------------------------------------------+-----------
+ * 0 | {10001} | {"\\x01000000"} | {f}
+ * 0 | {9931,9932,9933,9934} | {"\\xcb260000","\\xcc260000","\\xcd260000","\\xce260000"} | {f,f,f,f}
+ * (2 rows)
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_inspect.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "commands/vacuum.h"
+#include "funcapi.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+Datum pg_zs_page_type(PG_FUNCTION_ARGS);
+Datum pg_zs_undo_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_btree_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_toast_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_meta_page(PG_FUNCTION_ARGS);
+Datum pg_zs_calculate_adjacent_block(PG_FUNCTION_ARGS);
+Datum pg_zs_dump_attstreams(PG_FUNCTION_ARGS);
+Datum pg_zs_decode_chunk(PG_FUNCTION_ARGS);
+
+Datum
+pg_zs_page_type(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint64 pageno = PG_GETARG_INT64(1);
+ Relation rel;
+ uint16 zs_page_id;
+ Buffer buf;
+ Page page;
+ char *result;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ buf = ReadBuffer(rel, pageno);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ zs_page_id = *((uint16 *) ((char *) page + BLCKSZ - sizeof(uint16)));
+
+ UnlockReleaseBuffer(buf);
+
+ table_close(rel, AccessShareLock);
+
+ switch (zs_page_id)
+ {
+ case ZS_META_PAGE_ID:
+ result = "META";
+ break;
+ case ZS_BTREE_PAGE_ID:
+ result = "BTREE";
+ break;
+ case ZS_UNDO_PAGE_ID:
+ result = "UNDO";
+ break;
+ case ZS_TOAST_PAGE_ID:
+ result = "TOAST";
+ break;
+ case ZS_FREE_PAGE_ID:
+ result = "FREE";
+ break;
+ default:
+ result = psprintf("UNKNOWN 0x%04x", zs_page_id);
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * blkno int8
+ * nrecords int4
+ * freespace int4
+ * firstrecptr int8
+ * lastrecptr int8
+ */
+Datum
+pg_zs_undo_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber blkno;
+ char *ptr;
+ char *endptr;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page.
+ */
+ blkno = firstblk;
+ while (blkno != InvalidBlockNumber)
+ {
+ Datum values[5];
+ bool nulls[5];
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+ int nrecords;
+ ZSUndoRecPtr firstptr = {0, 0, 0};
+ ZSUndoRecPtr lastptr = {0, 0, 0};
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ {
+ elog(WARNING, "unexpected page id on UNDO page %u", blkno);
+ break;
+ }
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ nrecords = 0;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+
+ Assert(undorec->undorecptr.blkno == blkno);
+
+ lastptr = undorec->undorecptr;
+ if (nrecords == 0)
+ firstptr = lastptr;
+ nrecords++;
+
+ ptr += undorec->size;
+ }
+
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int32GetDatum(nrecords);
+ values[2] = Int32GetDatum(PageGetExactFreeSpace(page));
+ values[3] = Int64GetDatum(firstptr.counter);
+ values[4] = Int64GetDatum(lastptr.counter);
+
+ blkno = opaque->next;
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * blkno int8
+ * tid int8
+ * total_size int8
+ * prev int8
+ * next int8
+ * decompressed_size uint32
+ * is_compressed bool
+ */
+Datum
+pg_zs_toast_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ BlockNumber blkno;
+ BlockNumber nblocks;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /* scan all blocks in physical order */
+ for (blkno = 1; blkno < nblocks; blkno++)
+ {
+ Datum values[8];
+ bool nulls[8];
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in toast pages.
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSToastPageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_TOAST_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ values[0] = Int64GetDatum(blkno);
+ if (opaque->zs_tid)
+ {
+ values[1] = Int64GetDatum(opaque->zs_tid);
+ values[2] = Int64GetDatum(opaque->zs_total_size);
+ }
+ values[3] = Int64GetDatum(opaque->zs_slice_offset);
+ values[4] = Int64GetDatum(opaque->zs_prev);
+ values[5] = Int64GetDatum(opaque->zs_next);
+ values[6] = Int32GetDatum(opaque->zs_decompressed_size);
+ values[7] = BoolGetDatum(opaque->zs_is_compressed);
+
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * attno int4
+ * chunkno int4
+ * upperstream bool
+ * compressed bool
+ * attbyval bool
+ * attlen int4
+ *
+ * chunk_cursor int4
+ * chunk_len int4
+ *
+ * firsttid zstid
+ * lasttid zstid
+ *
+ * tids[] zstid
+ * datums[] bytea
+ * isnulls[] bool
+ * num_elems int4
+ */
+Datum
+pg_zs_dump_attstreams(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ BlockNumber blkno = PG_GETARG_INT64(1);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ Datum values[14];
+ bool nulls[14];
+
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ int chunkno;
+ bool upperstream;
+ bool attbyval;
+ int16 attlen;
+ int chunk_start;
+ PageHeader phdr;
+
+ attstream_decoder decoder;
+
+ ZSAttStream *streams[2];
+ int nstreams = 0;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in B-tree pages. (Presumably, most of the pages
+ * in the relation are b-tree pages, so it makes sense to scan the whole
+ * relation in physical order)
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ table_close(rel, AccessShareLock);
+ PG_RETURN_NULL();
+ }
+
+ opaque = (ZSBtreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID ||
+ opaque->zs_attno == ZS_META_ATTRIBUTE_NUM ||
+ opaque->zs_level != 0)
+ {
+ UnlockReleaseBuffer(buf);
+ table_close(rel, AccessShareLock);
+ PG_RETURN_NULL();
+ }
+
+ attbyval = rel->rd_att->attrs[opaque->zs_attno - 1].attbyval;
+ attlen = rel->rd_att->attrs[opaque->zs_attno - 1].attlen;
+
+ phdr = (PageHeader) page;
+
+ if (phdr->pd_lower - SizeOfPageHeaderData > SizeOfZSAttStreamHeader)
+ {
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + SizeOfPageHeaderData);
+ }
+
+ if (phdr->pd_special - phdr->pd_upper > SizeOfZSAttStreamHeader)
+ {
+ upperstream = nstreams;
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + phdr->pd_upper);
+ }
+
+ for (int i = 0; i < nstreams; i++)
+ {
+ ZSAttStream *stream = streams[i];
+ bytea *chunk;
+ zstid prevtid;
+ zstid firsttid;
+ zstid lasttid;
+
+ init_attstream_decoder(&decoder, attbyval, attlen);
+ decode_attstream_begin(&decoder, stream);
+
+ chunkno = 0;
+ chunk_start = decoder.pos;
+
+ while (get_attstream_chunk_cont(&decoder, &prevtid, &firsttid, &lasttid, &chunk))
+ {
+ values[0] = Int16GetDatum(opaque->zs_attno);
+ values[1] = Int32GetDatum(chunkno);
+ chunkno++;
+
+ values[2] = BoolGetDatum(upperstream == i);
+ values[3] = BoolGetDatum((stream->t_flags & ATTSTREAM_COMPRESSED) != 0);
+ values[4] = BoolGetDatum(attbyval);
+ values[5] = Int16GetDatum(attlen);
+
+ values[6] = Int32GetDatum(chunk_start);
+ values[7] = Int32GetDatum(decoder.pos - chunk_start);
+ chunk_start = decoder.pos;
+
+ values[8] = ZSTidGetDatum(prevtid);
+ values[9] = ZSTidGetDatum(firsttid);
+ values[10] = ZSTidGetDatum(lasttid);
+ values[11] = PointerGetDatum(chunk);
+ values[12] = PointerGetDatum(decoder.num_elements);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+ table_close(rel, AccessShareLock);
+
+ destroy_attstream_decoder(&decoder);
+
+ return (Datum) 0;
+}
+
+Datum
+pg_zs_decode_chunk(PG_FUNCTION_ARGS)
+{
+ bool attbyval = PG_GETARG_BOOL(0);
+ int attlen = PG_GETARG_INT16(1);
+ zstid prevtid = PG_GETARG_ZSTID(2);
+ zstid lasttid = PG_GETARG_ZSTID(3);
+ bytea *chunk = PG_GETARG_BYTEA_P(4);
+ attstream_decoder decoder;
+ Datum values[4];
+ bool nulls[4];
+ ZSAttStream *attstream = palloc(SizeOfZSAttStreamHeader + VARSIZE_ANY_EXHDR(chunk));
+ TupleDesc tupdesc;
+ HeapTuple tuple;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ attstream->t_decompressed_size = VARSIZE_ANY_EXHDR(chunk);
+ attstream->t_decompressed_bufsize = VARSIZE_ANY_EXHDR(chunk);
+ attstream->t_size = SizeOfZSAttStreamHeader + VARSIZE_ANY_EXHDR(chunk);
+ attstream->t_flags = 0;
+ attstream->t_lasttid = lasttid;
+ memcpy(attstream->t_payload, VARDATA_ANY(chunk), VARSIZE_ANY_EXHDR(chunk));
+
+ init_attstream_decoder(&decoder, attbyval, attlen);
+ decode_attstream_begin(&decoder, attstream);
+ decoder.prevtid = prevtid;
+
+ if (!decode_attstream_cont(&decoder))
+ PG_RETURN_NULL();
+ else
+ {
+ ArrayBuildState *astate_tids = NULL;
+ ArrayBuildState *astate_datums = NULL;
+ ArrayBuildState *astate_isnulls = NULL;
+
+ for (int i = 0; i < decoder.num_elements; i++)
+ {
+
+ bytea *attr_data;
+
+ astate_tids = accumArrayResult(astate_tids,
+ ZSTidGetDatum(decoder.tids[i]),
+ false,
+ ZSTIDOID,
+ CurrentMemoryContext);
+ if (decoder.isnulls[i])
+ {
+ astate_datums = accumArrayResult(astate_datums,
+ (Datum) 0,
+ true,
+ BYTEAOID,
+ CurrentMemoryContext);
+ }
+ else
+ {
+ /*
+ * Fixed length, attribute by value
+ */
+ if (attbyval && attlen > 0)
+ {
+ attr_data = (bytea *) palloc(attlen + VARHDRSZ);
+ SET_VARSIZE(attr_data, attlen + VARHDRSZ);
+ memcpy(VARDATA(attr_data), &decoder.datums[i], attlen);
+ }
+ else if (!attbyval && attlen > 0)
+ {
+ attr_data = (bytea *) palloc(attlen + VARHDRSZ);
+ SET_VARSIZE(attr_data, attlen + VARHDRSZ);
+ memcpy(VARDATA(attr_data),
+ DatumGetPointer(decoder.datums[i]),
+ attlen);
+ }
+ else if (attlen < 0)
+ {
+ int len;
+
+ len =
+ VARSIZE_ANY_EXHDR(DatumGetPointer(decoder.datums[i]));
+ attr_data = (bytea *) palloc(len + VARHDRSZ);
+ SET_VARSIZE(attr_data, len + VARHDRSZ);
+ memcpy(VARDATA(attr_data),
+ VARDATA_ANY(DatumGetPointer(decoder.datums[i])),
+ len);
+ }
+ astate_datums = accumArrayResult(astate_datums,
+ PointerGetDatum(attr_data),
+ false,
+ BYTEAOID,
+ CurrentMemoryContext);
+ }
+ astate_isnulls = accumArrayResult(astate_isnulls,
+ BoolGetDatum(decoder.isnulls[i]),
+ false,
+ BOOLOID,
+ CurrentMemoryContext);
+ }
+
+ values[0] = Int32GetDatum(decoder.num_elements);
+ values[1] = PointerGetDatum(makeArrayResult(astate_tids, CurrentMemoryContext));
+ values[2] = PointerGetDatum(makeArrayResult(astate_datums, CurrentMemoryContext));
+ values[3] = PointerGetDatum(makeArrayResult(astate_isnulls, CurrentMemoryContext));
+ }
+
+ destroy_attstream_decoder(&decoder);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+/*
+ * blkno int8
+ * nextblk int8
+ * attno int4
+ * level int4
+ *
+ * lokey int8
+ * hikey int8
+
+ * nitems int4
+ * ncompressed int4
+ * totalsz int4
+ * uncompressedsz int4
+ * freespace int4
+ */
+Datum
+pg_zs_btree_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ BlockNumber blkno;
+ BlockNumber nblocks;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /* scan all blocks in physical order */
+ for (blkno = 1; blkno < nblocks; blkno++)
+ {
+ Datum values[11];
+ bool nulls[11];
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ int nitems;
+ int ncompressed;
+ int totalsz;
+ int uncompressedsz;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in B-tree pages. (Presumably, most of the
+ * pages in the relation are b-tree pages, so it makes sense to scan
+ * the whole relation in physical order)
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSBtreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ nitems = 0;
+ ncompressed = 0;
+ totalsz = 0;
+ uncompressedsz = 0;
+ if (opaque->zs_level == 0)
+ {
+ /* meta leaf page */
+ if (opaque->zs_attno == ZS_META_ATTRIBUTE_NUM) {
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+
+ ZSTidArrayItem
+ *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ nitems++;
+ totalsz += item->t_size;
+
+ uncompressedsz += item->t_size;
+ }
+ }
+ /* attribute leaf page */
+ else
+ {
+ PageHeader phdr = (PageHeader) page;
+ ZSAttStream *streams[2];
+ int nstreams = 0;
+
+ if (phdr->pd_lower - SizeOfPageHeaderData > SizeOfZSAttStreamHeader)
+ {
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + SizeOfPageHeaderData);
+ }
+
+ if (phdr->pd_special - phdr->pd_upper > SizeOfZSAttStreamHeader)
+ {
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + phdr->pd_upper);
+ }
+
+ for (int i = 0; i < nstreams; i++)
+ {
+ ZSAttStream *stream = streams[i];
+
+ totalsz += stream->t_size;
+ /*
+ * FIXME: this is wrong. We currently don't calculate the
+ * number of items in the stream
+ */
+ nitems++;
+ if ((stream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ ncompressed++;
+ uncompressedsz += stream->t_decompressed_size;
+ }
+ else
+ {
+ uncompressedsz += stream->t_size;
+ }
+ }
+ }
+ }
+ else
+ {
+ /* internal page */
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+ }
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int64GetDatum(opaque->zs_next);
+ values[2] = Int32GetDatum(opaque->zs_attno);
+ values[3] = Int32GetDatum(opaque->zs_level);
+ values[4] = Int64GetDatum(opaque->zs_lokey);
+ values[5] = Int64GetDatum(opaque->zs_hikey);
+ values[6] = Int32GetDatum(nitems);
+ if (opaque->zs_level == 0)
+ {
+ values[7] = Int32GetDatum(ncompressed);
+ values[8] = Int32GetDatum(totalsz);
+ values[9] = Int32GetDatum(uncompressedsz);
+ }
+ else
+ {
+ nulls[7] = true;
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ values[10] = Int32GetDatum(PageGetExactFreeSpace(page));
+
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * blkno int8
+ * undo_head int8
+ * undo_tail int8
+ * undo_tail_first_counter int8
+ * undo_oldestpointer_counter int8
+ * undo_oldestpointer_blkno int8
+ * undo_oldestpointer_offset int8
+ * fpm_head int8
+ * flags int4
+ */
+Datum
+pg_zs_meta_page(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ TupleDesc tupdesc;
+ Datum values[9];
+ bool nulls[9];
+ Buffer buf;
+ Page page;
+ ZSMetaPageOpaque *opaque;
+ HeapTuple tuple;
+ Datum result;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* open the metapage */
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /* Read the page */
+ buf = ReadBuffer(rel, ZS_META_BLK);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSMetaPageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "Bad page special size");
+ }
+ opaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_META_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "The zs_page_id does not match ZS_META_PAGE_ID. Got: %d",
+ opaque->zs_page_id);
+ }
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ values[0] = Int64GetDatum(ZS_META_BLK);
+ values[1] = Int64GetDatum(opaque->zs_undo_head);
+ values[2] = Int64GetDatum(opaque->zs_undo_tail);
+ values[3] = Int64GetDatum(opaque->zs_undo_tail_first_counter);
+ values[4] = Int64GetDatum(opaque->zs_undo_oldestptr.counter);
+ values[5] = Int64GetDatum(opaque->zs_undo_oldestptr.blkno);
+ values[6] = Int32GetDatum(opaque->zs_undo_oldestptr.offset);
+ values[7] = Int64GetDatum(opaque->zs_fpm_head);
+ values[8] = Int32GetDatum(opaque->zs_flags);
+
+ UnlockReleaseBuffer(buf);
+
+ table_close(rel, AccessShareLock);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ PG_RETURN_DATUM(result);
+}
+
+/*
+ * Function to check whether blocks are adjacent in relfile.
+ *
+ * Returns the number of runs of consecutive blocks per attribute and the total
+ * number of blocks per attribute.
+ */
+Datum
+pg_zs_calculate_adjacent_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+
+ Buffer buf;
+ Relation rel;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ Tuplestorestate *tupstore;
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+
+ int *total_blocks;
+ int *num_runs;
+
+ Datum values[3];
+ bool nulls[3];
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ total_blocks = (int *)palloc0((rel->rd_att->natts + 1) * sizeof(int));
+ num_runs = (int *)palloc0((rel->rd_att->natts + 1) * sizeof(int));
+
+ for (int attnum=0; attnum <= rel->rd_att->natts; attnum++)
+ {
+ BlockNumber blkno;
+ buf = zsbt_descend(rel, attnum, MinZSTid, 0, true);
+
+ if (buf == InvalidBuffer)
+ continue;
+
+ blkno = BufferGetBlockNumber(buf);
+
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ num_runs[attnum] = 1;
+ total_blocks[attnum] = 1;
+
+ while (opaque->zs_next != InvalidBlockNumber)
+ {
+ if (opaque->zs_next != blkno + 1)
+ {
+ num_runs[attnum]++;
+ }
+ total_blocks[attnum]++;
+
+ UnlockReleaseBuffer(buf);
+
+ buf = ReadBuffer(rel, opaque->zs_next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ blkno = BufferGetBlockNumber(buf);
+
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ values[0] = Int32GetDatum(attnum);
+ values[1] = Int32GetDatum(num_runs[attnum]);
+ values[2] = Int32GetDatum(total_blocks[attnum]);
+ nulls[0] = false;
+ nulls[1] = false;
+ nulls[2] = false;
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
diff --git src/backend/access/zedstore/zedstore_meta.c src/backend/access/zedstore/zedstore_meta.c
new file mode 100644
index 0000000000..50496dc5cb
--- /dev/null
+++ src/backend/access/zedstore/zedstore_meta.c
@@ -0,0 +1,509 @@
+/*
+ * zedstore_meta.c
+ * Routines for handling ZedStore metapage
+ *
+ * The metapage holds a directory of B-tree root block numbers, one for each
+ * column.
+ *
+ * TODO:
+ * - extend the root block dir to an overflow page if there are too many
+ * attributes to fit on one page
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_meta.c
+ */
+#include "postgres.h"
+
+#include "access/itup.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+static void zsmeta_wal_log_metapage(Buffer buf, int natts);
+
+static ZSMetaCacheData *
+zsmeta_populate_cache_from_metapage(Relation rel, Page page)
+{
+ ZSMetaCacheData *cache;
+ ZSMetaPage *metapg;
+ int natts;
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ natts = metapg->nattributes;
+
+ cache =
+ MemoryContextAllocZero(CacheMemoryContext,
+ offsetof(ZSMetaCacheData, cache_attrs[natts]));
+ cache->cache_nattributes = natts;
+
+ for (int i = 0; i < natts; i++)
+ {
+ cache->cache_attrs[i].root = metapg->tree_root_dir[i].root;
+ cache->cache_attrs[i].rightmost = InvalidBlockNumber;
+ }
+
+ rel->rd_amcache = cache;
+ return cache;
+}
+
+ZSMetaCacheData *
+zsmeta_populate_cache(Relation rel)
+{
+ ZSMetaCacheData *cache;
+ Buffer metabuf;
+ BlockNumber nblocks;
+
+ RelationOpenSmgr(rel);
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ {
+ cache =
+ MemoryContextAllocZero(CacheMemoryContext,
+ offsetof(ZSMetaCacheData, cache_attrs));
+ cache->cache_nattributes = 0;
+ rel->rd_amcache = cache;
+ }
+ else
+ {
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ cache = zsmeta_populate_cache_from_metapage(rel, BufferGetPage(metabuf));
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ return cache;
+}
+
+static void
+zsmeta_expand_metapage_for_new_attributes(Relation rel)
+{
+ int natts = RelationGetNumberOfAttributes(rel) + 1;
+ Buffer metabuf;
+ Page page;
+ ZSMetaPage *metapg;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ if (natts > metapg->nattributes)
+ {
+ int new_pd_lower;
+
+ new_pd_lower = (char *) &metapg->tree_root_dir[natts] - (char *) page;
+ if (new_pd_lower > ((PageHeader) page)->pd_upper)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ START_CRIT_SECTION();
+
+ /* Initialize the new attribute roots to InvalidBlockNumber */
+ for (int i = metapg->nattributes; i < natts; i++)
+ {
+ metapg->tree_root_dir[i].root = InvalidBlockNumber;
+ metapg->tree_root_dir[i].fpm_head = InvalidBlockNumber;
+ }
+
+ metapg->nattributes = natts;
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+
+ MarkBufferDirty(metabuf);
+
+ if (RelationNeedsWAL(rel))
+ zsmeta_wal_log_metapage(metabuf, natts);
+
+ END_CRIT_SECTION();
+ }
+ UnlockReleaseBuffer(metabuf);
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+}
+
+static Page
+zsmeta_initmetapage_internal(int natts)
+{
+ Page page;
+ ZSMetaPageOpaque *opaque;
+ ZSMetaPage *metapg;
+ int new_pd_lower;
+
+ /*
+ * It's possible that we error out when building the metapage, if there
+ * are too many attribute, so work on a temporary copy first, before actually
+ * allocating the buffer.
+ */
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSMetaPageOpaque));
+
+ opaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_META_PAGE_ID;
+
+ /* UNDO-related fields */
+ opaque->zs_undo_oldestptr.counter = 2; /* start at 2, so that 0 is always "old", and 1 means "dead" */
+ opaque->zs_undo_head = InvalidBlockNumber;
+ opaque->zs_undo_tail = InvalidBlockNumber;
+ opaque->zs_undo_tail_first_counter = 2;
+
+ opaque->zs_fpm_head = InvalidBlockNumber;
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ new_pd_lower = (char *) &metapg->tree_root_dir[natts] - (char *) page;
+ if (new_pd_lower > ((PageHeader) page)->pd_upper)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ metapg->nattributes = natts;
+ for (int i = 0; i < natts; i++)
+ {
+ metapg->tree_root_dir[i].root = InvalidBlockNumber;
+ metapg->tree_root_dir[i].fpm_head = InvalidBlockNumber;
+ }
+
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+ return page;
+}
+
+/*
+ * Initialize the metapage for an empty relation.
+ */
+void
+zsmeta_initmetapage(Relation rel)
+{
+ Buffer buf;
+ Page page;
+ int natts = RelationGetNumberOfAttributes(rel) + 1;
+
+ /* Ok, write it out to disk */
+ buf = ReadBuffer(rel, P_NEW);
+ if (BufferGetBlockNumber(buf) != ZS_META_BLK)
+ elog(ERROR, "table is not empty");
+ page = zsmeta_initmetapage_internal(natts);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ PageRestoreTempPage(page, BufferGetPage(buf));
+
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ zsmeta_wal_log_metapage(buf, natts);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+}
+
+static void
+zsmeta_wal_log_metapage(Buffer buf, int natts)
+{
+ Page page = BufferGetPage(buf);
+ wal_zedstore_init_metapage init_rec;
+ XLogRecPtr recptr;
+
+ init_rec.natts = natts;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &init_rec, SizeOfZSWalInitMetapage);
+ XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_INIT_METAPAGE);
+
+ PageSetLSN(page, recptr);
+}
+
+static void
+zsmeta_wal_log_new_att_root(Buffer metabuf, Buffer rootbuf, AttrNumber attno)
+{
+ Page metapage = BufferGetPage(metabuf);
+ Page rootpage = BufferGetPage(rootbuf);
+ wal_zedstore_btree_new_root xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.attno = attno;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalBtreeNewRoot);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+ XLogRegisterBuffer(1, rootbuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_BTREE_NEW_ROOT);
+
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(rootpage, recptr);
+}
+
+void
+zsmeta_initmetapage_redo(XLogReaderState *record)
+{
+ Buffer buf;
+
+ /*
+ * Metapage changes are so rare that we rely on full-page images
+ * for replay.
+ */
+ if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+ elog(ERROR, "zedstore metapage init WAL record did not contain a full-page image");
+
+ Assert(BufferGetBlockNumber(buf) == ZS_META_BLK);
+ UnlockReleaseBuffer(buf);
+}
+
+void
+zsmeta_new_btree_root_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_btree_new_root *xlrec =
+ (wal_zedstore_btree_new_root *) XLogRecGetData(record);
+ AttrNumber attno = xlrec->attno;
+ Buffer metabuf;
+ Buffer rootbuf;
+ Page rootpage;
+ BlockNumber rootblk;
+ ZSBtreePageOpaque *opaque;
+
+ rootbuf = XLogInitBufferForRedo(record, 1);
+ rootpage = (Page) BufferGetPage(rootbuf);
+ rootblk = BufferGetBlockNumber(rootbuf);
+ /* initialize the page to look like a root leaf */
+ rootpage = BufferGetPage(rootbuf);
+ PageInit(rootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(rootpage);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = 0;
+ opaque->zs_flags = ZSBT_ROOT;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ PageSetLSN(rootpage, lsn);
+ MarkBufferDirty(rootbuf);
+
+ /* Update the metapage to point to it */
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = (Page) BufferGetPage(metabuf);
+ ZSMetaPage *metapg = (ZSMetaPage *) PageGetContents(metapage);
+
+ Assert(BufferGetBlockNumber(metabuf) == ZS_META_BLK);
+ Assert(metapg->tree_root_dir[attno].root == InvalidBlockNumber);
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ UnlockReleaseBuffer(rootbuf);
+}
+
+/*
+ * Get the block number of the b-tree root for given attribute.
+ *
+ * If 'readonly' is true, and the root doesn't exist yet (ie. it's an empty
+ * table), returns InvalidBlockNumber. Otherwise new root is allocated if
+ * the root doesn't exist.
+ */
+BlockNumber
+zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool readonly)
+{
+ Buffer metabuf;
+ ZSMetaPage *metapg;
+ BlockNumber rootblk;
+ ZSMetaCacheData *metacache;
+
+ Assert(attno == ZS_META_ATTRIBUTE_NUM || attno >= 1);
+
+ metacache = zsmeta_get_cache(rel);
+
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ BlockNumber nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (nblocks != 0)
+ metacache = zsmeta_populate_cache(rel);
+ else if (readonly)
+ return InvalidBlockNumber;
+ else
+ {
+ LockRelationForExtension(rel, ExclusiveLock);
+ /*
+ * Confirm number of blocks is still 0 after taking lock,
+ * before initializing a new metapage
+ */
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (nblocks == 0)
+ zsmeta_initmetapage(rel);
+ UnlockRelationForExtension(rel, ExclusiveLock);
+ metacache = zsmeta_populate_cache(rel);
+ }
+ }
+
+ /*
+ * file has less number of attributes stored compared to catalog. This
+ * happens due to add column default value storing value in catalog and
+ * absent in table. This attribute must be marked with atthasmissing.
+ */
+ if (attno >= metacache->cache_nattributes)
+ {
+ if (readonly)
+ {
+ /* re-check */
+ metacache = zsmeta_populate_cache(rel);
+ if (attno >= metacache->cache_nattributes)
+ return InvalidBlockNumber;
+ }
+ else
+ {
+ zsmeta_expand_metapage_for_new_attributes(rel);
+ metacache = zsmeta_populate_cache(rel);
+ }
+ }
+
+ rootblk = metacache->cache_attrs[attno].root;
+
+ /*
+ * Don't believe a cached result that says that the root is empty.
+ * It's possible that it was created after we populated the cache. If the
+ * root block number is out-of-date, that's OK because the caller will
+ * detect that case, but if the tree is missing altogether, the caller
+ * will have nothing to detect and will incorrectly return an empty result.
+ *
+ * XXX: It's a inefficient to repopulate the cache here, if we just
+ * did so in the zsmeta_get_cache() call above already.
+ */
+ if (readonly && rootblk == InvalidBlockNumber)
+ {
+ zsmeta_invalidate_cache(rel);
+ metacache = zsmeta_get_cache(rel);
+ rootblk = metacache->cache_attrs[attno].root;
+ }
+
+ if (!readonly && rootblk == InvalidBlockNumber)
+ {
+ /* try to allocate one */
+ Page page;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ /*
+ * Re-check that the root is still invalid, now that we have the
+ * metapage locked.
+ */
+ rootblk = metapg->tree_root_dir[attno].root;
+ if (rootblk == InvalidBlockNumber)
+ {
+ Buffer rootbuf;
+ Page rootpage;
+ ZSBtreePageOpaque *opaque;
+
+ /*
+ * Release the lock on the metapage while we find a new block, because
+ * that could take a while. (And accessing the Free Page Map might lock
+ * the metapage, too, causing self-deadlock.)
+ */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+ /* TODO: release lock on metapage while we do I/O */
+ rootbuf = zspage_getnewbuf(rel, attno);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+ rootblk = metapg->tree_root_dir[attno].root;
+ if (rootblk != InvalidBlockNumber)
+ {
+ /*
+ * Another backend created the root page, while we were busy
+ * finding a free page. We won't need the page we allocated,
+ * after all.
+ */
+ zspage_delete_page(rel, rootbuf, metabuf, attno);
+ }
+ else
+ {
+ rootblk = BufferGetBlockNumber(rootbuf);
+
+ START_CRIT_SECTION();
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ /* initialize the page to look like a root leaf */
+ rootpage = BufferGetPage(rootbuf);
+ PageInit(rootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(rootpage);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = 0;
+ opaque->zs_flags = ZSBT_ROOT;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ MarkBufferDirty(rootbuf);
+ MarkBufferDirty(metabuf);
+
+ if (RelationNeedsWAL(rel))
+ zsmeta_wal_log_new_att_root(metabuf, rootbuf, attno);
+
+ END_CRIT_SECTION();
+ }
+
+ UnlockReleaseBuffer(rootbuf);
+ }
+ UnlockReleaseBuffer(metabuf);
+
+ metacache->cache_attrs[attno].root = rootblk;
+ }
+
+ return rootblk;
+}
diff --git src/backend/access/zedstore/zedstore_simple8b.c src/backend/access/zedstore/zedstore_simple8b.c
new file mode 100644
index 0000000000..088b8a25f4
--- /dev/null
+++ src/backend/access/zedstore/zedstore_simple8b.c
@@ -0,0 +1,391 @@
+/*
+ * zedstore_simple8b.c
+ * Simple-8b encoding for zedstore
+ *
+ * FIXME: This is copy-pasted from src/backend/lib/integerset.c. Some of
+ * the things we do here are not relevant for the use in zedstore, or could
+ * be optimized. For example, EMPTY_CODEWORD is not used.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_simple8b.h
+ */
+#include "postgres.h"
+
+#include "access/zedstore_simple8b.h"
+
+/*
+ * Decode an array of Simple-8b codewords, known to contain 'num_integers'
+ * integers.
+ */
+void
+simple8b_decode_words(uint64 *codewords, int num_codewords,
+ uint64 *dst, int num_integers)
+{
+ int total_decoded = 0;
+
+ /* decode all the codewords */
+ for (int i = 0; i < num_codewords; i++)
+ {
+ int num_decoded;
+
+ num_decoded = simple8b_decode(codewords[i], &dst[total_decoded]);
+ total_decoded += num_decoded;
+ }
+ /*
+ * XXX: This error message is a bit specific, but it matches how this
+ * function is actually used, i.e. to encode TIDs, and the number of integers
+ * comes from the item header.
+ */
+ if (total_decoded != num_integers)
+ elog(ERROR, "number of TIDs in codewords did not match the item header");
+}
+
+/*
+ * Simple-8b encoding.
+ *
+ * The simple-8b algorithm packs between 1 and 240 integers into 64-bit words,
+ * called "codewords". The number of integers packed into a single codeword
+ * depends on the integers being packed; small integers are encoded using
+ * fewer bits than large integers. A single codeword can store a single
+ * 60-bit integer, or two 30-bit integers, for example.
+ *
+ * Since we're storing a unique, sorted, set of integers, we actually encode
+ * the *differences* between consecutive integers. That way, clusters of
+ * integers that are close to each other are packed efficiently, regardless
+ * of their absolute values.
+ *
+ * In Simple-8b, each codeword consists of a 4-bit selector, which indicates
+ * how many integers are encoded in the codeword, and the encoded integers are
+ * packed into the remaining 60 bits. The selector allows for 16 different
+ * ways of using the remaining 60 bits, called "modes". The number of integers
+ * packed into a single codeword in each mode is listed in the simple8b_modes
+ * table below. For example, consider the following codeword:
+ *
+ * 20-bit integer 20-bit integer 20-bit integer
+ * 1101 00000000000000010010 01111010000100100000 00000000000000010100
+ * ^
+ * selector
+ *
+ * The selector 1101 is 13 in decimal. From the modes table below, we see
+ * that it means that the codeword encodes three 20-bit integers. In decimal,
+ * those integers are 18, 500000 and 20. Because we encode deltas rather than
+ * absolute values, the actual values that they represent are 18, 500018 and
+ * 500038.
+ *
+ * Modes 0 and 1 are a bit special; they encode a run of 240 or 120 zeroes
+ * (which means 240 or 120 consecutive integers, since we're encoding the
+ * deltas between integers), without using the rest of the codeword bits
+ * for anything.
+ *
+ * Simple-8b cannot encode integers larger than 60 bits. Values larger than
+ * that are always stored in the 'first' field of a leaf item, never in the
+ * packed codeword. If there is a sequence of integers that are more than
+ * 2^60 apart, the codeword will go unused on those items. To represent that,
+ * we use a magic EMPTY_CODEWORD codeword value.
+ */
+static const struct simple8b_mode
+{
+ uint8 bits_per_int;
+ uint8 num_ints;
+} simple8b_modes[17] =
+
+{
+ {0, 240}, /* mode 0: 240 zeroes */
+ {0, 120}, /* mode 1: 120 zeroes */
+ {1, 60}, /* mode 2: sixty 1-bit integers */
+ {2, 30}, /* mode 3: thirty 2-bit integers */
+ {3, 20}, /* mode 4: twenty 3-bit integers */
+ {4, 15}, /* mode 5: fifteen 4-bit integers */
+ {5, 12}, /* mode 6: twelve 5-bit integers */
+ {6, 10}, /* mode 7: ten 6-bit integers */
+ {7, 8}, /* mode 8: eight 7-bit integers (four bits
+ * are wasted) */
+ {8, 7}, /* mode 9: seven 8-bit integers (four bits
+ * are wasted) */
+ {10, 6}, /* mode 10: six 10-bit integers */
+ {12, 5}, /* mode 11: five 12-bit integers */
+ {15, 4}, /* mode 12: four 15-bit integers */
+ {20, 3}, /* mode 13: three 20-bit integers */
+ {30, 2}, /* mode 14: two 30-bit integers */
+ {60, 1}, /* mode 15: one 60-bit integer */
+
+ {0, 0} /* sentinel value */
+};
+
+/*
+ * Maximum number of integers that can be encoded in a single Simple-8b
+ * codeword.
+ */
+#define SIMPLE8B_MAX_VALUES_PER_CODEWORD 240
+
+/*
+ * EMPTY_CODEWORD is a special value, used to indicate "no values".
+ * It is used if the next value is too large to be encoded with Simple-8b.
+ *
+ * This value looks like a mode-0 codeword, but we can distinguish it
+ * because a regular mode-0 codeword would have zeroes in the unused bits.
+ */
+#define EMPTY_CODEWORD UINT64CONST(0x0FFFFFFFFFFFFFFF)
+
+/*
+ * Encode a number of integers into a Simple-8b codeword.
+ *
+ * Returns the encoded codeword, and sets *num_encoded to the number of
+ * input integers that were encoded. That can be zero, if the first delta
+ * is too large to be encoded.
+ */
+uint64
+simple8b_encode(const uint64 *ints, int num_ints, int *num_encoded)
+{
+ int selector;
+ int nints;
+ int bits;
+ uint64 val;
+ uint64 codeword;
+ int i;
+
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ selector = 0;
+ nints = simple8b_modes[0].num_ints;
+ bits = simple8b_modes[0].bits_per_int;
+ val = ints[0];
+ i = 0; /* number of deltas we have accepted */
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << bits))
+ {
+ /* too large, step up to next mode */
+ selector++;
+ nints = simple8b_modes[selector].num_ints;
+ bits = simple8b_modes[selector].bits_per_int;
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ i++;
+ if (i >= nints)
+ break;
+ /* examine next delta */
+ if (i < num_ints)
+ val = ints[i];
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ if (nints == 0)
+ {
+ /*
+ * The first delta is too large to be encoded with Simple-8b.
+ *
+ * If there is at least one not-too-large integer in the input, we
+ * will encode it using mode 15 (or a more compact mode). Hence, we
+ * can only get here if the *first* delta is >= 2^60.
+ */
+ Assert(i == 0);
+ *num_encoded = 0;
+ return EMPTY_CODEWORD;
+ }
+
+ /*
+ * Encode the integers using the selected mode. Note that we shift them
+ * into the codeword in reverse order, so that they will come out in the
+ * correct order in the decoder.
+ */
+ codeword = 0;
+ if (bits > 0)
+ {
+ for (i = nints - 1; i > 0; i--)
+ {
+ val = ints[i];
+ codeword |= val;
+ codeword <<= bits;
+ }
+ val = ints[0];
+ codeword |= val;
+ }
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ *num_encoded = nints;
+ return codeword;
+}
+
+/*
+ * Encode a number of same integers into a Simple-8b codeword.
+ *
+ * This is a special version of simple8b_encode, where the first input
+ * integer is 'firstint', followed by a number of 'secondint'. This is
+ * equivalent to calling simple8b_encode() with an input array:
+ *
+ * ints[0]: firstint
+ * ints[1]: secondint
+ * ints[2]: secondint
+ * ...
+ * ints[num_ints - 1]: secondint
+ *
+ *
+ * We need that when doing a multi-insert, and it seems nice to have a
+ * specialized version for that, for speed, but also to keep the calling
+ * code simpler, so that it doesn't need to construct an input array.
+ *
+ * TODO: This is just copy-pasted from simple8b_encode, but since we know
+ * what the input is, we could probably optimize this further.
+ */
+uint64
+simple8b_encode_consecutive(const uint64 firstint, const uint64 secondint, int num_ints,
+ int *num_encoded)
+{
+ int selector;
+ int nints;
+ int bits;
+ uint64 val;
+ uint64 codeword;
+ int i;
+
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ selector = 0;
+ nints = simple8b_modes[0].num_ints;
+ bits = simple8b_modes[0].bits_per_int;
+ val = firstint;
+ i = 0; /* number of deltas we have accepted */
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << bits))
+ {
+ /* too large, step up to next mode */
+ selector++;
+ nints = simple8b_modes[selector].num_ints;
+ bits = simple8b_modes[selector].bits_per_int;
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ i++;
+ if (i >= nints)
+ break;
+ /* examine next delta */
+ if (i < num_ints)
+ val = secondint;
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ if (nints == 0)
+ {
+ /*
+ * The first delta is too large to be encoded with Simple-8b.
+ *
+ * If there is at least one not-too-large integer in the input, we
+ * will encode it using mode 15 (or a more compact mode). Hence, we
+ * can only get here if the *first* delta is >= 2^60.
+ */
+ Assert(i == 0);
+ *num_encoded = 0;
+ return EMPTY_CODEWORD;
+ }
+
+ /*
+ * Encode the integers using the selected mode. Note that we shift them
+ * into the codeword in reverse order, so that they will come out in the
+ * correct order in the decoder.
+ */
+ codeword = 0;
+ if (bits > 0)
+ {
+ for (i = nints - 1; i > 0; i--)
+ {
+ val = secondint;
+ codeword |= val;
+ codeword <<= bits;
+ }
+ val = firstint;
+ codeword |= val;
+ }
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ *num_encoded = nints;
+ return codeword;
+}
+
+/*
+ * Decode a codeword into an array of integers.
+ * Returns the number of integers decoded.
+ */
+int
+simple8b_decode(uint64 codeword, uint64 *decoded)
+{
+ int selector = (codeword >> 60);
+ int nints = simple8b_modes[selector].num_ints;
+ int bits = simple8b_modes[selector].bits_per_int;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ if (codeword == EMPTY_CODEWORD)
+ return 0;
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ decoded[i] = val;
+ codeword >>= bits;
+ }
+
+ return nints;
+}
diff --git src/backend/access/zedstore/zedstore_tid.c src/backend/access/zedstore/zedstore_tid.c
new file mode 100644
index 0000000000..45c2385815
--- /dev/null
+++ src/backend/access/zedstore/zedstore_tid.c
@@ -0,0 +1,159 @@
+/*
+ * zedstore_tid.c
+ * Functions for the built-in type zstid
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tid.c
+ */
+
+#include "postgres.h"
+
+#include "access/zedstore_tid.h"
+#include "storage/itemptr.h"
+#include "utils/fmgrprotos.h"
+#include "utils/int8.h"
+
+
+Datum
+tidtozstid(PG_FUNCTION_ARGS)
+{
+ ItemPointerData *arg = PG_GETARG_ITEMPOINTER(0);
+ zstid tid = ZSTidFromItemPointer(*arg);
+
+ PG_RETURN_ZSTID(tid);
+}
+
+Datum
+zstidin(PG_FUNCTION_ARGS)
+{
+ char *str = PG_GETARG_CSTRING(0);
+ int64 result;
+
+ (void) scanint8(str, false, &result);
+
+ if (result > MaxZSTid || result < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s",
+ str, "zstid")));
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
+zstidout(PG_FUNCTION_ARGS)
+{
+ zstid tid = PG_GETARG_ZSTID(0);
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%lu", tid);
+
+ PG_RETURN_CSTRING(pstrdup(buf));
+}
+
+Datum
+zstidlt(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 < tid2);
+}
+
+Datum
+zstidgt(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 > tid2);
+}
+
+Datum
+zstideq(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 == tid2);
+}
+
+Datum
+zstidle(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 <= tid2);
+}
+
+Datum
+zstidge(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 >= tid2);
+}
+
+Datum
+zstidne(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 != tid2);
+}
+
+Datum
+int2tozstid(PG_FUNCTION_ARGS)
+{
+ int16 arg = PG_GETARG_INT16(0);
+
+ if (arg < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value %d is out of range for type %s",
+ arg, "zstid")));
+
+ PG_RETURN_ZSTID((zstid) arg);
+}
+
+Datum
+int4tozstid(PG_FUNCTION_ARGS)
+{
+ int32 arg = PG_GETARG_INT32(0);
+
+ if (arg < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value %d is out of range for type %s",
+ arg, "zstid")));
+
+ PG_RETURN_ZSTID((zstid) arg);
+}
+
+Datum
+int8tozstid(PG_FUNCTION_ARGS)
+{
+ int64 arg = PG_GETARG_INT64(0);
+
+ if (arg > MaxZSTid || arg < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value " INT64_FORMAT " is out of range for type %s",
+ arg, "zstid")));
+
+ PG_RETURN_ZSTID((zstid) arg);
+}
+
+Datum
+zstidtoint8(PG_FUNCTION_ARGS)
+{
+ zstid arg = PG_GETARG_ZSTID(0);
+
+ PG_RETURN_INT64((int64) arg);
+}
diff --git src/backend/access/zedstore/zedstore_tiditem.c src/backend/access/zedstore/zedstore_tiditem.c
new file mode 100644
index 0000000000..c4d3ec9ecc
--- /dev/null
+++ src/backend/access/zedstore/zedstore_tiditem.c
@@ -0,0 +1,932 @@
+/*
+ * zedstore_tiditem.c
+ * Routines for packing TIDs into "items"
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tiditem.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_internal.h"
+#include "access/zedstore_simple8b.h"
+
+static int remap_slots(uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *orig_slots, int num_orig_slots,
+ int target_idx, ZSUndoRecPtr target_ptr,
+ ZSUndoRecPtr *new_slots,
+ int *new_num_slots,
+ uint8 *new_slotnos,
+ ZSUndoRecPtr recent_oldest_undo);
+static ZSTidArrayItem *build_item(zstid *tids, uint64 *deltas, uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *slots, int num_slots);
+
+static void deltas_to_tids(zstid firsttid, uint64 *deltas, int num_tids, zstid *tids);
+static void slotwords_to_slotnos(uint64 *slotwords, int num_tids, uint8 *slotnos);
+static int binsrch_tid_array(zstid key, zstid *arr, int arr_elems);
+
+/*
+ * Extract TIDs from an item into iterator.
+ */
+void
+zsbt_tid_item_unpack(ZSTidArrayItem *item, ZSTidItemIterator *iter)
+{
+ ZSUndoRecPtr *slots;
+ int num_tids;
+ uint64 *slotwords;
+ uint64 *codewords;
+
+ if (iter->tids_allocated_size < item->t_num_tids)
+ {
+ if (iter->tids)
+ pfree(iter->tids);
+ if (iter->tid_undoslotnos)
+ pfree(iter->tid_undoslotnos);
+ iter->tids = MemoryContextAlloc(iter->context, item->t_num_tids * sizeof(zstid));
+ iter->tid_undoslotnos = MemoryContextAlloc(iter->context, item->t_num_tids * sizeof(uint8));
+ iter->tids_allocated_size = item->t_num_tids;
+ }
+
+ ZSTidArrayItemDecode(item, &codewords, &slots, &slotwords);
+ num_tids = item->t_num_tids;
+
+ /* decode all the codewords */
+ simple8b_decode_words(codewords, item->t_num_codewords, iter->tids, num_tids);
+
+ /* convert the deltas to TIDs */
+ deltas_to_tids(item->t_firsttid, iter->tids, num_tids, iter->tids);
+ iter->num_tids = num_tids;
+ Assert(iter->tids[num_tids - 1] == item->t_endtid - 1);
+
+ /* Expand slotwords to slotnos */
+ slotwords_to_slotnos(slotwords, num_tids, iter->tid_undoslotnos);
+
+ /* also copy out the slots to the iterator */
+ iter->undoslots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ iter->undoslots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < item->t_num_undo_slots; i++)
+ iter->undoslots[i] = slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+}
+
+/*
+ * Create a ZSTidArrayItem (or items), to represent a range of contiguous TIDs,
+ * all with the same UNDO pointer.
+ */
+List *
+zsbt_tid_item_create_for_range(zstid tid, int nelements, ZSUndoRecPtr undo_ptr)
+{
+ uint64 total_encoded;
+ List *newitems = NIL;
+ uint64 codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ int num_slots;
+ int slotno;
+
+ Assert(undo_ptr.counter != DeadUndoPtr.counter);
+ if (IsZSUndoRecPtrValid(&undo_ptr))
+ {
+ slotno = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT + 1;
+ }
+ else
+ {
+ slotno = ZSBT_OLD_UNDO_SLOT;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+ }
+
+ total_encoded = 0;
+ while (total_encoded < nelements)
+ {
+ ZSTidArrayItem *newitem;
+ Size itemsz;
+ int num_codewords;
+ int num_tids;
+ zstid firsttid = tid + total_encoded;
+ uint64 first_delta;
+ uint64 second_delta;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *slotword_p;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ int i;
+
+ /*
+ * The first 'diff' is 0, because the first TID is implicitly 'starttid'.
+ * The rest have distance of 1 to the previous TID.
+ */
+ first_delta = 0;
+ second_delta = 1;
+ num_tids = 0;
+ for (num_codewords = 0;
+ num_codewords < ZSBT_MAX_ITEM_CODEWORDS && total_encoded < nelements && num_tids < ZSBT_MAX_ITEM_TIDS;
+ num_codewords++)
+ {
+ uint64 codeword;
+ int num_encoded;
+
+ codeword = simple8b_encode_consecutive(first_delta, second_delta,
+ nelements - total_encoded,
+ &num_encoded);
+ if (num_encoded == 0)
+ break;
+
+ codewords[num_codewords] = codeword;
+ total_encoded += num_encoded;
+ num_tids += num_encoded;
+ first_delta = 1;
+ }
+
+ itemsz = SizeOfZSTidArrayItem(num_tids, num_slots, num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_tids = num_tids;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = num_codewords;
+ newitem->t_firsttid = firsttid;
+ newitem->t_endtid = tid + total_encoded;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* Fill in undo slots */
+ if (slotno == ZSBT_FIRST_NORMAL_UNDO_SLOT)
+ {
+ Assert(num_slots == ZSBT_FIRST_NORMAL_UNDO_SLOT + 1);
+ newitem_slots[0] = undo_ptr;
+ }
+
+ /* Fill in slotwords */
+ i = 0;
+ slotword_p = newitem_slotwords;
+ while (i < num_tids)
+ {
+ uint64 slotword;
+
+ slotword = 0;
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(slotword_p++) = slotword;
+ }
+
+ /* Fill in TID codewords */
+ for (int i = 0; i < num_codewords; i++)
+ newitem_codewords[i] = codewords[i];
+
+ newitems = lappend(newitems, newitem);
+ }
+
+ return newitems;
+}
+
+/*
+ * Add a range of contiguous TIDs to an existing item.
+ *
+ * If all the new TIDs can be merged with the existing item, returns a List
+ * with a single element, containing the new combined item that covers all
+ * the existing TIDs, and the new TIDs. *modified_orig is set to true.
+ *
+ * If some of the new TIDs can be merged with the existing item, returns a
+ * List with more than one item. The returned items together replace the
+ * original item, such that all the existing TIDs and all the new TIDs are
+ * covered. *modified_orig is set to true in that case, too.
+ *
+ * If the new TIDs could not be merged with the existing item, returns a list
+ * of new items to represent the new TIDs, just like
+ * zsbt_tid_item_create_for_range(), and *modified_orig is set to false.
+ */
+List *
+zsbt_tid_item_add_tids(ZSTidArrayItem *orig, zstid firsttid, int nelements,
+ ZSUndoRecPtr undo_ptr, bool *modified_orig)
+{
+ int num_slots;
+ int num_new_codewords;
+ uint64 new_codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ ZSUndoRecPtr *orig_slots;
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ int slotno;
+ uint64 first_delta;
+ uint64 second_delta;
+ int total_new_encoded;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ List *newitems;
+ int num_tids;
+ ZSUndoRecPtr *dst_slot;
+ uint64 *dst_slotword;
+ uint64 *dst_codeword;
+ int i;
+ int j;
+
+ if (orig == NULL)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ /* Quick check to see if we can add the new TIDs to the previous item */
+ Assert(orig->t_endtid <= firsttid);
+
+ /*
+ * Is there room for a new codeword? Currently, we don't try to add tids to the
+ * last existing codeword, even if we perhaps could.
+ */
+ if (orig->t_num_codewords >= ZSBT_MAX_ITEM_CODEWORDS)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots, &orig_slotwords);
+
+ /* Is there an UNDO slot we can use? */
+ Assert(undo_ptr.counter != DeadUndoPtr.counter);
+ if (!IsZSUndoRecPtrValid(&undo_ptr))
+ {
+ slotno = ZSBT_OLD_UNDO_SLOT;
+ num_slots = orig->t_num_undo_slots;
+ }
+ else
+ {
+ for (slotno = ZSBT_FIRST_NORMAL_UNDO_SLOT; slotno < orig->t_num_undo_slots; slotno++)
+ {
+ if (orig_slots[slotno - ZSBT_FIRST_NORMAL_UNDO_SLOT].counter == undo_ptr.counter)
+ break;
+ }
+ if (slotno >= ZSBT_MAX_ITEM_UNDO_SLOTS)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ if (slotno >= orig->t_num_undo_slots)
+ num_slots = orig->t_num_undo_slots + 1;
+ else
+ num_slots = orig->t_num_undo_slots;
+ }
+
+ /* ok, go ahead, create as many new codewords as fits, or is needed. */
+ first_delta = firsttid - orig->t_endtid + 1;
+ second_delta = 1;
+ total_new_encoded = 0;
+ num_new_codewords = 0;
+ while (num_new_codewords < ZSBT_MAX_ITEM_CODEWORDS - orig->t_num_codewords &&
+ total_new_encoded < nelements && orig->t_num_tids + total_new_encoded < ZSBT_MAX_ITEM_TIDS)
+ {
+ uint64 codeword;
+ int num_encoded;
+
+ codeword = simple8b_encode_consecutive(first_delta,
+ second_delta,
+ nelements - total_new_encoded,
+ &num_encoded);
+ if (num_encoded == 0)
+ break;
+
+ new_codewords[num_new_codewords] = codeword;
+ first_delta = 1;
+ num_new_codewords++;
+ total_new_encoded += num_encoded;
+ }
+
+ if (num_new_codewords == 0)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ num_tids = orig->t_num_tids + total_new_encoded;
+
+ itemsz = SizeOfZSTidArrayItem(num_tids, num_slots, orig->t_num_codewords + num_new_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = orig->t_num_codewords + num_new_codewords;
+ newitem->t_firsttid = orig->t_firsttid;
+ newitem->t_endtid = firsttid + total_new_encoded;
+ newitem->t_num_tids = newitem->t_endtid - newitem->t_firsttid;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* copy existing codewords, followed by new ones */
+ dst_codeword = newitem_codewords;
+ for (int i = 0; i < orig->t_num_codewords; i++)
+ *(dst_codeword++) = orig_codewords[i];
+ for (int i = 0; i < num_new_codewords; i++)
+ *(dst_codeword++) = new_codewords[i];
+
+ /* copy existing UNDO slots, followed by new slot, if any */
+ dst_slot = newitem_slots;
+ for (i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ *(dst_slot++) = orig_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+ if (num_slots > orig->t_num_undo_slots)
+ *(dst_slot++) = undo_ptr;
+
+ /*
+ * Copy and build slotwords
+ */
+ dst_slotword = newitem_slotwords;
+ /* copy full original slotwords as is */
+ for (i = 0; i < orig->t_num_tids / ZSBT_SLOTNOS_PER_WORD; i++)
+ *(dst_slotword++) = orig_slotwords[i];
+
+ /* add to the last, partial slotword. */
+ i = orig->t_num_tids;
+ j = orig->t_num_tids % ZSBT_SLOTNOS_PER_WORD;
+ if (j != 0)
+ {
+ uint64 slotword = orig_slotwords[orig->t_num_tids / ZSBT_SLOTNOS_PER_WORD];
+
+ for (; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(dst_slotword++) = slotword;
+ }
+
+ /* new slotwords */
+ while (i < num_tids)
+ {
+ uint64 slotword = 0;
+
+ for (j = 0; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(dst_slotword++) = slotword;
+ }
+ Assert(dst_slotword == newitem_slotwords + ZSBT_NUM_SLOTWORDS(num_tids));
+
+ /* Create more items for the remainder, if needed */
+ *modified_orig = true;
+ if (total_new_encoded < nelements)
+ newitems = zsbt_tid_item_create_for_range(newitem->t_endtid,
+ nelements - total_new_encoded,
+ undo_ptr);
+ else
+ newitems = NIL;
+ newitems = lcons(newitem, newitems);
+ return newitems;
+}
+
+/*
+ * Change the UNDO pointer of a tuple with TID 'target_tid', inside an item.
+ *
+ * Returns an item, or multiple items, to replace the original one.
+ */
+List *
+zsbt_tid_item_change_undoptr(ZSTidArrayItem *orig, zstid target_tid, ZSUndoRecPtr undoptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ uint64 *deltas;
+ zstid *tids;
+ int num_tids = orig->t_num_tids;
+ int target_idx = -1;
+ ZSUndoRecPtr *orig_slots_partial;
+ ZSUndoRecPtr orig_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ List *newitems;
+ int new_slotno;
+
+ deltas = palloc(sizeof(uint64) * num_tids);
+ tids = palloc(sizeof(zstid) * num_tids);
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots_partial, &orig_slotwords);
+
+ /* decode the codewords, to find the target TID */
+ simple8b_decode_words(orig_codewords, orig->t_num_codewords, deltas, num_tids);
+
+ deltas_to_tids(orig->t_firsttid, deltas, num_tids, tids);
+
+ target_idx = binsrch_tid_array(target_tid, tids, num_tids);
+ Assert(tids[target_idx] == target_tid);
+
+ /* Ok, we know the target TID now. Can we use one of the existing UNDO slots? */
+ new_slotno = -1;
+ if (undoptr.counter == DeadUndoPtr.counter)
+ new_slotno = ZSBT_DEAD_UNDO_SLOT;
+ if (new_slotno == -1 && undoptr.counter < recent_oldest_undo.counter)
+ new_slotno = ZSBT_OLD_UNDO_SLOT;
+
+ orig_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ orig_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ orig_slots[i] = orig_slots_partial[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+
+ if (new_slotno == -1)
+ {
+ for (int i = 0; i < orig->t_num_undo_slots; i++)
+ {
+ if (orig_slots[i].counter == undoptr.counter)
+ {
+ /* We can reuse this existing slot for the target. */
+ new_slotno = i;
+ }
+ }
+ }
+ if (new_slotno == -1 && orig->t_num_undo_slots < ZSBT_MAX_ITEM_UNDO_SLOTS)
+ {
+ /* There's a free slot we can use for the target */
+ new_slotno = orig->t_num_undo_slots;
+ }
+
+ if (new_slotno != -1)
+ {
+ int num_slots;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+
+ num_slots = orig->t_num_undo_slots;
+ if (new_slotno == orig->t_num_undo_slots)
+ num_slots++;
+
+ /* Simple case */
+ itemsz = SizeOfZSTidArrayItem(orig->t_num_tids, num_slots, orig->t_num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = orig->t_num_codewords;
+ newitem->t_firsttid = orig->t_firsttid;
+ newitem->t_endtid = orig->t_endtid;
+ newitem->t_num_tids = orig->t_num_tids;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* copy codewords. They're unmodified. */
+ for (int i = 0; i < orig->t_num_codewords; i++)
+ newitem_codewords[i] = orig_codewords[i];
+
+ /* copy existing slots, followed by new slot, if any */
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ newitem_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT] = orig_slots[i];
+ if (new_slotno == orig->t_num_undo_slots)
+ newitem_slots[new_slotno - ZSBT_FIRST_NORMAL_UNDO_SLOT] = undoptr;
+
+ /* copy slotwords */
+ for (int i = 0; i < ZSBT_NUM_SLOTWORDS(orig->t_num_tids); i++)
+ {
+ uint64 slotword;
+
+ slotword = orig_slotwords[i];
+
+ if (target_idx / ZSBT_SLOTNOS_PER_WORD == i)
+ {
+ /* this slotword contains the target TID */
+ int shift = (target_idx % ZSBT_SLOTNOS_PER_WORD) * ZSBT_ITEM_UNDO_SLOT_BITS;
+ uint64 mask;
+
+ mask = ((UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1) << shift;
+
+ slotword &= ~mask;
+ slotword |= (uint64) new_slotno << shift;
+ }
+
+ newitem_slotwords[i] = slotword;
+ }
+
+ newitems = list_make1(newitem);
+ }
+ else
+ {
+ /* Have to remap the slots. */
+ uint8 *slotnos;
+ ZSUndoRecPtr tmp_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint8 *tmp_slotnos;
+ int idx;
+
+ slotnos = palloc(orig->t_num_tids * sizeof(uint8));
+ slotwords_to_slotnos(orig_slotwords, orig->t_num_tids, slotnos);
+
+ tmp_slotnos = palloc(orig->t_num_tids * sizeof(uint8));
+
+ /* reconstruct items */
+ idx = 0;
+ newitems = NIL;
+ while (idx < orig->t_num_tids)
+ {
+ ZSTidArrayItem *newitem;
+ int num_remapped;
+ int num_tmp_slots;
+
+ num_remapped = remap_slots(&slotnos[idx], orig->t_num_tids - idx,
+ orig_slots, orig->t_num_undo_slots,
+ target_idx - idx, undoptr,
+ tmp_slots, &num_tmp_slots,
+ tmp_slotnos,
+ recent_oldest_undo);
+
+ deltas[idx] = 0;
+ newitem = build_item(&tids[idx], &deltas[idx], tmp_slotnos, num_remapped,
+ tmp_slots, num_tmp_slots);
+
+ newitems = lappend(newitems, newitem);
+ idx += newitem->t_num_tids;
+ }
+
+ pfree(slotnos);
+ pfree(tmp_slotnos);
+ }
+
+ pfree(deltas);
+ pfree(tids);
+
+ return newitems;
+}
+
+/*
+ * Completely remove a number of TIDs from an item. (for vacuum)
+ */
+List *
+zsbt_tid_item_remove_tids(ZSTidArrayItem *orig, zstid *nexttid, IntegerSet *remove_tids,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ ZSUndoRecPtr *orig_slots_partial;
+ ZSUndoRecPtr orig_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ int total_remain;
+ uint64 *deltas;
+ zstid *tids;
+ int nelements = orig->t_num_tids;
+ List *newitems = NIL;
+ zstid tid;
+ zstid prev_tid;
+ int idx;
+ uint8 *slotnos;
+
+ deltas = palloc(sizeof(uint64) * nelements);
+ tids = palloc(sizeof(zstid) * nelements);
+ slotnos = palloc(sizeof(uint8) * nelements);
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots_partial, &orig_slotwords);
+
+ /* decode all the codewords */
+ simple8b_decode_words(orig_codewords, orig->t_num_codewords, deltas, orig->t_num_tids);
+
+ /* also decode the slotwords */
+ orig_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ orig_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ orig_slots[i] = orig_slots_partial[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+
+ idx = 0;
+ while (idx < orig->t_num_tids)
+ {
+ uint64 slotword = orig_slotwords[idx / ZSBT_SLOTNOS_PER_WORD];
+
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && idx < orig->t_num_tids; j++)
+ {
+ slotnos[idx++] = slotword & ((UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1);
+ slotword >>= ZSBT_ITEM_UNDO_SLOT_BITS;
+ }
+ }
+
+ /*
+ * Remove all the TIDs we can
+ */
+ total_remain = 0;
+ tid = orig->t_firsttid;
+ prev_tid = tid;
+ for (int i = 0; i < orig->t_num_tids; i++)
+ {
+ uint64 delta = deltas[i];
+
+ tid += delta;
+
+ while (*nexttid < tid)
+ {
+ if (!intset_iterate_next(remove_tids, nexttid))
+ *nexttid = MaxPlusOneZSTid;
+ }
+ if (tid < *nexttid)
+ {
+ deltas[total_remain] = tid - prev_tid;
+ tids[total_remain] = tid;
+ slotnos[total_remain] = slotnos[i];
+ total_remain++;
+ prev_tid = tid;
+ }
+ }
+
+ if (total_remain > 0)
+ {
+ ZSUndoRecPtr tmp_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint8 *tmp_slotnos;
+ int idx;
+
+ tmp_slotnos = palloc(total_remain * sizeof(uint8));
+
+ /*
+ * Ok, we have the decoded tids and undo slotnos in vals and undoslotnos now.
+ *
+ * Time to re-encode.
+ */
+ idx = 0;
+ while (idx < total_remain)
+ {
+ ZSTidArrayItem *newitem;
+ int num_remapped;
+ int num_tmp_slots;
+
+ num_remapped = remap_slots(&slotnos[idx], total_remain - idx,
+ orig_slots, orig->t_num_undo_slots,
+ -1, InvalidUndoPtr,
+ tmp_slots, &num_tmp_slots,
+ tmp_slotnos,
+ recent_oldest_undo);
+
+ deltas[idx] = 0;
+ newitem = build_item(&tids[idx], &deltas[idx], tmp_slotnos, num_remapped,
+ tmp_slots, num_tmp_slots);
+
+ newitems = lappend(newitems, newitem);
+ idx += newitem->t_num_tids;
+ }
+ pfree(tmp_slotnos);
+ }
+
+ pfree(deltas);
+ pfree(tids);
+ pfree(slotnos);
+
+ return newitems;
+}
+
+
+/*
+ * Convert an array of deltas to tids.
+ *
+ * Note: the input and output may point to the same array!
+ */
+static void
+deltas_to_tids(zstid firsttid, uint64 *deltas, int num_tids, zstid *tids)
+{
+ zstid prev_tid = firsttid;
+
+ for (int i = 0; i < num_tids; i++)
+ {
+ zstid tid;
+
+ tid = prev_tid + deltas[i];
+ tids[i] = tid;
+ prev_tid = tid;
+ }
+}
+
+/*
+ * Expand the slot numbers packed in slotwords, 2 bits per slotno, into
+ * a regular C array.
+ */
+static void
+slotwords_to_slotnos(uint64 *slotwords, int num_tids, uint8 *slotnos)
+{
+ uint64 *slotword_p;
+ const uint64 mask = (UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1;
+ int i;
+
+ i = 0;
+ slotword_p = slotwords;
+ while (i < num_tids)
+ {
+ uint64 slotword = *(slotword_p++);
+ int j;
+
+ /*
+ * process four elements at a time, for speed (this is an
+ * unrolled version of the loop below
+ */
+ j = 0;
+ while (j < ZSBT_SLOTNOS_PER_WORD && num_tids - i > 3)
+ {
+ slotnos[i] = slotword & mask;
+ slotnos[i + 1] = (slotword >> 2) & mask;
+ slotnos[i + 2] = (slotword >> 4) & mask;
+ slotnos[i + 3] = (slotword >> 6) & mask;
+ slotword = slotword >> 8;
+ i += 4;
+ j += 4;
+ }
+ /* handle the 0-3 elements at the end */
+ while (j < ZSBT_SLOTNOS_PER_WORD && num_tids - i > 0)
+ {
+ slotnos[i] = slotword & mask;
+ slotword = slotword >> 2;
+ i++;
+ j++;
+ }
+ }
+}
+
+/*
+ * Remap undo slots.
+ *
+ * We start with empty UNDO slots, and walk through the items,
+ * filling a slot whenever we encounter an UNDO pointer that we
+ * haven't assigned a slot for yet. If we run out of slots, stop.
+ */
+static int
+remap_slots(uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *orig_slots, int num_orig_slots,
+ int target_idx, ZSUndoRecPtr target_ptr,
+ ZSUndoRecPtr *new_slots,
+ int *new_num_slots,
+ uint8 *new_slotnos,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ int num_slots;
+ int8 slot_mapping[ZSBT_MAX_ITEM_UNDO_SLOTS + 1];
+ int idx;
+
+ new_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ new_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+
+ /*
+ * Have to remap the UNDO slots.
+- *
+ * We start with empty UNDO slots, and walk through the items,
+ * filling a slot whenever we encounter an UNDO pointer that we
+ * haven't assigned a slot for yet. If we run out of slots, stop.
+ */
+
+ slot_mapping[ZSBT_OLD_UNDO_SLOT] = ZSBT_OLD_UNDO_SLOT;
+ slot_mapping[ZSBT_DEAD_UNDO_SLOT] = ZSBT_DEAD_UNDO_SLOT;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < num_orig_slots; i++)
+ slot_mapping[i] = -1;
+
+ for (idx = 0; idx < num_tids; idx++)
+ {
+ int orig_slotno = slotnos[idx];
+ int new_slotno;
+
+ if (idx == target_idx)
+ new_slotno = -1;
+ else
+ new_slotno = slot_mapping[orig_slotno];
+ if (new_slotno == -1)
+ {
+ /* assign new slot for this. */
+ ZSUndoRecPtr this_undoptr;
+
+ if (idx == target_idx)
+ this_undoptr = target_ptr;
+ else
+ this_undoptr = orig_slots[orig_slotno];
+
+ if (this_undoptr.counter == DeadUndoPtr.counter)
+ new_slotno = ZSBT_DEAD_UNDO_SLOT;
+ else if (this_undoptr.counter < recent_oldest_undo.counter)
+ new_slotno = ZSBT_OLD_UNDO_SLOT;
+ else
+ {
+ for (int j = 0; j < num_slots; j++)
+ {
+ if (new_slots[j].counter == this_undoptr.counter)
+ {
+ /* We already had a slot for this undo pointer. Reuse it. */
+ new_slotno = j;
+ break;
+ }
+ }
+ if (new_slotno == -1)
+ {
+ if (num_slots >= ZSBT_MAX_ITEM_UNDO_SLOTS)
+ break; /* out of slots */
+ else
+ {
+ /* assign to free slot */
+ new_slots[num_slots] = this_undoptr;
+ new_slotno = num_slots;
+ num_slots++;
+ }
+ }
+ }
+
+ if (idx != target_idx)
+ slot_mapping[orig_slotno] = new_slotno;
+ }
+
+ new_slotnos[idx] = new_slotno;
+ }
+
+ *new_num_slots = num_slots;
+ return idx;
+}
+
+/*
+ * Construct a ZSTidArrayItem.
+ *
+ * 'tids' is the list of TIDs to be packed in the item.
+ *
+ * 'deltas' contain the difference between each TID. They could be computed
+ * from the 'tids', but since the caller has them lready, we can save some
+ * effort by passing them down.
+ *
+ * 'slots' contains the UNDO slots to be stored. NOTE: it contains the
+ * special 0 and 1 slots too, but they won't be stored in the item that's
+ * created.
+ *
+ * 'slotnos' contains the UNDO slot numbers corresponding to each tuple
+ */
+static ZSTidArrayItem *
+build_item(zstid *tids, uint64 *deltas, uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *slots, int num_slots)
+{
+ int num_codewords;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ int num_encoded;
+ uint64 codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ uint64 *dst_slotword;
+ int idx;
+
+ /*
+ * Create codewords.
+ */
+ num_codewords = 0;
+ num_encoded = 0;
+ while (num_encoded < num_tids && num_codewords < ZSBT_MAX_ITEM_CODEWORDS)
+ {
+ int n;
+ uint64 codeword;
+
+ codeword = simple8b_encode(&deltas[num_encoded], num_tids - num_encoded, &n);
+ if (n == 0)
+ break;
+
+ num_encoded += n;
+
+ codewords[num_codewords++] = codeword;
+ }
+
+ itemsz = SizeOfZSTidArrayItem(num_encoded, num_slots, num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_tids = num_encoded;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = num_codewords;
+ newitem->t_firsttid = tids[0];
+ newitem->t_endtid = tids[num_encoded - 1] + 1;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* Copy in the TID codewords */
+ for (int i = 0; i < num_codewords; i++)
+ newitem_codewords[i] = codewords[i];
+
+ /* Copy in undo slots */
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < num_slots; i++)
+ newitem_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT] = slots[i];
+
+ /* Create slotwords */
+ dst_slotword = newitem_slotwords;
+ idx = 0;
+ while (idx < num_encoded)
+ {
+ uint64 slotword = 0;
+
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && idx < num_encoded; j++)
+ slotword |= (uint64) slotnos[idx++] << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+
+ *(dst_slotword++) = slotword;
+ }
+ Assert(dst_slotword == newitem_slotwords + ZSBT_NUM_SLOTWORDS(num_tids));
+
+ return newitem;
+}
+
+static int
+binsrch_tid_array(zstid key, zstid *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid])
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git src/backend/access/zedstore/zedstore_tidpage.c src/backend/access/zedstore/zedstore_tidpage.c
new file mode 100644
index 0000000000..83b89db857
--- /dev/null
+++ src/backend/access/zedstore/zedstore_tidpage.c
@@ -0,0 +1,2064 @@
+/*
+ * zedstore_tidpage.c
+ * Routines for handling the TID tree.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with one B-tree at a time, it is the caller's
+ * responsibility to tie together the scans of each btree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tidpage.c
+ */
+#include "postgres.h"
+
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+#include "lib/integerset.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+
+/* prototypes for local functions */
+static void zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items, zs_pending_undo_op *undo_op);
+static OffsetNumber zsbt_tid_fetch(Relation rel, zstid tid,
+ Buffer *buf_p, ZSUndoRecPtr *undo_ptr_p, bool *isdead_p);
+static void zsbt_tid_add_items(Relation rel, Buffer buf, List *newitems,
+ zs_pending_undo_op *pending_undo_op);
+static void zsbt_tid_replace_item(Relation rel, Buffer buf, OffsetNumber off, List *newitems,
+ zs_pending_undo_op *pending_undo_op);
+
+static TM_Result zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd,
+ bool *this_xact_has_lock, ZSUndoRecPtr *prevundoptr_p);
+static zstid zsbt_tid_update_insert_new(Relation rel, TransactionId xid, CommandId cid,
+ ZSUndoRecPtr prevundoptr);
+static bool zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, ZSUndoRecPtr prevrecptr);
+static OffsetNumber zsbt_binsrch_tidpage(zstid key, Page page);
+static void zsbt_wal_log_tidleaf_items_begin(int nitems, zs_pending_undo_op *undo_op);
+static void zsbt_wal_log_tidleaf_items(Relation rel, Buffer buf,
+ OffsetNumber off, bool replace, List *items,
+ zs_pending_undo_op *undo_op);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of the btree.
+ */
+void
+zsbt_tid_begin_scan(Relation rel, zstid starttid,
+ zstid endtid, Snapshot snapshot, ZSTidTreeScan *scan)
+{
+ scan->rel = rel;
+ scan->snapshot = snapshot;
+ scan->context = CurrentMemoryContext;
+ scan->starttid = starttid;
+ scan->endtid = endtid;
+ scan->currtid = starttid - 1;
+ memset(&scan->recent_oldest_undo, 0, sizeof(scan->recent_oldest_undo));
+ memset(&scan->array_iter, 0, sizeof(scan->array_iter));
+ scan->array_iter.context = CurrentMemoryContext;
+ scan->array_curr_idx = -1;
+
+ scan->active = true;
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+}
+
+/*
+ * Reset the 'next' TID in a scan to the given TID.
+ */
+void
+zsbt_tid_reset_scan(ZSTidTreeScan *scan, zstid starttid, zstid endtid, zstid currtid)
+{
+ scan->starttid = starttid;
+ scan->endtid = endtid;
+ scan->currtid = currtid;
+ scan->array_curr_idx = -1;
+}
+
+void
+zsbt_tid_end_scan(ZSTidTreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+
+ scan->active = false;
+ scan->array_iter.num_tids = 0;
+ scan->array_curr_idx = -1;
+
+ if (scan->array_iter.tids)
+ pfree(scan->array_iter.tids);
+ if (scan->array_iter.tid_undoslotnos)
+ pfree(scan->array_iter.tid_undoslotnos);
+}
+
+/*
+ * Helper function of zsbt_tid_scan_next_array(), to extract Datums from the given
+ * array item into the scan->array_* fields.
+ */
+static void
+zsbt_tid_scan_extract_array(ZSTidTreeScan *scan, ZSTidArrayItem *aitem)
+{
+ bool slots_visible[4];
+ int first;
+ int last;
+ int num_visible_tids;
+ int continue_at;
+
+ zsbt_tid_item_unpack(aitem, &scan->array_iter);
+
+ slots_visible[ZSBT_OLD_UNDO_SLOT] = true;
+ slots_visible[ZSBT_DEAD_UNDO_SLOT] = false;
+
+ scan->array_iter.undoslot_visibility[ZSBT_OLD_UNDO_SLOT] = InvalidUndoSlotVisibility;
+ scan->array_iter.undoslot_visibility[ZSBT_OLD_UNDO_SLOT].xmin = FrozenTransactionId;
+
+ scan->array_iter.undoslot_visibility[ZSBT_DEAD_UNDO_SLOT] = InvalidUndoSlotVisibility;
+
+ for (int i = 2; i < aitem->t_num_undo_slots; i++)
+ {
+ ZSUndoRecPtr undoptr = scan->array_iter.undoslots[i];
+ TransactionId obsoleting_xid;
+
+ scan->array_iter.undoslot_visibility[i] = InvalidUndoSlotVisibility;
+
+ slots_visible[i] = zs_SatisfiesVisibility(scan, undoptr, &obsoleting_xid,
+ NULL, &scan->array_iter.undoslot_visibility[i]);
+ if (scan->serializable && TransactionIdIsValid(obsoleting_xid))
+ CheckForSerializableConflictOut(scan->rel, obsoleting_xid, scan->snapshot);
+ }
+
+ /*
+ * Skip over elements at the beginning and end of the array that
+ * are not within the range we're interested in.
+ */
+ for (first = 0; first < scan->array_iter.num_tids; first++)
+ {
+ if (scan->array_iter.tids[first] >= scan->starttid)
+ break;
+ }
+ for (last = scan->array_iter.num_tids - 1; last >= first; last--)
+ {
+ if (scan->array_iter.tids[last] < scan->endtid)
+ break;
+ }
+
+ /* squeeze out invisible TIDs */
+ if (first == 0)
+ {
+ int j;
+
+ for (j = 0; j <= last; j++)
+ {
+ if (!slots_visible[scan->array_iter.tid_undoslotnos[j]])
+ break;
+ }
+ num_visible_tids = j;
+ continue_at = j + 1;
+ }
+ else
+ {
+ num_visible_tids = 0;
+ continue_at = first;
+ }
+
+ for (int i = continue_at; i <= last; i++)
+ {
+ /* Is this item visible? */
+ if (slots_visible[scan->array_iter.tid_undoslotnos[i]])
+ {
+ scan->array_iter.tids[num_visible_tids] = scan->array_iter.tids[i];
+ scan->array_iter.tid_undoslotnos[num_visible_tids] = scan->array_iter.tid_undoslotnos[i];
+ num_visible_tids++;
+ }
+ }
+ scan->array_iter.num_tids = num_visible_tids;
+ scan->array_curr_idx = -1;
+}
+
+/*
+ * Advance scan to next batch of TIDs.
+ *
+ * Finds the next TID array item >= scan->nexttid, and decodes it into
+ * scan->array_iter. The values in scan->array_iter are valid until
+ * the next call to this function, zsbt_tid_reset_scan() or
+ * zsbt_tid_end_scan().
+ *
+ * Returns true if there was another item, or false if we reached the
+ * end of the scan.
+ *
+ * This is normally not used directly, see zsbt_tid_scan_next() wrapper.
+ */
+bool
+zsbt_tid_scan_next_array(ZSTidTreeScan *scan, zstid nexttid, ScanDirection direction)
+{
+ if (!scan->active)
+ return InvalidZSTid;
+
+ /*
+ * Process items, until we find something that is visible to the snapshot.
+ *
+ * This advances nexttid as it goes.
+ */
+ while (nexttid < scan->endtid && nexttid >= scan->starttid)
+ {
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+ BlockNumber next;
+
+ /*
+ * Find and lock the leaf page containing nexttid.
+ */
+ buf = zsbt_find_and_lock_leaf_containing_tid(scan->rel, ZS_META_ATTRIBUTE_NUM,
+ scan->lastbuf, nexttid,
+ BUFFER_LOCK_SHARE);
+ if (buf != scan->lastbuf)
+ scan->lastoff = InvalidOffsetNumber;
+ scan->lastbuf = buf;
+ if (!BufferIsValid(buf))
+ {
+ /*
+ * Completely empty tree. This should only happen at the beginning of a
+ * scan - a tree cannot go missing after it's been created - but we don't
+ * currently check for that.
+ */
+ break;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ Assert(opaque->zs_page_id == ZS_BTREE_PAGE_ID);
+
+ /*
+ * Scan the items on the page, to find the next one that covers
+ * nexttid.
+ *
+ * We check the last offset first, as an optimization
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (direction == ForwardScanDirection)
+ {
+ /* Search for the next item >= nexttid */
+ off = FirstOffsetNumber;
+ if (scan->lastoff > FirstOffsetNumber && scan->lastoff <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, scan->lastoff);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid >= item->t_endtid)
+ off = scan->lastoff + 1;
+ }
+
+ for (; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid >= item->t_endtid)
+ continue;
+
+ if (item->t_firsttid >= scan->endtid)
+ {
+ nexttid = scan->endtid;
+ break;
+ }
+
+ zsbt_tid_scan_extract_array(scan, item);
+
+ if (scan->array_iter.num_tids > 0)
+ {
+ if (scan->array_iter.tids[scan->array_iter.num_tids - 1] >= nexttid)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ scan->lastoff = off;
+ return true;
+ }
+ nexttid = scan->array_iter.tids[scan->array_iter.num_tids - 1] + 1;
+ }
+ }
+ /* No more items on this page. Walk right, if possible */
+ if (nexttid < opaque->zs_hikey)
+ nexttid = opaque->zs_hikey;
+ next = opaque->zs_next;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (next == InvalidBlockNumber || nexttid >= scan->endtid)
+ {
+ /* reached end of scan */
+ break;
+ }
+
+ scan->lastbuf = ReleaseAndReadBuffer(scan->lastbuf, scan->rel, next);
+ }
+ else
+ {
+ /* Search for the next item <= nexttid */
+ for (off = maxoff; off >= FirstOffsetNumber; off--)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid < item->t_firsttid)
+ continue;
+
+ if (item->t_endtid < scan->starttid)
+ {
+ nexttid = scan->starttid - 1;
+ break;
+ }
+
+ zsbt_tid_scan_extract_array(scan, item);
+
+ if (scan->array_iter.num_tids > 0)
+ {
+ if (scan->array_iter.tids[0] <= nexttid)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ scan->lastoff = off;
+ return true;
+ }
+ nexttid = scan->array_iter.tids[0] - 1;
+ }
+ }
+ /* No more items on this page. Loop back to find the left sibling. */
+ if (nexttid >= opaque->zs_lokey)
+ nexttid = opaque->zs_lokey - 1;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ if (nexttid < scan->starttid)
+ {
+ /* reached end of scan */
+ break;
+ }
+ scan->lastbuf = InvalidBuffer;
+ }
+ }
+
+ /* Reached end of scan. */
+ scan->array_iter.num_tids = 0;
+ if (BufferIsValid(scan->lastbuf))
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+
+ return false;
+}
+
+/*
+ * Get the first tid in the tree.
+ */
+zstid
+zsbt_get_first_tid(Relation rel)
+{
+ zstid leftmostkey;
+ zstid tid;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+
+ /* Find the leftmost leaf */
+ leftmostkey = MinZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, leftmostkey, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ return MaxPlusOneZSTid;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /* Move on to the next page if the current page has no item */
+ while (PageGetMaxOffsetNumber(page) < FirstOffsetNumber)
+ {
+ BlockNumber next = opaque->zs_next;
+
+ if (next == InvalidBlockNumber)
+ {
+ UnlockReleaseBuffer(buf);
+ return MaxPlusOneZSTid;
+ }
+ UnlockReleaseBuffer(buf);
+
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ }
+
+ tid = opaque->zs_lokey;
+ UnlockReleaseBuffer(buf);
+
+ return tid;
+}
+
+/*
+ * Get the last tid (plus one) in the tree.
+ */
+zstid
+zsbt_get_last_tid(Relation rel)
+{
+ zstid rightmostkey;
+ zstid tid;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+
+ /* Find the rightmost leaf */
+ rightmostkey = MaxZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, rightmostkey, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ return MinZSTid;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Look at the last item, for its tid.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSTidArrayItem *lastitem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ tid = lastitem->t_endtid;
+ }
+ else
+ {
+ tid = opaque->zs_lokey;
+ }
+ UnlockReleaseBuffer(buf);
+
+ return tid;
+}
+
+/*
+ * Insert a multiple TIDs.
+ *
+ * Populates the TIDs of the new tuples.
+ *
+ * If 'tid' in list is valid, then that TID is used. It better not be in use already. If
+ * it's invalid, then a new TID is allocated, as we see best. (When inserting the
+ * first column of the row, pass invalid, and for other columns, pass the TID
+ * you got for the first column.)
+ */
+zstid
+zsbt_tid_multi_insert(Relation rel, int ntuples,
+ TransactionId xid, CommandId cid, uint32 speculative_token, ZSUndoRecPtr prevundoptr)
+{
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ zstid insert_target_key;
+ List *newitems;
+ zs_pending_undo_op *undo_op;
+ zstid endtid;
+ zstid tid;
+ ZSTidArrayItem *lastitem;
+ bool modified_orig;
+
+ /*
+ * Insert to the rightmost leaf.
+ *
+ * TODO: use a Free Space Map to find suitable target.
+ */
+ insert_target_key = MaxZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, insert_target_key, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Look at the last item, for its tid.
+ *
+ * assign TIDS for each item.
+ */
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+
+ lastitem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ endtid = lastitem->t_endtid;
+ }
+ else
+ {
+ endtid = opaque->zs_lokey;
+ lastitem = NULL;
+ }
+ tid = endtid;
+
+ /* Form an undo record */
+ if (xid != FrozenTransactionId)
+ {
+ undo_op = zsundo_create_for_insert(rel, xid, cid, tid, ntuples,
+ speculative_token, prevundoptr);
+ }
+ else
+ {
+ undo_op = NULL;
+ }
+
+ /*
+ * Create an item to represent all the TIDs, merging with the last existing
+ * item if possible.
+ */
+ newitems = zsbt_tid_item_add_tids(lastitem, tid, ntuples, undo_op ? undo_op->reservation.undorecptr : InvalidUndoPtr,
+ &modified_orig);
+
+ /*
+ * Replace the original last item with the new items, or add new items.
+ * This splits the page if necessary.
+ */
+ if(modified_orig)
+ zsbt_tid_replace_item(rel, buf, maxoff, newitems, undo_op);
+ else
+ zsbt_tid_add_items(rel, buf, newitems, undo_op);
+ /* zsbt_tid_replace/add_item unlocked 'buf' */
+ ReleaseBuffer(buf);
+
+ list_free_deep(newitems);
+
+ /* Return the TIDs to the caller */
+ return tid;
+}
+
+TM_Result
+zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart, bool *this_xact_has_lock)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zs_pending_undo_op *undo_op;
+ OffsetNumber off;
+ ZSTidArrayItem *origitem;
+ Buffer buf;
+ Page page;
+ zstid next_tid;
+ List *newitems = NIL;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to delete with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+ if (item_isdead)
+ {
+ elog(ERROR, "cannot delete tuple that is already marked DEAD (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+
+ if (snapshot)
+ {
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ tid, item_undoptr, LockTupleExclusive,
+ &keep_old_undo_ptr, this_xact_has_lock,
+ hufd, &next_tid, NULL);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+ ZSUndoSlotVisibility visi_info;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, item_undoptr, &obsoleting_xid, NULL, &visi_info))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return TM_Updated;
+ }
+ }
+ }
+
+ /* Create UNDO record. */
+ undo_op = zsundo_create_for_delete(rel, xid, cid, tid, changingPart,
+ keep_old_undo_ptr ? item_undoptr : InvalidUndoPtr);
+
+ /* Update the tid with the new UNDO pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, undo_op->reservation.undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, undo_op);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+
+ return TM_Ok;
+}
+
+void
+zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ int idx;
+ Buffer buf;
+ /* Just using meta attribute, we can follow the update chain */
+ zstid curr_tid = *tid;
+
+ for(;;)
+ {
+ zstid next_tid = InvalidZSTid;
+ if (curr_tid == InvalidZSTid)
+ break;
+
+ /* Find the item */
+ idx = zsbt_tid_fetch(rel, curr_tid, &buf, &item_undoptr, &item_isdead);
+ if (idx == -1 || item_isdead)
+ break;
+
+ if (snapshot)
+ {
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+ ZSUndoSlotVisibility visi_info;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = snapshot;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (zs_SatisfiesVisibility(&scan, item_undoptr,
+ &obsoleting_xid, &next_tid, &visi_info))
+ {
+ *tid = curr_tid;
+ }
+
+ curr_tid = next_tid;
+ UnlockReleaseBuffer(buf);
+ }
+ }
+}
+
+/*
+ * A new TID is allocated, as we see best and returned to the caller. This
+ * function is only called for META attribute btree. Data columns will use the
+ * returned tid to insert new items.
+ */
+TM_Result
+zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd,
+ zstid *newtid_p, bool *this_xact_has_lock)
+{
+ TM_Result result;
+ ZSUndoRecPtr prevundoptr;
+ bool success;
+
+ /*
+ * This is currently only used on the meta-attribute. The other attributes
+ * don't need to carry visibility information, so the caller just inserts
+ * the new values with (multi_)insert() instead. This will change once we
+ * start doing the equivalent of HOT updates, where the TID doesn't change.
+ */
+ Assert(*newtid_p == InvalidZSTid);
+
+ /*
+ * Find and lock the old item.
+ *
+ * TODO: If there's free TID space left on the same page, we should keep the
+ * buffer locked, and use the same page for the new tuple.
+ */
+retry:
+ result = zsbt_tid_update_lock_old(rel, otid,
+ xid, cid, key_update, snapshot,
+ crosscheck, wait, hufd, this_xact_has_lock, &prevundoptr);
+
+ if (result != TM_Ok)
+ return result;
+
+ /* insert new version */
+ *newtid_p = zsbt_tid_update_insert_new(rel, xid, cid, prevundoptr);
+
+ /* update the old item with the "t_ctid pointer" for the new item */
+ success = zsbt_tid_mark_old_updated(rel, otid, *newtid_p, xid, cid, key_update, prevundoptr);
+ if (!success)
+ {
+ ZSUndoRecPtr oldest_undoptr = zsundo_get_oldest_undo_ptr(rel, true);
+
+ zsbt_tid_mark_dead(rel, *newtid_p, oldest_undoptr);
+ goto retry;
+ }
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): locks the old item for update.
+ */
+static TM_Result
+zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd, bool *this_xact_has_lock,
+ ZSUndoRecPtr *prevundoptr_p)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ Buffer buf;
+ ZSUndoRecPtr olditem_undoptr;
+ bool olditem_isdead;
+ int idx;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zstid next_tid;
+
+ /*
+ * Find the item to delete.
+ */
+ idx = zsbt_tid_fetch(rel, otid, &buf, &olditem_undoptr, &olditem_isdead);
+ if (idx == -1 || olditem_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+ *prevundoptr_p = olditem_undoptr;
+
+ /*
+ * Is it visible to us?
+ */
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ otid, olditem_undoptr,
+ key_update ? LockTupleExclusive : LockTupleNoKeyExclusive,
+ &keep_old_undo_ptr, this_xact_has_lock,
+ hufd, &next_tid, NULL);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+ ZSUndoSlotVisibility visi_info;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, olditem_undoptr, &obsoleting_xid, NULL, &visi_info))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ result = TM_Updated;
+ }
+ }
+
+ /*
+ * TODO: tuple-locking not implemented. Pray that there is no competing
+ * concurrent update!
+ */
+
+ UnlockReleaseBuffer(buf);
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): inserts the new, updated, item.
+ */
+static zstid
+zsbt_tid_update_insert_new(Relation rel, TransactionId xid, CommandId cid, ZSUndoRecPtr prevundoptr)
+{
+ return zsbt_tid_multi_insert(rel, 1, xid, cid, INVALID_SPECULATIVE_TOKEN, prevundoptr);
+}
+
+/*
+ * Subroutine of zsbt_update(): mark old item as updated.
+ */
+static bool
+zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, ZSUndoRecPtr prevrecptr)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false /* we trimmed in the zsbt_tid_update_lock_old() call */);
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr olditem_undoptr;
+ bool olditem_isdead;
+ OffsetNumber off;
+ bool keep_old_undo_ptr = true;
+ zs_pending_undo_op *undo_op;
+ List *newitems;
+ ZSTidArrayItem *origitem;
+
+ /*
+ * Find the item to delete. It could be part of a compressed item,
+ * we let zsbt_fetch() handle that.
+ */
+ off = zsbt_tid_fetch(rel, otid, &buf, &olditem_undoptr, &olditem_isdead);
+ if (!OffsetNumberIsValid(off) || olditem_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+
+ /*
+ * Did it change while we were inserting new row version?
+ */
+ if (!ZSUndoRecPtrEquals(olditem_undoptr, prevrecptr))
+ {
+ UnlockReleaseBuffer(buf);
+ return false;
+ }
+
+ /* Prepare an UNDO record. */
+ undo_op = zsundo_create_for_update(rel, xid, cid, otid, newtid,
+ keep_old_undo_ptr ? olditem_undoptr : InvalidUndoPtr,
+ key_update);
+
+ /* Replace the ZSBreeItem with one with the updated undo pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, otid, undo_op->reservation.undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, undo_op);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+
+ return true;
+}
+
+TM_Result
+zsbt_tid_lock(Relation rel, zstid tid, TransactionId xid, CommandId cid,
+ LockTupleMode mode, bool follow_updates, Snapshot snapshot,
+ TM_FailureData *hufd, zstid *next_tid, bool *this_xact_has_lock,
+ ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ OffsetNumber off;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zs_pending_undo_op *undo_op;
+ List *newitems;
+ ZSTidArrayItem *origitem;
+
+ *next_tid = tid;
+
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!OffsetNumberIsValid(off) || item_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to lock with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ tid, item_undoptr, mode,
+ &keep_old_undo_ptr, this_xact_has_lock,
+ hufd, next_tid, visi_info);
+
+ if (result != TM_Ok)
+ {
+ if (result == TM_Invisible && follow_updates &&
+ TransactionIdIsInProgress(visi_info->xmin))
+ {
+ /*
+ * need to lock tuple irrespective of its visibility on
+ * follow_updates.
+ */
+ }
+ else
+ {
+ UnlockReleaseBuffer(buf);
+ return result;
+ }
+ }
+
+ /* Create UNDO record. */
+ undo_op = zsundo_create_for_tuple_lock(rel, xid, cid, tid, mode,
+ keep_old_undo_ptr ? item_undoptr : InvalidUndoPtr);
+
+ /* Replace the item with an identical one, but with updated undo pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, undo_op->reservation.undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, undo_op);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+ return TM_Ok;
+}
+
+/*
+ * Collect all TIDs marked as dead in the TID tree.
+ *
+ * This is used during VACUUM.
+ */
+IntegerSet *
+zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid, uint64 *num_live_tuples)
+{
+ Buffer buf = InvalidBuffer;
+ IntegerSet *result;
+ ZSBtreePageOpaque *opaque;
+ zstid nexttid;
+ BlockNumber nextblock;
+ ZSTidItemIterator iter;
+
+ memset(&iter, 0, sizeof(ZSTidItemIterator));
+ iter.context = CurrentMemoryContext;
+
+ result = intset_create();
+
+ nexttid = starttid;
+ nextblock = InvalidBlockNumber;
+ for (;;)
+ {
+ Page page;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ if (nextblock != InvalidBlockNumber)
+ {
+ buf = ReleaseAndReadBuffer(buf, rel, nextblock);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (!zsbt_page_is_expected(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, buf))
+ {
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(buf))
+ {
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, true);
+ if (!BufferIsValid(buf))
+ return result;
+ }
+
+ page = BufferGetPage(buf);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ zsbt_tid_item_unpack(item, &iter);
+
+ for (int j = 0; j < iter.num_tids; j++)
+ {
+ (*num_live_tuples)++;
+ if (iter.tid_undoslotnos[j] == ZSBT_DEAD_UNDO_SLOT)
+ intset_add_member(result, iter.tids[j]);
+ }
+ }
+
+ opaque = ZSBtreePageGetOpaque(page);
+ nexttid = opaque->zs_hikey;
+ nextblock = opaque->zs_next;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (nexttid == MaxPlusOneZSTid)
+ {
+ Assert(nextblock == InvalidBlockNumber);
+ break;
+ }
+
+ if (intset_memory_usage(result) > (uint64) maintenance_work_mem * 1024)
+ break;
+ }
+
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ *endtid = nexttid;
+ return result;
+}
+
+/*
+ * Mark item with given TID as dead.
+ *
+ * This is used when UNDO actions are performed, after a transaction becomes
+ * old enough.
+ */
+void
+zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr recent_oldest_undo)
+{
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr item_undoptr;
+ OffsetNumber off;
+ ZSTidArrayItem *origitem;
+ List *newitems;
+ bool isdead;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ /*
+ * This can happen, at least in this scenario:
+ * 1. a backend reserves a range of TIDs, by inserting them to the
+ * TID tree.
+ * 2. it "cancels" the reservation in the middle of the transaction,
+ * by removing the TIDs from the tree again.
+ * 3. It then aborts. The UNDO record for the insertion is still in
+ * place, but the backend removed the TIDs already.
+ */
+ elog(DEBUG1, "could not find tuple to mark dead with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ UnlockReleaseBuffer(buf);
+ return;
+ }
+
+ /* Mark the TID as DEAD. (Unless it's already dead) */
+ if (isdead)
+ {
+ UnlockReleaseBuffer(buf);
+ return;
+ }
+
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, DeadUndoPtr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, NULL);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+}
+
+
+/*
+ * Remove items for the given TIDs from the TID tree.
+ *
+ * This is used during VACUUM.
+ */
+void
+zsbt_tid_remove(Relation rel, IntegerSet *tids)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ zstid nexttid;
+ MemoryContext oldcontext;
+ MemoryContext tmpcontext;
+
+ tmpcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMVacuumContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(tmpcontext);
+
+ intset_begin_iterate(tids);
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+
+ while (nexttid < MaxPlusOneZSTid)
+ {
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ List *newitems;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ /*
+ * Find the leaf page containing the next item to remove
+ */
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Rewrite the items on the page, removing all TIDs that need to be
+ * removed from the page.
+ */
+ newitems = NIL;
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ while (nexttid < item->t_firsttid)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ if (nexttid < item->t_endtid)
+ {
+ List *newitemsx = zsbt_tid_item_remove_tids(item, &nexttid, tids,
+ recent_oldest_undo);
+
+ newitems = list_concat(newitems, newitemsx);
+ }
+ else
+ {
+ /* keep this item unmodified */
+ newitems = lappend(newitems, item);
+ }
+ }
+
+ while (nexttid < opaque->zs_hikey)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ /* Pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (newitems)
+ {
+ zsbt_tid_recompress_replace(rel, buf, newitems, NULL);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack, NULL, 0);
+ }
+
+ ReleaseBuffer(buf);
+
+ MemoryContextReset(tmpcontext);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(tmpcontext);
+}
+
+/*
+ * Clear an item's UNDO pointer.
+ *
+ * This is used during VACUUM, to clear out aborted deletions.
+ */
+void
+zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ Buffer buf;
+ Page page;
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ OffsetNumber off;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ elog(WARNING, "could not find aborted tuple to remove with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ return;
+ }
+
+ if (ZSUndoRecPtrEquals(item_undoptr, undoptr))
+ {
+ ZSTidArrayItem *origitem;
+ List *newitems;
+
+ /* FIXME: we're overwriting the undo pointer with 'invalid', meaning the
+ * tuple becomes visible to everyone. That doesn't seem right. Shouldn't
+ * we restore the previous undo pointer, if the insertion was not yet
+ * visible to everyone?
+ */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, InvalidUndoPtr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, NULL);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+ }
+ else
+ {
+ Assert(item_isdead ||
+ item_undoptr.counter > undoptr.counter ||
+ !IsZSUndoRecPtrValid(&item_undoptr));
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+void
+zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete)
+{
+ Buffer buf;
+ ZSUndoRecPtr item_undoptr;
+ bool item_isdead;
+ bool found;
+
+ found = zsbt_tid_fetch(rel, tid, &buf, &item_undoptr, &item_isdead);
+ if (!found || item_isdead)
+ elog(ERROR, "couldn't find item for meta column for inserted tuple with TID (%u, %u) in rel %s",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid), rel->rd_rel->relname.data);
+
+ zsundo_clear_speculative_token(rel, item_undoptr);
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Fetch the item with given TID. The page containing the item is kept locked, and
+ * returned to the caller in *buf_p. This is used to locate a tuple for updating
+ * or deleting it.
+ */
+static OffsetNumber
+zsbt_tid_fetch(Relation rel, zstid tid, Buffer *buf_p, ZSUndoRecPtr *undoptr_p, bool *isdead_p)
+{
+ Buffer buf;
+ Page page;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, tid, 0, false);
+ if (buf == InvalidBuffer)
+ {
+ *buf_p = InvalidBuffer;
+ *undoptr_p = InvalidUndoPtr;
+ return InvalidOffsetNumber;
+ }
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /* Find the item on the page that covers the target TID */
+ off = zsbt_binsrch_tidpage(tid, page);
+ if (off >= FirstOffsetNumber && off <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (tid < item->t_endtid)
+ {
+ ZSTidItemIterator iter;
+
+ memset(&iter, 0, sizeof(ZSTidItemIterator));
+ iter.context = CurrentMemoryContext;
+
+ zsbt_tid_item_unpack(item, &iter);
+
+ /* TODO: could do binary search here. Better yet, integrate the
+ * unpack function with the callers */
+ for (int i = 0; i < iter.num_tids; i++)
+ {
+ if (iter.tids[i] == tid)
+ {
+ int slotno = iter.tid_undoslotnos[i];
+ ZSUndoRecPtr undoptr = iter.undoslots[slotno];
+
+ *isdead_p = (slotno == ZSBT_DEAD_UNDO_SLOT);
+ *undoptr_p = undoptr;
+ *buf_p = buf;
+
+ if (iter.tids)
+ pfree(iter.tids);
+ if (iter.tid_undoslotnos)
+ pfree(iter.tid_undoslotnos);
+
+ return off;
+ }
+ }
+
+ if (iter.tids)
+ pfree(iter.tids);
+ if (iter.tid_undoslotnos)
+ pfree(iter.tid_undoslotnos);
+ }
+ }
+ return InvalidOffsetNumber;
+}
+
+/*
+ * This helper function is used to implement INSERT.
+ *
+ * The items in 'newitems' are added to the page, to the correct position.
+ * FIXME: Actually, they're always just added to the end of the page, and that
+ * better be the correct position.
+ *
+ * This function handles splitting the page if needed.
+ */
+static void
+zsbt_tid_add_items(Relation rel, Buffer buf, List *newitems, zs_pending_undo_op *undo_op)
+{
+ Page page = BufferGetPage(buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber off;
+ Size newitemsize;
+ ListCell *lc;
+
+ newitemsize = 0;
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ newitemsize += sizeof(ItemIdData) + item->t_size;
+ }
+
+ if (newitemsize <= PageGetExactFreeSpace(page))
+ {
+ /* The new items fit on the page. Add them. */
+ OffsetNumber startoff;
+ OffsetNumber off;
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items_begin(list_length(newitems), undo_op);
+
+ START_CRIT_SECTION();
+
+ startoff = maxoff + 1;
+ off = startoff;
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ if (!PageAddItem(page, (Item) item, item->t_size, off, true, false))
+ elog(ERROR, "could not add item to TID tree page");
+ off++;
+ }
+
+ if (undo_op)
+ zsundo_finish_pending_op(undo_op, (char *) &undo_op->payload);
+
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items(rel, buf, startoff, false, newitems, undo_op);
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (undo_op)
+ {
+ UnlockReleaseBuffer(undo_op->reservation.undobuf);
+ pfree(undo_op);
+ }
+ }
+ else
+ {
+ List *items = NIL;
+
+ /* Collect all the old items on the page to a list */
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ /*
+ * Get the next item to process from the page.
+ */
+ items = lappend(items, item);
+ }
+
+ /* Add any new items to the end */
+ foreach (lc, newitems)
+ {
+ items = lappend(items, lfirst(lc));
+ }
+
+ /* Now pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_tid_recompress_replace(rel, buf, items, undo_op);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack, undo_op, 0);
+ }
+
+ list_free(items);
+ }
+}
+
+
+/*
+ * This helper function is used to implement INSERT, UPDATE and DELETE.
+ *
+ * If 'newitems' is not empty, the items in the list are added to the page,
+ * to the correct position. FIXME: Actually, they're always just added to
+ * the end of the page, and that better be the correct position.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page if needed.
+ */
+static void
+zsbt_tid_replace_item(Relation rel, Buffer buf, OffsetNumber targetoff, List *newitems,
+ zs_pending_undo_op *undo_op)
+{
+ Page page = BufferGetPage(buf);
+ ItemId iid;
+ ZSTidArrayItem *olditem;
+ ListCell *lc;
+ ssize_t sizediff;
+
+ /*
+ * Find the item that covers the given tid.
+ */
+ if (targetoff < FirstOffsetNumber || targetoff > PageGetMaxOffsetNumber(page))
+ elog(ERROR, "could not find item at off %d to replace", targetoff);
+ iid = PageGetItemId(page, targetoff);
+ olditem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ /* Calculate how much free space we'll need */
+ sizediff = -(olditem->t_size + sizeof(ItemIdData));
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *newitem = (ZSTidArrayItem *) lfirst(lc);
+
+ sizediff += newitem->t_size + sizeof(ItemIdData);
+ }
+
+ /* Can we fit them? */
+ if (sizediff <= PageGetExactFreeSpace(page))
+ {
+ ZSTidArrayItem *newitem;
+ OffsetNumber off;
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items_begin(list_length(newitems), undo_op);
+
+ START_CRIT_SECTION();
+
+ /* Remove existing item, and add new ones */
+ if (newitems == 0)
+ PageIndexTupleDelete(page, targetoff);
+ else
+ {
+ lc = list_head(newitems);
+ newitem = (ZSTidArrayItem *) lfirst(lc);
+ if (!PageIndexTupleOverwrite(page, targetoff, (Item) newitem, newitem->t_size))
+ elog(ERROR, "could not replace item in TID tree page at off %d", targetoff);
+ lc = lnext(newitems, lc);
+
+ off = targetoff + 1;
+ for (; lc != NULL; lc = lnext(newitems, lc))
+ {
+ newitem = (ZSTidArrayItem *) lfirst(lc);
+ if (!PageAddItem(page, (Item) newitem, newitem->t_size, off, false, false))
+ elog(ERROR, "could not add item in TID tree page at off %d", off);
+ off++;
+ }
+ }
+ MarkBufferDirty(buf);
+
+ if (undo_op)
+ zsundo_finish_pending_op(undo_op, (char *) &undo_op->payload);
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items(rel, buf, targetoff, true, newitems, undo_op);
+
+ END_CRIT_SECTION();
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid lasttid = 0;
+ OffsetNumber off;
+
+ for (off = FirstOffsetNumber; off <= PageGetMaxOffsetNumber(page); off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ Assert(item->t_firsttid >= lasttid);
+ lasttid = item->t_endtid;
+ }
+ }
+#endif
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (undo_op)
+ {
+ UnlockReleaseBuffer(undo_op->reservation.undobuf);
+ pfree(undo_op);
+ }
+ }
+ else
+ {
+ /* Have to split the page. */
+ List *items = NIL;
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber off;
+
+ /*
+ * Construct a List that contains all the items in the right order, and
+ * let zsbt_tid_recompress_page() do the heavy lifting to fit them on
+ * pages.
+ */
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (off == targetoff)
+ {
+ foreach(lc, newitems)
+ {
+ items = lappend(items, (ZSTidArrayItem *) lfirst(lc));
+ }
+ }
+ else
+ items = lappend(items, item);
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid endtid = 0;
+ ListCell *lc;
+
+ foreach (lc, items)
+ {
+ ZSTidArrayItem *i = (ZSTidArrayItem *) lfirst(lc);
+
+ Assert(i->t_firsttid >= endtid);
+ Assert(i->t_endtid > i->t_firsttid);
+ endtid = i->t_endtid;
+ }
+ }
+#endif
+
+ /* Pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_tid_recompress_replace(rel, buf, items, undo_op);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack, undo_op, 0);
+ }
+
+ list_free(items);
+ }
+}
+
+/*
+ * Recompressor routines
+ */
+typedef struct
+{
+ Page currpage;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ int num_pages;
+ int free_space_per_page;
+
+ zstid hikey;
+} zsbt_tid_recompress_context;
+
+static void
+zsbt_tid_recompress_newpage(zsbt_tid_recompress_context *cxt, zstid nexttid, int flags)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ if (cxt->currpage)
+ {
+ /* set the last tid on previous page */
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ oldopaque->zs_hikey = nexttid;
+ }
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (cxt->stack_tail)
+ cxt->stack_tail->next = stack;
+ else
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = ZS_META_ATTRIBUTE_NUM;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = flags;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_tid_recompress_add_to_page(zsbt_tid_recompress_context *cxt, ZSTidArrayItem *item)
+{
+ OffsetNumber maxoff;
+ Size freespc;
+
+ freespc = PageGetExactFreeSpace(cxt->currpage);
+ if (freespc < item->t_size + sizeof(ItemIdData) ||
+ freespc < cxt->free_space_per_page)
+ {
+ zsbt_tid_recompress_newpage(cxt, item->t_firsttid, 0);
+ }
+
+ maxoff = PageGetMaxOffsetNumber(cxt->currpage);
+ if (!PageAddItem(cxt->currpage, (Item) item, item->t_size, maxoff + 1, true, false))
+ elog(ERROR, "could not add item to TID tree page");
+}
+
+/*
+ * Subroutine of zsbt_tid_recompress_replace. Compute how much space the
+ * items will take, and compute how many pages will be needed for them, and
+ * decide how to distribute any free space thats's left over among the
+ * pages.
+ *
+ * Like in B-tree indexes, we aim for 50/50 splits, except for the
+ * rightmost page where aim for 90/10, so that most of the free space is
+ * left to the end of the index, where it's useful for new inserts. The
+ * 90/10 splits ensure that the we don't waste too much space on a table
+ * that's loaded at the end, and never updated.
+ */
+static void
+zsbt_tid_recompress_picksplit(zsbt_tid_recompress_context *cxt, List *items)
+{
+ size_t total_sz;
+ int num_pages;
+ int space_on_empty_page;
+ Size free_space_per_page;
+ ListCell *lc;
+
+ space_on_empty_page = BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(ZSBtreePageOpaque));
+
+ /* Compute total space needed for all the items. */
+ total_sz = 0;
+ foreach(lc, items)
+ {
+ ZSTidArrayItem *item = lfirst(lc);
+
+ total_sz += sizeof(ItemIdData) + item->t_size;
+ }
+
+ /* How many pages will we need for them? */
+ num_pages = (total_sz + space_on_empty_page - 1) / space_on_empty_page;
+
+ /* If everything fits on one page, don't split */
+ if (num_pages == 1)
+ {
+ free_space_per_page = 0;
+ }
+ /* If this is the rightmost page, do a 90/10 split */
+ else if (cxt->hikey == MaxPlusOneZSTid)
+ {
+ /*
+ * What does 90/10 mean if we have to use more than two pages? It means
+ * that 10% of the items go to the last page, and 90% are distributed to
+ * all the others.
+ */
+ double total_free_space;
+
+ total_free_space = space_on_empty_page * num_pages - total_sz;
+
+ free_space_per_page = total_free_space * 0.1 / (num_pages - 1);
+ }
+ /* Otherwise, aim for an even 50/50 split */
+ else
+ {
+ free_space_per_page = (space_on_empty_page * num_pages - total_sz) / num_pages;
+ }
+
+ cxt->num_pages = num_pages;
+ cxt->free_space_per_page = free_space_per_page;
+}
+
+/*
+ * Rewrite a leaf page, with given 'items' as the new content.
+ *
+ * If there are any uncompressed items in the list, we try to compress them.
+ * Any already-compressed items are added as is.
+ *
+ * If the items no longer fit on the page, then the page is split. It is
+ * entirely possible that they don't fit even on two pages; we split the page
+ * into as many pages as needed. Hopefully not more than a few pages, though,
+ * because otherwise you might hit limits on the number of buffer pins (with
+ * tiny shared_buffers).
+ *
+ * On entry, 'oldbuf' must be pinned and exclusive-locked. On exit, the lock
+ * is released, but it's still pinned.
+ *
+ * TODO: Try to combine single items, and existing array-items, into new array
+ * items.
+ */
+static void
+zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items, zs_pending_undo_op *undo_op)
+{
+ ListCell *lc;
+ zsbt_tid_recompress_context cxt;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(BufferGetPage(oldbuf));
+ BlockNumber orignextblk;
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+
+ orignextblk = oldopaque->zs_next;
+
+ cxt.currpage = NULL;
+ cxt.stack_head = cxt.stack_tail = NULL;
+ cxt.hikey = oldopaque->zs_hikey;
+
+ zsbt_tid_recompress_picksplit(&cxt, items);
+ zsbt_tid_recompress_newpage(&cxt, oldopaque->zs_lokey, (oldopaque->zs_flags & ZSBT_ROOT));
+
+ foreach(lc, items)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ zsbt_tid_recompress_add_to_page(&cxt, item);
+ }
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt.stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, 0);
+ stack->next->buf = nextbuf;
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = orignextblk;
+
+ /*
+ * zsbt_tid_recompress_picksplit() calculated that we'd need
+ * 'cxt.num_pages' pages. Check that it matches with how many pages we
+ * actually created.
+ * TODO: sometimes we may end up creating pages greater than the number of
+ * pages calculated. Correct calculation of cxt.num_pages.
+ */
+ Assert(list_length(downlinks) + 1 >= cxt.num_pages);
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt.stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt.stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt.stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt.stack_tail->next = zsbt_newroot(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt.stack_tail->next = zsbt_insert_downlinks(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_lokey, BufferGetBlockNumber(oldbuf), oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt.stack_head, undo_op, 0);
+}
+
+static OffsetNumber
+zsbt_binsrch_tidpage(zstid key, Page page)
+{
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber low,
+ high,
+ mid;
+
+ low = FirstOffsetNumber;
+ high = maxoff + 1;
+ while (high > low)
+ {
+ ItemId iid;
+ ZSTidArrayItem *item;
+
+ mid = low + (high - low) / 2;
+
+ iid = PageGetItemId(page, mid);
+ item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (key >= item->t_firsttid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
+
+/*
+ * Start a WAL operation to log changes to tid tree leaf items.
+ * This allocates enough space to accommodate records for the tid leaf items and
+ * any associated undo_op.
+ */
+static void
+zsbt_wal_log_tidleaf_items_begin(int nitems, zs_pending_undo_op *undo_op)
+{
+ int nrdatas;
+
+ XLogBeginInsert();
+ /*
+ * We allocate an rdata per tid leaf item. We may need two extra
+ * rdatas for UNDO. This must be called before we enter the critical
+ * section as XLogEnsureRecordSpace() performs memory allocation.
+ */
+ nrdatas = nitems + 1;
+ if (undo_op)
+ nrdatas += 2;
+ XLogEnsureRecordSpace(0, nrdatas);
+}
+
+/*
+ * It must be called after zsbt_wal_log_tidleaf_items_begin() is called and it
+ * must be called from a critical section.
+ */
+static void
+zsbt_wal_log_tidleaf_items(Relation rel, Buffer buf,
+ OffsetNumber off, bool replace, List *items,
+ zs_pending_undo_op *undo_op)
+{
+ ListCell *lc;
+ XLogRecPtr recptr;
+ wal_zedstore_tidleaf_items xlrec;
+
+ Assert(CritSectionCount > 0);
+
+ xlrec.nitems = list_length(items);
+ xlrec.off = off;
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalTidLeafItems);
+
+ foreach(lc, items)
+ {
+ void *item = (void *) lfirst(lc);
+ size_t itemsz;
+
+ itemsz = ((ZSTidArrayItem *) item)->t_size;
+
+ XLogRegisterBufData(0, item, itemsz);
+ }
+
+ if (undo_op)
+ XLogRegisterUndoOp(1, undo_op);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID,
+ replace ? WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM : WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS );
+
+ PageSetLSN(BufferGetPage(buf), recptr);
+ if (undo_op)
+ PageSetLSN(BufferGetPage(undo_op->reservation.undobuf), recptr);
+}
+
+void
+zsbt_tidleaf_items_redo(XLogReaderState *record, bool replace)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_tidleaf_items *xlrec =
+ (wal_zedstore_tidleaf_items *) XLogRecGetData(record);
+ Buffer buffer;
+ Buffer undobuf;
+
+ if (XLogRecHasBlockRef(record, 1))
+ undobuf = XLogRedoUndoOp(record, 1);
+ else
+ undobuf = InvalidBuffer;
+
+ if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+ {
+ Page page = (Page) BufferGetPage(buffer);
+ OffsetNumber off = xlrec->off;
+
+ if (xlrec->nitems == 0)
+ {
+ Assert(replace);
+ PageIndexTupleDelete(page, off);
+ }
+ else
+ {
+ char itembuf[BLCKSZ + MAXIMUM_ALIGNOF];
+ char *itembufp;
+ Size datasz;
+ char *data;
+ char *p;
+ int i;
+
+ itembufp = (char *) MAXALIGN(itembuf);
+
+ data = XLogRecGetBlockData(record, 0, &datasz);
+ p = data;
+ for (i = 0; i < xlrec->nitems; i++)
+ {
+ uint16 itemsz;
+
+ /*
+ * XXX: we assume that both ZSTidArrayItem and ZSAttributeArrayItem have
+ * t_size as the first field.
+ */
+ memcpy(&itemsz, p, sizeof(uint16));
+ Assert(itemsz > 0);
+ Assert(itemsz < BLCKSZ);
+ memcpy(itembufp, p, itemsz);
+ p += itemsz;
+
+ if (replace && i == 0)
+ {
+ if (!PageIndexTupleOverwrite(page, off, (Item) itembuf, itemsz))
+ elog(ERROR, "could not replace item on zedstore btree page at off %d", off);
+ }
+ else if (PageAddItem(page, (Item) itembufp, itemsz, off, false, false)
+ == InvalidOffsetNumber)
+ {
+ elog(ERROR, "could not add item to zedstore btree page");
+ }
+ off++;
+ }
+ Assert(p - data == datasz);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ }
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+ if (BufferIsValid(undobuf))
+ UnlockReleaseBuffer(undobuf);
+}
diff --git src/backend/access/zedstore/zedstore_toast.c src/backend/access/zedstore/zedstore_toast.c
new file mode 100644
index 0000000000..70eb1277b9
--- /dev/null
+++ src/backend/access/zedstore/zedstore_toast.c
@@ -0,0 +1,324 @@
+/*
+ * zedstore_toast.c
+ * Routines for Toasting oversized tuples in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_toast.c
+ */
+#include "postgres.h"
+
+#include "access/toast_internals.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/rel.h"
+
+static void zstoast_wal_log_newpage(Buffer prevbuf, Buffer buf, zstid tid, AttrNumber attno,
+ int offset, int32 total_size);
+
+/*
+ * Toast a datum, inside the ZedStore file.
+ *
+ * This is similar to regular toasting, but instead of using a separate index and
+ * heap, the datum is stored within the same ZedStore file as all the btrees and
+ * stuff. A chain of "toast-pages" is allocated for the datum, and each page is filled
+ * with as much of the datum as possible.
+ */
+Datum
+zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value, zstid tid)
+{
+ varatt_zs_toastptr *toastptr;
+ BlockNumber firstblk = InvalidBlockNumber;
+ Buffer buf = InvalidBuffer;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ Buffer prevbuf = InvalidBuffer;
+ ZSToastPageOpaque *prevopaque = NULL;
+ char *ptr;
+ int32 total_size;
+ int32 decompressed_size = 0;
+ int32 offset;
+ bool is_compressed;
+ bool is_first;
+ Datum toasted_datum;
+
+ Assert(tid != InvalidZSTid);
+
+ /*
+ * TID btree will always be inserted first, so there must be > 0 blocks
+ */
+ Assert(RelationGetNumberOfBlocks(rel) != 0);
+
+ if (VARATT_IS_COMPRESSED(value))
+ toasted_datum = value;
+ else
+ toasted_datum = toast_compress_datum(value);
+ if (DatumGetPointer(toasted_datum) != NULL)
+ {
+ /*
+ * If the compressed datum can be stored inline, return the datum
+ * directly.
+ */
+ if (VARSIZE_ANY(toasted_datum) <= MaxZedStoreDatumSize)
+ {
+ return toasted_datum;
+ }
+
+ is_compressed = true;
+ decompressed_size = TOAST_COMPRESS_RAWSIZE(toasted_datum);
+ ptr = TOAST_COMPRESS_RAWDATA(toasted_datum);
+ total_size = VARSIZE_ANY(toasted_datum) - TOAST_COMPRESS_HDRSZ;
+ }
+ else
+ {
+ /*
+ * If the compression doesn't reduce the size enough, allocate a
+ * toast page for it.
+ */
+ is_compressed = false;
+ ptr = VARDATA_ANY(value);
+ total_size = VARSIZE_ANY_EXHDR(value);
+ }
+
+
+ offset = 0;
+ is_first = true;
+ while (total_size - offset > 0)
+ {
+ Size thisbytes;
+
+ buf = zspage_getnewbuf(rel, ZS_INVALID_ATTRIBUTE_NUM);
+ if (prevbuf == InvalidBuffer)
+ firstblk = BufferGetBlockNumber(buf);
+
+ START_CRIT_SECTION();
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, sizeof(ZSToastPageOpaque));
+
+ thisbytes = Min(total_size - offset, PageGetExactFreeSpace(page));
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_tid = tid;
+ opaque->zs_attno = attno;
+ opaque->zs_total_size = total_size;
+ opaque->zs_decompressed_size = decompressed_size;
+ opaque->zs_is_compressed = is_compressed;
+ opaque->zs_slice_offset = offset;
+ opaque->zs_prev = is_first ? InvalidBlockNumber : BufferGetBlockNumber(prevbuf);
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_TOAST_PAGE_ID;
+
+ memcpy((char *) page + SizeOfPageHeaderData, ptr, thisbytes);
+ ((PageHeader) page)->pd_lower += thisbytes;
+
+ if (!is_first)
+ {
+ prevopaque->zs_next = BufferGetBlockNumber(buf);
+ MarkBufferDirty(prevbuf);
+ }
+
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ zstoast_wal_log_newpage(prevbuf, buf, tid, attno, offset, total_size);
+
+ END_CRIT_SECTION();
+
+ if (prevbuf != InvalidBuffer)
+ UnlockReleaseBuffer(prevbuf);
+ ptr += thisbytes;
+ offset += thisbytes;
+ prevbuf = buf;
+ prevopaque = opaque;
+ is_first = false;
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ toastptr = palloc0(sizeof(varatt_zs_toastptr));
+ SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
+ toastptr->zst_block = firstblk;
+
+ return PointerGetDatum(toastptr);
+}
+
+Datum
+zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted)
+{
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(toasted);
+ BlockNumber nextblk;
+ BlockNumber prevblk;
+ char *result = NULL;
+ char *ptr = NULL;
+ int32 total_size = 0;
+
+ Assert(toastptr->va_tag == VARTAG_ZEDSTORE);
+
+ prevblk = InvalidBlockNumber;
+ nextblk = toastptr->zst_block;
+
+ while (nextblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ uint32 size;
+
+ buf = ReadBuffer(rel, nextblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ Assert(opaque->zs_attno == attno);
+ Assert(opaque->zs_prev == prevblk);
+
+ if (prevblk == InvalidBlockNumber)
+ {
+ Assert(opaque->zs_tid == tid);
+
+ total_size = opaque->zs_total_size;
+
+ if(opaque->zs_is_compressed)
+ {
+ result = palloc(total_size + TOAST_COMPRESS_HDRSZ);
+
+ TOAST_COMPRESS_SET_RAWSIZE(result, opaque->zs_decompressed_size);
+ SET_VARSIZE_COMPRESSED(result, total_size + TOAST_COMPRESS_HDRSZ);
+ ptr = result + TOAST_COMPRESS_HDRSZ;
+ }
+ else
+ {
+ result = palloc(total_size + VARHDRSZ);
+ SET_VARSIZE(result, total_size + VARHDRSZ);
+ ptr = result + VARHDRSZ;
+ }
+ }
+
+ size = ((PageHeader) page)->pd_lower - SizeOfPageHeaderData;
+ memcpy(ptr, (char *) page + SizeOfPageHeaderData, size);
+ ptr += size;
+
+ prevblk = nextblk;
+ nextblk = opaque->zs_next;
+ UnlockReleaseBuffer(buf);
+ }
+ Assert(total_size > 0);
+ Assert(ptr == result + VARSIZE_ANY(result));
+
+ return PointerGetDatum(result);
+}
+
+void
+zedstore_toast_delete(Relation rel, Form_pg_attribute attr, zstid tid, BlockNumber blkno)
+{
+ BlockNumber nextblk;
+
+ nextblk = blkno;
+
+ while (nextblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+
+ buf = ReadBuffer(rel, nextblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_tid != tid)
+ {
+ UnlockReleaseBuffer(buf);
+ break;
+ }
+
+ Assert(opaque->zs_attno == attr->attnum);
+
+ nextblk = opaque->zs_next;
+ zspage_delete_page(rel, buf, InvalidBuffer, ZS_INVALID_ATTRIBUTE_NUM);
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+static void
+zstoast_wal_log_newpage(Buffer prevbuf, Buffer buf, zstid tid, AttrNumber attno,
+ int offset, int32 total_size)
+{
+ wal_zedstore_toast_newpage xlrec;
+ XLogRecPtr recptr;
+
+ Assert(offset <= total_size);
+
+ xlrec.tid = tid;
+ xlrec.attno = attno;
+ xlrec.offset = offset;
+ xlrec.total_size = total_size;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalToastNewPage);
+
+ /*
+ * It is easier to just force a full-page image, than WAL-log data. That
+ * means that the information in the wal_zedstore_toast_newpage struct isn't
+ * really necessary, but keep it for now, for the benefit of debugging with
+ * pg_waldump.
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+ if (BufferIsValid(prevbuf))
+ XLogRegisterBuffer(1, prevbuf, REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_TOAST_NEWPAGE);
+
+ PageSetLSN(BufferGetPage(buf), recptr);
+ if (BufferIsValid(prevbuf))
+ PageSetLSN(BufferGetPage(prevbuf), recptr);
+}
+
+void
+zstoast_newpage_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+#if UNUSED
+ wal_zedstore_toast_newpage *xlrec = (wal_zedstore_toast_newpage *) XLogRecGetData(record);
+#endif
+ BlockNumber blkno;
+ Buffer buf;
+ Buffer prevbuf = InvalidBuffer;
+
+ XLogRecGetBlockTag(record, 0, NULL, NULL, &blkno);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+ elog(ERROR, "zedstore toast newpage WAL record did not contain a full-page image");
+
+ if (XLogRecHasBlockRef(record, 1))
+ {
+ if (XLogReadBufferForRedo(record, 1, &prevbuf) == BLK_NEEDS_REDO)
+ {
+ Page prevpage = BufferGetPage(prevbuf);
+ ZSToastPageOpaque *prevopaque;
+
+ prevopaque = (ZSToastPageOpaque *) PageGetSpecialPointer(prevpage);
+ prevopaque->zs_next = BufferGetBlockNumber(buf);
+
+ PageSetLSN(prevpage, lsn);
+ MarkBufferDirty(prevbuf);
+ }
+ }
+ else
+ prevbuf = InvalidBuffer;
+
+ if (BufferIsValid(prevbuf))
+ UnlockReleaseBuffer(prevbuf);
+ UnlockReleaseBuffer(buf);
+}
diff --git src/backend/access/zedstore/zedstore_tuplebuffer.c src/backend/access/zedstore/zedstore_tuplebuffer.c
new file mode 100644
index 0000000000..5e8c5635fa
--- /dev/null
+++ src/backend/access/zedstore/zedstore_tuplebuffer.c
@@ -0,0 +1,583 @@
+/*
+ * zedstore_tuplebuffer.c
+ * Buffering insertions into a zedstore table
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tuplebuffer.c
+ */
+#include "postgres.h"
+
+#include "access/detoast.h"
+#include "access/table.h"
+#include "access/zedstoream.h"
+#include "access/zedstore_internal.h"
+#include "common/hashfn.h"
+#include "miscadmin.h"
+#include "utils/datum.h"
+
+/*
+ * Single inserts:
+ * If we see more than SINGLE_INSERT_TID_RESERVATION_THRESHOLD insertions with the
+ * same XID and CID, with no "flush" calls in between, we start reserving
+ * TIDs in batches of size SINGLE_INSERT_TID_RESERVATION_SIZE. The downside of
+ * reserving TIDs in batches is that if we are left with any unused TIDs at end
+ * of transaction (or when a "flush" call comes), we need to go and kill the
+ * unused TIDs. So only do batching when it seems like we're inserting a lot of rows.
+ *
+ * Multi inserts:
+ * Whenever we see a multi-insert, we allocate MULTI_INSERT_TID_RESERVATION_FACTOR
+ * times more tids than the number requested. This is to ensure that we don't
+ * end up with inefficient page splits from out-of-tid-order inserts into full-ish
+ * btree pages. Such inserts are typically observed under highly concurrent workloads.
+ * See https://www.postgresql.org/message-id/CADwEdopF2S6uRXJRg%3DVZRfPZis80OnawAOCTSh_SrN2i1KGkMw%40mail.gmail.com
+ * for more details.
+ *
+ * TODO: expose these constants as GUCs as they are very workload sensitive.
+ */
+
+#define SINGLE_INSERT_TID_RESERVATION_THRESHOLD 5
+#define SINGLE_INSERT_TID_RESERVATION_SIZE 100
+#define MULTI_INSERT_TID_RESERVATION_FACTOR 10
+
+#define ATTBUFFER_SIZE (1024 * 1024)
+
+typedef struct
+{
+ zstid buffered_tids[60];
+ Datum buffered_datums[60];
+ bool buffered_isnulls[60];
+ int num_buffered_rows;
+
+ attstream_buffer chunks;
+
+} attbuffer;
+
+typedef struct
+{
+ Oid relid; /* table's OID (hash key) */
+ char status; /* hash entry status */
+
+ int natts; /* # of attributes on table might change, if it's ALTERed */
+ attbuffer *attbuffers;
+
+ uint64 num_repeated_single_inserts; /* # of repeated single inserts for the same (xid, cid) */
+
+ TransactionId reserved_tids_xid;
+ CommandId reserved_tids_cid;
+ zstid reserved_tids_start; /* inclusive */
+ zstid reserved_tids_end; /* inclusive */
+
+} tuplebuffer;
+
+
+/* define hashtable mapping block numbers to PagetableEntry's */
+#define SH_PREFIX tuplebuffers
+#define SH_ELEMENT_TYPE tuplebuffer
+#define SH_KEY_TYPE Oid
+#define SH_KEY relid
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_EQUAL(tb, a, b) a == b
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+
+/* prototypes for internal functions */
+static void zsbt_attbuffer_spool(Relation rel, AttrNumber attno, attbuffer *attbuffer, int ntuples, zstid *tids, Datum *datums, bool *isnulls);
+static void zsbt_attbuffer_init(Form_pg_attribute attr, attbuffer *attbuffer);
+static void zsbt_attbuffer_flush(Relation rel, AttrNumber attno, attbuffer *attbuffer, bool all);
+static void tuplebuffer_kill_unused_reserved_tids(Relation rel, tuplebuffer *tupbuffer);
+
+static MemoryContext tuplebuffers_cxt = NULL;
+static struct tuplebuffers_hash *tuplebuffers = NULL;
+
+static tuplebuffer *
+get_tuplebuffer(Relation rel)
+{
+ bool found;
+ tuplebuffer *tupbuffer;
+
+ if (tuplebuffers_cxt == NULL)
+ {
+ tuplebuffers_cxt = AllocSetContextCreate(TopTransactionContext,
+ "ZedstoreAMTupleBuffers",
+ ALLOCSET_DEFAULT_SIZES);
+ tuplebuffers = tuplebuffers_create(tuplebuffers_cxt, 10, NULL);
+ }
+retry:
+ tupbuffer = tuplebuffers_insert(tuplebuffers, RelationGetRelid(rel), &found);
+ if (!found)
+ {
+ MemoryContext oldcxt;
+ AttrNumber attno;
+ int natts;
+
+ oldcxt = MemoryContextSwitchTo(tuplebuffers_cxt);
+ natts = rel->rd_att->natts;
+ tupbuffer->attbuffers = palloc(natts * sizeof(attbuffer));
+ tupbuffer->natts = natts;
+
+ for (attno = 1; attno <= natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(rel->rd_att, attno - 1);
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+
+ zsbt_attbuffer_init(attr, attbuffer);
+ }
+
+ tupbuffer->reserved_tids_xid = InvalidTransactionId;
+ tupbuffer->reserved_tids_cid = InvalidCommandId;
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+ tupbuffer->num_repeated_single_inserts = 0;
+
+ MemoryContextSwitchTo(oldcxt);
+ }
+ else if (rel->rd_att->natts > tupbuffer->natts)
+ {
+ zsbt_tuplebuffer_flush(rel);
+ goto retry;
+ }
+
+ return tupbuffer;
+}
+
+/*
+ * Allocate TIDs for insert.
+ *
+ * First check if the reserved tids can cater to the number of tids requested for
+ * allocation (ntids). If yes, consume the tids from these reserved tids. Else,
+ * we have to request more tids by inserting into the tid tree.
+ *
+ * We reserve tids inside the tupbuffer for the same (xid, cid) combo. The way we
+ * reserve tids is slightly different for single-insert vs multi-insert.
+ *
+ * For single inserts, in the same (xid, cid) once we encounter number of inserts =
+ * SINGLE_INSERT_TID_RESERVATION_THRESHOLD, we request and reserve
+ * SINGLE_INSERT_TID_RESERVATION_SIZE number of tids.
+ *
+ * For multi-inserts, we request and reserve (ntids * MULTI_INSERT_TID_RESERVATION_FACTOR)
+ * number of tids.
+ */
+zstid
+zsbt_tuplebuffer_allocate_tids(Relation rel, TransactionId xid, CommandId cid, int ntids)
+{
+ tuplebuffer *tupbuffer;
+ zstid result;
+
+ tupbuffer = get_tuplebuffer(rel);
+
+ if (tupbuffer->reserved_tids_xid != xid ||
+ tupbuffer->reserved_tids_cid != cid)
+ {
+ /*
+ * This insertion is for a different XID or CID than before. (Or this
+ * is the first insertion.)
+ */
+ tuplebuffer_kill_unused_reserved_tids(rel, tupbuffer);
+ tupbuffer->num_repeated_single_inserts = 0;
+
+ tupbuffer->reserved_tids_xid = xid;
+ tupbuffer->reserved_tids_cid = cid;
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+ }
+
+ if ((tupbuffer->reserved_tids_start != InvalidZSTid &&
+ tupbuffer->reserved_tids_end != InvalidZSTid) && ntids <=
+ (tupbuffer->reserved_tids_end - tupbuffer->reserved_tids_start + 1))
+ {
+ /* We have enough reserved tids */
+ result = tupbuffer->reserved_tids_start;
+ tupbuffer->reserved_tids_start += ntids;
+ }
+ else if (ntids == 1)
+ {
+ /* We don't have enough reserved tids for a single insert */
+ if (tupbuffer->num_repeated_single_inserts < SINGLE_INSERT_TID_RESERVATION_THRESHOLD)
+ {
+ /* We haven't seen many single inserts yet, so just allocate a single TID for this. */
+ result = zsbt_tid_multi_insert(rel, 1, xid, cid,
+ INVALID_SPECULATIVE_TOKEN, InvalidUndoPtr);
+ /* Since we don't reserve any tids, invalidate reservation fields */
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+ }
+ else
+ {
+ /* We're in batch mode for single inserts. Reserve a new block of TIDs. */
+ result = zsbt_tid_multi_insert(rel, SINGLE_INSERT_TID_RESERVATION_SIZE, xid, cid,
+ INVALID_SPECULATIVE_TOKEN, InvalidUndoPtr);
+ tupbuffer->reserved_tids_start = result + 1;
+ tupbuffer->reserved_tids_end = result + SINGLE_INSERT_TID_RESERVATION_SIZE - 1;
+ }
+ tupbuffer->num_repeated_single_inserts++;
+ }
+ else
+ {
+ /* We don't have enough tids for a multi-insert. */
+
+ /*
+ * Kill the unused tids in the tuple buffer first since we will replace
+ * them with a list of fresh continuous tids.
+ */
+ tuplebuffer_kill_unused_reserved_tids(rel, tupbuffer);
+ result = zsbt_tid_multi_insert(rel, MULTI_INSERT_TID_RESERVATION_FACTOR * ntids, xid, cid,
+ INVALID_SPECULATIVE_TOKEN, InvalidUndoPtr);
+ tupbuffer->reserved_tids_end = result + (MULTI_INSERT_TID_RESERVATION_FACTOR * ntids) - 1;
+ tupbuffer->reserved_tids_start = result + ntids;
+ }
+
+ return result;
+}
+
+/* buffer more data */
+void
+zsbt_tuplebuffer_spool_tuple(Relation rel, zstid tid, Datum *datums, bool *isnulls)
+{
+ AttrNumber attno;
+ tuplebuffer *tupbuffer;
+
+ tupbuffer = get_tuplebuffer(rel);
+
+ for (attno = 1; attno <= rel->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(rel->rd_att, attno - 1);
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+ Datum datum;
+ bool isnull;
+
+ datum = datums[attno - 1];
+ isnull = isnulls[attno - 1];
+
+ if (!isnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(datum))
+ datum = PointerGetDatum(detoast_external_attr((struct varlena *) DatumGetPointer(datum)));
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(rel, attno, datum, tid);
+ }
+
+ zsbt_attbuffer_spool(rel, attno, attbuffer, 1, &tid, &datum, &isnull);
+ }
+}
+
+void
+zsbt_tuplebuffer_spool_slots(Relation rel, zstid *tids, TupleTableSlot **slots, int ntuples)
+{
+ AttrNumber attno;
+ tuplebuffer *tupbuffer;
+ Datum *datums;
+ bool *isnulls;
+
+ tupbuffer = get_tuplebuffer(rel);
+
+ datums = palloc(ntuples * sizeof(Datum));
+ isnulls = palloc(ntuples * sizeof(bool));
+
+ for (attno = 1; attno <= rel->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(rel->rd_att, attno - 1);
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+
+ for (int i = 0; i < ntuples; i++)
+ {
+ Datum datum = slots[i]->tts_values[attno - 1];
+ bool isnull = slots[i]->tts_isnull[attno - 1];
+
+ if (attno == 1)
+ slot_getallattrs(slots[i]);
+
+ if (!isnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(datum))
+ datum = PointerGetDatum(detoast_external_attr((struct varlena *) DatumGetPointer(datum)));
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(rel, attno, datum, tids[i]);
+ }
+ datums[i] = datum;
+ isnulls[i] = isnull;
+ }
+
+ zsbt_attbuffer_spool(rel, attno, attbuffer, ntuples, tids, datums, isnulls);
+ }
+
+ pfree(datums);
+ pfree(isnulls);
+}
+
+
+static void
+zsbt_attbuffer_init(Form_pg_attribute attr, attbuffer *attbuffer)
+{
+ attstream_buffer *attbuf = &attbuffer->chunks;
+
+#define ATTBUF_INIT_SIZE 1024
+ attbuf->data = palloc(ATTBUF_INIT_SIZE);
+ attbuf->len = 0;
+ attbuf->maxlen = ATTBUF_INIT_SIZE;
+ attbuf->cursor = 0;
+
+ attbuf->firsttid = 0;
+ attbuf->lasttid = 0;
+
+ attbuf->attlen = attr->attlen;
+ attbuf->attbyval = attr->attbyval;
+
+ attbuffer->num_buffered_rows = 0;
+}
+
+static void
+zsbt_attbuffer_spool(Relation rel, AttrNumber attno, attbuffer *attbuffer,
+ int ntuples, zstid *tids, Datum *datums, bool *isnulls)
+{
+ int i;
+ attstream_buffer *chunks = &attbuffer->chunks;
+
+ for (i = 0; i < ntuples; i++)
+ {
+ Datum datum;
+
+ if (attbuffer->num_buffered_rows >= 60)
+ zsbt_attbuffer_flush(rel, attno, attbuffer, false);
+
+ if (!chunks->attbyval && !isnulls[i])
+ {
+ /* XXX: have to make a copy of pass-by ref values, because we
+ * need it to live until the end-of-xact, where we flush the buffers.
+ * That's pretty inefficient!
+ */
+ MemoryContext oldcxt = MemoryContextSwitchTo(tuplebuffers_cxt);
+ datum = zs_datumCopy(datums[i], chunks->attbyval, chunks->attlen);
+ MemoryContextSwitchTo(oldcxt);
+ }
+ else
+ datum = datums[i];
+
+ attbuffer->buffered_tids[attbuffer->num_buffered_rows] = tids[i];
+ attbuffer->buffered_datums[attbuffer->num_buffered_rows] = datum;
+ attbuffer->buffered_isnulls[attbuffer->num_buffered_rows] = isnulls[i];
+ attbuffer->num_buffered_rows++;
+ }
+}
+
+/* flush */
+
+static void
+zsbt_attbuffer_flush(Relation rel, AttrNumber attno, attbuffer *attbuffer, bool all)
+{
+ int num_encoded;
+ int num_remain;
+ attstream_buffer *chunks = &attbuffer->chunks;
+
+ /* First encode more */
+ if (attbuffer->num_buffered_rows >= 60 ||
+ (all && attbuffer->num_buffered_rows > 0))
+ {
+ num_encoded = append_attstream(chunks, all, attbuffer->num_buffered_rows,
+ attbuffer->buffered_tids,
+ attbuffer->buffered_datums,
+ attbuffer->buffered_isnulls);
+ num_remain = attbuffer->num_buffered_rows - num_encoded;
+
+ if (!chunks->attbyval)
+ {
+ for (int i = 0; i < num_encoded; i++)
+ {
+ if (!attbuffer->buffered_isnulls[i])
+ pfree(DatumGetPointer(attbuffer->buffered_datums[i]));
+ }
+ }
+
+ memmove(attbuffer->buffered_tids, &attbuffer->buffered_tids[num_encoded], num_remain * sizeof(zstid));
+ memmove(attbuffer->buffered_datums, &attbuffer->buffered_datums[num_encoded], num_remain * sizeof(Datum));
+ memmove(attbuffer->buffered_isnulls, &attbuffer->buffered_isnulls[num_encoded], num_remain * sizeof(bool));
+ attbuffer->num_buffered_rows = num_remain;
+ }
+
+ while ((all && chunks->len - chunks->cursor > 0) ||
+ chunks->len - chunks->cursor > ATTBUFFER_SIZE)
+ {
+ zsbt_attr_add(rel, attno, chunks);
+ }
+}
+
+/*
+ * Remove any reserved but unused TIDs from the TID tree.
+ */
+static void
+tuplebuffer_kill_unused_reserved_tids(Relation rel, tuplebuffer *tupbuffer)
+{
+ IntegerSet *unused_tids;
+ zstid tid;
+
+ if ((tupbuffer->reserved_tids_start == InvalidZSTid &&
+ tupbuffer->reserved_tids_end == InvalidZSTid) ||
+ tupbuffer->reserved_tids_start > tupbuffer->reserved_tids_end)
+ return; /* no reserved TIDs */
+
+ /*
+ * XXX: We use the zsbt_tid_remove() function for this, but it's
+ * a bit too heavy-weight. It's geared towards VACUUM and removing
+ * millions of TIDs in one go. Also, we leak the IntegerSet object;
+ * usually flushing is done at end of transaction, so that's not
+ * a problem, but it could be if we need to flush a lot in the
+ * same transaction.
+ *
+ * XXX: It would be nice to adjust the UNDO record, too. Otherwise,
+ * if we abort, the poor sod that tries to discard the UNDO record
+ * will try to mark these TIDs as unused in vein.
+ */
+ unused_tids = intset_create();
+
+ for (tid = tupbuffer->reserved_tids_start;
+ tid <= tupbuffer->reserved_tids_end;
+ tid++)
+ {
+ intset_add_member(unused_tids, tid);
+ }
+
+ zsbt_tid_remove(rel, unused_tids);
+
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+}
+
+static void
+tuplebuffer_flush_internal(Relation rel, tuplebuffer *tupbuffer)
+{
+ tuplebuffer_kill_unused_reserved_tids(rel, tupbuffer);
+
+ /* Flush the attribute data */
+ for (AttrNumber attno = 1; attno <= tupbuffer->natts; attno++)
+ {
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+
+ zsbt_attbuffer_flush(rel, attno, attbuffer, true);
+ }
+
+ tupbuffer->num_repeated_single_inserts = 0;
+}
+
+void
+zsbt_tuplebuffer_flush(Relation rel)
+{
+ tuplebuffer *tupbuffer;
+
+ if (!tuplebuffers)
+ return;
+ tupbuffer = tuplebuffers_lookup(tuplebuffers, RelationGetRelid(rel));
+ if (!tupbuffer)
+ return;
+
+ tuplebuffer_flush_internal(rel, tupbuffer);
+
+ for (int attno = 1 ; attno <= tupbuffer->natts; attno++)
+ {
+ attbuffer *attbuf = &(tupbuffer->attbuffers[attno-1]);
+ pfree(attbuf->chunks.data);
+ }
+ pfree(tupbuffer->attbuffers);
+
+ tuplebuffers_delete(tuplebuffers, RelationGetRelid(rel));
+}
+
+static void
+zsbt_tuplebuffers_flush(void)
+{
+ tuplebuffers_iterator iter;
+ tuplebuffer *tupbuffer;
+
+ tuplebuffers_start_iterate(tuplebuffers, &iter);
+ while ((tupbuffer = tuplebuffers_iterate(tuplebuffers, &iter)) != NULL)
+ {
+ Relation rel;
+
+ rel = table_open(tupbuffer->relid, NoLock);
+
+ tuplebuffer_flush_internal(rel, tupbuffer);
+
+ table_close(rel, NoLock);
+ }
+}
+
+
+/* check in a scan */
+
+
+/*
+ * End-of-transaction cleanup for zedstore.
+ *
+ * Flush tuple buffers in zedstore.
+ *
+ * We must flush everything before the top transaction commit becomes
+ * visible to others, so that they can see the data. On abort, we can drop
+ * everything we had buffered at top transaction abort. That's fortunate,
+ * because we couldn't access the table during abort processing anyway.
+ *
+ * Subtransactions:
+ *
+ * After a subtransaction has been marked as aborted, we mustn't write
+ * out any attribute data belonging to the aborted subtransaction. Two
+ * reasons for that. Firstly, the TIDs belonging to an aborted
+ * subtransaction might be vacuumed away at any point. We mustn't write
+ * out attribute data for a TID that's already been vacuumed away in the
+ * TID tree. Secondly, subtransaction abort releases locks acquired in
+ * the subtransaction, and we cannot write out data if we're not holding
+ * a lock on the table. So we must throw our buffers away at subtransaction
+ * abort.
+ *
+ * Since we throw away our buffers at subtransaction abort, we must take
+ * care that the buffers are empty when a subtransaction begins. If there
+ * was any leftover buffered data for other subtransactions, we would
+ * throw away that data too, if the new subtransaction aborts.
+ *
+ * Writing out the buffers at subtransaction commit probably isn't necessary,
+ * but might as well play it safe and do it.
+ */
+void
+AtEOXact_zedstore_tuplebuffers(bool isCommit)
+{
+ if (tuplebuffers_cxt)
+ {
+ if (isCommit)
+ zsbt_tuplebuffers_flush();
+ MemoryContextDelete(tuplebuffers_cxt);
+ tuplebuffers_cxt = NULL;
+ tuplebuffers = NULL;
+ }
+}
+
+void
+AtSubStart_zedstore_tuplebuffers(void)
+{
+ if (tuplebuffers_cxt)
+ {
+ zsbt_tuplebuffers_flush();
+ MemoryContextDelete(tuplebuffers_cxt);
+ tuplebuffers_cxt = NULL;
+ tuplebuffers = NULL;
+ }
+}
+
+void
+AtEOSubXact_zedstore_tuplebuffers(bool isCommit)
+{
+ if (tuplebuffers_cxt)
+ {
+ if (isCommit)
+ zsbt_tuplebuffers_flush();
+ MemoryContextDelete(tuplebuffers_cxt);
+ tuplebuffers_cxt = NULL;
+ tuplebuffers = NULL;
+ }
+}
diff --git src/backend/access/zedstore/zedstore_tupslot.c src/backend/access/zedstore/zedstore_tupslot.c
new file mode 100644
index 0000000000..48e11ab546
--- /dev/null
+++ src/backend/access/zedstore/zedstore_tupslot.c
@@ -0,0 +1,268 @@
+/*
+ * zedstore_tupslot.c
+ * Implementation of a TupleTableSlot for zedstore.
+ *
+ * This implementation is identical to a Virtual tuple slot
+ * (TTSOpsVirtual), but it has a slot_getsysattr() implementation
+ * that can fetch and compute the 'xmin' for the tuple.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tupslot.c
+ */
+#include "postgres.h"
+
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "executor/tuptable.h"
+#include "utils/expandeddatum.h"
+
+const TupleTableSlotOps TTSOpsZedstore;
+
+static void
+tts_zedstore_init(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ zslot->visi_info = NULL;
+}
+
+static void
+tts_zedstore_release(TupleTableSlot *slot)
+{
+}
+
+static void
+tts_zedstore_clear(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ if (unlikely(TTS_SHOULDFREE(slot)))
+ {
+ pfree(zslot->data);
+ zslot->data = NULL;
+
+ slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
+ }
+
+ slot->tts_nvalid = 0;
+ slot->tts_flags |= TTS_FLAG_EMPTY;
+ ItemPointerSetInvalid(&slot->tts_tid);
+
+ zslot->visi_info = NULL;
+}
+
+/*
+ * Attribute values are readily available in tts_values and tts_isnull array
+ * in a ZedstoreTupleTableSlot. So there should be no need to call either of the
+ * following two functions.
+ */
+static void
+tts_zedstore_getsomeattrs(TupleTableSlot *slot, int natts)
+{
+ elog(ERROR, "getsomeattrs is not required to be called on a zedstore tuple table slot");
+}
+
+/*
+ * We only support fetching 'xmin', currently. It's needed for referential
+ * integrity triggers (i.e. foreign keys).
+ */
+static Datum
+tts_zedstore_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ if (attnum == MinTransactionIdAttributeNumber ||
+ attnum == MinCommandIdAttributeNumber)
+ {
+ *isnull = false;
+ if (attnum == MinTransactionIdAttributeNumber)
+ return zslot->visi_info ? TransactionIdGetDatum(zslot->visi_info->xmin) : InvalidTransactionId;
+ else
+ {
+ Assert(attnum == MinCommandIdAttributeNumber);
+ return zslot->visi_info ? CommandIdGetDatum(zslot->visi_info->cmin) : InvalidCommandId;
+ }
+ }
+ elog(ERROR, "zedstore tuple table slot does not have system attributes (except xmin and cmin)");
+
+ return 0; /* silence compiler warnings */
+}
+
+/*
+ * To materialize a zedstore slot all the datums that aren't passed by value
+ * have to be copied into the slot's memory context. To do so, compute the
+ * required size, and allocate enough memory to store all attributes. That's
+ * good for cache hit ratio, but more importantly requires only memory
+ * allocation/deallocation.
+ */
+static void
+tts_zedstore_materialize(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *vslot = (ZedstoreTupleTableSlot *) slot;
+ TupleDesc desc = slot->tts_tupleDescriptor;
+ Size sz = 0;
+ char *data;
+
+ /* already materialized */
+ if (TTS_SHOULDFREE(slot))
+ return;
+
+ /* copy visibility information to go with the slot */
+ if (vslot->visi_info)
+ {
+ vslot->visi_info_buf = *vslot->visi_info;
+ vslot->visi_info = &vslot->visi_info_buf;
+ }
+
+ /* compute size of memory required */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ sz = att_align_nominal(sz, att->attalign);
+ sz += EOH_get_flat_size(DatumGetEOHP(val));
+ }
+ else
+ {
+ sz = att_align_nominal(sz, att->attalign);
+ sz = att_addlength_datum(sz, att->attlen, val);
+ }
+ }
+
+ /* all data is byval */
+ if (sz == 0)
+ return;
+
+ /* allocate memory */
+ vslot->data = data = MemoryContextAlloc(slot->tts_mcxt, sz);
+ slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+ /* and copy all attributes into the pre-allocated space */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ Size data_length;
+
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ ExpandedObjectHeader *eoh = DatumGetEOHP(val);
+
+ data = (char *) att_align_nominal(data,
+ att->attalign);
+ data_length = EOH_get_flat_size(eoh);
+ EOH_flatten_into(eoh, data, data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ else
+ {
+ Size data_length = 0;
+
+ data = (char *) att_align_nominal(data, att->attalign);
+ data_length = att_addlength_datum(data_length, att->attlen, val);
+
+ memcpy(data, DatumGetPointer(val), data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ }
+}
+
+static void
+tts_zedstore_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
+{
+ ZedstoreTupleTableSlot *zdstslot = (ZedstoreTupleTableSlot *) dstslot;
+
+ TupleDesc srcdesc = dstslot->tts_tupleDescriptor;
+
+ Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
+
+ tts_zedstore_clear(dstslot);
+
+ slot_getallattrs(srcslot);
+
+ for (int natt = 0; natt < srcdesc->natts; natt++)
+ {
+ dstslot->tts_values[natt] = srcslot->tts_values[natt];
+ dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
+ }
+
+ if (srcslot->tts_ops == &TTSOpsZedstore)
+ zdstslot->visi_info = ((ZedstoreTupleTableSlot *) srcslot)->visi_info;
+ else
+ zdstslot->visi_info = NULL;
+
+ dstslot->tts_nvalid = srcdesc->natts;
+ dstslot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ /* make sure storage doesn't depend on external memory */
+ tts_zedstore_materialize(dstslot);
+}
+
+static HeapTuple
+tts_zedstore_copy_heap_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+}
+
+static MinimalTuple
+tts_zedstore_copy_minimal_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_minimal_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+}
+
+
+const TupleTableSlotOps TTSOpsZedstore = {
+ .base_slot_size = sizeof(ZedstoreTupleTableSlot),
+ .init = tts_zedstore_init,
+ .release = tts_zedstore_release,
+ .clear = tts_zedstore_clear,
+ .getsomeattrs = tts_zedstore_getsomeattrs,
+ .getsysattr = tts_zedstore_getsysattr,
+ .materialize = tts_zedstore_materialize,
+ .copyslot = tts_zedstore_copyslot,
+
+ /*
+ * A zedstore tuple table slot can not "own" a heap tuple or a minimal
+ * tuple.
+ */
+ .get_heap_tuple = NULL,
+ .get_minimal_tuple = NULL,
+ .copy_heap_tuple = tts_zedstore_copy_heap_tuple,
+ .copy_minimal_tuple = tts_zedstore_copy_minimal_tuple
+};
diff --git src/backend/access/zedstore/zedstore_undolog.c src/backend/access/zedstore/zedstore_undolog.c
new file mode 100644
index 0000000000..8749e4a827
--- /dev/null
+++ src/backend/access/zedstore/zedstore_undolog.c
@@ -0,0 +1,651 @@
+/*
+ * zedstore_undolog.c
+ * Temporary UNDO-logging for zedstore.
+ *
+ * XXX: This file is hopefully replaced with an upstream UNDO facility later.
+ *
+ * The UNDO log is a dumb a stream of bytes. It can be appended to at the
+ * head, and the tail can be discarded away. The upper layer, see
+ * zedstore_undorec.c, is responsible for dividing the log into records,
+ * and deciding when and what to discard
+ *
+ * The upper layer is also responsible for WAL-logging any insertions and
+ * modifications of UNDO records. This module WAL-logs creation of new UNDO
+ * pages and discarding old ones, but not the content.
+ *
+ * Insertion is a two-step process. First, you reserve the space for the
+ * UNDO record with zsundo_insert_reserve(). You get a pointer to an UNDO
+ * buffer, where you can write the record. Once you're finished, call
+ * zsundo_insert_finish().
+ *
+ * To fetch a record, use zsundo_fetch(). You may modify the record, but
+ * you must dirty the buffer and WAL-log the change yourself. You cannot
+ * change its size, however.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_undolog.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undolog.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/rel.h"
+
+/*
+ * Reserve space in the UNDO log for a new UNDO record.
+ *
+ * Extends the UNDO log with a new page if needed. Information about the
+ * reservation is returned in *reservation_p. reservation_p->undobuf is
+ * the buffer containing the reserved space. reservation_p->undorecptr
+ * is a pointer that can be use to fetch the record later.
+ *
+ * This doesn't make any on-disk changes. The buffer is locked, but if
+ * the backend aborts later on, before actually writing the record no harm
+ * done.
+ *
+ * The intended usage is to call zs_insert_reserve_space(), then lock any
+ * any other pages needed for the operation. Then, write the UNDO record
+ * reservation_p->ptr, which points directly to the buffer, in the same
+ * critical section as any other page modifications that need to be done
+ * atomically. Finally, call zsundo_insert_finish(), to mark the space as
+ * used in the undo page header.
+ *
+ * The caller is responsible for WAL-logging, and replaying the changes, in
+ * case of a crash. (If there isn't enough space on the current latest UNDO
+ * page, a new page is allocated and appended to the UNDO log. That allocation
+ * is WAL-logged separately, the caller doesn't need to care about that.)
+ */
+void
+zsundo_insert_reserve(Relation rel, size_t size, zs_undo_reservation *reservation_p)
+{
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber tail_blk;
+ Buffer tail_buf = InvalidBuffer;
+ Page tail_pg = NULL;
+ ZSUndoPageOpaque *tail_opaque = NULL;
+ uint64 next_counter;
+ int offset;
+
+ if (size > MaxUndoRecordSize)
+ elog(ERROR, "UNDO record is too large (%zu bytes, max %zu bytes)", size, MaxUndoRecordSize);
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+retry_lock_tail:
+ tail_blk = metaopaque->zs_undo_tail;
+
+ /*
+ * Is there space on the tail page? If not, allocate a new UNDO page.
+ */
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_buf = ReadBuffer(rel, tail_blk);
+ LockBuffer(tail_buf, BUFFER_LOCK_EXCLUSIVE);
+ tail_pg = BufferGetPage(tail_buf);
+ tail_opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(tail_pg);
+ Assert(tail_opaque->first_undorecptr.counter == metaopaque->zs_undo_tail_first_counter);
+ }
+
+ if (tail_blk == InvalidBlockNumber || PageGetExactFreeSpace(tail_pg) < size)
+ {
+ Buffer newbuf;
+ BlockNumber newblk;
+ Page newpage;
+ ZSUndoPageOpaque *newopaque;
+
+ /*
+ * Release the lock on the old tail page and metapage while we find a new block,
+ * because that could take a while. (And accessing the Free Page Map might lock
+ * the metapage, too, causing self-deadlock.)
+ */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+ if (BufferIsValid(tail_buf))
+ LockBuffer(tail_buf, BUFFER_LOCK_UNLOCK);
+
+ /* new page */
+ newbuf = zspage_getnewbuf(rel, ZS_INVALID_ATTRIBUTE_NUM);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ if (metaopaque->zs_undo_tail != tail_blk)
+ {
+ /*
+ * Someone else extended the UNDO log concurrently. We don't need
+ * the new page, after all. (Or maybe we do, if the new
+ * tail block is already full, but we're not smart about it.)
+ */
+ zspage_delete_page(rel, newbuf, metabuf, ZS_INVALID_ATTRIBUTE_NUM);
+ UnlockReleaseBuffer(newbuf);
+ goto retry_lock_tail;
+ }
+ if (BufferIsValid(tail_buf))
+ LockBuffer(tail_buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (tail_blk == InvalidBlockNumber)
+ next_counter = metaopaque->zs_undo_tail_first_counter;
+ else
+ next_counter = tail_opaque->last_undorecptr.counter + 1;
+
+ START_CRIT_SECTION();
+
+ newblk = BufferGetBlockNumber(newbuf);
+ newpage = BufferGetPage(newbuf);
+ PageInit(newpage, BLCKSZ, sizeof(ZSUndoPageOpaque));
+ newopaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(newpage);
+ newopaque->next = InvalidBlockNumber;
+ newopaque->first_undorecptr.blkno = newblk;
+ newopaque->first_undorecptr.offset = SizeOfPageHeaderData;
+ newopaque->first_undorecptr.counter = next_counter;
+ newopaque->last_undorecptr = InvalidUndoPtr;
+ newopaque->zs_page_id = ZS_UNDO_PAGE_ID;
+ MarkBufferDirty(newbuf);
+
+ metaopaque->zs_undo_tail = newblk;
+ metaopaque->zs_undo_tail_first_counter = next_counter;
+ if (tail_blk == InvalidBlockNumber)
+ metaopaque->zs_undo_head = newblk;
+ MarkBufferDirty(metabuf);
+
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_opaque->next = newblk;
+ MarkBufferDirty(tail_buf);
+ }
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_undo_newpage xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.first_counter = next_counter;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalUndoNewPage);
+
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+ if (BufferIsValid(tail_buf))
+ XLogRegisterBuffer(1, tail_buf, REGBUF_STANDARD);
+ XLogRegisterBuffer(2, newbuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_UNDO_NEWPAGE);
+
+ PageSetLSN(BufferGetPage(metabuf), recptr);
+ if (BufferIsValid(tail_buf))
+ PageSetLSN(BufferGetPage(tail_buf), recptr);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+
+ if (tail_blk != InvalidBlockNumber)
+ UnlockReleaseBuffer(tail_buf);
+
+ END_CRIT_SECTION();
+
+ Assert(size <= PageGetExactFreeSpace(newpage));
+
+ tail_blk = newblk;
+ tail_buf = newbuf;
+ tail_pg = newpage;
+ tail_opaque = newopaque;
+ }
+ else
+ {
+ if (IsZSUndoRecPtrValid(&tail_opaque->last_undorecptr))
+ {
+ Assert(tail_opaque->last_undorecptr.counter >= metaopaque->zs_undo_tail_first_counter);
+ next_counter = tail_opaque->last_undorecptr.counter + 1;
+ }
+ else
+ {
+ next_counter = tail_opaque->first_undorecptr.counter;
+ Assert(next_counter == metaopaque->zs_undo_tail_first_counter);
+ }
+ }
+
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * All set for writing the record. But since we haven't modified the page
+ * yet, we are free to still turn back and release the lock without writing
+ * anything.
+ */
+ offset = ((PageHeader) tail_pg)->pd_lower;
+
+ /* Return the reservation to the caller */
+ reservation_p->undobuf = tail_buf;
+ reservation_p->undorecptr.counter = next_counter;
+ reservation_p->undorecptr.blkno = tail_blk;
+ reservation_p->undorecptr.offset = offset;
+ reservation_p->length = size;
+ reservation_p->ptr = ((char *) tail_pg) + offset;
+}
+
+/*
+ * Finish the insertion of an UNDO record.
+ *
+ * See zsundo_insert_reserve().
+ */
+void
+zsundo_insert_finish(zs_undo_reservation *reservation)
+{
+ Buffer undobuf = reservation->undobuf;
+ Page undopg = BufferGetPage(undobuf);
+ ZSUndoPageOpaque *opaque;
+
+ /*
+ * This should be used as part of a bigger critical section that
+ * writes a WAL record of the change. The caller must've written the
+ * data.
+ */
+ Assert(CritSectionCount > 0);
+
+ Assert(((PageHeader) undopg)->pd_lower == reservation->undorecptr.offset);
+
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(undopg);
+ opaque->last_undorecptr = reservation->undorecptr;
+
+ ((PageHeader) undopg)->pd_lower += reservation->length;
+
+ MarkBufferDirty(undobuf);
+}
+
+/*
+ * Lock page containing the given UNDO record, and return pointer to it
+ * within the buffer. Once you're done looking at the record, unlock and
+ * unpin the buffer.
+ *
+ * If lockmode is BUFFER_LOCK_EXCLUSIVE, you may modify the record. However,
+ * you cannot change its size, and you must mark the buffer dirty, and WAL-log any
+ * changes yourself.
+ *
+ * If missing_ok is true, it's OK if the UNDO record has been discarded away
+ * already. Will return NULL in that case. If missing_ok is false, throws an
+ * error if the record cannot be found.
+ */
+char *
+zsundo_fetch(Relation rel, ZSUndoRecPtr undoptr, Buffer *buf_p, int lockmode,
+ bool missing_ok)
+{
+ Buffer buf = InvalidBuffer;
+ Page page;
+ PageHeader pagehdr;
+ ZSUndoPageOpaque *opaque;
+ char *ptr;
+ Buffer metabuf = InvalidBuffer;
+
+ buf = ReadBuffer(rel, undoptr.blkno);
+ page = BufferGetPage(buf);
+ pagehdr = (PageHeader) page;
+
+ /*
+ * If the page might've been discarded away, there's a small chance that
+ * the buffer now holds an unrelated page. In that case, it's possible
+ * that we or someone else is holding a lock on it already. If we tried
+ * to lock the page unconditionally, we could accidentally break the
+ * lock ordering rules, by trying to lock a different kind of a page
+ * than we thought.
+ *
+ * To avoid that, try to lock the page optimistically, but if we would
+ * block, check in the metapage that the page hasn't been discarded away.
+ * zsundo_discard() keeps the metapage locked, so if we lock the page
+ * while holding the metapage, we can be sure that it's the UNDO page
+ * we're looking for.
+ */
+ if (!ConditionalLockBufferInMode(buf, lockmode))
+ {
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ if (metaopaque->zs_undo_oldestptr.counter > undoptr.counter)
+ {
+ /* the record has already been discarded */
+ ReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ UnlockReleaseBuffer(metabuf);
+ metabuf = InvalidBuffer;
+ goto record_missing;
+ }
+ LockBuffer(buf, lockmode);
+ }
+
+ if (PageIsNew(page))
+ goto record_missing;
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ goto record_missing;
+
+ /* Check that this page contains the given record */
+ if (undoptr.counter < opaque->first_undorecptr.counter ||
+ !IsZSUndoRecPtrValid(&opaque->last_undorecptr) ||
+ undoptr.counter > opaque->last_undorecptr.counter)
+ goto record_missing;
+
+ if (BufferIsValid(metabuf))
+ {
+ UnlockReleaseBuffer(metabuf);
+ metabuf = InvalidBuffer;
+ }
+
+ /* FIXME: the callers could do a more thorough check like this,
+ * since they know the record size */
+ /* Sanity check that the pointer pointed to a valid place */
+ if (undoptr.offset < SizeOfPageHeaderData ||
+ undoptr.offset >= pagehdr->pd_lower)
+ {
+ /*
+ * this should not happen in the case that the page was recycled for
+ * other use, so error even if 'missing_ok' is true
+ */
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+ }
+
+ ptr = ((char *) page) + undoptr.offset;
+
+#if 0 /* FIXME: move this to the callers? */
+ if (memcmp(&undorec->undorecptr, &undoptr, sizeof(ZSUndoRecPtr)) != 0)
+ {
+ /*
+ * this should not happen in the case that the page was recycled for
+ * other use, so error even if 'fail_ok' is true
+ */
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+ }
+#endif
+
+ *buf_p = buf;
+ return ptr;
+
+record_missing:
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ if (buf)
+ UnlockReleaseBuffer(buf);
+ *buf_p = InvalidBuffer;
+
+ /*
+ * If the metapage says that the page is there, but it doesn't contain the
+ * data we thought, that's an error even with 'missing_ok.
+ */
+ if (missing_ok && !BufferIsValid(metabuf))
+ return NULL;
+ else
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u; not an UNDO page",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+}
+
+/*
+ * Discard old UNDO log, recycling any now-unused pages.
+ *
+ * Updates the metapage with the oldest value that remains after the discard.
+ */
+void
+zsundo_discard(Relation rel, ZSUndoRecPtr oldest_undorecptr)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber nextblk;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ nextblk = metaopaque->zs_undo_head;
+ while (nextblk != InvalidBlockNumber)
+ {
+ BlockNumber blk = nextblk;
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+ bool discard_this_page = false;
+ BlockNumber nextfreeblkno = InvalidBlockNumber;
+
+ buf = ReadBuffer(rel, blk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * check that the page still looks like what we'd expect.
+ *
+ * FIXME: how to recover? Should these be just warnings?
+ */
+ if (PageIsEmpty(page))
+ elog(ERROR, "corrupted zedstore table; oldest UNDO log page is empty");
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSUndoPageOpaque)))
+ elog(ERROR, "corrupted zedstore table; oldest page in UNDO log is not an UNDO page");
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "corrupted zedstore table; oldest page in UNDO log has unexpected page id %d",
+ opaque->zs_page_id);
+ /* FIXME: Also check here that the max UndoRecPtr on the page is less
+ * than the new 'oldest_undorecptr'
+ */
+
+ if (!IsZSUndoRecPtrValid(&opaque->last_undorecptr) ||
+ opaque->last_undorecptr.counter < oldest_undorecptr.counter)
+ discard_this_page = true;
+
+ if (discard_this_page && blk == oldest_undorecptr.blkno)
+ elog(ERROR, "corrupted UNDO page chain, tried to discard active page");
+
+ nextblk = opaque->next;
+
+ START_CRIT_SECTION();
+
+ metaopaque->zs_undo_oldestptr = oldest_undorecptr;
+
+ if (discard_this_page)
+ {
+ if (nextblk == InvalidBlockNumber)
+ {
+ metaopaque->zs_undo_head = InvalidBlockNumber;
+ metaopaque->zs_undo_tail = InvalidBlockNumber;
+ metaopaque->zs_undo_tail_first_counter = oldest_undorecptr.counter;
+ }
+ else
+ metaopaque->zs_undo_head = nextblk;
+
+ /* Add the discarded page to the free page list */
+ nextfreeblkno = metaopaque->zs_fpm_head;
+ zspage_mark_page_deleted(page, nextfreeblkno);
+ metaopaque->zs_fpm_head = blk;
+
+ MarkBufferDirty(buf);
+ }
+
+ MarkBufferDirty(metabuf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_undo_discard xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.oldest_undorecptr = oldest_undorecptr;
+ xlrec.oldest_undopage = nextblk;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalUndoDiscard);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+ if (discard_this_page)
+ {
+ XLogRegisterBuffer(1, buf, REGBUF_KEEP_DATA | REGBUF_WILL_INIT | REGBUF_STANDARD);
+ XLogRegisterBufData(1, (char *) &nextfreeblkno, sizeof(BlockNumber));
+ }
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_UNDO_DISCARD);
+
+ PageSetLSN(BufferGetPage(metabuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+ }
+
+ UnlockReleaseBuffer(metabuf);
+}
+
+void
+zsundo_discard_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_undo_discard *xlrec = (wal_zedstore_undo_discard *) XLogRecGetData(record);
+ ZSUndoRecPtr oldest_undorecptr = xlrec->oldest_undorecptr;
+ BlockNumber nextblk = xlrec->oldest_undopage;
+ Buffer metabuf;
+ bool discard_this_page;
+ BlockNumber discardedblkno = InvalidBlockNumber;
+ BlockNumber nextfreeblkno = InvalidBlockNumber;
+
+ discard_this_page = XLogRecHasBlockRef(record, 1);
+ if (discard_this_page)
+ {
+ Size datalen;
+ char *data;
+
+ XLogRecGetBlockTag(record, 1, NULL, NULL, &discardedblkno);
+ data = XLogRecGetBlockData(record, 1, &datalen);
+ Assert(datalen == sizeof(BlockNumber));
+
+ memcpy(&nextfreeblkno, data, sizeof(BlockNumber));
+ }
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_undo_oldestptr = oldest_undorecptr;
+
+ if (discard_this_page)
+ {
+ if (nextblk == InvalidBlockNumber)
+ {
+ metaopaque->zs_undo_head = InvalidBlockNumber;
+ metaopaque->zs_undo_tail = InvalidBlockNumber;
+ metaopaque->zs_undo_tail_first_counter = oldest_undorecptr.counter;
+ }
+ else
+ metaopaque->zs_undo_head = nextblk;
+
+ /* Add the discarded page to the free page list */
+ metaopaque->zs_fpm_head = discardedblkno;
+ }
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (discard_this_page)
+ {
+ Buffer discardedbuf;
+ Page discardedpage;
+
+ discardedbuf = XLogInitBufferForRedo(record, 1);
+ discardedpage = BufferGetPage(discardedbuf);
+ zspage_mark_page_deleted(discardedpage, nextfreeblkno);
+
+ PageSetLSN(discardedpage, lsn);
+ MarkBufferDirty(discardedbuf);
+ UnlockReleaseBuffer(discardedbuf);
+ }
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+}
+
+void
+zsundo_newpage_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_undo_newpage *xlrec = (wal_zedstore_undo_newpage *) XLogRecGetData(record);
+ Buffer metabuf;
+ Buffer prevbuf;
+ Buffer newbuf;
+ BlockNumber newblk;
+ Page newpage;
+ ZSUndoPageOpaque *newopaque;
+ bool has_prev_block;
+
+ has_prev_block = XLogRecHasBlockRef(record, 1);
+ XLogRecGetBlockTag(record, 2, NULL, NULL, &newblk);
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_undo_tail = newblk;
+ metaopaque->zs_undo_tail_first_counter = xlrec->first_counter;
+ if (!has_prev_block)
+ metaopaque->zs_undo_head = newblk;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (has_prev_block)
+ {
+ if (XLogReadBufferForRedo(record, 1, &prevbuf) == BLK_NEEDS_REDO)
+ {
+ Page prevpage = BufferGetPage(prevbuf);
+ ZSUndoPageOpaque *prev_opaque;
+
+ prev_opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(prevpage);
+ prev_opaque->next = newblk;
+
+ PageSetLSN(prevpage, lsn);
+ MarkBufferDirty(prevbuf);
+ }
+ }
+ else
+ prevbuf = InvalidBuffer;
+
+ newbuf = XLogInitBufferForRedo(record, 2);
+ newblk = BufferGetBlockNumber(newbuf);
+ newpage = BufferGetPage(newbuf);
+ PageInit(newpage, BLCKSZ, sizeof(ZSUndoPageOpaque));
+ newopaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(newpage);
+ newopaque->next = InvalidBlockNumber;
+ newopaque->first_undorecptr.blkno = newblk;
+ newopaque->first_undorecptr.offset = SizeOfPageHeaderData;
+ newopaque->first_undorecptr.counter = xlrec->first_counter;
+ newopaque->last_undorecptr = InvalidUndoPtr;
+ newopaque->zs_page_id = ZS_UNDO_PAGE_ID;
+
+ PageSetLSN(newpage, lsn);
+ MarkBufferDirty(newbuf);
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ if (BufferIsValid(prevbuf))
+ UnlockReleaseBuffer(prevbuf);
+ UnlockReleaseBuffer(newbuf);
+}
diff --git src/backend/access/zedstore/zedstore_undorec.c src/backend/access/zedstore/zedstore_undorec.c
new file mode 100644
index 0000000000..7c4a143f45
--- /dev/null
+++ src/backend/access/zedstore/zedstore_undorec.c
@@ -0,0 +1,893 @@
+/*
+ * zedstore_undorec.c
+ * Functions for working on UNDO records.
+ *
+ * This file contains higher-level functions for constructing UNDO records
+ * for different kinds of WAL records.
+ *
+ * If you perform multiple operations in the same transaction and command, we
+ * reuse the same UNDO record for it. There's a one-element cache of each
+ * operation type, so this only takes effect in simple cases.
+ *
+ * TODO: make the caching work in more cases. A hash table or something..
+ * Currently, we do this for DELETEs and INSERTs. We could perhaps do this
+ * for UPDATEs as well, although they're more a bit more tricky, as we need
+ * to also store the 'ctid' pointer to the new tuple in an UPDATE.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_undorec.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/multixact.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undolog.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "lib/integerset.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "pgstat.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/pg_rusage.h"
+#include "utils/rel.h"
+#include "utils/lsyscache.h"
+
+/*
+ * Working area for VACUUM.
+ */
+typedef struct ZSVacRelStats
+{
+ int elevel;
+ BufferAccessStrategy vac_strategy;
+
+ /* hasindex = true means two-pass strategy; false means one-pass */
+ bool hasindex;
+ /* Overall statistics about rel */
+ BlockNumber rel_pages; /* total number of pages */
+ BlockNumber tupcount_pages; /* pages whose tuples we counted */
+ double old_live_tuples; /* previous value of pg_class.reltuples */
+ double new_rel_tuples; /* new estimated total # of tuples */
+ double new_live_tuples; /* new estimated total # of live tuples */
+ double new_dead_tuples; /* new estimated total # of dead tuples */
+ BlockNumber pages_removed;
+ double tuples_deleted;
+
+ IntegerSet *dead_tids;
+} ZSVacRelStats;
+
+static bool zs_lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats);
+static void lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats);
+
+
+/*
+ * Fetch the UNDO record with the given undo-pointer.
+ *
+ * The returned record is a palloc'd copy.
+ *
+ * If the record could not be found, returns NULL. That can happen if you try
+ * to fetch an UNDO record that has already been discarded. I.e. if undoptr
+ * is smaller than the oldest UNDO pointer stored in the metapage.
+ */
+ZSUndoRec *
+zsundo_fetch_record(Relation rel, ZSUndoRecPtr undoptr)
+{
+ ZSUndoRec *undorec_copy;
+ ZSUndoRec *undorec;
+ Buffer buf;
+
+ undorec = (ZSUndoRec *) zsundo_fetch(rel, undoptr, &buf, BUFFER_LOCK_SHARE, true);
+
+ if (undorec)
+ {
+ undorec_copy = palloc(undorec->size);
+ memcpy(undorec_copy, undorec, undorec->size);
+ }
+ else
+ undorec_copy = NULL;
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+
+ return undorec_copy;
+}
+
+
+zs_pending_undo_op *
+zsundo_create_for_delete(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ bool changedPart, ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_Delete *undorec;
+ zs_pending_undo_op *pending_op;
+
+ static RelFileNode cached_relfilenode;
+ static TransactionId cached_xid;
+ static CommandId cached_cid;
+ static bool cached_changedPart;
+ static ZSUndoRecPtr cached_prev_undo_ptr;
+ static ZSUndoRecPtr cached_undo_ptr;
+
+ if (RelFileNodeEquals(rel->rd_node, cached_relfilenode) &&
+ xid == cached_xid &&
+ cid == cached_cid &&
+ changedPart == cached_changedPart &&
+ prev_undo_ptr.counter == cached_prev_undo_ptr.counter)
+ {
+ Buffer buf;
+ ZSUndoRec_Delete *orig_undorec;
+
+ orig_undorec = (ZSUndoRec_Delete *) zsundo_fetch(rel, cached_undo_ptr,
+ &buf, BUFFER_LOCK_EXCLUSIVE, false);
+
+ if (orig_undorec->rec.type != ZSUNDO_TYPE_DELETE)
+ elog(ERROR, "unexpected undo record type %d, expected DELETE", orig_undorec->rec.type);
+
+ /* Is there space for a new TID in the record? */
+ if (orig_undorec->num_tids < ZSUNDO_NUM_TIDS_PER_DELETE)
+ {
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Delete));
+ undorec = (ZSUndoRec_Delete *) pending_op->payload;
+
+ pending_op->reservation.undobuf = buf;
+ pending_op->reservation.undorecptr = cached_undo_ptr;
+ pending_op->reservation.length = sizeof(ZSUndoRec_Delete);
+ pending_op->reservation.ptr = (char *) orig_undorec;
+ pending_op->is_update = true;
+
+ memcpy(undorec, orig_undorec, sizeof(ZSUndoRec_Delete));
+ undorec->tids[undorec->num_tids] = tid;
+ undorec->num_tids++;
+
+ return pending_op;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Cache miss. Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Delete));
+ pending_op->is_update = false;
+
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_Delete), &pending_op->reservation);
+
+ undorec = (ZSUndoRec_Delete *) pending_op->payload;
+ undorec->rec.size = sizeof(ZSUndoRec_Delete);
+ undorec->rec.type = ZSUNDO_TYPE_DELETE;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->changedPart = changedPart;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->tids[0] = tid;
+ undorec->num_tids = 1;
+
+ /* XXX: this caching mechanism assumes that once we've reserved the undo record,
+ * we never change our minds and don't write the undo record, after all.
+ */
+ cached_relfilenode = rel->rd_node;
+ cached_xid = xid;
+ cached_cid = cid;
+ cached_changedPart = changedPart;
+ cached_prev_undo_ptr = prev_undo_ptr;
+ cached_undo_ptr = pending_op->reservation.undorecptr;
+
+ return pending_op;
+}
+
+/*
+ * Create an UNDO record for insertion.
+ *
+ * The undo record stores the 'tid' of the row, as well as visibility information.
+ *
+ * There's a primitive caching mechanism here: If you perform multiple insertions
+ * with same visibility information, and consecutive TIDs, we will keep modifying
+ * the range of TIDs in the same UNDO record, instead of creating new records.
+ * That greatly reduces the space required for UNDO log of bulk inserts.
+ */
+zs_pending_undo_op *
+zsundo_create_for_insert(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ int nitems, uint32 speculative_token, ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_Insert *undorec;
+ zs_pending_undo_op *pending_op;
+
+ /*
+ * Cache miss. Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Insert));
+ pending_op->is_update = false;
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_Insert), &pending_op->reservation);
+ undorec = (ZSUndoRec_Insert *) pending_op->payload;
+
+ undorec->rec.size = sizeof(ZSUndoRec_Insert);
+ undorec->rec.type = ZSUNDO_TYPE_INSERT;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->firsttid = tid;
+ undorec->endtid = tid + nitems;
+ undorec->speculative_token = speculative_token;
+
+ return pending_op;
+}
+
+zs_pending_undo_op *
+zsundo_create_for_update(Relation rel, TransactionId xid, CommandId cid,
+ zstid oldtid, zstid newtid, ZSUndoRecPtr prev_undo_ptr,
+ bool key_update)
+{
+ ZSUndoRec_Update *undorec;
+ zs_pending_undo_op *pending_op;
+
+ /*
+ * Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Update));
+ pending_op->is_update = false;
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_Update), &pending_op->reservation);
+
+ undorec = (ZSUndoRec_Update *) pending_op->payload;
+ undorec->rec.size = sizeof(ZSUndoRec_Update);
+ undorec->rec.type = ZSUNDO_TYPE_UPDATE;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->oldtid = oldtid;
+ undorec->newtid = newtid;
+ undorec->key_update = key_update;
+
+ return pending_op;
+}
+
+zs_pending_undo_op *
+zsundo_create_for_tuple_lock(Relation rel, TransactionId xid, CommandId cid,
+ zstid tid, LockTupleMode lockmode,
+ ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_TupleLock *undorec;
+ zs_pending_undo_op *pending_op;
+
+ /*
+ * Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_TupleLock));
+ pending_op->is_update = false;
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_TupleLock), &pending_op->reservation);
+
+ undorec = (ZSUndoRec_TupleLock *) pending_op->payload;
+ undorec->rec.size = sizeof(ZSUndoRec_TupleLock);
+ undorec->rec.type = ZSUNDO_TYPE_TUPLE_LOCK;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->lockmode = lockmode;
+
+ return pending_op;
+}
+
+
+/*
+ * Scan the UNDO log, starting from oldest entry. Undo the effects of any
+ * aborted transactions. Records for committed transactions can be discarded
+ * away immediately.
+ *
+ * Returns the oldest valid UNDO ptr, after discarding.
+ */
+static ZSUndoRecPtr
+zsundo_trim(Relation rel, TransactionId OldestXmin)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber lastblk;
+ ZSUndoRecPtr oldest_undorecptr;
+ bool can_advance_oldestundorecptr;
+ char *ptr;
+ char *endptr;
+ char *pagebuf;
+
+ pagebuf = palloc(BLCKSZ);
+
+ oldest_undorecptr = InvalidUndoPtr;
+
+ /*
+ * Ensure that only one process discards at a time. We use a page lock on the
+ * metapage for that.
+ */
+ LockPage(rel, ZS_META_BLK, ExclusiveLock);
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ oldest_undorecptr = metaopaque->zs_undo_oldestptr;
+
+ /*
+ * If we assume that only one process can call TRIM at a time, then we
+ * don't need to hold the metapage locked. Alternatively, if multiple
+ * concurrent trims was possible, we could check after reading the head
+ * page, that it is the page we expect, and re-read the metapage if it's
+ * not.
+ */
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Don't trim undo pages in recovery mode to avoid writing new WALs.
+ */
+ if(RecoveryInProgress())
+ return oldest_undorecptr;
+
+ /*
+ * Loop through UNDO records, starting from the oldest page, until we
+ * hit a record that we cannot remove.
+ */
+ lastblk = firstblk;
+ can_advance_oldestundorecptr = false;
+ while (lastblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, lastblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "unexpected page id on UNDO page");
+
+ /*
+ * Make a copy of the page, because we cannot hold the
+ * lock while we reach out to the TID tree, to mark items dead.
+ * That would cause a deadlock risk (scans lock TID tree pages
+ * first, and then UNDO pages to check visibility)
+ */
+ memcpy(pagebuf, page, BLCKSZ);
+ page = pagebuf;
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+ bool did_commit;
+
+ Assert(undorec->undorecptr.blkno == lastblk);
+
+ if (undorec->undorecptr.counter < oldest_undorecptr.counter)
+ {
+ ptr += undorec->size;
+ continue;
+ }
+ oldest_undorecptr = undorec->undorecptr;
+
+ if (!TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* This is still needed. Bail out */
+ break;
+ }
+
+ /*
+ * No one thinks this transaction is in-progress anymore. If it
+ * committed, we can just discard away its UNDO record. If it aborted,
+ * we need to apply the UNDO record first. (For deletions, it's
+ * the other way round, though.)
+ *
+ * TODO: It would be much more efficient to do these in batches.
+ * So we should just collect the TIDs to mark dead here, and pass
+ * the whole list to zsbt_tid_mark_dead() after the loop.
+ */
+ did_commit = TransactionIdDidCommit(undorec->xid);
+
+ switch (undorec->type)
+ {
+ case ZSUNDO_TYPE_INSERT:
+ if (!did_commit)
+ {
+ ZSUndoRec_Insert *insertrec = (ZSUndoRec_Insert *) undorec;
+
+ for (zstid tid = insertrec->firsttid; tid < insertrec->endtid; tid++)
+ zsbt_tid_mark_dead(rel, tid, oldest_undorecptr);
+ }
+ break;
+ case ZSUNDO_TYPE_DELETE:
+ {
+ ZSUndoRec_Delete *deleterec = (ZSUndoRec_Delete *) undorec;
+
+ if (did_commit)
+ {
+ /* The deletion is now visible to everyone */
+ for (int i = 0; i < deleterec->num_tids; i++)
+ zsbt_tid_mark_dead(rel, deleterec->tids[i], oldest_undorecptr);
+ }
+ else
+ {
+ /*
+ * must clear the item's UNDO pointer, otherwise the deletion
+ * becomes visible to everyone when the UNDO record is discarded
+ * away.
+ */
+ for (int i = 0; i < deleterec->num_tids; i++)
+ zsbt_tid_undo_deletion(rel, deleterec->tids[i], undorec->undorecptr,
+ oldest_undorecptr);
+ }
+ }
+ break;
+ case ZSUNDO_TYPE_UPDATE:
+ if (did_commit)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+
+ zsbt_tid_mark_dead(rel, updaterec->oldtid, oldest_undorecptr);
+ }
+ break;
+ }
+
+ ptr += undorec->size;
+ can_advance_oldestundorecptr = true;
+ }
+
+ if (ptr < endptr)
+ {
+ break;
+ }
+ else
+ {
+ /* We processed all records on the page. Step to the next one, if any. */
+ Assert(ptr == endptr);
+ lastblk = opaque->next;
+ }
+ }
+
+ if (can_advance_oldestundorecptr)
+ {
+ if (lastblk == InvalidBlockNumber)
+ {
+ /*
+ * We stopped after the last valid record. Advance by one, to the next
+ * record which hasn't been created yet, and which is still needed.
+ */
+ oldest_undorecptr.counter++;
+ oldest_undorecptr.blkno = InvalidBlockNumber;
+ oldest_undorecptr.offset = 0;
+ }
+
+ zsundo_discard(rel, oldest_undorecptr);
+ }
+
+ UnlockPage(rel, ZS_META_BLK, ExclusiveLock);
+
+ pfree(pagebuf);
+
+ return oldest_undorecptr;
+}
+
+void
+zsundo_finish_pending_op(zs_pending_undo_op *pendingop, char *payload)
+{
+ /*
+ * This should be used as part of a bigger critical section that
+ * writes a WAL record of the change.
+ */
+ Assert(CritSectionCount > 0);
+
+ memcpy(pendingop->reservation.ptr, payload, pendingop->reservation.length);
+
+ if (!pendingop->is_update)
+ zsundo_insert_finish(&pendingop->reservation);
+ else
+ MarkBufferDirty(pendingop->reservation.undobuf);
+}
+
+
+void
+zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr)
+{
+ ZSUndoRec_Insert *undorec;
+ Buffer buf;
+
+ undorec = (ZSUndoRec_Insert *) zsundo_fetch(rel, undoptr, &buf, BUFFER_LOCK_EXCLUSIVE, false);
+
+ if (undorec->rec.type != ZSUNDO_TYPE_INSERT)
+ elog(ERROR, "unexpected undo record type %d on speculatively inserted row",
+ undorec->rec.type);
+
+ START_CRIT_SECTION();
+
+ MarkBufferDirty(buf);
+
+ undorec->speculative_token = INVALID_SPECULATIVE_TOKEN;
+
+ /*
+ * The speculative insertion token becomes irrelevant, if we crash, so no
+ * need to WAL-log it. However, if checksums are enabled, we may need to take
+ * a full-page image of the page, if a checkpoint happened between the
+ * speculative insertion and this call.
+ */
+ if (RelationNeedsWAL(rel))
+ {
+ if (XLogHintBitIsNeeded())
+ {
+ XLogRecPtr lsn;
+
+ lsn = XLogSaveBufferForHint(buf, true);
+ PageSetLSN(BufferGetPage(buf), lsn);
+ }
+ }
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Support functions for WAL-logging the insertion/modification of an
+ * UNDO record, as part of another WAL-logged change.
+ */
+void
+XLogRegisterUndoOp(uint8 block_id, zs_pending_undo_op *undo_op)
+{
+ zs_wal_undo_op *xlrec = &undo_op->waldata;
+
+ xlrec->undoptr = undo_op->reservation.undorecptr;
+ xlrec->length = undo_op->reservation.length;
+ xlrec->is_update = undo_op->is_update;
+
+ XLogRegisterBuffer(block_id, undo_op->reservation.undobuf,
+ REGBUF_STANDARD);
+ XLogRegisterBufData(block_id, (char *) xlrec, SizeOfZSWalUndoOp);
+ XLogRegisterBufData(block_id, (char *) undo_op->payload, undo_op->reservation.length);
+}
+
+/* redo support for the above */
+Buffer
+XLogRedoUndoOp(XLogReaderState *record, uint8 block_id)
+{
+ Buffer buffer;
+ zs_pending_undo_op op;
+
+ if (XLogReadBufferForRedo(record, block_id, &buffer) == BLK_NEEDS_REDO)
+ {
+ zs_wal_undo_op xlrec;
+ Size len;
+ char *p = XLogRecGetBlockData(record, block_id, &len);
+
+ Assert(len >= SizeOfZSWalUndoOp);
+
+ memcpy(&xlrec, p, SizeOfZSWalUndoOp);
+ p += SizeOfZSWalUndoOp;
+ len -= SizeOfZSWalUndoOp;
+ Assert(xlrec.length == len);
+
+ op.reservation.undobuf = buffer;
+ op.reservation.undorecptr = xlrec.undoptr;
+ op.reservation.length = xlrec.length;
+ op.reservation.ptr = ((char *) BufferGetPage(buffer)) + xlrec.undoptr.offset;
+ op.is_update = xlrec.is_update;
+
+ START_CRIT_SECTION();
+ zsundo_finish_pending_op(&op, p);
+ END_CRIT_SECTION();
+ }
+ return buffer;
+}
+
+
+
+static bool
+zs_lazy_tid_reaped(ItemPointer itemptr, void *state)
+{
+ ZSVacRelStats *vacrelstats = (ZSVacRelStats *) state;
+ zstid tid = ZSTidFromItemPointer(*itemptr);
+
+ return intset_is_member(vacrelstats->dead_tids, tid);
+}
+
+/*
+ * Entry point of VACUUM for zedstore tables.
+ *
+ * Vacuum on a zedstore table works quite differently from the heap. We don't
+ * scan the table. Instead, we scan just the active UNDO log, and remove any
+ * garbage left behind by aborts or deletions based on the UNDO log.
+ */
+void
+zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin)
+{
+ ZSVacRelStats *vacrelstats;
+ Relation *Irel;
+ int nindexes;
+ IndexBulkDeleteResult **indstats;
+ zstid starttid;
+ zstid endtid;
+ uint64 num_live_tuples;
+
+ /* do nothing if the table is completely empty. */
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ /* don't believe the cached value without checking */
+ BlockNumber nblocks = RelationGetNumberOfBlocks(rel);
+
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ return;
+ }
+
+ /*
+ * Scan the UNDO log, and discard what we can.
+ */
+ (void) zsundo_trim(rel, RecentGlobalXmin);
+
+ vacrelstats = (ZSVacRelStats *) palloc0(sizeof(ZSVacRelStats));
+
+ if (params->options & VACOPT_VERBOSE)
+ vacrelstats->elevel = INFO;
+ else
+ vacrelstats->elevel = DEBUG2;
+ vacrelstats->vac_strategy = bstrategy;
+
+ /* Open all indexes of the relation */
+ vac_open_indexes(rel, RowExclusiveLock, &nindexes, &Irel);
+ vacrelstats->hasindex = (nindexes > 0);
+ indstats = (IndexBulkDeleteResult **)
+ palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
+
+ ereport(vacrelstats->elevel,
+ (errmsg("vacuuming \"%s.%s\"",
+ get_namespace_name(RelationGetNamespace(rel)),
+ RelationGetRelationName(rel))));
+
+ starttid = MinZSTid;
+ num_live_tuples = 0;
+ do
+ {
+ IntegerSet *dead_tids;
+
+ /* Scan the TID tree, to collect TIDs that have been marked dead. */
+ dead_tids = zsbt_collect_dead_tids(rel, starttid, &endtid, &num_live_tuples);
+ vacrelstats->dead_tids = dead_tids;
+
+ if (intset_num_entries(dead_tids) > 0)
+ {
+ /* Remove index entries */
+ for (int i = 0; i < nindexes; i++)
+ lazy_vacuum_index(Irel[i],
+ &indstats[i],
+ vacrelstats);
+
+ /*
+ * Remove the attribute data for the dead rows, and finally their
+ * TID tree entries.
+ */
+ for (int attno = 1; attno <= RelationGetNumberOfAttributes(rel); attno++)
+ zsbt_attr_remove(rel, attno, dead_tids);
+ zsbt_tid_remove(rel, dead_tids);
+ }
+
+ ereport(vacrelstats->elevel,
+ (errmsg("\"%s\": removed " UINT64_FORMAT " row versions",
+ RelationGetRelationName(rel),
+ intset_num_entries(dead_tids))));
+
+ starttid = endtid;
+ } while(starttid < MaxPlusOneZSTid);
+
+ /* Do post-vacuum cleanup and statistics update for each index */
+ for (int i = 0; i < nindexes; i++)
+ lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+
+ /* Done with indexes */
+ vac_close_indexes(nindexes, Irel, NoLock);
+
+ /*
+ * Update pg_class to reflect new info we know. The main thing we know for
+ * sure here is relhasindex or not currently. Using OldestXmin as new
+ * frozenxid. And since we don't now the new multixid passing it as
+ * invalid to avoid update. Plus, using false for relallisvisible as don't
+ * know that either.
+ *
+ * FIXME: pass correct numbers for other arguments.
+ */
+ vac_update_relstats(rel,
+ RelationGetNumberOfBlocks(rel),
+ num_live_tuples,
+ false,
+ nindexes > 0,
+ OldestXmin,
+ InvalidMultiXactId,
+ false);
+
+ /* report results to the stats collector, too */
+ pgstat_report_vacuum(RelationGetRelid(rel),
+ rel->rd_rel->relisshared,
+ num_live_tuples,
+ 0); /* FIXME: # of dead tuples */
+}
+
+/*
+ * lazy_vacuum_index() -- vacuum one index relation.
+ *
+ * Delete all the index entries pointing to tuples listed in
+ * vacrelstats->dead_tuples, and update running statistics.
+ */
+static void
+lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = true;
+ ivinfo.message_level = vacrelstats->elevel;
+ /* We can only provide an approximate value of num_heap_tuples here */
+ ivinfo.num_heap_tuples = vacrelstats->old_live_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ zs_lazy_tid_reaped, (void *) vacrelstats);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
+ RelationGetRelationName(indrel),
+ intset_num_entries(vacrelstats->dead_tids)),
+ errdetail_internal("%s", pg_rusage_show(&ru0))));
+}
+
+/*
+ * lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ */
+static void
+lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = (vacrelstats->tupcount_pages < vacrelstats->rel_pages);
+ ivinfo.message_level = vacrelstats->elevel;
+
+ /*
+ * Now we can provide a better estimate of total number of surviving
+ * tuples (we assume indexes are more interested in that than in the
+ * number of nominally live tuples).
+ */
+ ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ stats = index_vacuum_cleanup(&ivinfo, stats);
+
+ if (!stats)
+ return;
+
+ /*
+ * Now update statistics in pg_class, but only if the index says the count
+ * is accurate.
+ */
+ if (!stats->estimated_count)
+ vac_update_relstats(indrel,
+ stats->num_pages,
+ stats->num_index_tuples,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ false);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("index \"%s\" now contains %.0f row versions in %u pages",
+ RelationGetRelationName(indrel),
+ stats->num_index_tuples,
+ stats->num_pages),
+ errdetail("%.0f index row versions were removed.\n"
+ "%u index pages have been deleted, %u are currently reusable.\n"
+ "%s.",
+ stats->tuples_removed,
+ stats->pages_deleted, stats->pages_free,
+ pg_rusage_show(&ru0))));
+
+ pfree(stats);
+}
+
+
+/*
+ * Return the current "Oldest undo pointer". The effects of any actions with
+ * undo pointer older than this is known to be visible to everyone. (i.e.
+ * an inserted tuple is known to be visible, and a deleted tuple is known to
+ * be invisible.)
+ *
+ * If 'attempt_trim' is true, this not only gets the current oldest UNDO pointer,
+ * but tries to first advance it as much as possible, by scanning and discarding
+ * old UNDO log. That's pretty expensive, but fetching records from the UNDO log
+ * is very expensive, too, so until that is somehow sped up, it is a good tradeoff
+ * to advance the discard pointer aggressively. It is only safe to trim the UNDO
+ * log when you're not holding any other page locks, however.
+ */
+ZSUndoRecPtr
+zsundo_get_oldest_undo_ptr(Relation rel, bool attempt_trim)
+{
+ ZSUndoRecPtr result;
+
+ /* do nothing if the table is completely empty. */
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ /* don't believe a cached 0 size without checking */
+ BlockNumber nblocks;
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ return InvalidUndoPtr;
+ }
+
+ /*
+ * If the caller asked for trimming the UNDO log, do that. Otherwise,
+ * just get the current value from the metapage.
+ */
+ if (attempt_trim)
+ result = zsundo_trim(rel, RecentGlobalXmin);
+ else
+ {
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ result = metaopaque->zs_undo_oldestptr;
+ UnlockReleaseBuffer(metabuf);
+ }
+ return result;
+}
diff --git src/backend/access/zedstore/zedstore_visibility.c src/backend/access/zedstore/zedstore_visibility.c
new file mode 100644
index 0000000000..c0f189b9a9
--- /dev/null
+++ src/backend/access/zedstore/zedstore_visibility.c
@@ -0,0 +1,930 @@
+/*
+ * zedstore_visibility.c
+ * Routines for MVCC in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_visibility.c
+ */
+#include "postgres.h"
+
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "storage/procarray.h"
+
+static bool
+zs_tuplelock_compatible(LockTupleMode mode, LockTupleMode newmode)
+{
+ switch (newmode)
+ {
+ case LockTupleKeyShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare ||
+ mode == LockTupleNoKeyExclusive;
+
+ case LockTupleShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare;
+
+ case LockTupleNoKeyExclusive:
+ return mode == LockTupleKeyShare;
+ case LockTupleExclusive:
+ return false;
+
+ default:
+ elog(ERROR, "unknown tuple lock mode %d", newmode);
+ }
+}
+
+static bool
+am_i_holding_lock(Relation rel, ZSUndoRecPtr undo_ptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ ZSUndoRec *undorec;
+
+ for (;;)
+ {
+ /* Is it visible? */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ return false;
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ return false;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ undo_ptr = undorec->prevundorec;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesUpdate.
+ *
+ * When returns TM_Ok, this also returns a flag in *undo_record_needed, to indicate
+ * whether the old UNDO record is still of interest to anyone. If the old record
+ * belonged to an aborted deleting transaction, for example, it can be ignored.
+ *
+ * This does more than HeapTupleSatisfiesUpdate. If HeapTupleSatisfiesUpdate sees
+ * an updated or locked tuple, it returns TM_BeingUpdated, and the caller has to
+ * check if the tuple lock is compatible with the update. zs_SatisfiesUpdate
+ * checks if the new lock mode is compatible with the old one, and returns TM_Ok
+ * if so. Waiting for conflicting locks is left to the caller.
+ *
+ * This is also used for tuple locking (e.g. SELECT FOR UPDATE). 'mode' indicates
+ * the lock mode. For a genuine UPDATE, pass LockTupleExclusive or
+ * LockTupleNoKeyExclusive depending on whether key columns are being modified.
+ *
+ * If the tuple was UPDATEd, *next_tid is set to the TID of the new row version.
+ */
+TM_Result
+zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo,
+ zstid item_tid, ZSUndoRecPtr item_undoptr,
+ LockTupleMode mode,
+ bool *undo_record_needed, bool *this_xact_has_lock,
+ TM_FailureData *tmfd,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ int chain_depth = 0;
+
+ *this_xact_has_lock = false;
+ *undo_record_needed = true;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ chain_depth++;
+
+retry_fetch:
+ /* Is it visible? */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /*
+ * The old UNDO record is no longer visible to anyone, so we don't
+ * need to keep it. If this record was not the one directly referenced
+ * from the item, then we must keep it, though. For example, if there
+ * is a chain (item -> LOCK_TUPLE -> INSERT), and the INSERT record is
+ * no longer needed by anyone, we must still keep the pointer to the LOCK
+ * record.
+ */
+ if (chain_depth == 1)
+ *undo_record_needed = false;
+
+ if (visi_info)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ }
+ return TM_Ok;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto retry_fetch;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (undorec->cid >= snapshot->curcid)
+ return TM_Invisible; /* inserted after scan started */
+ }
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return TM_Invisible; /* inserter has not committed yet */
+ else if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* it must have aborted or crashed */
+ return TM_Invisible;
+ }
+
+ /* The tuple is visible to use. But can we lock it? */
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ ZSUndoRec_TupleLock *lock_undorec = (ZSUndoRec_TupleLock *) undorec;
+
+ /*
+ * If any subtransaction of the current top transaction already holds
+ * a lock as strong as or stronger than what we're requesting, we
+ * effectively hold the desired lock already. We *must* succeed
+ * without trying to take the tuple lock, else we will deadlock
+ * against anyone wanting to acquire a stronger lock.
+ */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (lock_undorec->lockmode >= mode)
+ {
+ *undo_record_needed = true;
+ return TM_Ok;
+ }
+ }
+ else if (!zs_tuplelock_compatible(lock_undorec->lockmode, mode) &&
+ TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ /* but am I holding a weaker lock already? */
+ if (!*this_xact_has_lock)
+ *this_xact_has_lock = am_i_holding_lock(rel, undorec->prevundorec, recent_oldest_undo);
+
+ return TM_BeingModified;
+ }
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE)
+ {
+ ZSUndoRec_Delete *deleterec = (ZSUndoRec_Delete *) undorec;
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ /* but am I holding a weaker lock already? */
+ if (!*this_xact_has_lock)
+ *this_xact_has_lock = am_i_holding_lock(rel, undorec->prevundorec, recent_oldest_undo);
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ if (deleterec->changedPart)
+ {
+ ItemPointerSet(&tmfd->ctid, MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber);
+ *next_tid = InvalidZSTid;
+ return TM_Updated;
+ }
+ else
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ return TM_Deleted;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* updated-away tuple */
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ LockTupleMode old_lockmode;
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ *next_tid = updaterec->newtid;
+ old_lockmode = updaterec->key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ /* but am I holding a weaker lock already? */
+ if (!*this_xact_has_lock)
+ *this_xact_has_lock = am_i_holding_lock(rel, undorec->prevundorec, recent_oldest_undo);
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(((ZSUndoRec_Update *) undorec)->newtid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ return TM_Updated;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+
+/*
+ * Like HeapTupleSatisfiesAny
+ */
+static bool
+zs_SatisfiesAny(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ /* If this record is "old", then the record is visible. */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE ||
+ undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+
+ return true;
+}
+
+/*
+ * helper function to zs_SatisfiesMVCC(), to check if the given XID
+ * is visible to the snapshot.
+ */
+static bool
+xid_is_visible(Snapshot snapshot, TransactionId xid, CommandId cid, bool *aborted)
+{
+ *aborted = false;
+ if (TransactionIdIsCurrentTransactionId(xid))
+ {
+ if (cid >= snapshot->curcid)
+ return false;
+ else
+ return true;
+ }
+ else if (XidInMVCCSnapshot(xid, snapshot))
+ return false;
+ else if (TransactionIdDidCommit(xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ *aborted = true;
+ return false;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesMVCC
+ */
+static bool
+zs_SatisfiesMVCC(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ bool aborted;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ /* If this record is "old", then the record is visible. */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ /* Inserted tuple */
+ bool result;
+ result = xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted);
+ if (!result && !aborted)
+ *obsoleting_xid = undorec->xid;
+
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return result;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /*
+ * Deleted or updated-away. They are treated the same in an MVCC snapshot.
+ * They only need different treatment when updating or locking the row,
+ * in SatisfiesUpdate().
+ */
+ if (xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted))
+ {
+ /* we can see the deletion */
+ return false;
+ }
+ else
+ {
+ if (!aborted)
+ *obsoleting_xid = undorec->xid;
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesSelf
+ */
+static bool
+zs_SatisfiesSelf(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ ZSUndoRec *undorec;
+ ZSUndoRecPtr undo_ptr;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return false;
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true;
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true;
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesDirty
+ */
+static bool
+zs_SatisfiesDirty(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ snapshot->xmin = snapshot->xmax = InvalidTransactionId;
+ snapshot->speculativeToken = INVALID_SPECULATIVE_TOKEN;
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ ZSUndoRec_Insert *insertrec = (ZSUndoRec_Insert *) undorec;
+ snapshot->speculativeToken = insertrec->speculative_token;
+
+ /*
+ * HACK: For SnapshotDirty need to set the values of xmin/xmax/... in
+ * snapshot based on tuples. Hence, can't set the visi_info values
+ * here similar to other snapshots. Only setting the value for
+ * TransactionIdIsInProgress().
+ */
+
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ {
+ snapshot->xmin = undorec->xid;
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return true;
+ }
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* locked tuple. */
+ /* look at the previous UNDO record to find the insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /* deleted or updated-away tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ /*
+ * TODO: not required to set the snapshot's xmax here? As gets
+ * populated based on visi_info later in snapshot by caller.
+ */
+ snapshot->xmax = undorec->xid;
+ visi_info->xmax = undorec->xid;
+ return true;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * True if tuple might be visible to some transaction; false if it's
+ * surely dead to everyone, ie, vacuumable.
+ */
+static bool
+zs_SatisfiesNonVacuumable(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ TransactionId OldestXmin = scan->snapshot->xmin;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ Assert(TransactionIdIsValid(OldestXmin));
+
+ undo_ptr = item_undoptr;
+
+fetch_undo_record:
+
+ /* Is it visible? */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+
+ /* Inserted tuple */
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* inserter has not committed yet */
+
+ if (TransactionIdDidCommit(undorec->xid))
+ return true;
+
+ /* it must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* deleted or updated-away tuple */
+ ZSUndoRecPtr prevptr;
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* delete-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter committed. But perhaps it was recent enough that some open
+ * transactions could still see the tuple.
+ */
+ if (!TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ visi_info->nonvacuumable_status = ZSNV_RECENTLY_DEAD;
+ return true;
+ }
+
+ return false;
+ }
+
+ /*
+ * The deleting transaction did not commit. But before concluding
+ * that the tuple is live, we have to check if the inserting
+ * XID is live.
+ */
+ do {
+ prevptr = undorec->prevundorec;
+
+ if (prevptr.counter < scan->recent_oldest_undo.counter)
+ return true;
+ undorec = zsundo_fetch_record(rel, prevptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ return true;
+ }
+ } while(undorec->type == ZSUNDO_TYPE_TUPLE_LOCK);
+
+ Assert(undorec->type == ZSUNDO_TYPE_INSERT);
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* insert-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true; /* inserted committed */
+
+ /* inserter must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* look at the previous UNDO record, to find the Insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesVisibility
+ *
+ * If next_tid is not NULL then gets populated for the tuple if tuple was
+ * UPDATEd. *next_tid_p is set to the TID of the new row version.
+ */
+bool
+zs_SatisfiesVisibility(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr undo_ptr;
+
+ /* initialize as invalid, if we find valid one populate the same */
+ if (next_tid)
+ *next_tid = InvalidZSTid;
+
+ /* The caller should've filled in the recent_oldest_undo pointer */
+ Assert(scan->recent_oldest_undo.counter != 0);
+
+ *obsoleting_xid = InvalidTransactionId;
+
+ /*
+ * Items with invalid undo record are considered visible. Mostly META
+ * column stores the valid undo record, all other columns stores invalid
+ * undo pointer. Visibility check is performed based on META column and
+ * only if visible rest of columns are fetched. For in-place updates,
+ * columns other than META column may have valid undo record, in which
+ * case the visibility check needs to be performed for the same. META
+ * column can sometime also have items with invalid undo, see
+ * zsbt_undo_item_deletion().
+ */
+ undo_ptr = item_undoptr;
+ if (!IsZSUndoRecPtrValid(&undo_ptr))
+ return true;
+
+ switch (scan->snapshot->snapshot_type)
+ {
+ case SNAPSHOT_MVCC:
+ return zs_SatisfiesMVCC(scan, item_undoptr, obsoleting_xid, next_tid, visi_info);
+
+ case SNAPSHOT_SELF:
+ return zs_SatisfiesSelf(scan, item_undoptr, next_tid, visi_info);
+
+ case SNAPSHOT_ANY:
+ return zs_SatisfiesAny(scan, item_undoptr, visi_info);
+
+ case SNAPSHOT_TOAST:
+ elog(ERROR, "SnapshotToast not implemented in zedstore");
+ break;
+
+ case SNAPSHOT_DIRTY:
+ return zs_SatisfiesDirty(scan, item_undoptr, next_tid, visi_info);
+
+ case SNAPSHOT_HISTORIC_MVCC:
+ elog(ERROR, "SnapshotHistoricMVCC not implemented in zedstore yet");
+ break;
+
+ case SNAPSHOT_NON_VACUUMABLE:
+ return zs_SatisfiesNonVacuumable(scan, item_undoptr, visi_info);
+ }
+
+ return false; /* keep compiler quiet */
+}
diff --git src/backend/access/zedstore/zedstore_wal.c src/backend/access/zedstore/zedstore_wal.c
new file mode 100644
index 0000000000..c33dd65363
--- /dev/null
+++ src/backend/access/zedstore/zedstore_wal.c
@@ -0,0 +1,107 @@
+/*
+ * zedstore_wal.c
+ * WAL-logging for zedstore.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_wal.c
+ */
+#include "postgres.h"
+
+#include "access/bufmask.h"
+#include "access/xlogreader.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undolog.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+
+void
+zedstore_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case WAL_ZEDSTORE_INIT_METAPAGE:
+ zsmeta_initmetapage_redo(record);
+ break;
+ case WAL_ZEDSTORE_UNDO_NEWPAGE:
+ zsundo_newpage_redo(record);
+ break;
+ case WAL_ZEDSTORE_UNDO_DISCARD:
+ zsundo_discard_redo(record);
+ break;
+ case WAL_ZEDSTORE_BTREE_NEW_ROOT:
+ zsmeta_new_btree_root_redo(record);
+ break;
+ case WAL_ZEDSTORE_BTREE_REWRITE_PAGES:
+ zsbt_rewrite_pages_redo(record);
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS:
+ zsbt_tidleaf_items_redo(record, false);
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM:
+ zsbt_tidleaf_items_redo(record, true);
+ break;
+ case WAL_ZEDSTORE_ATTSTREAM_CHANGE:
+ zsbt_attstream_change_redo(record);
+ break;
+ case WAL_ZEDSTORE_TOAST_NEWPAGE:
+ zstoast_newpage_redo(record);
+ break;
+ case WAL_ZEDSTORE_FPM_DELETE_PAGE:
+ zspage_delete_page_redo(record);
+ break;
+ case WAL_ZEDSTORE_FPM_REUSE_PAGE:
+ zspage_reuse_page_redo(record);
+ break;
+ default:
+ elog(PANIC, "zedstore_redo: unknown op code %u", info);
+ }
+}
+
+void
+zedstore_mask(char *pagedata, BlockNumber blkno)
+{
+ Page page = (Page) pagedata;
+ uint16 page_id;
+
+ mask_page_lsn_and_checksum(page);
+
+ page_id = *(uint16 *) (pagedata + BLCKSZ - sizeof(uint16));
+
+ if (blkno == ZS_META_BLK)
+ {
+ }
+ else if (page_id == ZS_UNDO_PAGE_ID && PageGetSpecialSize(page) == sizeof(ZSUndoPageOpaque))
+ {
+ /*
+ * On INSERT undo records, mask out speculative insertion tokens.
+ */
+ char *endptr = pagedata + ((PageHeader) pagedata)->pd_lower;
+ char *ptr;
+
+ ptr = pagedata + SizeOfPageHeaderData;
+
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+
+ /* minimal validation */
+ if (undorec->size < sizeof(ZSUndoRec) || ptr + undorec->size > endptr)
+ break;
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ ((ZSUndoRec_Insert *) undorec)->speculative_token = MASK_MARKER;
+ }
+
+ ptr += undorec->size;
+ }
+ }
+
+ return;
+}
diff --git src/backend/access/zedstore/zedstoream_handler.c src/backend/access/zedstore/zedstoream_handler.c
new file mode 100644
index 0000000000..e13fe2cec2
--- /dev/null
+++ src/backend/access/zedstore/zedstoream_handler.c
@@ -0,0 +1,3197 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_handler.c
+ * ZedStore table access method code
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_handler.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/multixact.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/tsmapi.h"
+#include "access/tupdesc_details.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "catalog/catalog.h"
+#include "catalog/index.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "optimizer/plancat.h"
+#include "pgstat.h"
+#include "parser/parse_relation.h"
+#include "storage/lmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+typedef struct ZedStoreProjectData
+{
+ int num_proj_atts;
+ Bitmapset *project_columns;
+ int *proj_atts;
+ ZSTidTreeScan tid_scan;
+ ZSAttrTreeScan *attr_scans;
+ MemoryContext context;
+} ZedStoreProjectData;
+
+typedef struct ZedStoreDescData
+{
+ /* scan parameters */
+ TableScanDescData rs_scan; /* */
+ ZedStoreProjectData proj_data;
+
+ bool started;
+ zstid cur_range_start;
+ zstid cur_range_end;
+
+ /* These fields are used for bitmap scans, to hold a "block's" worth of data */
+#define MAX_ITEMS_PER_LOGICAL_BLOCK MaxHeapTuplesPerPage
+ int bmscan_ntuples;
+ zstid *bmscan_tids;
+ int bmscan_nexttuple;
+
+ /* These fields are use for TABLESAMPLE scans */
+ zstid min_tid_to_scan;
+ zstid max_tid_to_scan;
+ zstid next_tid_to_scan;
+
+} ZedStoreDescData;
+
+typedef struct ZedStoreDescData *ZedStoreDesc;
+
+typedef struct ZedStoreIndexFetchData
+{
+ IndexFetchTableData idx_fetch_data;
+ ZedStoreProjectData proj_data;
+} ZedStoreIndexFetchData;
+
+typedef struct ZedStoreIndexFetchData *ZedStoreIndexFetch;
+
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static IndexFetchTableData *zedstoream_begin_index_fetch(Relation rel);
+static void zedstoream_end_index_fetch(IndexFetchTableData *scan);
+static bool zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot);
+static bool zs_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, bool *have_tuple_lock);
+
+static bool zs_blkscan_next_block(TableScanDesc sscan,
+ BlockNumber blkno, OffsetNumber *offsets, int noffsets,
+ bool predicatelocks);
+static bool zs_blkscan_next_tuple(TableScanDesc sscan, TupleTableSlot *slot);
+
+static Size zs_parallelscan_estimate(Relation rel);
+static Size zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan);
+static void zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan);
+static bool zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end);
+static void zsbt_fill_missing_attribute_value(TupleDesc tupleDesc, int attno, Datum *datum, bool *isnull);
+
+/* ----------------------------------------------------------------
+ * storage AM support routines for zedstoream
+ * ----------------------------------------------------------------
+ */
+
+static bool
+zedstoream_fetch_row_version(Relation rel,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ IndexFetchTableData *fetcher;
+ bool result;
+
+ zsbt_tuplebuffer_flush(rel);
+
+ fetcher = zedstoream_begin_index_fetch(rel);
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ *
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked tid_p.
+ */
+ PredicateLockTID(rel, tid_p, snapshot, InvalidTransactionId);
+ }
+ ExecMaterializeSlot(slot);
+ slot->tts_tableOid = RelationGetRelid(rel);
+ slot->tts_tid = *tid_p;
+
+ zedstoream_end_index_fetch(fetcher);
+
+ return result;
+}
+
+static void
+zedstoream_get_latest_tid(TableScanDesc sscan,
+ ItemPointer tid)
+{
+ zstid ztid = ZSTidFromItemPointer(*tid);
+
+ zsbt_tuplebuffer_flush(sscan->rs_rd);
+
+ zsbt_find_latest_tid(sscan->rs_rd, &ztid, sscan->rs_snapshot);
+ *tid = ItemPointerFromZSTid(ztid);
+}
+
+static inline void
+zedstoream_insert_internal(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate, uint32 speculative_token)
+{
+ zstid tid;
+ TransactionId xid = GetCurrentTransactionId();
+ MemoryContext oldcontext;
+ MemoryContext insert_mcontext;
+
+ /*
+ * insert code performs allocations for creating items and merging
+ * items. These are small allocations but add-up based on number of
+ * columns and rows being inserted. Hence, creating context to track them
+ * and wholesale free instead of retail freeing them. TODO: in long term
+ * try if can avoid creating context here, retail free in normal case and
+ * only create context for page splits maybe.
+ */
+ insert_mcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(insert_mcontext);
+
+ if (slot->tts_tupleDescriptor->natts != relation->rd_att->natts)
+ elog(ERROR, "slot's attribute count doesn't match relcache entry");
+
+ if (speculative_token == INVALID_SPECULATIVE_TOKEN)
+ tid = zsbt_tuplebuffer_allocate_tids(relation, xid, cid, 1);
+ else
+ tid = zsbt_tid_multi_insert(relation, 1, xid, cid, speculative_token,
+ InvalidUndoPtr);
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ slot_getallattrs(slot);
+ zsbt_tuplebuffer_spool_tuple(relation, tid, slot->tts_values, slot->tts_isnull);
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ /* XXX: should we set visi_info here? */
+
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(insert_mcontext);
+
+ /* Note: speculative insertions are counted too, even if aborted later */
+ pgstat_count_heap_insert(relation, 1);
+}
+
+static void
+zedstoream_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, INVALID_SPECULATIVE_TOKEN);
+}
+
+static void
+zedstoream_insert_speculative(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, BulkInsertState bistate, uint32 specToken)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, specToken);
+}
+
+static void
+zedstoream_complete_speculative(Relation relation, TupleTableSlot *slot, uint32 spekToken,
+ bool succeeded)
+{
+ zstid tid;
+
+ tid = ZSTidFromItemPointer(slot->tts_tid);
+ zsbt_tid_clear_speculative_token(relation, tid, spekToken, true /* for complete */);
+ /*
+ * there is a conflict
+ *
+ * FIXME: Shouldn't we mark the TID dead first?
+ */
+ if (!succeeded)
+ {
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(relation, true);
+
+ zsbt_tid_mark_dead(relation, tid, recent_oldest_undo);
+ }
+}
+
+static void
+zedstoream_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
+ CommandId cid, int options, BulkInsertState bistate)
+{
+ int i;
+ TransactionId xid = GetCurrentTransactionId();
+ zstid firsttid;
+ zstid *tids;
+
+ if (ntuples == 0)
+ {
+ /* COPY sometimes calls us with 0 tuples. */
+ return;
+ }
+
+ firsttid = zsbt_tuplebuffer_allocate_tids(relation, xid, cid, ntuples);
+
+ tids = palloc(ntuples * sizeof(zstid));
+ for (i = 0; i < ntuples; i++)
+ tids[i] = firsttid + i;
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ zsbt_tuplebuffer_spool_slots(relation, tids, slots, ntuples);
+
+ for (i = 0; i < ntuples; i++)
+ {
+ slots[i]->tts_tableOid = RelationGetRelid(relation);
+ slots[i]->tts_tid = ItemPointerFromZSTid(firsttid + i);
+ }
+
+ pgstat_count_heap_insert(relation, ntuples);
+}
+
+static TM_Result
+zedstoream_delete(Relation relation, ItemPointer tid_p, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result = TM_Ok;
+ bool this_xact_has_lock = false;
+ bool have_tuple_lock = false;
+
+ zsbt_tuplebuffer_flush(relation);
+
+retry:
+ result = zsbt_tid_delete(relation, tid, xid, cid,
+ snapshot, crosscheck, wait, hufd, changingPart,
+ &this_xact_has_lock);
+
+ if (result != TM_Ok)
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to delete invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ /*
+ * Acquire tuple lock to establish our priosity for the tuple
+ * See zedstoream_lock_tuple().
+ */
+ if (!this_xact_has_lock)
+ {
+ zs_acquire_tuplock(relation, tid_p, LockTupleExclusive, LockWaitBlock,
+ &have_tuple_lock);
+ }
+
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Delete);
+ goto retry;
+ }
+ }
+ }
+
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, tid_p, ItemPointerGetBlockNumber(tid_p));
+
+ if (result == TM_Ok)
+ pgstat_count_heap_delete(relation);
+
+ return result;
+}
+
+
+/*
+ * Each tuple lock mode has a corresponding heavyweight lock, and one or two
+ * corresponding MultiXactStatuses (one to merely lock tuples, another one to
+ * update them). This table (and the macros below) helps us determine the
+ * heavyweight lock mode and MultiXactStatus values to use for any particular
+ * tuple lock strength.
+ *
+ * Don't look at lockstatus/updstatus directly! Use get_mxact_status_for_lock
+ * instead.
+ */
+static const struct
+{
+ LOCKMODE hwlock;
+ int lockstatus;
+ int updstatus;
+}
+
+ tupleLockExtraInfo[MaxLockTupleMode + 1] =
+{
+ { /* LockTupleKeyShare */
+ AccessShareLock,
+ MultiXactStatusForKeyShare,
+ -1 /* KeyShare does not allow updating tuples */
+ },
+ { /* LockTupleShare */
+ RowShareLock,
+ MultiXactStatusForShare,
+ -1 /* Share does not allow updating tuples */
+ },
+ { /* LockTupleNoKeyExclusive */
+ ExclusiveLock,
+ MultiXactStatusForNoKeyUpdate,
+ MultiXactStatusNoKeyUpdate
+ },
+ { /* LockTupleExclusive */
+ AccessExclusiveLock,
+ MultiXactStatusForUpdate,
+ MultiXactStatusUpdate
+ }
+};
+
+
+/*
+ * Acquire heavyweight locks on tuples, using a LockTupleMode strength value.
+ * This is more readable than having every caller translate it to lock.h's
+ * LOCKMODE.
+ */
+#define LockTupleTuplock(rel, tup, mode) \
+ LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define UnlockTupleTuplock(rel, tup, mode) \
+ UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define ConditionalLockTupleTuplock(rel, tup, mode) \
+ ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+
+/*
+ * Acquire heavyweight lock on the given tuple, in preparation for acquiring
+ * its normal, Xmax-based tuple lock.
+ *
+ * have_tuple_lock is an input and output parameter: on input, it indicates
+ * whether the lock has previously been acquired (and this function does
+ * nothing in that case). If this function returns success, have_tuple_lock
+ * has been flipped to true.
+ *
+ * Returns false if it was unable to obtain the lock; this can only happen if
+ * wait_policy is Skip.
+ *
+ * XXX: This is identical to heap_acquire_tuplock
+ */
+
+static bool
+zs_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, bool *have_tuple_lock)
+{
+ if (*have_tuple_lock)
+ return true;
+
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ LockTupleTuplock(relation, tid, mode);
+ break;
+
+ case LockWaitSkip:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ return false;
+ break;
+
+ case LockWaitError:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+ *have_tuple_lock = true;
+
+ return true;
+}
+
+
+static TM_Result
+zedstoream_lock_tuple(Relation relation, ItemPointer tid_p, Snapshot snapshot,
+ TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, uint8 flags,
+ TM_FailureData *tmfd)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result;
+ bool this_xact_has_lock = false;
+ bool have_tuple_lock = false;
+ zstid next_tid = tid;
+ SnapshotData SnapshotDirty;
+ bool locked_something = false;
+ ZSUndoSlotVisibility *visi_info = &((ZedstoreTupleTableSlot *) slot)->visi_info_buf;
+ bool follow_updates = false;
+
+ zsbt_tuplebuffer_flush(relation);
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = *tid_p;
+
+ tmfd->traversed = false;
+ /*
+ * For now, we lock just the first attribute. As long as everyone
+ * does that, that's enough.
+ */
+retry:
+ result = zsbt_tid_lock(relation, tid, xid, cid, mode, follow_updates,
+ snapshot, tmfd, &next_tid, &this_xact_has_lock, visi_info);
+ ((ZedstoreTupleTableSlot *) slot)->visi_info = visi_info;
+
+ if (result == TM_Invisible)
+ {
+ /*
+ * This is possible, but only when locking a tuple for ON CONFLICT
+ * UPDATE and some other cases handled below. We return this value
+ * here rather than throwing an error in order to give that case the
+ * opportunity to throw a more specific error.
+ */
+ /*
+ * This can also happen, if we're locking an UPDATE chain for KEY SHARE mode:
+ * A tuple has been inserted, and then updated, by a different transaction.
+ * The updating transaction is still in progress. We can lock the row
+ * in KEY SHARE mode, assuming the key columns were not updated, and we will
+ * try to lock all the row version, even the still in-progress UPDATEs.
+ * It's possible that the UPDATE aborts while we're chasing the update chain,
+ * so that the updated tuple becomes invisible to us. That's OK.
+ */
+ if (mode == LockTupleKeyShare && locked_something)
+ return TM_Ok;
+
+ /*
+ * This can also happen, if the caller asked for the latest version
+ * of the tuple and if tuple was inserted by our own transaction, we
+ * have to check cmin against cid: cmin >= current CID means our
+ * command cannot see the tuple, so we should ignore it.
+ */
+ Assert(visi_info->cmin != InvalidCommandId);
+ if ((flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION) != 0 &&
+ TransactionIdIsCurrentTransactionId(visi_info->xmin) &&
+ visi_info->cmin >= cid)
+ {
+ tmfd->xmax = visi_info->xmin;
+ tmfd->cmax = visi_info->cmin;
+ return TM_SelfModified;
+ }
+
+ return TM_Invisible;
+ }
+ else if (result == TM_Updated ||
+ (result == TM_SelfModified && tmfd->cmax >= cid))
+ {
+ /*
+ * The other transaction is an update and it already committed.
+ *
+ * If the caller asked for the latest version, find it.
+ */
+ if ((flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION) != 0 && next_tid != tid)
+ {
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (ItemPointerIndicatesMovedPartitions(&tmfd->ctid))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
+ /* it was updated, so look at the updated version */
+ *tid_p = ItemPointerFromZSTid(next_tid);
+
+ /* signal that a tuple later in the chain is getting locked */
+ tmfd->traversed = true;
+
+ /* loop back to fetch next in chain */
+
+ /* FIXME: In the corresponding code in heapam, we cross-check the xmin/xmax
+ * of the old and new tuple. Should we do the same here?
+ */
+
+ InitDirtySnapshot(SnapshotDirty);
+ snapshot = &SnapshotDirty;
+ tid = next_tid;
+ goto retry;
+ }
+
+ return result;
+ }
+ else if (result == TM_Deleted)
+ {
+ /*
+ * The other transaction is a delete and it already committed.
+ */
+ return result;
+ }
+ else if (result == TM_BeingModified)
+ {
+ TransactionId xwait = tmfd->xmax;
+
+ /*
+ * Acquire tuple lock to establish our priority for the tuple, or
+ * die trying. LockTuple will release us when we are next-in-line
+ * for the tuple. We must do this even if we are share-locking,
+ * but not if we already have a weaker lock on the tuple.
+ *
+ * If we are forced to "start over" below, we keep the tuple lock;
+ * this arranges that we stay at the head of the line while
+ * rechecking tuple state.
+ *
+ * Explanation for why we don't acquire heavy-weight lock when we
+ * already hold a weaker lock:
+ *
+ * Disable acquisition of the heavyweight tuple lock.
+ * Otherwise, when promoting a weaker lock, we might
+ * deadlock with another locker that has acquired the
+ * heavyweight tuple lock and is waiting for our
+ * transaction to finish.
+ *
+ * Note that in this case we still need to wait for
+ * the xid if required, to avoid acquiring
+ * conflicting locks.
+ *
+ */
+ if (!this_xact_has_lock &&
+ !zs_acquire_tuplock(relation, tid_p, mode, wait_policy,
+ &have_tuple_lock))
+ {
+ /*
+ * This can only happen if wait_policy is Skip and the lock
+ * couldn't be obtained.
+ */
+ return TM_WouldBlock;
+ }
+
+ /* wait for regular transaction to end, or die trying */
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Lock);
+ break;
+ case LockWaitSkip:
+ if (!ConditionalXactLockTableWait(xwait))
+ {
+ /* FIXME: should we release the hwlock here? */
+ return TM_WouldBlock;
+ }
+ break;
+ case LockWaitError:
+ if (!ConditionalXactLockTableWait(xwait))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+
+ /*
+ * xwait is done. Retry.
+ */
+ goto retry;
+ }
+ if (result == TM_Ok)
+ locked_something = true;
+
+ /*
+ * Now that we have successfully marked the tuple as locked, we can
+ * release the lmgr tuple lock, if we had it.
+ */
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (mode == LockTupleKeyShare)
+ {
+ /* lock all row versions, if it's a KEY SHARE lock */
+ follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
+ if (result == TM_Ok && tid != next_tid && next_tid != InvalidZSTid)
+ {
+ tid = next_tid;
+ goto retry;
+ }
+ }
+
+ /* Fetch the tuple, too. */
+ if (!zedstoream_fetch_row_version(relation, tid_p, SnapshotAny, slot))
+ elog(ERROR, "could not fetch locked tuple");
+
+ return TM_Ok;
+}
+
+/* like heap_tuple_attr_equals */
+static bool
+zs_tuple_attr_equals(int attrnum, TupleTableSlot *slot1, TupleTableSlot *slot2)
+{
+ TupleDesc tupdesc = slot1->tts_tupleDescriptor;
+ Datum value1,
+ value2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ /*
+ * If it's a whole-tuple reference, say "not equal". It's not really
+ * worth supporting this case, since it could only succeed after a no-op
+ * update, which is hardly a case worth optimizing for.
+ */
+ if (attrnum == 0)
+ return false;
+
+ /*
+ * Likewise, automatically say "not equal" for any system attribute other
+ * than tableOID; we cannot expect these to be consistent in a HOT chain,
+ * or even to be set correctly yet in the new tuple.
+ */
+ if (attrnum < 0)
+ {
+ if (attrnum != TableOidAttributeNumber)
+ return false;
+ }
+
+ /*
+ * Extract the corresponding values. XXX this is pretty inefficient if
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
+ * a single heap_deform_tuple call on each tuple, instead? But that
+ * doesn't work for system columns ...
+ */
+ value1 = slot_getattr(slot1, attrnum, &isnull1);
+ value2 = slot_getattr(slot2, attrnum, &isnull2);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+
+ /*
+ * If both are NULL, they can be considered equal.
+ */
+ if (isnull1)
+ return true;
+
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attrnum <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ }
+ else
+ {
+ Assert(attrnum <= tupdesc->natts);
+ att = TupleDescAttr(tupdesc, attrnum - 1);
+ return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ }
+}
+
+static bool
+is_key_update(Relation relation, TupleTableSlot *oldslot, TupleTableSlot *newslot)
+{
+ Bitmapset *key_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
+ int attnum;
+
+ /*
+ * Fetch the list of attributes to be checked for various operations.
+ *
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are
+ * doing an update on one of the relevant system catalogs, we could
+ * deadlock if we try to fetch the list later. In any case, the relcache
+ * caches the data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity and columns that are
+ * considered the "key" of rows in the table.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
+ */
+ key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
+
+ interesting_attrs = NULL;
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+
+ /* Determine columns modified by the update. */
+ modified_attrs = NULL;
+ while ((attnum = bms_first_member(interesting_attrs)) >= 0)
+ {
+ attnum += FirstLowInvalidHeapAttributeNumber;
+
+ if (!zs_tuple_attr_equals(attnum, oldslot, newslot))
+ modified_attrs = bms_add_member(modified_attrs,
+ attnum - FirstLowInvalidHeapAttributeNumber);
+ }
+
+ return bms_overlap(modified_attrs, key_attrs);
+}
+
+static TM_Result
+zedstoream_update(Relation relation, ItemPointer otid_p, TupleTableSlot *slot,
+ CommandId cid, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd,
+ LockTupleMode *lockmode, bool *update_indexes)
+{
+ zstid otid = ZSTidFromItemPointer(*otid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ bool key_update;
+ Datum *d;
+ bool *isnulls;
+ TM_Result result;
+ zstid newtid;
+ TupleTableSlot *oldslot;
+ IndexFetchTableData *fetcher;
+ MemoryContext oldcontext;
+ MemoryContext insert_mcontext;
+ bool this_xact_has_lock = false;
+ bool have_tuple_lock = false;
+
+ zsbt_tuplebuffer_flush(relation);
+
+ /*
+ * insert code performs allocations for creating items and merging
+ * items. These are small allocations but add-up based on number of
+ * columns and rows being inserted. Hence, creating context to track them
+ * and wholesale free instead of retail freeing them. TODO: in long term
+ * try if can avoid creating context here, retail free in normal case and
+ * only create context for page splits maybe.
+ */
+ insert_mcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(insert_mcontext);
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ oldslot = table_slot_create(relation, NULL);
+ fetcher = zedstoream_begin_index_fetch(relation);
+
+ /*
+ * The meta-attribute holds the visibility information, including the "t_ctid"
+ * pointer to the updated version. All the real attributes are just inserted,
+ * as if for a new row.
+ */
+retry:
+ newtid = InvalidZSTid;
+
+ /*
+ * Fetch the old row, so that we can figure out which columns were modified.
+ *
+ * FIXME: if we have to follow the update chain, we should look at the
+ * currently latest tuple version, rather than the one visible to our snapshot.
+ */
+ if (!zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ otid_p, SnapshotAny, oldslot))
+ {
+ return TM_Invisible;
+ }
+ key_update = is_key_update(relation, oldslot, slot);
+
+ *lockmode = key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ result = zsbt_tid_update(relation, otid,
+ xid, cid, key_update, snapshot, crosscheck,
+ wait, hufd, &newtid, &this_xact_has_lock);
+
+ *update_indexes = (result == TM_Ok);
+ if (result == TM_Ok)
+ {
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, otid_p, ItemPointerGetBlockNumber(otid_p));
+
+ zsbt_tuplebuffer_spool_tuple(relation, newtid, d, isnulls);
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(newtid);
+
+ pgstat_count_heap_update(relation, false);
+ }
+ else
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to update invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ /*
+ * Acquire tuple lock to establish our priosity for the tuple
+ * See zedstoream_lock_tuple().
+ */
+ if (!this_xact_has_lock)
+ {
+ zs_acquire_tuplock(relation, otid_p, LockTupleExclusive, LockWaitBlock,
+ &have_tuple_lock);
+ }
+
+ XactLockTableWait(xwait, relation, otid_p, XLTW_Update);
+ goto retry;
+ }
+ }
+ }
+
+ /*
+ * Now that we have successfully updated the tuple, we can
+ * release the lmgr tuple lock, if we had it.
+ */
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, otid_p, LockTupleExclusive);
+ have_tuple_lock = false;
+ }
+
+ zedstoream_end_index_fetch(fetcher);
+ ExecDropSingleTupleTableSlot(oldslot);
+
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(insert_mcontext);
+
+ return result;
+}
+
+static const TupleTableSlotOps *
+zedstoream_slot_callbacks(Relation relation)
+{
+ return &TTSOpsZedstore;
+}
+
+static void
+zs_initialize_proj_attributes(TupleDesc tupledesc, ZedStoreProjectData *proj_data)
+{
+ MemoryContext oldcontext;
+
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* add one for meta-attribute */
+ proj_data->proj_atts = palloc((tupledesc->natts + 1) * sizeof(int));
+ proj_data->attr_scans = palloc0(tupledesc->natts * sizeof(ZSAttrTreeScan));
+ proj_data->tid_scan.active = false;
+
+ proj_data->proj_atts[proj_data->num_proj_atts++] = ZS_META_ATTRIBUTE_NUM;
+
+ /*
+ * convert booleans array into an array of the attribute numbers of the
+ * required columns.
+ */
+ for (int idx = 0; idx < tupledesc->natts; idx++)
+ {
+ int att_no = idx + 1;
+
+ /*
+ * never project dropped columns, null will be returned for them
+ * in slot by default.
+ */
+ if (TupleDescAttr(tupledesc, idx)->attisdropped)
+ continue;
+
+ /* project_columns empty also conveys need all the columns */
+ if (proj_data->project_columns == NULL ||
+ bms_is_member(att_no, proj_data->project_columns))
+ proj_data->proj_atts[proj_data->num_proj_atts++] = att_no;
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static void
+zs_initialize_proj_attributes_extended(ZedStoreDesc scan, TupleDesc tupledesc)
+{
+ MemoryContext oldcontext;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ /* if already initialized return */
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ zs_initialize_proj_attributes(tupledesc, proj_data);
+
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* Extra setup for bitmap and sample scans */
+ if ((scan->rs_scan.rs_flags & SO_TYPE_BITMAPSCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_SAMPLESCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_ANALYZE))
+ {
+ scan->bmscan_ntuples = 0;
+ scan->bmscan_tids = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(zstid));
+ }
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static TableScanDesc
+zedstoream_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ Bitmapset *project_columns)
+{
+ ZedStoreDesc scan;
+
+ zsbt_tuplebuffer_flush(relation);
+
+ /* Sample scans have no snapshot, but we need one */
+ if (!snapshot)
+ {
+ Assert(!(flags & SO_TYPE_SAMPLESCAN));
+ snapshot = SnapshotAny;
+ }
+
+ /*
+ * allocate and initialize scan descriptor
+ */
+ scan = (ZedStoreDesc) palloc0(sizeof(ZedStoreDescData));
+
+ scan->rs_scan.rs_rd = relation;
+ scan->rs_scan.rs_snapshot = snapshot;
+ scan->rs_scan.rs_nkeys = nkeys;
+ scan->rs_scan.rs_flags = flags;
+ scan->rs_scan.rs_parallel = parallel_scan;
+
+ /*
+ * we can use page-at-a-time mode if it's an MVCC-safe snapshot
+ */
+
+ /*
+ * we do this here instead of in initscan() because heap_rescan also calls
+ * initscan() and we don't want to allocate memory again
+ */
+ if (nkeys > 0)
+ scan->rs_scan.rs_key = (ScanKey) palloc(sizeof(ScanKeyData) * nkeys);
+ else
+ scan->rs_scan.rs_key = NULL;
+
+ scan->proj_data.context = CurrentMemoryContext;
+ scan->proj_data.project_columns = project_columns;
+
+ /*
+ * For a seqscan in a serializable transaction, acquire a predicate lock
+ * on the entire relation. This is required not only to lock all the
+ * matching tuples, but also to conflict with new insertions into the
+ * table. In an indexscan, we take page locks on the index pages covering
+ * the range specified in the scan qual, but in a heap scan there is
+ * nothing more fine-grained to lock. A bitmap scan is a different story,
+ * there we have already scanned the index and locked the index pages
+ * covering the predicate. But in that case we still have to lock any
+ * matching heap tuples.
+ */
+ if (flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
+ PredicateLockRelation(relation, snapshot);
+
+ /*
+ * Currently, we don't have a stats counter for bitmap heap scans (but the
+ * underlying bitmap index scans will be counted) or sample scans (we only
+ * update stats for tuple fetches there)
+ */
+ if (!(flags & SO_TYPE_BITMAPSCAN) && !(flags & SO_TYPE_SAMPLESCAN))
+ pgstat_count_heap_scan(relation);
+
+ return (TableScanDesc) scan;
+}
+
+static TableScanDesc
+zedstoream_beginscan(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags)
+{
+ return zedstoream_beginscan_with_column_projection(relation, snapshot,
+ nkeys, key, parallel_scan, flags, NULL);
+}
+
+static void
+zedstoream_endscan(TableScanDesc sscan)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ if (proj_data->proj_atts)
+ pfree(proj_data->proj_atts);
+
+ if (proj_data->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&proj_data->tid_scan);
+ for (int i = 1; i < proj_data->num_proj_atts; i++)
+ zsbt_attr_end_scan(&proj_data->attr_scans[i - 1]);
+ }
+
+ if (scan->rs_scan.rs_flags & SO_TEMP_SNAPSHOT)
+ UnregisterSnapshot(scan->rs_scan.rs_snapshot);
+
+ if (proj_data->attr_scans)
+ pfree(proj_data->attr_scans);
+ pfree(scan);
+}
+
+static void
+zedstoream_rescan(TableScanDesc sscan, struct ScanKeyData *key,
+ bool set_params, bool allow_strat,
+ bool allow_sync, bool allow_pagemode)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+
+ /* these params don't do much in zedstore yet, but whatever */
+ if (set_params)
+ {
+ if (allow_strat)
+ scan->rs_scan.rs_flags |= SO_ALLOW_STRAT;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_STRAT;
+
+ if (allow_sync)
+ scan->rs_scan.rs_flags |= SO_ALLOW_SYNC;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_SYNC;
+
+ if (allow_pagemode && scan->rs_scan.rs_snapshot &&
+ IsMVCCSnapshot(scan->rs_scan.rs_snapshot))
+ scan->rs_scan.rs_flags |= SO_ALLOW_PAGEMODE;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_PAGEMODE;
+ }
+
+ if (scan->proj_data.num_proj_atts > 0)
+ {
+ zsbt_tid_reset_scan(&scan->proj_data.tid_scan,
+ scan->cur_range_start, scan->cur_range_end, scan->cur_range_start - 1);
+
+ if ((scan->rs_scan.rs_flags & SO_TYPE_SAMPLESCAN) != 0)
+ scan->next_tid_to_scan = ZSTidFromBlkOff(0, 1);
+ }
+}
+
+static bool
+zedstoream_getnextslot(TableScanDesc sscan, ScanDirection direction,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *scan_proj = &scan->proj_data;
+ int slot_natts = slot->tts_tupleDescriptor->natts;
+ Datum *slot_values = slot->tts_values;
+ bool *slot_isnull = slot->tts_isnull;
+ zstid this_tid;
+ Datum datum;
+ bool isnull;
+ ZSUndoSlotVisibility *visi_info;
+ uint8 slotno;
+
+ if (direction != ForwardScanDirection && scan->rs_scan.rs_parallel)
+ elog(ERROR, "parallel backward scan not implemented");
+
+ if (!scan->started)
+ {
+ MemoryContext oldcontext;
+
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, scan_proj);
+
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+ }
+ else
+ {
+ scan->cur_range_start = MinZSTid;
+ scan->cur_range_end = MaxPlusOneZSTid;
+ }
+
+ oldcontext = MemoryContextSwitchTo(scan_proj->context);
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ scan->cur_range_start,
+ scan->cur_range_end,
+ scan->rs_scan.rs_snapshot,
+ &scan_proj->tid_scan);
+ scan_proj->tid_scan.serializable = true;
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ int attno = scan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ slot->tts_tupleDescriptor,
+ attno,
+ &scan_proj->attr_scans[i - 1]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ scan->started = true;
+ }
+ Assert((scan_proj->num_proj_atts - 1) <= slot_natts);
+
+ /*
+ * Initialize the slot.
+ *
+ * We initialize all columns to NULL. The values for columns that are projected
+ * will be set to the actual values below, but it's important that non-projected
+ * columns are NULL.
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < slot_natts; i++)
+ slot_isnull[i] = true;
+
+ /*
+ * Find the next visible TID.
+ */
+ for (;;)
+ {
+ this_tid = zsbt_tid_scan_next(&scan_proj->tid_scan, direction);
+ if (this_tid == InvalidZSTid)
+ {
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+
+ zsbt_tid_reset_scan(&scan_proj->tid_scan,
+ scan->cur_range_start, scan->cur_range_end, scan->cur_range_start - 1);
+ continue;
+ }
+ else
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+ }
+ Assert (this_tid < scan->cur_range_end);
+ break;
+ }
+
+ /* Note: We don't need to predicate-lock tuples in Serializable mode,
+ * because in a sequential scan, we predicate-locked the whole table.
+ */
+
+ /* Fetch the datums of each attribute for this row */
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ ZSAttrTreeScan *btscan = &scan_proj->attr_scans[i - 1];
+ Form_pg_attribute attr = btscan->attdesc;
+ int natt;
+
+ if (!zsbt_attr_fetch(btscan, &datum, &isnull, this_tid))
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, btscan->attno,
+ &datum, &isnull);
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ natt = scan_proj->proj_atts[i];
+
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ MemoryContext oldcxt = CurrentMemoryContext;
+
+ if (btscan->decoder.tmpcxt)
+ MemoryContextSwitchTo(btscan->decoder.tmpcxt);
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, this_tid, datum);
+ MemoryContextSwitchTo(oldcxt);
+ }
+
+ /* Check that the values coming out of the b-tree are aligned properly */
+ if (!isnull && attr->attlen == -1)
+ {
+ Assert (VARATT_IS_1B(datum) || INTALIGN(datum) == datum);
+ }
+
+ Assert(natt > 0);
+ slot_values[natt - 1] = datum;
+ slot_isnull[natt - 1] = isnull;
+ }
+
+ /* Fill in the rest of the fields in the slot, and return the tuple */
+ slotno = ZSTidScanCurUndoSlotNo(&scan_proj->tid_scan);
+ visi_info = &scan_proj->tid_scan.array_iter.undoslot_visibility[slotno];
+ ((ZedstoreTupleTableSlot *) slot)->visi_info = visi_info;
+
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(this_tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ pgstat_count_heap_getnext(scan->rs_scan.rs_rd);
+ return true;
+}
+
+static bool
+zedstoream_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
+{
+ ZedStoreDesc scan;
+ zstid ztid;
+
+ if (!ItemPointerIsValid(tid))
+ return false;
+
+ scan = (ZedStoreDesc) sscan;
+ ztid = ZSTidFromItemPointer(*tid);
+
+ if (scan->min_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the min tid once and store it
+ */
+ scan->min_tid_to_scan = zsbt_get_first_tid(sscan->rs_rd);
+ }
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(sscan->rs_rd);
+ }
+
+ if ( ztid >= scan->min_tid_to_scan && ztid < scan->max_tid_to_scan)
+ return true;
+ else
+ return false;
+}
+
+static bool
+zedstoream_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
+ Snapshot snapshot)
+{
+ /*
+ * TODO: we didn't keep any visibility information about the tuple in the
+ * slot, so we have to fetch it again. A custom slot type might be a
+ * good idea..
+ */
+ zstid tid = ZSTidFromItemPointer(slot->tts_tid);
+ ZSTidTreeScan meta_scan;
+ bool found;
+
+ /* Use the meta-data tree for the visibility information. */
+ zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &meta_scan);
+
+ found = zsbt_tid_scan_next(&meta_scan, ForwardScanDirection) != InvalidZSTid;
+
+ zsbt_tid_end_scan(&meta_scan);
+
+ return found;
+}
+
+static TransactionId
+zedstoream_compute_xid_horizon_for_tuples(Relation rel,
+ ItemPointerData *items,
+ int nitems)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static IndexFetchTableData *
+zedstoream_begin_index_fetch(Relation rel)
+{
+ ZedStoreIndexFetch zscan;
+
+ zsbt_tuplebuffer_flush(rel);
+
+ zscan = palloc0(sizeof(ZedStoreIndexFetchData));
+ zscan->idx_fetch_data.rel = rel;
+ zscan->proj_data.context = CurrentMemoryContext;
+
+ return (IndexFetchTableData *) zscan;
+}
+
+static void
+zedstoream_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ Bitmapset *project_columns)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ zscan->proj_data.project_columns = project_columns;
+}
+
+static void
+zedstoream_reset_index_fetch(IndexFetchTableData *scan)
+{
+ /* TODO: we could close the scans here, but currently we don't bother */
+}
+
+static void
+zedstoream_end_index_fetch(IndexFetchTableData *scan)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&zscan_proj->tid_scan);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ zsbt_attr_end_scan(&zscan_proj->attr_scans[i - 1]);
+ }
+
+ if (zscan_proj->proj_atts)
+ pfree(zscan_proj->proj_atts);
+
+ if (zscan_proj->attr_scans)
+ pfree(zscan_proj->attr_scans);
+ pfree(zscan);
+}
+
+static bool
+zedstoream_index_fetch_tuple(struct IndexFetchTableData *scan,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ bool *call_again, bool *all_dead)
+{
+ bool result;
+
+ /*
+ * we don't do in-place updates, so this is essentially the same as
+ * fetch_row_version.
+ */
+ if (call_again)
+ *call_again = false;
+ if (all_dead)
+ *all_dead = false;
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) scan, tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ *
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked tid_p.
+ */
+ PredicateLockTID(scan->rel, tid_p, snapshot, InvalidTransactionId);
+ }
+
+ return result;
+}
+
+/*
+ * Shared implementation of fetch_row_version and index_fetch_tuple callbacks.
+ */
+static bool
+zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ Relation rel = fetch->idx_fetch_data.rel;
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ bool found = true;
+ ZedStoreProjectData *fetch_proj = &fetch->proj_data;
+
+ /* first time here, initialize */
+ if (fetch_proj->num_proj_atts == 0)
+ {
+ TupleDesc reldesc = RelationGetDescr(rel);
+ MemoryContext oldcontext;
+
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, fetch_proj);
+
+ oldcontext = MemoryContextSwitchTo(fetch_proj->context);
+ zsbt_tid_begin_scan(rel, tid, tid + 1,
+ snapshot,
+ &fetch_proj->tid_scan);
+ fetch_proj->tid_scan.serializable = true;
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ {
+ int attno = fetch_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(rel, reldesc, attno,
+ &fetch_proj->attr_scans[i - 1]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ zsbt_tid_reset_scan(&fetch_proj->tid_scan, tid, tid + 1, tid - 1);
+
+ /*
+ * Initialize the slot.
+ *
+ * If we're not fetching all columns, initialize the unfetched values
+ * in the slot to NULL. (Actually, this initializes all to NULL, and the
+ * code below will overwrite them for the columns that are projected)
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < slot->tts_tupleDescriptor->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ found = zsbt_tid_scan_next(&fetch_proj->tid_scan, ForwardScanDirection) != InvalidZSTid;
+ if (found)
+ {
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ {
+ int natt = fetch_proj->proj_atts[i];
+ ZSAttrTreeScan *btscan = &fetch_proj->attr_scans[i - 1];
+ Form_pg_attribute attr;
+ Datum datum;
+ bool isnull;
+
+ attr = btscan->attdesc;
+ if (zsbt_attr_fetch(btscan, &datum, &isnull, tid))
+ {
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ MemoryContext oldcxt = CurrentMemoryContext;
+
+ if (btscan->decoder.tmpcxt)
+ MemoryContextSwitchTo(btscan->decoder.tmpcxt);
+ datum = zedstore_toast_flatten(rel, natt, tid, datum);
+ MemoryContextSwitchTo(oldcxt);
+ }
+ }
+ else
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, btscan->attno,
+ &datum, &isnull);
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ }
+
+ if (found)
+ {
+ uint8 slotno = ZSTidScanCurUndoSlotNo(&fetch_proj->tid_scan);
+ ZSUndoSlotVisibility *visi_info;
+
+ visi_info = &fetch_proj->tid_scan.array_iter.undoslot_visibility[slotno];
+
+ ((ZedstoreTupleTableSlot *) slot)->visi_info = visi_info;
+ slot->tts_tableOid = RelationGetRelid(rel);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ return true;
+ }
+
+ return false;
+}
+
+static void
+zedstoream_index_validate_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ Snapshot snapshot,
+ ValidateIndexState *state)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ int attno;
+ TableScanDesc scan;
+ ItemPointerData idx_ptr;
+ bool tuplesort_empty = false;
+ Bitmapset *proj = NULL;
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. We need just those tuples
+ * satisfying the passed-in reference snapshot. We must disable syncscan
+ * here, because it's critical that we read from block zero forward to
+ * match the sorted TIDs.
+ */
+
+ /*
+ * TODO: It would be very good to fetch only the columns we need.
+ */
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ proj = bms_add_member(proj, indexInfo->ii_IndexAttrNumbers[attno]);
+ }
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Predicate,
+ baseRelation->rd_att->natts,
+ &proj);
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Expressions,
+ baseRelation->rd_att->natts,
+ &proj);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ /*
+ * Scan all tuples matching the snapshot.
+ */
+ ItemPointerSet(&idx_ptr, 0, 0); /* this is less than any real TID */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ ItemPointerData tup_ptr = slot->tts_tid;
+ int cmp;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, like heapam's function.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ if (tuplesort_empty)
+ cmp = -1;
+ else
+ {
+ while ((cmp = ItemPointerCompare(&tup_ptr, &idx_ptr)) > 0)
+ {
+ Datum ts_val;
+ bool ts_isnull;
+
+ tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+ &ts_val, &ts_isnull, NULL);
+ if (!tuplesort_empty)
+ {
+ Assert(!ts_isnull);
+ itemptr_decode(&idx_ptr, DatumGetInt64(ts_val));
+
+ /* If int8 is pass-by-ref, free (encoded) TID Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(ts_val));
+#endif
+ break;
+ }
+ else
+ {
+ /* Be tidy */
+ ItemPointerSetInvalid(&idx_ptr);
+ cmp = -1;
+ }
+ }
+ }
+ if (cmp < 0)
+ {
+ /* This item is not in the index */
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ index_insert(indexRelation, values, isnull, &tup_ptr, baseRelation,
+ indexInfo->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ indexInfo);
+
+ state->tups_inserted += 1;
+ }
+ }
+
+ table_endscan(scan);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+}
+
+static double
+zedstoream_index_build_range_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ bool anyvisible,
+ bool progress,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state,
+ TableScanDesc scan)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ double reltuples;
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ Snapshot snapshot;
+ SnapshotData NonVacuumableSnapshot;
+ bool need_unregister_snapshot = false;
+ TransactionId OldestXmin;
+ bool tupleIsAlive;
+
+#ifdef USE_ASSERT_CHECKING
+ bool checking_uniqueness;
+ /* See whether we're verifying uniqueness/exclusion properties */
+ checking_uniqueness = (indexInfo->ii_Unique ||
+ indexInfo->ii_ExclusionOps != NULL);
+
+ /*
+ * "Any visible" mode is not compatible with uniqueness checks; make sure
+ * only one of those is requested.
+ */
+ Assert(!(anyvisible && checking_uniqueness));
+#endif
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. In a normal index build, we use
+ * SnapshotAny because we must retrieve all tuples and do our own time
+ * qual checks (because we have to index RECENTLY_DEAD tuples). In a
+ * concurrent build, or during bootstrap, we take a regular MVCC snapshot
+ * and index whatever's live according to that.
+ */
+ OldestXmin = InvalidTransactionId;
+
+ /* okay to ignore lazy VACUUMs here */
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
+ OldestXmin = GetOldestXmin(baseRelation, PROCARRAY_FLAGS_VACUUM);
+
+ zsbt_tuplebuffer_flush(baseRelation);
+ if (!scan)
+ {
+ int attno;
+ Bitmapset *proj = NULL;
+
+ /*
+ * Serial index build.
+ *
+ * Must begin our own zedstore scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ */
+ if (!TransactionIdIsValid(OldestXmin))
+ {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ need_unregister_snapshot = true;
+ }
+ else
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot, OldestXmin);
+ snapshot = &NonVacuumableSnapshot;
+ }
+
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ proj = bms_add_member(proj, indexInfo->ii_IndexAttrNumbers[attno]);
+ }
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Predicate,
+ baseRelation->rd_att->natts,
+ &proj);
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Expressions,
+ baseRelation->rd_att->natts,
+ &proj);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ if (start_blockno != 0 || numblocks != InvalidBlockNumber)
+ {
+ ZedStoreDesc zscan = (ZedStoreDesc) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ zscan->cur_range_start = ZSTidFromBlkOff(start_blockno, 1);
+ zscan->cur_range_end = ZSTidFromBlkOff(numblocks, 1);
+
+ /* FIXME: when can 'num_proj_atts' be 0? */
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_begin_scan(zscan->rs_scan.rs_rd,
+ zscan->cur_range_start,
+ zscan->cur_range_end,
+ zscan->rs_scan.rs_snapshot,
+ &zscan_proj->tid_scan);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ {
+ int natt = zscan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(zscan->rs_scan.rs_rd,
+ RelationGetDescr(zscan->rs_scan.rs_rd),
+ natt,
+ &zscan_proj->attr_scans[i - 1]);
+ }
+ }
+ }
+ }
+ else
+ {
+ /*
+ * Parallel index build.
+ *
+ * Parallel case never registers/unregisters own snapshot. Snapshot
+ * is taken from parallel zedstore scan, and is SnapshotAny or an MVCC
+ * snapshot, based on same criteria as serial case.
+ */
+ Assert(!IsBootstrapProcessingMode());
+ Assert(allow_sync);
+ Assert(start_blockno == 0);
+ Assert(numblocks == InvalidBlockNumber);
+ snapshot = scan->rs_snapshot;
+
+ if (snapshot == SnapshotAny)
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot, OldestXmin);
+ snapshot = &NonVacuumableSnapshot;
+ }
+ }
+
+ /*
+ * Must call GetOldestXmin() with SnapshotAny. Should never call
+ * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == &NonVacuumableSnapshot || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == &NonVacuumableSnapshot ? TransactionIdIsValid(OldestXmin) :
+ !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == &NonVacuumableSnapshot || !anyvisible);
+
+ reltuples = 0;
+
+ /*
+ * Scan all tuples in the base relation.
+ */
+ while (zedstoream_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ ZSUndoSlotVisibility *visi_info;
+
+ if (numblocks != InvalidBlockNumber &&
+ ItemPointerGetBlockNumber(&slot->tts_tid) >= numblocks)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Is the tuple deleted, but still visible to old transactions?
+ *
+ * We need to include such tuples in the index, but exclude them
+ * from unique-checking.
+ *
+ * TODO: Heap checks for DELETE_IN_PROGRESS do we need as well?
+ */
+ visi_info = ((ZedstoreTupleTableSlot *) slot)->visi_info;
+ tupleIsAlive = (visi_info->nonvacuumable_status != ZSNV_RECENTLY_DEAD);
+
+ if (tupleIsAlive)
+ reltuples += 1;
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, to figure out which tuple version to index.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ callback(indexRelation, &slot->tts_tid, values, isnull, tupleIsAlive,
+ callback_state);
+ }
+
+ table_endscan(scan);
+
+ /* we can now forget our snapshot, if set and registered by us */
+ if (need_unregister_snapshot)
+ UnregisterSnapshot(snapshot);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+
+ return reltuples;
+}
+
+static void
+zedstoream_finish_bulk_insert(Relation relation, int options)
+{
+ zsbt_tuplebuffer_flush(relation);
+
+ /*
+ * If we skipped writing WAL, then we need to sync the zedstore (but not
+ * indexes since those use WAL anyway / don't go through tableam)
+ */
+ if (options & HEAP_INSERT_SKIP_WAL)
+ heap_sync(relation);
+}
+
+/* ------------------------------------------------------------------------
+ * DDL related callbacks for zedstore AM.
+ * ------------------------------------------------------------------------
+ */
+
+static void
+zedstoream_relation_set_new_filenode(Relation rel,
+ const RelFileNode *newrnode,
+ char persistence,
+ TransactionId *freezeXid,
+ MultiXactId *minmulti)
+{
+ SMgrRelation srel;
+
+ /* XXX: I think we could just throw away all data in the buffer */
+ zsbt_tuplebuffer_flush(rel);
+
+ /*
+ * Initialize to the minimum XID that could put tuples in the table. We
+ * know that no xacts older than RecentXmin are still running, so that
+ * will do.
+ */
+ *freezeXid = RecentXmin;
+
+ /*
+ * Similarly, initialize the minimum Multixact to the first value that
+ * could possibly be stored in tuples in the table. Running transactions
+ * could reuse values from their local cache, so we are careful to
+ * consider all currently running multis.
+ *
+ * XXX this could be refined further, but is it worth the hassle?
+ */
+ *minmulti = GetOldestMultiXactId();
+
+ srel = RelationCreateStorage(*newrnode, persistence);
+
+ /*
+ * If required, set up an init fork for an unlogged table so that it can
+ * be correctly reinitialized on restart. An immediate sync is required
+ * even if the page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved the
+ * redo pointer past our xlog record. Recovery may as well remove it
+ * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+ * record. Therefore, logging is necessary even if wal_level=minimal.
+ */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW ||
+ rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+ smgrcreate(srel, INIT_FORKNUM, false);
+ log_smgrcreate(newrnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+}
+
+static void
+zedstoream_relation_nontransactional_truncate(Relation rel)
+{
+ /* XXX: I think we could just throw away all data in the buffer */
+ zsbt_tuplebuffer_flush(rel);
+ zsmeta_invalidate_cache(rel);
+ RelationTruncate(rel, 0);
+}
+
+static void
+zedstoream_relation_copy_data(Relation rel, const RelFileNode *newrnode)
+{
+ SMgrRelation dstrel;
+
+ zsbt_tuplebuffer_flush(rel);
+
+ dstrel = smgropen(*newrnode, rel->rd_backend);
+ RelationOpenSmgr(rel);
+
+ /*
+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers. We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.
+ */
+ FlushRelationBuffers(rel);
+
+ /*
+ * Create and copy all the relation, and schedule unlinking of the
+ * old physical file.
+ *
+ * NOTE: any conflict in relfilenode value will be caught in
+ * RelationCreateStorage().
+ *
+ * NOTE: There is only the main fork in zedstore. Otherwise
+ * this would need to copy other forks, too.
+ */
+ RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+
+ /* copy main fork */
+ RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
+ rel->rd_rel->relpersistence);
+
+ /* drop old relation, and close new one */
+ RelationDropStorage(rel);
+ smgrclose(dstrel);
+}
+
+/*
+ * Subroutine of the zedstoream_relation_copy_for_cluster() callback.
+ *
+ * Creates the TID item with correct visibility information for the
+ * given tuple in the old table. Returns the tid of the tuple in the
+ * new table, or InvalidZSTid if this tuple can be left out completely.
+ *
+ * FIXME: This breaks UPDATE chains. I.e. after this is done, an UPDATE
+ * looks like DELETE + INSERT, instead of an UPDATE, to any transaction that
+ * might try to follow the update chain.
+ */
+static zstid
+zs_cluster_process_tuple(Relation OldHeap, Relation NewHeap,
+ zstid oldtid, ZSUndoRecPtr old_undoptr,
+ ZSUndoRecPtr recent_oldest_undo,
+ TransactionId OldestXmin)
+{
+ TransactionId this_xmin;
+ CommandId this_cmin;
+ TransactionId this_xmax;
+ CommandId this_cmax;
+ bool this_changedPart;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ /*
+ * Follow the chain of UNDO records for this tuple, to find the
+ * transaction that originally inserted the row (xmin/cmin), and
+ * the transaction that deleted or updated it away, if any (xmax/cmax)
+ */
+ this_xmin = FrozenTransactionId;
+ this_cmin = InvalidCommandId;
+ this_xmax = InvalidTransactionId;
+ this_cmax = InvalidCommandId;
+
+ undo_ptr = old_undoptr;
+ for (;;)
+ {
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /* This tuple version is visible to everyone. */
+ break;
+ }
+
+ /* Fetch the next UNDO record. */
+ undorec = zsundo_fetch_record(OldHeap, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * inserter aborted or crashed. This row is not visible to
+ * anyone. Including any later tuple versions we might have
+ * seen.
+ */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* Inserter committed. */
+ this_xmin = undorec->xid;
+ this_cmin = undorec->cid;
+
+ /* we know everything there is to know about this tuple version. */
+ break;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* Ignore tuple locks for now.
+ *
+ * FIXME: we should propagate them to the new copy of the table
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* Row was deleted (or updated away). */
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter aborted or crashed. The previous record should
+ * be an insertion (possibly with some tuple-locking in
+ * between). We'll remember the tuple when we see the
+ * insertion.
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else
+ {
+ /* deleter committed or is still in progress. */
+ if (TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* the deletion is visible to everyone. We can skip the row completely. */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* deleter committed or is in progress. Remember that it was
+ * deleted by this XID.
+ */
+ this_xmax = undorec->xid;
+ this_cmax = undorec->cid;
+ if (undorec->type == ZSUNDO_TYPE_DELETE)
+ this_changedPart = ((ZSUndoRec_Delete *) undorec)->changedPart;
+ else
+ this_changedPart = false;
+
+ /* follow the UNDO chain to find information about the inserting
+ * transaction (xmin/cmin)
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ }
+ }
+ }
+
+ /*
+ * We now know the visibility of this tuple. Re-create it in the new table.
+ */
+ if (this_xmin != InvalidTransactionId)
+ {
+ /* Insert the first version of the row. */
+ zstid newtid;
+
+ /* First, insert the tuple. */
+ newtid = zsbt_tid_multi_insert(NewHeap,
+ 1,
+ this_xmin,
+ this_cmin,
+ INVALID_SPECULATIVE_TOKEN,
+ InvalidUndoPtr);
+
+ /* And if the tuple was deleted/updated away, do the same in the new table. */
+ if (this_xmax != InvalidTransactionId)
+ {
+ TM_Result delete_result;
+ bool this_xact_has_lock;
+
+ /* tuple was deleted. */
+ delete_result = zsbt_tid_delete(NewHeap, newtid,
+ this_xmax, this_cmax,
+ NULL, NULL, false, NULL, this_changedPart,
+ &this_xact_has_lock);
+ if (delete_result != TM_Ok)
+ elog(ERROR, "tuple deletion failed during table rewrite");
+ }
+ return newtid;
+ }
+ else
+ return InvalidZSTid;
+}
+
+
+static void
+zedstoream_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
+ Relation OldIndex, bool use_sort,
+ TransactionId OldestXmin,
+ TransactionId *xid_cutoff,
+ MultiXactId *multi_cutoff,
+ double *num_tuples,
+ double *tups_vacuumed,
+ double *tups_recently_dead)
+{
+ TupleDesc olddesc;
+ ZSTidTreeScan tid_scan;
+ ZSAttrTreeScan *attr_scans;
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(OldHeap, true);
+ int attno;
+ IndexScanDesc indexScan;
+ Datum *newdatums;
+ bool *newisnulls;
+
+ zsbt_tuplebuffer_flush(OldHeap);
+
+ olddesc = RelationGetDescr(OldHeap),
+
+ attr_scans = palloc(olddesc->natts * sizeof(ZSAttrTreeScan));
+
+ /*
+ * Scan the old table. We ignore any old updated-away tuple versions,
+ * and only stop at the latest tuple version of each row. At the latest
+ * version, follow the update chain to get all the old versions of that
+ * row, too. That way, the whole update chain is processed in one go,
+ * and can be reproduced in the new table.
+ */
+ zsbt_tid_begin_scan(OldHeap, MinZSTid, MaxPlusOneZSTid,
+ SnapshotAny, &tid_scan);
+
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_begin_scan(OldHeap,
+ olddesc,
+ attno,
+ &attr_scans[attno - 1]);
+ }
+
+ newdatums = palloc(olddesc->natts * sizeof(Datum));
+ newisnulls = palloc(olddesc->natts * sizeof(bool));
+
+ /* TODO: sorting not implemented yet. (it would require materializing each
+ * row into a HeapTuple or something like that, which could carry the xmin/xmax
+ * information through the sorter).
+ */
+ use_sort = false;
+
+ /*
+ * Prepare to scan the OldHeap. To ensure we see recently-dead tuples
+ * that still need to be copied, we scan with SnapshotAny and use
+ * HeapTupleSatisfiesVacuum for the visibility test.
+ */
+ if (OldIndex != NULL && !use_sort)
+ {
+ const int ci_index[] = {
+ PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_INDEX_RELID
+ };
+ int64 ci_val[2];
+
+ /* Set phase and OIDOldIndex to columns */
+ ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[1] = RelationGetRelid(OldIndex);
+ pgstat_progress_update_multi_param(2, ci_index, ci_val);
+
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ index_rescan(indexScan, NULL, 0, NULL, 0);
+ }
+ else
+ {
+ /* In scan-and-sort mode and also VACUUM FULL, set phase */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+
+ indexScan = NULL;
+
+ /* Set total heap blocks */
+ /* TODO */
+#if 0
+ pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ heapScan->rs_nblocks);
+#endif
+ }
+
+ for (;;)
+ {
+ zstid old_tid;
+ ZSUndoRecPtr old_undoptr;
+ zstid new_tid;
+ zstid fetchtid = InvalidZSTid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (indexScan != NULL)
+ {
+ ItemPointer itemptr;
+
+ itemptr = index_getnext_tid(indexScan, ForwardScanDirection);
+ if (!itemptr)
+ break;
+
+ /* Since we used no scan keys, should never need to recheck */
+ if (indexScan->xs_recheck)
+ elog(ERROR, "CLUSTER does not support lossy index conditions");
+
+ fetchtid = ZSTidFromItemPointer(*itemptr);
+ zsbt_tid_reset_scan(&tid_scan, MinZSTid, MaxPlusOneZSTid, fetchtid - 1);
+ old_tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection);
+ if (old_tid == InvalidZSTid)
+ continue;
+ }
+ else
+ {
+ old_tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection);
+ if (old_tid == InvalidZSTid)
+ break;
+ fetchtid = old_tid;
+ }
+ if (old_tid != fetchtid)
+ continue;
+
+ old_undoptr = tid_scan.array_iter.undoslots[ZSTidScanCurUndoSlotNo(&tid_scan)];
+
+ new_tid = zs_cluster_process_tuple(OldHeap, NewHeap,
+ old_tid, old_undoptr,
+ recent_oldest_undo,
+ OldestXmin);
+ if (new_tid != InvalidZSTid)
+ {
+ /* Fetch the attributes and write them out */
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ Form_pg_attribute att = TupleDescAttr(olddesc, attno - 1);
+ Datum datum;
+ bool isnull;
+
+ if (att->attisdropped)
+ {
+ datum = (Datum) 0;
+ isnull = true;
+ }
+ else
+ {
+ if (!zsbt_attr_fetch(&attr_scans[attno - 1], &datum, &isnull, old_tid))
+ zsbt_fill_missing_attribute_value(olddesc, attno, &datum, &isnull);
+ }
+
+ /* flatten and re-toast any ZS-TOASTed values */
+ if (!isnull && att->attlen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(OldHeap, attno, old_tid, datum);
+ }
+ }
+ newdatums[attno - 1] = datum;
+ newisnulls[attno - 1] = isnull;
+ }
+
+ zsbt_tuplebuffer_spool_tuple(NewHeap, new_tid, newdatums, newisnulls);
+ }
+ }
+
+ if (indexScan != NULL)
+ index_endscan(indexScan);
+
+ zsbt_tid_end_scan(&tid_scan);
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_end_scan(&attr_scans[attno - 1]);
+ }
+
+ zsbt_tuplebuffer_flush(NewHeap);
+}
+
+static void
+zedstoream_scan_analyze_beginscan(Relation onerel, AnalyzeSampleContext *context)
+{
+ zstid tid;
+ List *va_cols = context->anl_cols;
+ Bitmapset *project_columns = NULL;
+
+ /* zedstore can sample rows on specified columns only */
+ if (!va_cols)
+ context->scan = table_beginscan_analyze(onerel);
+ else
+ {
+ ListCell *le;
+
+ foreach(le, va_cols)
+ {
+ char *col = strVal(lfirst(le));
+
+ project_columns =
+ bms_add_member(project_columns, attnameAttNum(onerel, col, false));
+ }
+
+ context->scan =
+ zedstoream_beginscan_with_column_projection(onerel, NULL, 0, NULL,
+ NULL, SO_TYPE_ANALYZE,
+ project_columns);
+ }
+
+ /* zedstore use a logical block number to acquire sample rows */
+ tid = zsbt_get_last_tid(onerel);
+ context->totalblocks = ZSTidGetBlockNumber(tid) + 1;
+}
+
+/*
+ * Get next logical block.
+ */
+static bool
+zedstoream_scan_analyze_next_block(BlockNumber blockno,
+ AnalyzeSampleContext *context)
+{
+ return zs_blkscan_next_block(context->scan, blockno, NULL, -1, false);
+}
+
+static bool
+zedstoream_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
+{
+ int i;
+ bool result;
+ AttrNumber attno;
+ TableScanDesc scan = context->scan;
+ ZedStoreDesc sscan = (ZedStoreDesc) scan;
+ ZSAttrTreeScan *attr_scan;
+ TupleTableSlot *slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DATA);
+
+ result = zs_blkscan_next_tuple(scan, slot);
+
+ if (result)
+ {
+ /* provide extra disk info when analyzing on full columns */
+ if (!context->anl_cols)
+ {
+ slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DISKSIZE);
+
+ ExecClearTuple(slot);
+
+ for (i = 0; i < scan->rs_rd->rd_att->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ for (i = 1; i < sscan->proj_data.num_proj_atts; i++)
+ {
+ attr_scan = &sscan->proj_data.attr_scans[i - 1];
+ attno = sscan->proj_data.proj_atts[i];
+
+ slot->tts_values[attno - 1] =
+ Float8GetDatum(attr_scan->decoder.avg_elements_size);
+ slot->tts_isnull[attno - 1] = false;
+ }
+
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ }
+
+ context->liverows++;
+ }
+
+ return result;
+}
+
+static void
+zedstoream_scan_analyze_sample_tuple(int pos, bool replace, AnalyzeSampleContext *context)
+{
+ TupleTableSlot *slot;
+ Relation onerel = context->scan->rs_rd;
+
+ slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DATA);
+ AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DATA, pos, replace, false);
+
+ /* only record */
+ if (!context->anl_cols)
+ {
+ slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DISKSIZE);
+ AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DISKSIZE, pos, replace, false);
+ }
+}
+
+static void
+zedstoream_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+ table_endscan(context->scan);
+}
+
+/* ------------------------------------------------------------------------
+ * Miscellaneous callbacks for the heap AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * FIXME: Implement this function as best for zedstore. The return value is
+ * for example leveraged by analyze to find which blocks to sample.
+ */
+static uint64
+zedstoream_relation_size(Relation rel, ForkNumber forkNumber)
+{
+ uint64 nblocks = 0;
+
+ /* Open it at the smgr level if not already done */
+ RelationOpenSmgr(rel);
+ nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+ return nblocks * BLCKSZ;
+}
+
+/*
+ * Zedstore stores TOAST chunks within the table file itself. Hence, doesn't
+ * need separate toast table to be created. Return false for this callback
+ * avoids creation of toast table.
+ */
+static bool
+zedstoream_relation_needs_toast_table(Relation rel)
+{
+ return false;
+}
+
+/* ------------------------------------------------------------------------
+ * Planner related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * currently this is exact duplicate of heapam_estimate_rel_size().
+ * TODO fix to tune it based on zedstore storage.
+ */
+static void
+zedstoream_relation_estimate_size(Relation rel, int32 *attr_widths,
+ BlockNumber *pages, double *tuples,
+ double *allvisfrac)
+{
+ BlockNumber curpages;
+ BlockNumber relpages;
+ double reltuples;
+ BlockNumber relallvisible;
+ double density;
+
+ /* it has storage, ok to call the smgr */
+ curpages = RelationGetNumberOfBlocks(rel);
+
+ /* coerce values in pg_class to more desirable types */
+ relpages = (BlockNumber) rel->rd_rel->relpages;
+ reltuples = (double) rel->rd_rel->reltuples;
+ relallvisible = (BlockNumber) rel->rd_rel->relallvisible;
+
+ /*
+ * HACK: if the relation has never yet been vacuumed, use a minimum size
+ * estimate of 10 pages. The idea here is to avoid assuming a
+ * newly-created table is really small, even if it currently is, because
+ * that may not be true once some data gets loaded into it. Once a vacuum
+ * or analyze cycle has been done on it, it's more reasonable to believe
+ * the size is somewhat stable.
+ *
+ * (Note that this is only an issue if the plan gets cached and used again
+ * after the table has been filled. What we're trying to avoid is using a
+ * nestloop-type plan on a table that has grown substantially since the
+ * plan was made. Normally, autovacuum/autoanalyze will occur once enough
+ * inserts have happened and cause cached-plan invalidation; but that
+ * doesn't happen instantaneously, and it won't happen at all for cases
+ * such as temporary tables.)
+ *
+ * We approximate "never vacuumed" by "has relpages = 0", which means this
+ * will also fire on genuinely empty relations. Not great, but
+ * fortunately that's a seldom-seen case in the real world, and it
+ * shouldn't degrade the quality of the plan too much anyway to err in
+ * this direction.
+ *
+ * If the table has inheritance children, we don't apply this heuristic.
+ * Totally empty parent tables are quite common, so we should be willing
+ * to believe that they are empty.
+ */
+ if (curpages < 10 &&
+ relpages == 0 &&
+ !rel->rd_rel->relhassubclass)
+ curpages = 10;
+
+ /* report estimated # pages */
+ *pages = curpages;
+ /* quick exit if rel is clearly empty */
+ if (curpages == 0)
+ {
+ *tuples = 0;
+ *allvisfrac = 0;
+ return;
+ }
+
+ /* estimate number of tuples from previous tuple density */
+ if (relpages > 0)
+ density = reltuples / (double) relpages;
+ else
+ {
+ /*
+ * When we have no data because the relation was truncated, estimate
+ * tuple width from attribute datatypes. We assume here that the
+ * pages are completely full, which is OK for tables (since they've
+ * presumably not been VACUUMed yet) but is probably an overestimate
+ * for indexes. Fortunately get_relation_info() can clamp the
+ * overestimate to the parent table's size.
+ *
+ * Note: this code intentionally disregards alignment considerations,
+ * because (a) that would be gilding the lily considering how crude
+ * the estimate is, and (b) it creates platform dependencies in the
+ * default plans which are kind of a headache for regression testing.
+ */
+ int32 tuple_width;
+
+ tuple_width = get_rel_data_width(rel, attr_widths);
+ tuple_width += MAXALIGN(SizeofHeapTupleHeader);
+ tuple_width += sizeof(ItemIdData);
+ /* note: integer division is intentional here */
+ density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width;
+ }
+ *tuples = rint(density * (double) curpages);
+
+ /*
+ * We use relallvisible as-is, rather than scaling it up like we do for
+ * the pages and tuples counts, on the theory that any pages added since
+ * the last VACUUM are most likely not marked all-visible. But costsize.c
+ * wants it converted to a fraction.
+ */
+ if (relallvisible == 0 || curpages <= 0)
+ *allvisfrac = 0;
+ else if ((double) relallvisible >= curpages)
+ *allvisfrac = 1;
+ else
+ *allvisfrac = (double) relallvisible / curpages;
+}
+
+/* ------------------------------------------------------------------------
+ * Executor related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * zs_blkscan_next_block() and zs_blkscan_next_tuple() are used to implement
+ * bitmap scans, and sample scans. The tableam interface for those are similar
+ * enough that they can share most code.
+ */
+static bool
+zs_blkscan_next_block(TableScanDesc sscan,
+ BlockNumber blkno, OffsetNumber *offsets, int noffsets,
+ bool predicatelocks)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *scan_proj = &scan->proj_data;
+ int ntuples;
+ zstid tid;
+ int idx;
+
+ if (!scan->started)
+ {
+ Relation rel = scan->rs_scan.rs_rd;
+ TupleDesc reldesc = RelationGetDescr(rel);
+ MemoryContext oldcontext;
+
+ zs_initialize_proj_attributes_extended(scan, reldesc);
+
+ oldcontext = MemoryContextSwitchTo(scan_proj->context);
+ zsbt_tid_begin_scan(rel,
+ ZSTidFromBlkOff(blkno, 1),
+ ZSTidFromBlkOff(blkno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &scan_proj->tid_scan);
+ scan_proj->tid_scan.serializable = true;
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ int attno = scan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(rel, reldesc, attno,
+ &scan_proj->attr_scans[i - 1]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ scan->started = true;
+ }
+ else
+ {
+ zsbt_tid_reset_scan(&scan_proj->tid_scan,
+ ZSTidFromBlkOff(blkno, 1),
+ ZSTidFromBlkOff(blkno + 1, 1),
+ ZSTidFromBlkOff(blkno, 1) - 1);
+ }
+
+ /*
+ * Our strategy for a bitmap scan is to scan the TID tree in
+ * next_block() function, starting at the given logical block number, and
+ * store all the matching TIDs in in the scan struct. next_tuple() will
+ * fetch the attribute data from the attribute trees.
+ *
+ * TODO: it might be good to pass the next expected TID down to
+ * zsbt_tid_scan_next, so that it could skip over to the next match more
+ * efficiently.
+ */
+ ntuples = 0;
+ idx = 0;
+ while ((tid = zsbt_tid_scan_next(&scan_proj->tid_scan, ForwardScanDirection)) != InvalidZSTid)
+ {
+ OffsetNumber off = ZSTidGetOffsetNumber(tid);
+ ItemPointerData itemptr;
+
+ Assert(ZSTidGetBlockNumber(tid) == blkno);
+
+ ItemPointerSet(&itemptr, blkno, off);
+
+ if (noffsets != -1)
+ {
+ while (off > offsets[idx] && idx < noffsets)
+ {
+ /*
+ * Acquire predicate lock on all tuples that we scan, even those that are
+ * not visible to the snapshot.
+ */
+ if (predicatelocks)
+ /*
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked itemptr.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot, InvalidTransactionId);
+
+ idx++;
+ }
+
+ if (idx == noffsets)
+ break;
+
+ if (off < offsets[idx])
+ continue;
+ }
+
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ if (predicatelocks)
+ /*
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked itemptr.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot, InvalidTransactionId);
+
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+ }
+
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return ntuples > 0;
+}
+
+static bool
+zs_blkscan_next_tuple(TableScanDesc sscan, TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid tid;
+
+ if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
+ return false;
+
+ /*
+ * Initialize the slot.
+ *
+ * We initialize all columns to NULL. The values for columns that are projected
+ * will be set to the actual values below, but it's important that non-projected
+ * columns are NULL.
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < sscan->rs_rd->rd_att->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ tid = scan->bmscan_tids[scan->bmscan_nexttuple];
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ ZSAttrTreeScan *attr_scan = &scan->proj_data.attr_scans[i - 1];
+ AttrNumber attno = scan->proj_data.proj_atts[i];
+ Form_pg_attribute att = TupleDescAttr(slot->tts_tupleDescriptor, attno - 1);
+ Datum datum;
+ bool isnull;
+
+ if (!zsbt_attr_fetch(attr_scan, &datum, &isnull, tid))
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, attno, &datum, &isnull);
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, attno, tid, datum);
+ }
+
+ Assert(attno > 0);
+ slot->tts_values[attno - 1] = datum;
+ slot->tts_isnull[attno - 1] = isnull;
+ }
+
+ /* FIXME: Don't we need to set visi_info, like in a seqscan? */
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ scan->bmscan_nexttuple++;
+
+ pgstat_count_heap_fetch(scan->rs_scan.rs_rd);
+
+ return true;
+}
+
+
+
+static bool
+zedstoream_scan_bitmap_next_block(TableScanDesc sscan,
+ TBMIterateResult *tbmres)
+{
+ return zs_blkscan_next_block(sscan, tbmres->blockno, tbmres->offsets, tbmres->ntuples, true);
+}
+
+static bool
+zedstoream_scan_bitmap_next_tuple(TableScanDesc sscan,
+ TBMIterateResult *tbmres,
+ TupleTableSlot *slot)
+{
+ return zs_blkscan_next_tuple(sscan, slot);
+}
+
+static bool
+zedstoream_scan_sample_next_block(TableScanDesc sscan, SampleScanState *scanstate)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ TsmRoutine *tsm = scanstate->tsmroutine;
+ BlockNumber blockno;
+
+ if (scan->next_tid_to_scan == InvalidZSTid)
+ {
+ /* initialize next tid with the first tid */
+ scan->next_tid_to_scan = zsbt_get_first_tid(scan->rs_scan.rs_rd);
+ }
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it, used to calculate max blocks to
+ * scan either for SYSTEM or BERNOULLI sampling.
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(scan->rs_scan.rs_rd);
+ }
+
+ if (tsm->NextSampleBlock)
+ {
+ /* Adding one below to convert block number to number of blocks. */
+ blockno = tsm->NextSampleBlock(scanstate,
+ ZSTidGetBlockNumber(scan->max_tid_to_scan) + 1);
+
+ if (!BlockNumberIsValid(blockno))
+ return false;
+ }
+ else
+ {
+ /* scanning table sequentially */
+ if (scan->next_tid_to_scan > scan->max_tid_to_scan)
+ return false;
+
+ blockno = ZSTidGetBlockNumber(scan->next_tid_to_scan);
+ /* move on to next block of tids for next iteration of scan */
+ scan->next_tid_to_scan = ZSTidFromBlkOff(blockno + 1, 1);
+ }
+
+ Assert(BlockNumberIsValid(blockno));
+
+ /*
+ * Fetch all TIDs on the page.
+ */
+ if (!zs_blkscan_next_block(sscan, blockno, NULL, -1, false))
+ return false;
+
+ /*
+ * Filter the list of TIDs, keeping only the TIDs that the sampling methods
+ * tells us to keep.
+ */
+ if (scan->bmscan_ntuples > 0)
+ {
+ zstid lasttid_for_block = scan->bmscan_tids[scan->bmscan_ntuples - 1];
+ OffsetNumber maxoffset = ZSTidGetOffsetNumber(lasttid_for_block);
+ OffsetNumber nextoffset;
+ int outtuples;
+ int idx;
+
+ /* ask the tablesample method which tuples to check on this page. */
+ nextoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+
+ outtuples = 0;
+ idx = 0;
+ while (idx < scan->bmscan_ntuples && OffsetNumberIsValid(nextoffset))
+ {
+ zstid thistid = scan->bmscan_tids[idx];
+ OffsetNumber thisoffset = ZSTidGetOffsetNumber(thistid);
+
+ if (thisoffset > nextoffset)
+ nextoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+ else
+ {
+ if (thisoffset == nextoffset)
+ scan->bmscan_tids[outtuples++] = thistid;
+ idx++;
+ }
+ }
+ scan->bmscan_ntuples = outtuples;
+
+ /*
+ * Must fast forward the sampler through all offsets on this page,
+ * until it returns InvalidOffsetNumber. Otherwise, the next
+ * call will continue to return offsets for this block.
+ *
+ * FIXME: It seems bogus that the sampler isn't reset, when you call
+ * NextSampleBlock(). Perhaps we should fix this in the TSM API?
+ */
+ while (OffsetNumberIsValid(nextoffset))
+ nextoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+ }
+
+ return scan->bmscan_ntuples > 0;
+}
+
+static bool
+zedstoream_scan_sample_next_tuple(TableScanDesc sscan, SampleScanState *scanstate,
+ TupleTableSlot *slot)
+{
+ /*
+ * We already filtered the rows in the next_block() function, so all TIDs in
+ * in scan->bmscan_tids belong to the sample.
+ */
+ return zs_blkscan_next_tuple(sscan, slot);
+}
+
+static void
+zedstoream_vacuum_rel(Relation onerel, VacuumParams *params,
+ BufferAccessStrategy bstrategy)
+{
+ zsbt_tuplebuffer_flush(onerel);
+ zsundo_vacuum(onerel, params, bstrategy,
+ GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM));
+}
+
+static const TableAmRoutine zedstoream_methods = {
+ .type = T_TableAmRoutine,
+ .scans_leverage_column_projection = true,
+
+ .slot_callbacks = zedstoream_slot_callbacks,
+
+ .scan_begin = zedstoream_beginscan,
+ .scan_begin_with_column_projection = zedstoream_beginscan_with_column_projection,
+ .scan_end = zedstoream_endscan,
+ .scan_rescan = zedstoream_rescan,
+ .scan_getnextslot = zedstoream_getnextslot,
+
+ .parallelscan_estimate = zs_parallelscan_estimate,
+ .parallelscan_initialize = zs_parallelscan_initialize,
+ .parallelscan_reinitialize = zs_parallelscan_reinitialize,
+
+ .index_fetch_begin = zedstoream_begin_index_fetch,
+ .index_fetch_reset = zedstoream_reset_index_fetch,
+ .index_fetch_end = zedstoream_end_index_fetch,
+ .index_fetch_set_column_projection = zedstoream_fetch_set_column_projection,
+ .index_fetch_tuple = zedstoream_index_fetch_tuple,
+
+ .tuple_insert = zedstoream_insert,
+ .tuple_insert_speculative = zedstoream_insert_speculative,
+ .tuple_complete_speculative = zedstoream_complete_speculative,
+ .multi_insert = zedstoream_multi_insert,
+ .tuple_delete = zedstoream_delete,
+ .tuple_update = zedstoream_update,
+ .tuple_lock = zedstoream_lock_tuple,
+ .finish_bulk_insert = zedstoream_finish_bulk_insert,
+
+ .tuple_fetch_row_version = zedstoream_fetch_row_version,
+ .tuple_get_latest_tid = zedstoream_get_latest_tid,
+ .tuple_tid_valid = zedstoream_tuple_tid_valid,
+ .tuple_satisfies_snapshot = zedstoream_tuple_satisfies_snapshot,
+ .compute_xid_horizon_for_tuples = zedstoream_compute_xid_horizon_for_tuples,
+
+ .relation_set_new_filenode = zedstoream_relation_set_new_filenode,
+ .relation_nontransactional_truncate = zedstoream_relation_nontransactional_truncate,
+ .relation_copy_data = zedstoream_relation_copy_data,
+ .relation_copy_for_cluster = zedstoream_relation_copy_for_cluster,
+ .relation_vacuum = zedstoream_vacuum_rel,
+ .scan_analyze_beginscan = zedstoream_scan_analyze_beginscan,
+ .scan_analyze_next_block = zedstoream_scan_analyze_next_block,
+ .scan_analyze_next_tuple = zedstoream_scan_analyze_next_tuple,
+ .scan_analyze_sample_tuple = zedstoream_scan_analyze_sample_tuple,
+ .scan_analyze_endscan = zedstoream_scan_analyze_endscan,
+
+ .index_build_range_scan = zedstoream_index_build_range_scan,
+ .index_validate_scan = zedstoream_index_validate_scan,
+
+ .relation_size = zedstoream_relation_size,
+ .relation_needs_toast_table = zedstoream_relation_needs_toast_table,
+ .relation_estimate_size = zedstoream_relation_estimate_size,
+
+ .scan_bitmap_next_block = zedstoream_scan_bitmap_next_block,
+ .scan_bitmap_next_tuple = zedstoream_scan_bitmap_next_tuple,
+ .scan_sample_next_block = zedstoream_scan_sample_next_block,
+ .scan_sample_next_tuple = zedstoream_scan_sample_next_tuple
+};
+
+Datum
+zedstore_tableam_handler(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_POINTER(&zedstoream_methods);
+}
+
+
+/*
+ * Routines for dividing up the TID range for parallel seq scans
+ */
+
+/*
+ * Number of TIDs to assign to a parallel worker in a parallel Seq Scan in
+ * one batch.
+ *
+ * Not sure what the optimimum would be. If the chunk size is too small,
+ * the parallel workers will waste effort, when two parallel workers both
+ * need to decompress and process the pages at the boundary. But on the
+ * other hand, if the chunk size is too large, we might not be able to make
+ * good use of all the parallel workers.
+ */
+#define ZS_PARALLEL_CHUNK_SIZE ((uint64) 0x100000)
+
+typedef struct ParallelZSScanDescData
+{
+ ParallelTableScanDescData base;
+
+ zstid pzs_endtid; /* last tid + 1 in relation at start of scan */
+ pg_atomic_uint64 pzs_allocatedtids; /* TID space allocated to workers so far. */
+} ParallelZSScanDescData;
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static Size
+zs_parallelscan_estimate(Relation rel)
+{
+ return sizeof(ParallelZSScanDescData);
+}
+
+static Size
+zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc zpscan = (ParallelZSScanDesc) pscan;
+
+ zpscan->base.phs_relid = RelationGetRelid(rel);
+ zpscan->pzs_endtid = zsbt_get_last_tid(rel);
+ pg_atomic_init_u64(&zpscan->pzs_allocatedtids, 1);
+
+ return sizeof(ParallelZSScanDescData);
+}
+
+static void
+zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc bpscan = (ParallelZSScanDesc) pscan;
+
+ pg_atomic_write_u64(&bpscan->pzs_allocatedtids, 1);
+}
+
+/*
+ * get the next TID range to scan
+ *
+ * Returns true if there is more to scan, false otherwise.
+ *
+ * Get the next TID range to scan. Even if there are no TIDs left to scan,
+ * another backend could have grabbed a range to scan and not yet finished
+ * looking at it, so it doesn't follow that the scan is done when the first
+ * backend gets 'false' return.
+ */
+static bool
+zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end)
+{
+ uint64 allocatedtids;
+
+ /*
+ * zhs_allocatedtid tracks how much has been allocated to workers
+ * already. When phs_allocatedtid >= rs_lasttid, all TIDs have been
+ * allocated.
+ *
+ * Because we use an atomic fetch-and-add to fetch the current value, the
+ * phs_allocatedtid counter will exceed rs_lasttid, because workers will
+ * still increment the value, when they try to allocate the next block but
+ * all blocks have been allocated already. The counter must be 64 bits
+ * wide because of that, to avoid wrapping around when rs_lasttid is close
+ * to 2^32. That's also one reason we do this at granularity of 2^16 TIDs,
+ * even though zedstore isn't block-oriented.
+ */
+ allocatedtids = pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids, ZS_PARALLEL_CHUNK_SIZE);
+ *start = (zstid) allocatedtids;
+ *end = (zstid) (allocatedtids + ZS_PARALLEL_CHUNK_SIZE);
+
+ return *start < pzscan->pzs_endtid;
+}
+
+/*
+ * Get the value for a row, when no value has been stored in the attribute tree.
+ *
+ * This is used after ALTER TABLE ADD COLUMN, when reading rows that were
+ * created before column was added. Usually, missing values are implicitly
+ * NULLs, but you could specify a different value in the ALTER TABLE command,
+ * too, with DEFAULT.
+ */
+static void
+zsbt_fill_missing_attribute_value(TupleDesc tupleDesc, int attno, Datum *datum, bool *isnull)
+{
+ Form_pg_attribute attr = TupleDescAttr(tupleDesc, attno - 1);
+
+ *isnull = true;
+ *datum = (Datum) 0;
+
+ /* This means catalog doesn't have the default value for this attribute */
+ if (!attr->atthasmissing)
+ return;
+
+ if (tupleDesc->constr &&
+ tupleDesc->constr->missing)
+ {
+ AttrMissing *attrmiss = NULL;
+ /*
+ * If there are missing values we want to put them into the
+ * tuple.
+ */
+ attrmiss = tupleDesc->constr->missing;
+
+ if (attrmiss[attno - 1].am_present)
+ {
+ *isnull = false;
+ if (attr->attbyval)
+ *datum = fetch_att(&attrmiss[attno - 1].am_value, attr->attbyval, attr->attlen);
+ else
+ *datum = zs_datumCopy(attrmiss[attno - 1].am_value, attr->attbyval, attr->attlen);
+ }
+ }
+}
diff --git src/backend/commands/analyze.c src/backend/commands/analyze.c
index 924ef37c81..09a5bfa6f1 100644
--- src/backend/commands/analyze.c
+++ src/backend/commands/analyze.c
@@ -89,19 +89,19 @@ static void do_analyze_rel(Relation onerel,
VacuumParams *params, List *va_cols,
AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
bool inh, bool in_outer_xact, int elevel);
+static void compute_disk_stats(VacAttrStats **stats, int natts,
+ TupleDesc desc, HeapTuple *rows,
+ int numrows);
static void compute_index_stats(Relation onerel, double totalrows,
AnlIndexData *indexdata, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context);
static VacAttrStats *examine_attribute(Relation onerel, int attnum,
Node *index_expr);
-static int acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows);
-static int compare_rows(const void *a, const void *b);
-static int acquire_inherited_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows);
+static void acquire_sample_rows(Relation onerel, int elevel,
+ AnalyzeSampleContext *context);
+static void acquire_inherited_sample_rows(Relation onerel, int elevel,
+ AnalyzeSampleContext *context);
static void update_attstats(Oid relid, bool inh,
int natts, VacAttrStats **vacattrstats);
static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
@@ -252,8 +252,6 @@ analyze_rel(Oid relid, RangeVar *relation,
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
MyPgXact->vacuumFlags |= PROC_IN_ANALYZE;
LWLockRelease(ProcArrayLock);
- pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
- RelationGetRelid(onerel));
/*
* Do the normal non-recursive ANALYZE. We can skip this for partitioned
@@ -323,6 +321,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ AnalyzeSampleContext *sample_context;
if (inh)
ereport(elevel,
@@ -507,6 +506,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (targrows < minrows)
targrows = minrows;
+ /* create context for acquiring sample rows */
+ sample_context = CreateAnalyzeSampleContext(onerel, va_cols, targrows,
+ vac_strategy);
+
/*
* Acquire the sample rows
*/
@@ -515,13 +518,13 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
if (inh)
- numrows = acquire_inherited_sample_rows(onerel, elevel,
- rows, targrows,
- &totalrows, &totaldeadrows);
+ acquire_inherited_sample_rows(onerel, elevel, sample_context);
else
- numrows = (*acquirefunc) (onerel, elevel,
- rows, targrows,
- &totalrows, &totaldeadrows);
+ (*acquirefunc) (onerel, elevel, sample_context);
+
+ /* Get the sample statistics */
+ AnalyzeGetSampleStats(sample_context, &numrows, &totalrows, &totaldeadrows);
+ rows = AnalyzeGetSampleRows(sample_context, ANALYZE_SAMPLE_DATA, 0);
/*
* Compute the statistics. Temporary results during the calculations for
@@ -571,6 +574,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
MemoryContextResetAndDeleteChildren(col_context);
}
+ /* compute disksize ratio stats if any */
+ if (AnalyzeSampleIsValid(sample_context, ANALYZE_SAMPLE_DISKSIZE))
+ {
+ TupleTableSlot *slot =
+ AnalyzeGetSampleSlot(sample_context, onerel, ANALYZE_SAMPLE_DISKSIZE);
+ HeapTuple *rows =
+ AnalyzeGetSampleRows(sample_context, ANALYZE_SAMPLE_DISKSIZE, 0);
+
+ compute_disk_stats(vacattrstats, attr_cnt,
+ slot->tts_tupleDescriptor,
+ rows, numrows);
+ }
+
if (hasindex)
compute_index_stats(onerel, totalrows,
indexdata, nindexes,
@@ -704,6 +720,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
pg_rusage_show(&ru0))));
}
+ DestroyAnalyzeSampleContext(sample_context);
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -716,6 +734,41 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
anl_context = NULL;
}
+static void
+compute_disk_stats(VacAttrStats **stats, int natts,
+ TupleDesc desc, HeapTuple *rows,
+ int numrows)
+{
+ int i, j;
+ float8 attr_size = 0;
+ float8 total = 0;
+ bool isNull;
+
+ for (i = 0; i < numrows; i++)
+ {
+ HeapTuple tup = rows[i];
+
+ for (j = 0; j < natts; j++)
+ {
+ VacAttrStats *vac = stats[j];
+ Datum dat = heap_getattr(tup, j + 1, desc, &isNull);
+
+ if (!isNull)
+ {
+ attr_size = DatumGetFloat8(dat);
+ vac->disksize += attr_size;
+ total += attr_size;
+ }
+ }
+ }
+
+ for (j = 0; j < natts; j++)
+ {
+ VacAttrStats *vac = stats[j];
+ vac->stadiskfrac = vac->disksize / total;
+ }
+}
+
/*
* Compute statistics about indexes of a relation
*/
@@ -1032,16 +1085,16 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
* block. The previous sampling method put too much credence in the row
* density near the start of the table.
*/
-static int
+static void
acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
int numrows = 0; /* # rows now in reservoir */
+ int targrows = context->targrows;
double samplerows = 0; /* total # rows collected */
- double liverows = 0; /* # live rows seen */
- double deadrows = 0; /* # dead rows seen */
double rowstoskip = -1; /* -1 means not set yet */
+ double totalrows = 0;
+ double totaldeadrows = 0;
BlockNumber totalblocks;
TransactionId OldestXmin;
BlockSamplerData bs;
@@ -1053,7 +1106,9 @@ acquire_sample_rows(Relation onerel, int elevel,
Assert(targrows > 0);
- totalblocks = RelationGetNumberOfBlocks(onerel);
+ table_scan_analyze_beginscan(onerel, context);
+
+ totalblocks = context->totalblocks;
/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
@@ -1068,9 +1123,6 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Prepare for sampling rows */
reservoir_init_selection_state(&rstate, targrows);
- scan = table_beginscan_analyze(onerel);
- slot = table_slot_create(onerel, NULL);
-
/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{
@@ -1078,10 +1130,10 @@ acquire_sample_rows(Relation onerel, int elevel,
vacuum_delay_point();
- if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
+ if (!table_scan_analyze_next_block(targblock, context))
continue;
- while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+ while (table_scan_analyze_next_tuple(OldestXmin, context))
{
/*
* The first targrows sample rows are simply copied into the
@@ -1096,7 +1148,11 @@ acquire_sample_rows(Relation onerel, int elevel,
* we're done.
*/
if (numrows < targrows)
- rows[numrows++] = ExecCopySlotHeapTuple(slot);
+ {
+ table_scan_analyze_sample_tuple(numrows, false, context);
+
+ numrows++;
+ }
else
{
/*
@@ -1116,8 +1172,8 @@ acquire_sample_rows(Relation onerel, int elevel,
int k = (int) (targrows * sampler_random_fract(rstate.randstate));
Assert(k >= 0 && k < targrows);
- heap_freetuple(rows[k]);
- rows[k] = ExecCopySlotHeapTuple(slot);
+
+ table_scan_analyze_sample_tuple(k, true, context);
}
rowstoskip -= 1;
@@ -1130,19 +1186,7 @@ acquire_sample_rows(Relation onerel, int elevel,
++blksdone);
}
- ExecDropSingleTupleTableSlot(slot);
- table_endscan(scan);
-
- /*
- * If we didn't find as many tuples as we wanted then we're done. No sort
- * is needed, since they're already in order.
- *
- * Otherwise we need to sort the collected tuples by position
- * (itempointer). It's not worth worrying about corner cases where the
- * tuples are already sorted.
- */
- if (numrows == targrows)
- qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+ table_scan_analyze_endscan(context);
/*
* Estimate total numbers of live and dead rows in relation, extrapolating
@@ -1153,13 +1197,13 @@ acquire_sample_rows(Relation onerel, int elevel,
*/
if (bs.m > 0)
{
- *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
- *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+ totalrows = floor((context->liverows / bs.m) * totalblocks + 0.5);
+ totaldeadrows = floor((context->deadrows / bs.m) * totalblocks + 0.5);
}
else
{
- *totalrows = 0.0;
- *totaldeadrows = 0.0;
+ totalrows = 0.0;
+ totaldeadrows = 0.0;
}
/*
@@ -1171,34 +1215,13 @@ acquire_sample_rows(Relation onerel, int elevel,
"%d rows in sample, %.0f estimated total rows",
RelationGetRelationName(onerel),
bs.m, totalblocks,
- liverows, deadrows,
- numrows, *totalrows)));
-
- return numrows;
-}
+ context->liverows,
+ context->deadrows,
+ numrows, totalrows)));
-/*
- * qsort comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b)
-{
- HeapTuple ha = *(const HeapTuple *) a;
- HeapTuple hb = *(const HeapTuple *) b;
- BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
- OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
- BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
- OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
- if (ba < bb)
- return -1;
- if (ba > bb)
- return 1;
- if (oa < ob)
- return -1;
- if (oa > ob)
- return 1;
- return 0;
+ context->totalrows += totalrows;
+ context->totaldeadrows += totaldeadrows;
+ context->totalsampledrows += numrows;
}
@@ -1210,18 +1233,16 @@ compare_rows(const void *a, const void *b)
* We fail and return zero if there are no inheritance children, or if all
* children are foreign tables that don't support ANALYZE.
*/
-static int
+static void
acquire_inherited_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
List *tableOIDs;
Relation *rels;
AcquireSampleRowsFunc *acquirefuncs;
double *relblocks;
double totalblocks;
- int numrows,
- nrels,
+ int nrels,
i;
ListCell *lc;
bool has_child;
@@ -1249,7 +1270,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no child tables",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
- return 0;
+ return;
}
/*
@@ -1347,7 +1368,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no analyzable child tables",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
- return 0;
+ return;
}
/*
@@ -1358,65 +1379,25 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
*/
pgstat_progress_update_param(PROGRESS_ANALYZE_CHILD_TABLES_TOTAL,
nrels);
- numrows = 0;
- *totalrows = 0;
- *totaldeadrows = 0;
for (i = 0; i < nrels; i++)
{
Relation childrel = rels[i];
AcquireSampleRowsFunc acquirefunc = acquirefuncs[i];
double childblocks = relblocks[i];
- pgstat_progress_update_param(PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID,
- RelationGetRelid(childrel));
-
if (childblocks > 0)
{
int childtargrows;
- childtargrows = (int) rint(targrows * childblocks / totalblocks);
+ childtargrows = (int) rint(context->totaltargrows * childblocks / totalblocks);
/* Make sure we don't overrun due to roundoff error */
- childtargrows = Min(childtargrows, targrows - numrows);
+ childtargrows = Min(childtargrows, context->totaltargrows - context->totalsampledrows);
if (childtargrows > 0)
{
- int childrows;
- double trows,
- tdrows;
+ InitAnalyzeSampleContextForChild(context, childrel, childtargrows);
/* Fetch a random sample of the child's rows */
- childrows = (*acquirefunc) (childrel, elevel,
- rows + numrows, childtargrows,
- &trows, &tdrows);
-
- /* We may need to convert from child's rowtype to parent's */
- if (childrows > 0 &&
- !equalTupleDescs(RelationGetDescr(childrel),
- RelationGetDescr(onerel)))
- {
- TupleConversionMap *map;
-
- map = convert_tuples_by_name(RelationGetDescr(childrel),
- RelationGetDescr(onerel));
- if (map != NULL)
- {
- int j;
-
- for (j = 0; j < childrows; j++)
- {
- HeapTuple newtup;
-
- newtup = execute_attr_map_tuple(rows[numrows + j], map);
- heap_freetuple(rows[numrows + j]);
- rows[numrows + j] = newtup;
- }
- free_conversion_map(map);
- }
- }
-
- /* And add to counts */
- numrows += childrows;
- *totalrows += trows;
- *totaldeadrows += tdrows;
+ (*acquirefunc) (childrel, elevel, context);
}
}
@@ -1428,8 +1409,6 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
pgstat_progress_update_param(PROGRESS_ANALYZE_CHILD_TABLES_DONE,
i + 1);
}
-
- return numrows;
}
@@ -1495,6 +1474,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(stats->attr->attnum);
values[Anum_pg_statistic_stainherit - 1] = BoolGetDatum(inh);
values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
+ values[Anum_pg_statistic_stadiskfrac - 1] = Float4GetDatum(stats->stadiskfrac);
values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
i = Anum_pg_statistic_stakind1 - 1;
@@ -1527,7 +1507,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
/* XXX knows more than it should about type float4: */
arry = construct_array(numdatums, nnum,
FLOAT4OID,
- sizeof(float4), true, TYPALIGN_INT);
+ sizeof(float4), true, 'i');
values[i++] = PointerGetDatum(arry); /* stanumbersN */
}
else
diff --git src/backend/commands/copy.c src/backend/commands/copy.c
index fbde9f88e7..59cbf0cac2 100644
--- src/backend/commands/copy.c
+++ src/backend/commands/copy.c
@@ -2113,9 +2113,26 @@ CopyTo(CopyState cstate)
{
TupleTableSlot *slot;
TableScanDesc scandesc;
+ Bitmapset *proj = NULL;
- scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
slot = table_slot_create(cstate->rel, NULL);
+ if (table_scans_leverage_column_projection(cstate->rel))
+ {
+ foreach(cur, cstate->attnumlist)
+ {
+ int attnum = lfirst_int(cur);
+ Assert(attnum <= slot->tts_tupleDescriptor->natts);
+ proj = bms_add_member(proj, attnum);
+ }
+
+ scandesc = table_beginscan_with_column_projection(cstate->rel,
+ GetActiveSnapshot(),
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
+ }
processed = 0;
while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot))
@@ -2132,6 +2149,8 @@ CopyTo(CopyState cstate)
ExecDropSingleTupleTableSlot(slot);
table_endscan(scandesc);
+ if (proj)
+ pfree(proj);
}
else
{
diff --git src/backend/commands/tablecmds.c src/backend/commands/tablecmds.c
index 8e35c5bd1a..0dc3a7de46 100644
--- src/backend/commands/tablecmds.c
+++ src/backend/commands/tablecmds.c
@@ -10177,7 +10177,16 @@ validateCheckConstraint(Relation rel, HeapTuple constrtup)
econtext->ecxt_scantuple = slot;
snapshot = RegisterSnapshot(GetLatestSnapshot());
- scan = table_beginscan(rel, snapshot, 0, NULL);
+ if (table_scans_leverage_column_projection(rel))
+ {
+ Bitmapset *proj = NULL;
+ PopulateNeededColumnsForNode((Node*)exprstate->expr, slot->tts_tupleDescriptor->natts, &proj);
+ scan = table_beginscan_with_column_projection(rel, snapshot, 0, NULL, proj);
+ }
+ else
+ {
+ scan = table_beginscan(rel, snapshot, 0, NULL);
+ }
/*
* Switch to per-tuple memory context and reset it for each tuple
diff --git src/backend/commands/trigger.c src/backend/commands/trigger.c
index ed551ab73a..aad158a553 100644
--- src/backend/commands/trigger.c
+++ src/backend/commands/trigger.c
@@ -2488,6 +2488,10 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
if (newtuple != trigtuple)
heap_freetuple(newtuple);
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(slot);
+
if (should_free)
heap_freetuple(trigtuple);
@@ -2773,6 +2777,10 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
newtuple = NULL;
}
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(newslot);
+
if (should_free_trig)
heap_freetuple(trigtuple);
diff --git src/backend/executor/execScan.c src/backend/executor/execScan.c
index 642805d90c..d74d2a1e46 100644
--- src/backend/executor/execScan.c
+++ src/backend/executor/execScan.c
@@ -20,6 +20,7 @@
#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
#include "utils/memutils.h"
@@ -340,3 +341,72 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+typedef struct neededColumnContext
+{
+ Bitmapset **mask;
+ int n;
+} neededColumnContext;
+
+static bool
+neededColumnContextWalker(Node *node, neededColumnContext *c)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *)node;
+
+ if (var->varattno >= 0)
+ {
+ Assert(var->varattno <= c->n);
+ *(c->mask) = bms_add_member(*(c->mask), var->varattno);
+ }
+
+ /*
+ * varattno zero flags whole row variable, so set bits for all the
+ * columns.
+ */
+ if (var->varattno == 0)
+ bms_add_range(*(c->mask), 1, c->n);
+
+ return false;
+ }
+ return expression_tree_walker(node, neededColumnContextWalker, (void * )c);
+}
+
+/*
+ * n specifies the number of allowed entries in mask: we use
+ * it for bounds-checking in the walker above.
+ */
+void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols)
+{
+ neededColumnContext c;
+
+ c.mask = scanCols;
+ c.n = n;
+
+ neededColumnContextWalker(expr, &c);
+}
+
+Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol)
+{
+ Bitmapset *result = NULL;
+ Plan *plan = scanstate->ps.plan;
+
+ PopulateNeededColumnsForNode((Node *) plan->targetlist, ncol, &result);
+ PopulateNeededColumnsForNode((Node *) plan->qual, ncol, &result);
+
+ if (IsA(plan, IndexScan))
+ {
+ PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexqualorig, ncol, &result);
+ PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexorderbyorig, ncol, &result);
+ }
+ else if (IsA(plan, BitmapHeapScan))
+ PopulateNeededColumnsForNode((Node *) ((BitmapHeapScan *) plan)->bitmapqualorig, ncol, &result);
+
+ return result;
+}
diff --git src/backend/executor/nodeIndexscan.c src/backend/executor/nodeIndexscan.c
index d0a96a38e0..77aee47917 100644
--- src/backend/executor/nodeIndexscan.c
+++ src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,13 @@ IndexNext(IndexScanState *node)
node->iss_NumScanKeys,
node->iss_NumOrderByKeys);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ Bitmapset *proj = NULL;
+ proj = PopulateNeededColumnsForScan(&node->ss, node->ss.ss_currentRelation->rd_att->natts);
+ table_index_fetch_set_column_projection(scandesc->xs_heapfetch, proj);
+ }
+
node->iss_ScanDesc = scandesc;
/*
@@ -901,6 +908,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
{
IndexScanState *indexstate;
Relation currentRelation;
+ const TupleTableSlotOps *table_slot_ops;
LOCKMODE lockmode;
/*
@@ -927,11 +935,19 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ss_currentScanDesc = NULL; /* no heap scan here */
/*
- * get the scan type from the relation descriptor.
+ * Initialize the scan slot.
+ *
+ * With the reorder queue, we will sometimes use the reorderqueue's slot,
+ * which uses heap ops, and sometimes the table AM's slot directly. We
+ * have to set scanopsfixed to false, unless the table AM also uses heap
+ * ops.
*/
+ table_slot_ops = table_slot_callbacks(currentRelation);
ExecInitScanTupleSlot(estate, &indexstate->ss,
RelationGetDescr(currentRelation),
- table_slot_callbacks(currentRelation));
+ table_slot_ops);
+ if (node->indexorderby && table_slot_ops != &TTSOpsHeapTuple)
+ indexstate->ss.ps.scanopsfixed = false;
/*
* Initialize result type and projection.
diff --git src/backend/executor/nodeSeqscan.c src/backend/executor/nodeSeqscan.c
index 1a7c1e919f..51ddf4b9e6 100644
--- src/backend/executor/nodeSeqscan.c
+++ src/backend/executor/nodeSeqscan.c
@@ -31,6 +31,7 @@
#include "access/tableam.h"
#include "executor/execdebug.h"
#include "executor/nodeSeqscan.h"
+#include "nodes/nodeFuncs.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
@@ -68,9 +69,20 @@ SeqNext(SeqScanState *node)
* We reach here if the scan is not parallel, or if we're serially
* executing a scan that was planned to be parallel.
*/
- scandesc = table_beginscan(node->ss.ss_currentRelation,
- estate->es_snapshot,
- 0, NULL);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ Bitmapset *proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ scandesc = table_beginscan_with_column_projection(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ }
node->ss.ss_currentScanDesc = scandesc;
}
@@ -270,14 +282,22 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
{
EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
+ Bitmapset *proj = NULL;
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ }
+
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan, proj);
}
/* ----------------------------------------------------------------
@@ -307,8 +327,19 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelTableScanDesc pscan;
+ Bitmapset *proj = NULL;
+
+ /*
+ * FIXME: this is duplicate work with ExecSeqScanInitializeDSM. In future
+ * plan will have the we have projection list, then this overhead will not exist.
+ */
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ }
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan, proj);
}
diff --git src/backend/optimizer/path/allpaths.c src/backend/optimizer/path/allpaths.c
index 905bbe77d8..7d43929b8b 100644
--- src/backend/optimizer/path/allpaths.c
+++ src/backend/optimizer/path/allpaths.c
@@ -23,6 +23,7 @@
#include "catalog/pg_class.h"
#include "catalog/pg_operator.h"
#include "catalog/pg_proc.h"
+#include "catalog/pg_statistic.h"
#include "foreign/fdwapi.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -47,6 +48,7 @@
#include "partitioning/partbounds.h"
#include "partitioning/partprune.h"
#include "rewrite/rewriteManip.h"
+#include "utils/syscache.h"
#include "utils/lsyscache.h"
@@ -79,7 +81,11 @@ static void set_rel_size(PlannerInfo *root, RelOptInfo *rel,
static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
Index rti, RangeTblEntry *rte);
static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
- RangeTblEntry *rte);
+ Index rti, RangeTblEntry *rte);
+static void set_plain_rel_page_estimates(PlannerInfo *root,
+ RelOptInfo *rel,
+ Index rti,
+ RangeTblEntry *rte);
static void create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel);
static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
RangeTblEntry *rte);
@@ -409,7 +415,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
else
{
/* Plain relation */
- set_plain_rel_size(root, rel, rte);
+ set_plain_rel_size(root, rel, rti, rte);
}
break;
case RTE_SUBQUERY:
@@ -571,7 +577,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
* Set size estimates for a plain relation (no subquery, no inheritance)
*/
static void
-set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
+set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte)
{
/*
* Test any partial indexes of rel for applicability. We must do this
@@ -581,6 +587,81 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
/* Mark rel with estimated output rows, width, etc */
set_baserel_size_estimates(root, rel);
+
+ /* Estimate the pages based on the selected columns */
+ set_plain_rel_page_estimates(root, rel, rti, rte);
+}
+
+static void
+set_plain_rel_page_estimates(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte)
+{
+ Var *var;
+ List *vars;
+ double pages;
+ ListCell *lc;
+ ListCell *lc1;
+ Bitmapset *cols = NULL;
+ HeapTuple tp;
+ AttrNumber attno;
+ Selectivity sel = 0;
+
+ Assert(rel->rtekind == RTE_RELATION);
+
+ foreach(lc, rel->reltarget->exprs)
+ {
+ Node *node;
+ node = lfirst(lc);
+ vars = pull_var_clause(node,
+ PVC_RECURSE_AGGREGATES |
+ PVC_RECURSE_WINDOWFUNCS |
+ PVC_RECURSE_PLACEHOLDERS);
+ foreach(lc1, vars)
+ {
+ var = lfirst(lc1);
+ if (var->varno == rti && var->varattno >= 0)
+ cols = bms_add_member(cols, var->varattno);
+ }
+ }
+
+ foreach(lc, rel->baserestrictinfo)
+ {
+ RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+ vars = pull_var_clause((Node *)rinfo->clause,
+ PVC_RECURSE_AGGREGATES |
+ PVC_RECURSE_WINDOWFUNCS |
+ PVC_RECURSE_PLACEHOLDERS);
+ foreach(lc1, vars)
+ {
+ var = lfirst(lc1);
+ if (var->varno == rti && var->varattno >= 0)
+ cols = bms_add_member(cols, var->varattno);
+ }
+ }
+
+ attno = -1;
+ while ((attno = bms_next_member(cols, attno)) >= 0)
+ {
+ tp = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(rte->relid),
+ Int16GetDatum(attno),
+ BoolGetDatum(rte->inh));
+
+ if (HeapTupleIsValid(tp))
+ {
+ sel += ((Form_pg_statistic) GETSTRUCT(tp))->stadiskfrac;
+ ReleaseSysCache(tp);
+ }
+ }
+
+ if (sel > 0)
+ {
+ pages = rel->pages * sel;
+
+ if (pages <= 1.0)
+ rel->pages = 1;
+ else
+ rel->pages = rint(pages);
+ }
}
/*
diff --git src/backend/optimizer/plan/createplan.c src/backend/optimizer/plan/createplan.c
index fc25908dc6..b8da03088c 100644
--- src/backend/optimizer/plan/createplan.c
+++ src/backend/optimizer/plan/createplan.c
@@ -825,6 +825,9 @@ use_physical_tlist(PlannerInfo *root, Path *path, int flags)
rel->rtekind != RTE_CTE)
return false;
+ if (rel->rtekind == RTE_RELATION && rel->leverage_column_projection)
+ return false;
+
/*
* Can't do it with inheritance cases either (mainly because Append
* doesn't project; this test may be unnecessary now that
diff --git src/backend/optimizer/util/plancat.c src/backend/optimizer/util/plancat.c
index d82fc5ab8b..92eabdeb61 100644
--- src/backend/optimizer/util/plancat.c
+++ src/backend/optimizer/util/plancat.c
@@ -125,6 +125,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
*/
relation = table_open(relationObjectId, NoLock);
+ if (relation->rd_tableam)
+ rel->leverage_column_projection = relation->rd_tableam->scans_leverage_column_projection;
/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())
ereport(ERROR,
diff --git src/backend/partitioning/partbounds.c src/backend/partitioning/partbounds.c
index 4c47f54a57..fd95484268 100644
--- src/backend/partitioning/partbounds.c
+++ src/backend/partitioning/partbounds.c
@@ -1286,6 +1286,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
TableScanDesc scan;
MemoryContext oldCxt;
TupleTableSlot *tupslot;
+ Bitmapset *proj = NULL;
/* Lock already taken above. */
if (part_relid != RelationGetRelid(default_rel))
@@ -1350,7 +1351,15 @@ check_default_partition_contents(Relation parent, Relation default_rel,
econtext = GetPerTupleExprContext(estate);
snapshot = RegisterSnapshot(GetLatestSnapshot());
tupslot = table_slot_create(part_rel, &estate->es_tupleTable);
- scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ if (table_scans_leverage_column_projection(part_rel))
+ {
+ PopulateNeededColumnsForNode((Node*)partqualstate->expr, tupslot->tts_tupleDescriptor->natts, &proj);
+ scan = table_beginscan_with_column_projection(part_rel, snapshot, 0, NULL, proj);
+ }
+ else
+ {
+ scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ }
/*
* Switch to per-tuple memory context and reset it for each tuple
@@ -1381,6 +1390,9 @@ check_default_partition_contents(Relation parent, Relation default_rel,
if (RelationGetRelid(default_rel) != RelationGetRelid(part_rel))
table_close(part_rel, NoLock); /* keep the lock until commit */
+
+ if (proj)
+ pfree(proj);
}
}
diff --git src/backend/replication/logical/decode.c src/backend/replication/logical/decode.c
index c2e5e3abf8..fd62337252 100644
--- src/backend/replication/logical/decode.c
+++ src/backend/replication/logical/decode.c
@@ -151,6 +151,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
case RM_COMMIT_TS_ID:
case RM_REPLORIGIN_ID:
case RM_GENERIC_ID:
+ case RM_ZEDSTORE_ID:
/* just deal with xid, and done */
ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
buf.origptr);
diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c
index e05e2b3456..ba68d2447d 100644
--- src/backend/storage/buffer/bufmgr.c
+++ src/backend/storage/buffer/bufmgr.c
@@ -3576,6 +3576,28 @@ LockBuffer(Buffer buffer, int mode)
elog(ERROR, "unrecognized buffer lock mode: %d", mode);
}
+/*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ */
+bool
+ConditionalLockBufferInMode(Buffer buffer, int mode)
+{
+ BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+ if (BufferIsLocal(buffer))
+ return true; /* act as though we got it */
+
+ buf = GetBufferDescriptor(buffer - 1);
+
+ if (mode == BUFFER_LOCK_SHARE)
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
+ else if (mode == BUFFER_LOCK_EXCLUSIVE)
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+ else
+ elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+}
+
/*
* Acquire the content_lock for the buffer, but only if we don't have to wait.
*
diff --git src/backend/utils/adt/tid.c src/backend/utils/adt/tid.c
index 4ce8375eab..13504d1aa3 100644
--- src/backend/utils/adt/tid.c
+++ src/backend/utils/adt/tid.c
@@ -29,18 +29,13 @@
#include "libpq/pqformat.h"
#include "miscadmin.h"
#include "parser/parsetree.h"
+#include "storage/itemptr.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
#include "utils/varlena.h"
-
-#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
-#define ItemPointerGetDatum(X) PointerGetDatum(X)
-#define PG_GETARG_ITEMPOINTER(n) DatumGetItemPointer(PG_GETARG_DATUM(n))
-#define PG_RETURN_ITEMPOINTER(x) return ItemPointerGetDatum(x)
-
#define LDELIM '('
#define RDELIM ')'
#define DELIM ','
diff --git src/bin/pg_waldump/rmgrdesc.c src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..c0dc97b307 100644
--- src/bin/pg_waldump/rmgrdesc.c
+++ src/bin/pg_waldump/rmgrdesc.c
@@ -22,6 +22,7 @@
#include "access/spgxlog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/zedstore_wal.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
#include "commands/sequence.h"
diff --git src/include/access/rmgrlist.h src/include/access/rmgrlist.h
index 6c15df7e70..38ed65f9e3 100644
--- src/include/access/rmgrlist.h
+++ src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_ZEDSTORE_ID, "Zedstore", zedstore_redo, zedstore_desc, zedstore_identify, NULL, NULL, zedstore_mask)
diff --git src/include/access/tableam.h src/include/access/tableam.h
index 91f84b1107..4548ef884d 100644
--- src/include/access/tableam.h
+++ src/include/access/tableam.h
@@ -37,6 +37,66 @@ struct SampleScanState;
struct TBMIterateResult;
struct VacuumParams;
struct ValidateIndexState;
+struct TupleConversionMap;
+
+typedef enum AnalyzeSampleType
+{
+ ANALYZE_SAMPLE_DATA = 0, /* real data per column */
+ ANALYZE_SAMPLE_DISKSIZE, /* physical size per column */
+ MAX_ANALYZE_SAMPLE /* must be last */
+} AnalyzeSampleType;
+
+typedef struct AnalyzeSampleContext
+{
+ /* Filled when context is created */
+ int totaltargrows;
+ List *anl_cols;
+ Relation parent;
+ BufferAccessStrategy bstrategy;
+
+ /* Filled by table AM analyze routines */
+ BlockNumber totalblocks;
+ TableScanDesc scan;
+
+ /*
+ * Acquiring sample rows from a inherited table will invoke
+ * multiple sampling iterations for each child relation, so
+ * bellow filed is the statistic for each iteration.
+ */
+ int targrows; /* target number of sample rows */
+ double liverows;
+ double deadrows;
+ bool ordered; /* are sample rows ordered physically */
+
+ /*
+ * Statistics filed by all sampling iterations.
+ */
+ int totalsampledrows; /* total number of sample rows stored */
+ double totalrows;
+ double totaldeadrows;
+
+ /*
+ * If childrel has different rowtype with parent, we
+ * need to convert sample tuple to the same rowtype
+ * with parent
+ */
+ struct TupleConversionMap *tup_convert_map;
+
+ /*
+ * Used by table AM analyze routines to store
+ * the temporary tuple for different types of
+ * sample rows, the tuple is finally stored to
+ * sample_rows[] if the tuple is
+ * randomly selected.
+ */
+ TupleTableSlot* sample_slots[MAX_ANALYZE_SAMPLE];
+
+ /*
+ * stores the final sample rows which will be
+ * used to compute statistics.
+ */
+ HeapTuple* sample_rows[MAX_ANALYZE_SAMPLE];
+} AnalyzeSampleContext;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -163,6 +223,7 @@ typedef struct TableAmRoutine
{
/* this must be set to T_TableAmRoutine */
NodeTag type;
+ bool scans_leverage_column_projection;
/* ------------------------------------------------------------------------
@@ -203,6 +264,13 @@ typedef struct TableAmRoutine
ParallelTableScanDesc pscan,
uint32 flags);
+ TableScanDesc (*scan_begin_with_column_projection)(Relation relation,
+ Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ Bitmapset *project_column);
+
/*
* Release resources and deallocate scan. If TableScanDesc.temp_snap,
* TableScanDesc.rs_snapshot needs to be unregistered.
@@ -278,6 +346,13 @@ typedef struct TableAmRoutine
*/
void (*index_fetch_end) (struct IndexFetchTableData *data);
+ /*
+ * Set column projections for AM which leverage column projections for
+ * scanning.
+ */
+ void (*index_fetch_set_column_projection) (struct IndexFetchTableData *data,
+ Bitmapset *project_column);
+
/*
* Fetch tuple at `tid` into `slot`, after doing a visibility test
* according to `snapshot`. If a tuple was found and passed the visibility
@@ -518,9 +593,10 @@ typedef struct TableAmRoutine
* clear what a good interface for non block based AMs would be, so there
* isn't one yet.
*/
- bool (*scan_analyze_next_block) (TableScanDesc scan,
- BlockNumber blockno,
- BufferAccessStrategy bstrategy);
+ void (*scan_analyze_beginscan) (Relation onerel, AnalyzeSampleContext *context);
+
+ bool (*scan_analyze_next_block) (BlockNumber blockno,
+ AnalyzeSampleContext *context);
/*
* See table_scan_analyze_next_tuple().
@@ -530,11 +606,13 @@ typedef struct TableAmRoutine
* influence autovacuum scheduling (see comment for relation_vacuum
* callback).
*/
- bool (*scan_analyze_next_tuple) (TableScanDesc scan,
- TransactionId OldestXmin,
- double *liverows,
- double *deadrows,
- TupleTableSlot *slot);
+ bool (*scan_analyze_next_tuple) (TransactionId OldestXmin,
+ AnalyzeSampleContext *context);
+
+ void (*scan_analyze_sample_tuple) (int pos, bool replace,
+ AnalyzeSampleContext *context);
+
+ void (*scan_analyze_endscan) (AnalyzeSampleContext *context);
/* see table_index_build_range_scan for reference about parameters */
double (*index_build_range_scan) (Relation table_rel,
@@ -761,6 +839,12 @@ table_beginscan(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline bool
+table_scans_leverage_column_projection(Relation relation)
+{
+ return relation->rd_tableam->scans_leverage_column_projection;
+}
+
/*
* Like table_beginscan(), but for scanning catalog. It'll automatically use a
* snapshot appropriate for scanning catalog relations.
@@ -790,6 +874,19 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline TableScanDesc
+table_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ Bitmapset *project_column)
+{
+ uint32 flags = SO_TYPE_SEQSCAN |
+ SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+ Assert(relation->rd_tableam->scans_leverage_column_projection);
+ return relation->rd_tableam->scan_begin_with_column_projection(
+ relation, snapshot, nkeys, key, NULL, flags, project_column);
+}
+
/*
* table_beginscan_bm is an alternative entry point for setting up a
* TableScanDesc for a bitmap heap scan. Although that scan technology is
@@ -937,7 +1034,8 @@ extern void table_parallelscan_initialize(Relation rel,
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel(Relation rel,
- ParallelTableScanDesc pscan);
+ ParallelTableScanDesc pscan,
+ Bitmapset *proj);
/*
* Restart a parallel scan. Call this in the leader process. Caller is
@@ -987,6 +1085,13 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
scan->rel->rd_tableam->index_fetch_end(scan);
}
+static inline void
+table_index_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ Bitmapset *project_column)
+{
+ scan->rel->rd_tableam->index_fetch_set_column_projection(scan, project_column);
+}
+
/*
* Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
* a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1464,6 +1569,12 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
}
+static inline void
+table_scan_analyze_beginscan(Relation rel, struct AnalyzeSampleContext *context)
+{
+ rel->rd_tableam->scan_analyze_beginscan(rel, context);
+}
+
/*
* Prepare to analyze block `blockno` of `scan`. The scan needs to have been
* started with table_beginscan_analyze(). Note that this routine might
@@ -1473,11 +1584,10 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
* Returns false if block is unsuitable for sampling, true otherwise.
*/
static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
- BufferAccessStrategy bstrategy)
+table_scan_analyze_next_block(BlockNumber blockno,
+ struct AnalyzeSampleContext *context)
{
- return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
- bstrategy);
+ return context->scan->rs_rd->rd_tableam->scan_analyze_next_block(blockno, context);
}
/*
@@ -1491,13 +1601,21 @@ table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
* tuples.
*/
static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
- double *liverows, double *deadrows,
- TupleTableSlot *slot)
+table_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
+{
+ return context->scan->rs_rd->rd_tableam->scan_analyze_next_tuple(OldestXmin, context);
+}
+
+static inline void
+table_scan_analyze_sample_tuple(Index sample, bool replace, AnalyzeSampleContext *context)
{
- return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
- liverows, deadrows,
- slot);
+ context->scan->rs_rd->rd_tableam->scan_analyze_sample_tuple(sample, replace, context);
+}
+
+static inline void
+table_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+ context->scan->rs_rd->rd_tableam->scan_analyze_endscan(context);
}
/*
@@ -1816,6 +1934,32 @@ extern void table_block_relation_estimate_size(Relation rel,
Size overhead_bytes_per_tuple,
Size usable_bytes_per_page);
+/* ----------------------------------------------------------------------------
+ * Helper functions to implement analyze scan.
+j* ----------------------------------------------------------------------------
+ */
+extern AnalyzeSampleContext *
+CreateAnalyzeSampleContext(Relation onerel, List *cols, int targrows,
+ BufferAccessStrategy strategy);
+extern void DestroyAnalyzeSampleContext(AnalyzeSampleContext *context);
+extern TupleTableSlot * AnalyzeGetSampleSlot(AnalyzeSampleContext *context,
+ Relation onerel, AnalyzeSampleType type);
+extern void AnalyzeRecordSampleRow(AnalyzeSampleContext *context,
+ TupleTableSlot *sample_slot,
+ HeapTuple sample_tuple,
+ AnalyzeSampleType type, int pos,
+ bool replace, bool withtid);
+extern void InitAnalyzeSampleContextForChild(AnalyzeSampleContext *context,
+ Relation child,
+ int childtargrows);
+extern void AnalyzeGetSampleStats(AnalyzeSampleContext *context,
+ int *totalsampledrows,
+ double *totalrows,
+ double *totaldeadrows);
+extern HeapTuple *
+AnalyzeGetSampleRows(AnalyzeSampleContext *context, AnalyzeSampleType type, int offset);
+extern bool AnalyzeSampleIsValid(AnalyzeSampleContext *context, AnalyzeSampleType type);
+
/* ----------------------------------------------------------------------------
* Functions in tableamapi.c
* ----------------------------------------------------------------------------
diff --git src/include/access/xlogrecord.h src/include/access/xlogrecord.h
index acd9af0194..27f9cbe9d4 100644
--- src/include/access/xlogrecord.h
+++ src/include/access/xlogrecord.h
@@ -218,7 +218,7 @@ typedef struct XLogRecordDataHeaderLong
* need a handful of block references, but there are a few exceptions that
* need more.
*/
-#define XLR_MAX_BLOCK_ID 32
+#define XLR_MAX_BLOCK_ID 199
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
diff --git src/include/access/zedstore_compression.h src/include/access/zedstore_compression.h
new file mode 100644
index 0000000000..8b1ab8586f
--- /dev/null
+++ src/include/access/zedstore_compression.h
@@ -0,0 +1,16 @@
+/*
+ * zedstore_compression.h
+ * internal declarations for ZedStore compression
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_compression.h
+ */
+#ifndef ZEDSTORE_COMPRESSION_H
+#define ZEDSTORE_COMPRESSION_H
+
+extern int zs_compress_destSize(const char *src, char *dst, int *srcSizePtr, int targetDstSize);
+extern void zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize);
+
+#endif /* ZEDSTORE_COMPRESSION_H */
diff --git src/include/access/zedstore_internal.h src/include/access/zedstore_internal.h
new file mode 100644
index 0000000000..aab1a8afa0
--- /dev/null
+++ src/include/access/zedstore_internal.h
@@ -0,0 +1,1083 @@
+/*
+ * zedstore_internal.h
+ * internal declarations for ZedStore tables
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_internal.h
+ */
+#ifndef ZEDSTORE_INTERNAL_H
+#define ZEDSTORE_INTERNAL_H
+
+#include "access/tableam.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_tid.h"
+#include "access/zedstore_undolog.h"
+#include "lib/integerset.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/datum.h"
+
+struct zs_pending_undo_op;
+
+#define ZS_META_ATTRIBUTE_NUM 0
+#define ZS_INVALID_ATTRIBUTE_NUM (-1)
+
+#define INVALID_SPECULATIVE_TOKEN 0
+
+/*
+ * attstream_buffer is an in-memory representation of an attribute stream. It is used
+ * by the operations that construct and manipulate attribute streams.
+ */
+typedef struct
+{
+ /*
+ * Enlargeable buffer. The chunks are stored in 'data', between the
+ * 'cursor' and 'len' positions. So if cursor > 0, there is some unused
+ * space before the chunks, and if data < maxlen, there is unused space
+ * after the chunks.
+ */
+ char *data; /* contains raw chunks */
+ int len;
+ int maxlen;
+ int cursor; /* beginning of remaining chunks */
+
+ /*
+ * First and last TID (inclusive) stored in the chunks.
+ */
+ zstid firsttid;
+ zstid lasttid;
+
+ /*
+ * meta-data of the attribute, so that we don't need to pass these along
+ * as separate arguments everywhere.
+ */
+ int16 attlen;
+ bool attbyval;
+} attstream_buffer;
+
+/*
+ * attstream_decoder is used to unpack an attstream into tids/datums/isnulls.
+ */
+typedef struct
+{
+ /* memory context holding the buffer */
+ MemoryContext cxt;
+
+ /* this is for holding decoded element data in the arrays, reset between decoder_attstream_cont calls */
+ MemoryContext tmpcxt;
+
+ /*
+ * meta-data of the attribute, so that we don't need to pass these along
+ * as separate arguments everywhere.
+ */
+ int16 attlen;
+ bool attbyval;
+
+ /* buffer and its allocated size */
+ char *chunks_buf;
+ int chunks_buf_size;
+
+ /* attstream compression ratio */
+ float8 compression_ratio;
+
+ /* information about the current attstream in the buffer */
+ int chunks_len;
+ zstid firsttid;
+ zstid lasttid;
+
+ /* next position within the attstream */
+ int pos;
+ zstid prevtid;
+
+ /*
+ * currently decoded batch of elements
+ */
+/* must be >= the max number of items in one codeword (that is, >= 60)*/
+#define DECODER_MAX_ELEMS 90
+ zstid tids[DECODER_MAX_ELEMS];
+ Datum datums[DECODER_MAX_ELEMS];
+ bool isnulls[DECODER_MAX_ELEMS];
+ int num_elements;
+ float8 avg_elements_size; /* avg physical size of elements */
+} attstream_decoder;
+
+/*
+ * A ZedStore table contains different kinds of pages, all in the same file.
+ *
+ * Block 0 is always a metapage. It contains the block numbers of the other
+ * data structures stored within the file, like the per-attribute B-trees,
+ * and the UNDO log. In addition, if there are overly large datums in the
+ * the table, they are chopped into separate "toast" pages.
+ */
+#define ZS_META_PAGE_ID 0xF083
+#define ZS_BTREE_PAGE_ID 0xF084
+#define ZS_UNDO_PAGE_ID 0xF085
+#define ZS_TOAST_PAGE_ID 0xF086
+#define ZS_FREE_PAGE_ID 0xF087
+
+/* flags for zedstore b-tree pages */
+#define ZSBT_ROOT 0x0001
+
+typedef struct ZSBtreePageOpaque
+{
+ AttrNumber zs_attno;
+ uint16 zs_level; /* 0 = leaf */
+ BlockNumber zs_next;
+ zstid zs_lokey; /* inclusive */
+ zstid zs_hikey; /* exclusive */
+ uint16 zs_flags;
+
+ uint16 padding1;
+ uint16 padding2;
+
+ uint16 zs_page_id; /* always ZS_BTREE_PAGE_ID */
+} ZSBtreePageOpaque;
+
+#define ZSBtreePageGetOpaque(page) ((ZSBtreePageOpaque *) PageGetSpecialPointer(page))
+
+/*
+ * Internal B-tree page layout.
+ *
+ * The "contents" of the page is an array of ZSBtreeInternalPageItem. The number
+ * of items can be deduced from pd_lower.
+ */
+typedef struct ZSBtreeInternalPageItem
+{
+ zstid tid;
+ BlockNumber childblk;
+} ZSBtreeInternalPageItem;
+
+static inline ZSBtreeInternalPageItem *
+ZSBtreeInternalPageGetItems(Page page)
+{
+ ZSBtreeInternalPageItem *items;
+
+ items = (ZSBtreeInternalPageItem *) PageGetContents(page);
+
+ return items;
+}
+static inline int
+ZSBtreeInternalPageGetNumItems(Page page)
+{
+ ZSBtreeInternalPageItem *begin;
+ ZSBtreeInternalPageItem *end;
+
+ begin = (ZSBtreeInternalPageItem *) PageGetContents(page);
+ end = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+
+ return end - begin;
+}
+
+static inline bool
+ZSBtreeInternalPageIsFull(Page page)
+{
+ PageHeader phdr = (PageHeader) page;
+
+ return phdr->pd_upper - phdr->pd_lower < sizeof(ZSBtreeInternalPageItem);
+}
+
+/*
+ * Attribute B-tree leaf page layout
+ *
+ * Leaf pages in the attribute trees don't follow the normal page layout
+ * with line pointers and items. They use the standard page header,
+ * with pd_lower and pd_upper, but the data stored in the lower and upper
+ * parts are different from the normal usage.
+ *
+ * The upper and lower parts of the page contain one "attribute stream"
+ * each. An attibute stream contains attribute data for a range of rows.
+ * Logically, it contains a list of TIDs, and their Datums and isnull
+ * flags. The ranges of TIDs stored in the streams never overlap, but
+ * there can be gaps, if rows have been deleted or updated.
+ *
+ * Physically, the stream consists of "chunks", where one chunk contains
+ * the TIDs of 1-60 datums, packed in a compact form, and their datums.
+ * Finally, the whole stream can be compressed. See comments in
+ * zedstore_attstream.c for a more detailed description of the chunk
+ * format.
+ *
+ * By convention, the attribute stream stored in the upper part of the
+ * page, between pd_upper and pd_special, is compressed, and the lower
+ * stream, stored between the page header and pd_lower, is uncompressed:
+ *
+ * +--------------------+
+ * | PageHeaderData |
+ * +--------------------+
+ * | lower attstream |
+ * | (uncompressed) ... |
+ * | .................. |
+ * | .................. |
+ * +--------------------+ <-pd_lower
+ * | |
+ * | (free space) |
+ * | |
+ * +--------------------+ <-pd_upper
+ * | upper attstream |
+ * | (compressed) .... |
+ * | .................. |
+ * | .................. |
+ * | .................. |
+ * +--------------------+ <-pd_special
+ * | ZSBtreePageOpaque |
+ * +--------------------+
+ *
+ * The point of having two streams is to allow fast appending of
+ * data to a page, without having to decompress and recompress
+ * the whole page. When new data is inserted, it is added to
+ * the uncompressed stream, if it fits. When a page comes full,
+ * the uncompressed stream is merged with the compressed stream,
+ * replacing both with one larger compressed stream.
+ *
+ * The names "lower" and "upper" refer to the physical location of
+ * the stream on the page. The data in the lower attstream
+ * have higher-numbered TIDs than the data in the upper attstream.
+ * No overlap is allowed. This works well with the usual usage
+ * pattern that new data is added to the end (i.e. with increasing
+ * sequence of TIDs), and old data is archived in compressed form
+ * when a page fills up.
+ */
+
+/*
+ * ZSAttStream represents one attribute stream, stored in the lower
+ * or upper part of an attribute leaf page. It is also used to
+ * pass around data in memory, in which case a stream can be
+ * arbitrarily long.
+ *
+ *
+ * Attstreams are compressed by feeding the stream to the compressor, until
+ * all the space available on the page. However, the compressor doesn't know
+ * about chunk boundaries within the stream, so it may stop the compression
+ * in the middle of a chunk. As an artifact of that, a compressed stream
+ * often contains an incomplete chunk at the end. That space goes wasted, and
+ * is ignored. 't_decompressed_size' is the total size of all complete chunks
+ * in a compressed stream, while 't_decompressed_bufsize' includes the wasted
+ * bytes at the end.
+ *
+ * XXX: We could avoid the waste by using a compressor that knows about the
+ * chunk boundaries. Or we could compress twice, first to get the size that
+ * fits, and second time to compress just what fits. But that would be twice
+ * as slow. In practice, the wasted space doesn't matter much. We try to
+ * keep each chunk relatively small, to minimize the waste. And because we
+ * know the next chunk wouldn't fit on the page anyway, there isn't much else
+ * we could do with the wasted space, anyway.
+ */
+typedef struct
+{
+ uint32 t_size; /* physical size of the stream. */
+ uint32 t_flags;
+ uint32 t_decompressed_size; /* payload size, excludes waste */
+ uint32 t_decompressed_bufsize; /* payload size, includes waste */
+ zstid t_lasttid; /* last TID stored in this stream */
+
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+} ZSAttStream;
+
+#define SizeOfZSAttStreamHeader offsetof(ZSAttStream, t_payload)
+
+#define ATTSTREAM_COMPRESSED 1
+
+
+/*
+ * TID B-tree leaf page layout
+ *
+ * Leaf pages are packed with ZSTidArrayItems. Each ZSTidArrayItem represents
+ * a range of tuples, starting at 't_firsttid', up to 't_endtid' - 1. For each
+ * tuple, we its TID and the UNDO pointer. The TIDs and UNDO pointers are specially
+ * encoded, so that they take less space.
+ *
+ * Item format:
+ *
+ * We make use of some assumptions / observations on the TIDs and UNDO pointers
+ * to pack them tightly:
+ *
+ * - TIDs are kept in ascending order, and the gap between two TIDs
+ * is usually very small. On a newly loaded table, all TIDs are
+ * consecutive.
+ *
+ * - It's common for the UNDO pointer to be old so that the tuple is
+ * visible to everyone. In that case we don't need to keep the exact value.
+ *
+ * - Nearby TIDs are likely to have only a few distinct UNDO pointer values.
+ *
+ *
+ * Each item looks like this:
+ *
+ * Header | 1-16 TID codewords | 0-2 UNDO pointers | UNDO "slotwords"
+ *
+ * The fixed-size header contains the start and end of the TID range that
+ * this item represents, and information on how many UNDO slots and codewords
+ * follow in the variable-size part.
+ *
+ * After the fixed-size header comes the list of TIDs. They are encoded in
+ * Simple-8b codewords. Simple-8b is an encoding scheme to pack multiple
+ * integers in 64-bit codewords. A single codeword can pack e.g. three 20-bit
+ * integers, or 20 3-bit integers, or a number of different combinations.
+ * Therefore, small integers pack more tightly than larger integers. We encode
+ * the difference between each TID, so in the common case that there are few
+ * gaps between the TIDs, we only need a few bits per tuple. The first encoded
+ * integer is always 0, because the first TID is stored explicitly in
+ * t_firsttid. (TODO: storing the first constant 0 is obviously a waste of
+ * space. Also, since there cannot be duplicates, we could store "delta - 1",
+ * which would allow a more tight representation in some cases.)
+ *
+ * After the TID codeword, are so called "UNDO slots". They represent all the
+ * distinct UNDO pointers in the group of TIDs that this item covers.
+ * Logically, there are 4 slots. Slots 0 and 1 are special, representing
+ * all-visible "old" TIDs, and "dead" TIDs. They are not stored in the item
+ * itself, to save space, but logically, they can be thought to be part of
+ * every item. They are included in 't_num_undo_slots', so the number of UNDO
+ * pointers physically stored on an item is actually 't_num_undo_slots - 2'.
+ *
+ * With the 4 UNDO slots, we can represent an UNDO pointer using a 2-bit
+ * slot number. If you update a tuple with a new UNDO pointer, and all four
+ * slots are already in use, the item needs to be split. Hopefully that doesn't
+ * happen too often (see assumptions above).
+ *
+ * After the UNDO slots come "UNDO slotwords". The slotwords contain the slot
+ * number of each tuple in the item. The slot numbers are packed in 64 bit
+ * integers, with 2 bits for each tuple.
+ *
+ * Representing UNDO pointers as distinct slots also has the advantage that
+ * when we're scanning the TID array, we can check the few UNDO pointers in
+ * the slots against the current snapshot, and remember the visibility of
+ * each slot, instead of checking every UNDO pointer separately. That
+ * considerably speeds up visibility checks when reading. That's one
+ * advantage of this special encoding scheme, compared to e.g. using a
+ * general-purpose compression algorithm on an array of TIDs and UNDO pointers.
+ *
+ * The physical size of an item depends on how many tuples it covers, the
+ * number of codewords needed to encode the TIDs, and many distinct UNDO
+ * pointers they have.
+ */
+typedef struct
+{
+ uint16 t_size;
+ uint16 t_num_tids;
+ uint16 t_num_codewords;
+ uint16 t_num_undo_slots;
+
+ zstid t_firsttid;
+ zstid t_endtid;
+
+ /* Followed by UNDO slots, and then followed by codewords */
+ uint64 t_payload[FLEXIBLE_ARRAY_MEMBER];
+
+} ZSTidArrayItem;
+
+/*
+ * We use 2 bits for the UNDO slot number for every tuple. We can therefore
+ * fit 32 slot numbers in each 64-bit "slotword".
+ */
+#define ZSBT_ITEM_UNDO_SLOT_BITS 2
+#define ZSBT_MAX_ITEM_UNDO_SLOTS (1 << (ZSBT_ITEM_UNDO_SLOT_BITS))
+#define ZSBT_ITEM_UNDO_SLOT_MASK (ZSBT_MAX_ITEM_UNDO_SLOTS - 1)
+#define ZSBT_SLOTNOS_PER_WORD (64 / ZSBT_ITEM_UNDO_SLOT_BITS)
+
+/*
+ * To keep the item size and time needed to work with them reasonable,
+ * limit the size of an item to max 16 codewords and 128 TIDs.
+ */
+#define ZSBT_MAX_ITEM_CODEWORDS 16
+#define ZSBT_MAX_ITEM_TIDS 128
+
+#define ZSBT_OLD_UNDO_SLOT 0
+#define ZSBT_DEAD_UNDO_SLOT 1
+#define ZSBT_FIRST_NORMAL_UNDO_SLOT 2
+
+/* Number of UNDO slotwords needed for a given number of tuples */
+#define ZSBT_NUM_SLOTWORDS(num_tids) ((num_tids + ZSBT_SLOTNOS_PER_WORD - 1) / ZSBT_SLOTNOS_PER_WORD)
+
+static inline size_t
+SizeOfZSTidArrayItem(int num_tids, int num_undo_slots, int num_codewords)
+{
+ Size sz;
+
+ sz = offsetof(ZSTidArrayItem, t_payload);
+ sz += num_codewords * sizeof(uint64);
+ sz += (num_undo_slots - ZSBT_FIRST_NORMAL_UNDO_SLOT) * sizeof(ZSUndoRecPtr);
+ sz += ZSBT_NUM_SLOTWORDS(num_tids) * sizeof(uint64);
+
+ return sz;
+}
+
+/*
+ * Get pointers to the TID codewords, UNDO slots, and slotwords from an item.
+ *
+ * Note: this is also used to get the pointers when constructing a new item, so
+ * don't assert here that the data is valid!
+ */
+static inline void
+ZSTidArrayItemDecode(ZSTidArrayItem *item, uint64 **codewords,
+ ZSUndoRecPtr **slots, uint64 **slotwords)
+{
+ char *p = (char *) item->t_payload;
+
+ *codewords = (uint64 *) p;
+ p += item->t_num_codewords * sizeof(uint64);
+ *slots = (ZSUndoRecPtr *) p;
+ p += (item->t_num_undo_slots - ZSBT_FIRST_NORMAL_UNDO_SLOT) * sizeof(ZSUndoRecPtr);
+ *slotwords = (uint64 *) p;
+}
+
+/*
+ * Toast page layout.
+ *
+ * When an overly large datum is stored, it is divided into chunks, and each
+ * chunk is stored on a dedicated toast page. The toast pages of a datum form
+ * list, each page has a next/prev pointer.
+ */
+/*
+ * Maximum size of an individual untoasted Datum stored in ZedStore. Datums
+ * larger than this need to be toasted.
+ *
+ * A datum needs to fit on a B-tree page, with page and item headers.
+ *
+ * XXX: 500 accounts for all the headers. Need to compute this correctly...
+ */
+#define MaxZedStoreDatumSize (BLCKSZ - 500)
+
+typedef struct ZSToastPageOpaque
+{
+ AttrNumber zs_attno;
+
+ /* these are only set on the first page. */
+ zstid zs_tid;
+ uint32 zs_total_size;
+ uint32 zs_decompressed_size;
+ bool zs_is_compressed;
+
+ uint32 zs_slice_offset;
+ BlockNumber zs_prev;
+ BlockNumber zs_next;
+ uint16 zs_flags;
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id;
+} ZSToastPageOpaque;
+
+/*
+ * "Toast pointer" of a datum that's stored in zedstore toast pages.
+ *
+ * This looks somewhat like a normal TOAST pointer, but we mustn't let these
+ * escape out of zedstore code, because the rest of the system doesn't know
+ * how to deal with them.
+ *
+ * This must look like varattrib_1b_e!
+ */
+typedef struct varatt_zs_toastptr
+{
+ /* varattrib_1b_e */
+ uint8 va_header;
+ uint8 va_tag; /* VARTAG_ZEDSTORE in zedstore toast datums */
+
+ /* first block */
+ BlockNumber zst_block;
+} varatt_zs_toastptr;
+
+/*
+ * va_tag value. this should be distinguishable from the values in
+ * vartag_external
+ */
+#define VARTAG_ZEDSTORE 10
+
+/*
+ * Versions of datumGetSize and datumCopy that know about ZedStore-toasted
+ * datums.
+ */
+static inline Size
+zs_datumGetSize(Datum value, bool typByVal, int typLen)
+{
+ if (typLen > 0)
+ return typLen;
+ else if (typLen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ return sizeof(varatt_zs_toastptr);
+ else
+ return VARSIZE_ANY(value);
+ }
+ else
+ return datumGetSize(value, typByVal, typLen);
+}
+
+static inline Datum
+zs_datumCopy(Datum value, bool typByVal, int typLen)
+{
+ if (typLen < 0 && VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ {
+ char *result = palloc(sizeof(varatt_zs_toastptr));
+
+ memcpy(result, DatumGetPointer(value), sizeof(varatt_zs_toastptr));
+
+ return PointerGetDatum(result);
+ }
+ else
+ return datumCopy(value, typByVal, typLen);
+}
+
+/*
+ * Block 0 on every ZedStore table is a metapage.
+ *
+ * It contains a directory of b-tree roots for each attribute, and lots more.
+ */
+#define ZS_META_BLK 0
+
+/*
+ * The metapage stores one of these for each attribute.
+ */
+typedef struct ZSRootDirItem
+{
+ BlockNumber root;
+ BlockNumber fpm_head;
+} ZSRootDirItem;
+
+typedef struct ZSMetaPage
+{
+ int nattributes;
+ ZSRootDirItem tree_root_dir[FLEXIBLE_ARRAY_MEMBER]; /* one for each attribute */
+} ZSMetaPage;
+
+/*
+ * it's not clear what we should store in the "opaque" special area, and what
+ * as page contents, on a metapage. But have at least the page_id field here,
+ * so that tools like pg_filedump can recognize it as a zedstore metapage.
+ */
+typedef struct ZSMetaPageOpaque
+{
+ /*
+ * Head and tail page of the UNDO log.
+ *
+ * 'zs_undo_tail' is the newest page, where new UNDO records will be inserted,
+ * and 'zs_undo_head' is the oldest page. 'zs_undo_tail_first_counter' is the
+ * UNDO counter value of the first record on the tail page (or if the tail
+ * page is empty, the counter value the first record on the tail page will
+ * have, when it's inserted.) If there is no UNDO log at all,
+ * 'zs_undo_tail_first_counter' is the new counter value to use. It's actually
+ * redundant, except when there is no UNDO log at all, but it's a nice
+ * cross-check at other times.
+ */
+ BlockNumber zs_undo_head;
+ BlockNumber zs_undo_tail;
+ uint64 zs_undo_tail_first_counter;
+
+ /*
+ * Oldest UNDO record that is still needed. Anything older than this can
+ * be discarded, and considered as visible to everyone.
+ */
+ ZSUndoRecPtr zs_undo_oldestptr;
+
+ BlockNumber zs_fpm_head; /* head of the Free Page Map list for UNDO pages */
+ uint16 zs_flags;
+ uint16 zs_page_id;
+} ZSMetaPageOpaque;
+
+/*
+ * Codes populated by zs_SatisfiesNonVacuumable. This has minimum values
+ * defined based on what's needed. Heap equivalent has more states.
+ */
+typedef enum
+{
+ ZSNV_NONE,
+ ZSNV_RECENTLY_DEAD /* tuple is dead, but not deletable yet */
+} ZSNV_Result;
+
+typedef struct ZSUndoSlotVisibility
+{
+ TransactionId xmin;
+ TransactionId xmax;
+ CommandId cmin;
+ uint32 speculativeToken;
+ ZSNV_Result nonvacuumable_status;
+} ZSUndoSlotVisibility;
+
+static const ZSUndoSlotVisibility InvalidUndoSlotVisibility = {
+ .xmin = InvalidTransactionId,
+ .xmax = InvalidTransactionId,
+ .cmin = InvalidCommandId,
+ .speculativeToken = INVALID_SPECULATIVE_TOKEN,
+ .nonvacuumable_status = ZSNV_NONE
+};
+
+typedef struct ZSTidItemIterator
+{
+ int tids_allocated_size;
+ zstid *tids;
+ uint8 *tid_undoslotnos;
+ int num_tids;
+ MemoryContext context;
+
+ ZSUndoRecPtr undoslots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ ZSUndoSlotVisibility undoslot_visibility[ZSBT_MAX_ITEM_UNDO_SLOTS];
+} ZSTidItemIterator;
+
+/*
+ * Holds the state of an in-progress scan on a zedstore Tid tree.
+ */
+typedef struct ZSTidTreeScan
+{
+ Relation rel;
+
+ /*
+ * memory context that should be used for any allocations that go with the scan,
+ * like the decompression buffers. This isn't a dedicated context, you must still
+ * free everything to avoid leaking! We need this because the getnext function
+ * might be called in a short-lived memory context that is reset between calls.
+ */
+ MemoryContext context;
+
+ bool active;
+ Buffer lastbuf;
+ OffsetNumber lastoff;
+ Snapshot snapshot;
+
+ /*
+ * starttid and endtid define a range of TIDs to scan. currtid is the previous
+ * TID that was returned from the scan. They determine what zsbt_tid_scan_next()
+ * will return.
+ */
+ zstid starttid;
+ zstid endtid;
+ zstid currtid;
+
+ /* in the "real" UNDO-log, this would probably be a global variable */
+ ZSUndoRecPtr recent_oldest_undo;
+
+ /* should this scan do predicate locking? Or check for conflicts? */
+ bool serializable;
+ bool acquire_predicate_tuple_locks;
+
+ /*
+ * These fields are used, when the scan is processing an array item.
+ */
+ ZSTidItemIterator array_iter;
+ int array_curr_idx;
+} ZSTidTreeScan;
+
+/*
+ * This is convenience function to get the index aka slot number for undo and
+ * visibility array. Important to note this performs "next_idx - 1" means
+ * works after returning from TID scan function when the next_idx has been
+ * incremented.
+ */
+static inline uint8
+ZSTidScanCurUndoSlotNo(ZSTidTreeScan *scan)
+{
+ Assert(scan->array_curr_idx >= 0 && scan->array_curr_idx < scan->array_iter.num_tids);
+ Assert(scan->array_iter.tid_undoslotnos != NULL);
+ return (scan->array_iter.tid_undoslotnos[scan->array_curr_idx]);
+}
+
+/*
+ * Holds the state of an in-progress scan on a zedstore attribute tree.
+ */
+typedef struct ZSAttrTreeScan
+{
+ Relation rel;
+ AttrNumber attno;
+ Form_pg_attribute attdesc;
+
+ /*
+ * memory context that should be used for any allocations that go with the scan,
+ * like the decompression buffers. This isn't a dedicated context, you must still
+ * free everything to avoid leaking! We need this because the getnext function
+ * might be called in a short-lived memory context that is reset between calls.
+ */
+ MemoryContext context;
+
+ bool active;
+ Buffer lastbuf;
+ OffsetNumber lastoff;
+
+ /*
+ * These fields are used, when the scan is processing an array tuple.
+ * They are filled in by zsbt_attr_scan_fetch_array().
+ */
+ attstream_decoder decoder;
+
+ /* last index into attr_decoder arrays */
+ int decoder_last_idx;
+
+} ZSAttrTreeScan;
+
+/*
+ * We keep a this cached copy of the information in the metapage in
+ * backend-private memory. In RelationData->rd_amcache.
+ *
+ * The cache contains the block numbers of the roots of all the tree
+ * structures, for quick searches, as well as the rightmost leaf page, for
+ * quick insertions to the end.
+ *
+ * Use zsmeta_get_cache() to get the cached struct.
+ *
+ * This is used together with smgr_targblock. smgr_targblock tracks the
+ * physical size of the relation file. This struct is only considered valid
+ * when smgr_targblock is valid. So in effect, we invalidate this whenever
+ * a smgr invalidation happens. Logically, the lifetime of this is the same
+ * as smgr_targblocks/smgr_fsm_nblocks/smgr_vm_nblocks, but there's no way
+ * to attach an AM-specific struct directly to SmgrRelation.
+ */
+typedef struct ZSMetaCacheData
+{
+ int cache_nattributes;
+
+ /* For each attribute */
+ struct {
+ BlockNumber root; /* root of the b-tree */
+ BlockNumber rightmost; /* right most leaf page */
+ zstid rightmost_lokey; /* lokey of rightmost leaf */
+ } cache_attrs[FLEXIBLE_ARRAY_MEMBER];
+
+} ZSMetaCacheData;
+
+extern ZSMetaCacheData *zsmeta_populate_cache(Relation rel);
+
+static inline ZSMetaCacheData *
+zsmeta_get_cache(Relation rel)
+{
+ if (rel->rd_amcache == NULL || RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ zsmeta_populate_cache(rel);
+ return (ZSMetaCacheData *) rel->rd_amcache;
+}
+
+/*
+ * Blow away the cached ZSMetaCacheData struct. Next call to zsmeta_get_cache()
+ * will reload it from the metapage.
+ */
+static inline void
+zsmeta_invalidate_cache(Relation rel)
+{
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+}
+
+/*
+ * zs_split_stack is used during page split, or page merge, to keep track
+ * of all the modified pages. The page split (or merge) routines don't
+ * modify pages directly, but they construct a list of 'zs_split_stack'
+ * entries. Each entry holds a buffer, and a temporary in-memory copy of
+ * a page that should be written to the buffer, once everything is completed.
+ * All the buffers are exclusively-locked.
+ */
+typedef struct zs_split_stack zs_split_stack;
+
+struct zs_split_stack
+{
+ zs_split_stack *next;
+
+ Buffer buf;
+ Page page; /* temp in-memory copy of page */
+ bool recycle; /* should the page be added to the FPM? */
+ bool special_only; /* if set, only the "special" area was changed, (the
+ * rest of the page won't need to be WAL-logged */
+};
+
+/* prototypes for functions in zedstore_tidpage.c */
+extern void zsbt_tid_begin_scan(Relation rel, zstid starttid, zstid endtid,
+ Snapshot snapshot, ZSTidTreeScan *scan);
+extern void zsbt_tid_reset_scan(ZSTidTreeScan *scan, zstid starttid, zstid endtid, zstid currtid);
+extern void zsbt_tid_end_scan(ZSTidTreeScan *scan);
+extern bool zsbt_tid_scan_next_array(ZSTidTreeScan *scan, zstid nexttid, ScanDirection direction);
+
+/*
+ * Return the next TID in the scan.
+ *
+ * The next TID means the first TID > scan->currtid. Each call moves
+ * scan->currtid to the last returned TID. You can call zsbt_tid_reset_scan()
+ * to change the position, scan->starttid and scan->endtid define the
+ * boundaries of the search.
+ */
+static inline zstid
+zsbt_tid_scan_next(ZSTidTreeScan *scan, ScanDirection direction)
+{
+ zstid nexttid;
+ int idx;
+
+ Assert(scan->active);
+
+ if (direction == ForwardScanDirection)
+ nexttid = scan->currtid + 1;
+ else if (direction == BackwardScanDirection)
+ nexttid = scan->currtid - 1;
+ else
+ nexttid = scan->currtid;
+
+ if (scan->array_iter.num_tids == 0 ||
+ nexttid < scan->array_iter.tids[0] ||
+ nexttid > scan->array_iter.tids[scan->array_iter.num_tids - 1])
+ {
+ scan->array_curr_idx = -1;
+ if (!zsbt_tid_scan_next_array(scan, nexttid, direction))
+ {
+ scan->currtid = nexttid;
+ return InvalidZSTid;
+ }
+ }
+
+ /*
+ * Optimize for the common case that we're scanning forward from the previous
+ * TID.
+ */
+ if (scan->array_curr_idx >= 0 && scan->array_iter.tids[scan->array_curr_idx] < nexttid)
+ idx = scan->array_curr_idx + 1;
+ else
+ idx = 0;
+
+ for (; idx < scan->array_iter.num_tids; idx++)
+ {
+ zstid this_tid = scan->array_iter.tids[idx];
+
+ if (this_tid >= scan->endtid)
+ {
+ scan->currtid = nexttid;
+ return InvalidZSTid;
+ }
+
+ if (this_tid >= nexttid)
+ {
+ /*
+ * Callers using SnapshotDirty need some extra visibility information.
+ */
+ if (scan->snapshot->snapshot_type == SNAPSHOT_DIRTY)
+ {
+ int slotno = scan->array_iter.tid_undoslotnos[idx];
+ ZSUndoSlotVisibility *visi_info = &scan->array_iter.undoslot_visibility[slotno];
+
+ if (visi_info->xmin != FrozenTransactionId)
+ scan->snapshot->xmin = visi_info->xmin;
+ scan->snapshot->xmax = visi_info->xmax;
+ scan->snapshot->speculativeToken = visi_info->speculativeToken;
+ }
+
+ /* on next call, continue the scan at the next TID */
+ scan->currtid = this_tid;
+ scan->array_curr_idx = idx;
+ return this_tid;
+ }
+ }
+
+ /*
+ * unreachable, because zsbt_tid_scan_next_array() should never return an array
+ * that doesn't contain a matching TID.
+ */
+ Assert(false);
+ return InvalidZSTid;
+}
+
+
+extern zstid zsbt_tid_multi_insert(Relation rel, int ntuples,
+ TransactionId xid, CommandId cid,
+ uint32 speculative_token, ZSUndoRecPtr prevundoptr);
+extern TM_Result zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart, bool *this_xact_has_lock);
+extern TM_Result zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid,
+ CommandId cid, bool key_update, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd, zstid *newtid_p, bool *this_xact_has_lock);
+extern void zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete);
+extern void zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr recent_oldest_undo);
+extern IntegerSet *zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid, uint64 *num_live_tuples);
+extern void zsbt_tid_remove(Relation rel, IntegerSet *tids);
+extern TM_Result zsbt_tid_lock(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ LockTupleMode lockmode, bool follow_updates,
+ Snapshot snapshot, TM_FailureData *hufd,
+ zstid *next_tid, bool *this_xact_has_lock,
+ ZSUndoSlotVisibility *visi_info);
+extern void zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr, ZSUndoRecPtr recent_oldest_undo);
+extern zstid zsbt_get_first_tid(Relation rel);
+extern zstid zsbt_get_last_tid(Relation rel);
+extern void zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot);
+
+/* prototypes for functions in zedstore_tiditem.c */
+extern List *zsbt_tid_item_create_for_range(zstid tid, int nelements, ZSUndoRecPtr undo_ptr);
+extern List *zsbt_tid_item_add_tids(ZSTidArrayItem *orig, zstid firsttid, int nelements,
+ ZSUndoRecPtr undo_ptr, bool *modified_orig);
+extern void zsbt_tid_item_unpack(ZSTidArrayItem *item, ZSTidItemIterator *iter);
+extern List *zsbt_tid_item_change_undoptr(ZSTidArrayItem *orig, zstid target_tid, ZSUndoRecPtr undoptr, ZSUndoRecPtr recent_oldest_undo);
+extern List *zsbt_tid_item_remove_tids(ZSTidArrayItem *orig, zstid *nexttid, IntegerSet *remove_tids,
+ ZSUndoRecPtr recent_oldest_undo);
+
+
+/* prototypes for functions in zedstore_attpage.c */
+extern void zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno,
+ ZSAttrTreeScan *scan);
+extern void zsbt_attr_end_scan(ZSAttrTreeScan *scan);
+extern bool zsbt_attr_scan_fetch_array(ZSAttrTreeScan *scan, zstid tid);
+
+extern void zsbt_attr_add(Relation rel, AttrNumber attno, attstream_buffer *newstream);
+extern void zsbt_attstream_change_redo(XLogReaderState *record);
+
+/* prototypes for functions in zedstore_attstream.c */
+extern void create_attstream(attstream_buffer *buffer, bool attbyval, int16 attlen,
+ int nelems, zstid *tids, Datum *datums, bool *isnulls);
+extern void init_attstream_buffer(attstream_buffer *buf, bool attbyval, int16 attlen);
+extern void init_attstream_buffer_from_stream(attstream_buffer *buf, bool attbyval, int16 attlen,
+ ZSAttStream *attstream, MemoryContext memcontext);
+extern int append_attstream(attstream_buffer *buffer, bool all, int nelems,
+ zstid *tids, Datum *datums, bool *isnulls);
+extern void vacuum_attstream(Relation rel, AttrNumber attno, attstream_buffer *buffer,
+ ZSAttStream *attstream,
+ zstid *tids_to_remove, int num_tids_to_remove);
+
+extern void merge_attstream(Form_pg_attribute attr, attstream_buffer *buffer, ZSAttStream *attstream2);
+extern void merge_attstream_buffer(Form_pg_attribute attr, attstream_buffer *buffer, attstream_buffer *buffer2);
+
+extern bool append_attstream_inplace(Form_pg_attribute att, ZSAttStream *oldstream, int freespace, attstream_buffer *newstream);
+
+extern int find_chunk_for_offset(attstream_buffer *attbuf, int offset, zstid *lasttid);
+extern int find_chunk_containing_tid(attstream_buffer *attbuf, zstid tid, zstid *lasttid);
+extern void trim_attstream_upto_offset(attstream_buffer *buf, int chunk_pos, zstid prev_lasttid);
+extern void split_attstream_buffer(attstream_buffer *oldattbuf, attstream_buffer *newattbuf, zstid splittid);
+
+extern void print_attstream(int attlen, char *chunk, int len);
+
+extern void init_attstream_decoder(attstream_decoder *decoder, bool attbyval, int16 attlen);
+extern void destroy_attstream_decoder(attstream_decoder *decoder);
+extern void decode_attstream_begin(attstream_decoder *decoder, ZSAttStream *attstream);
+extern bool decode_attstream_cont(attstream_decoder *decoder);
+extern bool get_attstream_chunk_cont(attstream_decoder *decoder, zstid *prevtid, zstid *firsttid, zstid *lasttid, bytea **chunk);
+
+/* prototypes for functions in zedstore_tuplebuffer.c */
+extern zstid zsbt_tuplebuffer_allocate_tids(Relation rel, TransactionId xid, CommandId cid, int ntids);
+extern void zsbt_tuplebuffer_flush(Relation rel);
+extern void zsbt_tuplebuffer_spool_tuple(Relation rel, zstid tid, Datum *datums, bool *isnulls);
+extern void zsbt_tuplebuffer_spool_slots(Relation rel, zstid *tids, TupleTableSlot **slots, int ntuples);
+
+extern void AtEOXact_zedstream_tuplebuffers(bool isCommit);
+
+
+/* prototypes for functions in zedstore_btree.c */
+extern zs_split_stack *zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks);
+extern zs_split_stack *zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks);
+extern void zsbt_attr_remove(Relation rel, AttrNumber attno, IntegerSet *tids);
+extern zs_split_stack *zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level);
+extern zs_split_stack *zs_new_split_stack_entry(Buffer buf, Page page);
+extern void zs_apply_split_changes(Relation rel, zs_split_stack *stack, struct zs_pending_undo_op *undo_op, AttrNumber attrNumber);
+extern Buffer zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly);
+extern Buffer zsbt_find_and_lock_leaf_containing_tid(Relation rel, AttrNumber attno,
+ Buffer buf, zstid nexttid, int lockmode);
+extern bool zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf);
+extern void zsbt_wal_log_leaf_items(Relation rel, AttrNumber attno, Buffer buf, OffsetNumber off, bool replace, List *items, struct zs_pending_undo_op *undo_op);
+extern void zsbt_wal_log_rewrite_pages(Relation rel, AttrNumber attno, List *buffers, struct zs_pending_undo_op *undo_op);
+
+/*
+ * Return the value of row identified with 'tid' in a scan.
+ *
+ * 'tid' must be greater than any previously returned item.
+ *
+ * Returns true if a matching item is found, false otherwise. After
+ * a false return, it's OK to call this again with another greater TID.
+ */
+static inline bool
+zsbt_attr_fetch(ZSAttrTreeScan *scan, Datum *datum, bool *isnull, zstid tid)
+{
+ int idx;
+
+ /*
+ * Fetch the next item from the scan. The item we're looking for might
+ * already be in scan->array_*.
+ */
+ if (scan->decoder.num_elements == 0 ||
+ tid < scan->decoder.tids[0] ||
+ tid > scan->decoder.tids[scan->decoder.num_elements - 1])
+ {
+ if (!zsbt_attr_scan_fetch_array(scan, tid))
+ return false;
+ scan->decoder_last_idx = -1;
+ }
+ Assert(scan->decoder.num_elements > 0 &&
+ tid >= scan->decoder.tids[0] &&
+ tid <= scan->decoder.tids[scan->decoder.num_elements - 1]);
+
+ /*
+ * Optimize for the common case that we're scanning forward from the previous
+ * TID.
+ */
+ if (scan->decoder_last_idx != -1 && scan->decoder.tids[scan->decoder_last_idx] < tid)
+ idx = scan->decoder_last_idx + 1;
+ else
+ idx = 0;
+
+ for (; idx < scan->decoder.num_elements; idx++)
+ {
+ zstid this_tid = scan->decoder.tids[idx];
+
+ if (this_tid == tid)
+ {
+ *isnull = scan->decoder.isnulls[idx];
+ *datum = scan->decoder.datums[idx];
+ scan->decoder_last_idx = idx;
+ return true;
+ }
+ if (this_tid > tid)
+ return false;
+ }
+
+ return false;
+}
+
+extern PGDLLIMPORT const TupleTableSlotOps TTSOpsZedstore;
+
+/* prototypes for functions in zedstore_meta.c */
+extern void zsmeta_initmetapage(Relation rel);
+extern void zsmeta_initmetapage_redo(XLogReaderState *record);
+extern BlockNumber zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool for_update);
+extern void zsmeta_add_root_for_new_attributes(Relation rel, Page page);
+
+/* prototypes for functions in zedstore_visibility.c */
+extern TM_Result zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo,
+ zstid item_tid, ZSUndoRecPtr item_undoptr,
+ LockTupleMode mode,
+ bool *undo_record_needed, bool *this_xact_has_lock,
+ TM_FailureData *tmfd, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info);
+extern bool zs_SatisfiesVisibility(ZSTidTreeScan *scan, ZSUndoRecPtr item_undoptr,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info);
+
+/* prototypes for functions in zedstore_toast.c */
+extern Datum zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value, zstid tid);
+extern Datum zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted);
+extern void zedstore_toast_delete(Relation rel, Form_pg_attribute attr, zstid tid, BlockNumber blkno);
+
+/* prototypes for functions in zedstore_freepagemap.c */
+extern Buffer zspage_getnewbuf(Relation rel, AttrNumber attrNumber);
+extern void zspage_mark_page_deleted(Page page, BlockNumber next_free_blk);
+extern void zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf, AttrNumber attrNumber);
+
+typedef struct ZedstoreTupleTableSlot
+{
+ TupleTableSlot base;
+
+ char *data; /* data for materialized slots */
+
+ /*
+ * Extra visibility information. The tuple's xmin and cmin can be extracted
+ * from here, used e.g. for triggers (XXX is that true?). There's also
+ * a flag to indicate if a tuple is vacuumable or not, which can be useful
+ * if you're scanning with SnapshotAny. That's currently used in index
+ * build.
+ */
+ ZSUndoSlotVisibility *visi_info;
+
+ /*
+ * Normally, when a tuple is retrieved from a table, 'visi_info' points to
+ * TID tree scan's data structures. But sometimes it's useful to keep the
+ * information together with the slot, e.g. whe a slot is copied, so that
+ * it doesn't depend on any data outside the slot. In that case, you can
+ * fill in 'visi_info_buf', and set visi_info = &visi_info_buf.
+ */
+ ZSUndoSlotVisibility visi_info_buf;
+} ZedstoreTupleTableSlot;
+
+#endif /* ZEDSTORE_INTERNAL_H */
diff --git src/include/access/zedstore_simple8b.h src/include/access/zedstore_simple8b.h
new file mode 100644
index 0000000000..ebaaa368c9
--- /dev/null
+++ src/include/access/zedstore_simple8b.h
@@ -0,0 +1,21 @@
+/*
+ * zedstore_simple8b.h
+ * XXX
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_simple8b.h
+ */
+#ifndef ZEDSTORE_SIMPLE8B_H
+#define ZEDSTORE_SIMPLE8B_H
+
+extern uint64 simple8b_encode(const uint64 *ints, int num_ints, int *num_encoded);
+extern uint64 simple8b_encode_consecutive(const uint64 firstint, const uint64 secondint, int num_ints,
+ int *num_encoded);
+extern int simple8b_decode(uint64 codeword, uint64 *decoded);
+
+extern void simple8b_decode_words(uint64 *codewords, int num_codewords,
+ uint64 *dst, int num_integers);
+
+#endif /* ZEDSTORE_SIMPLE8B_H */
diff --git src/include/access/zedstore_tid.h src/include/access/zedstore_tid.h
new file mode 100644
index 0000000000..4f0b0b5993
--- /dev/null
+++ src/include/access/zedstore_tid.h
@@ -0,0 +1,84 @@
+/*
+ * zedstore_tid.h
+ * Conversions between ItemPointers and uint64.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_tid.h
+ */
+#ifndef ZEDSTORE_TID_H
+#define ZEDSTORE_TID_H
+
+#include "storage/itemptr.h"
+
+/*
+ * Throughout ZedStore, we pass around TIDs as uint64's, rather than ItemPointers,
+ * for speed.
+ */
+typedef uint64 zstid;
+
+#define InvalidZSTid 0
+#define MinZSTid 1 /* blk 0, off 1 */
+#define MaxZSTid ((uint64) MaxBlockNumber << 16 | 0xffff)
+/* note: if this is converted to ItemPointer, it is invalid */
+#define MaxPlusOneZSTid (MaxZSTid + 1)
+
+#define MaxZSTidOffsetNumber 129
+
+#define PG_GETARG_ZSTID(n) DatumGetZSTid(PG_GETARG_DATUM(n))
+#define PG_RETURN_ZSTID(x) return ZSTidGetDatum(x)
+
+/* fmgr interface macros */
+#ifdef USE_FLOAT8_BYVAL
+#define ZSTidGetDatum(X) Int64GetDatum(X)
+#define DatumGetZSTid(X) ((zstid) (X))
+#else
+#define ZSTidGetDatum(X) PointerGetDatum(X)
+#define DatumGetZSTid(X) (* ((zstid*) DatumGetPointer(X)))
+#endif
+
+static inline zstid
+ZSTidFromBlkOff(BlockNumber blk, OffsetNumber off)
+{
+ Assert(off != 0);
+
+ return (uint64) blk * (MaxZSTidOffsetNumber - 1) + off;
+}
+
+static inline zstid
+ZSTidFromItemPointer(ItemPointerData iptr)
+{
+ Assert(ItemPointerIsValid(&iptr));
+ return ZSTidFromBlkOff(ItemPointerGetBlockNumber(&iptr),
+ ItemPointerGetOffsetNumber(&iptr));
+}
+
+static inline ItemPointerData
+ItemPointerFromZSTid(zstid tid)
+{
+ ItemPointerData iptr;
+ BlockNumber blk;
+ OffsetNumber off;
+
+ blk = (tid - 1) / (MaxZSTidOffsetNumber - 1);
+ off = (tid - 1) % (MaxZSTidOffsetNumber - 1) + 1;
+
+ ItemPointerSet(&iptr, blk, off);
+ Assert(ItemPointerIsValid(&iptr));
+ return iptr;
+}
+
+static inline BlockNumber
+ZSTidGetBlockNumber(zstid tid)
+{
+ return (BlockNumber) ((tid - 1) / (MaxZSTidOffsetNumber - 1));
+}
+
+static inline OffsetNumber
+ZSTidGetOffsetNumber(zstid tid)
+{
+ return (OffsetNumber) ((tid - 1) % (MaxZSTidOffsetNumber - 1) + 1);
+}
+
+#endif /* ZEDSTORE_TID_H */
diff --git src/include/access/zedstore_undolog.h src/include/access/zedstore_undolog.h
new file mode 100644
index 0000000000..da5cc67327
--- /dev/null
+++ src/include/access/zedstore_undolog.h
@@ -0,0 +1,108 @@
+/*
+ * zedstore_undolog.h
+ * internal declarations for ZedStore undo logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_undolog.h
+ */
+#ifndef ZEDSTORE_UNDOLOG_H
+#define ZEDSTORE_UNDOLOG_H
+
+#include "storage/buf.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+/*
+ * We don't support splitting UNDO records across pages, so the max size of
+ * an undo record is limited by block size.
+ */
+#define MaxUndoRecordSize (BLCKSZ - SizeOfPageHeaderData - sizeof(ZSUndoPageOpaque))
+
+/*
+ * An UNDO-pointer.
+ *
+ * In the "real" UNDO-logging work from EDB, an UndoRecPtr is only 64 bits.
+ * But we make life easier for us, by encoding more information in it.
+ *
+ * 'counter' is a number that's incremented every time a new undo record is
+ * created. It can be used to determine if an undo pointer is too old to be
+ * of interest to anyone.
+ *
+ * 'blkno' and 'offset' are the physical location of the UNDO record. They
+ * can be used to easily fetch a given record.
+ */
+typedef struct
+{
+ uint64 counter;
+ BlockNumber blkno;
+ int32 offset; /* int16 would suffice, but avoid padding */
+} ZSUndoRecPtr;
+
+/* TODO: assert that blkno and offset match, too, if counter matches */
+#define ZSUndoRecPtrEquals(a, b) ((a).counter == (b).counter)
+
+typedef struct
+{
+ BlockNumber next;
+ ZSUndoRecPtr first_undorecptr; /* note: this is set even if the page is empty! */
+ ZSUndoRecPtr last_undorecptr;
+ uint16 padding0; /* padding, to put zs_page_id last */
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* ZS_UNDO_PAGE_ID */
+} ZSUndoPageOpaque;
+
+/*
+ * "invalid" undo pointer. The value is chosen so that an invalid pointer
+ * is less than any real UNDO pointer value. Therefore, a tuple with an
+ * invalid UNDO pointer is considered visible to everyone.
+ */
+static const ZSUndoRecPtr InvalidUndoPtr = {
+ .counter = 0,
+ .blkno = InvalidBlockNumber,
+ .offset = 0
+};
+
+/*
+ * A special value used on TID items, to mean that a tuple is not visible to
+ * anyone
+ */
+static const ZSUndoRecPtr DeadUndoPtr = {
+ .counter = 1,
+ .blkno = InvalidBlockNumber,
+ .offset = 0
+};
+
+static inline bool
+IsZSUndoRecPtrValid(ZSUndoRecPtr *uptr)
+{
+ return uptr->counter != 0;
+}
+
+/*
+ * zs_undo_reservation represents a piece of UNDO log that has been reserved for
+ * inserting a new UNDO record, but the UNDO record hasn't been written yet.
+ */
+typedef struct
+{
+ Buffer undobuf;
+ ZSUndoRecPtr undorecptr;
+ size_t length;
+
+ char *ptr;
+} zs_undo_reservation;
+
+/* prototypes for functions in zedstore_undolog.c */
+extern void zsundo_insert_reserve(Relation rel, size_t size, zs_undo_reservation *reservation_p);
+extern void zsundo_insert_finish(zs_undo_reservation *reservation);
+
+extern char *zsundo_fetch(Relation rel, ZSUndoRecPtr undoptr, Buffer *buf_p, int lockmode, bool missing_ok);
+
+extern void zsundo_discard(Relation rel, ZSUndoRecPtr oldest_undorecptr);
+
+extern void zsundo_newpage_redo(XLogReaderState *record);
+extern void zsundo_discard_redo(XLogReaderState *record);
+
+#endif /* ZEDSTORE_UNDO_H */
diff --git src/include/access/zedstore_undorec.h src/include/access/zedstore_undorec.h
new file mode 100644
index 0000000000..07f307d483
--- /dev/null
+++ src/include/access/zedstore_undorec.h
@@ -0,0 +1,196 @@
+/*
+ * zedstore_undorec.h
+ * Declarations for different kinds of UNDO records in Zedstore.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_undorec.h
+ */
+#ifndef ZEDSTORE_UNDOREC_H
+#define ZEDSTORE_UNDOREC_H
+
+#include "access/zedstore_tid.h"
+#include "nodes/lockoptions.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+#define ZSUNDO_TYPE_INSERT 1
+#define ZSUNDO_TYPE_DELETE 2
+#define ZSUNDO_TYPE_UPDATE 3
+#define ZSUNDO_TYPE_TUPLE_LOCK 4
+
+struct ZSUndoRec
+{
+ int16 size; /* size of this record, including header */
+ uint8 type; /* ZSUNDO_TYPE_* */
+ ZSUndoRecPtr undorecptr;
+ TransactionId xid;
+ CommandId cid;
+
+ /*
+ * UNDO-record of the inserter. This is needed if a row is inserted, and
+ * deleted, and there are some snapshots active don't don't consider even
+ * the insertion as visible.
+ *
+ * This is also used in Insert records, if the record represents the
+ * new tuple version of an UPDATE, rather than an INSERT. It's needed to
+ * dig into possible KEY SHARE locks held on the row, which didn't prevent
+ * the tuple from being updated.
+ */
+ ZSUndoRecPtr prevundorec;
+};
+typedef struct ZSUndoRec ZSUndoRec;
+
+/*
+ * Type-specific record formats.
+ *
+ * We store similar info as zheap for INSERT/UPDATE/DELETE. See zheap README.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid firsttid;
+ zstid endtid; /* exclusive */
+ uint32 speculative_token; /* Only used for INSERT records */
+
+} ZSUndoRec_Insert;
+
+#define ZSUNDO_NUM_TIDS_PER_DELETE 10
+
+typedef struct
+{
+ ZSUndoRec rec;
+
+ bool changedPart; /* tuple was moved to a different partition by UPDATE */
+
+ /*
+ * One deletion record can represent deleting up to
+ * ZSUNDO_NUM_TIDS_PER_DELETE tuples. The 'rec.tid' field is unused.
+ */
+ uint16 num_tids;
+ zstid tids[ZSUNDO_NUM_TIDS_PER_DELETE];
+
+ /*
+ * TODO: It might be good to move the deleted tuple to the undo-log, so
+ * that the space can immediately be reused. But currently, we don't do
+ * that. Or even better, move the old tuple to the undo-log lazily, if
+ * the space is needed for a new insertion, before the old tuple becomes
+ * recyclable.
+ */
+} ZSUndoRec_Delete;
+
+/*
+ * This is used for an UPDATE, to mark the old tuple version as updated.
+ * It's the same as a deletion, except this stores the TID of the new tuple
+ * version, so it can be followed in READ COMMITTED mode.
+ *
+ * The ZSUndoRec_Insert record is used for the insertion of the new tuple
+ * version.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+
+ zstid oldtid;
+ zstid newtid;
+
+ bool key_update; /* were key columns updated?
+ * (for conflicting with FOR KEY SHARE) */
+
+} ZSUndoRec_Update;
+
+/*
+ * This is used when a tuple is locked e.g. with SELECT FOR UPDATE.
+ * The tuple isn't really changed in any way, but the undo record gives
+ * a place to store the XID of the locking transaction.
+ *
+ * In case of a FOR SHARE lock, there can be multiple lockers. Each locker
+ * will create a new undo record with its own XID that points to the previous
+ * record. So the records will form a chain, leading finally to the insertion
+ * record (or beyond the UNDO horizon, meaning the tuple's insertion is visible
+ * to everyone)
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid tid;
+
+ /*
+ * XXX: Is it OK to store this on disk? The enum values could change. Then
+ * again, no one should care about old locks that were acquired before
+ * last restart. Except with two-phase commit prepared transactions.
+ */
+ LockTupleMode lockmode;
+} ZSUndoRec_TupleLock;
+
+/*
+ * These are used in WAL records, to represent insertion or modification
+ * of an UNDO record.
+ *
+ * We use this same record for all UNDO operations. It's a bit wasteful;
+ * if an existing UNDO record is modified, we wouldn't need to overwrite
+ * the whole record. Also, no need to WAL-log the command ids, because
+ * they don't matter after crash/replay.
+ */
+typedef struct
+{
+ ZSUndoRecPtr undoptr;
+ uint16 length;
+ bool is_update;
+} zs_wal_undo_op;
+
+#define SizeOfZSWalUndoOp (offsetof(zs_wal_undo_op, is_update) + sizeof(bool))
+
+/*
+ * zs_pending_undo_op encapsulates the insertion or modification of an UNDO
+ * record. The zsundo_create_* functions don't insert UNDO records directly,
+ * because the callers are not in a critical section yet, and may still need
+ * to abort. For example, to inserting a new TID to the TID tree, we first
+ * construct the UNDO record for the insertion, and then lock the correct
+ * TID tree page to insert to. But if e.g. we need to split the TID page,
+ * we might still have to error out.
+ */
+struct zs_pending_undo_op
+{
+ zs_undo_reservation reservation;
+ bool is_update;
+
+ /*
+ * Payload to include in the WAL record. All the data here is redundant with
+ * the other fields in this struct, but we keep this copy here, so
+ * that we can register it as data in the WAL record.
+ */
+ zs_wal_undo_op waldata;
+
+ char payload[FLEXIBLE_ARRAY_MEMBER];
+};
+typedef struct zs_pending_undo_op zs_pending_undo_op;
+
+/* prototypes for functions in zedstore_undorec.c */
+extern struct ZSUndoRec *zsundo_fetch_record(Relation rel, ZSUndoRecPtr undorecptr);
+
+extern zs_pending_undo_op *zsundo_create_for_delete(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ bool changedPart, ZSUndoRecPtr prev_undo_ptr);
+extern zs_pending_undo_op *zsundo_create_for_insert(Relation rel, TransactionId xid, CommandId cid,
+ zstid tid, int nitems,
+ uint32 speculative_token, ZSUndoRecPtr prev_undo_ptr);
+extern zs_pending_undo_op *zsundo_create_for_update(Relation rel, TransactionId xid, CommandId cid,
+ zstid oldtid, zstid newtid, ZSUndoRecPtr prev_undo_ptr,
+ bool key_update);
+extern zs_pending_undo_op *zsundo_create_for_tuple_lock(Relation rel, TransactionId xid, CommandId cid,
+ zstid tid, LockTupleMode lockmode,
+ ZSUndoRecPtr prev_undo_ptr);
+extern void zsundo_finish_pending_op(zs_pending_undo_op *pendingop, char *payload);
+extern void zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr);
+
+extern void XLogRegisterUndoOp(uint8 block_id, zs_pending_undo_op *undo_op);
+extern Buffer XLogRedoUndoOp(XLogReaderState *record, uint8 block_id);
+
+struct VacuumParams;
+extern void zsundo_vacuum(Relation rel, struct VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin);
+extern ZSUndoRecPtr zsundo_get_oldest_undo_ptr(Relation rel, bool attempt_trim);
+
+#endif /* ZEDSTORE_UNDOREC_H */
diff --git src/include/access/zedstore_wal.h src/include/access/zedstore_wal.h
new file mode 100644
index 0000000000..8b496f3d13
--- /dev/null
+++ src/include/access/zedstore_wal.h
@@ -0,0 +1,220 @@
+/*
+ * zedstore_wal.h
+ * internal declarations for ZedStore wal logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_wal.h
+ */
+#ifndef ZEDSTORE_WAL_H
+#define ZEDSTORE_WAL_H
+
+#include "access/attnum.h"
+#include "access/xlogreader.h"
+#include "access/zedstore_tid.h"
+#include "access/zedstore_undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/off.h"
+
+#define WAL_ZEDSTORE_INIT_METAPAGE 0x00
+#define WAL_ZEDSTORE_UNDO_NEWPAGE 0x10
+#define WAL_ZEDSTORE_UNDO_DISCARD 0x20
+#define WAL_ZEDSTORE_BTREE_NEW_ROOT 0x30
+#define WAL_ZEDSTORE_BTREE_REWRITE_PAGES 0x40
+#define WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS 0x50
+#define WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM 0x60
+#define WAL_ZEDSTORE_ATTSTREAM_CHANGE 0x70
+#define WAL_ZEDSTORE_TOAST_NEWPAGE 0x80
+#define WAL_ZEDSTORE_FPM_DELETE_PAGE 0x90
+#define WAL_ZEDSTORE_FPM_REUSE_PAGE 0xA0
+
+/* in zedstore_wal.c */
+extern void zedstore_redo(XLogReaderState *record);
+extern void zedstore_mask(char *pagedata, BlockNumber blkno);
+
+/* in zedstoredesc.c */
+extern void zedstore_desc(StringInfo buf, XLogReaderState *record);
+extern const char *zedstore_identify(uint8 info);
+
+/*
+ * WAL record for initializing zedstore metapage (WAL_ZEDSTORE_INIT_METAPAGE)
+ *
+ * These records always use a full-page image, so this data is really just
+ * for debugging purposes.
+ */
+typedef struct wal_zedstore_init_metapage
+{
+ int32 natts; /* number of attributes. */
+} wal_zedstore_init_metapage;
+
+#define SizeOfZSWalInitMetapage (offsetof(wal_zedstore_init_metapage, natts) + sizeof(int32))
+
+/*
+ * WAL record for extending the UNDO log with one page.
+ */
+typedef struct wal_zedstore_undo_newpage
+{
+ uint64 first_counter;
+} wal_zedstore_undo_newpage;
+
+#define SizeOfZSWalUndoNewPage (offsetof(wal_zedstore_undo_newpage, first_counter) + sizeof(uint64))
+
+/*
+ * WAL record for updating the oldest undo pointer on the metapage, after
+ * discarding an old portion the UNDO log.
+ *
+ * blkref #0 is the metapage.
+ *
+ * If an old UNDO page was discarded away, advancing zs_undo_head, that page
+ * is stored as blkref #1. The new block number to store in zs_undo_head is
+ * stored as the data of blkref #0.
+ */
+typedef struct wal_zedstore_undo_discard
+{
+ ZSUndoRecPtr oldest_undorecptr;
+
+ /*
+ * Next oldest remaining block in the UNDO chain. This is not the same as
+ * oldest_undorecptr.block, if we are discarding multiple UNDO blocks. We will
+ * update oldest_undorecptr in the first iteration already, so that visibility
+ * checks can use the latest value immediately. But we can't hold a potentially
+ * unlimited number of pages locked while we mark them as deleted, so they are
+ * deleted one by one, and each deletion is WAL-logged separately.
+ */
+ BlockNumber oldest_undopage;
+} wal_zedstore_undo_discard;
+
+#define SizeOfZSWalUndoDiscard (offsetof(wal_zedstore_undo_discard, oldest_undopage) + sizeof(BlockNumber))
+
+/*
+ * WAL record for creating a new, empty, root page for an attribute.
+ */
+typedef struct wal_zedstore_btree_new_root
+{
+ AttrNumber attno; /* 0 means TID tree */
+} wal_zedstore_btree_new_root;
+
+#define SizeOfZSWalBtreeNewRoot (offsetof(wal_zedstore_btree_new_root, attno) + sizeof(AttrNumber))
+
+/*
+ * WAL record for replacing/adding items to the TID tree.
+ */
+typedef struct wal_zedstore_tidleaf_items
+{
+ int16 nitems;
+ OffsetNumber off;
+
+ /* the items follow */
+} wal_zedstore_tidleaf_items;
+
+#define SizeOfZSWalTidLeafItems (offsetof(wal_zedstore_tidleaf_items, off) + sizeof(OffsetNumber))
+
+/*
+ * WAL record for page splits, and other more complicated operations where
+ * we just rewrite whole pages.
+ *
+ * block #0 is UNDO buffer, if any.
+ * The rest are the b-tree pages (numpages).
+ */
+typedef struct wal_zedstore_btree_rewrite_pages
+{
+ int numpages;
+
+ /* one of these per page. */
+ struct
+ {
+ bool recycle;
+ bool special_only;
+ } pageinfo[FLEXIBLE_ARRAY_MEMBER];
+} wal_zedstore_btree_rewrite_pages;
+
+#define SizeOfZSWalBtreeRewritePages(numpages) (offsetof(wal_zedstore_btree_rewrite_pages, pageinfo[numpages]))
+
+/*
+ * WAL record for a change to attribute leaf page.
+ *
+ * Modifies an attribute stream stored on an attribute leaf page.
+ * If 'is_upper' is set, the change applies to the upper stream,
+ * between pd_upper and pd_special, otherwise it applies to the
+ * lower stream between page header and pd_lower.
+ *
+ * new_attstream_size is the new size of the attstream. At replay,
+ * pd_lower or pd_upper is adjusted to match the new size. If
+ * size of the upper stream changes, any existing data in the upper
+ * area on the page conceptually moved to the beginning of the upper
+ * area, before the replacement data in the record is applied.
+ *
+ * The block data 0 contains new data, which overwrites the data
+ * between begin_offset and end_offset. Not all data in the stream
+ * needs to be overwritten, that is, begin_offset and end_offset
+ * don't need to cover the whole stream. That allows efficiently
+ * appending data to an uncompressed stream. (It's also pretty
+ * effective for the compressed stream: if a stream is
+ * decompressed, some changes are made, and the stream is
+ * recompressed, the part before the change will usually re-compress
+ * to the same bytes.)
+ */
+typedef struct wal_zedstore_attstream_change
+{
+ bool is_upper;
+
+ /*
+ * These field correspond to the fields in ZSAttStream. But
+ * we use smaller fields to save on WAL volume. (ZSAttStream
+ * uses larger fields for the size, so that the same struct
+ * can be used for longer streams than fit on disk, when passed
+ * around in memory.)
+ */
+ uint16 new_attstream_size;
+ uint16 new_decompressed_size;
+ uint16 new_decompressed_bufsize;
+ zstid new_lasttid;
+
+ uint16 begin_offset;
+ uint16 end_offset;
+} wal_zedstore_attstream_change;
+
+#define SizeOfZSWalAttstreamChange (offsetof(wal_zedstore_attstream_change, end_offset) + sizeof(uint16))
+
+/*
+ * WAL record for zedstore toasting. When a large datum spans multiple pages,
+ * we write one of these for every page. The chain will appear valid between
+ * every operation, except that the total size won't match the total size of
+ * all the pages until the last page is written.
+ *
+ * blkref 0: the new page being added
+ * blkref 1: the previous page in the chain
+ */
+typedef struct wal_zedstore_toast_newpage
+{
+ zstid tid;
+ AttrNumber attno;
+ int32 total_size;
+ int32 offset;
+} wal_zedstore_toast_newpage;
+
+#define SizeOfZSWalToastNewPage (offsetof(wal_zedstore_toast_newpage, offset) + sizeof(int32))
+
+typedef struct wal_zedstore_fpm_delete_page
+{
+ BlockNumber next_free_blkno;
+} wal_zedstore_fpm_delete_page;
+
+#define SizeOfZSWalFpmDeletePage (offsetof(wal_zedstore_fpm_delete_page, next_free_blkno) + sizeof(BlockNumber))
+
+typedef struct wal_zedstore_fpm_reuse_page
+{
+ BlockNumber next_free_blkno;
+} wal_zedstore_fpm_reuse_page;
+
+#define SizeOfZSWalFpmReusePage (offsetof(wal_zedstore_fpm_reuse_page, next_free_blkno) + sizeof(BlockNumber))
+
+extern void zsbt_tidleaf_items_redo(XLogReaderState *record, bool replace);
+extern void zsmeta_new_btree_root_redo(XLogReaderState *record);
+extern void zsbt_rewrite_pages_redo(XLogReaderState *record);
+extern void zstoast_newpage_redo(XLogReaderState *record);
+extern void zspage_delete_page_redo(XLogReaderState *record);
+extern void zspage_reuse_page_redo(XLogReaderState *record);
+
+#endif /* ZEDSTORE_WAL_H */
diff --git src/include/access/zedstoream.h src/include/access/zedstoream.h
new file mode 100644
index 0000000000..c707ad86d6
--- /dev/null
+++ src/include/access/zedstoream.h
@@ -0,0 +1,17 @@
+/*
+ * zedstoream.h
+ * public declarations for ZedStore
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstoream.h
+ */
+#ifndef ZEDSTOREAM_H
+#define ZEDSTOREAM_H
+
+extern void AtEOXact_zedstore_tuplebuffers(bool isCommit);
+extern void AtSubStart_zedstore_tuplebuffers(void);
+extern void AtEOSubXact_zedstore_tuplebuffers(bool isCommit);
+
+#endif /* ZEDSTOREAM_H */
diff --git src/include/catalog/pg_am.dat src/include/catalog/pg_am.dat
index 0f051277a6..5abe8751b0 100644
--- src/include/catalog/pg_am.dat
+++ src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '6668', oid_symbol => 'ZEDSTORE_TABLE_AM_OID',
+ descr => 'zedstore table access method',
+ amname => 'zedstore', amhandler => 'zedstore_tableam_handler', amtype => 't' },
]
diff --git src/include/catalog/pg_cast.dat src/include/catalog/pg_cast.dat
index 01c5328ddd..7a566b83a5 100644
--- src/include/catalog/pg_cast.dat
+++ src/include/catalog/pg_cast.dat
@@ -526,4 +526,15 @@
{ castsource => 'jsonb', casttarget => 'float8', castfunc => 'float8(jsonb)',
castcontext => 'e', castmethod => 'f' },
+# zedstore
+{ castsource => 'tid', casttarget => 'zstid', castfunc => 'zstid(tid)',
+ castcontext => 'e', castmethod => 'f' },
+{ castsource => 'zstid', casttarget => 'int8', castfunc => 'int8(zstid)',
+ castcontext => 'e', castmethod => 'f' },
+{ castsource => 'int2', casttarget => 'zstid', castfunc => 'zstid(int2)',
+ castcontext => 'i', castmethod => 'f' },
+{ castsource => 'int4', casttarget => 'zstid', castfunc => 'zstid(int4)',
+ castcontext => 'i', castmethod => 'f' },
+{ castsource => 'int8', casttarget => 'zstid', castfunc => 'zstid(int8)',
+ castcontext => 'i', castmethod => 'f' },
]
diff --git src/include/catalog/pg_operator.dat src/include/catalog/pg_operator.dat
index 7c135da3b1..4757817113 100644
--- src/include/catalog/pg_operator.dat
+++ src/include/catalog/pg_operator.dat
@@ -3298,4 +3298,30 @@
oprresult => 'bool', oprcode => 'jsonb_path_match_opr(jsonb,jsonpath)',
oprrest => 'contsel', oprjoin => 'contjoinsel' },
+# zedstore
+{ oid => '7060', descr => 'equal',
+ oprname => '=', oprcanmerge => 'f', oprcanhash => 'f', oprleft => 'zstid',
+ oprright => 'zstid', oprresult => 'bool', oprcom => '=(zstid,zstid)',
+ oprnegate => '<>(zstid,zstid)', oprcode => 'zstideq', oprrest => 'eqsel',
+ oprjoin => 'eqjoinsel' },
+{ oid => '7061', descr => 'not equal',
+ oprname => '<>', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '<>(zstid,zstid)', oprnegate => '=(zstid,zstid)', oprcode => 'zstidne',
+ oprrest => 'neqsel', oprjoin => 'neqjoinsel' },
+{ oid => '7062', descr => 'less than',
+ oprname => '<', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '>(zstid,zstid)', oprnegate => '>=(zstid,zstid)', oprcode => 'zstidlt',
+ oprrest => 'scalarltsel', oprjoin => 'scalarltjoinsel' },
+{ oid => '7063', descr => 'greater than',
+ oprname => '>', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '<(zstid,zstid)', oprnegate => '<=(zstid,zstid)', oprcode => 'zstidgt',
+ oprrest => 'scalargtsel', oprjoin => 'scalargtjoinsel' },
+{ oid => '7064', descr => 'less than or equal',
+ oprname => '<=', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '>=(zstid,zstid)', oprnegate => '>(zstid,zstid)', oprcode => 'zstidle',
+ oprrest => 'scalarlesel', oprjoin => 'scalarlejoinsel' },
+{ oid => '7065', descr => 'greater than or equal',
+ oprname => '>=', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '<=(zstid,zstid)', oprnegate => '<(zstid,zstid)', oprcode => 'zstidge',
+ oprrest => 'scalargesel', oprjoin => 'scalargejoinsel' },
]
diff --git src/include/catalog/pg_proc.dat src/include/catalog/pg_proc.dat
index 87d25d4a4b..e809bb7615 100644
--- src/include/catalog/pg_proc.dat
+++ src/include/catalog/pg_proc.dat
@@ -873,6 +873,11 @@
proname => 'heap_tableam_handler', provolatile => 'v',
prorettype => 'table_am_handler', proargtypes => 'internal',
prosrc => 'heap_tableam_handler' },
+{ oid => '6669', oid_symbol => 'ZEDSTORE_TABLE_AM_HANDLER_OID',
+ descr => 'column-oriented table access method handler',
+ proname => 'zedstore_tableam_handler', provolatile => 'v',
+ prorettype => 'table_am_handler', proargtypes => 'internal',
+ prosrc => 'zedstore_tableam_handler' },
# Index access method handlers
{ oid => '330', descr => 'btree index access method handler',
@@ -10819,4 +10824,98 @@
proname => 'pg_partition_root', prorettype => 'regclass',
proargtypes => 'regclass', prosrc => 'pg_partition_root' },
+# zedstore inspection functions
+{ oid => '7000', descr => 'get zedstore page type',
+ proname => 'pg_zs_page_type', prorettype => 'text',
+ proargtypes => 'regclass int8', prosrc => 'pg_zs_page_type' },
+{ oid => '7001', descr => 'show stats about active zedstore undo pages',
+ proname => 'pg_zs_undo_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int4,int4,int8,int8}',
+ proargmodes => '{i,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nrecords,freespace,firstrecptr,lastrecptr}',
+ prosrc => 'pg_zs_undo_pages' },
+{ oid => '7002', descr => 'show stats about zedstore btree pages',
+ proname => 'pg_zs_btree_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int4,int4,int8,int8,int4,int4,int4,int4,int4}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nextblk,attno,level,lokey,hikey,nitems,ncompressed,totalsz,uncompressedsz,freespace}',
+ prosrc => 'pg_zs_btree_pages' },
+{ oid => '7003', descr => 'show stats about zedstore toast pages',
+ proname => 'pg_zs_toast_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int8,int8,int8,int8,int4,bool}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,tid,total_size,slice_offset,prev,next,decompressed_size,is_compressed}',
+ prosrc => 'pg_zs_toast_pages' },
+{ oid => '7004', descr => 'show stats about the zedstore meta page',
+ proname => 'pg_zs_meta_page', prorettype => 'record',
+ proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int8,int8,int8,int8,int4,int8,int4}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,undo_head,undo_tail,undo_tail_first_counter,undo_oldestpointer_counter,undo_oldestponter_blkno, undo_oldestpointer_offset,fpm_head,flags}',
+ prosrc => 'pg_zs_meta_page' },
+{ oid => '7005', descr => 'dump individual chunks from a zedstore btree page',
+ proname => 'pg_zs_dump_attstreams', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass int8',
+ proallargtypes => '{regclass,int8,int2,int4,bool,bool,bool,int2,int4,int4,zstid,zstid,zstid,bytea,int4}',
+ proargmodes => '{i,i,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,attno,chunkno,upperstream,compressed,attbyval,attlen,chunk_start,chunk_len,prevtid,firsttid,lasttid,chunk,itemcount}',
+ prosrc => 'pg_zs_dump_attstreams' },
+{ oid => '7006', descr => 'split zedstore chunks into value arrays',
+ proname => 'pg_zs_decode_chunk', prorettype => 'record',
+ proargtypes => 'bool int2 zstid zstid bytea',
+ proallargtypes => '{bool,int2,zstid,zstid,bytea,int4,_zstid,_bytea,_bool}',
+ proargmodes => '{i,i,i,i,i,o,o,o,o}',
+ proargnames => '{attbyval,attlen,prevtid,lasttid,chunk,num_elements,tids,datums,isnulls}',
+ prosrc => 'pg_zs_decode_chunk' },
+{ oid => '7007', descr => 'calculate the number of adjacent blocks per attribute',
+ proname => 'pg_zs_calculate_adjacent_block', prorettype => 'record',
+ proargtypes => 'regclass', prorows => '1000', proretset => 't',
+ proallargtypes => '{regclass,int4,int4,int4}',
+ proargmodes => '{i,o,o,o}',
+ proargnames => '{relid,attnum,num_runs,total_runs}',
+ prosrc => 'pg_zs_calculate_adjacent_block' },
+
+# zedstore
+{ oid => '7020', descr => 'input zstid',
+ proname => 'zstidin', prorettype => 'zstid', proargtypes => 'cstring',
+ prosrc => 'zstidin' },
+{ oid => '7021', descr => 'output zstid',
+ proname => 'zstidout', prorettype => 'cstring', proargtypes => 'zstid',
+ prosrc => 'zstidout' },
+{ oid => '7022',
+ proname => 'zstideq', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstideq' },
+{ oid => '7023',
+ proname => 'zstidle', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidle' },
+{ oid => '7024',
+ proname => 'zstidge', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidge' },
+{ oid => '7025',
+ proname => 'zstidgt', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidgt' },
+{ oid => '7026',
+ proname => 'zstidlt', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidlt' },
+{ oid => '7027',
+ proname => 'zstidne', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidne' },
+{ oid => '7028', descr => 'convert tid to zstid',
+ proname => 'zstid', provolatile => 's', prorettype => 'zstid',
+ proargtypes => 'tid', prosrc => 'tidtozstid' },
+{ oid => '7029', descr => 'convert int2 to zstid',
+ proname => 'zstid', prorettype => 'zstid', proargtypes => 'int2',
+ prosrc => 'int2tozstid' },
+{ oid => '7030', descr => 'convert int4 to zstid',
+ proname => 'zstid', prorettype => 'zstid', proargtypes => 'int4',
+ prosrc => 'int4tozstid' },
+{ oid => '7031', descr => 'convert int8 to zstid',
+ proname => 'zstid', prorettype => 'zstid', proargtypes => 'int8',
+ prosrc => 'int8tozstid' },
+{ oid => '7032', descr => 'convert zstd to int8',
+ proname => 'int8', prorettype => 'int8', proargtypes => 'zstid',
+ prosrc => 'zstidtoint8' },
]
diff --git src/include/catalog/pg_statistic.h src/include/catalog/pg_statistic.h
index 40a7260165..a069546917 100644
--- src/include/catalog/pg_statistic.h
+++ src/include/catalog/pg_statistic.h
@@ -36,6 +36,9 @@ CATALOG(pg_statistic,2619,StatisticRelationId)
/* the fraction of the column's entries that are NULL: */
float4 stanullfrac;
+ /* the fraction of the column's disksize of all columns */
+ float4 stadiskfrac;
+
/*
* stawidth is the average width in bytes of non-null entries. For
* fixed-width datatypes this is of course the same as the typlen, but for
diff --git src/include/catalog/pg_type.dat src/include/catalog/pg_type.dat
index 2e6110e3f2..274451a22c 100644
--- src/include/catalog/pg_type.dat
+++ src/include/catalog/pg_type.dat
@@ -621,4 +621,10 @@
typoutput => 'anycompatiblerange_out', typreceive => '-', typsend => '-',
typalign => 'd', typstorage => 'x' },
+
+{ oid => '7050', array_type_oid => '7051',
+ descr => 'zedstore logical tuple id',
+ typname => 'zstid', typlen => '8', typbyval => 'FLOAT8PASSBYVAL',
+ typcategory => 'N', typinput => 'zstidin', typoutput => 'zstidout',
+ typreceive => '-', typsend => '-', typalign => 'd' },
]
diff --git src/include/commands/vacuum.h src/include/commands/vacuum.h
index 2779bea5c9..fefde301d2 100644
--- src/include/commands/vacuum.h
+++ src/include/commands/vacuum.h
@@ -151,6 +151,12 @@ typedef struct VacAttrStats
int numvalues[STATISTIC_NUM_SLOTS];
Datum *stavalues[STATISTIC_NUM_SLOTS];
+ /*
+ * These fields are to be filled in compute_disk_stats
+ */
+ float4 stadiskfrac; /* fraction of the physical size */
+ float8 disksize; /* value of the physical size */
+
/*
* These fields describe the stavalues[n] element types. They will be
* initialized to match attrtypid, but a custom typanalyze function might
diff --git src/include/executor/executor.h src/include/executor/executor.h
index 94890512dc..517fe25907 100644
--- src/include/executor/executor.h
+++ src/include/executor/executor.h
@@ -601,5 +601,9 @@ extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
extern void CheckSubscriptionRelkind(char relkind, const char *nspname,
const char *relname);
+extern void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols);
+extern Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol);
#endif /* EXECUTOR_H */
diff --git src/include/fmgr.h src/include/fmgr.h
index 453df2220f..0a64173bb0 100644
--- src/include/fmgr.h
+++ src/include/fmgr.h
@@ -324,7 +324,7 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
*/
#define DatumGetByteaP(X) ((bytea *) PG_DETOAST_DATUM(X))
#define DatumGetTextP(X) ((text *) PG_DETOAST_DATUM(X))
-#define DatumGetBpCharP(X) ((BpChar *) PG_DETOAST_DATUM(X))
+#define DatumGetBpharP(X) ((BpChar *) PG_DETOAST_DATUM(X))
#define DatumGetVarCharP(X) ((VarChar *) PG_DETOAST_DATUM(X))
#define PG_GETARG_BYTEA_P(n) DatumGetByteaP(PG_GETARG_DATUM(n))
#define PG_GETARG_TEXT_P(n) DatumGetTextP(PG_GETARG_DATUM(n))
diff --git src/include/foreign/fdwapi.h src/include/foreign/fdwapi.h
index 95556dfb15..ee8c2afc6c 100644
--- src/include/foreign/fdwapi.h
+++ src/include/foreign/fdwapi.h
@@ -18,6 +18,7 @@
/* To avoid including explain.h here, reference ExplainState thus: */
struct ExplainState;
+struct AnalyzeSampleContext;
/*
@@ -139,10 +140,8 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
struct ExplainState *es);
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows,
- double *totaldeadrows);
+typedef void (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+ struct AnalyzeSampleContext *context);
typedef bool (*AnalyzeForeignTable_function) (Relation relation,
AcquireSampleRowsFunc *func,
diff --git src/include/nodes/pathnodes.h src/include/nodes/pathnodes.h
index 0ceb809644..a89fbddd9c 100644
--- src/include/nodes/pathnodes.h
+++ src/include/nodes/pathnodes.h
@@ -687,6 +687,7 @@ typedef struct RelOptInfo
PlannerInfo *subroot; /* if subquery */
List *subplan_params; /* if subquery */
int rel_parallel_workers; /* wanted number of parallel workers */
+ bool leverage_column_projection;
/* Information about foreign tables and foreign joins */
Oid serverid; /* identifies server for the table or join */
diff --git src/include/pg_config.h.in src/include/pg_config.h.in
index 41ad209380..dc1c87d239 100644
--- src/include/pg_config.h.in
+++ src/include/pg_config.h.in
@@ -319,6 +319,9 @@
/* Define to 1 if you have the `ldap_r' library (-lldap_r). */
#undef HAVE_LIBLDAP_R
+/* Define to 1 if you have the `lz4' library (-llz4). */
+#undef HAVE_LIBLZ4
+
/* Define to 1 if you have the `m' library (-lm). */
#undef HAVE_LIBM
@@ -358,6 +361,9 @@
/* Define to 1 if `long long int' works and is 64 bits. */
#undef HAVE_LONG_LONG_INT_64
+/* Define to 1 if you have the <lz4.h> header file. */
+#undef HAVE_LZ4_H
+
/* Define to 1 if you have the <mbarrier.h> header file. */
#undef HAVE_MBARRIER_H
@@ -877,6 +883,9 @@
/* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
#undef USE_LLVM
+/* Define to 1 to build with LZ4 support. (--with-lz4) */
+#undef USE_LZ4
+
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
diff --git src/include/storage/bufmgr.h src/include/storage/bufmgr.h
index d2a5b52f6e..a6e1537223 100644
--- src/include/storage/bufmgr.h
+++ src/include/storage/bufmgr.h
@@ -210,6 +210,7 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferInMode(Buffer buffer, int mode);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
diff --git src/include/storage/itemptr.h src/include/storage/itemptr.h
index 944f6fe6bd..77e57a3fca 100644
--- src/include/storage/itemptr.h
+++ src/include/storage/itemptr.h
@@ -195,6 +195,12 @@ typedef ItemPointerData *ItemPointer;
#define ItemPointerSetMovedPartitions(pointer) \
ItemPointerSet((pointer), MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber)
+/* fmgr interface macros */
+#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X) PointerGetDatum(X)
+#define PG_GETARG_ITEMPOINTER(n) DatumGetItemPointer(PG_GETARG_DATUM(n))
+#define PG_RETURN_ITEMPOINTER(x) return ItemPointerGetDatum(x)
+
/* ----------------
* externs
* ----------------
diff --git src/include/utils/rel.h src/include/utils/rel.h
index 39cdcddc2b..e91b64c50e 100644
--- src/include/utils/rel.h
+++ src/include/utils/rel.h
@@ -282,8 +282,16 @@ typedef struct StdRdOptions
int parallel_workers; /* max number of parallel workers */
bool vacuum_index_cleanup; /* enables index vacuuming and cleanup */
bool vacuum_truncate; /* enables vacuum to truncate a relation */
+ /*
+ * request zedstore_rel_extension_factor #blocks from storage manager
+ * whenever we need to extend the relation by one block for attribute/tid
+ * tree pages.
+ */
+ int zedstore_rel_extension_factor;
} StdRdOptions;
+#define ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR 1
+
#define HEAP_MIN_FILLFACTOR 10
#define HEAP_DEFAULT_FILLFACTOR 100
diff --git src/test/isolation/expected/multiple-row-versions_1.out src/test/isolation/expected/multiple-row-versions_1.out
new file mode 100644
index 0000000000..f4f140cab1
--- /dev/null
+++ src/test/isolation/expected/multiple-row-versions_1.out
@@ -0,0 +1,25 @@
+Parsed test spec with 4 sessions
+
+starting permutation: rx1 wx2 c2 wx3 ry3 wy4 rz4 c4 c3 wz1 c1
+step rx1: SELECT * FROM t WHERE id = 1000000;
+id txt
+
+1000000
+step wx2: UPDATE t SET txt = 'b' WHERE id = 1000000;
+step c2: COMMIT;
+step wx3: UPDATE t SET txt = 'c' WHERE id = 1000000;
+step ry3: SELECT * FROM t WHERE id = 500000;
+id txt
+
+500000
+step wy4: UPDATE t SET txt = 'd' WHERE id = 500000;
+step rz4: SELECT * FROM t WHERE id = 1;
+id txt
+
+1
+step c4: COMMIT;
+step c3: COMMIT;
+ERROR: could not serialize access due to read/write dependencies among transactions
+step wz1: UPDATE t SET txt = 'a' WHERE id = 1;
+ERROR: could not serialize access due to read/write dependencies among transactions
+step c1: COMMIT;
diff --git src/test/isolation/expected/read-only-anomaly-2_1.out src/test/isolation/expected/read-only-anomaly-2_1.out
new file mode 100644
index 0000000000..2e36552031
--- /dev/null
+++ src/test/isolation/expected/read-only-anomaly-2_1.out
@@ -0,0 +1,45 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s3c s2wx
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id;
+id balance
+
+X 0
+Y 20
+step s3c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
diff --git src/test/isolation/expected/read-only-anomaly-3_1.out src/test/isolation/expected/read-only-anomaly-3_1.out
new file mode 100644
index 0000000000..d9a5a8e49d
--- /dev/null
+++ src/test/isolation/expected/read-only-anomaly-3_1.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id; <waiting ...>
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+step s3r: <... completed>
+id balance
+
+X 0
+Y 20
+error in steps s2wx s3r: ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
diff --git src/test/isolation/expected/serializable-parallel_1.out src/test/isolation/expected/serializable-parallel_1.out
new file mode 100644
index 0000000000..2e36552031
--- /dev/null
+++ src/test/isolation/expected/serializable-parallel_1.out
@@ -0,0 +1,45 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s3c s2wx
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id;
+id balance
+
+X 0
+Y 20
+step s3c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
diff --git src/test/isolation/expected/vacuum-reltuples_1.out src/test/isolation/expected/vacuum-reltuples_1.out
new file mode 100644
index 0000000000..151b4a3f63
--- /dev/null
+++ src/test/isolation/expected/vacuum-reltuples_1.out
@@ -0,0 +1,59 @@
+Parsed test spec with 2 sessions
+
+starting permutation: modify vac stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step vac:
+ vacuum smalltbl;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 21
+
+starting permutation: modify open fetch1 vac close stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step open:
+ begin;
+ declare c1 cursor for select 1 as dummy from smalltbl;
+
+step fetch1:
+ fetch next from c1;
+
+dummy
+
+1
+step vac:
+ vacuum smalltbl;
+
+step close:
+ commit;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 21
+
+starting permutation: modify vac stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step vac:
+ vacuum smalltbl;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 21
diff --git src/test/isolation/specs/read-only-anomaly-2.spec src/test/isolation/specs/read-only-anomaly-2.spec
index 9812f49ee4..2b17fcb521 100644
--- src/test/isolation/specs/read-only-anomaly-2.spec
+++ src/test/isolation/specs/read-only-anomaly-2.spec
@@ -18,13 +18,15 @@ teardown
}
session "s1"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s1ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s1wy" { UPDATE bank_account SET balance = 20 WHERE id = 'Y'; }
step "s1c" { COMMIT; }
session "s2"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s2rx" { SELECT balance FROM bank_account WHERE id = 'X'; }
step "s2ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s2wx" { UPDATE bank_account SET balance = -11 WHERE id = 'X'; }
diff --git src/test/regress/GNUmakefile src/test/regress/GNUmakefile
index 1a3164065f..ea81c3afa2 100644
--- src/test/regress/GNUmakefile
+++ src/test/regress/GNUmakefile
@@ -49,12 +49,18 @@ $(top_builddir)/src/port/pg_config_paths.h: | submake-libpgport
install: all installdirs
$(INSTALL_PROGRAM) pg_regress$(X) '$(DESTDIR)$(pgxsdir)/$(subdir)/pg_regress$(X)'
+ $(INSTALL_PROGRAM) gpdiff.pl '$(DESTDIR)$(pgxsdir)/$(subdir)/gpdiff.pl'
+ $(INSTALL_PROGRAM) atmsort.pl '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pl'
+ $(INSTALL_PROGRAM) atmsort.pm '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pm'
installdirs:
$(MKDIR_P) '$(DESTDIR)$(pgxsdir)/$(subdir)'
uninstall:
rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/pg_regress$(X)'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/gpdiff.pl'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pl'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pm'
# Build dynamically-loaded object file for CREATE FUNCTION ... LANGUAGE C.
diff --git src/test/regress/atmsort.pl src/test/regress/atmsort.pl
new file mode 100755
index 0000000000..142b998bce
--- /dev/null
+++ src/test/regress/atmsort.pl
@@ -0,0 +1,346 @@
+#!/usr/bin/env perl
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Pod::Usage is loaded lazily when needed, if the --help or other such option
+# is actually used. Loading the module takes some time, which adds up when
+# running hundreds of regression tests, and gpdiff.pl calls this script twice
+# for every test. See lazy_pod2usage().
+#use Pod::Usage;
+
+use Getopt::Long;
+#use Data::Dumper; # only used by commented-out debug statements.
+use strict;
+use warnings;
+
+use File::Spec;
+
+use FindBin;
+use lib "$FindBin::Bin";
+use atmsort;
+
+=head1 NAME
+
+B<atmsort.pl> - [A] [T]est [M]echanism Sort: sort the contents of SQL log files to aid diff comparison
+
+=head1 SYNOPSIS
+
+B<atmsort.pl> [options] logfile [logfile...]
+
+Options:
+
+ -help brief help message
+ -man full documentation
+ -ignore_plans ignore explain plan content in query output
+ -init <file> load initialization file
+
+=head1 OPTIONS
+
+=over 8
+
+=item B<-help>
+
+ Print a brief help message and exits.
+
+=item B<-man>
+
+ Prints the manual page and exits.
+
+=item B<-ignore_plans>
+
+Specify this option to ignore any explain plan diffs between the
+input files. This will completely ignore any plan content in
+the input files thus masking differences in plans between the input files.
+
+For example, for the following plan:
+explain select i from foo where i > 10;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2.72 rows=45 width=4)
+ -> Seq Scan on foo (cost=0.00..1.55 rows=45 width=4)
+ Filter: i > 10
+ Settings: optimizer=on
+(4 rows)
+
+atmsort.pl -ignore_plans will reduce this to:
+
+explain select i from foo where i > 10;
+QUERY PLAN
+___________
+REGRESS_IGNORE:{
+REGRESS_IGNORE: 'child' => [
+REGRESS_IGNORE: {
+REGRESS_IGNORE: 'id' => 2,
+REGRESS_IGNORE: 'parent' => 1,
+REGRESS_IGNORE: 'short' => 'Seq Scan on foo'
+REGRESS_IGNORE: }
+REGRESS_IGNORE: ],
+REGRESS_IGNORE: 'id' => 1,
+REGRESS_IGNORE: 'short' => 'Gather Motion'
+REGRESS_IGNORE:}
+REGRESS_IGNORE:(4 rows)
+
+
+=item B<-init> <file>
+
+Specify an initialization file containing a series of directives
+(mainly for match_subs) that get applied to the input files. To
+specify multiple initialization files, use multiple init arguments,
+eg:
+
+ -init file1 -init file2
+
+
+=back
+
+=back
+
+=head1 DESCRIPTION
+
+atmsort reads sql log files from STDIN and sorts the query output for
+all SELECT statements that do *not* have an ORDER BY, writing the
+result to STDOUT. This change to the log facilitates diff comparison,
+since unORDERed query output does not have a guaranteed order. Note
+that for diff to work correctly, statements that do use ORDER BY must
+have a fully-specified order.
+
+The log content must look something like:
+
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1,2,3,4;
+ a | b | c | d
+ ------------+-----------------+-----------+---------------
+ 1 | 1 | 1 | 1
+ 1 | 1 | 1 | 2
+ 3 | 2 | 2 | 5
+ (3 rows)
+
+The log file must contain SELECT statements, followed by the query
+output in the standard PostgreSQL format, ie a set of named columns, a
+separator line constructed of dashes and plus signs, and the rows,
+followed by an "(N rows)" row count. The SELECT statement must be
+unambiguous, eg no embedded SQL keywords like INSERT, UPDATE, or
+DELETE, and it must be terminated with a semicolon. Normally, the
+query output is sorted, but if the statement contains an ORDER BY
+clause the query output for that query is not sorted.
+
+=head2 EXPLAIN PLAN
+
+atmsort can also use explain.pl to process EXPLAIN and EXPLAIN ANALYZE
+output in a configuration-independent way. It strips out all timing,
+segment, and slice information, reducing the plan to a simple nested
+perl structure. For example, for the following plan:
+
+explain analyze select * from customer;
+
+ QUERY PLAN
+------------------------------------------------------------------------
+ Gather Motion 2:1 (slice1) (cost=0.00..698.88 rows=25088 width=550)
+ Rows out: 150000 rows at destination with 0.230 ms to first row,
+ 386 ms to end, start offset by 8.254 ms.
+ -> Seq Scan on customer (cost=0.00..698.88 rows=25088 width=550)
+ Rows out: Avg 75000.0 rows x 2 workers. Max 75001 rows (seg0)
+ with 0.056 ms to first row, 26 ms to end, start offset by 7.332 ms.
+ Slice statistics:
+ (slice0) Executor memory: 186K bytes.
+ (slice1) Executor memory: 130K bytes avg x 2 workers,
+ 130K bytes max (seg0).
+ Total runtime: 413.401 ms
+(8 rows)
+
+atmsort reduces the plan to:
+
+ QUERY PLAN
+------------------------------------------------------------------------
+{
+ 'child' => [
+ {
+ 'id' => 2,
+ 'parent' => 1,
+ 'short' => 'Seq Scan on customer'
+ }
+ ],
+ 'id' => 1,
+ 'short' => 'Gather Motion'
+ }
+(8 rows)
+
+
+=head2 Advanced Usage
+
+atmsort supports several "commands" that allow finer-grained control
+over the comparison process for SELECT queries. These commands are
+specified in comments in the following form:
+
+ --
+ -- order 1
+ --
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1;
+
+or
+
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1; -- order 1
+
+The supported commands are:
+
+=over 13
+
+=item -- order column number[, column number...]
+
+ The order directive is used to compare
+ "partially-ordered" query
+ output. The specified columns are assumed
+ to be ordered, and the remaining columns are
+ sorted to allow for deterministic comparison.
+
+=item -- order none
+
+ The order none directive can be used to specify that the SELECT's
+ output is not ordered. This can be necessary if the default
+ heuristic that checks if there is an ORDER BY in the query gets
+ fooled, e.g by an ORDER BY in a subquery that doesn't force the
+ overall result to be ordered.
+
+=item -- ignore
+
+The ignore directive prefixes the SELECT output with REGRESS_IGNORE. The
+diff command can use the -I flag to ignore lines with this prefix.
+
+=item -- mvd colnum[, colnum...] -> colnum[, colnum...] [; <additional specs>]
+
+mvd is designed to support Multi-Value Dependencies for OLAP queries.
+The syntax "col1,col2->col3,col4" indicates that the col1 and col2
+values determine the col3, col4 result order.
+
+=item -- start_ignore
+
+Ignore all results until the next "end_ignore" directive. The
+start_ignore directive prefixes all subsequent output with REGRESS_IGNORE,
+and all other formatting directives are ignored as well. The diff
+command can use the -I flag to ignore lines with this prefix.
+
+=item -- end_ignore
+
+ Ends the ignored region that started with "start_ignore"
+
+=item -- start_matchsubs
+
+Starts a list of match/substitution expressions, where the match and
+substitution are specified as perl "m" and "s" operators for a single
+line of input. atmsort will compile the expressions and use them to
+process the current input file. The format is:
+
+ -- start_matchsubs
+ --
+ -- # first, a match expression
+ -- m/match this/
+ -- # next, a substitute expression
+ -- s/match this/substitute this/
+ --
+ -- # and can have more matchsubs after this...
+ --
+ -- end_matchsubs
+
+ Blank lines are ignored, and comments may be used if they are
+ prefixed with "#", the perl comment character, eg:
+
+ -- # this is a comment
+
+ Multiple match and substitute pairs may be specified. See "man
+ perlre" for more information on perl regular expressions.
+
+=item -- end_matchsubs
+
+ Ends the match/substitution region that started with "start_matchsubs"
+
+=item -- start_matchignore
+
+Similar to matchsubs, starts a list of match/ignore expressions as a
+set of perl match operators. Each line that matches one of the
+specified expressions is elided from the atmsort output. Note that
+there isn't an "ignore" expression -- just a list of individual match
+operators.
+
+=item -- end_matchignore
+
+ Ends the match/ignore region that started with "start_matchignore"
+
+=item -- force_explain
+
+Normally, atmsort can detect that a SQL query is being EXPLAINed, and
+the expain processing will happen automatically. However, if the
+query is complex, you may need to tag it with a comment to force the
+explain. Using this command for non-EXPLAIN statements is
+inadvisable.
+
+=back
+
+Note that you can combine the directives for a single query, but each
+directive must be on a separate line. Multiple mvd specifications
+must be on a single mvd line, separated by semicolons. Note that
+start_ignore overrides all directives until the next end_ignore.
+
+=head1 CAVEATS/LIMITATIONS
+
+atmsort cannot handle "unsorted" SELECT queries where the output has
+strings with embedded newlines or pipe ("|") characters due to
+limitations with the parser in the "tablelizer" function. Queries
+with these characteristics must have an ORDER BY clause to avoid
+potential erroneous comparison.
+
+=cut
+
+# Calls pod2usage, but loads the module first.
+sub lazy_pod2usage
+{
+ require Pod::Usage;
+ Pod::Usage::pod2usage(@_);
+}
+
+my $glob_id = "";
+
+my $glob_init;
+
+my $glob_orderwarn;
+my $glob_verbose;
+my $glob_fqo;
+
+my $man = 0;
+my $help = 0;
+my $ignore_plans;
+my @init_file;
+my $verbose;
+my $orderwarn;
+
+GetOptions(
+ 'help|?' => \$help, man => \$man,
+ 'gpd_ignore_plans|gp_ignore_plans|ignore_plans' => \$ignore_plans,
+ 'gpd_init|gp_init|init:s' => \@init_file,
+ 'order_warn|orderwarn' => \$orderwarn,
+ 'verbose' => \$verbose
+ )
+ or lazy_pod2usage(2);
+
+lazy_pod2usage(-msg => $glob_id, -exitstatus => 1) if $help;
+lazy_pod2usage(-msg => $glob_id, -exitstatus => 0, -verbose => 2) if $man;
+
+push @{$glob_init}, @init_file;
+
+my %args;
+
+$args{IGNORE_PLANS} = $ignore_plans if (defined ($ignore_plans));
+@{$args{INIT_FILES}} = @init_file if (scalar(@init_file));
+$args{ORDER_WARN} = $orderwarn if (defined ($orderwarn));
+$args{VERBOSE} = $verbose if (defined ($verbose));
+
+atmsort::atmsort_init(%args);
+
+atmsort::run_fhs(*STDIN, *STDOUT);
+
+exit();
diff --git src/test/regress/atmsort.pm src/test/regress/atmsort.pm
new file mode 100644
index 0000000000..bbee018255
--- /dev/null
+++ src/test/regress/atmsort.pm
@@ -0,0 +1,1371 @@
+#
+# This is the workhorse of atmsort.pl, extracted into a module so that it
+# can be called more efficiently from other perl programs.
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Public interface:
+#
+# atmsort_init(args in a hash)
+#
+# followed by:
+#
+# run_fhs(input file handle, output file handle)
+# or
+# run(input filename, output filename)
+#
+package atmsort;
+
+#use Data::Dumper; # only used by commented-out debug statements.
+use strict;
+use warnings;
+use File::Temp qw/ tempfile /;
+
+# optional set of prefixes to identify sql statements, query output,
+# and sorted lines (for testing purposes)
+#my $apref = 'a: ';
+#my $bpref = 'b: ';
+#my $cpref = 'c: ';
+#my $dpref = 'S: ';
+my $apref = '';
+my $bpref = '';
+my $cpref = '';
+my $dpref = '';
+
+my $glob_ignore_plans;
+my @glob_init;
+
+my $glob_orderwarn;
+my $glob_verbose;
+my $glob_fqo;
+
+my $atmsort_outfh;
+
+sub atmsort_init
+{
+ my %args = (
+ # defaults
+ IGNORE_PLANS => 0,
+ INIT_FILES => [],
+ ORDER_WARN => 0,
+ VERBOSE => 0,
+
+ # override the defaults from argument list
+ @_
+ );
+
+ $glob_ignore_plans = 0;
+ @glob_init = ();
+
+ $glob_orderwarn = 0;
+ $glob_verbose = 0;
+ $glob_fqo = {count => 0};
+
+ my $ignore_plans;
+ my @init_file;
+ my $verbose;
+ my $orderwarn;
+
+ $glob_ignore_plans = $args{IGNORE_PLANS};
+
+ @glob_init = @{$args{INIT_FILES}};
+
+ $glob_orderwarn = $args{ORDER_WARN};
+ $glob_verbose = $args{VERBOSE};
+
+ _process_init_files();
+}
+
+sub _process_init_files
+{
+ # allow multiple init files
+ if (@glob_init)
+ {
+ my $devnullfh;
+ my $init_file_fh;
+
+ open $devnullfh, "> /dev/null" or die "can't open /dev/null: $!";
+
+ for my $init_file (@glob_init)
+ {
+ die "no such file: $init_file"
+ unless (-e $init_file);
+
+ # Perform initialization from this init_file by passing it
+ # to bigloop. Open the file, and pass that as the input file
+ # handle, and redirect output to /dev/null.
+ open $init_file_fh, "< $init_file" or die "could not open $init_file: $!";
+
+ atmsort_bigloop($init_file_fh, $devnullfh);
+
+ close $init_file_fh;
+ }
+
+ close $devnullfh;
+ }
+}
+
+my $glob_match_then_sub_fnlist;
+
+sub _build_match_subs
+{
+ my ($here_matchsubs, $whomatch) = @_;
+
+ my $stat = [1];
+
+ # filter out the comments and blank lines
+ $here_matchsubs =~ s/^\s*(?:#.*)?(?:[\r\n]|\x0D\x0A)//gm;
+
+ # split up the document into separate lines
+ my @foo = split(/\n/, $here_matchsubs);
+
+ my $ii = 0;
+
+ my $matchsubs_arr = [];
+ my $msa;
+
+ # build an array of arrays of match/subs pairs
+ while ($ii < scalar(@foo))
+ {
+ my $lin = $foo[$ii];
+
+ if (defined($msa))
+ {
+ push @{$msa}, $lin;
+
+ push @{$matchsubs_arr}, $msa;
+
+ undef $msa;
+ }
+ else
+ {
+ $msa = [$lin];
+ }
+ $ii++;
+ next;
+ } # end while
+
+# print Data::Dumper->Dump($matchsubs_arr);
+
+ my $bigdef;
+
+ my $fn1;
+
+ # build a lambda function for each expression, and load it into an
+ # array
+ my $mscount = 1;
+
+ for my $defi (@{$matchsubs_arr})
+ {
+ unless (2 == scalar(@{$defi}))
+ {
+ my $err1 = "bad definition: " . Data::Dumper->Dump([$defi]);
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ $bigdef = '$fn1 = sub { my $ini = shift; '. "\n";
+ $bigdef .= 'if ($ini =~ ' . $defi->[0];
+ $bigdef .= ') { ' . "\n";
+# $bigdef .= 'print "match\n";' . "\n";
+ $bigdef .= '$ini =~ ' . $defi->[1];
+ $bigdef .= '; }' . "\n";
+ $bigdef .= 'return $ini; }' . "\n";
+
+# print $bigdef;
+
+ if (eval $bigdef)
+ {
+ my $cmt = $whomatch . " matchsubs \#" . $mscount;
+ $mscount++;
+
+ # store the function pointer and the text of the function
+ # definition
+ push @{$glob_match_then_sub_fnlist},
+ [$fn1, $bigdef, $cmt, $defi->[0], $defi->[1]];
+
+ if ($glob_verbose && defined $atmsort_outfh)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: Defined $cmt\t$defi->[0]\t$defi->[1]\n"
+ }
+ }
+ else
+ {
+ my $err1 = "bad eval: $bigdef";
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ }
+
+# print Data::Dumper->Dump($glob_match_then_sub_fnlist);
+
+ return $stat;
+
+} # end _build_match_subs
+
+sub match_then_subs
+{
+ my $ini = shift;
+
+ for my $ff (@{$glob_match_then_sub_fnlist})
+ {
+ # get the function and execute it
+ my $fn1 = $ff->[0];
+ if (!$glob_verbose)
+ {
+ $ini = &$fn1($ini);
+ }
+ else
+ {
+ my $subs = &$fn1($ini);
+ unless ($subs eq $ini)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: was: $ini";
+ print $atmsort_outfh "REGRESS_IGNORE: matched $ff->[-3]\t$ff->[-2]\t$ff->[-1]\n"
+ }
+
+ $ini = &$fn1($ini);
+ }
+
+ }
+ return $ini;
+}
+
+my $glob_match_then_ignore_fnlist;
+
+sub _build_match_ignores
+{
+ my ($here_matchignores, $whomatch) = @_;
+
+ my $stat = [1];
+
+ # filter out the comments and blank lines
+ $here_matchignores =~ s/^\s*(?:#.*)?(?:[\r\n]|\x0D\x0A)//gm;
+
+ # split up the document into separate lines
+ my @matchignores_arr = split(/\n/, $here_matchignores);
+
+ my $bigdef;
+
+ my $fn1;
+
+ # build a lambda function for each expression, and load it into an
+ # array
+ my $mscount = 1;
+
+ for my $defi (@matchignores_arr)
+ {
+ $bigdef = '$fn1 = sub { my $ini = shift; '. "\n";
+ $bigdef .= 'return ($ini =~ ' . $defi;
+ $bigdef .= ') ; } ' . "\n";
+# print $bigdef;
+
+ if (eval $bigdef)
+ {
+ my $cmt = $whomatch . " matchignores \#" . $mscount;
+ $mscount++;
+
+ # store the function pointer and the text of the function
+ # definition
+ push @{$glob_match_then_ignore_fnlist},
+ [$fn1, $bigdef, $cmt, $defi, "(ignore)"];
+ if ($glob_verbose && defined $atmsort_outfh)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: Defined $cmt\t$defi\n"
+ }
+
+ }
+ else
+ {
+ my $err1 = "bad eval: $bigdef";
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ }
+
+# print Data::Dumper->Dump($glob_match_then_ignore_fnlist);
+
+ return $stat;
+
+} # end _build_match_ignores
+
+# if the input matches, return 1 (ignore), else return 0 (keep)
+sub match_then_ignore
+{
+ my $ini = shift;
+
+ for my $ff (@{$glob_match_then_ignore_fnlist})
+ {
+ # get the function and execute it
+ my $fn1 = $ff->[0];
+
+ if (&$fn1($ini))
+ {
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: matched $ff->[-3]\t$ff->[-2]\t$ff->[-1]\n"
+ }
+ return 1; # matched
+ }
+ }
+ return 0; # no match
+}
+
+# convert a postgresql psql formatted table into an array of hashes
+sub tablelizer
+{
+ my ($ini, $got_line1) = @_;
+
+ # first, split into separate lines, the find all the column headings
+
+ my @lines = split(/\n/, $ini);
+
+ return undef
+ unless (scalar(@lines));
+
+ # if the first line is supplied, then it has the column headers,
+ # so don't try to find them (or the ---+---- separator) in
+ # "lines"
+ my $line1 = $got_line1;
+ $line1 = shift @lines
+ unless (defined($got_line1));
+
+ # look for <space>|<space>
+ my @colheads = split(/\s+\|\s+/, $line1);
+
+ # fixup first, last column head (remove leading,trailing spaces)
+
+ $colheads[0] =~ s/^(\s+|\s+$)//;
+ $colheads[-1] =~ s/^(\s+|\s+$)//;
+
+ return undef
+ unless (scalar(@lines));
+
+ shift @lines # skip dashed separator (unless it was skipped already)
+ unless (defined($got_line1));
+
+ my @rows;
+
+ for my $lin (@lines)
+ {
+ my @cols = split(/\|/, $lin, scalar(@colheads));
+ last
+ unless (scalar(@cols) == scalar(@colheads));
+
+ my $rowh = {};
+
+ for my $colhdcnt (0..(scalar(@colheads)-1))
+ {
+ my $rawcol = shift @cols;
+
+ $rawcol =~ s/^(\s+|\s+$)//;
+
+ my $colhd = $colheads[$colhdcnt];
+ $rowh->{($colhdcnt+1)} = $rawcol;
+ }
+ push @rows, $rowh;
+ }
+
+ return \@rows;
+}
+# reformat the EXPLAIN output according to the directive hash
+sub format_explain
+{
+ my ($outarr, $directive) = @_;
+ my $prefix = "";
+ my $xopt = "perl"; # normal case
+
+ $directive = {} unless (defined($directive));
+
+ # Ignore plan content if its between start_ignore and end_ignore blocks
+ # or if -ignore_plans is specified.
+ $prefix = "REGRESS_IGNORE:"
+ if (exists($directive->{ignore})) || ($glob_ignore_plans);
+
+ my @tmp_lines;
+
+ if (scalar(@{$outarr}))
+ {
+ @tmp_lines = (
+ "QUERY PLAN\n",
+ ("-" x 71) . "\n",
+ @{$outarr},
+ "(111 rows)\n"
+ );
+ }
+
+ # Apply prefix to each line, if requested.
+ if (defined($prefix) && length($prefix))
+ {
+ foreach my $line (@tmp_lines)
+ {
+ $line = $prefix . $line;
+ }
+ }
+
+ # Put back newlines and print
+ foreach my $line (@tmp_lines)
+ {
+ print $atmsort_outfh $line;
+ }
+
+ return \@tmp_lines;
+}
+
+# reformat the query output according to the directive hash
+sub format_query_output
+{
+ my ($fqostate, $has_order, $outarr, $directive) = @_;
+ my $prefix = "";
+
+ $directive = {} unless (defined($directive));
+
+ $fqostate->{count} += 1;
+
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: start fqo $fqostate->{count}\n";
+ }
+
+ # EXPLAIN
+ #
+ # EXPLAIN (COSTS OFF) output is *not* processed. The output with COSTS OFF
+ # shouldn't contain anything that varies across runs, and shouldn't need
+ # sanitizing.
+ #
+ # However when -ignore_plans is specified we also need to process
+ # EXPLAIN (COSTS OFF) to ignore the segments information.
+ if (exists($directive->{explain})
+ && ($glob_ignore_plans
+ || $directive->{explain} ne 'costs_off')
+ && (!exists($directive->{explain_processing})
+ || ($directive->{explain_processing} =~ m/on/)))
+ {
+ format_explain($outarr, $directive);
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+ return;
+ }
+
+ $prefix = "REGRESS_IGNORE:"
+ if (exists($directive->{ignore}));
+
+ if (exists($directive->{sortlines}))
+ {
+ my $firstline = $directive->{firstline};
+ my $ordercols = $directive->{order};
+ my $mvdlist = $directive->{mvd};
+
+ # lines already have newline terminator, so just rejoin them.
+ my $lines = join ("", @{$outarr});
+
+ my $ah1 = tablelizer($lines, $firstline);
+
+ unless (defined($ah1) && scalar(@{$ah1}))
+ {
+# print "No tablelizer hash for $lines, $firstline\n";
+# print STDERR "No tablelizer hash for $lines, $firstline\n";
+
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+
+ return;
+ }
+
+ my @allcols = sort (keys(%{$ah1->[0]}));
+
+ my @presortcols;
+ if (defined($ordercols) && length($ordercols))
+ {
+# $ordercols =~ s/^.*order\s*//;
+ $ordercols =~ s/\n//gm;
+ $ordercols =~ s/\s//gm;
+
+ @presortcols = split(/\s*\,\s*/, $ordercols);
+ }
+
+ my @mvdcols;
+ my @mvd_deps;
+ my @mvd_nodeps;
+ my @mvdspec;
+ if (defined($mvdlist) && length($mvdlist))
+ {
+ $mvdlist =~ s/\n//gm;
+ $mvdlist =~ s/\s//gm;
+
+ # find all the mvd specifications (separated by semicolons)
+ my @allspecs = split(/\;/, $mvdlist);
+
+# print "allspecs:", Data::Dumper->Dump(\@allspecs);
+
+ for my $item (@allspecs)
+ {
+ my $realspec;
+ # split the specification list, separating the
+ # specification columns on the left hand side (LHS)
+ # from the "dependent" columns on the right hand side (RHS)
+ my @colset = split(/\-\>/, $item, 2);
+ unless (scalar(@colset) == 2)
+ {
+ print $atmsort_outfh "invalid colset for $item\n";
+ print STDERR "invalid colset for $item\n";
+ next;
+ }
+ # specification columns (LHS)
+ my @scols = split(/\,/, $colset[0]);
+ unless (scalar(@scols))
+ {
+ print $atmsort_outfh "invalid dependency specification: $colset[0]\n";
+ print STDERR
+ "invalid dependency specification: $colset[0]\n";
+ next;
+ }
+ # dependent columns (RHS)
+ my @dcols = split(/\,/, $colset[1]);
+ unless (scalar(@dcols))
+ {
+ print $atmsort_outfh "invalid specified dependency: $colset[1]\n";
+ print STDERR "invalid specified dependency: $colset[1]\n";
+ next;
+ }
+ $realspec = {};
+ my $scol2 = [];
+ my $dcol2 = [];
+ my $sdcol = [];
+ $realspec->{spec} = $item;
+ push @{$scol2}, @scols;
+ push @{$dcol2}, @dcols;
+ push @{$sdcol}, @scols, @dcols;
+ $realspec->{scol} = $scol2;
+ $realspec->{dcol} = $dcol2;
+ $realspec->{allcol} = $sdcol;
+
+ push @mvdcols, @scols, @dcols;
+ # find all the dependent columns
+ push @mvd_deps, @dcols;
+ push @mvdspec, $realspec;
+ }
+
+ # find all the mvd cols which are *not* dependent. Need
+ # to handle the case of self-dependency, eg "mvd 1->1", so
+ # must build set of all columns, then strip out the
+ # "dependent" cols. So this is the set of all LHS columns
+ # which are never on the RHS.
+ my %get_nodeps;
+
+ for my $col (@mvdcols)
+ {
+ $get_nodeps{$col} = 1;
+ }
+
+ # remove dependent cols
+ for my $col (@mvd_deps)
+ {
+ if (exists($get_nodeps{$col}))
+ {
+ delete $get_nodeps{$col};
+ }
+ }
+ # now sorted and unique, with no dependents
+ @mvd_nodeps = sort (keys(%get_nodeps));
+# print "mvdspec:", Data::Dumper->Dump(\@mvdspec);
+# print "mvd no deps:", Data::Dumper->Dump(\@mvd_nodeps);
+ }
+
+ my %unsorth = map { $_ => 1 } @allcols;
+
+ # clear sorted column list if just "order 0"
+ if ((1 == scalar(@presortcols))
+ && ($presortcols[0] eq "0"))
+ {
+ @presortcols = ();
+ }
+
+
+ for my $col (@presortcols)
+ {
+ if (exists($unsorth{$col}))
+ {
+ delete $unsorth{$col};
+ }
+ }
+ for my $col (@mvdcols)
+ {
+ if (exists($unsorth{$col}))
+ {
+ delete $unsorth{$col};
+ }
+ }
+ my @unsortcols = sort(keys(%unsorth));
+
+# print Data::Dumper->Dump([$ah1]);
+
+ if (scalar(@presortcols))
+ {
+ my $hd1 = "sorted columns " . join(", ", @presortcols);
+
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+# print "hrow:",Data::Dumper->Dump([$h_row]), "\n";
+
+ for my $col (@presortcols)
+ {
+# print "col: ($col)\n";
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified ORDER column out of range: $col vs $maxcol\n";
+ print $atmsort_outfh $errstr;
+ print STDERR $errstr;
+ last;
+ }
+ }
+ print $atmsort_outfh $prefix, join(' | ', @collist), "\n";
+ }
+ }
+
+ if (scalar(@mvdspec))
+ {
+ my @outi;
+
+ my $hd1 = "multivalue dependency specifications";
+
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $mspec (@mvdspec)
+ {
+ $hd1 = $mspec->{spec};
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+# print "hrow:",Data::Dumper->Dump([$h_row]), "\n";
+
+ for my $col (@{$mspec->{allcol}})
+ {
+# print "col: ($col)\n";
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified MVD column out of range: $col vs $maxcol\n";
+ print $errstr;
+ print STDERR $errstr;
+ last;
+ }
+
+ }
+ push @outi, join(' | ', @collist);
+ }
+ my @ggg= sort @outi;
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $prefix, $line, "\n";
+ }
+ @outi = ();
+ }
+ }
+ my $hd2 = "unsorted columns " . join(", ", @unsortcols);
+
+ # the "unsorted" comparison must include all columns which are
+ # not sorted or part of an mvd specification, plus the sorted
+ # columns, plus the non-dependent mvd columns which aren't
+ # already in the list
+ if ((scalar(@presortcols))
+ || scalar(@mvd_nodeps))
+ {
+ if (scalar(@presortcols))
+ {
+ if (scalar(@mvd_deps))
+ {
+ my %get_presort;
+
+ for my $col (@presortcols)
+ {
+ $get_presort{$col} = 1;
+ }
+ # remove "dependent" (RHS) columns
+ for my $col (@mvd_deps)
+ {
+ if (exists($get_presort{$col}))
+ {
+ delete $get_presort{$col};
+ }
+ }
+ # now sorted and unique, minus all mvd dependent cols
+ @presortcols = sort (keys(%get_presort));
+
+ }
+
+ if (scalar(@presortcols))
+ {
+ $hd2 .= " ( " . join(", ", @presortcols) . ")";
+ # have to compare all columns as unsorted
+ push @unsortcols, @presortcols;
+ }
+ }
+ if (scalar(@mvd_nodeps))
+ {
+ my %get_nodeps;
+
+ for my $col (@mvd_nodeps)
+ {
+ $get_nodeps{$col} = 1;
+ }
+ # remove "nodeps" which are already in the output list
+ for my $col (@unsortcols)
+ {
+ if (exists($get_nodeps{$col}))
+ {
+ delete $get_nodeps{$col};
+ }
+ }
+ # now sorted and unique, minus all unsorted/sorted cols
+ @mvd_nodeps = sort (keys(%get_nodeps));
+ if (scalar(@mvd_nodeps))
+ {
+ $hd2 .= " (( " . join(", ", @mvd_nodeps) . "))";
+ # have to compare all columns as unsorted
+ push @unsortcols, @mvd_nodeps;
+ }
+
+ }
+
+ }
+
+ print $hd2, "\n", "-"x(length($hd2)), "\n";
+
+ my @finalunsort;
+
+ if (scalar(@unsortcols))
+ {
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+ for my $col (@unsortcols)
+ {
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified UNSORT column out of range: $col vs $maxcol\n";
+ print $errstr;
+ print STDERR $errstr;
+ last;
+ }
+
+ }
+ push @finalunsort, join(' | ', @collist);
+ }
+ my @ggg= sort @finalunsort;
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $prefix, $line, "\n";
+ }
+ }
+
+ if ($glob_verbose)
+ {
+ print "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+
+ return;
+ } # end order
+
+
+ if ($has_order)
+ {
+ my @ggg= @{$outarr};
+
+ if ($glob_orderwarn)
+ {
+ # If no ordering cols specified (no directive), and SELECT has
+ # ORDER BY, see if number of order by cols matches all cols in
+ # selected lists. Treat the order by cols as a comma separated
+ # list and count them. Works ok for simple ORDER BY clauses
+ if (defined($directive->{sql_statement}))
+ {
+ my @ocols = ($directive->{sql_statement} =~ m/select.*order\s+by\s+(.*)\;/ism);
+
+ if (scalar(@ocols))
+ {
+ my $fl2 = $directive->{firstline};
+ # lines already have newline terminator, so just rejoin them.
+ my $line2 = join ("", @{$outarr});
+
+ my $ah2 = tablelizer($line2, $fl2);
+ if (defined($ah2) && scalar(@{$ah2}))
+ {
+ my $allcol_count = scalar(keys(%{$ah2->[0]}));
+
+ # In order to count the number of ORDER BY columns we
+ # can transliterate over comma and increment by one to
+ # account for the last column not having a trailing
+ # comma. This is faster than splitting over the comma
+ # since we don't need to allocate the returned array.
+ my $ocol_count = ($ocols[0] =~ tr/,//) + 1;
+
+ if ($ocol_count < $allcol_count)
+ {
+ print "REGRESS_IGNORE: ORDER_WARNING: OUTPUT ",
+ $allcol_count, " columns, but ORDER BY on ",
+ $ocol_count, " \n";
+ }
+ }
+ }
+ }
+ } # end if $glob_orderwarn
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $dpref, $prefix, $line;
+ }
+ }
+ else
+ {
+ my @ggg= sort @{$outarr};
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $bpref, $prefix, $line;
+ }
+ }
+
+ if ($glob_verbose)
+ {
+ print "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+}
+
+
+# The caller should've opened ATMSORT_INFILE and ATMSORT_OUTFILE file handles.
+sub atmsort_bigloop
+{
+ my $infh = shift;
+ $atmsort_outfh = shift;
+
+ my $sql_statement = "";
+ my @outarr;
+
+ my $getrows = 0;
+ my $getstatement = 0;
+ my $has_order = 0;
+ my $copy_to_stdout_result = 0;
+ my $describe_mode = 0;
+ my $directive = {};
+ my $big_ignore = 0;
+ my %define_match_expression;
+
+ print $atmsort_outfh "REGRESS_IGNORE: formatted by atmsort.pm\n";
+
+ L_bigwhile:
+ while (<$infh>) # big while
+ {
+ reprocess_row:
+ my $ini = $_;
+
+ # look for match/substitution or match/ignore expressions
+ if (%define_match_expression)
+ {
+ if ($ini =~ m/\-\-\s*end\_match(subs|ignore)\s*$/)
+ {
+ if ($define_match_expression{"type"} ne $1)
+ {
+ die "Non-matching operation end_match" . $1 . ", " .
+ "expected end_match" . $define_match_expression{"type"};
+ }
+ }
+ else
+ {
+ $define_match_expression{"expr"} .= $ini;
+ goto L_push_outarr;
+ }
+
+ my @foo = split(/\n/, $define_match_expression{"expr"}, 2);
+
+ unless (2 == scalar(@foo))
+ {
+ $ini .= "REGRESS_IGNORE: bad match definition\n";
+ undef %define_match_expression;
+ goto L_push_outarr;
+ }
+
+ my $stat;
+
+ my $doc1 = $foo[1];
+
+ # strip off leading comment characters
+ $doc1 =~ s/^\s*\-\-//gm;
+
+ if ($define_match_expression{"type"} eq 'subs')
+ {
+ $stat = _build_match_subs($doc1, "USER");
+ }
+ else
+ {
+ $stat = _build_match_ignores($doc1, "USER");
+ }
+
+ if (scalar(@{$stat}) > 1)
+ {
+ my $outi = $stat->[1];
+
+ # print a message showing the error
+ $outi =~ s/^(.*)/REGRESS_IGNORE: ($1)/gm;
+ $ini .= $outi;
+ }
+ else
+ {
+ $ini .= "REGRESS_IGNORE: defined new match expression\n";
+ }
+
+ undef %define_match_expression;
+ goto L_push_outarr;
+ } # end defined match expression
+
+ if ($big_ignore > 0)
+ {
+ if ($ini =~ m/\-\-\s*end\_ignore\s*$/)
+ {
+ $big_ignore--;
+ }
+ print $atmsort_outfh "REGRESS_IGNORE:", $ini;
+ next;
+ }
+
+ if ($getrows) # getting rows from SELECT output
+ {
+ # The end of "result set" for a COPY TO STDOUT is a bit tricky
+ # to find. There is no explicit marker for it. We look for a
+ # line that looks like a SQL comment or a new query, or an ERROR.
+ # This is not bullet-proof, but works for the current tests.
+ if ($copy_to_stdout_result &&
+ ($ini =~ m/(?:\-\-|ERROR|copy|create|drop|select|insert|update)/i))
+ {
+ my @ggg = sort @outarr;
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $bpref, $line;
+ }
+
+ @outarr = ();
+ $getrows = 0;
+ $has_order = 0;
+ $copy_to_stdout_result = 0;
+
+ # Process the row again, in case it begins another
+ # COPY TO STDOUT statement, or another query.
+ goto reprocess_row;
+ }
+
+ my $end_of_table = 0;
+
+ if ($describe_mode)
+ {
+ # \d tables don't always end with a row count, and there may be
+ # more than one of them per command. So we allow any of the
+ # following to end the table:
+ # - a blank line
+ # - a row that doesn't have the same number of column separators
+ # as the header line
+ # - a row count (checked below)
+ if ($ini =~ m/^$/)
+ {
+ $end_of_table = 1;
+ }
+ elsif (exists($directive->{firstline}))
+ {
+ # Count the number of column separators in the table header
+ # and our current line.
+ my $headerSeparators = ($directive->{firstline} =~ tr/\|//);
+ my $lineSeparators = ($ini =~ tr/\|//);
+
+ if ($headerSeparators != $lineSeparators)
+ {
+ $end_of_table = 1;
+ }
+ }
+
+ # Don't reset describe_mode at the end of the table; there may
+ # be more tables still to go.
+ }
+
+ # regex example: (5 rows)
+ if ($ini =~ m/^\s*\(\d+\s+row(?:s)*\)\s*$/)
+ {
+ # Always ignore the rowcount for explain plan out as the
+ # skeleton plans might be the same even if the row counts
+ # differ because of session level GUCs.
+ if (exists($directive->{explain}))
+ {
+ $ini = 'REGRESS_IGNORE:' . $ini;
+ }
+
+ $end_of_table = 1;
+ }
+
+ if ($end_of_table)
+ {
+ format_query_output($glob_fqo,
+ $has_order, \@outarr, $directive);
+
+ $directive = {};
+ @outarr = ();
+ $getrows = 0;
+ $has_order = 0;
+ }
+ }
+ else # finding SQL statement or start of SELECT output
+ {
+ # To avoid hunting for gpdiff commands which are contained inside
+ # comments first establish if the line contains a comment with any
+ # trailing characters at all.
+ my $has_comment = ((m/\s*\-\-.+$/) ? 1 : 0);
+
+ if ($has_comment && $ini =~ m/\-\-\s*start\_match(subs|ignore)\s*$/)
+ {
+ $define_match_expression{"type"} = $1;
+ $define_match_expression{"expr"} = $ini;
+ goto L_push_outarr;
+ }
+ if ($has_comment && ($ini =~ m/\-\-\s*start\_ignore\s*$/))
+ {
+ $big_ignore += 1;
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $apref, $line;
+ }
+ @outarr = ();
+
+ print $atmsort_outfh 'REGRESS_IGNORE:', $ini;
+ next;
+ }
+
+ # EXPLAIN (COSTS OFF) ...
+ if ($ini =~ m/explain\s*\(.*costs\s+off.*\)/i)
+ {
+ $directive->{explain} = "costs_off";
+ }
+ # Note: \d is for the psql "describe"
+ elsif ($ini =~ m/(?:insert|update|delete|select|^\s*\\d|copy|execute)/i)
+ {
+ $copy_to_stdout_result = 0;
+ $has_order = 0;
+ $sql_statement = "";
+
+ if ($ini =~ m/explain.*(?:insert|update|delete|select|execute)/i)
+ {
+ $directive->{explain} = 'normal';
+ }
+
+ # Should we apply more heuristics to try to find the end of \d
+ # output?
+ $describe_mode = ($ini =~ m/^\s*\\d/);
+ }
+
+ # Catching multiple commands and capturing the parens matches
+ # makes it possible to check just the first character since
+ # each command has a unique first character. This allows us to
+ # use fewer regular expression matches in this hot section.
+ if ($has_comment &&
+ $ini =~ m/\-\-\s*((force_explain)\s*(operator)?\s*$|(ignore)\s*$|(order)\s+(\d+|none).*$|(mvd)\s+\d+.*$|(explain_processing_(on|off))\s+.*$)/)
+ {
+ my $full_command = $1;
+ my $cmd = substr($full_command, 0, 1);
+ if ($cmd eq 'i')
+ {
+ $directive->{ignore} = 'ignore';
+ }
+ elsif ($cmd eq 'o')
+ {
+ my $olist = $ini;
+ $olist =~ s/^.*\-\-\s*order//;
+ if ($olist =~ /none/)
+ {
+ $directive->{order_none} = 1;
+ }
+ else
+ {
+ $directive->{order} = $olist;
+ }
+ }
+ elsif ($cmd eq 'f')
+ {
+ if (defined($3))
+ {
+ $directive->{explain} = 'operator';
+ }
+ else
+ {
+ $directive->{explain} = 'normal';
+ }
+ }
+ elsif ($cmd eq 'e')
+ {
+ $full_command =~ m/(on|off)$/;
+ $directive->{explain_processing} = $1;
+ }
+ else
+ {
+ my $olist = $ini;
+ $olist =~ s/^.*\-\-\s*mvd//;
+ $directive->{mvd} = $olist;
+ }
+ }
+
+ if ($ini =~ m/select/i)
+ {
+ $getstatement = 1;
+ $sql_statement .= $ini;
+ }
+ if (index($ini, ';') != -1)
+ {
+ $getstatement = 0;
+ }
+
+ # prune notices with segment info if they are duplicates
+ if ($ini =~ m/^\s*(?:NOTICE|ERROR|HINT|DETAIL|WARNING)\:.*\(seg.*pid.*\)/)
+ {
+ $ini =~ s/\s+(?:\W)?(?:\W)?\(seg.*pid.*\)//;
+
+ my $outsize = scalar(@outarr);
+
+ my $lastguy = -1;
+
+ L_checkfor:
+ for my $jj (1..$outsize)
+ {
+ my $checkstr = $outarr[$lastguy];
+
+ #remove trailing spaces for comparison
+ $checkstr =~ s/\s+$//;
+
+ my $skinny = $ini;
+ $skinny =~ s/\s+$//;
+
+ # stop when no more notices
+ last L_checkfor
+ if ($checkstr !~ m/^\s*(?:NOTICE|ERROR|HINT|DETAIL|WARNING)\:/);
+
+ # discard this line if matches a previous notice
+ if ($skinny eq $checkstr)
+ {
+ if (0) # debug code
+ {
+ $ini = "DUP: " . $ini;
+ last L_checkfor;
+ }
+ next L_bigwhile;
+ }
+ $lastguy--;
+ } # end for
+
+ } # end if pruning notices
+
+ # MPP-1492 allow:
+ # copy (select ...) to stdout
+ # \copy (select ...) to stdout
+ # and special case these guys:
+ # copy test1 to stdout
+ # \copy test1 to stdout
+ my $matches_copy_to_stdout = 0;
+ if ($ini =~ m/^(?:\\)?copy\s+(?:(?:\(select.*\))|\S+)\s+to stdout.*$/i)
+ {
+ $matches_copy_to_stdout = 1;
+ }
+
+ # Try to detect the beginning of result set, as printed by psql
+ #
+ # Examples:
+ #
+ # hdr
+ # ----------
+ #
+ # a | b
+ # ---+---
+ #
+ # The previous line should be the header. It should have a space at the
+ # beginning and end. This line should consist of dashes and plus signs,
+ # with at least three dashes for each column.
+ #
+ if (($matches_copy_to_stdout && $ini !~ m/order by/i) ||
+ (scalar(@outarr) > 1 && $outarr[-1] =~ m/^\s+.*\s$/ &&
+ $ini =~ m/^(?:(?:\-\-)(?:\-)+(?:\+(?:\-)+)*)$/))
+ # special case for copy select
+ { # sort this region
+
+ $directive->{firstline} = $outarr[-1];
+
+ if (exists($directive->{order}) ||
+ exists($directive->{mvd}))
+ {
+ $directive->{sortlines} = $outarr[-1];
+ }
+
+ # special case for copy select
+ if ($matches_copy_to_stdout)
+ {
+ $copy_to_stdout_result = 1;
+ $sql_statement = "";
+ }
+ # special case for explain
+ if (exists($directive->{explain}) &&
+ ($ini =~ m/^\s*(?:(?:\-\-)(?:\-)+(?:\+(?:\-)+)*)+\s*$/) &&
+ (scalar(@outarr) && $outarr[-1] =~ m/QUERY PLAN/))
+ {
+ # ENGINF-88: fixup explain headers
+ $outarr[-1] = "QUERY PLAN\n";
+ $ini = ("_" x length($outarr[-1])) . "\n";
+ }
+
+ $getstatement = 0;
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $apref, $line;
+ }
+ @outarr = ();
+
+ print $atmsort_outfh $apref, $ini;
+
+ # If there is an ORDER BY in the query, then the results must
+ # be in the order that we have memorized in the expected
+ # output. Otherwise, the order of the rows is not
+ # well-defined, so we sort them before comparing, to mask out
+ # any differences in the order.
+ #
+ # This isn't foolproof, and will get fooled by ORDER BYs in
+ # subqueries, for example. But it catches the commmon cases.
+ if (defined($directive->{explain}))
+ {
+ $has_order = 1; # Do not reorder EXPLAIN output
+ }
+ elsif (defined($sql_statement)
+ && length($sql_statement)
+ && !defined($directive->{order_none})
+ # multiline match
+ && ($sql_statement =~ m/select.*order.*by/is))
+ {
+ # There was an ORDER BY. But if it was part of an
+ # "agg() OVER (ORDER BY ...)" or "WITHIN GROUP (ORDER BY
+ # ...)" construct, ignore it, because those constructs
+ # don't mean that the final result has to be in order.
+ my $t = $sql_statement;
+ $t =~ s/over\s*\(order\s+by.*\)/xx/isg;
+ $t =~ s/over\s*\((partition\s+by.*)?order\s+by.*\)/xx/isg;
+ $t =~ s/window\s+\w+\s+as\s+\((partition\s+by.*)?order\s+by.*\)/xx/isg;
+ $t =~ s/within\s+group\s*\((order\s+by.*)\)/xx/isg;
+
+ if ($t =~ m/order\s+by/is)
+ {
+ $has_order = 1; # so do *not* sort output
+ }
+ else
+ {
+ $has_order = 0; # need to sort query output
+ }
+ }
+ else
+ {
+ $has_order = 0; # need to sort query output
+ }
+ $directive->{sql_statement} = $sql_statement;
+ $sql_statement = '';
+
+ $getrows = 1;
+ next;
+ } # end sort this region
+ } # end finding SQL
+
+ # if MATCH then SUBSTITUTE
+ # see HERE document for definitions
+ $ini = match_then_subs($ini);
+
+ # if MATCH then IGNORE
+ # see HERE document for definitions
+ if ( match_then_ignore($ini))
+ {
+ next; # ignore matching lines
+ }
+
+L_push_outarr:
+
+ push @outarr, $ini;
+
+ } # end big while
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $cpref, $line;
+ }
+} # end bigloop
+
+
+# The arguments is the input filename. The output filename is returned as it
+# is generated in this function to avoid races around the temporary filename
+# creation.
+sub run
+{
+ my $infname = shift;
+
+ open my $infh, '<', $infname or die "could not open $infname: $!";
+ my ($outfh, $outfname) = tempfile();
+
+ run_fhs($infh, $outfh);
+
+ close $infh;
+ close $outfh;
+ return $outfname;
+}
+
+# The arguments are input and output file handles
+sub run_fhs
+{
+ my $infh = shift;
+ my $outfh = shift;
+
+
+ # loop over input file.
+ atmsort_bigloop($infh, $outfh);
+}
+
+1;
diff --git src/test/regress/expected/.gitignore src/test/regress/expected/.gitignore
index 93c56c85a0..0eb6984372 100644
--- src/test/regress/expected/.gitignore
+++ src/test/regress/expected/.gitignore
@@ -5,5 +5,6 @@
/largeobject.out
/largeobject_1.out
/misc.out
+/misc_1.out
/security_label.out
/tablespace.out
diff --git src/test/regress/expected/alter_table_1.out src/test/regress/expected/alter_table_1.out
new file mode 100644
index 0000000000..53653661e9
--- /dev/null
+++ src/test/regress/expected/alter_table_1.out
@@ -0,0 +1,4285 @@
+--
+-- ALTER_TABLE
+--
+-- Clean up in case a prior regression run failed
+SET client_min_messages TO 'warning';
+DROP ROLE IF EXISTS regress_alter_table_user1;
+RESET client_min_messages;
+CREATE USER regress_alter_table_user1;
+--
+-- add attribute
+--
+CREATE TABLE attmp (initial int4);
+COMMENT ON TABLE attmp_wrong IS 'table comment';
+ERROR: relation "attmp_wrong" does not exist
+COMMENT ON TABLE attmp IS 'table comment';
+COMMENT ON TABLE attmp IS NULL;
+ALTER TABLE attmp ADD COLUMN xmin integer; -- fails
+ERROR: column name "xmin" conflicts with a system column name
+ALTER TABLE attmp ADD COLUMN a int4 default 3;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+DROP TABLE attmp;
+-- the wolf bug - schema mods caused inconsistent row descriptors
+CREATE TABLE attmp (
+ initial int4
+);
+ALTER TABLE attmp ADD COLUMN a int4;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+CREATE INDEX attmp_idx ON attmp (a, (d + e), b);
+ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ERROR: column number must be in range from 1 to 32767
+LINE 1: ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ ^
+ALTER INDEX attmp_idx ALTER COLUMN 1 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "a" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS 1000;
+\d+ attmp_idx
+ Index "public.attmp_idx"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+------------------+------+------------+---------+--------------
+ a | integer | yes | a | plain |
+ expr | double precision | yes | (d + e) | plain | 1000
+ b | cstring | yes | b | plain |
+btree, for table "public.attmp"
+
+ALTER INDEX attmp_idx ALTER COLUMN 3 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "b" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 4 SET STATISTICS 1000;
+ERROR: column number 4 of relation "attmp_idx" does not exist
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS -1;
+DROP TABLE attmp;
+--
+-- rename - check on both non-temp and temp tables
+--
+CREATE TABLE attmp (regtable int);
+CREATE TEMP TABLE attmp (attmptable int);
+ALTER TABLE attmp RENAME TO attmp_new;
+SELECT * FROM attmp;
+ regtable
+----------
+(0 rows)
+
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+ALTER TABLE attmp RENAME TO attmp_new2;
+SELECT * FROM attmp; -- should fail
+ERROR: relation "attmp" does not exist
+LINE 1: SELECT * FROM attmp;
+ ^
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+SELECT * FROM attmp_new2;
+ regtable
+----------
+(0 rows)
+
+DROP TABLE attmp_new;
+DROP TABLE attmp_new2;
+-- check rename of partitioned tables and indexes also
+CREATE TABLE part_attmp (a int primary key) partition by range (a);
+CREATE TABLE part_attmp1 PARTITION OF part_attmp FOR VALUES FROM (0) TO (100);
+ALTER INDEX part_attmp_pkey RENAME TO part_attmp_index;
+ALTER INDEX part_attmp1_pkey RENAME TO part_attmp1_index;
+ALTER TABLE part_attmp RENAME TO part_at2tmp;
+ALTER TABLE part_attmp1 RENAME TO part_at2tmp1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX part_attmp_index RENAME TO fail;
+ERROR: must be owner of index part_attmp_index
+ALTER INDEX part_attmp1_index RENAME TO fail;
+ERROR: must be owner of index part_attmp1_index
+ALTER TABLE part_at2tmp RENAME TO fail;
+ERROR: must be owner of table part_at2tmp
+ALTER TABLE part_at2tmp1 RENAME TO fail;
+ERROR: must be owner of table part_at2tmp1
+RESET ROLE;
+DROP TABLE part_at2tmp;
+--
+-- check renaming to a table's array type's autogenerated name
+-- (the array type's name should get out of the way)
+--
+CREATE TABLE attmp_array (id int);
+CREATE TABLE attmp_array2 (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = 'attmp_array2[]'::regtype;
+ typname
+---------------
+ _attmp_array2
+(1 row)
+
+ALTER TABLE attmp_array2 RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+----------------
+ ___attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+DROP TABLE attmp_array;
+-- renaming to table's own array type's name is an interesting corner case
+CREATE TABLE attmp_array (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+ALTER TABLE attmp_array RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+-- ALTER TABLE ... RENAME on non-table relations
+-- renaming indexes (FIXME: this should probably test the index's functionality)
+ALTER INDEX IF EXISTS __onek_unique1 RENAME TO attmp_onek_unique1;
+NOTICE: relation "__onek_unique1" does not exist, skipping
+ALTER INDEX IF EXISTS __attmp_onek_unique1 RENAME TO onek_unique1;
+NOTICE: relation "__attmp_onek_unique1" does not exist, skipping
+ALTER INDEX onek_unique1 RENAME TO attmp_onek_unique1;
+ALTER INDEX attmp_onek_unique1 RENAME TO onek_unique1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX onek_unique1 RENAME TO fail; -- permission denied
+ERROR: must be owner of index onek_unique1
+RESET ROLE;
+-- renaming views
+CREATE VIEW attmp_view (unique1) AS SELECT unique1 FROM tenk1;
+ALTER TABLE attmp_view RENAME TO attmp_view_new;
+SET ROLE regress_alter_table_user1;
+ALTER VIEW attmp_view_new RENAME TO fail; -- permission denied
+ERROR: must be owner of view attmp_view_new
+RESET ROLE;
+-- hack to ensure we get an indexscan here
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+-- 5 values, sorted
+SELECT unique1 FROM tenk1 WHERE unique1 < 5;
+ unique1
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+(5 rows)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+DROP VIEW attmp_view_new;
+-- toast-like relation name
+alter table stud_emp rename to pg_toast_stud_emp;
+alter table pg_toast_stud_emp rename to stud_emp;
+-- renaming index should rename constraint as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+ALTER INDEX onek_unique1_constraint RENAME TO onek_unique1_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraint
+ALTER TABLE onek ADD CONSTRAINT onek_check_constraint CHECK (unique1 >= 0);
+ALTER TABLE onek RENAME CONSTRAINT onek_check_constraint TO onek_check_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_check_constraint_foo;
+-- renaming constraint should rename index as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+DROP INDEX onek_unique1_constraint; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint because constraint onek_unique1_constraint on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint on table onek instead.
+ALTER TABLE onek RENAME CONSTRAINT onek_unique1_constraint TO onek_unique1_constraint_foo;
+DROP INDEX onek_unique1_constraint_foo; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint_foo because constraint onek_unique1_constraint_foo on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint_foo on table onek instead.
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraints vs. inheritance
+CREATE TABLE constraint_rename_test (a int CONSTRAINT con1 CHECK (a > 0), b int, c int);
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+
+CREATE TABLE constraint_rename_test2 (a int CONSTRAINT con1 CHECK (a > 0), d int) INHERITS (constraint_rename_test);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging constraint "con1" with inherited definition
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test2 RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: cannot rename inherited constraint "con1"
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: inherited constraint "con1" must be renamed in child tables too
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con2 CHECK (b > 0) NO INHERIT;
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con2 TO con2foo; -- ok
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con2foo TO con2bar; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con3 PRIMARY KEY (a);
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con3 TO con3foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+Indexes:
+ "con3foo" PRIMARY KEY, btree (a)
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+DROP TABLE constraint_rename_test2;
+DROP TABLE constraint_rename_test;
+ALTER TABLE IF EXISTS constraint_not_exist RENAME CONSTRAINT con3 TO con3foo; -- ok
+NOTICE: relation "constraint_not_exist" does not exist, skipping
+ALTER TABLE IF EXISTS constraint_rename_test ADD CONSTRAINT con4 UNIQUE (a);
+NOTICE: relation "constraint_rename_test" does not exist, skipping
+-- renaming constraints with cache reset of target relation
+CREATE TABLE constraint_rename_cache (a int,
+ CONSTRAINT chk_a CHECK (a > 0),
+ PRIMARY KEY (a));
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT chk_a TO chk_a_new;
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT constraint_rename_cache_pkey TO constraint_rename_pkey_new;
+CREATE TABLE like_constraint_rename_cache
+ (LIKE constraint_rename_cache INCLUDING ALL);
+\d like_constraint_rename_cache
+ Table "public.like_constraint_rename_cache"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+Indexes:
+ "like_constraint_rename_cache_pkey" PRIMARY KEY, btree (a)
+Check constraints:
+ "chk_a_new" CHECK (a > 0)
+
+DROP TABLE constraint_rename_cache;
+DROP TABLE like_constraint_rename_cache;
+-- FOREIGN KEY CONSTRAINT adding TEST
+CREATE TABLE attmp2 (a int primary key);
+CREATE TABLE attmp3 (a int, b int);
+CREATE TABLE attmp4 (a int, b int, unique(a,b));
+CREATE TABLE attmp5 (a int, b int);
+-- Insert rows into attmp2 (pktable)
+INSERT INTO attmp2 values (1);
+INSERT INTO attmp2 values (2);
+INSERT INTO attmp2 values (3);
+INSERT INTO attmp2 values (4);
+-- Insert rows into attmp3
+INSERT INTO attmp3 values (1,10);
+INSERT INTO attmp3 values (1,20);
+INSERT INTO attmp3 values (5,50);
+-- Try (and fail) to add constraint due to invalid source columns
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(c) references attmp2 match full;
+ERROR: column "c" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid destination columns explicitly given
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(a) references attmp2(b) match full;
+ERROR: column "b" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid data
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed)
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ALTER TABLE attmp3 drop constraint attmpconstr;
+INSERT INTO attmp3 values (5,50);
+-- Try NOT VALID and then VALIDATE CONSTRAINT, but fails. Delete failure then re-validate
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full NOT VALID;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed) and repeat to show it works on already valid constraint
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+-- Try a non-verified CHECK constraint
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10); -- fail
+ERROR: check constraint "b_greater_than_ten" of relation "attmp3" is violated by some row
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10) NOT VALID; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- fails
+ERROR: check constraint "b_greater_than_ten" of relation "attmp3" is violated by some row
+DELETE FROM attmp3 WHERE NOT b > 10;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+-- Test inherited NOT VALID CHECK constraints
+select * from attmp3;
+ a | b
+---+----
+ 1 | 20
+(1 row)
+
+CREATE TABLE attmp6 () INHERITS (attmp3);
+CREATE TABLE attmp7 () INHERITS (attmp3);
+INSERT INTO attmp6 VALUES (6, 30), (7, 16);
+ALTER TABLE attmp3 ADD CONSTRAINT b_le_20 CHECK (b <= 20) NOT VALID;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- fails
+ERROR: check constraint "b_le_20" of relation "attmp6" is violated by some row
+DELETE FROM attmp6 WHERE b > 20;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- succeeds
+-- An already validated constraint must not be revalidated
+CREATE FUNCTION boo(int) RETURNS int IMMUTABLE STRICT LANGUAGE plpgsql AS $$ BEGIN RAISE NOTICE 'boo: %', $1; RETURN $1; END; $$;
+INSERT INTO attmp7 VALUES (8, 18);
+ALTER TABLE attmp7 ADD CONSTRAINT identity CHECK (b = boo(b));
+NOTICE: boo: 18
+ALTER TABLE attmp3 ADD CONSTRAINT IDENTITY check (b = boo(b)) NOT VALID;
+NOTICE: merging constraint "identity" with inherited definition
+ALTER TABLE attmp3 VALIDATE CONSTRAINT identity;
+NOTICE: boo: 16
+NOTICE: boo: 20
+-- A NO INHERIT constraint should not be looked for in children during VALIDATE CONSTRAINT
+create table parent_noinh_convalid (a int);
+create table child_noinh_convalid () inherits (parent_noinh_convalid);
+insert into parent_noinh_convalid values (1);
+insert into child_noinh_convalid values (1);
+alter table parent_noinh_convalid add constraint check_a_is_2 check (a = 2) no inherit not valid;
+-- fail, because of the row in parent
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+ERROR: check constraint "check_a_is_2" of relation "parent_noinh_convalid" is violated by some row
+delete from only parent_noinh_convalid;
+-- ok (parent itself contains no violating rows)
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+select convalidated from pg_constraint where conrelid = 'parent_noinh_convalid'::regclass and conname = 'check_a_is_2';
+ convalidated
+--------------
+ t
+(1 row)
+
+-- cleanup
+drop table parent_noinh_convalid, child_noinh_convalid;
+-- Try (and fail) to create constraint from attmp5(a) to attmp4(a) - unique constraint on
+-- attmp4 is a,b
+ALTER TABLE attmp5 add constraint attmpconstr foreign key(a) references attmp4(a) match full;
+ERROR: there is no unique constraint matching given keys for referenced table "attmp4"
+DROP TABLE attmp7;
+DROP TABLE attmp6;
+DROP TABLE attmp5;
+DROP TABLE attmp4;
+DROP TABLE attmp3;
+DROP TABLE attmp2;
+-- NOT VALID with plan invalidation -- ensure we don't use a constraint for
+-- exclusion until validated
+set constraint_exclusion TO 'partition';
+create table nv_parent (d date, check (false) no inherit not valid);
+-- not valid constraint added at creation time should automatically become valid
+\d nv_parent
+ Table "public.nv_parent"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_parent_check" CHECK (false) NO INHERIT
+
+create table nv_child_2010 () inherits (nv_parent);
+create table nv_child_2011 () inherits (nv_parent);
+alter table nv_child_2010 add check (d between '2010-01-01'::date and '2010-12-31'::date) not valid;
+alter table nv_child_2011 add check (d between '2011-01-01'::date and '2011-12-31'::date) not valid;
+explain (costs off) select * from nv_parent where d between '2011-08-01' and '2011-08-31';
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_1
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011 nv_parent_2
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+create table nv_child_2009 (check (d between '2009-01-01'::date and '2009-12-31'::date)) inherits (nv_parent);
+explain (costs off) select * from nv_parent where d between '2011-08-01'::date and '2011-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_1
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011 nv_parent_2
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_1
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2011 nv_parent_2
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009 nv_parent_3
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(9 rows)
+
+-- after validation, the constraint should be used
+alter table nv_child_2011 VALIDATE CONSTRAINT nv_child_2011_d_check;
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_1
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009 nv_parent_2
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(7 rows)
+
+-- add an inherited NOT VALID constraint
+alter table nv_parent add check (d between '2001-01-01'::date and '2099-12-31'::date) not valid;
+\d nv_child_2009
+ Table "public.nv_child_2009"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_child_2009_d_check" CHECK (d >= '01-01-2009'::date AND d <= '12-31-2009'::date)
+ "nv_parent_d_check" CHECK (d >= '01-01-2001'::date AND d <= '12-31-2099'::date) NOT VALID
+Inherits: nv_parent
+
+-- we leave nv_parent and children around to help test pg_dump logic
+-- Foreign key adding test with mixed types
+-- Note: these tables are TEMP to avoid name conflicts when this test
+-- is run in parallel with foreign_key.sql.
+CREATE TEMP TABLE PKTABLE (ptest1 int PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 inet);
+-- This next should fail, because int=inet does not exist
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+-- This should also fail for the same reason, but here we
+-- give the column name
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable(ptest1);
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+-- This should succeed, even though they are different types,
+-- because int=int8 exists and is a member of the integer opfamily
+CREATE TEMP TABLE FKTABLE (ftest1 int8);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+-- This should fail, because we'd have to cast numeric to int which is
+-- not an implicit coercion (or use numeric=numeric, but that's not part
+-- of the integer opfamily)
+CREATE TEMP TABLE FKTABLE (ftest1 numeric);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: numeric and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- On the other hand, this should work because int implicitly promotes to
+-- numeric, and we allow promotion on the FK side
+CREATE TEMP TABLE PKTABLE (ptest1 numeric PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+CREATE TEMP TABLE PKTABLE (ptest1 int, ptest2 inet,
+ PRIMARY KEY(ptest1, ptest2));
+-- This should fail, because we just chose really odd types
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- Again, so should this...
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- This fails because we mixed up the column ordering
+CREATE TEMP TABLE FKTABLE (ftest1 int, ftest2 inet);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest2, ptest1);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest2" are of incompatible types: integer and inet.
+-- As does this...
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest2, ftest1)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest2_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest2" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- Test that ALTER CONSTRAINT updates trigger deferrability properly
+CREATE TEMP TABLE PKTABLE (ptest1 int primary key);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY IMMEDIATE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fknd2 NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdd2 DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdi2 DEFERRABLE INITIALLY IMMEDIATE;
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'pktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+------------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd2 | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdi | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi | "RI_FKey_noaction_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi2 | "RI_FKey_noaction_upd" | 17 | t | f
+ fknd | "RI_FKey_cascade_del" | 9 | f | f
+ fknd | "RI_FKey_noaction_upd" | 17 | f | f
+ fknd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fknd2 | "RI_FKey_noaction_upd" | 17 | f | f
+(12 rows)
+
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'fktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+---------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_check_ins" | 5 | t | t
+ fkdd | "RI_FKey_check_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_check_ins" | 5 | t | t
+ fkdd2 | "RI_FKey_check_upd" | 17 | t | t
+ fkdi | "RI_FKey_check_ins" | 5 | t | f
+ fkdi | "RI_FKey_check_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_check_ins" | 5 | t | f
+ fkdi2 | "RI_FKey_check_upd" | 17 | t | f
+ fknd | "RI_FKey_check_ins" | 5 | f | f
+ fknd | "RI_FKey_check_upd" | 17 | f | f
+ fknd2 | "RI_FKey_check_ins" | 5 | f | f
+ fknd2 | "RI_FKey_check_upd" | 17 | f | f
+(12 rows)
+
+-- temp tables should go away by themselves, need not drop them.
+-- test check constraint adding
+create table atacc1 ( test int );
+-- add a check constraint
+alter table atacc1 add constraint atacc_test1 check (test>3);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (2).
+-- should succeed
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails when added
+create table atacc1 ( test int );
+-- insert a soon to be failing row
+insert into atacc1 (test) values (2);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test>3);
+ERROR: check constraint "atacc_test1" of relation "atacc1" is violated by some row
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails because the column doesn't exist
+create table atacc1 ( test int );
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test1>3);
+ERROR: column "test1" does not exist
+HINT: Perhaps you meant to reference the column "atacc1.test".
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int, test3 int);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test+test2<test3*4);
+-- should fail
+insert into atacc1 (test,test2,test3) values (4,4,2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (4, 4, 2).
+-- should succeed
+insert into atacc1 (test,test2,test3) values (4,4,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int check (test>3), test2 int);
+alter table atacc1 add check (test2>test);
+-- should fail for $2
+insert into atacc1 (test2, test) values (3, 4);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_check"
+DETAIL: Failing row contains (4, 3).
+drop table atacc1;
+-- inheritance related tests
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc2 add constraint foo check (test2>0);
+-- fail and then succeed on atacc2
+insert into atacc2 (test2) values (-3);
+ERROR: new row for relation "atacc2" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc2 (test2) values (3);
+-- fail and then succeed on atacc3
+insert into atacc3 (test2) values (-3);
+ERROR: new row for relation "atacc3" violates check constraint "foo"
+DETAIL: Failing row contains (null, -3, null).
+insert into atacc3 (test2) values (3);
+drop table atacc3;
+drop table atacc2;
+drop table atacc1;
+-- same things with one created with INHERIT
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc3 no inherit atacc2;
+-- fail
+alter table atacc3 no inherit atacc2;
+ERROR: relation "atacc2" is not a parent of relation "atacc3"
+-- make sure it really isn't a child
+insert into atacc3 (test2) values (3);
+select test2 from atacc2;
+ test2
+-------
+(0 rows)
+
+-- fail due to missing constraint
+alter table atacc2 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing constraint "foo"
+-- fail due to missing column
+alter table atacc3 rename test2 to testx;
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing column "test2"
+-- fail due to mismatched data type
+alter table atacc3 add test2 bool;
+alter table atacc3 inherit atacc2;
+ERROR: child table "atacc3" has different type for column "test2"
+alter table atacc3 drop test2;
+-- succeed
+alter table atacc3 add test2 int;
+update atacc3 set test2 = 4 where test2 is null;
+alter table atacc3 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+-- fail due to duplicates and circular inheritance
+alter table atacc3 inherit atacc2;
+ERROR: relation "atacc2" would be inherited from more than once
+alter table atacc2 inherit atacc3;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc3" is already a child of "atacc2".
+alter table atacc2 inherit atacc2;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc2" is already a child of "atacc2".
+-- test that we really are a child now (should see 4 not 3 and cascade should go through)
+select test2 from atacc2;
+ test2
+-------
+ 4
+(1 row)
+
+drop table atacc2 cascade;
+NOTICE: drop cascades to table atacc3
+drop table atacc1;
+-- adding only to a parent is allowed as of 9.2
+create table atacc1 (test int);
+create table atacc2 (test2 int) inherits (atacc1);
+-- ok:
+alter table atacc1 add constraint foo check (test>0) no inherit;
+-- check constraint is not there on child
+insert into atacc2 (test) values (-3);
+-- check constraint is there on parent
+insert into atacc1 (test) values (-3);
+ERROR: new row for relation "atacc1" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc1 (test) values (3);
+-- fail, violating row:
+alter table atacc2 add constraint foo check (test>0) no inherit;
+ERROR: check constraint "foo" of relation "atacc2" is violated by some row
+drop table atacc2;
+drop table atacc1;
+-- test unique constraint adding
+create table atacc1 ( test int ) ;
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- try to create duplicates via alter table using - should fail
+alter table atacc1 alter column test type integer using 0;
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(0) is duplicated.
+drop table atacc1;
+-- let's do one where the unique constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the unique constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test1);
+ERROR: column "test1" named in key does not exist
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test, test2);
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, unique(test));
+alter table atacc1 add unique (test2);
+-- should fail for @@ second one @@
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_test_key"
+DETAIL: Key (test)=(3) already exists.
+drop table atacc1;
+-- test primary key constraint adding
+create table atacc1 ( id serial, test int) ;
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- inserting NULL should fail
+insert into atacc1 (test) values(NULL);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (4, null).
+-- try adding a second primary key (should fail)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- drop first primary key constraint
+alter table atacc1 drop constraint atacc_test1 restrict;
+-- try adding a primary key on oid (should succeed)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+drop table atacc1;
+-- let's do one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do another one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing row
+insert into atacc1 (test) values (NULL);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: column "test" of relation "atacc1" contains null values
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the primary key constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a primary key constraint (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test1);
+ERROR: column "test1" of relation "atacc1" does not exist
+drop table atacc1;
+-- adding a new column as primary key to a non-empty table.
+-- should fail unless the column has a non-null default value.
+create table atacc1 ( test int );
+insert into atacc1 (test) values (0);
+-- add a primary key column without a default (fails).
+alter table atacc1 add column test2 int primary key;
+ERROR: column "test2" of relation "atacc1" contains null values
+-- now add a primary key column with a default (succeeds).
+alter table atacc1 add column test2 int default 0 primary key;
+drop table atacc1;
+-- this combination used to have order-of-execution problems (bug #15580)
+create table atacc1 (a int);
+insert into atacc1 values(1);
+alter table atacc1
+ add column b float8 not null default random(),
+ add primary key(a);
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test, test2);
+-- try adding a second primary key - should fail
+alter table atacc1 add constraint atacc_test2 primary key (test);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+insert into atacc1 (test,test2) values (NULL,3);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, 3).
+insert into atacc1 (test,test2) values (3, NULL);
+ERROR: null value in column "test2" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (3, null).
+insert into atacc1 (test,test2) values (NULL,NULL);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, primary key(test));
+-- only first should succeed
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_pkey"
+DETAIL: Key (test)=(3) already exists.
+insert into atacc1 (test2, test) values (1, NULL);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, 1).
+drop table atacc1;
+-- alter table / alter column [set/drop] not null tests
+-- try altering system catalogs, should fail
+alter table pg_class alter column relname drop not null;
+ERROR: permission denied: "pg_class" is a system catalog
+alter table pg_class alter relname set not null;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table non_existent alter column bar set not null;
+ERROR: relation "non_existent" does not exist
+alter table non_existent alter column bar drop not null;
+ERROR: relation "non_existent" does not exist
+-- test setting columns to null and not null and vice versa
+-- test checking for null values and primary key
+create table atacc1 (test int not null);
+alter table atacc1 add constraint "atacc1_pkey" primary key (test);
+alter table atacc1 alter column test drop not null;
+ERROR: column "test" is in a primary key
+alter table atacc1 drop constraint "atacc1_pkey";
+alter table atacc1 alter column test drop not null;
+insert into atacc1 values (null);
+alter table atacc1 alter test set not null;
+ERROR: column "test" of relation "atacc1" contains null values
+delete from atacc1;
+alter table atacc1 alter test set not null;
+-- try altering a non-existent column, should fail
+alter table atacc1 alter bar set not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+alter table atacc1 alter bar drop not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+alter table myview alter column test drop not null;
+ERROR: "myview" is not a table or foreign table
+alter table myview alter column test set not null;
+ERROR: "myview" is not a table or foreign table
+drop view myview;
+drop table atacc1;
+-- set not null verified by constraints
+create table atacc1 (test_a int, test_b int);
+insert into atacc1 values (null, 1);
+-- constraint not cover all values, should fail
+alter table atacc1 add constraint atacc1_constr_or check(test_a is not null or test_b < 10);
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" of relation "atacc1" contains null values
+alter table atacc1 drop constraint atacc1_constr_or;
+-- not valid constraint, should fail
+alter table atacc1 add constraint atacc1_constr_invalid check(test_a is not null) not valid;
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" of relation "atacc1" contains null values
+alter table atacc1 drop constraint atacc1_constr_invalid;
+-- with valid constraint
+update atacc1 set test_a = 1;
+alter table atacc1 add constraint atacc1_constr_a_valid check(test_a is not null);
+alter table atacc1 alter test_a set not null;
+delete from atacc1;
+insert into atacc1 values (2, null);
+alter table atacc1 alter test_a drop not null;
+-- test multiple set not null at same time
+-- test_a checked by atacc1_constr_a_valid, test_b should fail by table scan
+alter table atacc1 alter test_a set not null, alter test_b set not null;
+ERROR: column "test_b" of relation "atacc1" contains null values
+-- commands order has no importance
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+ERROR: column "test_b" of relation "atacc1" contains null values
+-- valid one by table scan, one by check constraints
+update atacc1 set test_b = 1;
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+alter table atacc1 alter test_a drop not null, alter test_b drop not null;
+-- both column has check constraints
+alter table atacc1 add constraint atacc1_constr_b_valid check(test_b is not null);
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+drop table atacc1;
+-- test inheritance
+create table parent (a int);
+create table child (b varchar(255)) inherits (parent);
+alter table parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" of relation "parent" violates not-null constraint
+DETAIL: Failing row contains (null).
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" of relation "child" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+alter table parent alter a drop not null;
+insert into parent values (NULL);
+insert into child (a, b) values (NULL, 'foo');
+alter table only parent alter a set not null;
+ERROR: column "a" of relation "parent" contains null values
+alter table child alter a set not null;
+ERROR: column "a" of relation "child" contains null values
+delete from parent;
+alter table only parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" of relation "parent" violates not-null constraint
+DETAIL: Failing row contains (null).
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" of relation "child" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+delete from child;
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" of relation "child" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+drop table child;
+drop table parent;
+-- test setting and removing default values
+create table def_test (
+ c1 int4 default 5,
+ c2 text default 'initial_default'
+);
+insert into def_test default values;
+alter table def_test alter column c1 drop default;
+insert into def_test default values;
+alter table def_test alter column c2 drop default;
+insert into def_test default values;
+alter table def_test alter column c1 set default 10;
+alter table def_test alter column c2 set default 'new_default';
+insert into def_test default values;
+select * from def_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+(4 rows)
+
+-- set defaults to an incorrect type: this should fail
+alter table def_test alter column c1 set default 'wrong_datatype';
+ERROR: invalid input syntax for type integer: "wrong_datatype"
+alter table def_test alter column c2 set default 20;
+-- set defaults on a non-existent column: this should fail
+alter table def_test alter column c3 set default 30;
+ERROR: column "c3" of relation "def_test" does not exist
+-- set defaults on views: we need to create a view, add a rule
+-- to allow insertions into it, and then alter the view to add
+-- a default
+create view def_view_test as select * from def_test;
+create rule def_view_test_ins as
+ on insert to def_view_test
+ do instead insert into def_test select new.*;
+insert into def_view_test default values;
+alter table def_view_test alter column c1 set default 45;
+insert into def_view_test default values;
+alter table def_view_test alter column c2 set default 'view_default';
+insert into def_view_test default values;
+select * from def_view_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+ |
+ 45 |
+ 45 | view_default
+(7 rows)
+
+drop rule def_view_test_ins on def_view_test;
+drop view def_view_test;
+drop table def_test;
+-- alter table / drop column tests
+-- try altering system catalogs, should fail
+alter table pg_class drop column relname;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table nosuchtable drop column bar;
+ERROR: relation "nosuchtable" does not exist
+-- test dropping columns
+create table atacc1 (a int4 not null, b int4, c int4 not null, d int4);
+insert into atacc1 values (1, 2, 3, 4);
+alter table atacc1 drop a;
+alter table atacc1 drop a;
+ERROR: column "a" of relation "atacc1" does not exist
+-- SELECTs
+select * from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select * from atacc1 order by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 order by a;
+ ^
+select * from atacc1 order by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 order by "........pg.dropped.1........"...
+ ^
+select * from atacc1 group by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 group by a;
+ ^
+select * from atacc1 group by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 group by "........pg.dropped.1........"...
+ ^
+select atacc1.* from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a from atacc1;
+ ^
+select atacc1.a from atacc1;
+ERROR: column atacc1.a does not exist
+LINE 1: select atacc1.a from atacc1;
+ ^
+select b,c,d from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a,b,c,d from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a,b,c,d from atacc1;
+ ^
+select * from atacc1 where a = 1;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 where a = 1;
+ ^
+select "........pg.dropped.1........" from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........" from atacc1;
+ ^
+select atacc1."........pg.dropped.1........" from atacc1;
+ERROR: column atacc1.........pg.dropped.1........ does not exist
+LINE 1: select atacc1."........pg.dropped.1........" from atacc1;
+ ^
+select "........pg.dropped.1........",b,c,d from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........",b,c,d from atacc1;
+ ^
+select * from atacc1 where "........pg.dropped.1........" = 1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 where "........pg.dropped.1........" = ...
+ ^
+-- UPDATEs
+update atacc1 set a = 3;
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: update atacc1 set a = 3;
+ ^
+update atacc1 set b = 2 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: update atacc1 set b = 2 where a = 3;
+ ^
+update atacc1 set "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: update atacc1 set "........pg.dropped.1........" = 3;
+ ^
+update atacc1 set b = 2 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: update atacc1 set b = 2 where "........pg.dropped.1........"...
+ ^
+-- INSERTs
+insert into atacc1 values (10, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (10, 11, 12, 13);
+ ^
+insert into atacc1 values (default, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (default, 11, 12, 13);
+ ^
+insert into atacc1 values (11, 12, 13);
+insert into atacc1 (a) values (10);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (10);
+ ^
+insert into atacc1 (a) values (default);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (default);
+ ^
+insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ ^
+insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ ^
+insert into atacc1 (b,c,d) values (11,12,13);
+insert into atacc1 ("........pg.dropped.1........") values (10);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........") values (default);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (10,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (default,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+-- DELETEs
+delete from atacc1 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: delete from atacc1 where a = 3;
+ ^
+delete from atacc1 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: delete from atacc1 where "........pg.dropped.1........" = 3;
+ ^
+delete from atacc1;
+-- try dropping a non-existent column, should fail
+alter table atacc1 drop bar;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try removing an oid column, should succeed (as it's nonexistent)
+alter table atacc1 SET WITHOUT OIDS;
+-- try adding an oid column, should fail (not supported)
+alter table atacc1 SET WITH OIDS;
+ERROR: syntax error at or near "WITH"
+LINE 1: alter table atacc1 SET WITH OIDS;
+ ^
+-- try dropping the xmin column, should fail
+alter table atacc1 drop xmin;
+ERROR: cannot drop system column "xmin"
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+select * from myview;
+ b | c | d
+---+---+---
+(0 rows)
+
+alter table myview drop d;
+ERROR: "myview" is not a table, composite type, or foreign table
+drop view myview;
+-- test some commands to make sure they fail on the dropped column
+analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+vacuum analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+vacuum analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+comment on column atacc1.a is 'testing';
+ERROR: column "a" of relation "atacc1" does not exist
+comment on column atacc1."........pg.dropped.1........" is 'testing';
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set storage plain;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set storage plain;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set statistics 0;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set statistics 0;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set default 3;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set default 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop default;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop default;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 rename a to x;
+ERROR: column "a" does not exist
+alter table atacc1 rename "........pg.dropped.1........" to x;
+ERROR: column "........pg.dropped.1........" does not exist
+alter table atacc1 add primary key(a);
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 add primary key("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 add unique(a);
+ERROR: column "a" named in key does not exist
+alter table atacc1 add unique("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" named in key does not exist
+alter table atacc1 add check (a > 3);
+ERROR: column "a" does not exist
+alter table atacc1 add check ("........pg.dropped.1........" > 3);
+ERROR: column "........pg.dropped.1........" does not exist
+create table atacc2 (id int4 unique);
+alter table atacc1 add foreign key (a) references atacc2(id);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc1 add foreign key ("........pg.dropped.1........") references atacc2(id);
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1(a);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+drop table atacc2;
+create index "testing_idx" on atacc1(a);
+ERROR: column "a" does not exist
+create index "testing_idx" on atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" does not exist
+-- test create as and select into
+insert into atacc1 values (21, 22, 23);
+create table attest1 as select * from atacc1;
+select * from attest1;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest1;
+select * into attest2 from atacc1;
+select * from attest2;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest2;
+-- try dropping all columns
+alter table atacc1 drop c;
+alter table atacc1 drop d;
+alter table atacc1 drop b;
+select * from atacc1;
+--
+(1 row)
+
+drop table atacc1;
+-- test constraint error reporting in presence of dropped columns
+create table atacc1 (id serial primary key, value int check (value < 10));
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (1, 100).
+alter table atacc1 drop column value;
+alter table atacc1 add column value int check (value < 10);
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (2, 100).
+insert into atacc1(id, value) values (null, 0);
+ERROR: null value in column "id" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, 0).
+drop table atacc1;
+-- test inheritance
+create table parent (a int, b int, c int);
+insert into parent values (1, 2, 3);
+alter table parent drop a;
+create table child (d varchar(255)) inherits (parent);
+insert into child values (12, 13, 'testing');
+select * from parent;
+ b | c
+----+----
+ 2 | 3
+ 12 | 13
+(2 rows)
+
+select * from child;
+ b | c | d
+----+----+---------
+ 12 | 13 | testing
+(1 row)
+
+alter table parent drop c;
+select * from parent;
+ b
+----
+ 2
+ 12
+(2 rows)
+
+select * from child;
+ b | d
+----+---------
+ 12 | testing
+(1 row)
+
+drop table child;
+drop table parent;
+-- check error cases for inheritance column merging
+create table parent (a float8, b numeric(10,4), c text collate "C");
+create table child (a float4) inherits (parent); -- fail
+NOTICE: merging column "a" with inherited definition
+ERROR: column "a" has a type conflict
+DETAIL: double precision versus real
+create table child (b decimal(10,7)) inherits (parent); -- fail
+NOTICE: moving and merging column "b" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "b" has a type conflict
+DETAIL: numeric(10,4) versus numeric(10,7)
+create table child (c text collate "POSIX") inherits (parent); -- fail
+NOTICE: moving and merging column "c" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "c" has a collation conflict
+DETAIL: "C" versus "POSIX"
+create table child (a double precision, b decimal(10,4)) inherits (parent);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging column "b" with inherited definition
+drop table child;
+drop table parent;
+-- test copy in/out
+create table attest (a int4, b int4, c int4);
+insert into attest values (1,2,3);
+alter table attest drop a;
+copy attest to stdout;
+2 3
+copy attest(a) to stdout;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") to stdout;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest from stdin;
+ERROR: extra data after last expected column
+CONTEXT: COPY attest, line 1: "10 11 12"
+select * from attest;
+ b | c
+---+---
+ 2 | 3
+(1 row)
+
+copy attest from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+(2 rows)
+
+copy attest(a) from stdin;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") from stdin;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest(b,c) from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+ 31 | 32
+(3 rows)
+
+drop table attest;
+-- test inheritance
+create table dropColumn (a int, b int, e int);
+create table dropColumnChild (c int) inherits (dropColumn);
+create table dropColumnAnother (d int) inherits (dropColumnChild);
+-- these two should fail
+alter table dropColumnchild drop column a;
+ERROR: cannot drop inherited column "a"
+alter table only dropColumnChild drop column b;
+ERROR: cannot drop inherited column "b"
+-- these three should work
+alter table only dropColumn drop column e;
+alter table dropColumnChild drop column c;
+alter table dropColumn drop column a;
+create table renameColumn (a int);
+create table renameColumnChild (b int) inherits (renameColumn);
+create table renameColumnAnother (c int) inherits (renameColumnChild);
+-- these three should fail
+alter table renameColumnChild rename column a to d;
+ERROR: cannot rename inherited column "a"
+alter table only renameColumnChild rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+alter table only renameColumn rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+-- these should work
+alter table renameColumn rename column a to d;
+alter table renameColumnChild rename column b to a;
+-- these should work
+alter table if exists doesnt_exist_tab rename column a to d;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+alter table if exists doesnt_exist_tab rename column b to a;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+-- this should work
+alter table renameColumn add column w int;
+-- this should fail
+alter table only renameColumn add column x int;
+ERROR: column must be added to child tables too
+-- Test corner cases in dropping of inherited columns
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+-- should work
+alter table p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+select f1 from c1;
+ f1
+----
+(0 rows)
+
+alter table c1 drop column f1;
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table p1 drop column f1;
+-- c1.f1 is dropped now, since there is no local definition for it
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is NOT dropped, but must now be considered non-inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1(id int, name text);
+create table p2(id2 int, name text, height int);
+create table c1(age int) inherits(p1,p2);
+NOTICE: merging multiple inherited definitions of column "name"
+create table gc1() inherits (c1);
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | name | 2 | f
+ c1 | id2 | 1 | f
+ c1 | height | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | name | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | height | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p1 | name | 0 | t
+ p2 | id2 | 0 | t
+ p2 | name | 0 | t
+ p2 | height | 0 | t
+(15 rows)
+
+-- should work
+alter table only p1 drop column name;
+-- should work. Now c1.name is local and inhcount is 0.
+alter table p2 drop column name;
+-- should be rejected since its inherited
+alter table gc1 drop column name;
+ERROR: cannot drop inherited column "name"
+-- should work, and drop gc1.name along
+alter table c1 drop column name;
+-- should fail: column does not exist
+alter table gc1 drop column name;
+ERROR: column "name" of relation "gc1" does not exist
+-- should work and drop the attribute in all tables
+alter table p2 drop column height;
+-- IF EXISTS test
+create table dropColumnExists ();
+alter table dropColumnExists drop column non_existing; --fail
+ERROR: column "non_existing" of relation "dropcolumnexists" does not exist
+alter table dropColumnExists drop column if exists non_existing; --succeed
+NOTICE: column "non_existing" of relation "dropcolumnexists" does not exist, skipping
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | id2 | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p2 | id2 | 0 | t
+(8 rows)
+
+drop table p1, p2 cascade;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to table c1
+drop cascades to table gc1
+-- test attinhcount tracking with merged columns
+create table depth0();
+create table depth1(c text) inherits (depth0);
+create table depth2() inherits (depth1);
+alter table depth0 add c text;
+NOTICE: merging definition of column "c" for child "depth1"
+select attrelid::regclass, attname, attinhcount, attislocal
+from pg_attribute
+where attnum > 0 and attrelid::regclass in ('depth0', 'depth1', 'depth2')
+order by attrelid::regclass::text, attnum;
+ attrelid | attname | attinhcount | attislocal
+----------+---------+-------------+------------
+ depth0 | c | 0 | t
+ depth1 | c | 1 | t
+ depth2 | c | 1 | f
+(3 rows)
+
+-- test renumbering of child-table columns in inherited operations
+create table p1 (f1 int);
+create table c1 (f2 text, f3 int) inherits (p1);
+alter table p1 add column a1 int check (a1 > 0);
+alter table p1 add column f2 text;
+NOTICE: merging definition of column "f2" for child "c1"
+insert into p1 values (1,2,'abc');
+insert into c1 values(11,'xyz',33,0); -- should fail
+ERROR: new row for relation "c1" violates check constraint "p1_a1_check"
+DETAIL: Failing row contains (11, xyz, 33, 0).
+insert into c1 values(11,'xyz',33,22);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 2 | abc
+ 11 | 22 | xyz
+(2 rows)
+
+update p1 set a1 = a1 + 1, f2 = upper(f2);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 3 | ABC
+ 11 | 23 | XYZ
+(2 rows)
+
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+-- test that operations with a dropped column do not try to reference
+-- its datatype
+create domain mytype as text;
+create temp table foo (f1 text, f2 mytype, f3 text);
+insert into foo values('bb','cc','dd');
+select * from foo;
+ f1 | f2 | f3
+----+----+----
+ bb | cc | dd
+(1 row)
+
+drop domain mytype cascade;
+NOTICE: drop cascades to column f2 of table foo
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+(1 row)
+
+insert into foo values('qq','rr');
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+ qq | rr
+(2 rows)
+
+update foo set f3 = 'zz';
+select * from foo;
+ f1 | f3
+----+----
+ bb | zz
+ qq | zz
+(2 rows)
+
+select f3,max(f1) from foo group by f3;
+ f3 | max
+----+-----
+ zz | qq
+(1 row)
+
+-- Simple tests for alter table column type
+alter table foo alter f1 TYPE integer; -- fails
+ERROR: column "f1" cannot be cast automatically to type integer
+HINT: You might need to specify "USING f1::integer".
+alter table foo alter f1 TYPE varchar(10);
+create table anothertab (atcol1 serial8, atcol2 boolean,
+ constraint anothertab_chk check (atcol1 <= 3));
+insert into anothertab (atcol1, atcol2) values (default, true);
+insert into anothertab (atcol1, atcol2) values (default, false);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+alter table anothertab alter column atcol1 type boolean; -- fails
+ERROR: column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to specify "USING atcol1::boolean".
+alter table anothertab alter column atcol1 type boolean using atcol1::int; -- fails
+ERROR: result of USING clause for column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to add an explicit cast.
+alter table anothertab alter column atcol1 type integer;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+insert into anothertab (atcol1, atcol2) values (45, null); -- fails
+ERROR: new row for relation "anothertab" violates check constraint "anothertab_chk"
+DETAIL: Failing row contains (45, null).
+insert into anothertab (atcol1, atcol2) values (default, null);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+ 3 |
+(3 rows)
+
+alter table anothertab alter column atcol2 type text
+ using case when atcol2 is true then 'IT WAS TRUE'
+ when atcol2 is false then 'IT WAS FALSE'
+ else 'IT WAS NULL!' end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ 1 | IT WAS TRUE
+ 2 | IT WAS FALSE
+ 3 | IT WAS NULL!
+(3 rows)
+
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: default for column "atcol1" cannot be cast automatically to type boolean
+alter table anothertab alter column atcol1 drop default;
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: operator does not exist: boolean <= integer
+HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
+alter table anothertab drop constraint anothertab_chk;
+alter table anothertab drop constraint anothertab_chk; -- fails
+ERROR: constraint "anothertab_chk" of relation "anothertab" does not exist
+alter table anothertab drop constraint IF EXISTS anothertab_chk; -- succeeds
+NOTICE: constraint "anothertab_chk" of relation "anothertab" does not exist, skipping
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ f | IT WAS TRUE
+ t | IT WAS FALSE
+ f | IT WAS NULL!
+(3 rows)
+
+drop table anothertab;
+-- Test index handling in alter table column type (cf. bugs #15835, #15865)
+create table anothertab(f1 int primary key, f2 int unique,
+ f3 int, f4 int, f5 int);
+alter table anothertab
+ add exclude using btree (f3 with =);
+alter table anothertab
+ add exclude using btree (f4 with =) where (f4 is not null);
+alter table anothertab
+ add exclude using btree (f4 with =) where (f5 > 0);
+alter table anothertab
+ add unique(f1,f4);
+create index on anothertab(f2,f3);
+create unique index on anothertab(f4);
+\d anothertab
+ Table "public.anothertab"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ f1 | integer | | not null |
+ f2 | integer | | |
+ f3 | integer | | |
+ f4 | integer | | |
+ f5 | integer | | |
+Indexes:
+ "anothertab_pkey" PRIMARY KEY, btree (f1)
+ "anothertab_f1_f4_key" UNIQUE CONSTRAINT, btree (f1, f4)
+ "anothertab_f2_f3_idx" btree (f2, f3)
+ "anothertab_f2_key" UNIQUE CONSTRAINT, btree (f2)
+ "anothertab_f3_excl" EXCLUDE USING btree (f3 WITH =)
+ "anothertab_f4_excl" EXCLUDE USING btree (f4 WITH =) WHERE (f4 IS NOT NULL)
+ "anothertab_f4_excl1" EXCLUDE USING btree (f4 WITH =) WHERE (f5 > 0)
+ "anothertab_f4_idx" UNIQUE, btree (f4)
+
+alter table anothertab alter column f1 type bigint;
+alter table anothertab
+ alter column f2 type bigint,
+ alter column f3 type bigint,
+ alter column f4 type bigint;
+alter table anothertab alter column f5 type bigint;
+\d anothertab
+ Table "public.anothertab"
+ Column | Type | Collation | Nullable | Default
+--------+--------+-----------+----------+---------
+ f1 | bigint | | not null |
+ f2 | bigint | | |
+ f3 | bigint | | |
+ f4 | bigint | | |
+ f5 | bigint | | |
+Indexes:
+ "anothertab_pkey" PRIMARY KEY, btree (f1)
+ "anothertab_f1_f4_key" UNIQUE CONSTRAINT, btree (f1, f4)
+ "anothertab_f2_f3_idx" btree (f2, f3)
+ "anothertab_f2_key" UNIQUE CONSTRAINT, btree (f2)
+ "anothertab_f3_excl" EXCLUDE USING btree (f3 WITH =)
+ "anothertab_f4_excl" EXCLUDE USING btree (f4 WITH =) WHERE (f4 IS NOT NULL)
+ "anothertab_f4_excl1" EXCLUDE USING btree (f4 WITH =) WHERE (f5 > 0)
+ "anothertab_f4_idx" UNIQUE, btree (f4)
+
+drop table anothertab;
+-- test that USING expressions are parsed before column alter type / drop steps
+create table another (f1 int, f2 text, f3 text);
+insert into another values(1, 'one', 'uno');
+insert into another values(2, 'two', 'due');
+insert into another values(3, 'three', 'tre');
+select * from another;
+ f1 | f2 | f3
+----+-------+-----
+ 1 | one | uno
+ 2 | two | due
+ 3 | three | tre
+(3 rows)
+
+alter table another
+ alter f1 type text using f2 || ' and ' || f3 || ' more',
+ alter f2 type bigint using f1 * 10,
+ drop column f3;
+select * from another;
+ f1 | f2
+--------------------+----
+ one and uno more | 10
+ two and due more | 20
+ three and tre more | 30
+(3 rows)
+
+drop table another;
+-- table's row type
+create table tab1 (a int, b text);
+create table tab2 (x int, y tab1);
+alter table tab1 alter column b type varchar; -- fails
+ERROR: cannot alter table "tab1" because column "tab2.y" uses its row type
+-- Alter column type that's part of a partitioned index
+create table at_partitioned (a int, b text) partition by range (a);
+create table at_part_1 partition of at_partitioned for values from (0) to (1000);
+insert into at_partitioned values (512, '0.123');
+create table at_part_2 (b text, a int);
+insert into at_part_2 values ('1.234', 1024);
+create index on at_partitioned (b);
+create index on at_partitioned (a);
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+
+alter table at_partitioned attach partition at_part_2 for values from (1000) to (2000);
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+alter table at_partitioned alter column b type numeric using b::numeric;
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | numeric | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | numeric | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+drop table at_partitioned;
+-- Alter column type when no table rewrite is required
+-- Also check that comments are preserved
+create table at_partitioned(id int, name varchar(64), unique (id, name))
+ partition by hash(id);
+comment on constraint at_partitioned_id_name_key on at_partitioned is 'parent constraint';
+comment on index at_partitioned_id_name_key is 'parent index';
+create table at_partitioned_0 partition of at_partitioned
+ for values with (modulus 2, remainder 0);
+comment on constraint at_partitioned_0_id_name_key on at_partitioned_0 is 'child 0 constraint';
+comment on index at_partitioned_0_id_name_key is 'child 0 index';
+create table at_partitioned_1 partition of at_partitioned
+ for values with (modulus 2, remainder 1);
+comment on constraint at_partitioned_1_id_name_key on at_partitioned_1 is 'child 1 constraint';
+comment on index at_partitioned_1_id_name_key is 'child 1 index';
+insert into at_partitioned values(1, 'foo');
+insert into at_partitioned values(3, 'bar');
+create temp table old_oids as
+ select relname, oid as oldoid, relfilenode as oldfilenode
+ from pg_class where relname like 'at_partitioned%';
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+---------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | t | own | child 0 index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | t | own | child 1 index
+ at_partitioned_id_name_key | t | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+--------------------
+ at_partitioned_0_id_name_key | child 0 constraint
+ at_partitioned_1_id_name_key | child 1 constraint
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+alter table at_partitioned alter column name type varchar(127);
+-- Note: these tests currently show the wrong behavior for comments :-(
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+--------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | f | own | parent index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | f | own | parent index
+ at_partitioned_id_name_key | f | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+-------------------
+ at_partitioned_0_id_name_key |
+ at_partitioned_1_id_name_key |
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+-- Don't remove this DROP, it exposes bug #15672
+drop table at_partitioned;
+-- disallow recursive containment of row types
+create temp table recur1 (f1 int);
+alter table recur1 add column f2 recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 recur1[]; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create domain array_of_recur1 as recur1[];
+alter table recur1 add column f2 array_of_recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create temp table recur2 (f1 int, f2 recur1);
+alter table recur1 add column f2 recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 int;
+alter table recur1 alter column f2 type recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+-- SET STORAGE may need to add a TOAST table
+create table test_storage (a text);
+alter table test_storage alter a set storage plain;
+alter table test_storage add b int default 0; -- rewrite table to remove its TOAST table
+alter table test_storage alter a set storage extended; -- re-add TOAST table
+select reltoastrelid <> 0 as has_toast_table
+from pg_class
+where oid = 'test_storage'::regclass;
+ has_toast_table
+-----------------
+ f
+(1 row)
+
+-- ALTER COLUMN TYPE with a check constraint and a child table (bug #13779)
+CREATE TABLE test_inh_check (a float check (a > 10.2), b float);
+CREATE TABLE test_inh_check_child() INHERITS(test_inh_check);
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN a TYPE numeric;
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+-- also try noinherit, local, and local+inherited cases
+ALTER TABLE test_inh_check ADD CONSTRAINT bnoinherit CHECK (b > 100) NO INHERIT;
+ALTER TABLE test_inh_check_child ADD CONSTRAINT blocal CHECK (b < 1000);
+ALTER TABLE test_inh_check_child ADD CONSTRAINT bmerged CHECK (b > 1);
+ALTER TABLE test_inh_check ADD CONSTRAINT bmerged CHECK (b > 1);
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "bmerged" CHECK (b > 1::double precision)
+ "bnoinherit" CHECK (b > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "blocal" CHECK (b < 1000::double precision)
+ "bmerged" CHECK (b > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN b TYPE numeric;
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "bnoinherit" CHECK (b::double precision > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "blocal" CHECK (b::double precision < 1000::double precision)
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+-- ALTER COLUMN TYPE with different schema in children
+-- Bug at https://postgr.es/m/20170102225618.GA10071@telsasoft.com
+CREATE TABLE test_type_diff (f1 int);
+CREATE TABLE test_type_diff_c (extra smallint) INHERITS (test_type_diff);
+ALTER TABLE test_type_diff ADD COLUMN f2 int;
+INSERT INTO test_type_diff_c VALUES (1, 2, 3);
+ALTER TABLE test_type_diff ALTER COLUMN f2 TYPE bigint USING f2::bigint;
+CREATE TABLE test_type_diff2 (int_two int2, int_four int4, int_eight int8);
+CREATE TABLE test_type_diff2_c1 (int_four int4, int_eight int8, int_two int2);
+CREATE TABLE test_type_diff2_c2 (int_eight int8, int_two int2, int_four int4);
+CREATE TABLE test_type_diff2_c3 (int_two int2, int_four int4, int_eight int8);
+ALTER TABLE test_type_diff2_c1 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c2 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c3 INHERIT test_type_diff2;
+INSERT INTO test_type_diff2_c1 VALUES (1, 2, 3);
+INSERT INTO test_type_diff2_c2 VALUES (4, 5, 6);
+INSERT INTO test_type_diff2_c3 VALUES (7, 8, 9);
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int8 USING int_four::int8;
+-- whole-row references are disallowed
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int4 USING (pg_column_size(test_type_diff2));
+ERROR: cannot convert whole-row table reference
+DETAIL: USING expression contains a whole-row table reference.
+-- check for rollback of ANALYZE corrupting table property flags (bug #11638)
+CREATE TABLE check_fk_presence_1 (id int PRIMARY KEY, t text);
+CREATE TABLE check_fk_presence_2 (id int REFERENCES check_fk_presence_1, t text);
+BEGIN;
+ALTER TABLE check_fk_presence_2 DROP CONSTRAINT check_fk_presence_2_id_fkey;
+ANALYZE check_fk_presence_2;
+ROLLBACK;
+\d check_fk_presence_2
+ Table "public.check_fk_presence_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id | integer | | |
+ t | text | | |
+Foreign-key constraints:
+ "check_fk_presence_2_id_fkey" FOREIGN KEY (id) REFERENCES check_fk_presence_1(id)
+
+DROP TABLE check_fk_presence_1, check_fk_presence_2;
+-- check column addition within a view (bug #14876)
+create table at_base_table(id int, stuff text);
+insert into at_base_table values (23, 'skidoo');
+create view at_view_1 as select * from at_base_table bt;
+create view at_view_2 as select *, to_json(v1) as j from at_view_1 v1;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+View definition:
+ SELECT bt.id,
+ bt.stuff
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo"}
+(1 row)
+
+create or replace view at_view_1 as select *, 2+2 as more from at_base_table bt;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ more | integer | | | | plain |
+View definition:
+ SELECT bt.id,
+ bt.stuff,
+ 2 + 2 AS more
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff, NULL))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo","more":null}
+(1 row)
+
+drop view at_view_2;
+drop view at_view_1;
+drop table at_base_table;
+-- check adding a column not iself requiring a rewrite, together with
+-- a column requiring a default (bug #16038)
+-- ensure that rewrites aren't silently optimized away, removing the
+-- value of the test
+CREATE FUNCTION check_ddl_rewrite(p_tablename regclass, p_ddl text)
+RETURNS boolean
+LANGUAGE plpgsql AS $$
+DECLARE
+ v_relfilenode oid;
+BEGIN
+ v_relfilenode := relfilenode FROM pg_class WHERE oid = p_tablename;
+
+ EXECUTE p_ddl;
+
+ RETURN v_relfilenode <> (SELECT relfilenode FROM pg_class WHERE oid = p_tablename);
+END;
+$$;
+CREATE TABLE rewrite_test(col text);
+INSERT INTO rewrite_test VALUES ('something');
+INSERT INTO rewrite_test VALUES (NULL);
+-- empty[12] don't need rewrite, but notempty[12]_rewrite will force one
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN empty1 text,
+ ADD COLUMN notempty1_rewrite serial;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN notempty2_rewrite serial,
+ ADD COLUMN empty2 text;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+-- also check that fast defaults cause no problem, first without rewrite
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN empty3 text,
+ ADD COLUMN notempty3_norewrite int default 42;
+$$);
+ check_ddl_rewrite
+-------------------
+ f
+(1 row)
+
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN notempty4_norewrite int default 42,
+ ADD COLUMN empty4 text;
+$$);
+ check_ddl_rewrite
+-------------------
+ f
+(1 row)
+
+-- then with rewrite
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN empty5 text,
+ ADD COLUMN notempty5_norewrite int default 42,
+ ADD COLUMN notempty5_rewrite serial;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN notempty6_rewrite serial,
+ ADD COLUMN empty6 text,
+ ADD COLUMN notempty6_norewrite int default 42;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+-- cleanup
+DROP FUNCTION check_ddl_rewrite(regclass, text);
+DROP TABLE rewrite_test;
+--
+-- lock levels
+--
+drop type lockmodes;
+ERROR: type "lockmodes" does not exist
+create type lockmodes as enum (
+ 'SIReadLock'
+,'AccessShareLock'
+,'RowShareLock'
+,'RowExclusiveLock'
+,'ShareUpdateExclusiveLock'
+,'ShareLock'
+,'ShareRowExclusiveLock'
+,'ExclusiveLock'
+,'AccessExclusiveLock'
+);
+drop view my_locks;
+ERROR: view "my_locks" does not exist
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = txid_current()::integer)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname != 'my_locks'
+group by c.relname;
+create table alterlock (f1 int primary key, f2 text);
+insert into alterlock values (1, 'foo');
+create table alterlock2 (f3 int primary key, f1 int);
+insert into alterlock2 values (1, 1);
+begin; alter table alterlock alter column f2 set statistics 150;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock cluster on alterlock_pkey;
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+ alterlock_pkey | ShareUpdateExclusiveLock
+(2 rows)
+
+commit;
+begin; alter table alterlock set without cluster;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (fillfactor = 100);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock reset (fillfactor);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (toast.autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set (n_distinct = 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+-- test that mixing options with different lock levels works as expected
+begin; alter table alterlock set (autovacuum_enabled = off, fillfactor = 80);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set storage extended;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock alter column f2 set default 'x';
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin;
+create trigger ttdummy
+ before delete or update on alterlock
+ for each row
+ execute procedure
+ ttdummy (1, 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+-----------------------
+ alterlock | ShareRowExclusiveLock
+(1 row)
+
+rollback;
+begin;
+select * from my_locks order by 1;
+ relname | max_lockmode
+---------+--------------
+(0 rows)
+
+alter table alterlock2 add foreign key (f1) references alterlock (f1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+begin;
+alter table alterlock2
+add constraint alterlock2nv foreign key (f1) references alterlock (f1) NOT VALID;
+select * from my_locks order by 1;
+ relname | max_lockmode
+------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+(2 rows)
+
+commit;
+begin;
+alter table alterlock2 validate constraint alterlock2nv;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+--------------------------
+ alterlock | RowShareLock
+ alterlock2 | ShareUpdateExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = txid_current()::integer)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname = 'my_locks'
+group by c.relname;
+-- raise exception
+alter table my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter view my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter table my_locks reset (autovacuum_enabled);
+alter view my_locks reset (autovacuum_enabled);
+begin;
+alter view my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter view my_locks reset (security_barrier);
+rollback;
+-- this test intentionally applies the ALTER TABLE command against a view, but
+-- uses a view option so we expect this to succeed. This form of SQL is
+-- accepted for historical reasons, as shown in the docs for ALTER VIEW
+begin;
+alter table my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter table my_locks reset (security_barrier);
+rollback;
+-- cleanup
+drop table alterlock2;
+drop table alterlock;
+drop view my_locks;
+drop type lockmodes;
+--
+-- alter function
+--
+create function test_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql returns null on null input;
+select test_strict(NULL);
+ test_strict
+-------------
+
+(1 row)
+
+alter function test_strict(text) called on null input;
+select test_strict(NULL);
+ test_strict
+-------------------
+ got passed a null
+(1 row)
+
+create function non_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql called on null input;
+select non_strict(NULL);
+ non_strict
+-------------------
+ got passed a null
+(1 row)
+
+alter function non_strict(text) returns null on null input;
+select non_strict(NULL);
+ non_strict
+------------
+
+(1 row)
+
+--
+-- alter object set schema
+--
+create schema alter1;
+create schema alter2;
+create table alter1.t1(f1 serial primary key, f2 int check (f2 > 0));
+create view alter1.v1 as select * from alter1.t1;
+create function alter1.plus1(int) returns int as 'select $1+1' language sql;
+create domain alter1.posint integer check (value > 0);
+create type alter1.ctype as (f1 int, f2 text);
+create function alter1.same(alter1.ctype, alter1.ctype) returns boolean language sql
+as 'select $1.f1 is not distinct from $2.f1 and $1.f2 is not distinct from $2.f2';
+create operator alter1.=(procedure = alter1.same, leftarg = alter1.ctype, rightarg = alter1.ctype);
+create operator class alter1.ctype_hash_ops default for type alter1.ctype using hash as
+ operator 1 alter1.=(alter1.ctype, alter1.ctype);
+create conversion alter1.latin1_to_utf8 for 'latin1' to 'utf8' from iso8859_1_to_utf8;
+create text search parser alter1.prs(start = prsd_start, gettoken = prsd_nexttoken, end = prsd_end, lextypes = prsd_lextype);
+create text search configuration alter1.cfg(parser = alter1.prs);
+create text search template alter1.tmpl(init = dsimple_init, lexize = dsimple_lexize);
+create text search dictionary alter1.dict(template = alter1.tmpl);
+insert into alter1.t1(f2) values(11);
+insert into alter1.t1(f2) values(12);
+alter table alter1.t1 set schema alter1; -- no-op, same schema
+alter table alter1.t1 set schema alter2;
+alter table alter1.v1 set schema alter2;
+alter function alter1.plus1(int) set schema alter2;
+alter domain alter1.posint set schema alter2;
+alter operator class alter1.ctype_hash_ops using hash set schema alter2;
+alter operator family alter1.ctype_hash_ops using hash set schema alter2;
+alter operator alter1.=(alter1.ctype, alter1.ctype) set schema alter2;
+alter function alter1.same(alter1.ctype, alter1.ctype) set schema alter2;
+alter type alter1.ctype set schema alter1; -- no-op, same schema
+alter type alter1.ctype set schema alter2;
+alter conversion alter1.latin1_to_utf8 set schema alter2;
+alter text search parser alter1.prs set schema alter2;
+alter text search configuration alter1.cfg set schema alter2;
+alter text search template alter1.tmpl set schema alter2;
+alter text search dictionary alter1.dict set schema alter2;
+-- this should succeed because nothing is left in alter1
+drop schema alter1;
+insert into alter2.t1(f2) values(13);
+insert into alter2.t1(f2) values(14);
+select * from alter2.t1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select * from alter2.v1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select alter2.plus1(41);
+ plus1
+-------
+ 42
+(1 row)
+
+-- clean up
+drop schema alter2 cascade;
+NOTICE: drop cascades to 13 other objects
+DETAIL: drop cascades to table alter2.t1
+drop cascades to view alter2.v1
+drop cascades to function alter2.plus1(integer)
+drop cascades to type alter2.posint
+drop cascades to type alter2.ctype
+drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
+drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
+drop cascades to conversion alter2.latin1_to_utf8
+drop cascades to text search parser alter2.prs
+drop cascades to text search configuration alter2.cfg
+drop cascades to text search template alter2.tmpl
+drop cascades to text search dictionary alter2.dict
+--
+-- composite types
+--
+CREATE TYPE test_type AS (a int);
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE nosuchtype ADD ATTRIBUTE b text; -- fails
+ERROR: relation "nosuchtype" does not exist
+ALTER TYPE test_type ADD ATTRIBUTE b text;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+ALTER TYPE test_type ADD ATTRIBUTE b text; -- fails
+ERROR: column "b" of relation "test_type" already exists
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE varchar;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE integer;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE b;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE c; -- fails
+ERROR: column "c" of relation "test_type" does not exist
+ALTER TYPE test_type DROP ATTRIBUTE IF EXISTS c;
+NOTICE: column "c" of relation "test_type" does not exist, skipping
+ALTER TYPE test_type DROP ATTRIBUTE a, ADD ATTRIBUTE d boolean;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ d | boolean | | |
+
+ALTER TYPE test_type RENAME ATTRIBUTE a TO aa;
+ERROR: column "a" does not exist
+ALTER TYPE test_type RENAME ATTRIBUTE d TO dd;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ dd | boolean | | |
+
+DROP TYPE test_type;
+CREATE TYPE test_type1 AS (a int, b text);
+CREATE TABLE test_tbl1 (x int, y test_type1);
+ALTER TYPE test_type1 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type1" because column "test_tbl1.y" uses it
+CREATE TYPE test_type2 AS (a int, b text);
+CREATE TABLE test_tbl2 OF test_type2;
+CREATE TABLE test_tbl2_subclass () INHERITS (test_tbl2);
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ADD ATTRIBUTE c text; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ADD ATTRIBUTE c text CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 DROP ATTRIBUTE b; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 DROP ATTRIBUTE b CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+\d test_tbl2_subclass
+ Table "public.test_tbl2_subclass"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Inherits: test_tbl2
+
+DROP TABLE test_tbl2_subclass;
+CREATE TYPE test_typex AS (a int, b text);
+CREATE TABLE test_tblx (x int, y test_typex check ((y).a > 0));
+ALTER TYPE test_typex DROP ATTRIBUTE a; -- fails
+ERROR: cannot drop column a of composite type test_typex because other objects depend on it
+DETAIL: constraint test_tblx_y_check on table test_tblx depends on column a of composite type test_typex
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+ALTER TYPE test_typex DROP ATTRIBUTE a CASCADE;
+NOTICE: drop cascades to constraint test_tblx_y_check on table test_tblx
+\d test_tblx
+ Table "public.test_tblx"
+ Column | Type | Collation | Nullable | Default
+--------+------------+-----------+----------+---------
+ x | integer | | |
+ y | test_typex | | |
+
+DROP TABLE test_tblx;
+DROP TYPE test_typex;
+-- This test isn't that interesting on its own, but the purpose is to leave
+-- behind a table to test pg_upgrade with. The table has a composite type
+-- column in it, and the composite type has a dropped attribute.
+CREATE TYPE test_type3 AS (a int);
+CREATE TABLE test_tbl3 (c) AS SELECT '(1)'::test_type3;
+ALTER TYPE test_type3 DROP ATTRIBUTE a, ADD ATTRIBUTE b int;
+CREATE TYPE test_type_empty AS ();
+DROP TYPE test_type_empty;
+--
+-- typed tables: OF / NOT OF
+--
+CREATE TYPE tt_t0 AS (z inet, x int, y numeric(8,2));
+ALTER TYPE tt_t0 DROP ATTRIBUTE z;
+CREATE TABLE tt0 (x int NOT NULL, y numeric(8,2)); -- OK
+CREATE TABLE tt1 (x int, y bigint); -- wrong base type
+CREATE TABLE tt2 (x int, y numeric(9,2)); -- wrong typmod
+CREATE TABLE tt3 (y numeric(8,2), x int); -- wrong column order
+CREATE TABLE tt4 (x int); -- too few columns
+CREATE TABLE tt5 (x int, y numeric(8,2), z int); -- too few columns
+CREATE TABLE tt6 () INHERITS (tt0); -- can't have a parent
+CREATE TABLE tt7 (x int, q text, y numeric(8,2));
+ALTER TABLE tt7 DROP q; -- OK
+ALTER TABLE tt0 OF tt_t0;
+ALTER TABLE tt1 OF tt_t0;
+ERROR: table "tt1" has different type for column "y"
+ALTER TABLE tt2 OF tt_t0;
+ERROR: table "tt2" has different type for column "y"
+ALTER TABLE tt3 OF tt_t0;
+ERROR: table has column "y" where type requires "x"
+ALTER TABLE tt4 OF tt_t0;
+ERROR: table is missing column "y"
+ALTER TABLE tt5 OF tt_t0;
+ERROR: table has extra column "z"
+ALTER TABLE tt6 OF tt_t0;
+ERROR: typed tables cannot inherit
+ALTER TABLE tt7 OF tt_t0;
+CREATE TYPE tt_t1 AS (x int, y numeric(8,2));
+ALTER TABLE tt7 OF tt_t1; -- reassign an already-typed table
+ALTER TABLE tt7 NOT OF;
+\d tt7
+ Table "public.tt7"
+ Column | Type | Collation | Nullable | Default
+--------+--------------+-----------+----------+---------
+ x | integer | | |
+ y | numeric(8,2) | | |
+
+-- make sure we can drop a constraint on the parent but it remains on the child
+CREATE TABLE test_drop_constr_parent (c text CHECK (c IS NOT NULL));
+CREATE TABLE test_drop_constr_child () INHERITS (test_drop_constr_parent);
+ALTER TABLE ONLY test_drop_constr_parent DROP CONSTRAINT "test_drop_constr_parent_c_check";
+-- should fail
+INSERT INTO test_drop_constr_child (c) VALUES (NULL);
+ERROR: new row for relation "test_drop_constr_child" violates check constraint "test_drop_constr_parent_c_check"
+DETAIL: Failing row contains (null).
+DROP TABLE test_drop_constr_parent CASCADE;
+NOTICE: drop cascades to table test_drop_constr_child
+--
+-- IF EXISTS test
+--
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+NOTICE: relation "tt8" does not exist, skipping
+CREATE TABLE tt8(a int);
+CREATE SCHEMA alter2;
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+\d alter2.tt8
+ Table "alter2.tt8"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ f1 | integer | | not null | 0
+Indexes:
+ "xxx" PRIMARY KEY, btree (f1)
+Check constraints:
+ "tt8_f_check" CHECK (f1 >= 0 AND f1 <= 10)
+
+DROP TABLE alter2.tt8;
+DROP SCHEMA alter2;
+--
+-- Check conflicts between index and CHECK constraint names
+--
+CREATE TABLE tt9(c integer);
+ALTER TABLE tt9 ADD CHECK(c > 1);
+ALTER TABLE tt9 ADD CHECK(c > 2); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 3);
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 4); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD UNIQUE(c);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key UNIQUE(c); -- fail, dup name
+ERROR: relation "tt9_c_key" already exists
+ALTER TABLE tt9 ADD CONSTRAINT foo UNIQUE(c); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key CHECK(c > 5); -- fail, dup name
+ERROR: constraint "tt9_c_key" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key2 CHECK(c > 6);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+\d tt9
+ Table "public.tt9"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c | integer | | |
+Indexes:
+ "tt9_c_key" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key1" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key3" UNIQUE CONSTRAINT, btree (c)
+Check constraints:
+ "foo" CHECK (c > 3)
+ "tt9_c_check" CHECK (c > 1)
+ "tt9_c_check1" CHECK (c > 2)
+ "tt9_c_key2" CHECK (c > 6)
+
+DROP TABLE tt9;
+-- Check that comments on constraints and indexes are not lost at ALTER TABLE.
+CREATE TABLE comment_test (
+ id int,
+ positive_col int CHECK (positive_col > 0),
+ indexed_col int,
+ CONSTRAINT comment_test_pk PRIMARY KEY (id));
+CREATE INDEX comment_test_index ON comment_test(indexed_col);
+COMMENT ON COLUMN comment_test.id IS 'Column ''id'' on comment_test';
+COMMENT ON INDEX comment_test_index IS 'Simple index on comment_test';
+COMMENT ON CONSTRAINT comment_test_positive_col_check ON comment_test IS 'CHECK constraint on comment_test.positive_col';
+COMMENT ON CONSTRAINT comment_test_pk ON comment_test IS 'PRIMARY KEY constraint of comment_test';
+COMMENT ON INDEX comment_test_pk IS 'Index backing the PRIMARY KEY of comment_test';
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Change the datatype of all the columns. ALTER TABLE is optimized to not
+-- rebuild an index if the new data type is binary compatible with the old
+-- one. Check do a dummy ALTER TABLE that doesn't change the datatype
+-- first, to test that no-op codepath, and another one that does.
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE bigint;
+-- Check that the comments are intact.
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Check compatibility for foreign keys and comments. This is done
+-- separately as rebuilding the column type of the parent leads
+-- to an error and would reduce the test scope.
+CREATE TABLE comment_test_child (
+ id text CONSTRAINT comment_test_child_fk REFERENCES comment_test);
+CREATE INDEX comment_test_child_fk ON comment_test_child(id);
+COMMENT ON COLUMN comment_test_child.id IS 'Column ''id'' on comment_test_child';
+COMMENT ON INDEX comment_test_child_fk IS 'Index backing the FOREIGN KEY of comment_test_child';
+COMMENT ON CONSTRAINT comment_test_child_fk ON comment_test_child IS 'FOREIGN KEY constraint of comment_test_child';
+-- Change column type of parent
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int USING id::integer;
+ERROR: foreign key constraint "comment_test_child_fk" cannot be implemented
+DETAIL: Key columns "id" and "id" are of incompatible types: text and integer.
+-- Comments should be intact
+SELECT col_description('comment_test_child'::regclass, 1) as comment;
+ comment
+-----------------------------------
+ Column 'id' on comment_test_child
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ index | comment
+-----------------------+-----------------------------------------------------
+ comment_test_child_fk | Index backing the FOREIGN KEY of comment_test_child
+(1 row)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ constraint | comment
+-----------------------+----------------------------------------------
+ comment_test_child_fk | FOREIGN KEY constraint of comment_test_child
+(1 row)
+
+-- Check that we map relation oids to filenodes and back correctly. Only
+-- display bad mappings so the test output doesn't change all the time. A
+-- filenode function call can return NULL for a relation dropped concurrently
+-- with the call's surrounding query, so ignore a NULL mapped_oid for
+-- relations that no longer exist after all calls finish.
+CREATE TEMP TABLE filenode_mapping AS
+SELECT
+ oid, mapped_oid, reltablespace, relfilenode, relname
+FROM pg_class,
+ pg_filenode_relation(reltablespace, pg_relation_filenode(oid)) AS mapped_oid
+WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
+SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
+WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
+ oid | mapped_oid | reltablespace | relfilenode | relname
+-----+------------+---------------+-------------+---------
+(0 rows)
+
+-- Checks on creating and manipulation of user defined relations in
+-- pg_catalog.
+SHOW allow_system_table_mods;
+ allow_system_table_mods
+-------------------------
+ off
+(1 row)
+
+-- disallowed because of search_path issues with pg_dump
+CREATE TABLE pg_catalog.new_system_table();
+ERROR: permission denied to create "pg_catalog.new_system_table"
+DETAIL: System catalog modifications are currently disallowed.
+-- instead create in public first, move to catalog
+CREATE TABLE new_system_table(id serial primary key, othercol text);
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table SET SCHEMA public;
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+-- will be ignored -- already there:
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table RENAME TO old_system_table;
+CREATE INDEX old_system_table__othercol ON old_system_table (othercol);
+INSERT INTO old_system_table(othercol) VALUES ('somedata'), ('otherdata');
+UPDATE old_system_table SET id = -id;
+DELETE FROM old_system_table WHERE othercol = 'somedata';
+TRUNCATE old_system_table;
+ALTER TABLE old_system_table DROP CONSTRAINT new_system_table_pkey;
+ALTER TABLE old_system_table DROP COLUMN othercol;
+DROP TABLE old_system_table;
+-- set logged
+CREATE UNLOGGED TABLE unlogged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of an unlogged table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | u
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | u
+(3 rows)
+
+CREATE UNLOGGED TABLE unlogged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged1); -- foreign key
+CREATE UNLOGGED TABLE unlogged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged3); -- self-referencing foreign key
+ALTER TABLE unlogged3 SET LOGGED; -- skip self-referencing foreign key
+ALTER TABLE unlogged2 SET LOGGED; -- fails because a foreign key to an unlogged table exists
+ERROR: could not change table "unlogged2" to logged because it references unlogged table "unlogged1"
+ALTER TABLE unlogged1 SET LOGGED;
+-- check relpersistence of an unlogged table after changing to permanent
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | p
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | p
+(3 rows)
+
+ALTER TABLE unlogged1 SET LOGGED; -- silently do nothing
+DROP TABLE unlogged3;
+DROP TABLE unlogged2;
+DROP TABLE unlogged1;
+-- set unlogged
+CREATE TABLE logged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of a permanent table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | p
+ logged1_f1_seq | S | p
+ logged1_pkey | i | p
+(3 rows)
+
+CREATE TABLE logged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged1); -- foreign key
+CREATE TABLE logged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged3); -- self-referencing foreign key
+ALTER TABLE logged1 SET UNLOGGED; -- fails because a foreign key from a permanent table exists
+ERROR: could not change table "logged1" to unlogged because it references logged table "logged2"
+ALTER TABLE logged3 SET UNLOGGED; -- skip self-referencing foreign key
+ALTER TABLE logged2 SET UNLOGGED;
+ALTER TABLE logged1 SET UNLOGGED;
+-- check relpersistence of a permanent table after changing to unlogged
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | u
+ logged1_f1_seq | S | p
+ logged1_pkey | i | u
+(3 rows)
+
+ALTER TABLE logged1 SET UNLOGGED; -- silently do nothing
+DROP TABLE logged3;
+DROP TABLE logged2;
+DROP TABLE logged1;
+-- test ADD COLUMN IF NOT EXISTS
+CREATE TABLE test_add_column(c1 integer);
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer;
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer, -- fail because c2 already exists
+ ADD COLUMN c3 integer primary key;
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN c3 integer primary key;
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer primary key; -- skipping because c3 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer, -- skipping because c3 already exists
+ ADD COLUMN c4 integer REFERENCES test_add_column;
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c4 integer REFERENCES test_add_column;
+NOTICE: column "c4" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c5 SERIAL CHECK (c5 > 8);
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------------------------------------------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+ c5 | integer | | not null | nextval('test_add_column_c5_seq'::regclass)
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Check constraints:
+ "test_add_column_c5_check" CHECK (c5 > 8)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c5 SERIAL CHECK (c5 > 10);
+NOTICE: column "c5" of relation "test_add_column" already exists, skipping
+\d test_add_column*
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------------------------------------------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+ c5 | integer | | not null | nextval('test_add_column_c5_seq'::regclass)
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Check constraints:
+ "test_add_column_c5_check" CHECK (c5 > 8)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ Sequence "public.test_add_column_c5_seq"
+ Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
+---------+-------+---------+------------+-----------+---------+-------
+ integer | 1 | 1 | 2147483647 | 1 | no | 1
+Owned by: public.test_add_column.c5
+
+ Index "public.test_add_column_pkey"
+ Column | Type | Key? | Definition
+--------+---------+------+------------
+ c3 | integer | yes | c3
+primary key, btree, for table "public.test_add_column"
+
+DROP TABLE test_add_column;
+\d test_add_column*
+-- assorted cases with multiple ALTER TABLE steps
+CREATE TABLE ataddindex(f1 INT);
+INSERT INTO ataddindex VALUES (42), (43);
+CREATE UNIQUE INDEX ataddindexi0 ON ataddindex(f1);
+ALTER TABLE ataddindex
+ ADD PRIMARY KEY USING INDEX ataddindexi0,
+ ALTER f1 TYPE BIGINT;
+\d ataddindex
+ Table "public.ataddindex"
+ Column | Type | Collation | Nullable | Default
+--------+--------+-----------+----------+---------
+ f1 | bigint | | not null |
+Indexes:
+ "ataddindexi0" PRIMARY KEY, btree (f1)
+
+DROP TABLE ataddindex;
+CREATE TABLE ataddindex(f1 VARCHAR(10));
+INSERT INTO ataddindex(f1) VALUES ('foo'), ('a');
+ALTER TABLE ataddindex
+ ALTER f1 SET DATA TYPE TEXT,
+ ADD EXCLUDE ((f1 LIKE 'a') WITH =);
+\d ataddindex
+ Table "public.ataddindex"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+Indexes:
+ "ataddindex_expr_excl" EXCLUDE USING btree ((f1 ~~ 'a'::text) WITH =)
+
+DROP TABLE ataddindex;
+-- unsupported constraint types for partitioned tables
+CREATE TABLE partitioned (
+ a int,
+ b int
+) PARTITION BY RANGE (a, (a+b+1));
+ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ERROR: exclusion constraints are not supported on partitioned tables
+LINE 1: ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ ^
+-- cannot drop column that is part of the partition key
+ALTER TABLE partitioned DROP COLUMN a;
+ERROR: cannot drop column "a" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned ALTER COLUMN a TYPE char(5);
+ERROR: cannot alter column "a" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned DROP COLUMN b;
+ERROR: cannot drop column "b" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned ALTER COLUMN b TYPE char(5);
+ERROR: cannot alter column "b" because it is part of the partition key of relation "partitioned"
+-- partitioned table cannot participate in regular inheritance
+CREATE TABLE nonpartitioned (
+ a int,
+ b int
+);
+ALTER TABLE partitioned INHERIT nonpartitioned;
+ERROR: cannot change inheritance of partitioned table
+ALTER TABLE nonpartitioned INHERIT partitioned;
+ERROR: cannot inherit from partitioned table "partitioned"
+-- cannot add NO INHERIT constraint to partitioned tables
+ALTER TABLE partitioned ADD CONSTRAINT chk_a CHECK (a > 0) NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "partitioned"
+DROP TABLE partitioned, nonpartitioned;
+--
+-- ATTACH PARTITION
+--
+-- check that target table is partitioned
+CREATE TABLE unparted (
+ a int
+);
+CREATE TABLE fail_part (like unparted);
+ALTER TABLE unparted ATTACH PARTITION fail_part FOR VALUES IN ('a');
+ERROR: table "unparted" is not partitioned
+DROP TABLE unparted, fail_part;
+-- check that partition bound is compatible
+CREATE TABLE list_parted (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part (LIKE list_parted);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) TO (10);
+ERROR: invalid bound specification for a list partition
+LINE 1: ...list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) T...
+ ^
+DROP TABLE fail_part;
+-- check that the table being attached exists
+ALTER TABLE list_parted ATTACH PARTITION nonexistent FOR VALUES IN (1);
+ERROR: relation "nonexistent" does not exist
+-- check ownership of the source table
+CREATE ROLE regress_test_me;
+CREATE ROLE regress_test_not_me;
+CREATE TABLE not_owned_by_me (LIKE list_parted);
+ALTER TABLE not_owned_by_me OWNER TO regress_test_not_me;
+SET SESSION AUTHORIZATION regress_test_me;
+CREATE TABLE owned_by_me (
+ a int
+) PARTITION BY LIST (a);
+ALTER TABLE owned_by_me ATTACH PARTITION not_owned_by_me FOR VALUES IN (1);
+ERROR: must be owner of table not_owned_by_me
+RESET SESSION AUTHORIZATION;
+DROP TABLE owned_by_me, not_owned_by_me;
+DROP ROLE regress_test_not_me;
+DROP ROLE regress_test_me;
+-- check that the table being attached is not part of regular inheritance
+CREATE TABLE parent (LIKE list_parted);
+CREATE TABLE child () INHERITS (parent);
+ALTER TABLE list_parted ATTACH PARTITION child FOR VALUES IN (1);
+ERROR: cannot attach inheritance child as partition
+ALTER TABLE list_parted ATTACH PARTITION parent FOR VALUES IN (1);
+ERROR: cannot attach inheritance parent as partition
+DROP TABLE parent CASCADE;
+NOTICE: drop cascades to table child
+-- check any TEMP-ness
+CREATE TEMP TABLE temp_parted (a int) PARTITION BY LIST (a);
+CREATE TABLE perm_part (a int);
+ALTER TABLE temp_parted ATTACH PARTITION perm_part FOR VALUES IN (1);
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_parted"
+DROP TABLE temp_parted, perm_part;
+-- check that the table being attached is not a typed table
+CREATE TYPE mytype AS (a int);
+CREATE TABLE fail_part OF mytype;
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: cannot attach a typed table as partition
+DROP TYPE mytype CASCADE;
+NOTICE: drop cascades to table fail_part
+-- check that the table being attached has only columns present in the parent
+CREATE TABLE fail_part (like list_parted, c int);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: table "fail_part" contains column "c" not found in parent "list_parted"
+DETAIL: The new partition may contain only the columns present in parent.
+DROP TABLE fail_part;
+-- check that the table being attached has every column of the parent
+CREATE TABLE fail_part (a int NOT NULL);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing column "b"
+DROP TABLE fail_part;
+-- check that columns match in type, collation and NOT NULL status
+CREATE TABLE fail_part (
+ b char(3),
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different type for column "b"
+ALTER TABLE fail_part ALTER b TYPE char (2) COLLATE "POSIX";
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different collation for column "b"
+DROP TABLE fail_part;
+-- check that the table being attached has all constraints of the parent
+CREATE TABLE fail_part (
+ b char(2) COLLATE "C",
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing constraint "check_a"
+-- check that the constraint matches in definition with parent's constraint
+ALTER TABLE fail_part ADD CONSTRAINT check_a CHECK (a >= 0);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different definition for check constraint "check_a"
+DROP TABLE fail_part;
+-- check the attributes and constraints after partition is attached
+CREATE TABLE part_1 (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+);
+ALTER TABLE list_parted ATTACH PARTITION part_1 FOR VALUES IN (1);
+-- attislocal and conislocal are always false for merged attributes and constraints respectively.
+SELECT attislocal, attinhcount FROM pg_attribute WHERE attrelid = 'part_1'::regclass AND attnum > 0;
+ attislocal | attinhcount
+------------+-------------
+ f | 1
+ f | 1
+(2 rows)
+
+SELECT conislocal, coninhcount FROM pg_constraint WHERE conrelid = 'part_1'::regclass AND conname = 'check_a';
+ conislocal | coninhcount
+------------+-------------
+ f | 1
+(1 row)
+
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE fail_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: partition "fail_part" would overlap partition "part_1"
+DROP TABLE fail_part;
+-- check that an existing table can be attached as a default partition
+CREATE TABLE def_part (LIKE list_parted INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION def_part DEFAULT;
+-- check attaching default partition fails if a default partition already
+-- exists
+CREATE TABLE fail_def_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_def_part DEFAULT;
+ERROR: partition "fail_def_part" conflicts with existing default partition "def_part"
+-- check validation when attaching list partitions
+CREATE TABLE list_parted2 (
+ a int,
+ b char
+) PARTITION BY LIST (a);
+-- check that violating rows are correctly reported
+CREATE TABLE part_2 (LIKE list_parted2);
+INSERT INTO part_2 VALUES (3, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: partition constraint of relation "part_2" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part_2;
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+-- check partition cannot be attached if default has some row for its values
+CREATE TABLE list_parted2_def PARTITION OF list_parted2 DEFAULT;
+INSERT INTO list_parted2_def VALUES (11, 'z');
+CREATE TABLE part_3 (LIKE list_parted2);
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+ERROR: updated partition constraint for default partition "list_parted2_def" would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM list_parted2_def WHERE a = 11;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part_3_4 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IN (3))
+);
+-- however, if a list partition does not accept nulls, there should be
+-- an explicit NOT NULL constraint on the partition key column for the
+-- validation scan to be skipped;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+-- adding a NOT NULL constraint will cause the scan to be skipped
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+ALTER TABLE part_3_4 ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+-- check if default partition scan skipped
+ALTER TABLE list_parted2_def ADD CONSTRAINT check_a CHECK (a IN (5, 6));
+CREATE TABLE part_55_66 PARTITION OF list_parted2 FOR VALUES IN (55, 66);
+-- check validation when attaching range partitions
+CREATE TABLE range_parted (
+ a int,
+ b int
+) PARTITION BY RANGE (a, b);
+-- check that violating rows are correctly reported
+CREATE TABLE part1 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 1 AND b <= 10)
+);
+INSERT INTO part1 VALUES (1, 10);
+-- Remember the TO bound is exclusive
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+ERROR: partition constraint of relation "part1" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part1;
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part2 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 10 AND b < 18)
+);
+ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
+-- Create default partition
+CREATE TABLE partr_def1 PARTITION OF range_parted DEFAULT;
+-- Only one default partition is allowed, hence, following should give error
+CREATE TABLE partr_def2 (LIKE part1 INCLUDING CONSTRAINTS);
+ALTER TABLE range_parted ATTACH PARTITION partr_def2 DEFAULT;
+ERROR: partition "partr_def2" conflicts with existing default partition "partr_def1"
+-- Overlapping partitions cannot be attached, hence, following should give error
+INSERT INTO partr_def1 VALUES (2, 10);
+CREATE TABLE part3 (LIKE range_parted);
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (2, 10) TO (2, 20);
+ERROR: updated partition constraint for default partition "partr_def1" would be violated by some row
+-- Attaching partitions should be successful when there are no overlapping rows
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (3, 10) TO (3, 20);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE part_5 (
+ LIKE list_parted2
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE part_5_a PARTITION OF part_5 FOR VALUES IN ('a');
+INSERT INTO part_5_a (a, b) VALUES (6, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+ERROR: partition constraint of relation "part_5_a" is violated by some row
+-- delete the faulting row and also add a constraint to skip the scan
+DELETE FROM part_5_a WHERE a NOT IN (3);
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 5);
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+ALTER TABLE list_parted2 DETACH PARTITION part_5;
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+-- scan should again be skipped, even though NOT NULL is now a column property
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+-- Check the case where attnos of the partitioning columns in the table being
+-- attached differs from the parent. It should not affect the constraint-
+-- checking logic that allows to skip the scan.
+CREATE TABLE part_6 (
+ c int,
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 6)
+);
+ALTER TABLE part_6 DROP c;
+ALTER TABLE list_parted2 ATTACH PARTITION part_6 FOR VALUES IN (6);
+-- Similar to above, but the table being attached is a partitioned table
+-- whose partition has still different attnos for the root partitioning
+-- columns.
+CREATE TABLE part_7 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+) PARTITION BY LIST (b);
+CREATE TABLE part_7_a_null (
+ c int,
+ d int,
+ e int,
+ LIKE list_parted2, -- 'a' will have attnum = 4
+ CONSTRAINT check_b CHECK (b IS NULL OR b = 'a'),
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+);
+ALTER TABLE part_7_a_null DROP c, DROP d, DROP e;
+ALTER TABLE part_7 ATTACH PARTITION part_7_a_null FOR VALUES IN ('a', null);
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+-- Same example, but check this time that the constraint correctly detects
+-- violating rows
+ALTER TABLE list_parted2 DETACH PARTITION part_7;
+ALTER TABLE part_7 DROP CONSTRAINT check_a; -- thusly, scan won't be skipped
+INSERT INTO part_7 (a, b) VALUES (8, null), (9, 'a');
+SELECT tableoid::regclass, a, b FROM part_7 order by a;
+ tableoid | a | b
+---------------+---+---
+ part_7_a_null | 8 |
+ part_7_a_null | 9 | a
+(2 rows)
+
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+ERROR: partition constraint of relation "part_7_a_null" is violated by some row
+-- check that leaf partitions of default partition are scanned when
+-- attaching a partitioned table.
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+CREATE TABLE part5_def PARTITION OF part_5 DEFAULT PARTITION BY LIST(a);
+CREATE TABLE part5_def_p1 PARTITION OF part5_def FOR VALUES IN (5);
+INSERT INTO part5_def_p1 VALUES (5, 'y');
+CREATE TABLE part5_p1 (LIKE part_5);
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+ERROR: updated partition constraint for default partition "part5_def_p1" would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part5_def_p1 WHERE b = 'y';
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+-- check that the table being attached is not already a partition
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: "part_2" is already a partition
+-- check that circular inheritance is not allowed
+ALTER TABLE part_5 ATTACH PARTITION list_parted2 FOR VALUES IN ('b');
+ERROR: circular inheritance not allowed
+DETAIL: "part_5" is already a child of "list_parted2".
+ALTER TABLE list_parted2 ATTACH PARTITION list_parted2 FOR VALUES IN (0);
+ERROR: circular inheritance not allowed
+DETAIL: "list_parted2" is already a child of "list_parted2".
+-- If a partitioned table being created or an existing table being attached
+-- as a partition does not have a constraint that would allow validation scan
+-- to be skipped, but an individual partition does, then the partition's
+-- validation scan is skipped.
+CREATE TABLE quuux (a int, b text) PARTITION BY LIST (a);
+CREATE TABLE quuux_default PARTITION OF quuux DEFAULT PARTITION BY LIST (b);
+CREATE TABLE quuux_default1 PARTITION OF quuux_default (
+ CONSTRAINT check_1 CHECK (a IS NOT NULL AND a = 1)
+) FOR VALUES IN ('b');
+CREATE TABLE quuux1 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux1 FOR VALUES IN (1); -- validate!
+CREATE TABLE quuux2 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux2 FOR VALUES IN (2); -- skip validation
+DROP TABLE quuux1, quuux2;
+-- should validate for quuux1, but not for quuux2
+CREATE TABLE quuux1 PARTITION OF quuux FOR VALUES IN (1);
+CREATE TABLE quuux2 PARTITION OF quuux FOR VALUES IN (2);
+DROP TABLE quuux;
+-- check validation when attaching hash partitions
+-- Use hand-rolled hash functions and operator class to get predictable result
+-- on different matchines. part_test_int4_ops is defined in insert.sql.
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE hash_parted (
+ a int,
+ b int
+) PARTITION BY HASH (a part_test_int4_ops);
+CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 4, REMAINDER 0);
+CREATE TABLE fail_part (LIKE hpart_1);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 4);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 0);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+DROP TABLE fail_part;
+-- check validation when attaching hash partitions
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_2 (LIKE hash_parted);
+INSERT INTO hpart_2 VALUES (3, 0);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+ERROR: partition constraint of relation "hpart_2" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_2;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE hpart_5 (
+ LIKE hash_parted
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_5_a PARTITION OF hpart_5 FOR VALUES IN ('1', '2', '3');
+INSERT INTO hpart_5_a (a, b) VALUES (7, 1);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+ERROR: partition constraint of relation "hpart_5_a" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_5_a;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+-- check that the table being attach is with valid modulus and remainder value
+CREATE TABLE fail_part(LIKE hash_parted);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 0, REMAINDER 1);
+ERROR: modulus for hash partition must be a positive integer
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 8);
+ERROR: remainder for hash partition must be less than modulus
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 3, REMAINDER 2);
+ERROR: every hash partition modulus must be a factor of the next larger modulus
+DROP TABLE fail_part;
+--
+-- DETACH PARTITION
+--
+-- check that the table is partitioned at all
+CREATE TABLE regular_table (a int);
+ALTER TABLE regular_table DETACH PARTITION any_name;
+ERROR: table "regular_table" is not partitioned
+DROP TABLE regular_table;
+-- check that the partition being detached exists at all
+ALTER TABLE list_parted2 DETACH PARTITION part_4;
+ERROR: relation "part_4" does not exist
+ALTER TABLE hash_parted DETACH PARTITION hpart_4;
+ERROR: relation "hpart_4" does not exist
+-- check that the partition being detached is actually a partition of the parent
+CREATE TABLE not_a_part (a int);
+ALTER TABLE list_parted2 DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "list_parted2"
+ALTER TABLE list_parted2 DETACH PARTITION part_1;
+ERROR: relation "part_1" is not a partition of relation "list_parted2"
+ALTER TABLE hash_parted DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "hash_parted"
+DROP TABLE not_a_part;
+-- check that, after being detached, attinhcount/coninhcount is dropped to 0 and
+-- attislocal/conislocal is set to true
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+SELECT attinhcount, attislocal FROM pg_attribute WHERE attrelid = 'part_3_4'::regclass AND attnum > 0;
+ attinhcount | attislocal
+-------------+------------
+ 0 | t
+ 0 | t
+(2 rows)
+
+SELECT coninhcount, conislocal FROM pg_constraint WHERE conrelid = 'part_3_4'::regclass AND conname = 'check_a';
+ coninhcount | conislocal
+-------------+------------
+ 0 | t
+(1 row)
+
+DROP TABLE part_3_4;
+-- check that a detached partition is not dropped on dropping a partitioned table
+CREATE TABLE range_parted2 (
+ a int
+) PARTITION BY RANGE(a);
+CREATE TABLE part_rp PARTITION OF range_parted2 FOR VALUES FROM (0) to (100);
+ALTER TABLE range_parted2 DETACH PARTITION part_rp;
+DROP TABLE range_parted2;
+SELECT * from part_rp;
+ a
+---
+(0 rows)
+
+DROP TABLE part_rp;
+-- Check ALTER TABLE commands for partitioned tables and partitions
+-- cannot add/drop column to/from *only* the parent
+ALTER TABLE ONLY list_parted2 ADD COLUMN c int;
+ERROR: column must be added to child tables too
+ALTER TABLE ONLY list_parted2 DROP COLUMN b;
+ERROR: cannot drop column from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- cannot add a column to partition or drop an inherited one
+ALTER TABLE part_2 ADD COLUMN c text;
+ERROR: cannot add column to a partition
+ALTER TABLE part_2 DROP COLUMN b;
+ERROR: cannot drop inherited column "b"
+-- Nor rename, alter type
+ALTER TABLE part_2 RENAME COLUMN b to c;
+ERROR: cannot rename inherited column "b"
+ALTER TABLE part_2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter inherited column "b"
+-- cannot add/drop NOT NULL or check constraints to *only* the parent, when
+-- partitions exist
+ALTER TABLE ONLY list_parted2 ALTER b SET NOT NULL;
+ERROR: constraint must be added to child tables too
+DETAIL: Column "b" of relation "part_2" is not already NOT NULL.
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE ONLY list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ERROR: constraint must be added to child tables too
+ALTER TABLE list_parted2 ALTER b SET NOT NULL;
+ALTER TABLE ONLY list_parted2 ALTER b DROP NOT NULL;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ALTER TABLE ONLY list_parted2 DROP CONSTRAINT check_b;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- It's alright though, if no partitions are yet created
+CREATE TABLE parted_no_parts (a int) PARTITION BY LIST (a);
+ALTER TABLE ONLY parted_no_parts ALTER a SET NOT NULL;
+ALTER TABLE ONLY parted_no_parts ADD CONSTRAINT check_a CHECK (a > 0);
+ALTER TABLE ONLY parted_no_parts ALTER a DROP NOT NULL;
+ALTER TABLE ONLY parted_no_parts DROP CONSTRAINT check_a;
+DROP TABLE parted_no_parts;
+-- cannot drop inherited NOT NULL or check constraints from partition
+ALTER TABLE list_parted2 ALTER b SET NOT NULL, ADD CONSTRAINT check_a2 CHECK (a > 0);
+ALTER TABLE part_2 ALTER b DROP NOT NULL;
+ERROR: column "b" is marked NOT NULL in parent table
+ALTER TABLE part_2 DROP CONSTRAINT check_a2;
+ERROR: cannot drop inherited constraint "check_a2" of relation "part_2"
+-- Doesn't make sense to add NO INHERIT constraints on partitioned tables
+ALTER TABLE list_parted2 add constraint check_b2 check (b <> 'zz') NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "list_parted2"
+-- check that a partition cannot participate in regular inheritance
+CREATE TABLE inh_test () INHERITS (part_2);
+ERROR: cannot inherit from partition "part_2"
+CREATE TABLE inh_test (LIKE part_2);
+ALTER TABLE inh_test INHERIT part_2;
+ERROR: cannot inherit from a partition
+ALTER TABLE part_2 INHERIT inh_test;
+ERROR: cannot change inheritance of a partition
+-- cannot drop or alter type of partition key columns of lower level
+-- partitioned tables; for example, part_5, which is list_parted2's
+-- partition, is partitioned on b;
+ALTER TABLE list_parted2 DROP COLUMN b;
+ERROR: cannot drop column "b" because it is part of the partition key of relation "part_5"
+ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter column "b" because it is part of the partition key of relation "part_5"
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+ a
+---
+(0 rows)
+
+-- cleanup
+DROP TABLE list_parted, list_parted2, range_parted;
+DROP TABLE fail_def_part;
+DROP TABLE hash_parted;
+-- more tests for certain multi-level partitioning scenarios
+create table p (a int, b int) partition by range (a, b);
+create table p1 (b int, a int not null) partition by range (b);
+create table p11 (like p1);
+alter table p11 drop a;
+alter table p11 add a int;
+alter table p11 drop a;
+alter table p11 add a int not null;
+-- attnum for key attribute 'a' is different in p, p1, and p11
+select attrelid::regclass, attname, attnum
+from pg_attribute
+where attname = 'a'
+ and (attrelid = 'p'::regclass
+ or attrelid = 'p1'::regclass
+ or attrelid = 'p11'::regclass)
+order by attrelid::regclass::text;
+ attrelid | attname | attnum
+----------+---------+--------
+ p | a | 1
+ p1 | a | 2
+ p11 | a | 4
+(3 rows)
+
+alter table p1 attach partition p11 for values from (2) to (5);
+insert into p1 (a, b) values (2, 3);
+-- check that partition validation scan correctly detects violating rows
+alter table p attach partition p1 for values from (1, 2) to (1, 10);
+ERROR: partition constraint of relation "p11" is violated by some row
+-- cleanup
+drop table p;
+drop table p1;
+-- validate constraint on partitioned tables should only scan leaf partitions
+create table parted_validate_test (a int) partition by list (a);
+create table parted_validate_test_1 partition of parted_validate_test for values in (0, 1);
+alter table parted_validate_test add constraint parted_validate_test_chka check (a > 0) not valid;
+alter table parted_validate_test validate constraint parted_validate_test_chka;
+drop table parted_validate_test;
+-- test alter column options
+CREATE TABLE attmp(i integer);
+INSERT INTO attmp VALUES (1);
+ALTER TABLE attmp ALTER COLUMN i SET (n_distinct = 1, n_distinct_inherited = 2);
+ALTER TABLE attmp ALTER COLUMN i RESET (n_distinct_inherited);
+ANALYZE attmp;
+DROP TABLE attmp;
+DROP USER regress_alter_table_user1;
+-- check that violating rows are correctly reported when attaching as the
+-- default partition
+create table defpart_attach_test (a int) partition by list (a);
+create table defpart_attach_test1 partition of defpart_attach_test for values in (1);
+create table defpart_attach_test_d (b int, a int);
+alter table defpart_attach_test_d drop b;
+insert into defpart_attach_test_d values (1), (2);
+-- error because its constraint as the default partition would be violated
+-- by the row containing 1
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+ERROR: partition constraint of relation "defpart_attach_test_d" is violated by some row
+delete from defpart_attach_test_d where a = 1;
+alter table defpart_attach_test_d add check (a > 1);
+-- should be attached successfully and without needing to be scanned
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+-- check that attaching a partition correctly reports any rows in the default
+-- partition that should not be there for the new partition to be attached
+-- successfully
+create table defpart_attach_test_2 (like defpart_attach_test_d);
+alter table defpart_attach_test attach partition defpart_attach_test_2 for values in (2);
+ERROR: updated partition constraint for default partition "defpart_attach_test_d" would be violated by some row
+drop table defpart_attach_test;
+-- check combinations of temporary and permanent relations when attaching
+-- partitions.
+create table perm_part_parent (a int) partition by list (a);
+create temp table temp_part_parent (a int) partition by list (a);
+create table perm_part_child (a int);
+create temp table temp_part_child (a int);
+alter table temp_part_parent attach partition perm_part_child default; -- error
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_part_parent"
+alter table perm_part_parent attach partition temp_part_child default; -- error
+ERROR: cannot attach a temporary relation as partition of permanent relation "perm_part_parent"
+alter table temp_part_parent attach partition temp_part_child default; -- ok
+drop table perm_part_parent cascade;
+drop table temp_part_parent cascade;
+-- check that attaching partitions to a table while it is being used is
+-- prevented
+create table tab_part_attach (a int) partition by list (a);
+create or replace function func_part_attach() returns trigger
+ language plpgsql as $$
+ begin
+ execute 'create table tab_part_attach_1 (a int)';
+ execute 'alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)';
+ return null;
+ end $$;
+create trigger trig_part_attach before insert on tab_part_attach
+ for each statement execute procedure func_part_attach();
+insert into tab_part_attach values (1);
+ERROR: cannot ALTER TABLE "tab_part_attach" because it is being used by active queries in this session
+CONTEXT: SQL statement "alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)"
+PL/pgSQL function func_part_attach() line 4 at EXECUTE
+drop table tab_part_attach;
+drop function func_part_attach();
+-- test case where the partitioning operator is a SQL function whose
+-- evaluation results in the table's relcache being rebuilt partway through
+-- the execution of an ATTACH PARTITION command
+create function at_test_sql_partop (int4, int4) returns int language sql
+as $$ select case when $1 = $2 then 0 when $1 > $2 then 1 else -1 end; $$;
+create operator class at_test_sql_partop for type int4 using btree as
+ operator 1 < (int4, int4), operator 2 <= (int4, int4),
+ operator 3 = (int4, int4), operator 4 >= (int4, int4),
+ operator 5 > (int4, int4), function 1 at_test_sql_partop(int4, int4);
+create table at_test_sql_partop (a int) partition by range (a at_test_sql_partop);
+create table at_test_sql_partop_1 (a int);
+alter table at_test_sql_partop attach partition at_test_sql_partop_1 for values from (0) to (10);
+drop table at_test_sql_partop;
+drop operator class at_test_sql_partop using btree;
+drop function at_test_sql_partop;
+/* Test case for bug #16242 */
+-- We create a parent and child where the child has missing
+-- non-null attribute values, and arrange to pass them through
+-- tuple conversion from the child to the parent tupdesc
+create table bar1 (a integer, b integer not null default 1)
+ partition by range (a);
+create table bar2 (a integer);
+insert into bar2 values (1);
+alter table bar2 add column b integer not null default 1;
+-- (at this point bar2 contains tuple with natts=1)
+alter table bar1 attach partition bar2 default;
+-- this works:
+select * from bar1;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+-- this exercises tuple conversion:
+create function xtrig()
+ returns trigger language plpgsql
+as $$
+ declare
+ r record;
+ begin
+ for r in select * from old loop
+ raise info 'a=%, b=%', r.a, r.b;
+ end loop;
+ return NULL;
+ end;
+$$;
+create trigger xtrig
+ after update on bar1
+ referencing old table as old
+ for each statement execute procedure xtrig();
+update bar1 set a = a + 1;
+INFO: a=1, b=1
+/* End test case for bug #16242 */
diff --git src/test/regress/expected/box.out src/test/regress/expected/box.out
index 4d0f169214..f5c0ec106d 100644
--- src/test/regress/expected/box.out
+++ src/test/regress/expected/box.out
@@ -253,6 +253,7 @@ INSERT INTO box_temp
('(-infinity,0)(0,infinity)'),
('(-infinity,-infinity)(infinity,infinity)');
SET enable_seqscan = false;
+SET enable_bitmapscan = false;
SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
f1
----------------------------
@@ -476,6 +477,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
(2 rows)
RESET enable_seqscan;
+RESET enable_bitmapscan;
DROP INDEX box_spgist;
--
-- Test the SP-GiST index on the larger volume of data
diff --git src/test/regress/expected/brin_1.out src/test/regress/expected/brin_1.out
new file mode 100644
index 0000000000..8c85680368
--- /dev/null
+++ src/test/regress/expected/brin_1.out
@@ -0,0 +1,528 @@
+CREATE TABLE brintest (byteacol bytea,
+ charcol "char",
+ namecol name,
+ int8col bigint,
+ int2col smallint,
+ int4col integer,
+ textcol text,
+ oidcol oid,
+ tidcol tid,
+ float4col real,
+ float8col double precision,
+ macaddrcol macaddr,
+ inetcol inet,
+ cidrcol cidr,
+ bpcharcol character,
+ datecol date,
+ timecol time without time zone,
+ timestampcol timestamp without time zone,
+ timestamptzcol timestamp with time zone,
+ intervalcol interval,
+ timetzcol time with time zone,
+ bitcol bit(10),
+ varbitcol bit varying(16),
+ numericcol numeric,
+ uuidcol uuid,
+ int4rangecol int4range,
+ lsncol pg_lsn,
+ boxcol box
+) WITH (fillfactor=10);
+INSERT INTO brintest SELECT
+ repeat(stringu1, 8)::bytea,
+ substr(stringu1, 1, 1)::"char",
+ stringu1::name, 142857 * tenthous,
+ thousand,
+ twothousand,
+ repeat(stringu1, 8),
+ unique1::oid,
+ format('(%s,%s)', tenthous, twenty)::tid,
+ (four + 1.0)/(hundred+1),
+ odd::float8 / (tenthous + 1),
+ format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
+ inet '10.2.3.4/24' + tenthous,
+ cidr '10.2.3/24' + tenthous,
+ substr(stringu1, 1, 1)::bpchar,
+ date '1995-08-15' + tenthous,
+ time '01:20:30' + thousand * interval '18.5 second',
+ timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
+ timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
+ justify_days(justify_hours(tenthous * interval '12 minutes')),
+ timetz '01:30:20+02' + hundred * interval '15 seconds',
+ thousand::bit(10),
+ tenthous::bit(16)::varbit,
+ tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
+ format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
+ int4range(thousand, twothousand),
+ format('%s/%s%s', odd, even, tenthous)::pg_lsn,
+ box(point(odd, even), point(thousand, twothousand))
+FROM tenk1 ORDER BY unique2 LIMIT 100;
+-- throw in some NULL's and different values
+INSERT INTO brintest (inetcol, cidrcol, int4rangecol) SELECT
+ inet 'fe80::6e40:8ff:fea9:8c46' + tenthous,
+ cidr 'fe80::6e40:8ff:fea9:8c46' + tenthous,
+ 'empty'::int4range
+FROM tenk1 ORDER BY thousand, tenthous LIMIT 25;
+CREATE INDEX brinidx ON brintest USING brin (
+ byteacol,
+ charcol,
+ namecol,
+ int8col,
+ int2col,
+ int4col,
+ textcol,
+ oidcol,
+ tidcol,
+ float4col,
+ float8col,
+ macaddrcol,
+ inetcol inet_inclusion_ops,
+ inetcol inet_minmax_ops,
+ cidrcol inet_inclusion_ops,
+ cidrcol inet_minmax_ops,
+ bpcharcol,
+ datecol,
+ timecol,
+ timestampcol,
+ timestamptzcol,
+ intervalcol,
+ timetzcol,
+ bitcol,
+ varbitcol,
+ numericcol,
+ uuidcol,
+ int4rangecol,
+ lsncol,
+ boxcol
+) with (pages_per_range = 1);
+CREATE TABLE brinopers (colname name, typ text,
+ op text[], value text[], matches int[],
+ check (cardinality(op) = cardinality(value)),
+ check (cardinality(op) = cardinality(matches)));
+INSERT INTO brinopers VALUES
+ ('byteacol', 'bytea',
+ '{>, >=, =, <=, <}',
+ '{AAAAAA, AAAAAA, BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA, ZZZZZZ, ZZZZZZ}',
+ '{100, 100, 1, 100, 100}'),
+ ('charcol', '"char"',
+ '{>, >=, =, <=, <}',
+ '{A, A, M, Z, Z}',
+ '{97, 100, 6, 100, 98}'),
+ ('namecol', 'name',
+ '{>, >=, =, <=, <}',
+ '{AAAAAA, AAAAAA, MAAAAA, ZZAAAA, ZZAAAA}',
+ '{100, 100, 2, 100, 100}'),
+ ('int2col', 'int2',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int2col', 'int4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int2col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int2',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('int8col', 'int2',
+ '{>, >=}',
+ '{0, 0}',
+ '{100, 100}'),
+ ('int8col', 'int4',
+ '{>, >=}',
+ '{0, 0}',
+ '{100, 100}'),
+ ('int8col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 1257141600, 1428427143, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('textcol', 'text',
+ '{>, >=, =, <=, <}',
+ '{ABABAB, ABABAB, BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA, ZZAAAA, ZZAAAA}',
+ '{100, 100, 1, 100, 100}'),
+ ('oidcol', 'oid',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 8800, 9999, 9999}',
+ '{100, 100, 1, 100, 100}'),
+ ('tidcol', 'tid',
+ '{>, >=, =, <=, <}',
+ '{"(0,0)", "(0,0)", "(8800,0)", "(9999,19)", "(9999,19)"}',
+ '{100, 100, 1, 100, 100}'),
+ ('float4col', 'float4',
+ '{>, >=, =, <=, <}',
+ '{0.0103093, 0.0103093, 1, 1, 1}',
+ '{100, 100, 4, 100, 96}'),
+ ('float4col', 'float8',
+ '{>, >=, =, <=, <}',
+ '{0.0103093, 0.0103093, 1, 1, 1}',
+ '{100, 100, 4, 100, 96}'),
+ ('float8col', 'float4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 0, 1.98, 1.98}',
+ '{99, 100, 1, 100, 100}'),
+ ('float8col', 'float8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 0, 1.98, 1.98}',
+ '{99, 100, 1, 100, 100}'),
+ ('macaddrcol', 'macaddr',
+ '{>, >=, =, <=, <}',
+ '{00:00:01:00:00:00, 00:00:01:00:00:00, 2c:00:2d:00:16:00, ff:fe:00:00:00:00, ff:fe:00:00:00:00}',
+ '{99, 100, 2, 100, 100}'),
+ ('inetcol', 'inet',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14.231/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14.231/24, 10.2.14.231/25, 10.2.14.231/8, 0/0}',
+ '{100, 1, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('inetcol', 'inet',
+ '{&&, >>=, <<=, =}',
+ '{fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46, fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('inetcol', 'cidr',
+ '{&&, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14/24, 10.2.14/25, 10/8, 0/0}',
+ '{100, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('inetcol', 'cidr',
+ '{&&, >>=, <<=, =}',
+ '{fe80::/32, fe80::6e40:8ff:fea9:8c46, fe80::/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('cidrcol', 'inet',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14.231/24, 10.2.14.231/25, 10.2.14.231/8, 0/0}',
+ '{100, 2, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('cidrcol', 'inet',
+ '{&&, >>=, <<=, =}',
+ '{fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46, fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('cidrcol', 'cidr',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14/24, 10.2.14/25, 10/8, 0/0}',
+ '{100, 2, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('cidrcol', 'cidr',
+ '{&&, >>=, <<=, =}',
+ '{fe80::/32, fe80::6e40:8ff:fea9:8c46, fe80::/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('bpcharcol', 'bpchar',
+ '{>, >=, =, <=, <}',
+ '{A, A, W, Z, Z}',
+ '{97, 100, 6, 100, 98}'),
+ ('datecol', 'date',
+ '{>, >=, =, <=, <}',
+ '{1995-08-15, 1995-08-15, 2009-12-01, 2022-12-30, 2022-12-30}',
+ '{100, 100, 1, 100, 100}'),
+ ('timecol', 'time',
+ '{>, >=, =, <=, <}',
+ '{01:20:30, 01:20:30, 02:28:57, 06:28:31.5, 06:28:31.5}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestampcol', 'timestamp',
+ '{>, >=, =, <=, <}',
+ '{1942-07-23 03:05:09, 1942-07-23 03:05:09, 1964-03-24 19:26:45, 1984-01-20 22:42:21, 1984-01-20 22:42:21}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestampcol', 'timestamptz',
+ '{>, >=, =, <=, <}',
+ '{1942-07-23 03:05:09, 1942-07-23 03:05:09, 1964-03-24 19:26:45, 1984-01-20 22:42:21, 1984-01-20 22:42:21}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestamptzcol', 'timestamptz',
+ '{>, >=, =, <=, <}',
+ '{1972-10-10 03:00:00-04, 1972-10-10 03:00:00-04, 1972-10-19 09:00:00-07, 1972-11-20 19:00:00-03, 1972-11-20 19:00:00-03}',
+ '{100, 100, 1, 100, 100}'),
+ ('intervalcol', 'interval',
+ '{>, >=, =, <=, <}',
+ '{00:00:00, 00:00:00, 1 mons 13 days 12:24, 2 mons 23 days 07:48:00, 1 year}',
+ '{100, 100, 1, 100, 100}'),
+ ('timetzcol', 'timetz',
+ '{>, >=, =, <=, <}',
+ '{01:30:20+02, 01:30:20+02, 01:35:50+02, 23:55:05+02, 23:55:05+02}',
+ '{99, 100, 2, 100, 100}'),
+ ('bitcol', 'bit(10)',
+ '{>, >=, =, <=, <}',
+ '{0000000010, 0000000010, 0011011110, 1111111000, 1111111000}',
+ '{100, 100, 1, 100, 100}'),
+ ('varbitcol', 'varbit(16)',
+ '{>, >=, =, <=, <}',
+ '{0000000000000100, 0000000000000100, 0001010001100110, 1111111111111000, 1111111111111000}',
+ '{100, 100, 1, 100, 100}'),
+ ('numericcol', 'numeric',
+ '{>, >=, =, <=, <}',
+ '{0.00, 0.01, 2268164.347826086956521739130434782609, 99470151.9, 99470151.9}',
+ '{100, 100, 1, 100, 100}'),
+ ('uuidcol', 'uuid',
+ '{>, >=, =, <=, <}',
+ '{00040004-0004-0004-0004-000400040004, 00040004-0004-0004-0004-000400040004, 52225222-5222-5222-5222-522252225222, 99989998-9998-9998-9998-999899989998, 99989998-9998-9998-9998-999899989998}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4rangecol', 'int4range',
+ '{<<, &<, &&, &>, >>, @>, <@, =, <, <=, >, >=}',
+ '{"[10000,)","[10000,)","(,]","[3,4)","[36,44)","(1500,1501]","[3,4)","[222,1222)","[36,44)","[43,1043)","[367,4466)","[519,)"}',
+ '{53, 53, 53, 53, 50, 22, 72, 1, 74, 75, 34, 21}'),
+ ('int4rangecol', 'int4range',
+ '{@>, <@, =, <=, >, >=}',
+ '{empty, empty, empty, empty, empty, empty}',
+ '{125, 72, 72, 72, 53, 125}'),
+ ('int4rangecol', 'int4',
+ '{@>}',
+ '{1500}',
+ '{22}'),
+ ('lsncol', 'pg_lsn',
+ '{>, >=, =, <=, <, IS, IS NOT}',
+ '{0/1200, 0/1200, 44/455222, 198/1999799, 198/1999799, NULL, NULL}',
+ '{100, 100, 1, 100, 100, 25, 100}'),
+ ('boxcol', 'point',
+ '{@>}',
+ '{"(500,43)"}',
+ '{11}'),
+ ('boxcol', 'box',
+ '{<<, &<, &&, &>, >>, <<|, &<|, |&>, |>>, @>, <@, ~=}',
+ '{"((1000,2000),(3000,4000))","((1,2),(3000,4000))","((1,2),(3000,4000))","((1,2),(3000,4000))","((1,2),(3,4))","((1000,2000),(3000,4000))","((1,2000),(3,4000))","((1000,2),(3000,4))","((1,2),(3,4))","((1,2),(300,400))","((1,2),(3000,4000))","((222,1222),(44,45))"}',
+ '{100, 100, 100, 99, 96, 100, 100, 99, 96, 1, 99, 1}');
+DO $x$
+DECLARE
+ r record;
+ r2 record;
+ cond text;
+ idx_ctids tid[];
+ ss_ctids tid[];
+ count int;
+ plan_ok bool;
+ plan_line text;
+BEGIN
+ FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers, unnest(op) WITH ORDINALITY AS oper LOOP
+
+ -- prepare the condition
+ IF r.value IS NULL THEN
+ cond := format('%I %s %L', r.colname, r.oper, r.value);
+ ELSE
+ cond := format('%I %s %L::%s', r.colname, r.oper, r.value, r.typ);
+ END IF;
+
+ -- run the query using the brin index
+ SET enable_seqscan = 0;
+ SET enable_bitmapscan = 1;
+
+ plan_ok := false;
+ FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond) LOOP
+ IF plan_line LIKE '%Bitmap Heap Scan on brintest%' THEN
+ plan_ok := true;
+ END IF;
+ END LOOP;
+ IF NOT plan_ok THEN
+ RAISE WARNING 'did not get bitmap indexscan plan for %', r;
+ END IF;
+
+ EXECUTE format($y$SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond)
+ INTO idx_ctids;
+
+ -- run the query using a seqscan
+ SET enable_seqscan = 1;
+ SET enable_bitmapscan = 0;
+
+ plan_ok := false;
+ FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond) LOOP
+ IF plan_line LIKE '%Seq Scan on brintest%' THEN
+ plan_ok := true;
+ END IF;
+ END LOOP;
+ IF NOT plan_ok THEN
+ RAISE WARNING 'did not get seqscan plan for %', r;
+ END IF;
+
+ EXECUTE format($y$SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond)
+ INTO ss_ctids;
+
+ -- make sure both return the same results
+ count := array_length(idx_ctids, 1);
+
+ IF NOT (count = array_length(ss_ctids, 1) AND
+ idx_ctids @> ss_ctids AND
+ idx_ctids <@ ss_ctids) THEN
+ -- report the results of each scan to make the differences obvious
+ RAISE WARNING 'something not right in %: count %', r, count;
+ SET enable_seqscan = 1;
+ SET enable_bitmapscan = 0;
+ FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest WHERE ' || cond LOOP
+ RAISE NOTICE 'seqscan: %', r2;
+ END LOOP;
+
+ SET enable_seqscan = 0;
+ SET enable_bitmapscan = 1;
+ FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest WHERE ' || cond LOOP
+ RAISE NOTICE 'bitmapscan: %', r2;
+ END LOOP;
+ END IF;
+
+ -- make sure we found expected number of matches
+ IF count != r.matches THEN RAISE WARNING 'unexpected number of results % for %', count, r; END IF;
+ END LOOP;
+END;
+$x$;
+RESET enable_seqscan;
+RESET enable_bitmapscan;
+INSERT INTO brintest SELECT
+ repeat(stringu1, 42)::bytea,
+ substr(stringu1, 1, 1)::"char",
+ stringu1::name, 142857 * tenthous,
+ thousand,
+ twothousand,
+ repeat(stringu1, 42),
+ unique1::oid,
+ format('(%s,%s)', tenthous, twenty)::tid,
+ (four + 1.0)/(hundred+1),
+ odd::float8 / (tenthous + 1),
+ format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
+ inet '10.2.3.4' + tenthous,
+ cidr '10.2.3/24' + tenthous,
+ substr(stringu1, 1, 1)::bpchar,
+ date '1995-08-15' + tenthous,
+ time '01:20:30' + thousand * interval '18.5 second',
+ timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
+ timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
+ justify_days(justify_hours(tenthous * interval '12 minutes')),
+ timetz '01:30:20' + hundred * interval '15 seconds',
+ thousand::bit(10),
+ tenthous::bit(16)::varbit,
+ tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
+ format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
+ int4range(thousand, twothousand),
+ format('%s/%s%s', odd, even, tenthous)::pg_lsn,
+ box(point(odd, even), point(thousand, twothousand))
+FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET 5;
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+VACUUM brintest; -- force a summarization cycle in brinidx
+UPDATE brintest SET int8col = int8col * int4col;
+UPDATE brintest SET textcol = '' WHERE textcol IS NOT NULL;
+-- Tests for brin_summarize_new_values
+SELECT brin_summarize_new_values('brintest'); -- error, not an index
+ERROR: "brintest" is not an index
+SELECT brin_summarize_new_values('tenk1_unique1'); -- error, not a BRIN index
+ERROR: "tenk1_unique1" is not a BRIN index
+SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected
+ brin_summarize_new_values
+---------------------------
+ 9
+(1 row)
+
+-- Tests for brin_desummarize_range
+SELECT brin_desummarize_range('brinidx', -1); -- error, invalid range
+ERROR: block number out of range: -1
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_desummarize_range('brinidx', 100000000);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+-- Test brin_summarize_range
+CREATE TABLE brin_summarize (
+ value int
+) WITH (fillfactor=10, autovacuum_enabled=false);
+CREATE INDEX brin_summarize_idx ON brin_summarize USING brin (value) WITH (pages_per_range=2);
+-- Fill a few pages
+DO $$
+DECLARE curtid tid;
+BEGIN
+ LOOP
+ INSERT INTO brin_summarize VALUES (1) RETURNING ctid INTO curtid;
+ EXIT WHEN curtid > tid '(2, 0)';
+ END LOOP;
+END;
+$$;
+-- summarize one range
+SELECT brin_summarize_range('brin_summarize_idx', 0);
+ brin_summarize_range
+----------------------
+ 0
+(1 row)
+
+-- nothing: already summarized
+SELECT brin_summarize_range('brin_summarize_idx', 1);
+ brin_summarize_range
+----------------------
+ 0
+(1 row)
+
+-- summarize one range
+SELECT brin_summarize_range('brin_summarize_idx', 2);
+ brin_summarize_range
+----------------------
+ 1
+(1 row)
+
+-- nothing: page doesn't exist in table
+SELECT brin_summarize_range('brin_summarize_idx', 4294967295);
+ brin_summarize_range
+----------------------
+ 2
+(1 row)
+
+-- invalid block number values
+SELECT brin_summarize_range('brin_summarize_idx', -1);
+ERROR: block number out of range: -1
+SELECT brin_summarize_range('brin_summarize_idx', 4294967296);
+ERROR: block number out of range: 4294967296
+-- test value merging in add_value
+CREATE TABLE brintest_2 (n numrange);
+CREATE INDEX brinidx_2 ON brintest_2 USING brin (n);
+INSERT INTO brintest_2 VALUES ('empty');
+INSERT INTO brintest_2 VALUES (numrange(0, 2^1000::numeric));
+INSERT INTO brintest_2 VALUES ('(-1, 0)');
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_summarize_range('brinidx', 0);
+ brin_summarize_range
+----------------------
+ 1
+(1 row)
+
+DROP TABLE brintest_2;
+-- test brin cost estimates behave sanely based on correlation of values
+CREATE TABLE brin_test (a INT, b INT);
+INSERT INTO brin_test SELECT x/100,x%100 FROM generate_series(1,10000) x(x);
+CREATE INDEX brin_test_a_idx ON brin_test USING brin (a) WITH (pages_per_range = 2);
+CREATE INDEX brin_test_b_idx ON brin_test USING brin (b) WITH (pages_per_range = 2);
+VACUUM ANALYZE brin_test;
+-- Ensure brin index is used when columns are perfectly correlated
+EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE a = 1;
+ QUERY PLAN
+--------------------------------------------
+ Bitmap Heap Scan on brin_test
+ Recheck Cond: (a = 1)
+ -> Bitmap Index Scan on brin_test_a_idx
+ Index Cond: (a = 1)
+(4 rows)
+
+-- Ensure brin index is not used when values are not correlated
+EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE b = 1;
+ QUERY PLAN
+-----------------------
+ Seq Scan on brin_test
+ Filter: (b = 1)
+(2 rows)
+
diff --git src/test/regress/expected/cluster_1.out src/test/regress/expected/cluster_1.out
new file mode 100644
index 0000000000..7167f05199
--- /dev/null
+++ src/test/regress/expected/cluster_1.out
@@ -0,0 +1,582 @@
+--
+-- CLUSTER
+--
+CREATE TABLE clstr_tst_s (rf_a SERIAL PRIMARY KEY,
+ b INT);
+CREATE TABLE clstr_tst (a SERIAL PRIMARY KEY,
+ b INT,
+ c TEXT,
+ d TEXT,
+ CONSTRAINT clstr_tst_con FOREIGN KEY (b) REFERENCES clstr_tst_s);
+CREATE INDEX clstr_tst_b ON clstr_tst (b);
+CREATE INDEX clstr_tst_c ON clstr_tst (c);
+CREATE INDEX clstr_tst_c_b ON clstr_tst (c,b);
+CREATE INDEX clstr_tst_b_c ON clstr_tst (b,c);
+INSERT INTO clstr_tst_s (b) VALUES (0);
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+CREATE TABLE clstr_tst_inh () INHERITS (clstr_tst);
+INSERT INTO clstr_tst (b, c) VALUES (11, 'once');
+INSERT INTO clstr_tst (b, c) VALUES (10, 'diez');
+INSERT INTO clstr_tst (b, c) VALUES (31, 'treinta y uno');
+INSERT INTO clstr_tst (b, c) VALUES (22, 'veintidos');
+INSERT INTO clstr_tst (b, c) VALUES (3, 'tres');
+INSERT INTO clstr_tst (b, c) VALUES (20, 'veinte');
+INSERT INTO clstr_tst (b, c) VALUES (23, 'veintitres');
+INSERT INTO clstr_tst (b, c) VALUES (21, 'veintiuno');
+INSERT INTO clstr_tst (b, c) VALUES (4, 'cuatro');
+INSERT INTO clstr_tst (b, c) VALUES (14, 'catorce');
+INSERT INTO clstr_tst (b, c) VALUES (2, 'dos');
+INSERT INTO clstr_tst (b, c) VALUES (18, 'dieciocho');
+INSERT INTO clstr_tst (b, c) VALUES (27, 'veintisiete');
+INSERT INTO clstr_tst (b, c) VALUES (25, 'veinticinco');
+INSERT INTO clstr_tst (b, c) VALUES (13, 'trece');
+INSERT INTO clstr_tst (b, c) VALUES (28, 'veintiocho');
+INSERT INTO clstr_tst (b, c) VALUES (32, 'treinta y dos');
+INSERT INTO clstr_tst (b, c) VALUES (5, 'cinco');
+INSERT INTO clstr_tst (b, c) VALUES (29, 'veintinueve');
+INSERT INTO clstr_tst (b, c) VALUES (1, 'uno');
+INSERT INTO clstr_tst (b, c) VALUES (24, 'veinticuatro');
+INSERT INTO clstr_tst (b, c) VALUES (30, 'treinta');
+INSERT INTO clstr_tst (b, c) VALUES (12, 'doce');
+INSERT INTO clstr_tst (b, c) VALUES (17, 'diecisiete');
+INSERT INTO clstr_tst (b, c) VALUES (9, 'nueve');
+INSERT INTO clstr_tst (b, c) VALUES (19, 'diecinueve');
+INSERT INTO clstr_tst (b, c) VALUES (26, 'veintiseis');
+INSERT INTO clstr_tst (b, c) VALUES (15, 'quince');
+INSERT INTO clstr_tst (b, c) VALUES (7, 'siete');
+INSERT INTO clstr_tst (b, c) VALUES (16, 'dieciseis');
+INSERT INTO clstr_tst (b, c) VALUES (8, 'ocho');
+-- This entry is needed to test that TOASTED values are copied correctly.
+INSERT INTO clstr_tst (b, c, d) VALUES (6, 'seis', repeat('xyzzy', 100000));
+CLUSTER clstr_tst_c ON clstr_tst;
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY a;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 1 | 11 | once | |
+ 2 | 10 | diez | |
+ 3 | 31 | treinta y uno | |
+ 4 | 22 | veintidos | |
+ 5 | 3 | tres | |
+ 6 | 20 | veinte | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 9 | 4 | cuatro | |
+ 10 | 14 | catorce | |
+ 11 | 2 | dos | |
+ 12 | 18 | dieciocho | |
+ 13 | 27 | veintisiete | |
+ 14 | 25 | veinticinco | |
+ 15 | 13 | trece | |
+ 16 | 28 | veintiocho | |
+ 17 | 32 | treinta y dos | |
+ 18 | 5 | cinco | |
+ 19 | 29 | veintinueve | |
+ 20 | 1 | uno | |
+ 21 | 24 | veinticuatro | |
+ 22 | 30 | treinta | |
+ 23 | 12 | doce | |
+ 24 | 17 | diecisiete | |
+ 25 | 9 | nueve | |
+ 26 | 19 | diecinueve | |
+ 27 | 26 | veintiseis | |
+ 28 | 15 | quince | |
+ 29 | 7 | siete | |
+ 30 | 16 | dieciseis | |
+ 31 | 8 | ocho | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY b;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 20 | 1 | uno | |
+ 11 | 2 | dos | |
+ 5 | 3 | tres | |
+ 9 | 4 | cuatro | |
+ 18 | 5 | cinco | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 31 | 8 | ocho | |
+ 25 | 9 | nueve | |
+ 2 | 10 | diez | |
+ 1 | 11 | once | |
+ 23 | 12 | doce | |
+ 15 | 13 | trece | |
+ 10 | 14 | catorce | |
+ 28 | 15 | quince | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 12 | 18 | dieciocho | |
+ 26 | 19 | diecinueve | |
+ 6 | 20 | veinte | |
+ 8 | 21 | veintiuno | |
+ 4 | 22 | veintidos | |
+ 7 | 23 | veintitres | |
+ 21 | 24 | veinticuatro | |
+ 14 | 25 | veinticinco | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 16 | 28 | veintiocho | |
+ 19 | 29 | veintinueve | |
+ 22 | 30 | treinta | |
+ 3 | 31 | treinta y uno | |
+ 17 | 32 | treinta y dos | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY c;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+----------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+(33 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------
+ clstr_tst_con
+ clstr_tst_pkey
+(2 rows)
+
+SELECT relname, relkind,
+ EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
+FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
+ relname | relkind | hastoast
+----------------------+---------+----------
+ clstr_tst | r | f
+ clstr_tst_a_seq | S | f
+ clstr_tst_b | i | f
+ clstr_tst_b_c | i | f
+ clstr_tst_c | i | f
+ clstr_tst_c_b | i | f
+ clstr_tst_inh | r | f
+ clstr_tst_pkey | i | f
+ clstr_tst_s | r | f
+ clstr_tst_s_pkey | i | f
+ clstr_tst_s_rf_a_seq | S | f
+(11 rows)
+
+-- Verify that indisclustered is correctly set
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+-------------
+ clstr_tst_c
+(1 row)
+
+-- Try changing indisclustered
+ALTER TABLE clstr_tst CLUSTER ON clstr_tst_b_c;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------------
+ clstr_tst_b_c
+(1 row)
+
+-- Try turning off all clustering
+ALTER TABLE clstr_tst SET WITHOUT CLUSTER;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------
+(0 rows)
+
+-- Verify that clustering all tables does in fact cluster the right ones
+CREATE USER regress_clstr_user;
+CREATE TABLE clstr_1 (a INT PRIMARY KEY);
+CREATE TABLE clstr_2 (a INT PRIMARY KEY);
+CREATE TABLE clstr_3 (a INT PRIMARY KEY);
+ALTER TABLE clstr_1 OWNER TO regress_clstr_user;
+ALTER TABLE clstr_3 OWNER TO regress_clstr_user;
+GRANT SELECT ON clstr_2 TO regress_clstr_user;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- "CLUSTER <tablename>" on a table that hasn't been clustered
+CLUSTER clstr_2;
+ERROR: there is no previously clustered index for table "clstr_2"
+CLUSTER clstr_1_pkey ON clstr_1;
+CLUSTER clstr_2 USING clstr_2_pkey;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 1
+ 2
+ 2
+ 1
+(6 rows)
+
+-- revert to the original state
+DELETE FROM clstr_1;
+DELETE FROM clstr_2;
+DELETE FROM clstr_3;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- this user can only cluster clstr_1 and clstr_3, but the latter
+-- has not been clustered
+SET SESSION AUTHORIZATION regress_clstr_user;
+CLUSTER;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 2
+ 1
+ 2
+ 1
+(6 rows)
+
+-- cluster a single table using the indisclustered bit previously set
+DELETE FROM clstr_1;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+CLUSTER clstr_1;
+SELECT * FROM clstr_1;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- Test MVCC-safety of cluster. There isn't much we can do to verify the
+-- results with a single backend...
+CREATE TABLE clustertest (key int PRIMARY KEY);
+INSERT INTO clustertest VALUES (10);
+INSERT INTO clustertest VALUES (20);
+INSERT INTO clustertest VALUES (30);
+INSERT INTO clustertest VALUES (40);
+INSERT INTO clustertest VALUES (50);
+-- Use a transaction so that updates are not committed when CLUSTER sees 'em
+BEGIN;
+-- Test update where the old row version is found first in the scan
+UPDATE clustertest SET key = 100 WHERE key = 10;
+-- Test update where the new row version is found first in the scan
+UPDATE clustertest SET key = 35 WHERE key = 40;
+-- Test longer update chain
+UPDATE clustertest SET key = 60 WHERE key = 50;
+UPDATE clustertest SET key = 70 WHERE key = 60;
+UPDATE clustertest SET key = 80 WHERE key = 70;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 100
+ 35
+ 80
+(5 rows)
+
+CLUSTER clustertest_pkey ON clustertest;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+COMMIT;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+-- check that temp tables can be clustered
+create temp table clstr_temp (col1 int primary key, col2 text);
+insert into clstr_temp values (2, 'two'), (1, 'one');
+cluster clstr_temp using clstr_temp_pkey;
+select * from clstr_temp;
+ col1 | col2
+------+------
+ 1 | one
+ 2 | two
+(2 rows)
+
+drop table clstr_temp;
+RESET SESSION AUTHORIZATION;
+-- Check that partitioned tables cannot be clustered
+CREATE TABLE clstrpart (a int) PARTITION BY RANGE (a);
+CREATE INDEX clstrpart_idx ON clstrpart (a);
+ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+ERROR: cannot mark index clustered in partitioned table
+CLUSTER clstrpart USING clstrpart_idx;
+ERROR: cannot cluster a partitioned table
+DROP TABLE clstrpart;
+-- Test CLUSTER with external tuplesorting
+create table clstr_4 as select * from tenk1;
+create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
+-- ensure we don't use the index in CLUSTER nor the checking SELECTs
+set enable_indexscan = off;
+-- Use external sort:
+set maintenance_work_mem = '1MB';
+cluster clstr_4 using cluster_sort;
+select * from
+(select hundred, lag(hundred) over () as lhundred,
+ thousand, lag(thousand) over () as lthousand,
+ tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
+where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
+ hundred | lhundred | thousand | lthousand | tenthous | ltenthous
+---------+----------+----------+-----------+----------+-----------
+(0 rows)
+
+reset enable_indexscan;
+reset maintenance_work_mem;
+-- test CLUSTER on expression index
+CREATE TABLE clstr_expression(id serial primary key, a int, b text COLLATE "C");
+INSERT INTO clstr_expression(a, b) SELECT g.i % 42, 'prefix'||g.i FROM generate_series(1, 133) g(i);
+CREATE INDEX clstr_expression_minus_a ON clstr_expression ((-a), b);
+CREATE INDEX clstr_expression_upper_b ON clstr_expression ((upper(b)));
+-- verify indexes work before cluster
+BEGIN;
+SET LOCAL enable_seqscan = false;
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_upper_b on clstr_expression
+ Index Cond: (upper(b) = 'PREFIX3'::text)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ id | a | b
+----+---+---------
+ 3 | 3 | prefix3
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_minus_a on clstr_expression
+ Index Cond: ((- a) = '-3'::integer)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ id | a | b
+-----+---+-----------
+ 129 | 3 | prefix129
+ 3 | 3 | prefix3
+ 45 | 3 | prefix45
+ 87 | 3 | prefix87
+(4 rows)
+
+COMMIT;
+-- and after clustering on clstr_expression_minus_a
+CLUSTER clstr_expression USING clstr_expression_minus_a;
+BEGIN;
+SET LOCAL enable_seqscan = false;
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_upper_b on clstr_expression
+ Index Cond: (upper(b) = 'PREFIX3'::text)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ id | a | b
+----+---+---------
+ 3 | 3 | prefix3
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_minus_a on clstr_expression
+ Index Cond: ((- a) = '-3'::integer)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ id | a | b
+-----+---+-----------
+ 129 | 3 | prefix129
+ 3 | 3 | prefix3
+ 45 | 3 | prefix45
+ 87 | 3 | prefix87
+(4 rows)
+
+COMMIT;
+-- and after clustering on clstr_expression_upper_b
+CLUSTER clstr_expression USING clstr_expression_upper_b;
+BEGIN;
+SET LOCAL enable_seqscan = false;
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_upper_b on clstr_expression
+ Index Cond: (upper(b) = 'PREFIX3'::text)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ id | a | b
+----+---+---------
+ 3 | 3 | prefix3
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_minus_a on clstr_expression
+ Index Cond: ((- a) = '-3'::integer)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ id | a | b
+-----+---+-----------
+ 129 | 3 | prefix129
+ 3 | 3 | prefix3
+ 45 | 3 | prefix45
+ 87 | 3 | prefix87
+(4 rows)
+
+COMMIT;
+-- clean up
+DROP TABLE clustertest;
+DROP TABLE clstr_1;
+DROP TABLE clstr_2;
+DROP TABLE clstr_3;
+DROP TABLE clstr_4;
+DROP TABLE clstr_expression;
+DROP USER regress_clstr_user;
diff --git src/test/regress/expected/combocid_1.out src/test/regress/expected/combocid_1.out
new file mode 100644
index 0000000000..533a2f2fe4
--- /dev/null
+++ src/test/regress/expected/combocid_1.out
@@ -0,0 +1,169 @@
+--
+-- Tests for some likely failure cases with combo cmin/cmax mechanism
+--
+CREATE TEMP TABLE combocidtest (foobar int);
+BEGIN;
+-- a few dummy ops to push up the CommandId counter
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest VALUES (1);
+INSERT INTO combocidtest VALUES (2);
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+SAVEPOINT s1;
+UPDATE combocidtest SET foobar = foobar + 10;
+-- here we should see only updated tuples
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,3) | 12 | 11
+ (0,4) | 12 | 12
+(2 rows)
+
+ROLLBACK TO s1;
+-- now we should see old tuples, but with combo CIDs starting at 0
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+COMMIT;
+-- combo data is not there anymore, but should still see tuples
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+-- Test combo cids with portals
+BEGIN;
+INSERT INTO combocidtest VALUES (333);
+DECLARE c CURSOR FOR SELECT ctid,cmin,* FROM combocidtest;
+DELETE FROM combocidtest;
+FETCH ALL FROM c;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,5) | 0 | 333
+(3 rows)
+
+ROLLBACK;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+-- check behavior with locked tuples
+BEGIN;
+-- a few dummy ops to push up the CommandId counter
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest VALUES (444);
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+SAVEPOINT s1;
+-- this doesn't affect cmin
+SELECT ctid,cmin,* FROM combocidtest FOR UPDATE;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+-- but this does
+UPDATE combocidtest SET foobar = foobar + 10;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,7) | 12 | 11
+ (0,8) | 12 | 12
+ (0,9) | 12 | 454
+(3 rows)
+
+ROLLBACK TO s1;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+COMMIT;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+-- test for bug reported in
+-- CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
+CREATE TABLE IF NOT EXISTS testcase(
+ id int PRIMARY KEY,
+ balance numeric
+);
+INSERT INTO testcase VALUES (1, 0);
+BEGIN;
+SELECT * FROM testcase WHERE testcase.id = 1 FOR UPDATE;
+ id | balance
+----+---------
+ 1 | 0
+(1 row)
+
+UPDATE testcase SET balance = balance + 400 WHERE id=1;
+SAVEPOINT subxact;
+UPDATE testcase SET balance = balance - 100 WHERE id=1;
+ROLLBACK TO SAVEPOINT subxact;
+-- should return one tuple
+SELECT * FROM testcase WHERE id = 1 FOR UPDATE;
+ id | balance
+----+---------
+ 1 | 400
+(1 row)
+
+ROLLBACK;
+DROP TABLE testcase;
diff --git src/test/regress/expected/create_am.out src/test/regress/expected/create_am.out
index 84da403afc..a9b8a5d9fa 100644
--- src/test/regress/expected/create_am.out
+++ src/test/regress/expected/create_am.out
@@ -126,11 +126,12 @@ ERROR: function int4in(internal) does not exist
CREATE ACCESS METHOD bogus TYPE TABLE HANDLER bthandler;
ERROR: function bthandler must return type table_am_handler
SELECT amname, amhandler, amtype FROM pg_am where amtype = 't' ORDER BY 1, 2;
- amname | amhandler | amtype
---------+----------------------+--------
- heap | heap_tableam_handler | t
- heap2 | heap_tableam_handler | t
-(2 rows)
+ amname | amhandler | amtype
+----------+--------------------------+--------
+ heap | heap_tableam_handler | t
+ heap2 | heap_tableam_handler | t
+ zedstore | zedstore_tableam_handler | t
+(3 rows)
-- First create tables employing the new AM using USING
-- plain CREATE TABLE
diff --git src/test/regress/expected/create_index.out src/test/regress/expected/create_index.out
index ae95bb38a6..3862ae575f 100644
--- src/test/regress/expected/create_index.out
+++ src/test/regress/expected/create_index.out
@@ -1872,7 +1872,9 @@ SELECT count(*) FROM dupindexcols
--
-- Check ordering of =ANY indexqual results (bug in 9.2.0)
--
-vacuum tenk1; -- ensure we get consistent plans here
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
explain (costs off)
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
@@ -1913,6 +1915,7 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+RESET enable_indexscan;
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1935,6 +1938,8 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+RESET enable_seqscan;
+RESET enable_bitmapscan;
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git src/test/regress/expected/create_index_1.out src/test/regress/expected/create_index_1.out
new file mode 100644
index 0000000000..614419e7fd
--- /dev/null
+++ src/test/regress/expected/create_index_1.out
@@ -0,0 +1,2598 @@
+--
+-- CREATE_INDEX
+-- Create ancillary data structures (i.e. indices)
+--
+--
+-- BTREE
+--
+CREATE INDEX onek_unique1 ON onek USING btree(unique1 int4_ops);
+CREATE INDEX IF NOT EXISTS onek_unique1 ON onek USING btree(unique1 int4_ops);
+NOTICE: relation "onek_unique1" already exists, skipping
+CREATE INDEX IF NOT EXISTS ON onek USING btree(unique1 int4_ops);
+ERROR: syntax error at or near "ON"
+LINE 1: CREATE INDEX IF NOT EXISTS ON onek USING btree(unique1 int4_...
+ ^
+CREATE INDEX onek_unique2 ON onek USING btree(unique2 int4_ops);
+CREATE INDEX onek_hundred ON onek USING btree(hundred int4_ops);
+CREATE INDEX onek_stringu1 ON onek USING btree(stringu1 name_ops);
+CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
+CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
+CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
+CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
+CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
+CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
+CREATE INDEX rix ON road USING btree (name text_ops);
+CREATE INDEX iix ON ihighway USING btree (name text_ops);
+CREATE INDEX six ON shighway USING btree (name text_ops);
+-- test comments
+COMMENT ON INDEX six_wrong IS 'bad index';
+ERROR: relation "six_wrong" does not exist
+COMMENT ON INDEX six IS 'good index';
+COMMENT ON INDEX six IS NULL;
+--
+-- BTREE ascending/descending cases
+--
+-- we load int4/text from pure descending data (each key is a new
+-- low key) and name/f8 from pure ascending data (each key is a new
+-- high key). we had a bug where new low keys would sometimes be
+-- "lost".
+--
+CREATE INDEX bt_i4_index ON bt_i4_heap USING btree (seqno int4_ops);
+CREATE INDEX bt_name_index ON bt_name_heap USING btree (seqno name_ops);
+CREATE INDEX bt_txt_index ON bt_txt_heap USING btree (seqno text_ops);
+CREATE INDEX bt_f8_index ON bt_f8_heap USING btree (seqno float8_ops);
+--
+-- BTREE partial indices
+--
+CREATE INDEX onek2_u1_prtl ON onek2 USING btree(unique1 int4_ops)
+ where unique1 < 20 or unique1 > 980;
+CREATE INDEX onek2_u2_prtl ON onek2 USING btree(unique2 int4_ops)
+ where stringu1 < 'B';
+CREATE INDEX onek2_stu1_prtl ON onek2 USING btree(stringu1 name_ops)
+ where onek2.stringu1 >= 'J' and onek2.stringu1 < 'K';
+--
+-- GiST (rtree-equivalent opclasses only)
+--
+CREATE INDEX grect2ind ON fast_emp4000 USING gist (home_base);
+CREATE INDEX gpolygonind ON polygon_tbl USING gist (f1);
+CREATE INDEX gcircleind ON circle_tbl USING gist (f1);
+INSERT INTO POINT_TBL(f1) VALUES (NULL);
+CREATE INDEX gpointind ON point_tbl USING gist (f1);
+CREATE TEMP TABLE gpolygon_tbl AS
+ SELECT polygon(home_base) AS f1 FROM slow_emp4000;
+INSERT INTO gpolygon_tbl VALUES ( '(1000,0,0,1000)' );
+INSERT INTO gpolygon_tbl VALUES ( '(0,1000,1000,1000)' );
+CREATE TEMP TABLE gcircle_tbl AS
+ SELECT circle(home_base) AS f1 FROM slow_emp4000;
+CREATE INDEX ggpolygonind ON gpolygon_tbl USING gist (f1);
+CREATE INDEX ggcircleind ON gcircle_tbl USING gist (f1);
+--
+-- Test GiST indexes
+--
+-- get non-indexed results for comparison purposes
+SET enable_seqscan = ON;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
+SELECT * FROM fast_emp4000
+ WHERE home_base @ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ home_base
+-----------------------
+ (337,455),(240,359)
+ (1444,403),(1346,344)
+(2 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ count
+-------
+ 278
+(1 row)
+
+SELECT * FROM polygon_tbl WHERE f1 ~ '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ f1
+---------------------
+ ((2,0),(2,4),(0,0))
+(1 row)
+
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ f1
+---------------
+ <(1,2),3>
+ <(1,3),5>
+ <(1,2),100>
+ <(100,1),115>
+(4 rows)
+
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ count
+-------
+ 5
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ count
+-------
+ 4
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+
+(10 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ f1
+----
+
+(1 row)
+
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (1e-300,-1e-300)
+ (0,0)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+(9 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ f1
+-------------------------------------------------
+ ((240,359),(240,455),(337,455),(337,359))
+ ((662,163),(662,187),(759,187),(759,163))
+ ((1000,0),(0,1000))
+ ((0,1000),(1000,1000))
+ ((1346,344),(1346,403),(1444,403),(1444,344))
+ ((278,1409),(278,1457),(369,1457),(369,1409))
+ ((907,1156),(907,1201),(948,1201),(948,1156))
+ ((1517,971),(1517,1043),(1594,1043),(1594,971))
+ ((175,1820),(175,1850),(259,1850),(259,1820))
+ ((2424,81),(2424,160),(2424,160),(2424,81))
+(10 rows)
+
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ circle_center | radius
+----------------+--------
+ (288.5,407) | 68
+ (710.5,175) | 50
+ (323.5,1433) | 51
+ (927.5,1178.5) | 30
+ (1395,373.5) | 57
+ (1555.5,1007) | 53
+ (217,1835) | 45
+ (489,2421.5) | 22
+ (2424,120.5) | 40
+ (751.5,2655) | 20
+(10 rows)
+
+-- Now check the results from plain indexscan
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = OFF;
+EXPLAIN (COSTS OFF)
+SELECT * FROM fast_emp4000
+ WHERE home_base @ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ QUERY PLAN
+----------------------------------------------------------------
+ Sort
+ Sort Key: ((home_base[0])[0])
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base @ '(2000,1000),(200,200)'::box)
+(4 rows)
+
+SELECT * FROM fast_emp4000
+ WHERE home_base @ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ home_base
+-----------------------
+ (337,455),(240,359)
+ (1444,403),(1346,344)
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base && '(1000,1000),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base IS NULL)
+(3 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ count
+-------
+ 278
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM polygon_tbl WHERE f1 ~ '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ QUERY PLAN
+-----------------------------------------------------------
+ Sort
+ Sort Key: ((poly_center(f1))[0])
+ -> Index Scan using gpolygonind on polygon_tbl
+ Index Cond: (f1 ~ '((1,1),(2,2),(2,1))'::polygon)
+(4 rows)
+
+SELECT * FROM polygon_tbl WHERE f1 ~ '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ f1
+---------------------
+ ((2,0),(2,4),(0,0))
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ QUERY PLAN
+--------------------------------------------------
+ Sort
+ Sort Key: (area(f1))
+ -> Index Scan using gcircleind on circle_tbl
+ Index Cond: (f1 && '<(1,-2),1>'::circle)
+(4 rows)
+
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ f1
+---------------
+ <(1,2),3>
+ <(1,3),5>
+ <(1,2),100>
+ <(100,1),115>
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ QUERY PLAN
+------------------------------------------------------------
+ Aggregate
+ -> Index Scan using ggpolygonind on gpolygon_tbl
+ Index Cond: (f1 && '((1000,1000),(0,0))'::polygon)
+(3 rows)
+
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Index Scan using ggcircleind on gcircle_tbl
+ Index Cond: (f1 && '<(500,500),500>'::circle)
+(3 rows)
+
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(100,100),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(100,100),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '((0,0),(0,100),(100,100),(50,50),(100,0),(0,0))'::polygon)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ count
+-------
+ 4
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '<(50,50),50>'::circle)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 << '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 >> '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 <^ '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 >^ '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ count
+-------
+ 4
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 ~= '(-5,-12)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Order By: (f1 <-> '(0,1)'::point)
+(2 rows)
+
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+
+(10 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 IS NULL)
+(2 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ f1
+----
+
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 IS NOT NULL)
+ Order By: (f1 <-> '(0,1)'::point)
+(3 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+------------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+ Order By: (f1 <-> '(0,1)'::point)
+(3 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ QUERY PLAN
+-----------------------------------------------------
+ Limit
+ -> Index Scan using ggpolygonind on gpolygon_tbl
+ Order By: (f1 <-> '(0,0)'::point)
+(3 rows)
+
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ f1
+-------------------------------------------------
+ ((240,359),(240,455),(337,455),(337,359))
+ ((662,163),(662,187),(759,187),(759,163))
+ ((1000,0),(0,1000))
+ ((0,1000),(1000,1000))
+ ((1346,344),(1346,403),(1444,403),(1444,344))
+ ((278,1409),(278,1457),(369,1457),(369,1409))
+ ((907,1156),(907,1201),(948,1201),(948,1156))
+ ((1517,971),(1517,1043),(1594,1043),(1594,971))
+ ((175,1820),(175,1850),(259,1850),(259,1820))
+ ((2424,81),(2424,160),(2424,160),(2424,81))
+(10 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ QUERY PLAN
+---------------------------------------------------
+ Limit
+ -> Index Scan using ggcircleind on gcircle_tbl
+ Order By: (f1 <-> '(200,300)'::point)
+(3 rows)
+
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ circle_center | radius
+----------------+--------
+ (288.5,407) | 68
+ (710.5,175) | 50
+ (323.5,1433) | 51
+ (927.5,1178.5) | 30
+ (1395,373.5) | 57
+ (1555.5,1007) | 53
+ (217,1835) | 45
+ (489,2421.5) | 22
+ (2424,120.5) | 40
+ (751.5,2655) | 20
+(10 rows)
+
+-- Now check the results from bitmap indexscan
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = ON;
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+------------------------------------------------------------
+ Sort
+ Sort Key: ((f1 <-> '(0,1)'::point))
+ -> Bitmap Heap Scan on point_tbl
+ Recheck Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+ -> Bitmap Index Scan on gpointind
+ Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+(6 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+--
+-- GIN over int[] and text[]
+--
+-- Note: GIN currently supports only bitmap scans, not plain indexscans
+--
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = ON;
+CREATE INDEX intarrayidx ON array_index_op_test USING gin (i);
+explain (costs off)
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ QUERY PLAN
+----------------------------------------------------
+ Sort
+ Sort Key: seqno
+ -> Bitmap Heap Scan on array_index_op_test
+ Recheck Cond: (i @> '{32}'::integer[])
+ -> Bitmap Index Scan on intarrayidx
+ Index Cond: (i @> '{32}'::integer[])
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(8 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(8 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32,17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32,17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(11 rows)
+
+SELECT * FROM array_index_op_test WHERE i <@ '{38,34,32,89}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------+----------------------------------------------------------------------------------------------------------------------------
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 101 | {} | {}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE i = '{47,77}' ORDER BY seqno;
+ seqno | i | t
+-------+---------+-----------------------------------------------------------------------------------------------------------------
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i @> '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 1 | {92,75,71,52,64,83} | {AAAAAAAA44066,AAAAAA1059,AAAAAAAAAAA176,AAAAAAA48038}
+ 2 | {3,6} | {AAAAAA98232,AAAAAAAA79710,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAAAAAAA55798,AAAAAAAAA12793}
+ 3 | {37,64,95,43,3,41,13,30,11,43} | {AAAAAAAAAA48845,AAAAA75968,AAAAA95309,AAA54451,AAAAAAAAAA22292,AAAAAAA99836,A96617,AA17009,AAAAAAAAAAAAAA95246}
+ 4 | {71,39,99,55,33,75,45} | {AAAAAAAAA53663,AAAAAAAAAAAAAAA67062,AAAAAAAAAA64777,AAA99043,AAAAAAAAAAAAAAAAAAA91804,39557}
+ 5 | {50,42,77,50,4} | {AAAAAAAAAAAAAAAAA26540,AAAAAAA79710,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA176,AAAAA95309,AAAAAAAAAAA46154,AAAAAA66777,AAAAAAAAA27249,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA70104}
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 7 | {12,51,88,64,8} | {AAAAAAAAAAAAAAAAAA12591,AAAAAAAAAAAAAAAAA50407,AAAAAAAAAAAA67946}
+ 8 | {60,84} | {AAAAAAA81898,AAAAAA1059,AAAAAAAAAAAA81511,AAAAA961,AAAAAAAAAAAAAAAA31334,AAAAA64741,AA6416,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAA50407}
+ 9 | {56,52,35,27,80,44,81,22} | {AAAAAAAAAAAAAAA73034,AAAAAAAAAAAAA7929,AAAAAAA66161,AA88409,39557,A27153,AAAAAAAA9523,AAAAAAAAAAA99000}
+ 10 | {71,5,45} | {AAAAAAAAAAA21658,AAAAAAAAAAAA21089,AAA54451,AAAAAAAAAAAAAAAAAA54141,AAAAAAAAAAAAAA28620,AAAAAAAAAAA21658,AAAAAAAAAAA74076,AAAAAAAAA27249}
+ 11 | {41,86,74,48,22,74,47,50} | {AAAAAAAA9523,AAAAAAAAAAAA37562,AAAAAAAAAAAAAAAA14047,AAAAAAAAAAA46154,AAAA41702,AAAAAAAAAAAAAAAAA764,AAAAA62737,39557}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 13 | {3,52,34,23} | {AAAAAA98232,AAAA49534,AAAAAAAAAAA21658}
+ 14 | {78,57,19} | {AAAA8857,AAAAAAAAAAAAAAA73034,AAAAAAAA81587,AAAAAAAAAAAAAAA68526,AAAAA75968,AAAAAAAAAAAAAA65909,AAAAAAAAA10012,AAAAAAAAAAAAAA65909}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 16 | {14,63,85,11} | {AAAAAA66777}
+ 17 | {7,10,81,85} | {AAAAAA43678,AAAAAAA12144,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAAAAA15356}
+ 18 | {1} | {AAAAAAAAAAA33576,AAAAA95309,64261,AAA59323,AAAAAAAAAAAAAA95246,55847,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAAAA64374}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 20 | {72,89,70,51,54,37,8,49,79} | {AAAAAA58494}
+ 21 | {2,8,65,10,5,79,43} | {AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAAAAA91804,AAAAA64669,AAAAAAAAAAAAAAAA1443,AAAAAAAAAAAAAAAA23657,AAAAA12179,AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAA31334,AAAAAAAAAAAAAAAA41303,AAAAAAAAAAAAAAAAAAA85420}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 23 | {40,90,5,38,72,40,30,10,43,55} | {A6053,AAAAAAAAAAA6119,AA44673,AAAAAAAAAAAAAAAAA764,AA17009,AAAAA17383,AAAAA70514,AAAAA33250,AAAAA95309,AAAAAAAAAAAA37562}
+ 24 | {94,61,99,35,48} | {AAAAAAAAAAA50956,AAAAAAAAAAA15165,AAAA85070,AAAAAAAAAAAAAAA36627,AAAAA961,AAAAAAAAAA55219}
+ 25 | {31,1,10,11,27,79,38} | {AAAAAAAAAAAAAAAAAA59334,45449}
+ 26 | {71,10,9,69,75} | {47735,AAAAAAA21462,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA91804,AAAAAAAAA72121,AAAAAAAAAAAAAAAAAAA1205,AAAAA41597,AAAA8857,AAAAAAAAAAAAAAAAAAA15356,AA17009}
+ 27 | {94} | {AA6416,A6053,AAAAAAA21462,AAAAAAA57334,AAAAAAAAAAAAAAAAAA12591,AA88409,AAAAAAAAAAAAA70254}
+ 28 | {14,33,6,34,14} | {AAAAAAAAAAAAAAA13198,AAAAAAAA69452,AAAAAAAAAAA82945,AAAAAAA12144,AAAAAAAAA72121,AAAAAAAAAA18601}
+ 29 | {39,21} | {AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA38885,AAAA85070,AAAAAAAAAAAAAAAAAAA70104,AAAAA66674,AAAAAAAAAAAAA62007,AAAAAAAA69452,AAAAAAA1242,AAAAAAAAAAAAAAAA1729,AAAA35194}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 31 | {80,24,18,21,54} | {AAAAAAAAAAAAAAA13198,AAAAAAAAAAAAAAAAAAA70415,A27153,AAAAAAAAA53663,AAAAAAAAAAAAAAAAA50407,A68938}
+ 32 | {58,79,82,80,67,75,98,10,41} | {AAAAAAAAAAAAAAAAAA61286,AAA54451,AAAAAAAAAAAAAAAAAAA87527,A96617,51533}
+ 33 | {74,73} | {A85417,AAAAAAA56483,AAAAA17383,AAAAAAAAAAAAA62159,AAAAAAAAAAAA52814,AAAAAAAAAAAAA85723,AAAAAAAAAAAAAAAAAA55796}
+ 34 | {70,45} | {AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAA28620,AAAAAAAAAA55219,AAAAAAAA23648,AAAAAAAAAA22292,AAAAAAA1242}
+ 35 | {23,40} | {AAAAAAAAAAAA52814,AAAA48949,AAAAAAAAA34727,AAAA8857,AAAAAAAAAAAAAAAAAAA62179,AAAAAAAAAAAAAAA68526,AAAAAAA99836,AAAAAAAA50094,AAAA91194,AAAAAAAAAAAAA73084}
+ 36 | {79,82,14,52,30,5,79} | {AAAAAAAAA53663,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA89194,AA88409,AAAAAAAAAAAAAAA81326,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAA33598}
+ 37 | {53,11,81,39,3,78,58,64,74} | {AAAAAAAAAAAAAAAAAAA17075,AAAAAAA66161,AAAAAAAA23648,AAAAAAAAAAAAAA10611}
+ 38 | {59,5,4,95,28} | {AAAAAAAAAAA82945,A96617,47735,AAAAA12179,AAAAA64669,AAAAAA99807,AA74433,AAAAAAAAAAAAAAAAA59387}
+ 39 | {82,43,99,16,74} | {AAAAAAAAAAAAAAA67062,AAAAAAA57334,AAAAAAAAAAAAAA65909,A27153,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAA64777,AAAAAAAAAAAA81511,AAAAAAAAAAAAAA65909,AAAAAAAAAAAAAA28620}
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 41 | {19,26,63,12,93,73,27,94} | {AAAAAAA79710,AAAAAAAAAA55219,AAAA41702,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAAAAA63050,AAAAAAA99836,AAAAAAAAAAAAAA8666}
+ 42 | {15,76,82,75,8,91} | {AAAAAAAAAAA176,AAAAAA38063,45449,AAAAAA54032,AAAAAAA81898,AA6416,AAAAAAAAAAAAAAAAAAA62179,45449,AAAAA60038,AAAAAAAA81587}
+ 43 | {39,87,91,97,79,28} | {AAAAAAAAAAA74076,A96617,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAAAAA55796,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAA67946}
+ 44 | {40,58,68,29,54} | {AAAAAAA81898,AAAAAA66777,AAAAAA98232}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 46 | {53,24} | {AAAAAAAAAAA53908,AAAAAA54032,AAAAA17383,AAAA48949,AAAAAAAAAA18601,AAAAA64669,45449,AAAAAAAAAAA98051,AAAAAAAAAAAAAAAAAA71621}
+ 47 | {98,23,64,12,75,61} | {AAA59323,AAAAA95309,AAAAAAAAAAAAAAAA31334,AAAAAAAAA27249,AAAAA17383,AAAAAAAAAAAA37562,AAAAAA1059,A84822,55847,AAAAA70466}
+ 48 | {76,14} | {AAAAAAAAAAAAA59671,AAAAAAAAAAAAAAAAAAA91804,AAAAAA66777,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAA73084,AAAAAAA79710,AAAAAAAAAAAAAAA40402,AAAAAAAAAAAAAAAAAAA65037}
+ 49 | {56,5,54,37,49} | {AA21643,AAAAAAAAAAA92631,AAAAAAAA81587}
+ 50 | {20,12,37,64,93} | {AAAAAAAAAA5483,AAAAAAAAAAAAAAAAAAA1205,AA6416,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAAAA47955}
+ 51 | {47} | {AAAAAAAAAAAAAA96505,AAAAAAAAAAAAAAAAAA36842,AAAAA95309,AAAAAAAA81587,AA6416,AAAA91194,AAAAAA58494,AAAAAA1059,AAAAAAAA69452}
+ 52 | {89,0} | {AAAAAAAAAAAAAAAAAA47955,AAAAAAA48038,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAA73084,AAAAA70466,AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA46154,AA66862}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 54 | {70,47} | {AAAAAAAAAAAAAAAAAA54141,AAAAA40681,AAAAAAA48038,AAAAAAAAAAAAAAAA29150,AAAAA41597,AAAAAAAAAAAAAAAAAA59334,AA15322}
+ 55 | {47,79,47,64,72,25,71,24,93} | {AAAAAAAAAAAAAAAAAA55796,AAAAA62737}
+ 56 | {33,7,60,54,93,90,77,85,39} | {AAAAAAAAAAAAAAAAAA32918,AA42406}
+ 57 | {23,45,10,42,36,21,9,96} | {AAAAAAAAAAAAAAAAAAA70415}
+ 58 | {92} | {AAAAAAAAAAAAAAAA98414,AAAAAAAA23648,AAAAAAAAAAAAAAAAAA55796,AA25381,AAAAAAAAAAA6119}
+ 59 | {9,69,46,77} | {39557,AAAAAAA89932,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAAAAAA26540,AAA20874,AA6416,AAAAAAAAAAAAAAAAAA47955}
+ 60 | {62,2,59,38,89} | {AAAAAAA89932,AAAAAAAAAAAAAAAAAAA15356,AA99927,AA17009,AAAAAAAAAAAAAAA35875}
+ 61 | {72,2,44,95,54,54,13} | {AAAAAAAAAAAAAAAAAAA91804}
+ 62 | {83,72,29,73} | {AAAAAAAAAAAAA15097,AAAA8857,AAAAAAAAAAAA35809,AAAAAAAAAAAA52814,AAAAAAAAAAAAAAAAAAA38885,AAAAAAAAAAAAAAAAAA24183,AAAAAA43678,A96617}
+ 63 | {11,4,61,87} | {AAAAAAAAA27249,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAA13198,AAA20874,39557,51533,AAAAAAAAAAA53908,AAAAAAAAAAAAAA96505,AAAAAAAA78938}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 66 | {31,23,70,52,4,33,48,25} | {AAAAAAAAAAAAAAAAA69675,AAAAAAAA50094,AAAAAAAAAAA92631,AAAA35194,39557,AAAAAAA99836}
+ 67 | {31,94,7,10} | {AAAAAA38063,A96617,AAAA35194,AAAAAAAAAAAA67946}
+ 68 | {90,43,38} | {AA75092,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAA92631,AAAAAAAAA10012,AAAAAAAAAAAAA7929,AA21643}
+ 69 | {67,35,99,85,72,86,44} | {AAAAAAAAAAAAAAAAAAA1205,AAAAAAAA50094,AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAAAAAAA47955}
+ 70 | {56,70,83} | {AAAA41702,AAAAAAAAAAA82945,AA21643,AAAAAAAAAAA99000,A27153,AA25381,AAAAAAAAAAAAAA96505,AAAAAAA1242}
+ 71 | {74,26} | {AAAAAAAAAAA50956,AA74433,AAAAAAA21462,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAA70254,AAAAAAAAAA43419,39557}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 73 | {88,25,96,78,65,15,29,19} | {AAA54451,AAAAAAAAA27249,AAAAAAA9228,AAAAAAAAAAAAAAA67062,AAAAAAAAAAAAAAAAAAA70415,AAAAA17383,AAAAAAAAAAAAAAAA33598}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 75 | {12,96,83,24,71,89,55} | {AAAA48949,AAAAAAAA29716,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAA29150,AAA28075,AAAAAAAAAAAAAAAAA43052}
+ 76 | {92,55,10,7} | {AAAAAAAAAAAAAAA67062}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 78 | {55,89,44,84,34} | {AAAAAAAAAAA6119,AAAAAAAAAAAAAA8666,AA99927,AA42406,AAAAAAA81898,AAAAAAA9228,AAAAAAAAAAA92631,AA21643,AAAAAAAAAAAAAA28620}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 80 | {74,89,44,80,0} | {AAAA35194,AAAAAAAA79710,AAA20874,AAAAAAAAAAAAAAAAAAA70104,AAAAAAAAAAAAA73084,AAAAAAA57334,AAAAAAA9228,AAAAAAAAAAAAA62007}
+ 81 | {63,77,54,48,61,53,97} | {AAAAAAAAAAAAAAA81326,AAAAAAAAAA22292,AA25381,AAAAAAAAAAA74076,AAAAAAA81898,AAAAAAAAA72121}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 83 | {14,10} | {AAAAAAAAAA22292,AAAAAAAAAAAAA70254,AAAAAAAAAAA6119}
+ 84 | {11,83,35,13,96,94} | {AAAAA95309,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAAA24183}
+ 85 | {39,60} | {AAAAAAAAAAAAAAAA55798,AAAAAAAAAA22292,AAAAAAA66161,AAAAAAA21462,AAAAAAAAAAAAAAAAAA12591,55847,AAAAAA98232,AAAAAAAAAAA46154}
+ 86 | {33,81,72,74,45,36,82} | {AAAAAAAA81587,AAAAAAAAAAAAAA96505,45449,AAAA80176}
+ 87 | {57,27,50,12,97,68} | {AAAAAAAAAAAAAAAAA26540,AAAAAAAAA10012,AAAAAAAAAAAA35809,AAAAAAAAAAAAAAAA29150,AAAAAAAAAAA82945,AAAAAA66777,31228,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAA96505}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 90 | {88,75} | {AAAAA60038,AAAAAAAA23648,AAAAAAAAAAA99000,AAAA41702,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAA68526}
+ 91 | {78} | {AAAAAAAAAAAAA62007,AAA99043}
+ 92 | {85,63,49,45} | {AAAAAAA89932,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA21089}
+ 93 | {11} | {AAAAAAAAAAA176,AAAAAAAAAAAAAA8666,AAAAAAAAAAAAAAA453,AAAAAAAAAAAAA85723,A68938,AAAAAAAAAAAAA9821,AAAAAAA48038,AAAAAAAAAAAAAAAAA59387,AA99927,AAAAA17383}
+ 94 | {98,9,85,62,88,91,60,61,38,86} | {AAAAAAAA81587,AAAAA17383,AAAAAAAA81587}
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 99 | {37,86} | {AAAAAAAAAAAAAAAAAA32918,AAAAA70514,AAAAAAAAA10012,AAAAAAAAAAAAAAAAA59387,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA15356}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+ 101 | {} | {}
+ 102 | {NULL} | {NULL}
+(102 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_index_op_test WHERE i <@ '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+--------+--------
+ 102 | {NULL} | {NULL}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i @> '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_op_test WHERE i && '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+CREATE INDEX textarrayidx ON array_index_op_test USING gin (t);
+explain (costs off)
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
+ QUERY PLAN
+------------------------------------------------------------
+ Sort
+ Sort Key: seqno
+ -> Bitmap Heap Scan on array_index_op_test
+ Recheck Cond: (t @> '{AAAAAAAA72908}'::text[])
+ -> Bitmap Index Scan on textarrayidx
+ Index Cond: (t @> '{AAAAAAAA72908}'::text[])
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAA72908}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------------------+--------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------------------+--------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908,AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------+--------------------------------------------------------------------
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAA72908,AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t <@ '{AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------+-----------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 101 | {} | {}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t = '{AAAAAAAAAA646,A87088}' ORDER BY seqno;
+ seqno | i | t
+-------+------------+------------------------
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t @> '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 1 | {92,75,71,52,64,83} | {AAAAAAAA44066,AAAAAA1059,AAAAAAAAAAA176,AAAAAAA48038}
+ 2 | {3,6} | {AAAAAA98232,AAAAAAAA79710,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAAAAAAA55798,AAAAAAAAA12793}
+ 3 | {37,64,95,43,3,41,13,30,11,43} | {AAAAAAAAAA48845,AAAAA75968,AAAAA95309,AAA54451,AAAAAAAAAA22292,AAAAAAA99836,A96617,AA17009,AAAAAAAAAAAAAA95246}
+ 4 | {71,39,99,55,33,75,45} | {AAAAAAAAA53663,AAAAAAAAAAAAAAA67062,AAAAAAAAAA64777,AAA99043,AAAAAAAAAAAAAAAAAAA91804,39557}
+ 5 | {50,42,77,50,4} | {AAAAAAAAAAAAAAAAA26540,AAAAAAA79710,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA176,AAAAA95309,AAAAAAAAAAA46154,AAAAAA66777,AAAAAAAAA27249,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA70104}
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 7 | {12,51,88,64,8} | {AAAAAAAAAAAAAAAAAA12591,AAAAAAAAAAAAAAAAA50407,AAAAAAAAAAAA67946}
+ 8 | {60,84} | {AAAAAAA81898,AAAAAA1059,AAAAAAAAAAAA81511,AAAAA961,AAAAAAAAAAAAAAAA31334,AAAAA64741,AA6416,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAA50407}
+ 9 | {56,52,35,27,80,44,81,22} | {AAAAAAAAAAAAAAA73034,AAAAAAAAAAAAA7929,AAAAAAA66161,AA88409,39557,A27153,AAAAAAAA9523,AAAAAAAAAAA99000}
+ 10 | {71,5,45} | {AAAAAAAAAAA21658,AAAAAAAAAAAA21089,AAA54451,AAAAAAAAAAAAAAAAAA54141,AAAAAAAAAAAAAA28620,AAAAAAAAAAA21658,AAAAAAAAAAA74076,AAAAAAAAA27249}
+ 11 | {41,86,74,48,22,74,47,50} | {AAAAAAAA9523,AAAAAAAAAAAA37562,AAAAAAAAAAAAAAAA14047,AAAAAAAAAAA46154,AAAA41702,AAAAAAAAAAAAAAAAA764,AAAAA62737,39557}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 13 | {3,52,34,23} | {AAAAAA98232,AAAA49534,AAAAAAAAAAA21658}
+ 14 | {78,57,19} | {AAAA8857,AAAAAAAAAAAAAAA73034,AAAAAAAA81587,AAAAAAAAAAAAAAA68526,AAAAA75968,AAAAAAAAAAAAAA65909,AAAAAAAAA10012,AAAAAAAAAAAAAA65909}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 16 | {14,63,85,11} | {AAAAAA66777}
+ 17 | {7,10,81,85} | {AAAAAA43678,AAAAAAA12144,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAAAAA15356}
+ 18 | {1} | {AAAAAAAAAAA33576,AAAAA95309,64261,AAA59323,AAAAAAAAAAAAAA95246,55847,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAAAA64374}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 20 | {72,89,70,51,54,37,8,49,79} | {AAAAAA58494}
+ 21 | {2,8,65,10,5,79,43} | {AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAAAAA91804,AAAAA64669,AAAAAAAAAAAAAAAA1443,AAAAAAAAAAAAAAAA23657,AAAAA12179,AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAA31334,AAAAAAAAAAAAAAAA41303,AAAAAAAAAAAAAAAAAAA85420}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 23 | {40,90,5,38,72,40,30,10,43,55} | {A6053,AAAAAAAAAAA6119,AA44673,AAAAAAAAAAAAAAAAA764,AA17009,AAAAA17383,AAAAA70514,AAAAA33250,AAAAA95309,AAAAAAAAAAAA37562}
+ 24 | {94,61,99,35,48} | {AAAAAAAAAAA50956,AAAAAAAAAAA15165,AAAA85070,AAAAAAAAAAAAAAA36627,AAAAA961,AAAAAAAAAA55219}
+ 25 | {31,1,10,11,27,79,38} | {AAAAAAAAAAAAAAAAAA59334,45449}
+ 26 | {71,10,9,69,75} | {47735,AAAAAAA21462,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA91804,AAAAAAAAA72121,AAAAAAAAAAAAAAAAAAA1205,AAAAA41597,AAAA8857,AAAAAAAAAAAAAAAAAAA15356,AA17009}
+ 27 | {94} | {AA6416,A6053,AAAAAAA21462,AAAAAAA57334,AAAAAAAAAAAAAAAAAA12591,AA88409,AAAAAAAAAAAAA70254}
+ 28 | {14,33,6,34,14} | {AAAAAAAAAAAAAAA13198,AAAAAAAA69452,AAAAAAAAAAA82945,AAAAAAA12144,AAAAAAAAA72121,AAAAAAAAAA18601}
+ 29 | {39,21} | {AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA38885,AAAA85070,AAAAAAAAAAAAAAAAAAA70104,AAAAA66674,AAAAAAAAAAAAA62007,AAAAAAAA69452,AAAAAAA1242,AAAAAAAAAAAAAAAA1729,AAAA35194}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 31 | {80,24,18,21,54} | {AAAAAAAAAAAAAAA13198,AAAAAAAAAAAAAAAAAAA70415,A27153,AAAAAAAAA53663,AAAAAAAAAAAAAAAAA50407,A68938}
+ 32 | {58,79,82,80,67,75,98,10,41} | {AAAAAAAAAAAAAAAAAA61286,AAA54451,AAAAAAAAAAAAAAAAAAA87527,A96617,51533}
+ 33 | {74,73} | {A85417,AAAAAAA56483,AAAAA17383,AAAAAAAAAAAAA62159,AAAAAAAAAAAA52814,AAAAAAAAAAAAA85723,AAAAAAAAAAAAAAAAAA55796}
+ 34 | {70,45} | {AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAA28620,AAAAAAAAAA55219,AAAAAAAA23648,AAAAAAAAAA22292,AAAAAAA1242}
+ 35 | {23,40} | {AAAAAAAAAAAA52814,AAAA48949,AAAAAAAAA34727,AAAA8857,AAAAAAAAAAAAAAAAAAA62179,AAAAAAAAAAAAAAA68526,AAAAAAA99836,AAAAAAAA50094,AAAA91194,AAAAAAAAAAAAA73084}
+ 36 | {79,82,14,52,30,5,79} | {AAAAAAAAA53663,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA89194,AA88409,AAAAAAAAAAAAAAA81326,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAA33598}
+ 37 | {53,11,81,39,3,78,58,64,74} | {AAAAAAAAAAAAAAAAAAA17075,AAAAAAA66161,AAAAAAAA23648,AAAAAAAAAAAAAA10611}
+ 38 | {59,5,4,95,28} | {AAAAAAAAAAA82945,A96617,47735,AAAAA12179,AAAAA64669,AAAAAA99807,AA74433,AAAAAAAAAAAAAAAAA59387}
+ 39 | {82,43,99,16,74} | {AAAAAAAAAAAAAAA67062,AAAAAAA57334,AAAAAAAAAAAAAA65909,A27153,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAA64777,AAAAAAAAAAAA81511,AAAAAAAAAAAAAA65909,AAAAAAAAAAAAAA28620}
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 41 | {19,26,63,12,93,73,27,94} | {AAAAAAA79710,AAAAAAAAAA55219,AAAA41702,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAAAAA63050,AAAAAAA99836,AAAAAAAAAAAAAA8666}
+ 42 | {15,76,82,75,8,91} | {AAAAAAAAAAA176,AAAAAA38063,45449,AAAAAA54032,AAAAAAA81898,AA6416,AAAAAAAAAAAAAAAAAAA62179,45449,AAAAA60038,AAAAAAAA81587}
+ 43 | {39,87,91,97,79,28} | {AAAAAAAAAAA74076,A96617,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAAAAA55796,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAA67946}
+ 44 | {40,58,68,29,54} | {AAAAAAA81898,AAAAAA66777,AAAAAA98232}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 46 | {53,24} | {AAAAAAAAAAA53908,AAAAAA54032,AAAAA17383,AAAA48949,AAAAAAAAAA18601,AAAAA64669,45449,AAAAAAAAAAA98051,AAAAAAAAAAAAAAAAAA71621}
+ 47 | {98,23,64,12,75,61} | {AAA59323,AAAAA95309,AAAAAAAAAAAAAAAA31334,AAAAAAAAA27249,AAAAA17383,AAAAAAAAAAAA37562,AAAAAA1059,A84822,55847,AAAAA70466}
+ 48 | {76,14} | {AAAAAAAAAAAAA59671,AAAAAAAAAAAAAAAAAAA91804,AAAAAA66777,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAA73084,AAAAAAA79710,AAAAAAAAAAAAAAA40402,AAAAAAAAAAAAAAAAAAA65037}
+ 49 | {56,5,54,37,49} | {AA21643,AAAAAAAAAAA92631,AAAAAAAA81587}
+ 50 | {20,12,37,64,93} | {AAAAAAAAAA5483,AAAAAAAAAAAAAAAAAAA1205,AA6416,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAAAA47955}
+ 51 | {47} | {AAAAAAAAAAAAAA96505,AAAAAAAAAAAAAAAAAA36842,AAAAA95309,AAAAAAAA81587,AA6416,AAAA91194,AAAAAA58494,AAAAAA1059,AAAAAAAA69452}
+ 52 | {89,0} | {AAAAAAAAAAAAAAAAAA47955,AAAAAAA48038,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAA73084,AAAAA70466,AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA46154,AA66862}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 54 | {70,47} | {AAAAAAAAAAAAAAAAAA54141,AAAAA40681,AAAAAAA48038,AAAAAAAAAAAAAAAA29150,AAAAA41597,AAAAAAAAAAAAAAAAAA59334,AA15322}
+ 55 | {47,79,47,64,72,25,71,24,93} | {AAAAAAAAAAAAAAAAAA55796,AAAAA62737}
+ 56 | {33,7,60,54,93,90,77,85,39} | {AAAAAAAAAAAAAAAAAA32918,AA42406}
+ 57 | {23,45,10,42,36,21,9,96} | {AAAAAAAAAAAAAAAAAAA70415}
+ 58 | {92} | {AAAAAAAAAAAAAAAA98414,AAAAAAAA23648,AAAAAAAAAAAAAAAAAA55796,AA25381,AAAAAAAAAAA6119}
+ 59 | {9,69,46,77} | {39557,AAAAAAA89932,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAAAAAA26540,AAA20874,AA6416,AAAAAAAAAAAAAAAAAA47955}
+ 60 | {62,2,59,38,89} | {AAAAAAA89932,AAAAAAAAAAAAAAAAAAA15356,AA99927,AA17009,AAAAAAAAAAAAAAA35875}
+ 61 | {72,2,44,95,54,54,13} | {AAAAAAAAAAAAAAAAAAA91804}
+ 62 | {83,72,29,73} | {AAAAAAAAAAAAA15097,AAAA8857,AAAAAAAAAAAA35809,AAAAAAAAAAAA52814,AAAAAAAAAAAAAAAAAAA38885,AAAAAAAAAAAAAAAAAA24183,AAAAAA43678,A96617}
+ 63 | {11,4,61,87} | {AAAAAAAAA27249,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAA13198,AAA20874,39557,51533,AAAAAAAAAAA53908,AAAAAAAAAAAAAA96505,AAAAAAAA78938}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 66 | {31,23,70,52,4,33,48,25} | {AAAAAAAAAAAAAAAAA69675,AAAAAAAA50094,AAAAAAAAAAA92631,AAAA35194,39557,AAAAAAA99836}
+ 67 | {31,94,7,10} | {AAAAAA38063,A96617,AAAA35194,AAAAAAAAAAAA67946}
+ 68 | {90,43,38} | {AA75092,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAA92631,AAAAAAAAA10012,AAAAAAAAAAAAA7929,AA21643}
+ 69 | {67,35,99,85,72,86,44} | {AAAAAAAAAAAAAAAAAAA1205,AAAAAAAA50094,AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAAAAAAA47955}
+ 70 | {56,70,83} | {AAAA41702,AAAAAAAAAAA82945,AA21643,AAAAAAAAAAA99000,A27153,AA25381,AAAAAAAAAAAAAA96505,AAAAAAA1242}
+ 71 | {74,26} | {AAAAAAAAAAA50956,AA74433,AAAAAAA21462,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAA70254,AAAAAAAAAA43419,39557}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 73 | {88,25,96,78,65,15,29,19} | {AAA54451,AAAAAAAAA27249,AAAAAAA9228,AAAAAAAAAAAAAAA67062,AAAAAAAAAAAAAAAAAAA70415,AAAAA17383,AAAAAAAAAAAAAAAA33598}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 75 | {12,96,83,24,71,89,55} | {AAAA48949,AAAAAAAA29716,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAA29150,AAA28075,AAAAAAAAAAAAAAAAA43052}
+ 76 | {92,55,10,7} | {AAAAAAAAAAAAAAA67062}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 78 | {55,89,44,84,34} | {AAAAAAAAAAA6119,AAAAAAAAAAAAAA8666,AA99927,AA42406,AAAAAAA81898,AAAAAAA9228,AAAAAAAAAAA92631,AA21643,AAAAAAAAAAAAAA28620}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 80 | {74,89,44,80,0} | {AAAA35194,AAAAAAAA79710,AAA20874,AAAAAAAAAAAAAAAAAAA70104,AAAAAAAAAAAAA73084,AAAAAAA57334,AAAAAAA9228,AAAAAAAAAAAAA62007}
+ 81 | {63,77,54,48,61,53,97} | {AAAAAAAAAAAAAAA81326,AAAAAAAAAA22292,AA25381,AAAAAAAAAAA74076,AAAAAAA81898,AAAAAAAAA72121}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 83 | {14,10} | {AAAAAAAAAA22292,AAAAAAAAAAAAA70254,AAAAAAAAAAA6119}
+ 84 | {11,83,35,13,96,94} | {AAAAA95309,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAAA24183}
+ 85 | {39,60} | {AAAAAAAAAAAAAAAA55798,AAAAAAAAAA22292,AAAAAAA66161,AAAAAAA21462,AAAAAAAAAAAAAAAAAA12591,55847,AAAAAA98232,AAAAAAAAAAA46154}
+ 86 | {33,81,72,74,45,36,82} | {AAAAAAAA81587,AAAAAAAAAAAAAA96505,45449,AAAA80176}
+ 87 | {57,27,50,12,97,68} | {AAAAAAAAAAAAAAAAA26540,AAAAAAAAA10012,AAAAAAAAAAAA35809,AAAAAAAAAAAAAAAA29150,AAAAAAAAAAA82945,AAAAAA66777,31228,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAA96505}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 90 | {88,75} | {AAAAA60038,AAAAAAAA23648,AAAAAAAAAAA99000,AAAA41702,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAA68526}
+ 91 | {78} | {AAAAAAAAAAAAA62007,AAA99043}
+ 92 | {85,63,49,45} | {AAAAAAA89932,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA21089}
+ 93 | {11} | {AAAAAAAAAAA176,AAAAAAAAAAAAAA8666,AAAAAAAAAAAAAAA453,AAAAAAAAAAAAA85723,A68938,AAAAAAAAAAAAA9821,AAAAAAA48038,AAAAAAAAAAAAAAAAA59387,AA99927,AAAAA17383}
+ 94 | {98,9,85,62,88,91,60,61,38,86} | {AAAAAAAA81587,AAAAA17383,AAAAAAAA81587}
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 99 | {37,86} | {AAAAAAAAAAAAAAAAAA32918,AAAAA70514,AAAAAAAAA10012,AAAAAAAAAAAAAAAAA59387,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA15356}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+ 101 | {} | {}
+ 102 | {NULL} | {NULL}
+(102 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_index_op_test WHERE t <@ '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+-- And try it with a multicolumn GIN index
+DROP INDEX intarrayidx, textarrayidx;
+CREATE INDEX botharrayidx ON array_index_op_test USING gin (i, t);
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(7 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(7 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32}' AND t && '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------------+------------------------------------------------------------------------------
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' AND t @> '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------------+------------------------------------------------------------------------------
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+--------+--------
+ 102 | {NULL} | {NULL}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+--
+-- Try a GIN index with a lot of items with same key. (GIN creates a posting
+-- tree when there are enough duplicates)
+--
+CREATE TABLE array_gin_test (a int[]);
+INSERT INTO array_gin_test SELECT ARRAY[1, g%5, g] FROM generate_series(1, 10000) g;
+CREATE INDEX array_gin_test_idx ON array_gin_test USING gin (a);
+SELECT COUNT(*) FROM array_gin_test WHERE a @> '{2}';
+ count
+-------
+ 2000
+(1 row)
+
+DROP TABLE array_gin_test;
+--
+-- Test GIN index's reloptions
+--
+CREATE INDEX gin_relopts_test ON array_index_op_test USING gin (i)
+ WITH (FASTUPDATE=on, GIN_PENDING_LIST_LIMIT=128);
+\d+ gin_relopts_test
+ Index "public.gin_relopts_test"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+---------+------+------------+---------+--------------
+ i | integer | yes | i | plain |
+gin, for table "public.array_index_op_test"
+Options: fastupdate=on, gin_pending_list_limit=128
+
+--
+-- HASH
+--
+CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
+CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
+CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
+CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops) WITH (fillfactor=60);
+CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
+CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
+DROP TABLE unlogged_hash_table;
+-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
+-- Test hash index build tuplesorting. Force hash tuplesort using low
+-- maintenance_work_mem setting and fillfactor:
+SET maintenance_work_mem = '1MB';
+CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (stringu1 = 'TVAAAA'::name)
+ -> Bitmap Index Scan on hash_tuplesort_idx
+ Index Cond: (stringu1 = 'TVAAAA'::name)
+(5 rows)
+
+SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
+ count
+-------
+ 14
+(1 row)
+
+DROP INDEX hash_tuplesort_idx;
+RESET maintenance_work_mem;
+--
+-- Test functional index
+--
+CREATE TABLE func_index_heap (f1 text, f2 text);
+CREATE UNIQUE INDEX func_index_index on func_index_heap (textcat(f1,f2));
+INSERT INTO func_index_heap VALUES('ABC','DEF');
+INSERT INTO func_index_heap VALUES('AB','CDEFG');
+INSERT INTO func_index_heap VALUES('QWE','RTY');
+-- this should fail because of unique index:
+INSERT INTO func_index_heap VALUES('ABCD', 'EF');
+ERROR: duplicate key value violates unique constraint "func_index_index"
+DETAIL: Key (textcat(f1, f2))=(ABCDEF) already exists.
+-- but this shouldn't:
+INSERT INTO func_index_heap VALUES('QWERTY');
+-- while we're here, see that the metadata looks sane
+\d func_index_heap
+ Table "public.func_index_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "func_index_index" UNIQUE, btree (textcat(f1, f2))
+
+\d func_index_index
+ Index "public.func_index_index"
+ Column | Type | Key? | Definition
+---------+------+------+-----------------
+ textcat | text | yes | textcat(f1, f2)
+unique, btree, for table "public.func_index_heap"
+
+--
+-- Same test, expressional index
+--
+DROP TABLE func_index_heap;
+CREATE TABLE func_index_heap (f1 text, f2 text);
+CREATE UNIQUE INDEX func_index_index on func_index_heap ((f1 || f2) text_ops);
+INSERT INTO func_index_heap VALUES('ABC','DEF');
+INSERT INTO func_index_heap VALUES('AB','CDEFG');
+INSERT INTO func_index_heap VALUES('QWE','RTY');
+-- this should fail because of unique index:
+INSERT INTO func_index_heap VALUES('ABCD', 'EF');
+ERROR: duplicate key value violates unique constraint "func_index_index"
+DETAIL: Key ((f1 || f2))=(ABCDEF) already exists.
+-- but this shouldn't:
+INSERT INTO func_index_heap VALUES('QWERTY');
+-- while we're here, see that the metadata looks sane
+\d func_index_heap
+ Table "public.func_index_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "func_index_index" UNIQUE, btree ((f1 || f2))
+
+\d func_index_index
+ Index "public.func_index_index"
+ Column | Type | Key? | Definition
+--------+------+------+------------
+ expr | text | yes | (f1 || f2)
+unique, btree, for table "public.func_index_heap"
+
+-- this should fail because of unsafe column type (anonymous record)
+create index on func_index_heap ((f1 || f2), (row(f1, f2)));
+ERROR: column "row" has pseudo-type record
+--
+-- Test unique index with included columns
+--
+CREATE TABLE covering_index_heap (f1 int, f2 int, f3 text);
+CREATE UNIQUE INDEX covering_index_index on covering_index_heap (f1,f2) INCLUDE(f3);
+INSERT INTO covering_index_heap VALUES(1,1,'AAA');
+INSERT INTO covering_index_heap VALUES(1,2,'AAA');
+-- this should fail because of unique index on f1,f2:
+INSERT INTO covering_index_heap VALUES(1,2,'BBB');
+ERROR: duplicate key value violates unique constraint "covering_index_index"
+DETAIL: Key (f1, f2)=(1, 2) already exists.
+-- and this shouldn't:
+INSERT INTO covering_index_heap VALUES(1,4,'AAA');
+-- Try to build index on table that already contains data
+CREATE UNIQUE INDEX covering_pkey on covering_index_heap (f1,f2) INCLUDE(f3);
+-- Try to use existing covering index as primary key
+ALTER TABLE covering_index_heap ADD CONSTRAINT covering_pkey PRIMARY KEY USING INDEX
+covering_pkey;
+DROP TABLE covering_index_heap;
+--
+-- Also try building functional, expressional, and partial indexes on
+-- tables that already contain data.
+--
+create unique index hash_f8_index_1 on hash_f8_heap(abs(random));
+create unique index hash_f8_index_2 on hash_f8_heap((seqno + 1), random);
+create unique index hash_f8_index_3 on hash_f8_heap(random) where seqno > 1000;
+--
+-- Try some concurrent index builds
+--
+-- Unfortunately this only tests about half the code paths because there are
+-- no concurrent updates happening to the table at the same time.
+CREATE TABLE concur_heap (f1 text, f2 text);
+-- empty table
+CREATE INDEX CONCURRENTLY concur_index1 ON concur_heap(f2,f1);
+CREATE INDEX CONCURRENTLY IF NOT EXISTS concur_index1 ON concur_heap(f2,f1);
+NOTICE: relation "concur_index1" already exists, skipping
+INSERT INTO concur_heap VALUES ('a','b');
+INSERT INTO concur_heap VALUES ('b','b');
+-- unique index
+CREATE UNIQUE INDEX CONCURRENTLY concur_index2 ON concur_heap(f1);
+CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
+NOTICE: relation "concur_index2" already exists, skipping
+-- check if constraint is set up properly to be enforced
+INSERT INTO concur_heap VALUES ('b','x');
+ERROR: duplicate key value violates unique constraint "concur_index2"
+DETAIL: Key (f1)=(b) already exists.
+-- check if constraint is enforced properly at build time
+CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+ERROR: could not create unique index "concur_index3"
+DETAIL: Key (f2)=(b) is duplicated.
+-- test that expression indexes and partial indexes work concurrently
+CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
+CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
+-- here we also check that you can default the index name
+CREATE INDEX CONCURRENTLY on concur_heap((f2||f1));
+-- You can't do a concurrent index build in a transaction
+BEGIN;
+CREATE INDEX CONCURRENTLY concur_index7 ON concur_heap(f1);
+ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- But you can do a regular index build in a transaction
+BEGIN;
+CREATE INDEX std_index on concur_heap(f2);
+COMMIT;
+-- Failed builds are left invalid by VACUUM FULL, fixed by REINDEX
+VACUUM FULL concur_heap;
+REINDEX TABLE concur_heap;
+ERROR: could not create unique index "concur_index3"
+DETAIL: Key (f2)=(b) is duplicated.
+DELETE FROM concur_heap WHERE f1 = 'b';
+VACUUM FULL concur_heap;
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "concur_heap_expr_idx" btree ((f2 || f1))
+ "concur_index1" btree (f2, f1)
+ "concur_index2" UNIQUE, btree (f1)
+ "concur_index3" UNIQUE, btree (f2) INVALID
+ "concur_index4" btree (f2) WHERE f1 = 'a'::text
+ "concur_index5" btree (f2) WHERE f1 = 'x'::text
+ "std_index" btree (f2)
+
+REINDEX TABLE concur_heap;
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "concur_heap_expr_idx" btree ((f2 || f1))
+ "concur_index1" btree (f2, f1)
+ "concur_index2" UNIQUE, btree (f1)
+ "concur_index3" UNIQUE, btree (f2)
+ "concur_index4" btree (f2) WHERE f1 = 'a'::text
+ "concur_index5" btree (f2) WHERE f1 = 'x'::text
+ "std_index" btree (f2)
+
+-- Temporary tables with concurrent builds and on-commit actions
+-- CONCURRENTLY used with CREATE INDEX and DROP INDEX is ignored.
+-- PRESERVE ROWS, the default.
+CREATE TEMP TABLE concur_temp (f1 int, f2 text)
+ ON COMMIT PRESERVE ROWS;
+INSERT INTO concur_temp VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX CONCURRENTLY concur_temp_ind ON concur_temp(f1);
+DROP INDEX CONCURRENTLY concur_temp_ind;
+DROP TABLE concur_temp;
+-- ON COMMIT DROP
+BEGIN;
+CREATE TEMP TABLE concur_temp (f1 int, f2 text)
+ ON COMMIT DROP;
+INSERT INTO concur_temp VALUES (1, 'foo'), (2, 'bar');
+-- Fails when running in a transaction.
+CREATE INDEX CONCURRENTLY concur_temp_ind ON concur_temp(f1);
+ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- ON COMMIT DELETE ROWS
+CREATE TEMP TABLE concur_temp (f1 int, f2 text)
+ ON COMMIT DELETE ROWS;
+INSERT INTO concur_temp VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX CONCURRENTLY concur_temp_ind ON concur_temp(f1);
+DROP INDEX CONCURRENTLY concur_temp_ind;
+DROP TABLE concur_temp;
+--
+-- Try some concurrent index drops
+--
+DROP INDEX CONCURRENTLY "concur_index2"; -- works
+DROP INDEX CONCURRENTLY IF EXISTS "concur_index2"; -- notice
+NOTICE: index "concur_index2" does not exist, skipping
+-- failures
+DROP INDEX CONCURRENTLY "concur_index2", "concur_index3";
+ERROR: DROP INDEX CONCURRENTLY does not support dropping multiple objects
+BEGIN;
+DROP INDEX CONCURRENTLY "concur_index5";
+ERROR: DROP INDEX CONCURRENTLY cannot run inside a transaction block
+ROLLBACK;
+-- successes
+DROP INDEX CONCURRENTLY IF EXISTS "concur_index3";
+DROP INDEX CONCURRENTLY "concur_index4";
+DROP INDEX CONCURRENTLY "concur_index5";
+DROP INDEX CONCURRENTLY "concur_index1";
+DROP INDEX CONCURRENTLY "concur_heap_expr_idx";
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "std_index" btree (f2)
+
+DROP TABLE concur_heap;
+--
+-- Test ADD CONSTRAINT USING INDEX
+--
+CREATE TABLE cwi_test( a int , b varchar(10), c char);
+-- add some data so that all tests have something to work with.
+INSERT INTO cwi_test VALUES(1, 2), (3, 4), (5, 6);
+CREATE UNIQUE INDEX cwi_uniq_idx ON cwi_test(a , b);
+ALTER TABLE cwi_test ADD primary key USING INDEX cwi_uniq_idx;
+\d cwi_test
+ Table "public.cwi_test"
+ Column | Type | Collation | Nullable | Default
+--------+-----------------------+-----------+----------+---------
+ a | integer | | not null |
+ b | character varying(10) | | not null |
+ c | character(1) | | |
+Indexes:
+ "cwi_uniq_idx" PRIMARY KEY, btree (a, b)
+
+\d cwi_uniq_idx
+ Index "public.cwi_uniq_idx"
+ Column | Type | Key? | Definition
+--------+-----------------------+------+------------
+ a | integer | yes | a
+ b | character varying(10) | yes | b
+primary key, btree, for table "public.cwi_test"
+
+CREATE UNIQUE INDEX cwi_uniq2_idx ON cwi_test(b , a);
+ALTER TABLE cwi_test DROP CONSTRAINT cwi_uniq_idx,
+ ADD CONSTRAINT cwi_replaced_pkey PRIMARY KEY
+ USING INDEX cwi_uniq2_idx;
+NOTICE: ALTER TABLE / ADD CONSTRAINT USING INDEX will rename index "cwi_uniq2_idx" to "cwi_replaced_pkey"
+\d cwi_test
+ Table "public.cwi_test"
+ Column | Type | Collation | Nullable | Default
+--------+-----------------------+-----------+----------+---------
+ a | integer | | not null |
+ b | character varying(10) | | not null |
+ c | character(1) | | |
+Indexes:
+ "cwi_replaced_pkey" PRIMARY KEY, btree (b, a)
+
+\d cwi_replaced_pkey
+ Index "public.cwi_replaced_pkey"
+ Column | Type | Key? | Definition
+--------+-----------------------+------+------------
+ b | character varying(10) | yes | b
+ a | integer | yes | a
+primary key, btree, for table "public.cwi_test"
+
+DROP INDEX cwi_replaced_pkey; -- Should fail; a constraint depends on it
+ERROR: cannot drop index cwi_replaced_pkey because constraint cwi_replaced_pkey on table cwi_test requires it
+HINT: You can drop constraint cwi_replaced_pkey on table cwi_test instead.
+-- Check that non-default index options are rejected
+CREATE UNIQUE INDEX cwi_uniq3_idx ON cwi_test(a desc);
+ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq3_idx; -- fail
+ERROR: index "cwi_uniq3_idx" column number 1 does not have default sorting behavior
+LINE 1: ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq3_idx;
+ ^
+DETAIL: Cannot create a primary key or unique constraint using such an index.
+CREATE UNIQUE INDEX cwi_uniq4_idx ON cwi_test(b collate "POSIX");
+ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq4_idx; -- fail
+ERROR: index "cwi_uniq4_idx" column number 1 does not have default sorting behavior
+LINE 1: ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq4_idx;
+ ^
+DETAIL: Cannot create a primary key or unique constraint using such an index.
+DROP TABLE cwi_test;
+-- ADD CONSTRAINT USING INDEX is forbidden on partitioned tables
+CREATE TABLE cwi_test(a int) PARTITION BY hash (a);
+create unique index on cwi_test (a);
+alter table cwi_test add primary key using index cwi_test_a_idx ;
+ERROR: ALTER TABLE / ADD CONSTRAINT USING INDEX is not supported on partitioned tables
+DROP TABLE cwi_test;
+--
+-- Check handling of indexes on system columns
+--
+CREATE TABLE syscol_table (a INT);
+-- System columns cannot be indexed
+CREATE INDEX ON syscolcol_table (ctid);
+ERROR: relation "syscolcol_table" does not exist
+-- nor used in expressions
+CREATE INDEX ON syscol_table ((ctid >= '(1000,0)'));
+ERROR: index creation on system columns is not supported
+-- nor used in predicates
+CREATE INDEX ON syscol_table (a) WHERE ctid >= '(1000,0)';
+ERROR: index creation on system columns is not supported
+DROP TABLE syscol_table;
+--
+-- Tests for IS NULL/IS NOT NULL with b-tree indexes
+--
+SELECT unique1, unique2 INTO onek_with_null FROM onek;
+INSERT INTO onek_with_null (unique1,unique2) VALUES (NULL, -1), (NULL, NULL);
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2,unique1);
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = ON;
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc nulls last,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 nulls first,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+-- Check initial-positioning logic too
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2);
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = OFF;
+SELECT unique1, unique2 FROM onek_with_null
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ | -1
+ 147 | 0
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= -1
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ | -1
+ 147 | 0
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= 0
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 147 | 0
+ 931 | 1
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ |
+ 278 | 999
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= -1
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 278 | 999
+ 0 | 998
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 < 999
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 0 | 998
+ 744 | 997
+(2 rows)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+DROP TABLE onek_with_null;
+--
+-- Check bitmap index path planning
+--
+EXPLAIN (COSTS OFF)
+SELECT * FROM tenk1
+ WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1
+ Recheck Cond: (((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3)) OR ((thousand = 42) AND (tenthous = 42)))
+ -> BitmapOr
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 1))
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 3))
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 42))
+(9 rows)
+
+SELECT * FROM tenk1
+ WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+ 42 | 5530 | 0 | 2 | 2 | 2 | 42 | 42 | 42 | 42 | 42 | 84 | 85 | QBAAAA | SEIAAA | OOOOxx
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM tenk1
+ WHERE hundred = 42 AND (thousand = 42 OR thousand = 99);
+ QUERY PLAN
+---------------------------------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: ((hundred = 42) AND ((thousand = 42) OR (thousand = 99)))
+ -> BitmapAnd
+ -> Bitmap Index Scan on tenk1_hundred
+ Index Cond: (hundred = 42)
+ -> BitmapOr
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: (thousand = 42)
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: (thousand = 99)
+(11 rows)
+
+SELECT count(*) FROM tenk1
+ WHERE hundred = 42 AND (thousand = 42 OR thousand = 99);
+ count
+-------
+ 10
+(1 row)
+
+--
+-- Check behavior with duplicate index column contents
+--
+CREATE TABLE dupindexcols AS
+ SELECT unique1 as id, stringu2::text as f1 FROM tenk1;
+CREATE INDEX dupindexcols_i ON dupindexcols (f1, id, f1 text_pattern_ops);
+ANALYZE dupindexcols;
+EXPLAIN (COSTS OFF)
+ SELECT count(*) FROM dupindexcols
+ WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on dupindexcols
+ Recheck Cond: ((f1 >= 'WA'::text) AND (f1 <= 'ZZZ'::text) AND (id < 1000) AND (f1 ~<~ 'YX'::text))
+ -> Bitmap Index Scan on dupindexcols_i
+ Index Cond: ((f1 >= 'WA'::text) AND (f1 <= 'ZZZ'::text) AND (id < 1000) AND (f1 ~<~ 'YX'::text))
+(5 rows)
+
+SELECT count(*) FROM dupindexcols
+ WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
+ count
+-------
+ 97
+(1 row)
+
+--
+-- Check ordering of =ANY indexqual results (bug in 9.2.0)
+--
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
+explain (costs off)
+SELECT unique1 FROM tenk1
+WHERE unique1 IN (1,42,7)
+ORDER BY unique1;
+ QUERY PLAN
+-------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+ Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(2 rows)
+
+SELECT unique1 FROM tenk1
+WHERE unique1 IN (1,42,7)
+ORDER BY unique1;
+ unique1
+---------
+ 1
+ 7
+ 42
+(3 rows)
+
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ QUERY PLAN
+-------------------------------------------------------
+ Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: (thousand < 2)
+ Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
+(3 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous
+----------+----------
+ 0 | 3000
+ 1 | 1001
+(2 rows)
+
+RESET enable_indexscan;
+SET enable_indexonlyscan = OFF;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Sort Key: thousand
+ -> Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(4 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous
+----------+----------
+ 0 | 3000
+ 1 | 1001
+(2 rows)
+
+RESET enable_seqscan;
+RESET enable_bitmapscan;
+RESET enable_indexonlyscan;
+--
+-- Check elimination of constant-NULL subexpressions
+--
+explain (costs off)
+ select * from tenk1 where (thousand, tenthous) in ((1,1001), (null,null));
+ QUERY PLAN
+------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand = 1) AND (tenthous = 1001))
+(2 rows)
+
+--
+-- Check matching of boolean index columns to WHERE conditions and sort keys
+--
+create temp table boolindex (b bool, i int, unique(b, i), junk float);
+explain (costs off)
+ select * from boolindex order by b, i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+(2 rows)
+
+explain (costs off)
+ select * from boolindex where b order by i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b = true order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where not b order by i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+ Index Cond: (b = false)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b is true order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b is false order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = false)
+(3 rows)
+
+--
+-- REINDEX (VERBOSE)
+--
+CREATE TABLE reindex_verbose(id integer primary key);
+\set VERBOSITY terse \\ -- suppress machine-dependent details
+REINDEX (VERBOSE) TABLE reindex_verbose;
+INFO: index "reindex_verbose_pkey" was reindexed
+\set VERBOSITY default
+DROP TABLE reindex_verbose;
+--
+-- REINDEX CONCURRENTLY
+--
+CREATE TABLE concur_reindex_tab (c1 int);
+-- REINDEX
+REINDEX TABLE concur_reindex_tab; -- notice
+NOTICE: table "concur_reindex_tab" has no indexes to reindex
+REINDEX TABLE CONCURRENTLY concur_reindex_tab; -- notice
+NOTICE: table "concur_reindex_tab" has no indexes that can be reindexed concurrently
+ALTER TABLE concur_reindex_tab ADD COLUMN c2 text; -- add toast index
+-- Normal index with integer column
+CREATE UNIQUE INDEX concur_reindex_ind1 ON concur_reindex_tab(c1);
+-- Normal index with text column
+CREATE INDEX concur_reindex_ind2 ON concur_reindex_tab(c2);
+-- UNIQUE index with expression
+CREATE UNIQUE INDEX concur_reindex_ind3 ON concur_reindex_tab(abs(c1));
+-- Duplicate column names
+CREATE INDEX concur_reindex_ind4 ON concur_reindex_tab(c1, c1, c2);
+-- Create table for check on foreign key dependence switch with indexes swapped
+ALTER TABLE concur_reindex_tab ADD PRIMARY KEY USING INDEX concur_reindex_ind1;
+CREATE TABLE concur_reindex_tab2 (c1 int REFERENCES concur_reindex_tab);
+INSERT INTO concur_reindex_tab VALUES (1, 'a');
+INSERT INTO concur_reindex_tab VALUES (2, 'a');
+-- Reindex concurrently of exclusion constraint currently not supported
+CREATE TABLE concur_reindex_tab3 (c1 int, c2 int4range, EXCLUDE USING gist (c2 WITH &&));
+INSERT INTO concur_reindex_tab3 VALUES (3, '[1,2]');
+REINDEX INDEX CONCURRENTLY concur_reindex_tab3_c2_excl; -- error
+ERROR: concurrent index creation for exclusion constraints is not supported
+REINDEX TABLE CONCURRENTLY concur_reindex_tab3; -- succeeds with warning
+WARNING: cannot reindex exclusion constraint index "public.concur_reindex_tab3_c2_excl" concurrently, skipping
+NOTICE: table "concur_reindex_tab3" has no indexes that can be reindexed concurrently
+INSERT INTO concur_reindex_tab3 VALUES (4, '[2,4]');
+ERROR: conflicting key value violates exclusion constraint "concur_reindex_tab3_c2_excl"
+DETAIL: Key (c2)=([2,5)) conflicts with existing key (c2)=([1,3)).
+-- Check materialized views
+CREATE MATERIALIZED VIEW concur_reindex_matview AS SELECT * FROM concur_reindex_tab;
+-- Dependency lookup before and after the follow-up REINDEX commands.
+-- These should remain consistent.
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_tab'::regclass,
+ 'concur_reindex_ind1'::regclass,
+ 'concur_reindex_ind2'::regclass,
+ 'concur_reindex_ind3'::regclass,
+ 'concur_reindex_ind4'::regclass,
+ 'concur_reindex_matview'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+------------------------------------------------------------+---------
+ index concur_reindex_ind1 | constraint concur_reindex_ind1 on table concur_reindex_tab | i
+ index concur_reindex_ind2 | column c2 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c2 of table concur_reindex_tab | a
+ materialized view concur_reindex_matview | schema public | n
+ table concur_reindex_tab | schema public | n
+(9 rows)
+
+REINDEX INDEX CONCURRENTLY concur_reindex_ind1;
+REINDEX TABLE CONCURRENTLY concur_reindex_tab;
+REINDEX TABLE CONCURRENTLY concur_reindex_matview;
+NOTICE: table "concur_reindex_matview" has no indexes that can be reindexed concurrently
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_tab'::regclass,
+ 'concur_reindex_ind1'::regclass,
+ 'concur_reindex_ind2'::regclass,
+ 'concur_reindex_ind3'::regclass,
+ 'concur_reindex_ind4'::regclass,
+ 'concur_reindex_matview'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+------------------------------------------------------------+---------
+ index concur_reindex_ind1 | constraint concur_reindex_ind1 on table concur_reindex_tab | i
+ index concur_reindex_ind2 | column c2 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c2 of table concur_reindex_tab | a
+ materialized view concur_reindex_matview | schema public | n
+ table concur_reindex_tab | schema public | n
+(9 rows)
+
+-- Check that comments are preserved
+CREATE TABLE testcomment (i int);
+CREATE INDEX testcomment_idx1 ON testcomment (i);
+COMMENT ON INDEX testcomment_idx1 IS 'test comment';
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+REINDEX TABLE testcomment;
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+REINDEX TABLE CONCURRENTLY testcomment ;
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+DROP TABLE testcomment;
+-- Check that indisclustered updates are preserved
+CREATE TABLE concur_clustered(i int);
+CREATE INDEX concur_clustered_i_idx ON concur_clustered(i);
+ALTER TABLE concur_clustered CLUSTER ON concur_clustered_i_idx;
+REINDEX TABLE CONCURRENTLY concur_clustered;
+SELECT indexrelid::regclass, indisclustered FROM pg_index
+ WHERE indrelid = 'concur_clustered'::regclass;
+ indexrelid | indisclustered
+------------------------+----------------
+ concur_clustered_i_idx | t
+(1 row)
+
+DROP TABLE concur_clustered;
+-- Partitions
+-- Create some partitioned tables
+CREATE TABLE concur_reindex_part (c1 int, c2 int) PARTITION BY RANGE (c1);
+CREATE TABLE concur_reindex_part_0 PARTITION OF concur_reindex_part
+ FOR VALUES FROM (0) TO (10) PARTITION BY list (c2);
+CREATE TABLE concur_reindex_part_0_1 PARTITION OF concur_reindex_part_0
+ FOR VALUES IN (1);
+CREATE TABLE concur_reindex_part_0_2 PARTITION OF concur_reindex_part_0
+ FOR VALUES IN (2);
+-- This partitioned table will have no partitions.
+CREATE TABLE concur_reindex_part_10 PARTITION OF concur_reindex_part
+ FOR VALUES FROM (10) TO (20) PARTITION BY list (c2);
+-- Create some partitioned indexes
+CREATE INDEX concur_reindex_part_index ON ONLY concur_reindex_part (c1);
+CREATE INDEX concur_reindex_part_index_0 ON ONLY concur_reindex_part_0 (c1);
+ALTER INDEX concur_reindex_part_index ATTACH PARTITION concur_reindex_part_index_0;
+-- This partitioned index will have no partitions.
+CREATE INDEX concur_reindex_part_index_10 ON ONLY concur_reindex_part_10 (c1);
+ALTER INDEX concur_reindex_part_index ATTACH PARTITION concur_reindex_part_index_10;
+CREATE INDEX concur_reindex_part_index_0_1 ON ONLY concur_reindex_part_0_1 (c1);
+ALTER INDEX concur_reindex_part_index_0 ATTACH PARTITION concur_reindex_part_index_0_1;
+CREATE INDEX concur_reindex_part_index_0_2 ON ONLY concur_reindex_part_0_2 (c1);
+ALTER INDEX concur_reindex_part_index_0 ATTACH PARTITION concur_reindex_part_index_0_2;
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+-- REINDEX fails for partitioned indexes
+REINDEX INDEX concur_reindex_part_index_10;
+ERROR: REINDEX is not yet implemented for partitioned indexes
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_10;
+ERROR: REINDEX is not yet implemented for partitioned indexes
+-- REINDEX is a no-op for partitioned tables
+REINDEX TABLE concur_reindex_part_10;
+WARNING: REINDEX of partitioned tables is not yet implemented, skipping "concur_reindex_part_10"
+NOTICE: table "concur_reindex_part_10" has no indexes to reindex
+REINDEX TABLE CONCURRENTLY concur_reindex_part_10;
+WARNING: REINDEX of partitioned tables is not yet implemented, skipping "concur_reindex_part_10"
+NOTICE: table "concur_reindex_part_10" has no indexes that can be reindexed concurrently
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+-- REINDEX should preserve dependencies of partition tree.
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_part'::regclass,
+ 'concur_reindex_part_0'::regclass,
+ 'concur_reindex_part_0_1'::regclass,
+ 'concur_reindex_part_0_2'::regclass,
+ 'concur_reindex_part_index'::regclass,
+ 'concur_reindex_part_index_0'::regclass,
+ 'concur_reindex_part_index_0_1'::regclass,
+ 'concur_reindex_part_index_0_2'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+--------------------------------------------+---------
+ column c1 of table concur_reindex_part | table concur_reindex_part | i
+ column c2 of table concur_reindex_part_0 | table concur_reindex_part_0 | i
+ index concur_reindex_part_index | column c1 of table concur_reindex_part | a
+ index concur_reindex_part_index_0 | column c1 of table concur_reindex_part_0 | a
+ index concur_reindex_part_index_0 | index concur_reindex_part_index | P
+ index concur_reindex_part_index_0 | table concur_reindex_part_0 | S
+ index concur_reindex_part_index_0_1 | column c1 of table concur_reindex_part_0_1 | a
+ index concur_reindex_part_index_0_1 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_1 | table concur_reindex_part_0_1 | S
+ index concur_reindex_part_index_0_2 | column c1 of table concur_reindex_part_0_2 | a
+ index concur_reindex_part_index_0_2 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_2 | table concur_reindex_part_0_2 | S
+ table concur_reindex_part | schema public | n
+ table concur_reindex_part_0 | schema public | n
+ table concur_reindex_part_0 | table concur_reindex_part | a
+ table concur_reindex_part_0_1 | schema public | n
+ table concur_reindex_part_0_1 | table concur_reindex_part_0 | a
+ table concur_reindex_part_0_2 | schema public | n
+ table concur_reindex_part_0_2 | table concur_reindex_part_0 | a
+(19 rows)
+
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_0_1;
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_0_2;
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+REINDEX TABLE CONCURRENTLY concur_reindex_part_0_1;
+REINDEX TABLE CONCURRENTLY concur_reindex_part_0_2;
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_part'::regclass,
+ 'concur_reindex_part_0'::regclass,
+ 'concur_reindex_part_0_1'::regclass,
+ 'concur_reindex_part_0_2'::regclass,
+ 'concur_reindex_part_index'::regclass,
+ 'concur_reindex_part_index_0'::regclass,
+ 'concur_reindex_part_index_0_1'::regclass,
+ 'concur_reindex_part_index_0_2'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+--------------------------------------------+---------
+ column c1 of table concur_reindex_part | table concur_reindex_part | i
+ column c2 of table concur_reindex_part_0 | table concur_reindex_part_0 | i
+ index concur_reindex_part_index | column c1 of table concur_reindex_part | a
+ index concur_reindex_part_index_0 | column c1 of table concur_reindex_part_0 | a
+ index concur_reindex_part_index_0 | index concur_reindex_part_index | P
+ index concur_reindex_part_index_0 | table concur_reindex_part_0 | S
+ index concur_reindex_part_index_0_1 | column c1 of table concur_reindex_part_0_1 | a
+ index concur_reindex_part_index_0_1 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_1 | table concur_reindex_part_0_1 | S
+ index concur_reindex_part_index_0_2 | column c1 of table concur_reindex_part_0_2 | a
+ index concur_reindex_part_index_0_2 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_2 | table concur_reindex_part_0_2 | S
+ table concur_reindex_part | schema public | n
+ table concur_reindex_part_0 | schema public | n
+ table concur_reindex_part_0 | table concur_reindex_part | a
+ table concur_reindex_part_0_1 | schema public | n
+ table concur_reindex_part_0_1 | table concur_reindex_part_0 | a
+ table concur_reindex_part_0_2 | schema public | n
+ table concur_reindex_part_0_2 | table concur_reindex_part_0 | a
+(19 rows)
+
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+DROP TABLE concur_reindex_part;
+-- Check errors
+-- Cannot run inside a transaction block
+BEGIN;
+REINDEX TABLE CONCURRENTLY concur_reindex_tab;
+ERROR: REINDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+REINDEX TABLE CONCURRENTLY pg_class; -- no catalog relation
+ERROR: cannot reindex system catalogs concurrently
+REINDEX INDEX CONCURRENTLY pg_class_oid_index; -- no catalog index
+ERROR: cannot reindex system catalogs concurrently
+-- These are the toast table and index of pg_authid.
+REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_1260; -- no catalog toast table
+ERROR: cannot reindex system catalogs concurrently
+REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_1260_index; -- no catalog toast index
+ERROR: cannot reindex system catalogs concurrently
+REINDEX SYSTEM CONCURRENTLY postgres; -- not allowed for SYSTEM
+ERROR: cannot reindex system catalogs concurrently
+-- Warns about catalog relations
+REINDEX SCHEMA CONCURRENTLY pg_catalog;
+WARNING: cannot reindex system catalogs concurrently, skipping all
+-- Check the relation status, there should not be invalid indexes
+\d concur_reindex_tab
+ Table "public.concur_reindex_tab"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | not null |
+ c2 | text | | |
+Indexes:
+ "concur_reindex_ind1" PRIMARY KEY, btree (c1)
+ "concur_reindex_ind2" btree (c2)
+ "concur_reindex_ind3" UNIQUE, btree (abs(c1))
+ "concur_reindex_ind4" btree (c1, c1, c2)
+Referenced by:
+ TABLE "concur_reindex_tab2" CONSTRAINT "concur_reindex_tab2_c1_fkey" FOREIGN KEY (c1) REFERENCES concur_reindex_tab(c1)
+
+DROP MATERIALIZED VIEW concur_reindex_matview;
+DROP TABLE concur_reindex_tab, concur_reindex_tab2, concur_reindex_tab3;
+-- Check handling of invalid indexes
+CREATE TABLE concur_reindex_tab4 (c1 int);
+INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
+-- This trick creates an invalid index.
+CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+ERROR: could not create unique index "concur_reindex_ind5"
+DETAIL: Key (c1)=(1) is duplicated.
+-- Reindexing concurrently this index fails with the same failure.
+-- The extra index created is itself invalid, and can be dropped.
+REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
+ERROR: could not create unique index "concur_reindex_ind5_ccnew"
+DETAIL: Key (c1)=(1) is duplicated.
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+ "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
+
+DROP INDEX concur_reindex_ind5_ccnew;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- The invalid index is not processed when running REINDEX TABLE.
+REINDEX TABLE CONCURRENTLY concur_reindex_tab4;
+WARNING: cannot reindex invalid index "public.concur_reindex_ind5" concurrently, skipping
+NOTICE: table "concur_reindex_tab4" has no indexes that can be reindexed concurrently
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+
+-- But it is fixed with REINDEX INDEX.
+REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1)
+
+DROP TABLE concur_reindex_tab4;
+-- Check handling of indexes with expressions and predicates. The
+-- definitions of the rebuilt indexes should match the original
+-- definitions.
+CREATE TABLE concur_exprs_tab (c1 int , c2 boolean);
+INSERT INTO concur_exprs_tab (c1, c2) VALUES (1369652450, FALSE),
+ (414515746, TRUE),
+ (897778963, FALSE);
+CREATE UNIQUE INDEX concur_exprs_index_expr
+ ON concur_exprs_tab ((c1::text COLLATE "C"));
+CREATE UNIQUE INDEX concur_exprs_index_pred ON concur_exprs_tab (c1)
+ WHERE (c1::text > 500000000::text COLLATE "C");
+CREATE UNIQUE INDEX concur_exprs_index_pred_2
+ ON concur_exprs_tab ((1 / c1))
+ WHERE ('-H') >= (c2::TEXT) COLLATE "C";
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+--------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= ((c2)::text COLLATE "C"))
+(1 row)
+
+REINDEX TABLE CONCURRENTLY concur_exprs_tab;
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+--------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= ((c2)::text COLLATE "C"))
+(1 row)
+
+-- ALTER TABLE recreates the indexes, which should keep their collations.
+ALTER TABLE concur_exprs_tab ALTER c2 TYPE TEXT;
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= (c2 COLLATE "C"))
+(1 row)
+
+DROP TABLE concur_exprs_tab;
+-- Temporary tables and on-commit actions, where CONCURRENTLY is ignored.
+-- ON COMMIT PRESERVE ROWS, the default.
+CREATE TEMP TABLE concur_temp_tab_1 (c1 int, c2 text)
+ ON COMMIT PRESERVE ROWS;
+INSERT INTO concur_temp_tab_1 VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX concur_temp_ind_1 ON concur_temp_tab_1(c2);
+REINDEX TABLE CONCURRENTLY concur_temp_tab_1;
+REINDEX INDEX CONCURRENTLY concur_temp_ind_1;
+-- Still fails in transaction blocks
+BEGIN;
+REINDEX INDEX CONCURRENTLY concur_temp_ind_1;
+ERROR: REINDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- ON COMMIT DELETE ROWS
+CREATE TEMP TABLE concur_temp_tab_2 (c1 int, c2 text)
+ ON COMMIT DELETE ROWS;
+CREATE INDEX concur_temp_ind_2 ON concur_temp_tab_2(c2);
+REINDEX TABLE CONCURRENTLY concur_temp_tab_2;
+REINDEX INDEX CONCURRENTLY concur_temp_ind_2;
+-- ON COMMIT DROP
+BEGIN;
+CREATE TEMP TABLE concur_temp_tab_3 (c1 int, c2 text)
+ ON COMMIT PRESERVE ROWS;
+INSERT INTO concur_temp_tab_3 VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX concur_temp_ind_3 ON concur_temp_tab_3(c2);
+-- Fails when running in a transaction
+REINDEX INDEX CONCURRENTLY concur_temp_ind_3;
+ERROR: REINDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- REINDEX SCHEMA processes all temporary relations
+CREATE TABLE reindex_temp_before AS
+SELECT oid, relname, relfilenode, relkind, reltoastrelid
+ FROM pg_class
+ WHERE relname IN ('concur_temp_ind_1', 'concur_temp_ind_2');
+SELECT pg_my_temp_schema()::regnamespace as temp_schema_name \gset
+REINDEX SCHEMA CONCURRENTLY :temp_schema_name;
+SELECT b.relname,
+ b.relkind,
+ CASE WHEN a.relfilenode = b.relfilenode THEN 'relfilenode is unchanged'
+ ELSE 'relfilenode has changed' END
+ FROM reindex_temp_before b JOIN pg_class a ON b.oid = a.oid
+ ORDER BY 1;
+ relname | relkind | case
+-------------------+---------+-------------------------
+ concur_temp_ind_1 | i | relfilenode has changed
+ concur_temp_ind_2 | i | relfilenode has changed
+(2 rows)
+
+DROP TABLE concur_temp_tab_1, concur_temp_tab_2, reindex_temp_before;
+--
+-- REINDEX SCHEMA
+--
+REINDEX SCHEMA schema_to_reindex; -- failure, schema does not exist
+ERROR: schema "schema_to_reindex" does not exist
+CREATE SCHEMA schema_to_reindex;
+SET search_path = 'schema_to_reindex';
+CREATE TABLE table1(col1 SERIAL PRIMARY KEY);
+INSERT INTO table1 SELECT generate_series(1,400);
+CREATE TABLE table2(col1 SERIAL PRIMARY KEY, col2 TEXT NOT NULL);
+INSERT INTO table2 SELECT generate_series(1,400), 'abc';
+CREATE INDEX ON table2(col2);
+CREATE MATERIALIZED VIEW matview AS SELECT col1 FROM table2;
+CREATE INDEX ON matview(col1);
+CREATE VIEW view AS SELECT col2 FROM table2;
+CREATE TABLE reindex_before AS
+SELECT oid, relname, relfilenode, relkind, reltoastrelid
+ FROM pg_class
+ where relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'schema_to_reindex');
+INSERT INTO reindex_before
+SELECT oid, 'pg_toast_TABLE', relfilenode, relkind, reltoastrelid
+FROM pg_class WHERE oid IN
+ (SELECT reltoastrelid FROM reindex_before WHERE reltoastrelid > 0);
+INSERT INTO reindex_before
+SELECT oid, 'pg_toast_TABLE_index', relfilenode, relkind, reltoastrelid
+FROM pg_class where oid in
+ (select indexrelid from pg_index where indrelid in
+ (select reltoastrelid from reindex_before where reltoastrelid > 0));
+REINDEX SCHEMA schema_to_reindex;
+CREATE TABLE reindex_after AS SELECT oid, relname, relfilenode, relkind
+ FROM pg_class
+ where relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'schema_to_reindex');
+SELECT b.relname,
+ b.relkind,
+ CASE WHEN a.relfilenode = b.relfilenode THEN 'relfilenode is unchanged'
+ ELSE 'relfilenode has changed' END
+ FROM reindex_before b JOIN pg_class a ON b.oid = a.oid
+ ORDER BY 1;
+ relname | relkind | case
+------------------+---------+--------------------------
+ matview | m | relfilenode is unchanged
+ matview_col1_idx | i | relfilenode has changed
+ table1 | r | relfilenode is unchanged
+ table1_col1_seq | S | relfilenode is unchanged
+ table1_pkey | i | relfilenode has changed
+ table2 | r | relfilenode is unchanged
+ table2_col1_seq | S | relfilenode is unchanged
+ table2_col2_idx | i | relfilenode has changed
+ table2_pkey | i | relfilenode has changed
+ view | v | relfilenode is unchanged
+(10 rows)
+
+REINDEX SCHEMA schema_to_reindex;
+BEGIN;
+REINDEX SCHEMA schema_to_reindex; -- failure, cannot run in a transaction
+ERROR: REINDEX SCHEMA cannot run inside a transaction block
+END;
+-- concurrently
+REINDEX SCHEMA CONCURRENTLY schema_to_reindex;
+-- Failure for unauthorized user
+CREATE ROLE regress_reindexuser NOLOGIN;
+SET SESSION ROLE regress_reindexuser;
+REINDEX SCHEMA schema_to_reindex;
+ERROR: must be owner of schema schema_to_reindex
+-- Permission failures with toast tables and indexes (pg_authid here)
+RESET ROLE;
+GRANT USAGE ON SCHEMA pg_toast TO regress_reindexuser;
+SET SESSION ROLE regress_reindexuser;
+REINDEX TABLE pg_toast.pg_toast_1260;
+ERROR: must be owner of table pg_toast_1260
+REINDEX INDEX pg_toast.pg_toast_1260_index;
+ERROR: must be owner of index pg_toast_1260_index
+-- Clean up
+RESET ROLE;
+REVOKE USAGE ON SCHEMA pg_toast FROM regress_reindexuser;
+DROP ROLE regress_reindexuser;
+DROP SCHEMA schema_to_reindex CASCADE;
+NOTICE: drop cascades to 6 other objects
+DETAIL: drop cascades to table table1
+drop cascades to table table2
+drop cascades to materialized view matview
+drop cascades to view view
+drop cascades to table reindex_before
+drop cascades to table reindex_after
diff --git src/test/regress/expected/create_table.out src/test/regress/expected/create_table.out
index c5e95edbed..044ecb6b85 100644
--- src/test/regress/expected/create_table.out
+++ src/test/regress/expected/create_table.out
@@ -275,16 +275,16 @@ CREATE TABLE IF NOT EXISTS as_select1 AS EXECUTE select1;
NOTICE: relation "as_select1" already exists, skipping
DROP TABLE as_select1;
DEALLOCATE select1;
+-- FIXME: enable this test when we introduce meta-page overflow for zedstore
-- create an extra wide table to test for issues related to that
-- (temporarily hide query, to avoid the long CREATE TABLE stmt)
-\set ECHO none
-INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
-SELECT firstc, lastc FROM extra_wide_table;
- firstc | lastc
------------+----------
- first col | last col
-(1 row)
-
+-- \set ECHO none
+-- SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
+-- FROM generate_series(1, 1100) g(i)
+-- \gexec
+-- \set ECHO all
+-- INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
+-- SELECT firstc, lastc FROM extra_wide_table;
-- check that tables with oids cannot be created anymore
CREATE TABLE withoid() WITH OIDS;
ERROR: syntax error at or near "OIDS"
diff --git src/test/regress/expected/fsm_1.out src/test/regress/expected/fsm_1.out
new file mode 100644
index 0000000000..9b5f9be13a
--- /dev/null
+++ src/test/regress/expected/fsm_1.out
@@ -0,0 +1,73 @@
+--
+-- Free Space Map test
+--
+SELECT current_setting('block_size')::integer AS blocksize,
+current_setting('block_size')::integer / 8 AS strsize
+\gset
+CREATE TABLE fsm_check_size (num int, str text);
+-- Fill 3 blocks with one record each
+ALTER TABLE fsm_check_size SET (fillfactor=15);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(1,3) i;
+-- There should be no FSM
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- The following operations are for testing the functionality of the local
+-- in-memory map. In particular, we want to be able to insert into some
+-- other block than the one at the end of the heap, without using a FSM.
+-- Fill most of the last block
+ALTER TABLE fsm_check_size SET (fillfactor=100);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(101,105) i;
+-- Make sure records can go into any block but the last one
+ALTER TABLE fsm_check_size SET (fillfactor=30);
+-- Insert large record and make sure it does not cause the relation to extend
+INSERT INTO fsm_check_size VALUES (111, rpad('', :strsize, 'a'));
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- Extend table with enough blocks to exceed the FSM threshold
+DO $$
+DECLARE curtid tid;
+num int;
+BEGIN
+num = 11;
+ LOOP
+ INSERT INTO fsm_check_size VALUES (num, 'b') RETURNING ctid INTO curtid;
+ EXIT WHEN curtid >= tid '(4, 0)';
+ num = num + 1;
+ END LOOP;
+END;
+$$;
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ fsm_nblocks
+-------------
+ 0
+(1 row)
+
+-- Add long random string to extend TOAST table to 1 block
+INSERT INTO fsm_check_size
+VALUES(0, (SELECT string_agg(md5(chr(i)), '')
+ FROM generate_series(1, :blocksize / 100) i));
+VACUUM fsm_check_size;
+SELECT pg_relation_size(reltoastrelid, 'main') / :blocksize AS toast_nblocks,
+pg_relation_size(reltoastrelid, 'fsm') / :blocksize AS toast_fsm_nblocks
+FROM pg_class WHERE relname = 'fsm_check_size';
+ toast_nblocks | toast_fsm_nblocks
+---------------+-------------------
+ |
+(1 row)
+
+DROP TABLE fsm_check_size;
diff --git src/test/regress/expected/join_hash_1.out src/test/regress/expected/join_hash_1.out
new file mode 100644
index 0000000000..b745ef8358
--- /dev/null
+++ src/test/regress/expected/join_hash_1.out
@@ -0,0 +1,1015 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(9 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious outer hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- An full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious outer hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan 1)) AND ((SubPlan 2) = (SubPlan 3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan 1) = hjtest_1.id) AND ((SubPlan 3) = (SubPlan 2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
diff --git src/test/regress/expected/limit_1.out src/test/regress/expected/limit_1.out
new file mode 100644
index 0000000000..55020ae124
--- /dev/null
+++ src/test/regress/expected/limit_1.out
@@ -0,0 +1,505 @@
+--
+-- LIMIT
+-- Check the LIMIT/OFFSET feature of SELECT
+--
+SELECT ''::text AS two, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 50
+ ORDER BY unique1 LIMIT 2;
+ two | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 51 | 76 | ZBAAAA
+ | 52 | 985 | ACAAAA
+(2 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 60
+ ORDER BY unique1 LIMIT 5;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 61 | 560 | JCAAAA
+ | 62 | 633 | KCAAAA
+ | 63 | 296 | LCAAAA
+ | 64 | 479 | MCAAAA
+ | 65 | 64 | NCAAAA
+(5 rows)
+
+SELECT ''::text AS two, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 60 AND unique1 < 63
+ ORDER BY unique1 LIMIT 5;
+ two | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 61 | 560 | JCAAAA
+ | 62 | 633 | KCAAAA
+(2 rows)
+
+SELECT ''::text AS three, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 100
+ ORDER BY unique1 LIMIT 3 OFFSET 20;
+ three | unique1 | unique2 | stringu1
+-------+---------+---------+----------
+ | 121 | 700 | REAAAA
+ | 122 | 519 | SEAAAA
+ | 123 | 777 | TEAAAA
+(3 rows)
+
+SELECT ''::text AS zero, unique1, unique2, stringu1
+ FROM onek WHERE unique1 < 50
+ ORDER BY unique1 DESC LIMIT 8 OFFSET 99;
+ zero | unique1 | unique2 | stringu1
+------+---------+---------+----------
+(0 rows)
+
+SELECT ''::text AS eleven, unique1, unique2, stringu1
+ FROM onek WHERE unique1 < 50
+ ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
+ eleven | unique1 | unique2 | stringu1
+--------+---------+---------+----------
+ | 10 | 520 | KAAAAA
+ | 9 | 49 | JAAAAA
+ | 8 | 653 | IAAAAA
+ | 7 | 647 | HAAAAA
+ | 6 | 978 | GAAAAA
+ | 5 | 541 | FAAAAA
+ | 4 | 833 | EAAAAA
+ | 3 | 431 | DAAAAA
+ | 2 | 326 | CAAAAA
+ | 1 | 214 | BAAAAA
+ | 0 | 998 | AAAAAA
+(11 rows)
+
+SELECT ''::text AS ten, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 OFFSET 990;
+ ten | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 990 | 369 | CMAAAA
+ | 991 | 426 | DMAAAA
+ | 992 | 363 | EMAAAA
+ | 993 | 661 | FMAAAA
+ | 994 | 695 | GMAAAA
+ | 995 | 144 | HMAAAA
+ | 996 | 258 | IMAAAA
+ | 997 | 21 | JMAAAA
+ | 998 | 549 | KMAAAA
+ | 999 | 152 | LMAAAA
+(10 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 OFFSET 990 LIMIT 5;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 990 | 369 | CMAAAA
+ | 991 | 426 | DMAAAA
+ | 992 | 363 | EMAAAA
+ | 993 | 661 | FMAAAA
+ | 994 | 695 | GMAAAA
+(5 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 LIMIT 5 OFFSET 900;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 900 | 913 | QIAAAA
+ | 901 | 931 | RIAAAA
+ | 902 | 702 | SIAAAA
+ | 903 | 641 | TIAAAA
+ | 904 | 793 | UIAAAA
+(5 rows)
+
+-- Test null limit and offset. The planner would discard a simple null
+-- constant, so to ensure executor is exercised, do this:
+select * from int8_tbl limit (case when random() < 0.5 then null::bigint end);
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+select * from int8_tbl offset (case when random() < 0.5 then null::bigint end);
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+-- Test assorted cases involving backwards fetch from a LIMIT plan node
+begin;
+declare c1 cursor for select * from int8_tbl limit 10;
+fetch all in c1;
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+fetch 1 in c1;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c1;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | -4567890123456789
+(1 row)
+
+fetch backward all in c1;
+ q1 | q2
+------------------+------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | 123
+ 123 | 4567890123456789
+ 123 | 456
+(4 rows)
+
+fetch backward 1 in c1;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c1;
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+declare c2 cursor for select * from int8_tbl limit 3;
+fetch all in c2;
+ q1 | q2
+------------------+------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+(3 rows)
+
+fetch 1 in c2;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c2;
+ q1 | q2
+------------------+-----
+ 4567890123456789 | 123
+(1 row)
+
+fetch backward all in c2;
+ q1 | q2
+-----+------------------
+ 123 | 4567890123456789
+ 123 | 456
+(2 rows)
+
+fetch backward 1 in c2;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c2;
+ q1 | q2
+------------------+------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+(3 rows)
+
+declare c3 cursor for select * from int8_tbl offset 3;
+fetch all in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(2 rows)
+
+fetch 1 in c3;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | -4567890123456789
+(1 row)
+
+fetch backward all in c3;
+ q1 | q2
+------------------+------------------
+ 4567890123456789 | 4567890123456789
+(1 row)
+
+fetch backward 1 in c3;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(2 rows)
+
+declare c4 cursor for select * from int8_tbl offset 10;
+fetch all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+rollback;
+-- Stress test for variable LIMIT in conjunction with bounded-heap sorting
+SELECT
+ (SELECT n
+ FROM (VALUES (1)) AS x,
+ (SELECT n FROM generate_series(1,10) AS n
+ ORDER BY n LIMIT 1 OFFSET s-1) AS y) AS z
+ FROM generate_series(1,10) AS s;
+ z
+----
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+ 10
+(10 rows)
+
+--
+-- Test behavior of volatile and set-returning functions in conjunction
+-- with ORDER BY and LIMIT.
+--
+create temp sequence testseq;
+explain (verbose, costs off)
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by unique2 limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (nextval('testseq'::regclass))
+ -> Index Scan using tenk1_unique2 on public.tenk1
+ Output: unique1, unique2, nextval('testseq'::regclass)
+(4 rows)
+
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by unique2 limit 10;
+ unique1 | unique2 | nextval
+---------+---------+---------
+ 8800 | 0 | 1
+ 1891 | 1 | 2
+ 3420 | 2 | 3
+ 9850 | 3 | 4
+ 7164 | 4 | 5
+ 8009 | 5 | 6
+ 5057 | 6 | 7
+ 6701 | 7 | 8
+ 4321 | 8 | 9
+ 3043 | 9 | 10
+(10 rows)
+
+select currval('testseq');
+ currval
+---------
+ 10
+(1 row)
+
+explain (verbose, costs off)
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by tenthous limit 10;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (nextval('testseq'::regclass)), tenthous
+ -> Result
+ Output: unique1, unique2, nextval('testseq'::regclass), tenthous
+ -> Sort
+ Output: unique1, unique2, tenthous
+ Sort Key: tenk1.tenthous
+ -> Seq Scan on public.tenk1
+ Output: unique1, unique2, tenthous
+(9 rows)
+
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by tenthous limit 10;
+ unique1 | unique2 | nextval
+---------+---------+---------
+ 0 | 9998 | 11
+ 1 | 2838 | 12
+ 2 | 2716 | 13
+ 3 | 5679 | 14
+ 4 | 1621 | 15
+ 5 | 5557 | 16
+ 6 | 2855 | 17
+ 7 | 8518 | 18
+ 8 | 5435 | 19
+ 9 | 4463 | 20
+(10 rows)
+
+select currval('testseq');
+ currval
+---------
+ 20
+(1 row)
+
+explain (verbose, costs off)
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by unique2 limit 7;
+ QUERY PLAN
+------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (generate_series(1, 10))
+ -> ProjectSet
+ Output: unique1, unique2, generate_series(1, 10)
+ -> Index Scan using tenk1_unique2 on public.tenk1
+ Output: unique1, unique2
+(6 rows)
+
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by unique2 limit 7;
+ unique1 | unique2 | generate_series
+---------+---------+-----------------
+ 8800 | 0 | 1
+ 8800 | 0 | 2
+ 8800 | 0 | 3
+ 8800 | 0 | 4
+ 8800 | 0 | 5
+ 8800 | 0 | 6
+ 8800 | 0 | 7
+(7 rows)
+
+explain (verbose, costs off)
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by tenthous limit 7;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (generate_series(1, 10)), tenthous
+ -> ProjectSet
+ Output: unique1, unique2, generate_series(1, 10), tenthous
+ -> Sort
+ Output: unique1, unique2, tenthous
+ Sort Key: tenk1.tenthous
+ -> Seq Scan on public.tenk1
+ Output: unique1, unique2, tenthous
+(9 rows)
+
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by tenthous limit 7;
+ unique1 | unique2 | generate_series
+---------+---------+-----------------
+ 0 | 9998 | 1
+ 0 | 9998 | 2
+ 0 | 9998 | 3
+ 0 | 9998 | 4
+ 0 | 9998 | 5
+ 0 | 9998 | 6
+ 0 | 9998 | 7
+(7 rows)
+
+-- use of random() is to keep planner from folding the expressions together
+explain (verbose, costs off)
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ ProjectSet
+ Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
+ -> Result
+(3 rows)
+
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
+ s1 | s2
+----+----
+ 0 | 0
+ 1 | 1
+ 2 | 2
+(3 rows)
+
+explain (verbose, costs off)
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
+order by s2 desc;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------
+ Sort
+ Output: (generate_series(0, 2)), (generate_series(((random() * '0.1'::double precision))::integer, 2))
+ Sort Key: (generate_series(((random() * '0.1'::double precision))::integer, 2)) DESC
+ -> ProjectSet
+ Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
+ -> Result
+(6 rows)
+
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
+order by s2 desc;
+ s1 | s2
+----+----
+ 2 | 2
+ 1 | 1
+ 0 | 0
+(3 rows)
+
+-- test for failure to set all aggregates' aggtranstype
+explain (verbose, costs off)
+select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
+ from tenk1 group by thousand order by thousand limit 3;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------
+ Limit
+ Output: (sum(tenthous)), (((sum(tenthous))::double precision + (random() * '0'::double precision))), thousand
+ -> GroupAggregate
+ Output: sum(tenthous), ((sum(tenthous))::double precision + (random() * '0'::double precision)), thousand
+ Group Key: tenk1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on public.tenk1
+ Output: thousand, tenthous
+(7 rows)
+
+select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
+ from tenk1 group by thousand order by thousand limit 3;
+ s1 | s2
+-------+-------
+ 45000 | 45000
+ 45010 | 45010
+ 45020 | 45020
+(3 rows)
+
diff --git src/test/regress/expected/opr_sanity.out src/test/regress/expected/opr_sanity.out
index 3c0b21d633..6283e3f771 100644
--- src/test/regress/expected/opr_sanity.out
+++ src/test/regress/expected/opr_sanity.out
@@ -832,6 +832,12 @@ macaddr8_gt(macaddr8,macaddr8)
macaddr8_ge(macaddr8,macaddr8)
macaddr8_ne(macaddr8,macaddr8)
macaddr8_cmp(macaddr8,macaddr8)
+zstideq(zstid,zstid)
+zstidle(zstid,zstid)
+zstidge(zstid,zstid)
+zstidgt(zstid,zstid)
+zstidlt(zstid,zstid)
+zstidne(zstid,zstid)
-- restore normal output mode
\a\t
-- List of functions used by libpq's fe-lobj.c
diff --git src/test/regress/expected/psql_1.out src/test/regress/expected/psql_1.out
new file mode 100644
index 0000000000..8581af1b1b
--- /dev/null
+++ src/test/regress/expected/psql_1.out
@@ -0,0 +1,4977 @@
+--
+-- Tests for psql features that aren't closely connected to any
+-- specific server features
+--
+-- \set
+-- fail: invalid name
+\set invalid/name foo
+invalid variable name: "invalid/name"
+-- fail: invalid value for special variable
+\set AUTOCOMMIT foo
+unrecognized value "foo" for "AUTOCOMMIT": Boolean expected
+\set FETCH_COUNT foo
+invalid value "foo" for "FETCH_COUNT": integer expected
+-- check handling of built-in boolean variable
+\echo :ON_ERROR_ROLLBACK
+off
+\set ON_ERROR_ROLLBACK
+\echo :ON_ERROR_ROLLBACK
+on
+\set ON_ERROR_ROLLBACK foo
+unrecognized value "foo" for "ON_ERROR_ROLLBACK"
+Available values are: on, off, interactive.
+\echo :ON_ERROR_ROLLBACK
+on
+\set ON_ERROR_ROLLBACK on
+\echo :ON_ERROR_ROLLBACK
+on
+\unset ON_ERROR_ROLLBACK
+\echo :ON_ERROR_ROLLBACK
+off
+-- \g and \gx
+SELECT 1 as one, 2 as two \g
+ one | two
+-----+-----
+ 1 | 2
+(1 row)
+
+\gx
+-[ RECORD 1 ]
+one | 1
+two | 2
+
+SELECT 3 as three, 4 as four \gx
+-[ RECORD 1 ]
+three | 3
+four | 4
+
+\g
+ three | four
+-------+------
+ 3 | 4
+(1 row)
+
+-- \gx should work in FETCH_COUNT mode too
+\set FETCH_COUNT 1
+SELECT 1 as one, 2 as two \g
+ one | two
+-----+-----
+ 1 | 2
+(1 row)
+
+\gx
+-[ RECORD 1 ]
+one | 1
+two | 2
+
+SELECT 3 as three, 4 as four \gx
+-[ RECORD 1 ]
+three | 3
+four | 4
+
+\g
+ three | four
+-------+------
+ 3 | 4
+(1 row)
+
+\unset FETCH_COUNT
+-- \gset
+select 10 as test01, 20 as test02, 'Hello' as test03 \gset pref01_
+\echo :pref01_test01 :pref01_test02 :pref01_test03
+10 20 Hello
+-- should fail: bad variable name
+select 10 as "bad name"
+\gset
+invalid variable name: "bad name"
+-- multiple backslash commands in one line
+select 1 as x, 2 as y \gset pref01_ \\ \echo :pref01_x
+1
+select 3 as x, 4 as y \gset pref01_ \echo :pref01_x \echo :pref01_y
+3
+4
+select 5 as x, 6 as y \gset pref01_ \\ \g \echo :pref01_x :pref01_y
+ x | y
+---+---
+ 5 | 6
+(1 row)
+
+5 6
+select 7 as x, 8 as y \g \gset pref01_ \echo :pref01_x :pref01_y
+ x | y
+---+---
+ 7 | 8
+(1 row)
+
+7 8
+-- NULL should unset the variable
+\set var2 xyz
+select 1 as var1, NULL as var2, 3 as var3 \gset
+\echo :var1 :var2 :var3
+1 :var2 3
+-- \gset requires just one tuple
+select 10 as test01, 20 as test02 from generate_series(1,3) \gset
+more than one row returned for \gset
+select 10 as test01, 20 as test02 from generate_series(1,0) \gset
+no rows returned for \gset
+-- \gset should work in FETCH_COUNT mode too
+\set FETCH_COUNT 1
+select 1 as x, 2 as y \gset pref01_ \\ \echo :pref01_x
+1
+select 3 as x, 4 as y \gset pref01_ \echo :pref01_x \echo :pref01_y
+3
+4
+select 10 as test01, 20 as test02 from generate_series(1,3) \gset
+more than one row returned for \gset
+select 10 as test01, 20 as test02 from generate_series(1,0) \gset
+no rows returned for \gset
+\unset FETCH_COUNT
+-- \gdesc
+SELECT
+ NULL AS zero,
+ 1 AS one,
+ 2.0 AS two,
+ 'three' AS three,
+ $1 AS four,
+ sin($2) as five,
+ 'foo'::varchar(4) as six,
+ CURRENT_DATE AS now
+\gdesc
+ Column | Type
+--------+----------------------
+ zero | text
+ one | integer
+ two | numeric
+ three | text
+ four | text
+ five | double precision
+ six | character varying(4)
+ now | date
+(8 rows)
+
+-- should work with tuple-returning utilities, such as EXECUTE
+PREPARE test AS SELECT 1 AS first, 2 AS second;
+EXECUTE test \gdesc
+ Column | Type
+--------+---------
+ first | integer
+ second | integer
+(2 rows)
+
+EXPLAIN EXECUTE test \gdesc
+ Column | Type
+------------+------
+ QUERY PLAN | text
+(1 row)
+
+-- should fail cleanly - syntax error
+SELECT 1 + \gdesc
+ERROR: syntax error at end of input
+LINE 1: SELECT 1 +
+ ^
+-- check behavior with empty results
+SELECT \gdesc
+The command has no result, or the result has no columns.
+CREATE TABLE bububu(a int) \gdesc
+The command has no result, or the result has no columns.
+-- subject command should not have executed
+TABLE bububu; -- fail
+ERROR: relation "bububu" does not exist
+LINE 1: TABLE bububu;
+ ^
+-- query buffer should remain unchanged
+SELECT 1 AS x, 'Hello', 2 AS y, true AS "dirty\name"
+\gdesc
+ Column | Type
+------------+---------
+ x | integer
+ ?column? | text
+ y | integer
+ dirty\name | boolean
+(4 rows)
+
+\g
+ x | ?column? | y | dirty\name
+---+----------+---+------------
+ 1 | Hello | 2 | t
+(1 row)
+
+-- all on one line
+SELECT 3 AS x, 'Hello', 4 AS y, true AS "dirty\name" \gdesc \g
+ Column | Type
+------------+---------
+ x | integer
+ ?column? | text
+ y | integer
+ dirty\name | boolean
+(4 rows)
+
+ x | ?column? | y | dirty\name
+---+----------+---+------------
+ 3 | Hello | 4 | t
+(1 row)
+
+-- \gexec
+create temporary table gexec_test(a int, b text, c date, d float);
+select format('create index on gexec_test(%I)', attname)
+from pg_attribute
+where attrelid = 'gexec_test'::regclass and attnum > 0
+order by attnum
+\gexec
+create index on gexec_test(a)
+create index on gexec_test(b)
+create index on gexec_test(c)
+create index on gexec_test(d)
+-- \gexec should work in FETCH_COUNT mode too
+-- (though the fetch limit applies to the executed queries not the meta query)
+\set FETCH_COUNT 1
+select 'select 1 as ones', 'select x.y, x.y*2 as double from generate_series(1,4) as x(y)'
+union all
+select 'drop table gexec_test', NULL
+union all
+select 'drop table gexec_test', 'select ''2000-01-01''::date as party_over'
+\gexec
+select 1 as ones
+ ones
+------
+ 1
+(1 row)
+
+select x.y, x.y*2 as double from generate_series(1,4) as x(y)
+ y | double
+---+--------
+ 1 | 2
+ 2 | 4
+ 3 | 6
+ 4 | 8
+(4 rows)
+
+drop table gexec_test
+drop table gexec_test
+ERROR: table "gexec_test" does not exist
+select '2000-01-01'::date as party_over
+ party_over
+------------
+ 01-01-2000
+(1 row)
+
+\unset FETCH_COUNT
+-- show all pset options
+\pset
+border 1
+columns 0
+csv_fieldsep ','
+expanded off
+fieldsep '|'
+fieldsep_zero off
+footer on
+format aligned
+linestyle ascii
+null ''
+numericlocale off
+pager 1
+pager_min_lines 0
+recordsep '\n'
+recordsep_zero off
+tableattr
+title
+tuples_only off
+unicode_border_linestyle single
+unicode_column_linestyle single
+unicode_header_linestyle single
+-- test multi-line headers, wrapping, and newline indicators
+-- in aligned, unaligned, and wrapped formats
+prepare q as select array_to_string(array_agg(repeat('x',2*n)),E'\n') as "ab
+
+c", array_to_string(array_agg(repeat('y',20-2*n)),E'\n') as "a
+bc" from generate_series(1,10) as n(n) group by n>1 order by n>1;
+\pset linestyle ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab + a +
+ + bc
+ c
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx +yyyyyyyyyyyyyyyy +
+xxxxxx +yyyyyyyyyyyyyy +
+xxxxxxxx +yyyyyyyyyyyy +
+xxxxxxxxxx +yyyyyyyyyy +
+xxxxxxxxxxxx +yyyyyyyy +
+xxxxxxxxxxxxxx +yyyyyy +
+xxxxxxxxxxxxxxxx +yyyy +
+xxxxxxxxxxxxxxxxxx +yy +
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab + a +
+ + bc
+ c
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx +yyyyyyyyyyyyyyyy +
+xxxxxx +yyyyyyyyyyyyyy +
+xxxxxxxx +yyyyyyyyyyyy +
+xxxxxxxxxx +yyyyyyyyyy +
+xxxxxxxxxxxx +yyyyyyyy +
+xxxxxxxxxxxxxx +yyyyyy +
+xxxxxxxxxxxxxxxx +yyyy +
+xxxxxxxxxxxxxxxxxx +yy +
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab +| a +
+ +| bc
+ c |
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx +| yyyyyyyyyyyyyyyy +
+ xxxxxx +| yyyyyyyyyyyyyy +
+ xxxxxxxx +| yyyyyyyyyyyy +
+ xxxxxxxxxx +| yyyyyyyyyy +
+ xxxxxxxxxxxx +| yyyyyyyy +
+ xxxxxxxxxxxxxx +| yyyyyy +
+ xxxxxxxxxxxxxxxx +| yyyy +
+ xxxxxxxxxxxxxxxxxx +| yy +
+ xxxxxxxxxxxxxxxxxxxx |
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab +| a +
+ +| bc
+ c |
+-------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx +| yyyyyyyyyyyyyyyy +
+ xxxxxx +| yyyyyyyyyyyyyy +
+ xxxxxxxx +| yyyyyyyyyyyy +
+ xxxxxxxxxx +| yyyyyyyyyy +
+ xxxxxxxxxxxx +| yyyyyyyy +
+ xxxxxxxxxxxxxx +| yyyyyy +
+ xxxxxxxxxxxxxxxx +| yyyy +
+ xxxxxxxxxxxxxxxxx.| yy +
+.x +|
+ xxxxxxxxxxxxxxxxx.|
+.xxx |
+(2 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| ab +| a +|
+| +| bc |
+| c | |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx +| yyyyyyyyyyyyyyyy +|
+| xxxxxx +| yyyyyyyyyyyyyy +|
+| xxxxxxxx +| yyyyyyyyyyyy +|
+| xxxxxxxxxx +| yyyyyyyyyy +|
+| xxxxxxxxxxxx +| yyyyyyyy +|
+| xxxxxxxxxxxxxx +| yyyyyy +|
+| xxxxxxxxxxxxxxxx +| yyyy +|
+| xxxxxxxxxxxxxxxxxx +| yy +|
+| xxxxxxxxxxxxxxxxxxxx | |
++----------------------+--------------------+
+(2 rows)
+
+\pset format wrapped
+execute q;
++-----------------+--------------------+
+| ab +| a +|
+| +| bc |
+| c | |
++-----------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx +| yyyyyyyyyyyyyyyy +|
+| xxxxxx +| yyyyyyyyyyyyyy +|
+| xxxxxxxx +| yyyyyyyyyyyy +|
+| xxxxxxxxxx +| yyyyyyyyyy +|
+| xxxxxxxxxxxx +| yyyyyyyy +|
+| xxxxxxxxxxxxxx +| yyyyyy +|
+| xxxxxxxxxxxxxxx.| yyyy +|
+|.x +| yy +|
+| xxxxxxxxxxxxxxx.| |
+|.xxx +| |
+| xxxxxxxxxxxxxxx.| |
+|.xxxxx | |
++-----------------+--------------------+
+(2 rows)
+
+\pset expanded on
+\pset columns 20
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+* Record 1
+ab+ xx
+ +
+c
+a + yyyyyyyyyyyyyyyyyy
+bc
+* Record 2
+ab+ xxxx +
+ + xxxxxx +
+c xxxxxxxx +
+ xxxxxxxxxx +
+ xxxxxxxxxxxx +
+ xxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxxxxxxx
+a + yyyyyyyyyyyyyyyy +
+bc yyyyyyyyyyyyyy +
+ yyyyyyyyyyyy +
+ yyyyyyyyyy +
+ yyyyyyyy +
+ yyyyyy +
+ yyyy +
+ yy +
+
+
+\pset format wrapped
+execute q;
+* Record 1
+ab+ xx
+ +
+c
+a + yyyyyyyyyyyyyyy.
+bc .yyy
+* Record 2
+ab+ xxxx +
+ + xxxxxx +
+c xxxxxxxx +
+ xxxxxxxxxx +
+ xxxxxxxxxxxx +
+ xxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxx.
+ .x +
+ xxxxxxxxxxxxxxx.
+ .xxx +
+ xxxxxxxxxxxxxxx.
+ .xxxxx
+a + yyyyyyyyyyyyyyy.
+bc .y +
+ yyyyyyyyyyyyyy +
+ yyyyyyyyyyyy +
+ yyyyyyyyyy +
+ yyyyyyyy +
+ yyyyyy +
+ yyyy +
+ yy +
+
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+-[ RECORD 1 ]------------
+ab+| xx
+ +|
+c |
+a +| yyyyyyyyyyyyyyyyyy
+bc |
+-[ RECORD 2 ]------------
+ab+| xxxx +
+ +| xxxxxx +
+c | xxxxxxxx +
+ | xxxxxxxxxx +
+ | xxxxxxxxxxxx +
+ | xxxxxxxxxxxxxx +
+ | xxxxxxxxxxxxxxxx +
+ | xxxxxxxxxxxxxxxxxx +
+ | xxxxxxxxxxxxxxxxxxxx
+a +| yyyyyyyyyyyyyyyy +
+bc | yyyyyyyyyyyyyy +
+ | yyyyyyyyyyyy +
+ | yyyyyyyyyy +
+ | yyyyyyyy +
+ | yyyyyy +
+ | yyyy +
+ | yy +
+ |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]------
+ab+| xx
+ +|
+c |
+a +| yyyyyyyyyyyyyy.
+bc |.yyyy
+-[ RECORD 2 ]------
+ab+| xxxx +
+ +| xxxxxx +
+c | xxxxxxxx +
+ | xxxxxxxxxx +
+ | xxxxxxxxxxxx +
+ | xxxxxxxxxxxxxx+
+ | xxxxxxxxxxxxxx.
+ |.xx +
+ | xxxxxxxxxxxxxx.
+ |.xxxx +
+ | xxxxxxxxxxxxxx.
+ |.xxxxxx
+a +| yyyyyyyyyyyyyy.
+bc |.yy +
+ | yyyyyyyyyyyyyy+
+ | yyyyyyyyyyyy +
+ | yyyyyyyyyy +
+ | yyyyyyyy +
+ | yyyyyy +
+ | yyyy +
+ | yy +
+ |
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
++-[ RECORD 1 ]--------------+
+| ab+| xx |
+| +| |
+| c | |
+| a +| yyyyyyyyyyyyyyyyyy |
+| bc | |
++-[ RECORD 2 ]--------------+
+| ab+| xxxx +|
+| +| xxxxxx +|
+| c | xxxxxxxx +|
+| | xxxxxxxxxx +|
+| | xxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxxxxxxxx |
+| a +| yyyyyyyyyyyyyyyy +|
+| bc | yyyyyyyyyyyyyy +|
+| | yyyyyyyyyyyy +|
+| | yyyyyyyyyy +|
+| | yyyyyyyy +|
+| | yyyyyy +|
+| | yyyy +|
+| | yy +|
+| | |
++----+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+
+| ab+| xx |
+| +| |
+| c | |
+| a +| yyyyyyyyyyy.|
+| bc |.yyyyyyy |
++-[ RECORD 2 ]-----+
+| ab+| xxxx +|
+| +| xxxxxx +|
+| c | xxxxxxxx +|
+| | xxxxxxxxxx +|
+| | xxxxxxxxxxx.|
+| |.x +|
+| | xxxxxxxxxxx.|
+| |.xxx +|
+| | xxxxxxxxxxx.|
+| |.xxxxx +|
+| | xxxxxxxxxxx.|
+| |.xxxxxxx +|
+| | xxxxxxxxxxx.|
+| |.xxxxxxxxx |
+| a +| yyyyyyyyyyy.|
+| bc |.yyyyy +|
+| | yyyyyyyyyyy.|
+| |.yyy +|
+| | yyyyyyyyyyy.|
+| |.y +|
+| | yyyyyyyyyy +|
+| | yyyyyyyy +|
+| | yyyyyy +|
+| | yyyy +|
+| | yy +|
+| | |
++----+-------------+
+
+\pset linestyle old-ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab a
+ + bc
+ c +
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab a
+ + bc
+ c +
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab | a
++ |+ bc
++ c |+
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx : yyyyyyyyyyyyyy
+ xxxxxxxx : yyyyyyyyyyyy
+ xxxxxxxxxx : yyyyyyyyyy
+ xxxxxxxxxxxx : yyyyyyyy
+ xxxxxxxxxxxxxx : yyyyyy
+ xxxxxxxxxxxxxxxx : yyyy
+ xxxxxxxxxxxxxxxxxx : yy
+ xxxxxxxxxxxxxxxxxxxx :
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab | a
++ |+ bc
++ c |+
+-------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx : yyyyyyyyyyyyyy
+ xxxxxxxx : yyyyyyyyyyyy
+ xxxxxxxxxx : yyyyyyyyyy
+ xxxxxxxxxxxx : yyyyyyyy
+ xxxxxxxxxxxxxx : yyyyyy
+ xxxxxxxxxxxxxxxx : yyyy
+ xxxxxxxxxxxxxxxxx : yy
+ x :
+ xxxxxxxxxxxxxxxxx
+ xxx
+(2 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| ab | a |
+|+ |+ bc |
+|+ c |+ |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx : yyyyyyyyyyyyyy |
+| xxxxxxxx : yyyyyyyyyyyy |
+| xxxxxxxxxx : yyyyyyyyyy |
+| xxxxxxxxxxxx : yyyyyyyy |
+| xxxxxxxxxxxxxx : yyyyyy |
+| xxxxxxxxxxxxxxxx : yyyy |
+| xxxxxxxxxxxxxxxxxx : yy |
+| xxxxxxxxxxxxxxxxxxxx : |
++----------------------+--------------------+
+(2 rows)
+
+\pset format wrapped
+execute q;
++-----------------+--------------------+
+| ab | a |
+|+ |+ bc |
+|+ c |+ |
++-----------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx : yyyyyyyyyyyyyy |
+| xxxxxxxx : yyyyyyyyyyyy |
+| xxxxxxxxxx : yyyyyyyyyy |
+| xxxxxxxxxxxx : yyyyyyyy |
+| xxxxxxxxxxxxxx : yyyyyy |
+| xxxxxxxxxxxxxxx : yyyy |
+| x : yy |
+| xxxxxxxxxxxxxxx : |
+| xxx |
+| xxxxxxxxxxxxxxx |
+| xxxxx |
++-----------------+--------------------+
+(2 rows)
+
+\pset expanded on
+\pset columns 20
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+* Record 1
+ ab xx
++
++c
+ a yyyyyyyyyyyyyyyyyy
++bc
+* Record 2
+ ab xxxx
++ xxxxxx
++c xxxxxxxx
+ xxxxxxxxxx
+ xxxxxxxxxxxx
+ xxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxxxxxx
+ a yyyyyyyyyyyyyyyy
++bc yyyyyyyyyyyyyy
+ yyyyyyyyyyyy
+ yyyyyyyyyy
+ yyyyyyyy
+ yyyyyy
+ yyyy
+ yy
+
+
+\pset format wrapped
+execute q;
+* Record 1
+ ab xx
++
++c
+ a yyyyyyyyyyyyyyyy
++bc yy
+* Record 2
+ ab xxxx
++ xxxxxx
++c xxxxxxxx
+ xxxxxxxxxx
+ xxxxxxxxxxxx
+ xxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxx
+ xx
+ xxxxxxxxxxxxxxxx
+ xxxx
+ a yyyyyyyyyyyyyyyy
++bc yyyyyyyyyyyyyy
+ yyyyyyyyyyyy
+ yyyyyyyyyy
+ yyyyyyyy
+ yyyyyy
+ yyyy
+ yy
+
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+-[ RECORD 1 ]-------------
+ ab | xx
++ ;
++c ;
+ a | yyyyyyyyyyyyyyyyyy
++bc ;
+-[ RECORD 2 ]-------------
+ ab | xxxx
++ : xxxxxx
++c : xxxxxxxx
+ : xxxxxxxxxx
+ : xxxxxxxxxxxx
+ : xxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxxxxxxxx
+ a | yyyyyyyyyyyyyyyy
++bc : yyyyyyyyyyyyyy
+ : yyyyyyyyyyyy
+ : yyyyyyyyyy
+ : yyyyyyyy
+ : yyyyyy
+ : yyyy
+ : yy
+ :
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]-------
+ ab | xx
++ ;
++c ;
+ a | yyyyyyyyyyyyyy
++bc ; yyyy
+-[ RECORD 2 ]-------
+ ab | xxxx
++ : xxxxxx
++c : xxxxxxxx
+ : xxxxxxxxxx
+ : xxxxxxxxxxxx
+ : xxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxx
+ ; xx
+ : xxxxxxxxxxxxxx
+ ; xxxx
+ : xxxxxxxxxxxxxx
+ ; xxxxxx
+ a | yyyyyyyyyyyyyy
++bc ; yy
+ : yyyyyyyyyyyyyy
+ : yyyyyyyyyyyy
+ : yyyyyyyyyy
+ : yyyyyyyy
+ : yyyyyy
+ : yyyy
+ : yy
+ :
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
++-[ RECORD 1 ]--------------+
+| ab | xx |
+|+ ; |
+|+c ; |
+| a | yyyyyyyyyyyyyyyyyy |
+|+bc ; |
++-[ RECORD 2 ]--------------+
+| ab | xxxx |
+|+ : xxxxxx |
+|+c : xxxxxxxx |
+| : xxxxxxxxxx |
+| : xxxxxxxxxxxx |
+| : xxxxxxxxxxxxxx |
+| : xxxxxxxxxxxxxxxx |
+| : xxxxxxxxxxxxxxxxxx |
+| : xxxxxxxxxxxxxxxxxxxx |
+| a | yyyyyyyyyyyyyyyy |
+|+bc : yyyyyyyyyyyyyy |
+| : yyyyyyyyyyyy |
+| : yyyyyyyyyy |
+| : yyyyyyyy |
+| : yyyyyy |
+| : yyyy |
+| : yy |
+| : |
++----+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+
+| ab | xx |
+|+ ; |
+|+c ; |
+| a | yyyyyyyyyyy |
+|+bc ; yyyyyyy |
++-[ RECORD 2 ]-----+
+| ab | xxxx |
+|+ : xxxxxx |
+|+c : xxxxxxxx |
+| : xxxxxxxxxx |
+| : xxxxxxxxxxx |
+| ; x |
+| : xxxxxxxxxxx |
+| ; xxx |
+| : xxxxxxxxxxx |
+| ; xxxxx |
+| : xxxxxxxxxxx |
+| ; xxxxxxx |
+| : xxxxxxxxxxx |
+| ; xxxxxxxxx |
+| a | yyyyyyyyyyy |
+|+bc ; yyyyy |
+| : yyyyyyyyyyy |
+| ; yyy |
+| : yyyyyyyyyyy |
+| ; y |
+| : yyyyyyyyyy |
+| : yyyyyyyy |
+| : yyyyyy |
+| : yyyy |
+| : yy |
+| : |
++----+-------------+
+
+deallocate q;
+-- test single-line header and data
+prepare q as select repeat('x',2*n) as "0123456789abcdef", repeat('y',20-2*n) as "0123456789" from generate_series(1,10) as n;
+\pset linestyle ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef | 0123456789
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxxx |
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef | 0123456789
+---------------------+------------------
+ xx | yyyyyyyyyyyyyyyy.
+ |.yy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxx.|
+.x |
+(10 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| 0123456789abcdef | 0123456789 |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxxxx | |
++----------------------+--------------------+
+(10 rows)
+
+\pset format wrapped
+execute q;
++--------------------+-----------------+
+| 0123456789abcdef | 0123456789 |
++--------------------+-----------------+
+| xx | yyyyyyyyyyyyyyy.|
+| |.yyy |
+| xxxx | yyyyyyyyyyyyyyy.|
+| |.y |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxx.| |
+|.xx | |
++--------------------+-----------------+
+(10 rows)
+
+\pset expanded on
+\pset columns 30
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset format wrapped
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyy.
+ .yyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyy.
+ .yyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyy.
+ .yy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxx.
+ .xx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxx.
+ .xxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxx.
+ .xxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxx.
+ .xxxxxxxx
+0123456789
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]----+-----------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyy.
+ |.yyyyyyyy
+-[ RECORD 2 ]----+-----------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyy.
+ |.yyyyyy
+-[ RECORD 3 ]----+-----------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyy.
+ |.yyyy
+-[ RECORD 4 ]----+-----------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyy.
+ |.yy
+-[ RECORD 5 ]----+-----------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxxxxxxxx
+0123456789 |
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
++-[ RECORD 1 ]-----+----------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyyy |
++-[ RECORD 2 ]-----+----------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+----------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxxxx |
+| 0123456789 | |
++------------------+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+---------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyy.|
+| |.yyyyyyy.|
+| |.yyyy |
++-[ RECORD 2 ]-----+---------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyy.|
+| |.yyyyyyy.|
+| |.yy |
++-[ RECORD 3 ]-----+---------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyy.|
+| |.yyyyyyy |
++-[ RECORD 4 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.x |
+| 0123456789 | yyyyyyy.|
+| |.yyyyy |
++-[ RECORD 5 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxx |
+| 0123456789 | yyyyyyy.|
+| |.yyy |
++-[ RECORD 6 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxx |
+| 0123456789 | yyyyyyy.|
+| |.y |
++-[ RECORD 7 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx.|
+| |.xx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx.|
+| |.xxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx.|
+| |.xxxxxx |
+| 0123456789 | |
++------------------+---------+
+
+\pset expanded on
+\pset columns 20
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset format wrapped
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yyy
+* Record 2
+0123456789abcdef xxx.
+ .x
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .y
+* Record 3
+0123456789abcdef xxx.
+ .xxx
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yy
+* Record 4
+0123456789abcdef xxx.
+ .xxx.
+ .xx
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy
+* Record 5
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .x
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .y
+* Record 6
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx
+0123456789 yyy.
+ .yyy.
+ .yy
+* Record 7
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xx
+0123456789 yyy.
+ .yyy
+* Record 8
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .x
+0123456789 yyy.
+ .y
+* Record 9
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xx
+0123456789
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]----+----
+0123456789abcdef | xx
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy
+-[ RECORD 2 ]----+----
+0123456789abcdef | xxx.
+ |.x
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.y
+-[ RECORD 3 ]----+----
+0123456789abcdef | xxx.
+ |.xxx
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yy
+-[ RECORD 4 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xx
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy
+-[ RECORD 5 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.x
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.y
+-[ RECORD 6 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx
+0123456789 | yyy.
+ |.yyy.
+ |.yy
+-[ RECORD 7 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xx
+0123456789 | yyy.
+ |.yyy
+-[ RECORD 8 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.x
+0123456789 | yyy.
+ |.y
+-[ RECORD 9 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx
+0123456789 | yy
+-[ RECORD 10 ]---+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xx
+0123456789 |
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
++-[ RECORD 1 ]-----+----------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyyy |
++-[ RECORD 2 ]-----+----------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+----------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxxxx |
+| 0123456789 | |
++------------------+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+-----+
+| 0123456789abcdef | xx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy |
++-[ RECORD 2 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.x |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.y |
++-[ RECORD 3 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yy |
++-[ RECORD 4 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy |
++-[ RECORD 5 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.x |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.y |
++-[ RECORD 6 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yy |
++-[ RECORD 7 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xx |
+| 0123456789 | yyy.|
+| |.yyy |
++-[ RECORD 8 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.x |
+| 0123456789 | yyy.|
+| |.y |
++-[ RECORD 9 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xx |
+| 0123456789 | |
++------------------+-----+
+
+\pset linestyle old-ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef | 0123456789
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxxx |
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef | 0123456789
+---------------------+------------------
+ xx | yyyyyyyyyyyyyyyy
+ ; yy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxx |
+ x
+(10 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| 0123456789abcdef | 0123456789 |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxxxx | |
++----------------------+--------------------+
+(10 rows)
+
+\pset format wrapped
+execute q;
++--------------------+-----------------+
+| 0123456789abcdef | 0123456789 |
++--------------------+-----------------+
+| xx | yyyyyyyyyyyyyyy |
+| ; yyy |
+| xxxx | yyyyyyyyyyyyyyy |
+| ; y |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxx | |
+| xx |
++--------------------+-----------------+
+(10 rows)
+
+\pset expanded on
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset format wrapped
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
++-[ RECORD 1 ]-----+----------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyyy |
++-[ RECORD 2 ]-----+----------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+----------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxxxx |
+| 0123456789 | |
++------------------+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+-------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyy |
+| ; y |
++-[ RECORD 2 ]-----+-------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+-------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxx |
+| ; x |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxx |
+| ; xxx |
+| 0123456789 | |
++------------------+-------------------+
+
+deallocate q;
+\pset linestyle ascii
+\pset border 1
+-- support table for output-format tests (useful to create a footer)
+create table psql_serial_tab (id serial);
+-- test header/footer/tuples_only behavior in aligned/unaligned/wrapped cases
+\pset format aligned
+\pset expanded off
+\d psql_serial_tab_id_seq
+ Sequence "public.psql_serial_tab_id_seq"
+ Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
+---------+-------+---------+------------+-----------+---------+-------
+ integer | 1 | 1 | 2147483647 | 1 | no | 1
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+ pg_catalog | exp | double precision | double precision | func
+ pg_catalog | exp | numeric | numeric | func
+
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+-[ RECORD 1 ]---------
+Type | integer
+Start | 1
+Minimum | 1
+Maximum | 2147483647
+Increment | 1
+Cycles? | no
+Cache | 1
+
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+Schema | pg_catalog
+Name | exp
+Result data type | double precision
+Argument data types | double precision
+Type | func
+--------------------+-----------------
+Schema | pg_catalog
+Name | exp
+Result data type | numeric
+Argument data types | numeric
+Type | func
+
+\pset tuples_only false
+-- empty table is a special case for this format
+select 1 where false;
+(0 rows)
+
+\pset format unaligned
+\pset expanded off
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+Type|Start|Minimum|Maximum|Increment|Cycles?|Cache
+integer|1|1|2147483647|1|no|1
+Owned by: public.psql_serial_tab.id
+\pset tuples_only true
+\df exp
+pg_catalog|exp|double precision|double precision|func
+pg_catalog|exp|numeric|numeric|func
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+
+Type|integer
+Start|1
+Minimum|1
+Maximum|2147483647
+Increment|1
+Cycles?|no
+Cache|1
+
+Owned by: public.psql_serial_tab.id
+\pset tuples_only true
+\df exp
+Schema|pg_catalog
+Name|exp
+Result data type|double precision
+Argument data types|double precision
+Type|func
+
+Schema|pg_catalog
+Name|exp
+Result data type|numeric
+Argument data types|numeric
+Type|func
+\pset tuples_only false
+\pset format wrapped
+\pset expanded off
+\d psql_serial_tab_id_seq
+ Sequence "public.psql_serial_tab_id_seq"
+ Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
+---------+-------+---------+------------+-----------+---------+-------
+ integer | 1 | 1 | 2147483647 | 1 | no | 1
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+ pg_catalog | exp | double precision | double precision | func
+ pg_catalog | exp | numeric | numeric | func
+
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+-[ RECORD 1 ]---------
+Type | integer
+Start | 1
+Minimum | 1
+Maximum | 2147483647
+Increment | 1
+Cycles? | no
+Cache | 1
+
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+Schema | pg_catalog
+Name | exp
+Result data type | double precision
+Argument data types | double precision
+Type | func
+--------------------+-----------------
+Schema | pg_catalog
+Name | exp
+Result data type | numeric
+Argument data types | numeric
+Type | func
+
+\pset tuples_only false
+-- check conditional tableam display
+-- Create a heap2 table am handler with heapam handler
+CREATE ACCESS METHOD heap_psql TYPE TABLE HANDLER heap_tableam_handler;
+CREATE TABLE tbl_heap_psql(f1 int, f2 char(100)) using heap_psql;
+CREATE TABLE tbl_heap(f1 int, f2 char(100)) using heap;
+\d+ tbl_heap_psql
+ Table "public.tbl_heap_psql"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+
+\d+ tbl_heap
+ Table "public.tbl_heap"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+
+\set HIDE_TABLEAM off
+\d+ tbl_heap_psql
+ Table "public.tbl_heap_psql"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+Access method: heap_psql
+
+\d+ tbl_heap
+ Table "public.tbl_heap"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+Access method: heap
+
+\set HIDE_TABLEAM on
+DROP TABLE tbl_heap, tbl_heap_psql;
+DROP ACCESS METHOD heap_psql;
+-- test numericlocale (as best we can without control of psql's locale)
+\pset format aligned
+\pset expanded off
+\pset numericlocale true
+select n, -n as m, n * 111 as x, '1e90'::float8 as f
+from generate_series(0,3) n;
+ n | m | x | f
+---+----+-----+-------
+ 0 | 0 | 0 | 1e+90
+ 1 | -1 | 111 | 1e+90
+ 2 | -2 | 222 | 1e+90
+ 3 | -3 | 333 | 1e+90
+(4 rows)
+
+\pset numericlocale false
+-- test asciidoc output format
+\pset format asciidoc
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+
+.Sequence "public.psql_serial_tab_id_seq"
+[options="header",cols="<l,>l,>l,>l,>l,<l,>l",frame="none"]
+|====
+^l|Type ^l|Start ^l|Minimum ^l|Maximum ^l|Increment ^l|Cycles? ^l|Cache
+|integer |1 |1 |2147483647 |1 |no |1
+|====
+
+....
+Owned by: public.psql_serial_tab.id
+....
+\pset tuples_only true
+\df exp
+
+[cols="<l,<l,<l,<l,<l",frame="none"]
+|====
+|pg_catalog |exp |double precision |double precision |func
+|pg_catalog |exp |numeric |numeric |func
+|====
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+
+.Sequence "public.psql_serial_tab_id_seq"
+[cols="h,l",frame="none"]
+|====
+2+^|Record 1
+<l|Type <l|integer
+<l|Start >l|1
+<l|Minimum >l|1
+<l|Maximum >l|2147483647
+<l|Increment >l|1
+<l|Cycles? <l|no
+<l|Cache >l|1
+|====
+
+....
+Owned by: public.psql_serial_tab.id
+....
+\pset tuples_only true
+\df exp
+
+[cols="h,l",frame="none"]
+|====
+2+|
+<l|Schema <l|pg_catalog
+<l|Name <l|exp
+<l|Result data type <l|double precision
+<l|Argument data types <l|double precision
+<l|Type <l|func
+2+|
+<l|Schema <l|pg_catalog
+<l|Name <l|exp
+<l|Result data type <l|numeric
+<l|Argument data types <l|numeric
+<l|Type <l|func
+|====
+\pset tuples_only false
+prepare q as
+ select 'some|text' as "a|title", ' ' as "empty ", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+
+[options="header",cols="<l,<l,>l",frame="none",grid="none"]
+|====
+^l|a\|title ^l|empty ^l|int
+|some\|text | |1
+|some\|text | |2
+|====
+
+....
+(2 rows)
+....
+\pset border 1
+execute q;
+
+[options="header",cols="<l,<l,>l",frame="none"]
+|====
+^l|a\|title ^l|empty ^l|int
+|some\|text | |1
+|some\|text | |2
+|====
+
+....
+(2 rows)
+....
+\pset border 2
+execute q;
+
+[options="header",cols="<l,<l,>l",frame="all",grid="all"]
+|====
+^l|a\|title ^l|empty ^l|int
+|some\|text | |1
+|some\|text | |2
+|====
+
+....
+(2 rows)
+....
+\pset expanded on
+\pset border 0
+execute q;
+
+[cols="h,l",frame="none",grid="none"]
+|====
+2+^|Record 1
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|1
+2+^|Record 2
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|2
+|====
+\pset border 1
+execute q;
+
+[cols="h,l",frame="none"]
+|====
+2+^|Record 1
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|1
+2+^|Record 2
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|2
+|====
+\pset border 2
+execute q;
+
+[cols="h,l",frame="all",grid="all"]
+|====
+2+^|Record 1
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|1
+2+^|Record 2
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|2
+|====
+deallocate q;
+-- test csv output format
+\pset format csv
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+Type,Start,Minimum,Maximum,Increment,Cycles?,Cache
+integer,1,1,2147483647,1,no,1
+\pset tuples_only true
+\df exp
+pg_catalog,exp,double precision,double precision,func
+pg_catalog,exp,numeric,numeric,func
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Type,integer
+Start,1
+Minimum,1
+Maximum,2147483647
+Increment,1
+Cycles?,no
+Cache,1
+\pset tuples_only true
+\df exp
+Schema,pg_catalog
+Name,exp
+Result data type,double precision
+Argument data types,double precision
+Type,func
+Schema,pg_catalog
+Name,exp
+Result data type,numeric
+Argument data types,numeric
+Type,func
+\pset tuples_only false
+prepare q as
+ select 'some"text' as "a""title", E' <foo>\n<bar>' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+execute q;
+"a""title",junk,empty,int
+"some""text"," <foo>
+<bar>", ,1
+"some""text"," <foo>
+<bar>", ,2
+\pset expanded on
+execute q;
+"a""title","some""text"
+junk," <foo>
+<bar>"
+empty,
+int,1
+"a""title","some""text"
+junk," <foo>
+<bar>"
+empty,
+int,2
+deallocate q;
+-- special cases
+\pset expanded off
+select 'comma,comma' as comma, 'semi;semi' as semi;
+comma,semi
+"comma,comma",semi;semi
+\pset csv_fieldsep ';'
+select 'comma,comma' as comma, 'semi;semi' as semi;
+comma;semi
+comma,comma;"semi;semi"
+select '\.' as data;
+data
+"\."
+\pset csv_fieldsep '.'
+select '\' as d1, '' as d2;
+"d1"."d2"
+"\".""
+-- illegal csv separators
+\pset csv_fieldsep ''
+\pset: csv_fieldsep must be a single one-byte character
+\pset csv_fieldsep '\0'
+\pset: csv_fieldsep must be a single one-byte character
+\pset csv_fieldsep '\n'
+\pset: csv_fieldsep cannot be a double quote, a newline, or a carriage return
+\pset csv_fieldsep '\r'
+\pset: csv_fieldsep cannot be a double quote, a newline, or a carriage return
+\pset csv_fieldsep '"'
+\pset: csv_fieldsep cannot be a double quote, a newline, or a carriage return
+\pset csv_fieldsep ',,'
+\pset: csv_fieldsep must be a single one-byte character
+\pset csv_fieldsep ','
+-- test html output format
+\pset format html
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+<table border="1">
+ <caption>Sequence "public.psql_serial_tab_id_seq"</caption>
+ <tr>
+ <th align="center">Type</th>
+ <th align="center">Start</th>
+ <th align="center">Minimum</th>
+ <th align="center">Maximum</th>
+ <th align="center">Increment</th>
+ <th align="center">Cycles?</th>
+ <th align="center">Cache</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">integer</td>
+ <td align="right">1</td>
+ <td align="right">1</td>
+ <td align="right">2147483647</td>
+ <td align="right">1</td>
+ <td align="left">no</td>
+ <td align="right">1</td>
+ </tr>
+</table>
+<p>Owned by: public.psql_serial_tab.id<br />
+</p>
+\pset tuples_only true
+\df exp
+<table border="1">
+ <tr valign="top">
+ <td align="left">pg_catalog</td>
+ <td align="left">exp</td>
+ <td align="left">double precision</td>
+ <td align="left">double precision</td>
+ <td align="left">func</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">pg_catalog</td>
+ <td align="left">exp</td>
+ <td align="left">numeric</td>
+ <td align="left">numeric</td>
+ <td align="left">func</td>
+ </tr>
+</table>
+
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+<table border="1">
+ <caption>Sequence "public.psql_serial_tab_id_seq"</caption>
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>Type</th>
+ <td align="left">integer</td>
+ </tr>
+ <tr valign="top">
+ <th>Start</th>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <th>Minimum</th>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <th>Maximum</th>
+ <td align="right">2147483647</td>
+ </tr>
+ <tr valign="top">
+ <th>Increment</th>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <th>Cycles?</th>
+ <td align="left">no</td>
+ </tr>
+ <tr valign="top">
+ <th>Cache</th>
+ <td align="right">1</td>
+ </tr>
+</table>
+<p>Owned by: public.psql_serial_tab.id<br />
+</p>
+\pset tuples_only true
+\df exp
+<table border="1">
+
+ <tr><td colspan="2"> </td></tr>
+ <tr valign="top">
+ <th>Schema</th>
+ <td align="left">pg_catalog</td>
+ </tr>
+ <tr valign="top">
+ <th>Name</th>
+ <td align="left">exp</td>
+ </tr>
+ <tr valign="top">
+ <th>Result data type</th>
+ <td align="left">double precision</td>
+ </tr>
+ <tr valign="top">
+ <th>Argument data types</th>
+ <td align="left">double precision</td>
+ </tr>
+ <tr valign="top">
+ <th>Type</th>
+ <td align="left">func</td>
+ </tr>
+
+ <tr><td colspan="2"> </td></tr>
+ <tr valign="top">
+ <th>Schema</th>
+ <td align="left">pg_catalog</td>
+ </tr>
+ <tr valign="top">
+ <th>Name</th>
+ <td align="left">exp</td>
+ </tr>
+ <tr valign="top">
+ <th>Result data type</th>
+ <td align="left">numeric</td>
+ </tr>
+ <tr valign="top">
+ <th>Argument data types</th>
+ <td align="left">numeric</td>
+ </tr>
+ <tr valign="top">
+ <th>Type</th>
+ <td align="left">func</td>
+ </tr>
+</table>
+
+\pset tuples_only false
+prepare q as
+ select 'some"text' as "a&title", E' <foo>\n<bar>' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+<table border="0">
+ <tr>
+ <th align="center">a&title</th>
+ <th align="center">junk</th>
+ <th align="center">empty</th>
+ <th align="center">int</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">2</td>
+ </tr>
+</table>
+<p>(2 rows)<br />
+</p>
+\pset border 1
+execute q;
+<table border="1">
+ <tr>
+ <th align="center">a&title</th>
+ <th align="center">junk</th>
+ <th align="center">empty</th>
+ <th align="center">int</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">2</td>
+ </tr>
+</table>
+<p>(2 rows)<br />
+</p>
+\pset tableattr foobar
+execute q;
+<table border="1" foobar>
+ <tr>
+ <th align="center">a&title</th>
+ <th align="center">junk</th>
+ <th align="center">empty</th>
+ <th align="center">int</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">2</td>
+ </tr>
+</table>
+<p>(2 rows)<br />
+</p>
+\pset tableattr
+\pset expanded on
+\pset border 0
+execute q;
+<table border="0">
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">1</td>
+ </tr>
+
+ <tr><td colspan="2" align="center">Record 2</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">2</td>
+ </tr>
+</table>
+
+\pset border 1
+execute q;
+<table border="1">
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">1</td>
+ </tr>
+
+ <tr><td colspan="2" align="center">Record 2</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">2</td>
+ </tr>
+</table>
+
+\pset tableattr foobar
+execute q;
+<table border="1" foobar>
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">1</td>
+ </tr>
+
+ <tr><td colspan="2" align="center">Record 2</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">2</td>
+ </tr>
+</table>
+
+\pset tableattr
+deallocate q;
+-- test latex output format
+\pset format latex
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+\begin{center}
+Sequence "public.psql\_serial\_tab\_id\_seq"
+\end{center}
+
+\begin{tabular}{l | r | r | r | r | l | r}
+\textit{Type} & \textit{Start} & \textit{Minimum} & \textit{Maximum} & \textit{Increment} & \textit{Cycles?} & \textit{Cache} \\
+\hline
+integer & 1 & 1 & 2147483647 & 1 & no & 1 \\
+\end{tabular}
+
+\noindent Owned by: public.psql\_serial\_tab.id \\
+
+\pset tuples_only true
+\df exp
+\begin{tabular}{l | l | l | l | l}
+pg\_catalog & exp & double precision & double precision & func \\
+pg\_catalog & exp & numeric & numeric & func \\
+\end{tabular}
+
+\noindent
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+\begin{center}
+Sequence "public.psql\_serial\_tab\_id\_seq"
+\end{center}
+
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+Type & integer \\
+Start & 1 \\
+Minimum & 1 \\
+Maximum & 2147483647 \\
+Increment & 1 \\
+Cycles? & no \\
+Cache & 1 \\
+\end{tabular}
+
+\noindent Owned by: public.psql\_serial\_tab.id \\
+
+\pset tuples_only true
+\df exp
+\begin{tabular}{c|l}
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & double precision \\
+Argument data types & double precision \\
+Type & func \\
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & numeric \\
+Argument data types & numeric \\
+Type & func \\
+\end{tabular}
+
+\noindent
+\pset tuples_only false
+prepare q as
+ select 'some\more_text' as "a$title", E' #<foo>%&^~|\n{bar}' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+\begin{tabular}{lllr}
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset border 1
+execute q;
+\begin{tabular}{l | l | l | r}
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset border 2
+execute q;
+\begin{tabular}{| l | l | l | r |}
+\hline
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\hline
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset border 3
+execute q;
+\begin{tabular}{| l | l | l | r |}
+\hline
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\hline
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset expanded on
+\pset border 0
+execute q;
+\begin{tabular}{cl}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 1
+execute q;
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 2
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset border 3
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+deallocate q;
+-- test latex-longtable output format
+\pset format latex-longtable
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+\begin{longtable}{l | r | r | r | r | l | r}
+\small\textbf{\textit{Type}} & \small\textbf{\textit{Start}} & \small\textbf{\textit{Minimum}} & \small\textbf{\textit{Maximum}} & \small\textbf{\textit{Increment}} & \small\textbf{\textit{Cycles?}} & \small\textbf{\textit{Cache}} \\
+\midrule
+\endfirsthead
+\small\textbf{\textit{Type}} & \small\textbf{\textit{Start}} & \small\textbf{\textit{Minimum}} & \small\textbf{\textit{Maximum}} & \small\textbf{\textit{Increment}} & \small\textbf{\textit{Cycles?}} & \small\textbf{\textit{Cache}} \\
+\midrule
+\endhead
+\caption[Sequence "public.psql\_serial\_tab\_id\_seq" (Continued)]{Sequence "public.psql\_serial\_tab\_id\_seq"}
+\endfoot
+\caption[Sequence "public.psql\_serial\_tab\_id\_seq"]{Sequence "public.psql\_serial\_tab\_id\_seq"}
+\endlastfoot
+\raggedright{integer}
+&
+\raggedright{1}
+&
+\raggedright{1}
+&
+\raggedright{2147483647}
+&
+\raggedright{1}
+&
+\raggedright{no}
+&
+\raggedright{1} \tabularnewline
+\end{longtable}
+\pset tuples_only true
+\df exp
+\begin{longtable}{l | l | l | l | l}
+\raggedright{pg\_catalog}
+&
+\raggedright{exp}
+&
+\raggedright{double precision}
+&
+\raggedright{double precision}
+&
+\raggedright{func} \tabularnewline
+\raggedright{pg\_catalog}
+&
+\raggedright{exp}
+&
+\raggedright{numeric}
+&
+\raggedright{numeric}
+&
+\raggedright{func} \tabularnewline
+\end{longtable}
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+\begin{center}
+Sequence "public.psql\_serial\_tab\_id\_seq"
+\end{center}
+
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+Type & integer \\
+Start & 1 \\
+Minimum & 1 \\
+Maximum & 2147483647 \\
+Increment & 1 \\
+Cycles? & no \\
+Cache & 1 \\
+\end{tabular}
+
+\noindent Owned by: public.psql\_serial\_tab.id \\
+
+\pset tuples_only true
+\df exp
+\begin{tabular}{c|l}
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & double precision \\
+Argument data types & double precision \\
+Type & func \\
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & numeric \\
+Argument data types & numeric \\
+Type & func \\
+\end{tabular}
+
+\noindent
+\pset tuples_only false
+prepare q as
+ select 'some\more_text' as "a$title", E' #<foo>%&^~|\n{bar}' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+\begin{longtable}{lllr}
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endhead
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+\end{longtable}
+\pset border 1
+execute q;
+\begin{longtable}{l | l | l | r}
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endhead
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+\end{longtable}
+\pset border 2
+execute q;
+\begin{longtable}{| l | l | l | r |}
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endhead
+\bottomrule
+\endfoot
+\bottomrule
+\endlastfoot
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+\end{longtable}
+\pset border 3
+execute q;
+\begin{longtable}{| l | l | l | r |}
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\endhead
+\bottomrule
+\endfoot
+\bottomrule
+\endlastfoot
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+ \hline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+ \hline
+\end{longtable}
+\pset tableattr lr
+execute q;
+\begin{longtable}{| p{lr\textwidth} | p{lr\textwidth} | p{lr\textwidth} | r |}
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\endhead
+\bottomrule
+\endfoot
+\bottomrule
+\endlastfoot
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+ \hline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+ \hline
+\end{longtable}
+\pset tableattr
+\pset expanded on
+\pset border 0
+execute q;
+\begin{tabular}{cl}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 1
+execute q;
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 2
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset border 3
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset tableattr lr
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset tableattr
+deallocate q;
+-- test troff-ms output format
+\pset format troff-ms
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+.LP
+.DS C
+Sequence "public.psql_serial_tab_id_seq"
+.DE
+.LP
+.TS
+center;
+l | r | r | r | r | l | r.
+\fIType\fP \fIStart\fP \fIMinimum\fP \fIMaximum\fP \fIIncrement\fP \fICycles?\fP \fICache\fP
+_
+integer 1 1 2147483647 1 no 1
+.TE
+.DS L
+Owned by: public.psql_serial_tab.id
+.DE
+\pset tuples_only true
+\df exp
+.LP
+.TS
+center;
+l | l | l | l | l.
+pg_catalog exp double precision double precision func
+pg_catalog exp numeric numeric func
+.TE
+.DS L
+.DE
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+.LP
+.DS C
+Sequence "public.psql_serial_tab_id_seq"
+.DE
+.LP
+.TS
+center;
+c s.
+\fIRecord 1\fP
+_
+.T&
+c | l.
+Type integer
+Start 1
+Minimum 1
+Maximum 2147483647
+Increment 1
+Cycles? no
+Cache 1
+.TE
+.DS L
+Owned by: public.psql_serial_tab.id
+.DE
+\pset tuples_only true
+\df exp
+.LP
+.TS
+center;
+c l;
+_
+Schema pg_catalog
+Name exp
+Result data type double precision
+Argument data types double precision
+Type func
+_
+Schema pg_catalog
+Name exp
+Result data type numeric
+Argument data types numeric
+Type func
+.TE
+.DS L
+.DE
+\pset tuples_only false
+prepare q as
+ select 'some\text' as "a\title", E' <foo>\n<bar>' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+.LP
+.TS
+center;
+lllr.
+\fIa\(rstitle\fP \fIjunk\fP \fIempty\fP \fIint\fP
+_
+some\(rstext <foo>
+<bar> 1
+some\(rstext <foo>
+<bar> 2
+.TE
+.DS L
+(2 rows)
+.DE
+\pset border 1
+execute q;
+.LP
+.TS
+center;
+l | l | l | r.
+\fIa\(rstitle\fP \fIjunk\fP \fIempty\fP \fIint\fP
+_
+some\(rstext <foo>
+<bar> 1
+some\(rstext <foo>
+<bar> 2
+.TE
+.DS L
+(2 rows)
+.DE
+\pset border 2
+execute q;
+.LP
+.TS
+center box;
+l | l | l | r.
+\fIa\(rstitle\fP \fIjunk\fP \fIempty\fP \fIint\fP
+_
+some\(rstext <foo>
+<bar> 1
+some\(rstext <foo>
+<bar> 2
+.TE
+.DS L
+(2 rows)
+.DE
+\pset expanded on
+\pset border 0
+execute q;
+.LP
+.TS
+center;
+c s.
+\fIRecord 1\fP
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 1
+.T&
+c s.
+\fIRecord 2\fP
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 2
+.TE
+.DS L
+.DE
+\pset border 1
+execute q;
+.LP
+.TS
+center;
+c s.
+\fIRecord 1\fP
+_
+.T&
+c | l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 1
+.T&
+c s.
+\fIRecord 2\fP
+_
+.T&
+c | l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 2
+.TE
+.DS L
+.DE
+\pset border 2
+execute q;
+.LP
+.TS
+center box;
+c s.
+\fIRecord 1\fP
+_
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 1
+_
+.T&
+c s.
+\fIRecord 2\fP
+_
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 2
+.TE
+.DS L
+.DE
+deallocate q;
+-- check ambiguous format requests
+\pset format a
+\pset: ambiguous abbreviation "a" matches both "aligned" and "asciidoc"
+\pset format l
+-- clean up after output format tests
+drop table psql_serial_tab;
+\pset format aligned
+\pset expanded off
+\pset border 1
+-- \echo and allied features
+\echo this is a test
+this is a test
+\echo -n without newline
+without newline\echo with -n newline
+with -n newline
+\echo '-n' with newline
+-n with newline
+\set foo bar
+\echo foo = :foo
+foo = bar
+\qecho this is a test
+this is a test
+\qecho foo = :foo
+foo = bar
+\warn this is a test
+this is a test
+\warn foo = :foo
+foo = bar
+-- tests for \if ... \endif
+\if true
+ select 'okay';
+ ?column?
+----------
+ okay
+(1 row)
+
+ select 'still okay';
+ ?column?
+------------
+ still okay
+(1 row)
+
+\else
+ not okay;
+ still not okay
+\endif
+-- at this point query buffer should still have last valid line
+\g
+ ?column?
+------------
+ still okay
+(1 row)
+
+-- \if should work okay on part of a query
+select
+ \if true
+ 42
+ \else
+ (bogus
+ \endif
+ forty_two;
+ forty_two
+-----------
+ 42
+(1 row)
+
+select \if false \\ (bogus \else \\ 42 \endif \\ forty_two;
+ forty_two
+-----------
+ 42
+(1 row)
+
+-- test a large nested if using a variety of true-equivalents
+\if true
+ \if 1
+ \if yes
+ \if on
+ \echo 'all true'
+all true
+ \else
+ \echo 'should not print #1-1'
+ \endif
+ \else
+ \echo 'should not print #1-2'
+ \endif
+ \else
+ \echo 'should not print #1-3'
+ \endif
+\else
+ \echo 'should not print #1-4'
+\endif
+-- test a variety of false-equivalents in an if/elif/else structure
+\if false
+ \echo 'should not print #2-1'
+\elif 0
+ \echo 'should not print #2-2'
+\elif no
+ \echo 'should not print #2-3'
+\elif off
+ \echo 'should not print #2-4'
+\else
+ \echo 'all false'
+all false
+\endif
+-- test true-false elif after initial true branch
+\if true
+ \echo 'should print #2-5'
+should print #2-5
+\elif true
+ \echo 'should not print #2-6'
+\elif false
+ \echo 'should not print #2-7'
+\else
+ \echo 'should not print #2-8'
+\endif
+-- test simple true-then-else
+\if true
+ \echo 'first thing true'
+first thing true
+\else
+ \echo 'should not print #3-1'
+\endif
+-- test simple false-true-else
+\if false
+ \echo 'should not print #4-1'
+\elif true
+ \echo 'second thing true'
+second thing true
+\else
+ \echo 'should not print #5-1'
+\endif
+-- invalid boolean expressions are false
+\if invalid boolean expression
+unrecognized value "invalid boolean expression" for "\if expression": Boolean expected
+ \echo 'will not print #6-1'
+\else
+ \echo 'will print anyway #6-2'
+will print anyway #6-2
+\endif
+-- test un-matched endif
+\endif
+\endif: no matching \if
+-- test un-matched else
+\else
+\else: no matching \if
+-- test un-matched elif
+\elif
+\elif: no matching \if
+-- test double-else error
+\if true
+\else
+\else
+\else: cannot occur after \else
+\endif
+-- test elif out-of-order
+\if false
+\else
+\elif
+\elif: cannot occur after \else
+\endif
+-- test if-endif matching in a false branch
+\if false
+ \if false
+ \echo 'should not print #7-1'
+ \else
+ \echo 'should not print #7-2'
+ \endif
+ \echo 'should not print #7-3'
+\else
+ \echo 'should print #7-4'
+should print #7-4
+\endif
+-- show that vars and backticks are not expanded when ignoring extra args
+\set foo bar
+\echo :foo :'foo' :"foo"
+bar 'bar' "bar"
+\pset fieldsep | `nosuchcommand` :foo :'foo' :"foo"
+\pset: extra argument "nosuchcommand" ignored
+\pset: extra argument ":foo" ignored
+\pset: extra argument ":'foo'" ignored
+\pset: extra argument ":"foo"" ignored
+-- show that vars and backticks are not expanded and commands are ignored
+-- when in a false if-branch
+\set try_to_quit '\\q'
+\if false
+ :try_to_quit
+ \echo `nosuchcommand` :foo :'foo' :"foo"
+ \pset fieldsep | `nosuchcommand` :foo :'foo' :"foo"
+ \a
+ \C arg1
+ \c arg1 arg2 arg3 arg4
+ \cd arg1
+ \conninfo
+ \copy arg1 arg2 arg3 arg4 arg5 arg6
+ \copyright
+ SELECT 1 as one, 2, 3 \crosstabview
+ \dt arg1
+ \e arg1 arg2
+ \ef whole_line
+ \ev whole_line
+ \echo arg1 arg2 arg3 arg4 arg5
+ \echo arg1
+ \encoding arg1
+ \errverbose
+ \f arg1
+ \g arg1
+ \gx arg1
+ \gexec
+ SELECT 1 AS one \gset
+ \h
+ \?
+ \html
+ \i arg1
+ \ir arg1
+ \l arg1
+ \lo arg1 arg2
+invalid command \lo
+ \lo_list
+ \o arg1
+ \p
+ \password arg1
+ \prompt arg1 arg2
+ \pset arg1 arg2
+ \q
+ \reset
+ \s arg1
+ \set arg1 arg2 arg3 arg4 arg5 arg6 arg7
+ \setenv arg1 arg2
+ \sf whole_line
+ \sv whole_line
+ \t arg1
+ \T arg1
+ \timing arg1
+ \unset arg1
+ \w arg1
+ \watch arg1
+ \x arg1
+ -- \else here is eaten as part of OT_FILEPIPE argument
+ \w |/no/such/file \else
+ -- \endif here is eaten as part of whole-line argument
+ \! whole_line \endif
+ \z
+\else
+ \echo 'should print #8-1'
+should print #8-1
+\endif
+-- :{?...} defined variable test
+\set i 1
+\if :{?i}
+ \echo '#9-1 ok, variable i is defined'
+#9-1 ok, variable i is defined
+\else
+ \echo 'should not print #9-2'
+\endif
+\if :{?no_such_variable}
+ \echo 'should not print #10-1'
+\else
+ \echo '#10-2 ok, variable no_such_variable is not defined'
+#10-2 ok, variable no_such_variable is not defined
+\endif
+SELECT :{?i} AS i_is_defined;
+ i_is_defined
+--------------
+ t
+(1 row)
+
+SELECT NOT :{?no_such_var} AS no_such_var_is_not_defined;
+ no_such_var_is_not_defined
+----------------------------
+ t
+(1 row)
+
+-- SHOW_CONTEXT
+\set SHOW_CONTEXT never
+do $$
+begin
+ raise notice 'foo';
+ raise exception 'bar';
+end $$;
+NOTICE: foo
+ERROR: bar
+\set SHOW_CONTEXT errors
+do $$
+begin
+ raise notice 'foo';
+ raise exception 'bar';
+end $$;
+NOTICE: foo
+ERROR: bar
+CONTEXT: PL/pgSQL function inline_code_block line 4 at RAISE
+\set SHOW_CONTEXT always
+do $$
+begin
+ raise notice 'foo';
+ raise exception 'bar';
+end $$;
+NOTICE: foo
+CONTEXT: PL/pgSQL function inline_code_block line 3 at RAISE
+ERROR: bar
+CONTEXT: PL/pgSQL function inline_code_block line 4 at RAISE
+-- test printing and clearing the query buffer
+SELECT 1;
+ ?column?
+----------
+ 1
+(1 row)
+
+\p
+SELECT 1;
+SELECT 2 \r
+\p
+SELECT 1;
+SELECT 3 \p
+SELECT 3
+UNION SELECT 4 \p
+SELECT 3
+UNION SELECT 4
+UNION SELECT 5
+ORDER BY 1;
+ ?column?
+----------
+ 3
+ 4
+ 5
+(3 rows)
+
+\r
+\p
+SELECT 3
+UNION SELECT 4
+UNION SELECT 5
+ORDER BY 1;
+-- tests for special result variables
+-- working query, 2 rows selected
+SELECT 1 AS stuff UNION SELECT 2;
+ stuff
+-------
+ 1
+ 2
+(2 rows)
+
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 2
+-- syntax error
+SELECT 1 UNION;
+ERROR: syntax error at or near ";"
+LINE 1: SELECT 1 UNION;
+ ^
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42601
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at or near ";"
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42601
+-- empty query
+;
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+-- must have kept previous values
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at or near ";"
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42601
+-- other query error
+DROP TABLE this_table_does_not_exist;
+ERROR: table "this_table_does_not_exist" does not exist
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42P01
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: table "this_table_does_not_exist" does not exist
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42P01
+-- nondefault verbosity error settings (except verbose, which is too unstable)
+\set VERBOSITY terse
+SELECT 1 UNION;
+ERROR: syntax error at or near ";" at character 15
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42601
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at or near ";"
+\set VERBOSITY sqlstate
+SELECT 1/0;
+ERROR: 22012
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 22012
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: division by zero
+\set VERBOSITY default
+-- working \gdesc
+SELECT 3 AS three, 4 AS four \gdesc
+ Column | Type
+--------+---------
+ three | integer
+ four | integer
+(2 rows)
+
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 2
+-- \gdesc with an error
+SELECT 4 AS \gdesc
+ERROR: syntax error at end of input
+LINE 1: SELECT 4 AS
+ ^
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42601
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at end of input
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42601
+-- check row count for a cursor-fetched query
+\set FETCH_COUNT 10
+select unique2 from tenk1 order by unique2 limit 19;
+ unique2
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+ 10
+ 11
+ 12
+ 13
+ 14
+ 15
+ 16
+ 17
+ 18
+(19 rows)
+
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 19
+-- cursor-fetched query with an error after the first group
+select 1/(15-unique2) from tenk1 order by unique2 limit 19;
+ ?column?
+----------
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ERROR: division by zero
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 22012
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: division by zero
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 22012
+\unset FETCH_COUNT
+create schema testpart;
+create role regress_partitioning_role;
+alter schema testpart owner to regress_partitioning_role;
+set role to regress_partitioning_role;
+-- run test inside own schema and hide other partitions
+set search_path to testpart;
+create table testtable_apple(logdate date);
+create table testtable_orange(logdate date);
+create index testtable_apple_index on testtable_apple(logdate);
+create index testtable_orange_index on testtable_orange(logdate);
+create table testpart_apple(logdate date) partition by range(logdate);
+create table testpart_orange(logdate date) partition by range(logdate);
+create index testpart_apple_index on testpart_apple(logdate);
+create index testpart_orange_index on testpart_orange(logdate);
+-- only partition related object should be displayed
+\dP test*apple*
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+----------------------+---------------------------+-------------------+-------------+----------------
+ testpart | testpart_apple | regress_partitioning_role | partitioned table | |
+ testpart | testpart_apple_index | regress_partitioning_role | partitioned index | | testpart_apple
+(2 rows)
+
+\dPt test*apple*
+ List of partitioned tables
+ Schema | Name | Owner | Parent name
+----------+----------------+---------------------------+-------------
+ testpart | testpart_apple | regress_partitioning_role |
+(1 row)
+
+\dPi test*apple*
+ List of partitioned indexes
+ Schema | Name | Owner | Parent name | Table
+----------+----------------------+---------------------------+-------------+----------------
+ testpart | testpart_apple_index | regress_partitioning_role | | testpart_apple
+(1 row)
+
+drop table testtable_apple;
+drop table testtable_orange;
+drop table testpart_apple;
+drop table testpart_orange;
+create table parent_tab (id int) partition by range (id);
+create index parent_index on parent_tab (id);
+create table child_0_10 partition of parent_tab
+ for values from (0) to (10);
+create table child_10_20 partition of parent_tab
+ for values from (10) to (20);
+create table child_20_30 partition of parent_tab
+ for values from (20) to (30);
+insert into parent_tab values (generate_series(0,29));
+create table child_30_40 partition of parent_tab
+for values from (30) to (40)
+ partition by range(id);
+create table child_30_35 partition of child_30_40
+ for values from (30) to (35);
+create table child_35_40 partition of child_30_40
+ for values from (35) to (40);
+insert into parent_tab values (generate_series(30,39));
+\dPt
+ List of partitioned tables
+ Schema | Name | Owner
+----------+------------+---------------------------
+ testpart | parent_tab | regress_partitioning_role
+(1 row)
+
+\dPi
+ List of partitioned indexes
+ Schema | Name | Owner | Table
+----------+--------------+---------------------------+------------
+ testpart | parent_index | regress_partitioning_role | parent_tab
+(1 row)
+
+\dP testpart.*
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+--------------------+---------------------------+-------------------+--------------+-------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table | |
+ testpart | child_30_40 | regress_partitioning_role | partitioned table | parent_tab |
+ testpart | parent_index | regress_partitioning_role | partitioned index | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | partitioned index | parent_index | child_30_40
+(4 rows)
+
+\dP
+ List of partitioned relations
+ Schema | Name | Owner | Type | Table
+----------+--------------+---------------------------+-------------------+------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table |
+ testpart | parent_index | regress_partitioning_role | partitioned index | parent_tab
+(2 rows)
+
+\dPtn
+ List of partitioned tables
+ Schema | Name | Owner | Parent name
+----------+-------------+---------------------------+-------------
+ testpart | parent_tab | regress_partitioning_role |
+ testpart | child_30_40 | regress_partitioning_role | parent_tab
+(2 rows)
+
+\dPin
+ List of partitioned indexes
+ Schema | Name | Owner | Parent name | Table
+----------+--------------------+---------------------------+--------------+-------------
+ testpart | parent_index | regress_partitioning_role | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | parent_index | child_30_40
+(2 rows)
+
+\dPn
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+--------------------+---------------------------+-------------------+--------------+-------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table | |
+ testpart | child_30_40 | regress_partitioning_role | partitioned table | parent_tab |
+ testpart | parent_index | regress_partitioning_role | partitioned index | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | partitioned index | parent_index | child_30_40
+(4 rows)
+
+\dPn testpart.*
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+--------------------+---------------------------+-------------------+--------------+-------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table | |
+ testpart | child_30_40 | regress_partitioning_role | partitioned table | parent_tab |
+ testpart | parent_index | regress_partitioning_role | partitioned index | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | partitioned index | parent_index | child_30_40
+(4 rows)
+
+drop table parent_tab cascade;
+drop schema testpart;
+set search_path to default;
+set role to default;
+drop role regress_partitioning_role;
+-- \d on toast table (use pg_statistic's toast table, which has a known name)
+\d pg_toast.pg_toast_2619
+TOAST table "pg_toast.pg_toast_2619"
+ Column | Type
+------------+---------
+ chunk_id | oid
+ chunk_seq | integer
+ chunk_data | bytea
+Owning table: "pg_catalog.pg_statistic"
+Indexes:
+ "pg_toast_2619_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
+
+-- check printing info about access methods
+\dA
+List of access methods
+ Name | Type
+----------+-------
+ brin | Index
+ btree | Index
+ gin | Index
+ gist | Index
+ hash | Index
+ heap | Table
+ heap2 | Table
+ spgist | Index
+ zedstore | Table
+(9 rows)
+
+\dA *
+List of access methods
+ Name | Type
+----------+-------
+ brin | Index
+ btree | Index
+ gin | Index
+ gist | Index
+ hash | Index
+ heap | Table
+ heap2 | Table
+ spgist | Index
+ zedstore | Table
+(9 rows)
+
+\dA h*
+List of access methods
+ Name | Type
+-------+-------
+ hash | Index
+ heap | Table
+ heap2 | Table
+(3 rows)
+
+\dA foo
+List of access methods
+ Name | Type
+------+------
+(0 rows)
+
+\dA foo bar
+List of access methods
+ Name | Type
+------+------
+(0 rows)
+
+\dA: extra argument "bar" ignored
+\dA+
+ List of access methods
+ Name | Type | Handler | Description
+----------+-------+--------------------------+----------------------------------------
+ brin | Index | brinhandler | block range index (BRIN) access method
+ btree | Index | bthandler | b-tree index access method
+ gin | Index | ginhandler | GIN index access method
+ gist | Index | gisthandler | GiST index access method
+ hash | Index | hashhandler | hash index access method
+ heap | Table | heap_tableam_handler | heap table access method
+ heap2 | Table | heap_tableam_handler |
+ spgist | Index | spghandler | SP-GiST index access method
+ zedstore | Table | zedstore_tableam_handler | zedstore table access method
+(9 rows)
+
+\dA+ *
+ List of access methods
+ Name | Type | Handler | Description
+----------+-------+--------------------------+----------------------------------------
+ brin | Index | brinhandler | block range index (BRIN) access method
+ btree | Index | bthandler | b-tree index access method
+ gin | Index | ginhandler | GIN index access method
+ gist | Index | gisthandler | GiST index access method
+ hash | Index | hashhandler | hash index access method
+ heap | Table | heap_tableam_handler | heap table access method
+ heap2 | Table | heap_tableam_handler |
+ spgist | Index | spghandler | SP-GiST index access method
+ zedstore | Table | zedstore_tableam_handler | zedstore table access method
+(9 rows)
+
+\dA+ h*
+ List of access methods
+ Name | Type | Handler | Description
+-------+-------+----------------------+--------------------------
+ hash | Index | hashhandler | hash index access method
+ heap | Table | heap_tableam_handler | heap table access method
+ heap2 | Table | heap_tableam_handler |
+(3 rows)
+
+\dA+ foo
+ List of access methods
+ Name | Type | Handler | Description
+------+------+---------+-------------
+(0 rows)
+
+\dAc brin pg*.oid*
+ List of operator classes
+ AM | Input type | Storage type | Operator class | Default?
+------+------------+--------------+----------------+----------
+ brin | oid | | oid_minmax_ops | yes
+(1 row)
+
+\dAf spgist
+ List of operator families
+ AM | Operator family | Applicable types
+--------+-----------------+------------------
+ spgist | box_ops | box
+ spgist | kd_point_ops | point
+ spgist | network_ops | inet
+ spgist | poly_ops | polygon
+ spgist | quad_point_ops | point
+ spgist | range_ops | anyrange
+ spgist | text_ops | text
+(7 rows)
+
+\dAf btree int4
+ List of operator families
+ AM | Operator family | Applicable types
+-------+-----------------+---------------------------
+ btree | integer_ops | smallint, integer, bigint
+(1 row)
+
+\dAo brin uuid_minmax_ops
+ List of operators of operator families
+ AM | Opfamily Name | Operator
+------+-----------------+-----------------
+ brin | uuid_minmax_ops | < (uuid, uuid)
+ brin | uuid_minmax_ops | <= (uuid, uuid)
+ brin | uuid_minmax_ops | = (uuid, uuid)
+ brin | uuid_minmax_ops | >= (uuid, uuid)
+ brin | uuid_minmax_ops | > (uuid, uuid)
+(5 rows)
+
+\dAo * pg_catalog.jsonb_path_ops
+ List of operators of operator families
+ AM | Opfamily Name | Operator
+-----+----------------+----------------------
+ gin | jsonb_path_ops | @> (jsonb, jsonb)
+ gin | jsonb_path_ops | @? (jsonb, jsonpath)
+ gin | jsonb_path_ops | @@ (jsonb, jsonpath)
+(3 rows)
+
+\dAp brin uuid_minmax_ops
+ List of procedures of operator families
+ AM | Operator family | Left arg type | Right arg type | Number | Proc name
+------+-----------------+---------------+----------------+--------+------------------------
+ brin | uuid_minmax_ops | uuid | uuid | 1 | brin_minmax_opcinfo
+ brin | uuid_minmax_ops | uuid | uuid | 2 | brin_minmax_add_value
+ brin | uuid_minmax_ops | uuid | uuid | 3 | brin_minmax_consistent
+ brin | uuid_minmax_ops | uuid | uuid | 4 | brin_minmax_union
+(4 rows)
+
+\dAp * pg_catalog.uuid_ops
+ List of procedures of operator families
+ AM | Operator family | Left arg type | Right arg type | Number | Proc name
+-------+-----------------+---------------+----------------+--------+--------------------
+ btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
+ btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
+ btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ hash | uuid_ops | uuid | uuid | 1 | uuid_hash
+ hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
+(5 rows)
+
diff --git src/test/regress/expected/psql_crosstab_1.out src/test/regress/expected/psql_crosstab_1.out
new file mode 100644
index 0000000000..d8ab4c35f3
--- /dev/null
+++ src/test/regress/expected/psql_crosstab_1.out
@@ -0,0 +1,216 @@
+--
+-- \crosstabview
+--
+CREATE TABLE ctv_data (v, h, c, i, d) AS
+VALUES
+ ('v1','h2','foo', 3, '2015-04-01'::date),
+ ('v2','h1','bar', 3, '2015-01-02'),
+ ('v1','h0','baz', NULL, '2015-07-12'),
+ ('v0','h4','qux', 4, '2015-07-15'),
+ ('v0','h4','dbl', -3, '2014-12-15'),
+ ('v0',NULL,'qux', 5, '2014-07-15'),
+ ('v1','h2','quux',7, '2015-04-04');
+-- make plans more stable
+ANALYZE ctv_data;
+-- running \crosstabview after query uses query in buffer
+SELECT v, EXTRACT(year FROM d), count(*)
+ FROM ctv_data
+ GROUP BY 1, 2
+ ORDER BY 1, 2;
+ v | date_part | count
+----+-----------+-------
+ v0 | 2014 | 2
+ v0 | 2015 | 1
+ v1 | 2015 | 3
+ v2 | 2015 | 1
+(4 rows)
+
+-- basic usage with 3 columns
+ \crosstabview
+ v | 2014 | 2015
+----+------+------
+ v0 | 2 | 1
+ v1 | | 3
+ v2 | | 1
+(3 rows)
+
+-- ordered months in horizontal header, quoted column name
+SELECT v, to_char(d, 'Mon') AS "month name", EXTRACT(month FROM d) AS num,
+ count(*) FROM ctv_data GROUP BY 1,2,3 ORDER BY 1
+ \crosstabview v "month name" 4 num
+ v | Jan | Apr | Jul | Dec
+----+-----+-----+-----+-----
+ v0 | | | 2 | 1
+ v1 | | 2 | 1 |
+ v2 | 1 | | |
+(3 rows)
+
+-- ordered months in vertical header, ordered years in horizontal header
+SELECT EXTRACT(year FROM d) AS year, to_char(d,'Mon') AS """month"" name",
+ EXTRACT(month FROM d) AS month,
+ format('sum=%s avg=%s', sum(i), avg(i)::numeric(2,1))
+ FROM ctv_data
+ GROUP BY EXTRACT(year FROM d), to_char(d,'Mon'), EXTRACT(month FROM d)
+ORDER BY month
+\crosstabview """month"" name" year format year
+ "month" name | 2014 | 2015
+--------------+-----------------+----------------
+ Jan | | sum=3 avg=3.0
+ Apr | | sum=10 avg=5.0
+ Jul | sum=5 avg=5.0 | sum=4 avg=4.0
+ Dec | sum=-3 avg=-3.0 |
+(4 rows)
+
+-- combine contents vertically into the same cell (V/H duplicates)
+SELECT v, h, string_agg(c, E'\n') FROM ctv_data GROUP BY v, h ORDER BY 1,2,3
+ \crosstabview 1 2 3
+ v | h4 | | h0 | h2 | h1
+----+-----+-----+-----+------+-----
+ v0 | qux+| qux | | |
+ | dbl | | | |
+ v1 | | | baz | foo +|
+ | | | | quux |
+ v2 | | | | | bar
+(3 rows)
+
+-- horizontal ASC order from window function
+SELECT v,h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v0 | | | | qux+| qux
+ | | | | dbl |
+ v1 | baz | | foo +| |
+ | | | quux | |
+ v2 | | bar | | |
+(3 rows)
+
+-- horizontal DESC order from window function
+SELECT v, h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h DESC) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | | h4 | h2 | h1 | h0
+----+-----+-----+------+-----+-----
+ v0 | qux | qux+| | |
+ | | dbl | | |
+ v1 | | | foo +| | baz
+ | | | quux | |
+ v2 | | | | bar |
+(3 rows)
+
+-- horizontal ASC order from window function, NULLs pushed rightmost
+SELECT v,h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h NULLS LAST) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v0 | | | | qux+| qux
+ | | | | dbl |
+ v1 | baz | | foo +| |
+ | | | quux | |
+ v2 | | bar | | |
+(3 rows)
+
+-- only null, no column name, 2 columns: error
+SELECT null,null \crosstabview
+\crosstabview: query must return at least three columns
+-- only null, no column name, 3 columns: works
+SELECT null,null,null \crosstabview
+ ?column? |
+----------+--
+ |
+(1 row)
+
+-- null display
+\pset null '#null#'
+SELECT v,h, string_agg(i::text, E'\n') AS i FROM ctv_data
+GROUP BY v, h ORDER BY h,v
+ \crosstabview v h i
+ v | h0 | h1 | h2 | h4 | #null#
+----+--------+----+----+----+--------
+ v1 | #null# | | 3 +| |
+ | | | 7 | |
+ v2 | | 3 | | |
+ v0 | | | | 4 +| 5
+ | | | | -3 |
+(3 rows)
+
+\pset null ''
+-- refer to columns by position
+SELECT v,h,string_agg(i::text, E'\n'), string_agg(c, E'\n')
+FROM ctv_data GROUP BY v, h ORDER BY h,v
+ \crosstabview 2 1 4
+ h | v1 | v2 | v0
+----+------+-----+-----
+ h0 | baz | |
+ h1 | | bar |
+ h2 | quux+| |
+ | foo | |
+ h4 | | | qux+
+ | | | dbl
+ | | | qux
+(5 rows)
+
+-- refer to columns by positions and names mixed
+SELECT v,h, string_agg(i::text, E'\n') AS i, string_agg(c, E'\n') AS c
+FROM ctv_data GROUP BY v, h ORDER BY h,v
+ \crosstabview 1 "h" 4
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v1 | baz | | quux+| |
+ | | | foo | |
+ v2 | | bar | | |
+ v0 | | | | qux+| qux
+ | | | | dbl |
+(3 rows)
+
+-- refer to columns by quoted names, check downcasing of unquoted name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview "22" B "Foo"
+ 22 | 2
+----+---
+ 1 | 3
+(1 row)
+
+-- error: bad column name
+SELECT v,h,c,i FROM ctv_data
+ \crosstabview v h j
+\crosstabview: column name not found: "j"
+-- error: need to quote name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview 1 2 Foo
+\crosstabview: column name not found: "foo"
+-- error: need to not quote name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview 1 "B" "Foo"
+\crosstabview: column name not found: "B"
+-- error: bad column number
+SELECT v,h,i,c FROM ctv_data
+ \crosstabview 2 1 5
+\crosstabview: column number 5 is out of range 1..4
+-- error: same H and V columns
+SELECT v,h,i,c FROM ctv_data
+ \crosstabview 2 h 4
+\crosstabview: vertical and horizontal headers must be different columns
+-- error: too many columns
+SELECT a,a,1 FROM generate_series(1,3000) AS a
+ \crosstabview
+\crosstabview: maximum number of columns (1600) exceeded
+-- error: only one column
+SELECT 1 \crosstabview
+\crosstabview: query must return at least three columns
+DROP TABLE ctv_data;
+-- check error reporting (bug #14476)
+CREATE TABLE ctv_data (x int, y int, v text);
+INSERT INTO ctv_data SELECT 1, x, '*' || x FROM generate_series(1,10) x;
+SELECT * FROM ctv_data \crosstabview
+ x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
+---+----+----+----+----+----+----+----+----+----+-----
+ 1 | *1 | *2 | *3 | *4 | *5 | *6 | *7 | *8 | *9 | *10
+(1 row)
+
+INSERT INTO ctv_data VALUES (1, 10, '*'); -- duplicate data to cause error
+SELECT * FROM ctv_data \crosstabview
+\crosstabview: query result contains multiple data values for row "1", column "10"
+DROP TABLE ctv_data;
diff --git src/test/regress/expected/rangefuncs_1.out src/test/regress/expected/rangefuncs_1.out
new file mode 100644
index 0000000000..78b177ceb0
--- /dev/null
+++ src/test/regress/expected/rangefuncs_1.out
@@ -0,0 +1,2100 @@
+CREATE TABLE rngfunc2(rngfuncid int, f2 int);
+INSERT INTO rngfunc2 VALUES(1, 11);
+INSERT INTO rngfunc2 VALUES(2, 22);
+INSERT INTO rngfunc2 VALUES(1, 111);
+CREATE FUNCTION rngfunct(int) returns setof rngfunc2 as 'SELECT * FROM rngfunc2 WHERE rngfuncid = $1 ORDER BY f2;' LANGUAGE SQL;
+-- function with ORDINALITY
+select * from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select * from rngfunct(1) with ordinality as z(a,b,ord) where b > 100; -- ordinal 2, not 1
+ a | b | ord
+---+-----+-----
+ 1 | 111 | 2
+(1 row)
+
+-- ordinality vs. column names and types
+select a,b,ord from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select a,ord from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select * from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select a,ord from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select * from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select row_to_json(s.*) from generate_series(11,14) with ordinality s;
+ row_to_json
+-------------------------
+ {"s":11,"ordinality":1}
+ {"s":12,"ordinality":2}
+ {"s":13,"ordinality":3}
+ {"s":14,"ordinality":4}
+(4 rows)
+
+-- ordinality vs. views
+create temporary view vw_ord as select * from (values (1)) v(n) join rngfunct(1) with ordinality as z(a,b,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | ord
+---+---+----+-----
+ 1 | 1 | 11 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN rngfunct(1) WITH ORDINALITY z(a, b, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- multiple functions
+select * from rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord);
+ a | b | c | d | ord
+---+-----+---+----+-----
+ 1 | 11 | 2 | 22 | 1
+ 1 | 111 | | | 2
+(2 rows)
+
+create temporary view vw_ord as select * from (values (1)) v(n) join rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | c | d | ord
+---+---+----+---+----+-----
+ 1 | 1 | 11 | 2 | 22 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.c, +
+ z.d, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN ROWS FROM(rngfunct(1), rngfunct(2)) WITH ORDINALITY z(a, b, c, d, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- expansions of unnest()
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]);
+ unnest | unnest | unnest
+--------+--------+--------
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(101,102)) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 101 | 1
+ 20 | bar | 102 | 2
+(2 rows)
+
+create temporary view vw_ord as select * from unnest(array[10,20],array['foo','bar'],array[1.0]) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(1,2)) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+---
+ 10 | foo | 1
+ 20 | bar | 2
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM ROWS FROM(unnest(ARRAY[10, 20]), unnest(ARRAY['foo'::text, 'bar'::text]), generate_series(1, 2)) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+-- ordinality and multiple functions vs. rewind and reverse scan
+begin;
+declare rf_cur scroll cursor for select * from rows from(generate_series(1,5),generate_series(1,2)) with ordinality as g(i,j,o);
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch backward all from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+ 4 | | 4
+ 3 | | 3
+ 2 | 2 | 2
+ 1 | 1 | 1
+(5 rows)
+
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+(1 row)
+
+fetch absolute 1 from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 4 | | 4
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+commit;
+-- function with implicit LATERAL
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) z where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2
+-----------+-----+-----------+-----
+ 1 | 11 | 1 | 11
+ 2 | 22 | 2 | 22
+ 1 | 111 | 1 | 111
+(3 rows)
+
+-- function with implicit LATERAL and explicit ORDINALITY
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) with ordinality as z(rngfuncid,f2,ord) where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2 | ord
+-----------+-----+-----------+-----+-----
+ 1 | 11 | 1 | 11 | 1
+ 2 | 22 | 2 | 22 | 1
+ 1 | 111 | 1 | 111 | 2
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+ 2 | 22
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(1) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = 1) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- nested functions
+select rngfunct.rngfuncid, rngfunct.f2 from rngfunct(sin(pi()/2)::int) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+CREATE TABLE rngfunc (rngfuncid int, rngfuncsubid int, rngfuncname text, primary key(rngfuncid,rngfuncsubid));
+INSERT INTO rngfunc VALUES(1,1,'Joe');
+INSERT INTO rngfunc VALUES(1,2,'Ed');
+INSERT INTO rngfunc VALUES(2,1,'Mary');
+-- sql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc1(int) RETURNS int AS 'SELECT $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc1(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc1(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc1
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1) WITH ORDINALITY as t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc2(int) RETURNS setof int AS 'SELECT rngfuncid FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc2(1) AS t1;
+ t1
+----
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc2
+-------------
+ 1
+ 1
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc3(int) RETURNS setof text AS 'SELECT rngfuncname FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc3(1) AS t1;
+ t1
+-----
+ Joe
+ Ed
+(2 rows)
+
+SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc3
+-------------
+ Joe
+ Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc4(int) RETURNS rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc4(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = c
+CREATE FUNCTION getrngfunc5(int) RETURNS setof rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc5(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = record
+CREATE FUNCTION getrngfunc6(int) RETURNS RECORD AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc6(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc6(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = record
+CREATE FUNCTION getrngfunc7(int) RETURNS setof record AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc7(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc7(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc8(int) RETURNS int AS 'DECLARE rngfuncint int; BEGIN SELECT rngfuncid into rngfuncint FROM rngfunc WHERE rngfuncid = $1; RETURN rngfuncint; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc8(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc8
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc9(int) RETURNS rngfunc AS 'DECLARE rngfunctup rngfunc%ROWTYPE; BEGIN SELECT * into rngfunctup FROM rngfunc WHERE rngfuncid = $1; RETURN rngfunctup; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc9(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- mix 'n match kinds, to exercise expandRTE and related logic
+select * from rows from(getrngfunc1(1),getrngfunc2(1),getrngfunc3(1),getrngfunc4(1),getrngfunc5(1),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc8(1),getrngfunc9(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+---+-----+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1
+ | 1 | Ed | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | | 2
+(2 rows)
+
+select * from rows from(getrngfunc9(1),getrngfunc8(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc5(1),getrngfunc4(1),getrngfunc3(1),getrngfunc2(1),getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+---+-----+---+---+-----+---+---+-----+---+---+-----+-----+---+---+---
+ 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | Joe | 1 | 1 | 1
+ | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | Ed | 1 | | 2
+(2 rows)
+
+create temporary view vw_rngfunc as
+ select * from rows from(getrngfunc9(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,n);
+select * from vw_rngfunc;
+ a | b | c | d | e | f | g | n
+---+---+-----+---+---+-----+---+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1
+ | | | 1 | 2 | Ed | | 2
+(2 rows)
+
+select pg_get_viewdef('vw_rngfunc');
+ pg_get_viewdef
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ SELECT t1.a, +
+ t1.b, +
+ t1.c, +
+ t1.d, +
+ t1.e, +
+ t1.f, +
+ t1.g, +
+ t1.n +
+ FROM ROWS FROM(getrngfunc9(1), getrngfunc7(1) AS (rngfuncid integer, rngfuncsubid integer, rngfuncname text), getrngfunc1(1)) WITH ORDINALITY t1(a, b, c, d, e, f, g, n);
+(1 row)
+
+drop view vw_rngfunc;
+DROP FUNCTION getrngfunc1(int);
+DROP FUNCTION getrngfunc2(int);
+DROP FUNCTION getrngfunc3(int);
+DROP FUNCTION getrngfunc4(int);
+DROP FUNCTION getrngfunc5(int);
+DROP FUNCTION getrngfunc6(int);
+DROP FUNCTION getrngfunc7(int);
+DROP FUNCTION getrngfunc8(int);
+DROP FUNCTION getrngfunc9(int);
+DROP FUNCTION rngfunct(int);
+DROP TABLE rngfunc2;
+DROP TABLE rngfunc;
+-- Rescan tests --
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq1;
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq2;
+CREATE TYPE rngfunc_rescan_t AS (i integer, s bigint);
+CREATE FUNCTION rngfunc_sql(int,int) RETURNS setof rngfunc_rescan_t AS 'SELECT i, nextval(''rngfunc_rescan_seq1'') FROM generate_series($1,$2) i;' LANGUAGE SQL;
+-- plpgsql functions use materialize mode
+CREATE FUNCTION rngfunc_mat(int,int) RETURNS setof rngfunc_rescan_t AS 'begin for i in $1..$2 loop return next (i, nextval(''rngfunc_rescan_seq2'')); end loop; end;' LANGUAGE plpgsql;
+--invokes ExecReScanFunctionScan - all these cases should materialize the function only once
+-- LEFT JOIN on a condition that the planner can't prove to be true is used to ensure the function
+-- is on the inner path of a nestloop join
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN ROWS FROM( rngfunc_sql(11,13), rngfunc_mat(11,13) ) WITH ORDINALITY AS f(i1,s1,i2,s2,o) ON (r+i1+i2)<100;
+ r | i1 | s1 | i2 | s2 | o
+---+----+----+----+----+---
+ 1 | 11 | 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 12 | 2 | 2
+ 1 | 13 | 3 | 13 | 3 | 3
+ 2 | 11 | 1 | 11 | 1 | 1
+ 2 | 12 | 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 13 | 3 | 3
+ 3 | 11 | 1 | 11 | 1 | 1
+ 3 | 12 | 2 | 12 | 2 | 2
+ 3 | 13 | 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 2 | 11
+ 2 | 12
+ 2 | 13
+ 3 | 11
+ 3 | 12
+ 3 | 13
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 10
+ 2 | 20
+ 2 | 30
+ 3 | 10
+ 3 | 20
+ 3 | 30
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 10 | 1
+ 2 | 20 | 2
+ 2 | 30 | 3
+ 3 | 10 | 1
+ 3 | 20 | 2
+ 3 | 30 | 3
+(9 rows)
+
+--invokes ExecReScanFunctionScan with chgParam != NULL (using implied LATERAL)
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+-- selective rescan of multiple functions:
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(11,11), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | | | 12 | 2
+ 1 | | | 13 | 3
+ 2 | 11 | 1 | 12 | 4
+ 2 | | | 13 | 5
+ 3 | 11 | 1 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(11,11) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | |
+ 1 | 13 | 3 | |
+ 2 | 12 | 4 | 11 | 1
+ 2 | 13 | 5 | |
+ 3 | 13 | 6 | 11 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | 12 | 2
+ 1 | 13 | 3 | 13 | 3
+ 2 | 12 | 4 | 12 | 4
+ 2 | 13 | 5 | 13 | 5
+ 3 | 13 | 6 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM generate_series(1,2) r1, generate_series(r1,3) r2, ROWS FROM( rngfunc_sql(10+r1,13), rngfunc_mat(10+r2,13) );
+ r1 | r2 | i | s | i | s
+----+----+----+----+----+---
+ 1 | 1 | 11 | 1 | 11 | 1
+ 1 | 1 | 12 | 2 | 12 | 2
+ 1 | 1 | 13 | 3 | 13 | 3
+ 1 | 2 | 11 | 4 | 12 | 4
+ 1 | 2 | 12 | 5 | 13 | 5
+ 1 | 2 | 13 | 6 | |
+ 1 | 3 | 11 | 7 | 13 | 6
+ 1 | 3 | 12 | 8 | |
+ 1 | 3 | 13 | 9 | |
+ 2 | 2 | 12 | 10 | 12 | 7
+ 2 | 2 | 13 | 11 | 13 | 8
+ 2 | 3 | 12 | 12 | 13 | 9
+ 2 | 3 | 13 | 13 | |
+(13 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) f(i);
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 12
+ 2 | 13
+ 2 | 14
+ 2 | 15
+ 2 | 16
+ 2 | 17
+ 2 | 18
+ 3 | 13
+ 3 | 14
+ 3 | 15
+ 3 | 16
+ 3 | 17
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 1 | 14 | 4
+ 1 | 15 | 5
+ 1 | 16 | 6
+ 1 | 17 | 7
+ 1 | 18 | 8
+ 1 | 19 | 9
+ 2 | 12 | 1
+ 2 | 13 | 2
+ 2 | 14 | 3
+ 2 | 15 | 4
+ 2 | 16 | 5
+ 2 | 17 | 6
+ 2 | 18 | 7
+ 3 | 13 | 1
+ 3 | 14 | 2
+ 3 | 15 | 3
+ 3 | 16 | 4
+ 3 | 17 | 5
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) f(i);
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 20
+ 2 | 40
+ 2 | 60
+ 3 | 30
+ 3 | 60
+ 3 | 90
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 20 | 1
+ 2 | 40 | 2
+ 2 | 60 | 3
+ 3 | 30 | 1
+ 3 | 60 | 2
+ 3 | 90 | 3
+(9 rows)
+
+-- deep nesting
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(21,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 21
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 21
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 21
+ 3 | 3 | 10 | 22
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 21
+ 3 | 3 | 30 | 22
+ 3 | 3 | 30 | 23
+(27 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(20+r1,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 23
+(18 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r2,r2+3) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 10
+ 1 | 1 | 10 | 11
+ 1 | 1 | 10 | 12
+ 1 | 1 | 10 | 13
+ 1 | 1 | 20 | 20
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 30
+ 1 | 1 | 30 | 31
+ 1 | 1 | 30 | 32
+ 1 | 1 | 30 | 33
+ 2 | 2 | 10 | 10
+ 2 | 2 | 10 | 11
+ 2 | 2 | 10 | 12
+ 2 | 2 | 10 | 13
+ 2 | 2 | 20 | 20
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 30
+ 2 | 2 | 30 | 31
+ 2 | 2 | 30 | 32
+ 2 | 2 | 30 | 33
+ 3 | 3 | 10 | 10
+ 3 | 3 | 10 | 11
+ 3 | 3 | 10 | 12
+ 3 | 3 | 10 | 13
+ 3 | 3 | 20 | 20
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 30
+ 3 | 3 | 30 | 31
+ 3 | 3 | 30 | 32
+ 3 | 3 | 30 | 33
+(36 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r1,2+r2/5) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+---
+ 1 | 1 | 10 | 1
+ 1 | 1 | 10 | 2
+ 1 | 1 | 10 | 3
+ 1 | 1 | 10 | 4
+ 1 | 1 | 20 | 1
+ 1 | 1 | 20 | 2
+ 1 | 1 | 20 | 3
+ 1 | 1 | 20 | 4
+ 1 | 1 | 20 | 5
+ 1 | 1 | 20 | 6
+ 1 | 1 | 30 | 1
+ 1 | 1 | 30 | 2
+ 1 | 1 | 30 | 3
+ 1 | 1 | 30 | 4
+ 1 | 1 | 30 | 5
+ 1 | 1 | 30 | 6
+ 1 | 1 | 30 | 7
+ 1 | 1 | 30 | 8
+ 2 | 2 | 10 | 2
+ 2 | 2 | 10 | 3
+ 2 | 2 | 10 | 4
+ 2 | 2 | 20 | 2
+ 2 | 2 | 20 | 3
+ 2 | 2 | 20 | 4
+ 2 | 2 | 20 | 5
+ 2 | 2 | 20 | 6
+ 2 | 2 | 30 | 2
+ 2 | 2 | 30 | 3
+ 2 | 2 | 30 | 4
+ 2 | 2 | 30 | 5
+ 2 | 2 | 30 | 6
+ 2 | 2 | 30 | 7
+ 2 | 2 | 30 | 8
+ 3 | 3 | 10 | 3
+ 3 | 3 | 10 | 4
+ 3 | 3 | 20 | 3
+ 3 | 3 | 20 | 4
+ 3 | 3 | 20 | 5
+ 3 | 3 | 20 | 6
+ 3 | 3 | 30 | 3
+ 3 | 3 | 30 | 4
+ 3 | 3 | 30 | 5
+ 3 | 3 | 30 | 6
+ 3 | 3 | 30 | 7
+ 3 | 3 | 30 | 8
+(45 rows)
+
+-- check handling of FULL JOIN with multiple lateral references (bug #15741)
+SELECT *
+FROM (VALUES (1),(2)) v1(r1)
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, v1.r1) AS gs1
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, gs1) AS gs2
+ LEFT JOIN generate_series(1, gs2) AS gs3 ON TRUE
+ ) AS ss1 ON TRUE
+ FULL JOIN generate_series(1, v1.r1) AS gs4 ON FALSE
+ ) AS ss0 ON TRUE;
+ r1 | gs1 | gs2 | gs3 | gs4
+----+-----+-----+-----+-----
+ 1 | | | | 1
+ 1 | 1 | 1 | 1 |
+ 2 | | | | 1
+ 2 | | | | 2
+ 2 | 1 | 1 | 1 |
+ 2 | 2 | 1 | 1 |
+ 2 | 2 | 2 | 1 |
+ 2 | 2 | 2 | 2 |
+(8 rows)
+
+DROP FUNCTION rngfunc_sql(int,int);
+DROP FUNCTION rngfunc_mat(int,int);
+DROP SEQUENCE rngfunc_rescan_seq1;
+DROP SEQUENCE rngfunc_rescan_seq2;
+--
+-- Test cases involving OUT parameters
+--
+CREATE FUNCTION rngfunc(in f1 int, out f2 int)
+AS 'select $1+1' LANGUAGE sql;
+SELECT rngfunc(42);
+ rngfunc
+---------
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42);
+ f2
+----
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42) AS p(x);
+ x
+----
+ 43
+(1 row)
+
+-- explicit spec of return type is OK
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+-- error, wrong result type
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS float
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be integer because of OUT parameters
+-- with multiple OUT params you must get a RECORD result
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be record because of OUT parameters
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text)
+RETURNS record
+AS 'select $1+1' LANGUAGE sql;
+ERROR: cannot change return type of existing function
+HINT: Use DROP FUNCTION rngfunc(integer) first.
+CREATE OR REPLACE FUNCTION rngfuncr(in f1 int, out f2 int, out text)
+AS $$select $1-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncr(f1) FROM int4_tbl;
+ f1 | rngfuncr
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (123455,123456z)
+ -123456 | (-123457,-123456z)
+ 2147483647 | (2147483646,2147483647z)
+ -2147483647 | (-2147483648,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncr(42);
+ f2 | column2
+----+---------
+ 41 | 42z
+(1 row)
+
+SELECT * FROM rngfuncr(42) AS p(a,b);
+ a | b
+----+-----
+ 41 | 42z
+(1 row)
+
+CREATE OR REPLACE FUNCTION rngfuncb(in f1 int, inout f2 int, out text)
+AS $$select $2-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncb(f1, f1/2) FROM int4_tbl;
+ f1 | rngfuncb
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (61727,123456z)
+ -123456 | (-61729,-123456z)
+ 2147483647 | (1073741822,2147483647z)
+ -2147483647 | (-1073741824,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncb(42, 99);
+ f2 | column2
+----+---------
+ 98 | 42z
+(1 row)
+
+SELECT * FROM rngfuncb(42, 99) AS p(a,b);
+ a | b
+----+-----
+ 98 | 42z
+(1 row)
+
+-- Can reference function with or without OUT params for DROP, etc
+DROP FUNCTION rngfunc(int);
+DROP FUNCTION rngfuncr(in f2 int, out f1 int, out text);
+DROP FUNCTION rngfuncb(in f1 int, inout f2 int);
+--
+-- For my next trick, polymorphic OUT parameters
+--
+CREATE FUNCTION dup (f1 anyelement, f2 out anyelement, f3 out anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+SELECT dup('xyz'); -- fails
+ERROR: could not determine polymorphic type because input has type unknown
+SELECT dup('xyz'::text);
+ dup
+-------------------
+ (xyz,"{xyz,xyz}")
+(1 row)
+
+SELECT * FROM dup('xyz'::text);
+ f2 | f3
+-----+-----------
+ xyz | {xyz,xyz}
+(1 row)
+
+-- fails, as we are attempting to rename first argument
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot change name of input parameter "f1"
+HINT: Use DROP FUNCTION dup(anyelement) first.
+DROP FUNCTION dup(anyelement);
+-- equivalent behavior, though different name exposed for input arg
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+DROP FUNCTION dup(anyelement);
+-- fails, no way to deduce outputs
+CREATE FUNCTION bad (f1 int, out f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot determine result data type
+DETAIL: A function returning a polymorphic type must have at least one polymorphic argument.
+--
+-- table functions
+--
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a int)
+AS $$ SELECT a FROM generate_series(1,5) a(a) $$ LANGUAGE sql;
+SELECT * FROM rngfunc();
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+DROP FUNCTION rngfunc();
+CREATE OR REPLACE FUNCTION rngfunc(int)
+RETURNS TABLE(a int, b int)
+AS $$ SELECT a, b
+ FROM generate_series(1,$1) a(a),
+ generate_series(1,$1) b(b) $$ LANGUAGE sql;
+SELECT * FROM rngfunc(3);
+ a | b
+---+---
+ 1 | 1
+ 1 | 2
+ 1 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(9 rows)
+
+DROP FUNCTION rngfunc(int);
+-- case that causes change of typmod knowledge during inlining
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a varchar(5))
+AS $$ SELECT 'hello'::varchar(5) $$ LANGUAGE sql STABLE;
+SELECT * FROM rngfunc() GROUP BY 1;
+ a
+-------
+ hello
+(1 row)
+
+DROP FUNCTION rngfunc();
+--
+-- some tests on SQL functions with RETURNING
+--
+create temp table tt(f1 serial, data text);
+create function insert_tt(text) returns int as
+$$ insert into tt(data) values($1) returning f1 $$
+language sql;
+select insert_tt('foo');
+ insert_tt
+-----------
+ 1
+(1 row)
+
+select insert_tt('bar');
+ insert_tt
+-----------
+ 2
+(1 row)
+
+select * from tt;
+ f1 | data
+----+------
+ 1 | foo
+ 2 | bar
+(2 rows)
+
+-- insert will execute to completion even if function needs just 1 row
+create or replace function insert_tt(text) returns int as
+$$ insert into tt(data) values($1),($1||$1) returning f1 $$
+language sql;
+select insert_tt('fool');
+ insert_tt
+-----------
+ 3
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+(4 rows)
+
+-- setof does what's expected
+create or replace function insert_tt2(text,text) returns setof int as
+$$ insert into tt(data) values($1),($2) returning f1 $$
+language sql;
+select insert_tt2('foolish','barrish');
+ insert_tt2
+------------
+ 5
+ 6
+(2 rows)
+
+select * from insert_tt2('baz','quux');
+ insert_tt2
+------------
+ 7
+ 8
+(2 rows)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+(8 rows)
+
+-- limit doesn't prevent execution to completion
+select insert_tt2('foolish','barrish') limit 1;
+ insert_tt2
+------------
+ 9
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+(10 rows)
+
+-- triggers will fire, too
+create function noticetrigger() returns trigger as $$
+begin
+ raise notice 'noticetrigger % %', new.f1, new.data;
+ return null;
+end $$ language plpgsql;
+create trigger tnoticetrigger after insert on tt for each row
+execute procedure noticetrigger();
+select insert_tt2('foolme','barme') limit 1;
+NOTICE: noticetrigger 11 foolme
+NOTICE: noticetrigger 12 barme
+ insert_tt2
+------------
+ 11
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+(12 rows)
+
+-- and rules work
+create temp table tt_log(f1 int, data text);
+create rule insert_tt_rule as on insert to tt do also
+ insert into tt_log values(new.*);
+select insert_tt2('foollog','barlog') limit 1;
+NOTICE: noticetrigger 13 foollog
+NOTICE: noticetrigger 14 barlog
+ insert_tt2
+------------
+ 13
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+ 13 | foollog
+ 14 | barlog
+(14 rows)
+
+-- note that nextval() gets executed a second time in the rule expansion,
+-- which is expected.
+select * from tt_log;
+ f1 | data
+----+---------
+ 15 | foollog
+ 16 | barlog
+(2 rows)
+
+-- test case for a whole-row-variable bug
+create function rngfunc1(n integer, out a text, out b text)
+ returns setof record
+ language sql
+ as $$ select 'foo ' || i, 'bar ' || i from generate_series(1,$1) i $$;
+set work_mem='64kB';
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+reset work_mem;
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+drop function rngfunc1(n integer);
+-- test use of SQL functions returning record
+-- this is supported in some cases where the query doesn't specify
+-- the actual record type ...
+create function array_to_set(anyarray) returns setof record as $$
+ select i AS "index", $1[i] AS "value" from generate_subscripts($1, 1) i
+$$ language sql strict immutable;
+select array_to_set(array['one', 'two']);
+ array_to_set
+--------------
+ (1,one)
+ (2,two)
+(2 rows)
+
+select * from array_to_set(array['one', 'two']) as t(f1 int,f2 text);
+ f1 | f2
+----+-----
+ 1 | one
+ 2 | two
+(2 rows)
+
+select * from array_to_set(array['one', 'two']); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from array_to_set(array['one', 'two']);
+ ^
+create temp table rngfunc(f1 int8, f2 int8);
+create function testrngfunc() returns record as $$
+ insert into rngfunc values (1,2) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+(1 row)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+(1 row)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+create function testrngfunc() returns setof record as $$
+ insert into rngfunc values (1,2), (3,4) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+ (3,4)
+(2 rows)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+ 3 | 4
+(2 rows)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+--
+-- Check some cases involving added/dropped columns in a rowtype result
+--
+create temp table users (userid text, seq int, email text, todrop bool, moredrop int, enabled bool);
+insert into users values ('id',1,'email',true,11,true);
+insert into users values ('id2',2,'email2',true,12,true);
+alter table users drop column todrop;
+create or replace function get_first_user() returns users as
+$$ SELECT * FROM users ORDER BY userid LIMIT 1; $$
+language sql stable;
+SELECT get_first_user();
+ get_first_user
+-------------------
+ (id,1,email,11,t)
+(1 row)
+
+SELECT * FROM get_first_user();
+ userid | seq | email | moredrop | enabled
+--------+-----+-------+----------+---------
+ id | 1 | email | 11 | t
+(1 row)
+
+create or replace function get_users() returns setof users as
+$$ SELECT * FROM users ORDER BY userid; $$
+language sql stable;
+SELECT get_users();
+ get_users
+---------------------
+ (id,1,email,11,t)
+ (id2,2,email2,12,t)
+(2 rows)
+
+SELECT * FROM get_users();
+ userid | seq | email | moredrop | enabled
+--------+-----+--------+----------+---------
+ id | 1 | email | 11 | t
+ id2 | 2 | email2 | 12 | t
+(2 rows)
+
+SELECT * FROM get_users() WITH ORDINALITY; -- make sure ordinality copes
+ userid | seq | email | moredrop | enabled | ordinality
+--------+-----+--------+----------+---------+------------
+ id | 1 | email | 11 | t | 1
+ id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+-- multiple functions vs. dropped columns
+SELECT * FROM ROWS FROM(generate_series(10,11), get_users()) WITH ORDINALITY;
+ generate_series | userid | seq | email | moredrop | enabled | ordinality
+-----------------+--------+-----+--------+----------+---------+------------
+ 10 | id | 1 | email | 11 | t | 1
+ 11 | id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+-- check that we can cope with post-parsing changes in rowtypes
+create temp view usersview as
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+alter table users add column junk text;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+begin;
+alter table users drop column moredrop;
+select * from usersview; -- expect clean failure
+ERROR: attribute 5 of type record has been dropped
+rollback;
+alter table users alter column seq type numeric;
+select * from usersview; -- expect clean failure
+ERROR: attribute 2 of type record has wrong type
+DETAIL: Table has type numeric, but query expects integer.
+drop view usersview;
+drop function get_first_user();
+drop function get_users();
+drop table users;
+-- this won't get inlined because of type coercion, but it shouldn't fail
+create or replace function rngfuncbar() returns setof text as
+$$ select 'foo'::varchar union all select 'bar'::varchar ; $$
+language sql stable;
+select rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+select * from rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+drop function rngfuncbar();
+-- check handling of a SQL function with multiple OUT params (bug #5777)
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1) $$ language sql;
+select * from rngfuncbar();
+ column1 | column2
+---------+---------
+ 1 | 2.1
+(1 row)
+
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned type integer at ordinal position 2, but query expects numeric.
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1, 3) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned row contains 3 attributes, but query expects 2.
+drop function rngfuncbar();
+-- check whole-row-Var handling in nested lateral functions (bug #11703)
+create function extractq2(t int8_tbl) returns int8 as $$
+ select t.q2
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ QUERY PLAN
+------------------------------------
+ Nested Loop
+ Output: f.x
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+ -> Function Scan on f
+ Output: f.x
+ Function Call: int8_tbl.q2
+(7 rows)
+
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+create function extractq2_2(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t) offset 0
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------------
+ Nested Loop
+ Output: ((int8_tbl.*).q2)
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.*
+ -> Result
+ Output: (int8_tbl.*).q2
+(6 rows)
+
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- without the "offset 0", this function gets optimized quite differently
+create function extractq2_2_opt(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t)
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------
+ Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+(2 rows)
+
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- check handling of nulls in SRF results (bug #7808)
+create type rngfunc2 as (a integer, b text);
+select *, row_to_json(u) from unnest(array[(1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+---+---------------------
+ | | {"a":null,"b":null}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, (1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ | | {"a":null,"b":null}
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(3 rows)
+
+select *, row_to_json(u) from unnest(array[]::rngfunc2[]) u;
+ a | b | row_to_json
+---+---+-------------
+(0 rows)
+
+drop type rngfunc2;
diff --git src/test/regress/expected/reloptions_1.out src/test/regress/expected/reloptions_1.out
new file mode 100644
index 0000000000..d766b195f8
--- /dev/null
+++ src/test/regress/expected/reloptions_1.out
@@ -0,0 +1,219 @@
+-- Simple create
+CREATE TABLE reloptions_test(i INT) WITH (FiLLFaCToR=30,
+ autovacuum_enabled = false, autovacuum_analyze_scale_factor = 0.2);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {fillfactor=30,autovacuum_enabled=false,autovacuum_analyze_scale_factor=0.2}
+(1 row)
+
+-- Fail min/max values check
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=2);
+ERROR: value 2 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=110);
+ERROR: value 110 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = -10.0);
+ERROR: value -10.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = 110.0);
+ERROR: value 110.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+-- Fail when option and namespace do not exist
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_namespace.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_namespace"
+-- Fail while setting improper values
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=-30.1);
+ERROR: value -30.1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor='string');
+ERROR: invalid value for integer option "fillfactor": string
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=true);
+ERROR: invalid value for integer option "fillfactor": true
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=12);
+ERROR: invalid value for boolean option "autovacuum_enabled": 12
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=30.5);
+ERROR: invalid value for boolean option "autovacuum_enabled": 30.5
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled='string');
+ERROR: invalid value for boolean option "autovacuum_enabled": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor='string');
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor=true);
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": true
+-- Fail if option is specified twice
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=30, fillfactor=40);
+ERROR: parameter "fillfactor" specified more than once
+-- Specifying name only for a non-Boolean option should fail
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor);
+ERROR: invalid value for integer option "fillfactor": true
+-- Simple ALTER TABLE
+ALTER TABLE reloptions_test SET (fillfactor=31,
+ autovacuum_analyze_scale_factor = 0.3);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {autovacuum_enabled=false,fillfactor=31,autovacuum_analyze_scale_factor=0.3}
+(1 row)
+
+-- Set boolean option to true without specifying value
+ALTER TABLE reloptions_test SET (autovacuum_enabled, fillfactor=32);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-----------------------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true,fillfactor=32}
+(1 row)
+
+-- Check that RESET works well
+ALTER TABLE reloptions_test RESET (fillfactor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+---------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true}
+(1 row)
+
+-- Resetting all values causes the column to become null
+ALTER TABLE reloptions_test RESET (autovacuum_enabled,
+ autovacuum_analyze_scale_factor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass AND
+ reloptions IS NULL;
+ reloptions
+------------
+
+(1 row)
+
+-- RESET fails if a value is specified
+ALTER TABLE reloptions_test RESET (fillfactor=12);
+ERROR: RESET must not include values for parameters
+-- Test vacuum_truncate option
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test(i INT NOT NULL, j text)
+ WITH (vacuum_truncate=false,
+ toast.vacuum_truncate=false,
+ autovacuum_enabled=false);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+--------------------------------------------------
+ {vacuum_truncate=false,autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid =
+ (SELECT reltoastrelid FROM pg_class
+ WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (vacuum_truncate);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+----------------------------
+ {autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') = 0;
+ ?column?
+----------
+ f
+(1 row)
+
+-- Test toast.* options
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR)
+ WITH (toast.autovacuum_vacuum_cost_delay = 23);
+SELECT reltoastrelid as toast_oid
+ FROM pg_class WHERE oid = 'reloptions_test'::regclass \gset
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test SET (toast.autovacuum_vacuum_cost_delay = 24);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (toast.autovacuum_vacuum_cost_delay);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+-- Fail on non-existent options in toast namespace
+CREATE TABLE reloptions_test2 (i int) WITH (toast.not_existing_option = 42);
+ERROR: unrecognized parameter "not_existing_option"
+-- Mix TOAST & heap
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR) WITH
+ (toast.autovacuum_vacuum_cost_delay = 23,
+ autovacuum_vacuum_cost_delay = 24, fillfactor = 40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-------------------------------------------------
+ {autovacuum_vacuum_cost_delay=24,fillfactor=40}
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid = (
+ SELECT reltoastrelid FROM pg_class WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+--
+-- CREATE INDEX, ALTER INDEX for btrees
+--
+CREATE INDEX reloptions_test_idx ON reloptions_test (s) WITH (fillfactor=30);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=30}
+(1 row)
+
+-- Fail when option and namespace do not exist
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_ns.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_ns"
+-- Check allowed ranges
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=1);
+ERROR: value 1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=130);
+ERROR: value 130 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+-- Check ALTER
+ALTER INDEX reloptions_test_idx SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
+-- Check ALTER on empty reloption list
+CREATE INDEX reloptions_test_idx3 ON reloptions_test (s);
+ALTER INDEX reloptions_test_idx3 SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx3'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
diff --git src/test/regress/expected/rowsecurity_1.out src/test/regress/expected/rowsecurity_1.out
new file mode 100644
index 0000000000..79b1f73e08
--- /dev/null
+++ src/test/regress/expected/rowsecurity_1.out
@@ -0,0 +1,4052 @@
+--
+-- Test of Row-level security feature
+--
+-- Clean up in case a prior regression run failed
+-- Suppress NOTICE messages when users/groups don't exist
+SET client_min_messages TO 'warning';
+DROP USER IF EXISTS regress_rls_alice;
+DROP USER IF EXISTS regress_rls_bob;
+DROP USER IF EXISTS regress_rls_carol;
+DROP USER IF EXISTS regress_rls_dave;
+DROP USER IF EXISTS regress_rls_exempt_user;
+DROP ROLE IF EXISTS regress_rls_group1;
+DROP ROLE IF EXISTS regress_rls_group2;
+DROP SCHEMA IF EXISTS regress_rls_schema CASCADE;
+RESET client_min_messages;
+-- initial setup
+CREATE USER regress_rls_alice NOLOGIN;
+CREATE USER regress_rls_bob NOLOGIN;
+CREATE USER regress_rls_carol NOLOGIN;
+CREATE USER regress_rls_dave NOLOGIN;
+CREATE USER regress_rls_exempt_user BYPASSRLS NOLOGIN;
+CREATE ROLE regress_rls_group1 NOLOGIN;
+CREATE ROLE regress_rls_group2 NOLOGIN;
+GRANT regress_rls_group1 TO regress_rls_bob;
+GRANT regress_rls_group2 TO regress_rls_carol;
+CREATE SCHEMA regress_rls_schema;
+GRANT ALL ON SCHEMA regress_rls_schema to public;
+SET search_path = regress_rls_schema;
+-- setup of malicious function
+CREATE OR REPLACE FUNCTION f_leak(text) RETURNS bool
+ COST 0.0000001 LANGUAGE plpgsql
+ AS 'BEGIN RAISE NOTICE ''f_leak => %'', $1; RETURN true; END';
+GRANT EXECUTE ON FUNCTION f_leak(text) TO public;
+-- BASIC Row-Level Security Scenario
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE uaccount (
+ pguser name primary key,
+ seclv int
+);
+GRANT SELECT ON uaccount TO public;
+INSERT INTO uaccount VALUES
+ ('regress_rls_alice', 99),
+ ('regress_rls_bob', 1),
+ ('regress_rls_carol', 2),
+ ('regress_rls_dave', 3);
+CREATE TABLE category (
+ cid int primary key,
+ cname text
+);
+GRANT ALL ON category TO public;
+INSERT INTO category VALUES
+ (11, 'novel'),
+ (22, 'science fiction'),
+ (33, 'technology'),
+ (44, 'manga');
+CREATE TABLE document (
+ did int primary key,
+ cid int references category(cid),
+ dlevel int not null,
+ dauthor name,
+ dtitle text
+);
+GRANT ALL ON document TO public;
+INSERT INTO document VALUES
+ ( 1, 11, 1, 'regress_rls_bob', 'my first novel'),
+ ( 2, 11, 2, 'regress_rls_bob', 'my second novel'),
+ ( 3, 22, 2, 'regress_rls_bob', 'my science fiction'),
+ ( 4, 44, 1, 'regress_rls_bob', 'my first manga'),
+ ( 5, 44, 2, 'regress_rls_bob', 'my second manga'),
+ ( 6, 22, 1, 'regress_rls_carol', 'great science fiction'),
+ ( 7, 33, 2, 'regress_rls_carol', 'great technology book'),
+ ( 8, 44, 1, 'regress_rls_carol', 'great manga'),
+ ( 9, 22, 1, 'regress_rls_dave', 'awesome science fiction'),
+ (10, 33, 2, 'regress_rls_dave', 'awesome technology book');
+ALTER TABLE document ENABLE ROW LEVEL SECURITY;
+-- user's security level must be higher than or equal to document's
+CREATE POLICY p1 ON document AS PERMISSIVE
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+-- try to create a policy of bogus type
+CREATE POLICY p1 ON document AS UGLY
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+ERROR: unrecognized row security option "ugly"
+LINE 1: CREATE POLICY p1 ON document AS UGLY
+ ^
+HINT: Only PERMISSIVE or RESTRICTIVE policies are supported currently.
+-- but Dave isn't allowed to anything at cid 50 or above
+-- this is to make sure that we sort the policies by name first
+-- when applying WITH CHECK, a later INSERT by Dave should fail due
+-- to p1r first
+CREATE POLICY p2r ON document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid <> 44 AND cid < 50);
+-- and Dave isn't allowed to see manga documents
+CREATE POLICY p1r ON document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid <> 44);
+\dp
+ Access privileges
+ Schema | Name | Type | Access privileges | Column privileges | Policies
+--------------------+----------+-------+---------------------------------------------+-------------------+--------------------------------------------
+ regress_rls_schema | category | table | regress_rls_alice=arwdDxt/regress_rls_alice+| |
+ | | | =arwdDxt/regress_rls_alice | |
+ regress_rls_schema | document | table | regress_rls_alice=arwdDxt/regress_rls_alice+| | p1: +
+ | | | =arwdDxt/regress_rls_alice | | (u): (dlevel <= ( SELECT uaccount.seclv +
+ | | | | | FROM uaccount +
+ | | | | | WHERE (uaccount.pguser = CURRENT_USER)))+
+ | | | | | p2r (RESTRICTIVE): +
+ | | | | | (u): ((cid <> 44) AND (cid < 50)) +
+ | | | | | to: regress_rls_dave +
+ | | | | | p1r (RESTRICTIVE): +
+ | | | | | (u): (cid <> 44) +
+ | | | | | to: regress_rls_dave
+ regress_rls_schema | uaccount | table | regress_rls_alice=arwdDxt/regress_rls_alice+| |
+ | | | =r/regress_rls_alice | |
+(3 rows)
+
+\d document
+ Table "regress_rls_schema.document"
+ Column | Type | Collation | Nullable | Default
+---------+---------+-----------+----------+---------
+ did | integer | | not null |
+ cid | integer | | |
+ dlevel | integer | | not null |
+ dauthor | name | | |
+ dtitle | text | | |
+Indexes:
+ "document_pkey" PRIMARY KEY, btree (did)
+Foreign-key constraints:
+ "document_cid_fkey" FOREIGN KEY (cid) REFERENCES category(cid)
+Policies:
+ POLICY "p1"
+ USING ((dlevel <= ( SELECT uaccount.seclv
+ FROM uaccount
+ WHERE (uaccount.pguser = CURRENT_USER))))
+ POLICY "p1r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING ((cid <> 44))
+ POLICY "p2r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING (((cid <> 44) AND (cid < 50)))
+
+SELECT * FROM pg_policies WHERE schemaname = 'regress_rls_schema' AND tablename = 'document' ORDER BY policyname;
+ schemaname | tablename | policyname | permissive | roles | cmd | qual | with_check
+--------------------+-----------+------------+-------------+--------------------+-----+--------------------------------------------+------------
+ regress_rls_schema | document | p1 | PERMISSIVE | {public} | ALL | (dlevel <= ( SELECT uaccount.seclv +|
+ | | | | | | FROM uaccount +|
+ | | | | | | WHERE (uaccount.pguser = CURRENT_USER))) |
+ regress_rls_schema | document | p1r | RESTRICTIVE | {regress_rls_dave} | ALL | (cid <> 44) |
+ regress_rls_schema | document | p2r | RESTRICTIVE | {regress_rls_dave} | ALL | ((cid <> 44) AND (cid < 50)) |
+(3 rows)
+
+-- viewpoint from regress_rls_bob
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(5 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+(5 rows)
+
+-- try a sampled version
+SELECT * FROM document TABLESAMPLE BERNOULLI(50) REPEATABLE(0)
+ WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+(10 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 44 | 5 | 2 | regress_rls_bob | my second manga | manga
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+ 33 | 10 | 2 | regress_rls_dave | awesome technology book | technology
+(10 rows)
+
+-- try a sampled version
+SELECT * FROM document TABLESAMPLE BERNOULLI(50) REPEATABLE(0)
+ WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------
+ Seq Scan on document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+-----------------------------------------------------------
+ Hash Join
+ Hash Cond: (category.cid = document.cid)
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on category
+ -> Hash
+ -> Seq Scan on document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(9 rows)
+
+-- viewpoint from regress_rls_dave
+SET SESSION AUTHORIZATION regress_rls_dave;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+(7 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+ 33 | 10 | 2 | regress_rls_dave | awesome technology book | technology
+(7 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
+ Seq Scan on document
+ Filter: ((cid <> 44) AND (cid <> 44) AND (cid < 50) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (category.cid = document.cid)
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on category
+ -> Hash
+ -> Seq Scan on document
+ Filter: ((cid <> 44) AND (cid <> 44) AND (cid < 50) AND (dlevel <= $0) AND f_leak(dtitle))
+(9 rows)
+
+-- 44 would technically fail for both p2r and p1r, but we should get an error
+-- back from p1r for this because it sorts first
+INSERT INTO document VALUES (100, 44, 1, 'regress_rls_dave', 'testing sorting of policies'); -- fail
+ERROR: new row violates row-level security policy "p1r" for table "document"
+-- Just to see a p2r error
+INSERT INTO document VALUES (100, 55, 1, 'regress_rls_dave', 'testing sorting of policies'); -- fail
+ERROR: new row violates row-level security policy "p2r" for table "document"
+-- only owner can change policies
+ALTER POLICY p1 ON document USING (true); --fail
+ERROR: must be owner of table document
+DROP POLICY p1 ON document; --fail
+ERROR: must be owner of relation document
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p1 ON document USING (dauthor = current_user);
+-- viewpoint from regress_rls_bob again
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+--------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+(5 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER by did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-----------------+--------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 44 | 5 | 2 | regress_rls_bob | my second manga | manga
+(5 rows)
+
+-- viewpoint from rls_regres_carol again
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+(3 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER by did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-----------------------+-----------------
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------
+ Seq Scan on document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Index Scan using category_pkey on category
+ Index Cond: (cid = document.cid)
+(5 rows)
+
+-- interaction of FK/PK constraints
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY p2 ON category
+ USING (CASE WHEN current_user = 'regress_rls_bob' THEN cid IN (11, 33)
+ WHEN current_user = 'regress_rls_carol' THEN cid IN (22, 44)
+ ELSE false END);
+ALTER TABLE category ENABLE ROW LEVEL SECURITY;
+-- cannot delete PK referenced by invisible FK
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM document d FULL OUTER JOIN category c on d.cid = c.cid ORDER BY d.did, c.cid;
+ did | cid | dlevel | dauthor | dtitle | cid | cname
+-----+-----+--------+-----------------+--------------------+-----+------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel | 11 | novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel | 11 | novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction | |
+ 4 | 44 | 1 | regress_rls_bob | my first manga | |
+ 5 | 44 | 2 | regress_rls_bob | my second manga | |
+ | | | | | 33 | technology
+(6 rows)
+
+DELETE FROM category WHERE cid = 33; -- fails with FK violation
+ERROR: update or delete on table "category" violates foreign key constraint "document_cid_fkey" on table "document"
+DETAIL: Key is still referenced from table "document".
+-- can insert FK referencing invisible PK
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document d FULL OUTER JOIN category c on d.cid = c.cid ORDER BY d.did, c.cid;
+ did | cid | dlevel | dauthor | dtitle | cid | cname
+-----+-----+--------+-------------------+-----------------------+-----+-----------------
+ 6 | 22 | 1 | regress_rls_carol | great science fiction | 22 | science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book | |
+ 8 | 44 | 1 | regress_rls_carol | great manga | 44 | manga
+(3 rows)
+
+INSERT INTO document VALUES (11, 33, 1, current_user, 'hoge');
+-- UNIQUE or PRIMARY KEY constraint violation DOES reveal presence of row
+SET SESSION AUTHORIZATION regress_rls_bob;
+INSERT INTO document VALUES (8, 44, 1, 'regress_rls_bob', 'my third manga'); -- Must fail with unique violation, revealing presence of did we can't see
+ERROR: duplicate key value violates unique constraint "document_pkey"
+SELECT * FROM document WHERE did = 8; -- and confirm we can't see it
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+---------+--------
+(0 rows)
+
+-- RLS policies are checked before constraints
+INSERT INTO document VALUES (8, 44, 1, 'regress_rls_carol', 'my third manga'); -- Should fail with RLS check violation, not duplicate key violation
+ERROR: new row violates row-level security policy for table "document"
+UPDATE document SET did = 8, dauthor = 'regress_rls_carol' WHERE did = 5; -- Should fail with RLS check violation, not duplicate key violation
+ERROR: new row violates row-level security policy for table "document"
+-- database superuser does bypass RLS policy when enabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- database superuser does bypass RLS policy when disabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- database non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- RLS policy does not apply to table owner when RLS enabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- RLS policy does not apply to table owner when RLS disabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+--
+-- Table inheritance and RLS policy
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE TABLE t1 (id int not null primary key, a int, junk1 text, b text);
+ALTER TABLE t1 DROP COLUMN junk1; -- just a disturbing factor
+GRANT ALL ON t1 TO public;
+COPY t1 FROM stdin WITH ;
+CREATE TABLE t2 (c float) INHERITS (t1);
+GRANT ALL ON t2 TO public;
+COPY t2 FROM stdin;
+CREATE TABLE t3 (id int not null primary key, c text, b text, a int);
+ALTER TABLE t3 INHERIT t1;
+GRANT ALL ON t3 TO public;
+COPY t3(id, a,b,c) FROM stdin;
+CREATE POLICY p1 ON t1 FOR ALL TO PUBLIC USING (a % 2 = 0); -- be even number
+CREATE POLICY p2 ON t2 FOR ALL TO PUBLIC USING (a % 2 = 1); -- be odd number
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE t2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM t1;
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+-- reference to system column
+SELECT tableoid::regclass, * FROM t1;
+ tableoid | id | a | b
+----------+-----+---+-----
+ t1 | 102 | 2 | bbb
+ t1 | 104 | 4 | dad
+ t2 | 202 | 2 | bcd
+ t2 | 204 | 4 | def
+ t3 | 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT *, t1 FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+-- reference to whole-row reference
+SELECT *, t1 FROM t1;
+ id | a | b | t1
+-----+---+-----+-------------
+ 102 | 2 | bbb | (102,2,bbb)
+ 104 | 4 | dad | (104,4,dad)
+ 202 | 2 | bcd | (202,2,bcd)
+ 204 | 4 | def | (204,4,def)
+ 302 | 2 | yyy | (302,2,yyy)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT *, t1 FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+-- for share/update lock
+SELECT * FROM t1 FOR SHARE;
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 FOR SHARE;
+ QUERY PLAN
+-------------------------------------
+ LockRows
+ -> Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(8 rows)
+
+SELECT * FROM t1 WHERE f_leak(b) FOR SHARE;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b) FOR SHARE;
+ QUERY PLAN
+-----------------------------------------------------
+ LockRows
+ -> Append
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(8 rows)
+
+-- union all query
+SELECT a, b, tableoid::regclass FROM t2 UNION ALL SELECT a, b, tableoid::regclass FROM t3;
+ a | b | tableoid
+---+-----+----------
+ 1 | abc | t2
+ 3 | cde | t2
+ 1 | xxx | t3
+ 2 | yyy | t3
+ 3 | zzz | t3
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT a, b, tableoid::regclass FROM t2 UNION ALL SELECT a, b, tableoid::regclass FROM t3;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t2
+ Filter: ((a % 2) = 1)
+ -> Seq Scan on t3
+(4 rows)
+
+-- superuser is allowed to bypass RLS checks
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2 t1_1
+ Filter: f_leak(b)
+ -> Seq Scan on t3 t1_2
+ Filter: f_leak(b)
+(7 rows)
+
+-- non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2 t1_1
+ Filter: f_leak(b)
+ -> Seq Scan on t3 t1_2
+ Filter: f_leak(b)
+(7 rows)
+
+--
+-- Partitioned Tables
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE part_document (
+ did int,
+ cid int,
+ dlevel int not null,
+ dauthor name,
+ dtitle text
+) PARTITION BY RANGE (cid);
+GRANT ALL ON part_document TO public;
+-- Create partitions for document categories
+CREATE TABLE part_document_fiction PARTITION OF part_document FOR VALUES FROM (11) to (12);
+CREATE TABLE part_document_satire PARTITION OF part_document FOR VALUES FROM (55) to (56);
+CREATE TABLE part_document_nonfiction PARTITION OF part_document FOR VALUES FROM (99) to (100);
+GRANT ALL ON part_document_fiction TO public;
+GRANT ALL ON part_document_satire TO public;
+GRANT ALL ON part_document_nonfiction TO public;
+INSERT INTO part_document VALUES
+ ( 1, 11, 1, 'regress_rls_bob', 'my first novel'),
+ ( 2, 11, 2, 'regress_rls_bob', 'my second novel'),
+ ( 3, 99, 2, 'regress_rls_bob', 'my science textbook'),
+ ( 4, 55, 1, 'regress_rls_bob', 'my first satire'),
+ ( 5, 99, 2, 'regress_rls_bob', 'my history book'),
+ ( 6, 11, 1, 'regress_rls_carol', 'great science fiction'),
+ ( 7, 99, 2, 'regress_rls_carol', 'great technology book'),
+ ( 8, 55, 2, 'regress_rls_carol', 'great satire'),
+ ( 9, 11, 1, 'regress_rls_dave', 'awesome science fiction'),
+ (10, 99, 2, 'regress_rls_dave', 'awesome technology book');
+ALTER TABLE part_document ENABLE ROW LEVEL SECURITY;
+-- Create policy on parent
+-- user's security level must be higher than or equal to document's
+CREATE POLICY pp1 ON part_document AS PERMISSIVE
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+-- Dave is only allowed to see cid < 55
+CREATE POLICY pp1r ON part_document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid < 55);
+\d+ part_document
+ Partitioned table "regress_rls_schema.part_document"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+---------+---------+-----------+----------+---------+----------+--------------+-------------
+ did | integer | | | | plain | |
+ cid | integer | | | | plain | |
+ dlevel | integer | | not null | | plain | |
+ dauthor | name | | | | plain | |
+ dtitle | text | | | | extended | |
+Partition key: RANGE (cid)
+Policies:
+ POLICY "pp1"
+ USING ((dlevel <= ( SELECT uaccount.seclv
+ FROM uaccount
+ WHERE (uaccount.pguser = CURRENT_USER))))
+ POLICY "pp1r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING ((cid < 55))
+Partitions: part_document_fiction FOR VALUES FROM (11) TO (12),
+ part_document_nonfiction FOR VALUES FROM (99) TO (100),
+ part_document_satire FOR VALUES FROM (55) TO (56)
+
+SELECT * FROM pg_policies WHERE schemaname = 'regress_rls_schema' AND tablename like '%part_document%' ORDER BY policyname;
+ schemaname | tablename | policyname | permissive | roles | cmd | qual | with_check
+--------------------+---------------+------------+-------------+--------------------+-----+--------------------------------------------+------------
+ regress_rls_schema | part_document | pp1 | PERMISSIVE | {public} | ALL | (dlevel <= ( SELECT uaccount.seclv +|
+ | | | | | | FROM uaccount +|
+ | | | | | | WHERE (uaccount.pguser = CURRENT_USER))) |
+ regress_rls_schema | part_document | pp1r | RESTRICTIVE | {regress_rls_dave} | ALL | (cid < 55) |
+(2 rows)
+
+-- viewpoint from regress_rls_bob
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+(10 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- viewpoint from regress_rls_dave
+SET SESSION AUTHORIZATION regress_rls_dave;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+--------------------------------------------------------------
+ Seq Scan on part_document_fiction part_document
+ Filter: ((cid < 55) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+-- pp1 ERROR
+INSERT INTO part_document VALUES (100, 11, 5, 'regress_rls_dave', 'testing pp1'); -- fail
+ERROR: new row violates row-level security policy for table "part_document"
+-- pp1r ERROR
+INSERT INTO part_document VALUES (100, 99, 1, 'regress_rls_dave', 'testing pp1r'); -- fail
+ERROR: new row violates row-level security policy "pp1r" for table "part_document"
+-- Show that RLS policy does not apply for direct inserts to children
+-- This should fail with RLS POLICY pp1r violation.
+INSERT INTO part_document VALUES (100, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- fail
+ERROR: new row violates row-level security policy "pp1r" for table "part_document"
+-- But this should succeed.
+INSERT INTO part_document_satire VALUES (100, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- success
+-- We still cannot see the row using the parent
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+-- But we can if we look directly
+SELECT * FROM part_document_satire WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => testing RLS with partitions
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- Turn on RLS and create policy on child to show RLS is checked before constraints
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER TABLE part_document_satire ENABLE ROW LEVEL SECURITY;
+CREATE POLICY pp3 ON part_document_satire AS RESTRICTIVE
+ USING (cid < 55);
+-- This should fail with RLS violation now.
+SET SESSION AUTHORIZATION regress_rls_dave;
+INSERT INTO part_document_satire VALUES (101, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- fail
+ERROR: new row violates row-level security policy for table "part_document_satire"
+-- And now we cannot see directly into the partition either, due to RLS
+SELECT * FROM part_document_satire WHERE f_leak(dtitle) ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+---------+--------
+(0 rows)
+
+-- The parent looks same as before
+-- viewpoint from regress_rls_dave
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+--------------------------------------------------------------
+ Seq Scan on part_document_fiction part_document
+ Filter: ((cid < 55) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => testing RLS with partitions
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- only owner can change policies
+ALTER POLICY pp1 ON part_document USING (true); --fail
+ERROR: must be owner of table part_document
+DROP POLICY pp1 ON part_document; --fail
+ERROR: must be owner of relation part_document
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY pp1 ON part_document USING (dauthor = current_user);
+-- viewpoint from regress_rls_bob again
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+---------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+(5 rows)
+
+-- viewpoint from rls_regres_carol again
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great satire
+NOTICE: f_leak => great technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------------
+ Append
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+(7 rows)
+
+-- database superuser does bypass RLS policy when enabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+SELECT * FROM part_document ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- database non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM part_document ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- RLS policy does not apply to table owner when RLS enabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+SELECT * FROM part_document ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- When RLS disabled, other users get ERROR.
+SET SESSION AUTHORIZATION regress_rls_dave;
+SET row_security TO OFF;
+SELECT * FROM part_document ORDER by did;
+ERROR: query would be affected by row-level security policy for table "part_document"
+SELECT * FROM part_document_satire ORDER by did;
+ERROR: query would be affected by row-level security policy for table "part_document_satire"
+-- Check behavior with a policy that uses a SubPlan not an InitPlan.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE POLICY pp3 ON part_document AS RESTRICTIVE
+ USING ((SELECT dlevel <= seclv FROM uaccount WHERE pguser = current_user));
+SET SESSION AUTHORIZATION regress_rls_carol;
+INSERT INTO part_document VALUES (100, 11, 5, 'regress_rls_carol', 'testing pp3'); -- fail
+ERROR: new row violates row-level security policy "pp3" for table "part_document"
+----- Dependencies -----
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE TABLE dependee (x integer, y integer);
+CREATE TABLE dependent (x integer, y integer);
+CREATE POLICY d1 ON dependent FOR ALL
+ TO PUBLIC
+ USING (x = (SELECT d.x FROM dependee d WHERE d.y = y));
+DROP TABLE dependee; -- Should fail without CASCADE due to dependency on row security qual?
+ERROR: cannot drop table dependee because other objects depend on it
+DETAIL: policy d1 on table dependent depends on table dependee
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+DROP TABLE dependee CASCADE;
+NOTICE: drop cascades to policy d1 on table dependent
+EXPLAIN (COSTS OFF) SELECT * FROM dependent; -- After drop, should be unqualified
+ QUERY PLAN
+-----------------------
+ Seq Scan on dependent
+(1 row)
+
+----- RECURSION ----
+--
+-- Simple recursion
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rec1 (x integer, y integer);
+CREATE POLICY r1 ON rec1 USING (x = (SELECT r.x FROM rec1 r WHERE y = r.y));
+ALTER TABLE rec1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, direct recursion
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rec2 (a integer, b integer);
+ALTER POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2 WHERE b = y));
+CREATE POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1 WHERE y = b));
+ALTER TABLE rec2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion via views
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rec1v AS SELECT * FROM rec1;
+CREATE VIEW rec2v AS SELECT * FROM rec2;
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2v WHERE b = y));
+ALTER POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1v WHERE y = b));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion via views
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion via .s.b views
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+DROP VIEW rec1v, rec2v CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to policy r1 on table rec1
+drop cascades to policy r2 on table rec2
+CREATE VIEW rec1v WITH (security_barrier) AS SELECT * FROM rec1;
+CREATE VIEW rec2v WITH (security_barrier) AS SELECT * FROM rec2;
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2v WHERE b = y));
+CREATE POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1v WHERE y = b));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion via s.b. views
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- recursive RLS and VIEWs in policy
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE s1 (a int, b text);
+INSERT INTO s1 (SELECT x, md5(x::text) FROM generate_series(-10,10) x);
+CREATE TABLE s2 (x int, y text);
+INSERT INTO s2 (SELECT x, md5(x::text) FROM generate_series(-6,6) x);
+GRANT SELECT ON s1, s2 TO regress_rls_bob;
+CREATE POLICY p1 ON s1 USING (a in (select x from s2 where y like '%2f%'));
+CREATE POLICY p2 ON s2 USING (x in (select a from s1 where b like '%22%'));
+CREATE POLICY p3 ON s1 FOR INSERT WITH CHECK (a = (SELECT a FROM s1));
+ALTER TABLE s1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE s2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW v2 AS SELECT * FROM s2 WHERE y like '%af%';
+SELECT * FROM s1 WHERE f_leak(b); -- fail (infinite recursion)
+ERROR: infinite recursion detected in policy for relation "s1"
+INSERT INTO s1 VALUES (1, 'foo'); -- fail (infinite recursion)
+ERROR: infinite recursion detected in policy for relation "s1"
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p3 on s1;
+ALTER POLICY p2 ON s2 USING (x % 2 = 0);
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- OK
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+ a | b
+---+----------------------------------
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM only s1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on s1
+ Filter: ((hashed SubPlan 1) AND f_leak(b))
+ SubPlan 1
+ -> Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%2f%'::text))
+(5 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p1 ON s1 USING (a in (select x from v2)); -- using VIEW in RLS policy
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- OK
+NOTICE: f_leak => 0267aaf632e87a63288a08331f22c7c3
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+ a | b
+----+----------------------------------
+ -4 | 0267aaf632e87a63288a08331f22c7c3
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM s1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on s1
+ Filter: ((hashed SubPlan 1) AND f_leak(b))
+ SubPlan 1
+ -> Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%af%'::text))
+(5 rows)
+
+SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
+ xx | x | y
+----+----+----------------------------------
+ -6 | -6 | 596a3d04481816330f07e4f97510c28f
+ -4 | -4 | 0267aaf632e87a63288a08331f22c7c3
+ 2 | 2 | c81e728d9d4c2f636f067f89cc14862c
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%28%'::text))
+ SubPlan 2
+ -> Limit
+ -> Seq Scan on s1
+ Filter: (hashed SubPlan 1)
+ SubPlan 1
+ -> Seq Scan on s2 s2_1
+ Filter: (((x % 2) = 0) AND (y ~~ '%af%'::text))
+(9 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p2 ON s2 USING (x in (select a from s1 where b like '%d2%'));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- fail (infinite recursion via view)
+ERROR: infinite recursion detected in policy for relation "s1"
+-- prepared statement with regress_rls_alice privilege
+PREPARE p1(int) AS SELECT * FROM t1 WHERE a <= $1;
+EXECUTE p1(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p1(2);
+ QUERY PLAN
+----------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+ -> Seq Scan on t2 t1_1
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+ -> Seq Scan on t3 t1_2
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+(7 rows)
+
+-- superuser is allowed to bypass RLS checks
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2 t1_1
+ Filter: f_leak(b)
+ -> Seq Scan on t3 t1_2
+ Filter: f_leak(b)
+(7 rows)
+
+-- plan cache should be invalidated
+EXECUTE p1(2);
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+(6 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p1(2);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (a <= 2)
+ -> Seq Scan on t2 t1_1
+ Filter: (a <= 2)
+ -> Seq Scan on t3 t1_2
+ Filter: (a <= 2)
+(7 rows)
+
+PREPARE p2(int) AS SELECT * FROM t1 WHERE a = $1;
+EXECUTE p2(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p2(2);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (a = 2)
+ -> Seq Scan on t2 t1_1
+ Filter: (a = 2)
+ -> Seq Scan on t3 t1_2
+ Filter: (a = 2)
+(7 rows)
+
+-- also, case when privilege switch from superuser
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+EXECUTE p2(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p2(2);
+ QUERY PLAN
+---------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a = 2) AND ((a % 2) = 0))
+ -> Seq Scan on t2 t1_1
+ Filter: ((a = 2) AND ((a % 2) = 0))
+ -> Seq Scan on t3 t1_2
+ Filter: ((a = 2) AND ((a % 2) = 0))
+(7 rows)
+
+--
+-- UPDATE / DELETE and Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+EXPLAIN (COSTS OFF) UPDATE t1 SET b = b || b WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Update on t1
+ Update on t1
+ Update on t2 t1_1
+ Update on t3 t1_2
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(10 rows)
+
+UPDATE t1 SET b = b || b WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+EXPLAIN (COSTS OFF) UPDATE only t1 SET b = b || '_updt' WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Update on t1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+UPDATE only t1 SET b = b || '_updt' WHERE f_leak(b);
+NOTICE: f_leak => bbbbbb
+NOTICE: f_leak => daddad
+-- returning clause with system column
+UPDATE only t1 SET b = b WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+(2 rows)
+
+UPDATE t1 SET b = b WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ id | a | b
+-----+---+-------------
+ 102 | 2 | bbbbbb_updt
+ 104 | 4 | daddad_updt
+ 202 | 2 | bcdbcd
+ 204 | 4 | defdef
+ 302 | 2 | yyyyyy
+(5 rows)
+
+UPDATE t1 SET b = b WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+ t2 | 202 | 2 | bcdbcd | (202,2,bcdbcd)
+ t2 | 204 | 4 | defdef | (204,4,defdef)
+ t3 | 302 | 2 | yyyyyy | (302,2,yyyyyy)
+(5 rows)
+
+-- updates with from clause
+EXPLAIN (COSTS OFF) UPDATE t2 SET b=t2.b FROM t3
+WHERE t2.a = 3 and t3.a = 2 AND f_leak(t2.b) AND f_leak(t3.b);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t2
+ -> Nested Loop
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: ((a = 2) AND f_leak(b))
+(6 rows)
+
+UPDATE t2 SET b=t2.b FROM t3
+WHERE t2.a = 3 and t3.a = 2 AND f_leak(t2.b) AND f_leak(t3.b);
+NOTICE: f_leak => cde
+NOTICE: f_leak => yyyyyy
+EXPLAIN (COSTS OFF) UPDATE t1 SET b=t1.b FROM t2
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t1
+ Update on t1
+ Update on t2 t1_1
+ Update on t3 t1_2
+ -> Nested Loop
+ -> Seq Scan on t1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Nested Loop
+ -> Seq Scan on t2 t1_1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Nested Loop
+ -> Seq Scan on t3 t1_2
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+(19 rows)
+
+UPDATE t1 SET b=t1.b FROM t2
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+EXPLAIN (COSTS OFF) UPDATE t2 SET b=t2.b FROM t1
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on t2
+ -> Nested Loop
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+(11 rows)
+
+UPDATE t2 SET b=t2.b FROM t1
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+NOTICE: f_leak => cde
+-- updates with from clause self join
+EXPLAIN (COSTS OFF) UPDATE t2 t2_1 SET b = t2_2.b FROM t2 t2_2
+WHERE t2_1.a = 3 AND t2_2.a = t2_1.a AND t2_2.b = t2_1.b
+AND f_leak(t2_1.b) AND f_leak(t2_2.b) RETURNING *, t2_1, t2_2;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t2 t2_1
+ -> Nested Loop
+ Join Filter: (t2_1.b = t2_2.b)
+ -> Seq Scan on t2 t2_1
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Seq Scan on t2 t2_2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+UPDATE t2 t2_1 SET b = t2_2.b FROM t2 t2_2
+WHERE t2_1.a = 3 AND t2_2.a = t2_1.a AND t2_2.b = t2_1.b
+AND f_leak(t2_1.b) AND f_leak(t2_2.b) RETURNING *, t2_1, t2_2;
+NOTICE: f_leak => cde
+NOTICE: f_leak => cde
+ id | a | b | c | id | a | b | c | t2_1 | t2_2
+-----+---+-----+-----+-----+---+-----+-----+-----------------+-----------------
+ 203 | 3 | cde | 3.3 | 203 | 3 | cde | 3.3 | (203,3,cde,3.3) | (203,3,cde,3.3)
+(1 row)
+
+EXPLAIN (COSTS OFF) UPDATE t1 t1_1 SET b = t1_2.b FROM t1 t1_2
+WHERE t1_1.a = 4 AND t1_2.a = t1_1.a AND t1_2.b = t1_1.b
+AND f_leak(t1_1.b) AND f_leak(t1_2.b) RETURNING *, t1_1, t1_2;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on t1 t1_1
+ Update on t1 t1_1
+ Update on t2 t1_1_1
+ Update on t3 t1_1_2
+ -> Nested Loop
+ Join Filter: (t1_1.b = t1_2.b)
+ -> Seq Scan on t1 t1_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Nested Loop
+ Join Filter: (t1_1_1.b = t1_2.b)
+ -> Seq Scan on t2 t1_1_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Nested Loop
+ Join Filter: (t1_1_2.b = t1_2.b)
+ -> Seq Scan on t3 t1_1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+(37 rows)
+
+UPDATE t1 t1_1 SET b = t1_2.b FROM t1 t1_2
+WHERE t1_1.a = 4 AND t1_2.a = t1_1.a AND t1_2.b = t1_1.b
+AND f_leak(t1_1.b) AND f_leak(t1_2.b) RETURNING *, t1_1, t1_2;
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => defdef
+NOTICE: f_leak => defdef
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => defdef
+ id | a | b | id | a | b | t1_1 | t1_2
+-----+---+-------------+-----+---+-------------+---------------------+---------------------
+ 104 | 4 | daddad_updt | 104 | 4 | daddad_updt | (104,4,daddad_updt) | (104,4,daddad_updt)
+ 204 | 4 | defdef | 204 | 4 | defdef | (204,4,defdef) | (204,4,defdef)
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 ORDER BY a,b;
+ id | a | b
+-----+---+-------------
+ 101 | 1 | aba
+ 201 | 1 | abc
+ 301 | 1 | xxx
+ 102 | 2 | bbbbbb_updt
+ 202 | 2 | bcdbcd
+ 302 | 2 | yyyyyy
+ 103 | 3 | ccc
+ 203 | 3 | cde
+ 303 | 3 | zzz
+ 104 | 4 | daddad_updt
+ 204 | 4 | defdef
+(11 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+EXPLAIN (COSTS OFF) DELETE FROM only t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Delete on t1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+EXPLAIN (COSTS OFF) DELETE FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Delete on t1
+ Delete on t1
+ Delete on t2 t1_1
+ Delete on t3 t1_2
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(10 rows)
+
+DELETE FROM only t1 WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+(2 rows)
+
+DELETE FROM t1 WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ tableoid | id | a | b | t1
+----------+-----+---+--------+----------------
+ t2 | 202 | 2 | bcdbcd | (202,2,bcdbcd)
+ t2 | 204 | 4 | defdef | (204,4,defdef)
+ t3 | 302 | 2 | yyyyyy | (302,2,yyyyyy)
+(3 rows)
+
+--
+-- S.b. view on top of Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE b1 (a int, b text);
+INSERT INTO b1 (SELECT x, md5(x::text) FROM generate_series(-10,10) x);
+CREATE POLICY p1 ON b1 USING (a % 2 = 0);
+ALTER TABLE b1 ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON b1 TO regress_rls_bob;
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW bv1 WITH (security_barrier) AS SELECT * FROM b1 WHERE a > 0 WITH CHECK OPTION;
+GRANT ALL ON bv1 TO regress_rls_carol;
+SET SESSION AUTHORIZATION regress_rls_carol;
+EXPLAIN (COSTS OFF) SELECT * FROM bv1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------------------------
+ Subquery Scan on bv1
+ Filter: f_leak(bv1.b)
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND ((a % 2) = 0))
+(4 rows)
+
+SELECT * FROM bv1 WHERE f_leak(b);
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+ a | b
+----+----------------------------------
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+(5 rows)
+
+INSERT INTO bv1 VALUES (-1, 'xxx'); -- should fail view WCO
+ERROR: new row violates row-level security policy for table "b1"
+INSERT INTO bv1 VALUES (11, 'xxx'); -- should fail RLS check
+ERROR: new row violates row-level security policy for table "b1"
+INSERT INTO bv1 VALUES (12, 'xxx'); -- ok
+EXPLAIN (COSTS OFF) UPDATE bv1 SET b = 'yyy' WHERE a = 4 AND f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on b1
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND (a = 4) AND ((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+UPDATE bv1 SET b = 'yyy' WHERE a = 4 AND f_leak(b);
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+EXPLAIN (COSTS OFF) DELETE FROM bv1 WHERE a = 6 AND f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Delete on b1
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND (a = 6) AND ((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+DELETE FROM bv1 WHERE a = 6 AND f_leak(b);
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM b1;
+ a | b
+-----+----------------------------------
+ -10 | 1b0fd9efa5279c4203b7c70233f86dbf
+ -9 | 252e691406782824eec43d7eadc3d256
+ -8 | a8d2ec85eaf98407310b72eb73dda247
+ -7 | 74687a12d3915d3c4d83f1af7b3683d5
+ -6 | 596a3d04481816330f07e4f97510c28f
+ -5 | 47c1b025fa18ea96c33fbb6718688c0f
+ -4 | 0267aaf632e87a63288a08331f22c7c3
+ -3 | b3149ecea4628efd23d2f86e5a723472
+ -2 | 5d7b9adcbe1c629ec722529dd12e5129
+ -1 | 6bb61e3b7bce0931da574d19d1d82c88
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | xxx
+ 4 | yyy
+(21 rows)
+
+--
+-- INSERT ... ON CONFLICT DO UPDATE and Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p1 ON document;
+DROP POLICY p1r ON document;
+CREATE POLICY p1 ON document FOR SELECT USING (true);
+CREATE POLICY p2 ON document FOR INSERT WITH CHECK (dauthor = current_user);
+CREATE POLICY p3 ON document FOR UPDATE
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'))
+ WITH CHECK (dauthor = current_user);
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Exists...
+SELECT * FROM document WHERE did = 2;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+(1 row)
+
+-- ...so violates actual WITH CHECK OPTION within UPDATE (not INSERT, since
+-- alternative UPDATE path happens to be taken):
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_carol', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, dauthor = EXCLUDED.dauthor;
+ERROR: new row violates row-level security policy for table "document"
+-- Violates USING qual for UPDATE policy p3.
+--
+-- UPDATE path is taken, but UPDATE fails purely because *existing* row to be
+-- updated is not a "novel"/cid 11 (row is not leaked, even though we have
+-- SELECT privileges sufficient to see the row in this instance):
+INSERT INTO document VALUES (33, 22, 1, 'regress_rls_bob', 'okay science fiction'); -- preparation for next statement
+INSERT INTO document VALUES (33, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'Some novel, replaces sci-fi') -- takes UPDATE path
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Fine (we UPDATE, since INSERT WCOs and UPDATE security barrier quals + WCOs
+-- not violated):
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+----------------
+ 2 | 11 | 2 | regress_rls_bob | my first novel
+(1 row)
+
+-- Fine (we INSERT, so "cid = 33" ("technology") isn't evaluated):
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------------
+ 78 | 11 | 1 | regress_rls_bob | some technology novel
+(1 row)
+
+-- Fine (same query, but we UPDATE, so "cid = 33", ("technology") is not the
+-- case in respect of *existing* tuple):
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------------
+ 78 | 33 | 1 | regress_rls_bob | some technology novel
+(1 row)
+
+-- Same query a third time, but now fails due to existing tuple finally not
+-- passing quals:
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Don't fail just because INSERT doesn't satisfy WITH CHECK option that
+-- originated as a barrier/USING() qual from the UPDATE. Note that the UPDATE
+-- path *isn't* taken, and so UPDATE-related policy does not apply:
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+----------------------------------
+ 79 | 33 | 1 | regress_rls_bob | technology book, can only insert
+(1 row)
+
+-- But this time, the same statement fails, because the UPDATE path is taken,
+-- and updating the row just inserted falls afoul of security barrier qual
+-- (enforced as WCO) -- what we might have updated target tuple to is
+-- irrelevant, in fact.
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Test default USING qual enforced as WCO
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p1 ON document;
+DROP POLICY p2 ON document;
+DROP POLICY p3 ON document;
+CREATE POLICY p3_with_default ON document FOR UPDATE
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'));
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Just because WCO-style enforcement of USING quals occurs with
+-- existing/target tuple does not mean that the implementation can be allowed
+-- to fail to also enforce this qual against the final tuple appended to
+-- relation (since in the absence of an explicit WCO, this is also interpreted
+-- as an UPDATE/ALL WCO in general).
+--
+-- UPDATE path is taken here (fails due to existing tuple). Note that this is
+-- not reported as a "USING expression", because it's an RLS UPDATE check that originated as
+-- a USING qual for the purposes of RLS in general, as opposed to an explicit
+-- USING qual that is ordinarily a security barrier. We leave it up to the
+-- UPDATE to make this fail:
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy for table "document"
+-- UPDATE path is taken here. Existing tuple passes, since its cid
+-- corresponds to "novel", but default USING qual is enforced against
+-- post-UPDATE tuple too (as always when updating with a policy that lacks an
+-- explicit WCO), and so this fails:
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET cid = EXCLUDED.cid, dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy for table "document"
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p3_with_default ON document;
+--
+-- Test ALL policies with ON CONFLICT DO UPDATE (much the same as existing UPDATE
+-- tests)
+--
+CREATE POLICY p3_with_all ON document FOR ALL
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'))
+ WITH CHECK (dauthor = current_user);
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Fails, since ALL WCO is enforced in insert path:
+INSERT INTO document VALUES (80, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_carol', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33;
+ERROR: new row violates row-level security policy for table "document"
+-- Fails, since ALL policy USING qual is enforced (existing, target tuple is in
+-- violation, since it has the "manga" cid):
+INSERT INTO document VALUES (4, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Fails, since ALL WCO are enforced:
+INSERT INTO document VALUES (1, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dauthor = 'regress_rls_carol';
+ERROR: new row violates row-level security policy for table "document"
+--
+-- ROLE/GROUP
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE z1 (a int, b text);
+CREATE TABLE z2 (a int, b text);
+GRANT SELECT ON z1,z2 TO regress_rls_group1, regress_rls_group2,
+ regress_rls_bob, regress_rls_carol;
+INSERT INTO z1 VALUES
+ (1, 'aba'),
+ (2, 'bbb'),
+ (3, 'ccc'),
+ (4, 'dad');
+CREATE POLICY p1 ON z1 TO regress_rls_group1 USING (a % 2 = 0);
+CREATE POLICY p2 ON z1 TO regress_rls_group2 USING (a % 2 = 1);
+ALTER TABLE z1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+PREPARE plancache_test AS SELECT * FROM z1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+PREPARE plancache_test2 AS WITH q AS MATERIALIZED (SELECT * FROM z1 WHERE f_leak(b)) SELECT * FROM q,z2;
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+PREPARE plancache_test3 AS WITH q AS MATERIALIZED (SELECT * FROM z2) SELECT * FROM q,z1 WHERE f_leak(z1.b);
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+SET ROLE regress_rls_group1;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => ccc
+ a | b
+---+-----
+ 1 | aba
+ 3 | ccc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+SET ROLE regress_rls_group2;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => ccc
+ a | b
+---+-----
+ 1 | aba
+ 3 | ccc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+--
+-- Views should follow policy for view owner.
+--
+-- View and Table owner are the same.
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE VIEW rls_view AS SELECT * FROM z1 WHERE f_leak(b);
+GRANT SELECT ON rls_view TO regress_rls_bob;
+-- Query as role that is not owner of view or table. Should return all records.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rls_view;
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 1 | aba
+ 2 | bbb
+ 3 | ccc
+ 4 | dad
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+---------------------
+ Seq Scan on z1
+ Filter: f_leak(b)
+(2 rows)
+
+-- Query as view/table owner. Should return all records.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM rls_view;
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 1 | aba
+ 2 | bbb
+ 3 | ccc
+ 4 | dad
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+---------------------
+ Seq Scan on z1
+ Filter: f_leak(b)
+(2 rows)
+
+DROP VIEW rls_view;
+-- View and Table owners are different.
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rls_view AS SELECT * FROM z1 WHERE f_leak(b);
+GRANT SELECT ON rls_view TO regress_rls_alice;
+-- Query as role that is not owner of view but is owner of table.
+-- Should return records based on view owner policies.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+-- Query as role that is not owner of table but is owner of view.
+-- Should return records based on view owner policies.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+-- Query as role that is not the owner of the table or view without permissions.
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM rls_view; --fail - permission denied.
+ERROR: permission denied for view rls_view
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view; --fail - permission denied.
+ERROR: permission denied for view rls_view
+-- Query as role that is not the owner of the table or view with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_carol;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+DROP VIEW rls_view;
+--
+-- Command specific
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE x1 (a int, b text, c text);
+GRANT ALL ON x1 TO PUBLIC;
+INSERT INTO x1 VALUES
+ (1, 'abc', 'regress_rls_bob'),
+ (2, 'bcd', 'regress_rls_bob'),
+ (3, 'cde', 'regress_rls_carol'),
+ (4, 'def', 'regress_rls_carol'),
+ (5, 'efg', 'regress_rls_bob'),
+ (6, 'fgh', 'regress_rls_bob'),
+ (7, 'fgh', 'regress_rls_carol'),
+ (8, 'fgh', 'regress_rls_carol');
+CREATE POLICY p0 ON x1 FOR ALL USING (c = current_user);
+CREATE POLICY p1 ON x1 FOR SELECT USING (a % 2 = 0);
+CREATE POLICY p2 ON x1 FOR INSERT WITH CHECK (a % 2 = 1);
+CREATE POLICY p3 ON x1 FOR UPDATE USING (a % 2 = 0);
+CREATE POLICY p4 ON x1 FOR DELETE USING (a < 8);
+ALTER TABLE x1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM x1 WHERE f_leak(b) ORDER BY a ASC;
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => efg
+NOTICE: f_leak => fgh
+NOTICE: f_leak => fgh
+ a | b | c
+---+-----+-------------------
+ 1 | abc | regress_rls_bob
+ 2 | bcd | regress_rls_bob
+ 4 | def | regress_rls_carol
+ 5 | efg | regress_rls_bob
+ 6 | fgh | regress_rls_bob
+ 8 | fgh | regress_rls_carol
+(6 rows)
+
+UPDATE x1 SET b = b || '_updt' WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => efg
+NOTICE: f_leak => fgh
+NOTICE: f_leak => fgh
+ a | b | c
+---+----------+-------------------
+ 1 | abc_updt | regress_rls_bob
+ 2 | bcd_updt | regress_rls_bob
+ 4 | def_updt | regress_rls_carol
+ 5 | efg_updt | regress_rls_bob
+ 6 | fgh_updt | regress_rls_bob
+ 8 | fgh_updt | regress_rls_carol
+(6 rows)
+
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM x1 WHERE f_leak(b) ORDER BY a ASC;
+NOTICE: f_leak => cde
+NOTICE: f_leak => fgh
+NOTICE: f_leak => bcd_updt
+NOTICE: f_leak => def_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => fgh_updt
+ a | b | c
+---+----------+-------------------
+ 2 | bcd_updt | regress_rls_bob
+ 3 | cde | regress_rls_carol
+ 4 | def_updt | regress_rls_carol
+ 6 | fgh_updt | regress_rls_bob
+ 7 | fgh | regress_rls_carol
+ 8 | fgh_updt | regress_rls_carol
+(6 rows)
+
+UPDATE x1 SET b = b || '_updt' WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => cde
+NOTICE: f_leak => fgh
+NOTICE: f_leak => bcd_updt
+NOTICE: f_leak => def_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => fgh_updt
+ a | b | c
+---+---------------+-------------------
+ 3 | cde_updt | regress_rls_carol
+ 7 | fgh_updt | regress_rls_carol
+ 2 | bcd_updt_updt | regress_rls_bob
+ 4 | def_updt_updt | regress_rls_carol
+ 6 | fgh_updt_updt | regress_rls_bob
+ 8 | fgh_updt_updt | regress_rls_carol
+(6 rows)
+
+DELETE FROM x1 WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => cde_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => bcd_updt_updt
+NOTICE: f_leak => def_updt_updt
+NOTICE: f_leak => fgh_updt_updt
+NOTICE: f_leak => fgh_updt_updt
+ a | b | c
+---+---------------+-------------------
+ 3 | cde_updt | regress_rls_carol
+ 7 | fgh_updt | regress_rls_carol
+ 2 | bcd_updt_updt | regress_rls_bob
+ 4 | def_updt_updt | regress_rls_carol
+ 6 | fgh_updt_updt | regress_rls_bob
+ 8 | fgh_updt_updt | regress_rls_carol
+(6 rows)
+
+--
+-- Duplicate Policy Names
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE y1 (a int, b text);
+CREATE TABLE y2 (a int, b text);
+GRANT ALL ON y1, y2 TO regress_rls_bob;
+CREATE POLICY p1 ON y1 FOR ALL USING (a % 2 = 0);
+CREATE POLICY p2 ON y1 FOR SELECT USING (a > 2);
+CREATE POLICY p1 ON y1 FOR SELECT USING (a % 2 = 1); --fail
+ERROR: policy "p1" for table "y1" already exists
+CREATE POLICY p1 ON y2 FOR ALL USING (a % 2 = 0); --OK
+ALTER TABLE y1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE y2 ENABLE ROW LEVEL SECURITY;
+--
+-- Expression structure with SBV
+--
+-- Create view as table owner. RLS should NOT be applied.
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE VIEW rls_sbv WITH (security_barrier) AS
+ SELECT * FROM y1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) SELECT * FROM rls_sbv WHERE (a = 1);
+ QUERY PLAN
+-----------------------------------
+ Seq Scan on y1
+ Filter: (f_leak(b) AND (a = 1))
+(2 rows)
+
+DROP VIEW rls_sbv;
+-- Create view as role that does not own table. RLS should be applied.
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rls_sbv WITH (security_barrier) AS
+ SELECT * FROM y1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) SELECT * FROM rls_sbv WHERE (a = 1);
+ QUERY PLAN
+------------------------------------------------------------------
+ Seq Scan on y1
+ Filter: ((a = 1) AND ((a > 2) OR ((a % 2) = 0)) AND f_leak(b))
+(2 rows)
+
+DROP VIEW rls_sbv;
+--
+-- Expression structure
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+INSERT INTO y2 (SELECT x, md5(x::text) FROM generate_series(0,20) x);
+CREATE POLICY p2 ON y2 USING (a % 3 = 0);
+CREATE POLICY p3 ON y2 USING (a % 4 = 0);
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM y2 WHERE f_leak(b);
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => eccbc87e4b5ce2fe28308fd9f2a7baf3
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => 45c48cce2e2d7fbdea1afc51c7c6ad26
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => 9bf31c7ff062936a96d3c8bd1f8f2ff3
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(14 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Seq Scan on y2
+ Filter: ((((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)) AND f_leak(b))
+(2 rows)
+
+--
+-- Qual push-down of leaky functions, when not referring to table
+--
+SELECT * FROM y2 WHERE f_leak('abc');
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(14 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 WHERE f_leak('abc');
+ QUERY PLAN
+---------------------------------------------------------------------------------------
+ Seq Scan on y2
+ Filter: (f_leak('abc'::text) AND (((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)))
+(2 rows)
+
+CREATE TABLE test_qual_pushdown (
+ abc text
+);
+INSERT INTO test_qual_pushdown VALUES ('abc'),('def');
+SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(abc);
+NOTICE: f_leak => abc
+NOTICE: f_leak => def
+ a | b | abc
+---+---+-----
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(abc);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (test_qual_pushdown.abc = y2.b)
+ -> Seq Scan on test_qual_pushdown
+ Filter: f_leak(abc)
+ -> Hash
+ -> Seq Scan on y2
+ Filter: (((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0))
+(7 rows)
+
+SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(b);
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => eccbc87e4b5ce2fe28308fd9f2a7baf3
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => 45c48cce2e2d7fbdea1afc51c7c6ad26
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => 9bf31c7ff062936a96d3c8bd1f8f2ff3
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b | abc
+---+---+-----
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (test_qual_pushdown.abc = y2.b)
+ -> Seq Scan on test_qual_pushdown
+ -> Hash
+ -> Seq Scan on y2
+ Filter: ((((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)) AND f_leak(b))
+(6 rows)
+
+DROP TABLE test_qual_pushdown;
+--
+-- Plancache invalidate on user change.
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE t1 CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to table t2
+drop cascades to table t3
+CREATE TABLE t1 (a integer);
+GRANT SELECT ON t1 TO regress_rls_bob, regress_rls_carol;
+CREATE POLICY p1 ON t1 TO regress_rls_bob USING ((a % 2) = 0);
+CREATE POLICY p2 ON t1 TO regress_rls_carol USING ((a % 4) = 0);
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+-- Prepare as regress_rls_bob
+SET ROLE regress_rls_bob;
+PREPARE role_inval AS SELECT * FROM t1;
+-- Check plan
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(2 rows)
+
+-- Change to regress_rls_carol
+SET ROLE regress_rls_carol;
+-- Check plan- should be different
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 4) = 0)
+(2 rows)
+
+-- Change back to regress_rls_bob
+SET ROLE regress_rls_bob;
+-- Check plan- should be back to original
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(2 rows)
+
+--
+-- CTE and RLS
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE t1 CASCADE;
+CREATE TABLE t1 (a integer, b text);
+CREATE POLICY p1 ON t1 USING (a % 2 = 0);
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON t1 TO regress_rls_bob;
+INSERT INTO t1 (SELECT x, md5(x::text) FROM generate_series(0,20) x);
+SET SESSION AUTHORIZATION regress_rls_bob;
+WITH cte1 AS MATERIALIZED (SELECT * FROM t1 WHERE f_leak(b)) SELECT * FROM cte1;
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(11 rows)
+
+EXPLAIN (COSTS OFF)
+WITH cte1 AS MATERIALIZED (SELECT * FROM t1 WHERE f_leak(b)) SELECT * FROM cte1;
+ QUERY PLAN
+-------------------------------------------------
+ CTE Scan on cte1
+ CTE cte1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(4 rows)
+
+WITH cte1 AS (UPDATE t1 SET a = a + 1 RETURNING *) SELECT * FROM cte1; --fail
+ERROR: new row violates row-level security policy for table "t1"
+WITH cte1 AS (UPDATE t1 SET a = a RETURNING *) SELECT * FROM cte1; --ok
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(11 rows)
+
+WITH cte1 AS (INSERT INTO t1 VALUES (21, 'Fail') RETURNING *) SELECT * FROM cte1; --fail
+ERROR: new row violates row-level security policy for table "t1"
+WITH cte1 AS (INSERT INTO t1 VALUES (20, 'Success') RETURNING *) SELECT * FROM cte1; --ok
+ a | b
+----+---------
+ 20 | Success
+(1 row)
+
+--
+-- Rename Policy
+--
+RESET SESSION AUTHORIZATION;
+ALTER POLICY p1 ON t1 RENAME TO p1; --fail
+ERROR: policy "p1" for table "t1" already exists
+SELECT polname, relname
+ FROM pg_policy pol
+ JOIN pg_class pc ON (pc.oid = pol.polrelid)
+ WHERE relname = 't1';
+ polname | relname
+---------+---------
+ p1 | t1
+(1 row)
+
+ALTER POLICY p1 ON t1 RENAME TO p2; --ok
+SELECT polname, relname
+ FROM pg_policy pol
+ JOIN pg_class pc ON (pc.oid = pol.polrelid)
+ WHERE relname = 't1';
+ polname | relname
+---------+---------
+ p2 | t1
+(1 row)
+
+--
+-- Check INSERT SELECT
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE TABLE t2 (a integer, b text);
+INSERT INTO t2 (SELECT * FROM t1);
+EXPLAIN (COSTS OFF) INSERT INTO t2 (SELECT * FROM t1);
+ QUERY PLAN
+-------------------------------
+ Insert on t2
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(3 rows)
+
+SELECT * FROM t2;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t2;
+ QUERY PLAN
+----------------
+ Seq Scan on t2
+(1 row)
+
+CREATE TABLE t3 AS SELECT * FROM t1;
+SELECT * FROM t3;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+SELECT * INTO t4 FROM t1;
+SELECT * FROM t4;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+--
+-- RLS with JOIN
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE blog (id integer, author text, post text);
+CREATE TABLE comment (blog_id integer, message text);
+GRANT ALL ON blog, comment TO regress_rls_bob;
+CREATE POLICY blog_1 ON blog USING (id % 2 = 0);
+ALTER TABLE blog ENABLE ROW LEVEL SECURITY;
+INSERT INTO blog VALUES
+ (1, 'alice', 'blog #1'),
+ (2, 'bob', 'blog #1'),
+ (3, 'alice', 'blog #2'),
+ (4, 'alice', 'blog #3'),
+ (5, 'john', 'blog #1');
+INSERT INTO comment VALUES
+ (1, 'cool blog'),
+ (1, 'fun blog'),
+ (3, 'crazy blog'),
+ (5, 'what?'),
+ (4, 'insane!'),
+ (2, 'who did it?');
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Check RLS JOIN with Non-RLS.
+SELECT id, author, message FROM blog JOIN comment ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 4 | alice | insane!
+ 2 | bob | who did it?
+(2 rows)
+
+-- Check Non-RLS JOIN with RLS.
+SELECT id, author, message FROM comment JOIN blog ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 4 | alice | insane!
+ 2 | bob | who did it?
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY comment_1 ON comment USING (blog_id < 4);
+ALTER TABLE comment ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Check RLS JOIN RLS
+SELECT id, author, message FROM blog JOIN comment ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 2 | bob | who did it?
+(1 row)
+
+SELECT id, author, message FROM comment JOIN blog ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 2 | bob | who did it?
+(1 row)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP TABLE blog, comment;
+--
+-- Default Deny Policy
+--
+RESET SESSION AUTHORIZATION;
+DROP POLICY p2 ON t1;
+ALTER TABLE t1 OWNER TO regress_rls_alice;
+-- Check that default deny does not apply to superuser.
+RESET SESSION AUTHORIZATION;
+SELECT * FROM t1;
+ a | b
+----+----------------------------------
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 11 | 6512bd43d9caa6e02c990b0a82652dca
+ 13 | c51ce410c124a10e0db5e4b97fc2af39
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 17 | 70efdf2ec9b086079795c442636b55fb
+ 19 | 1f0e3dad99908345f7439f8ffabdffc4
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(22 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+----------------
+ Seq Scan on t1
+(1 row)
+
+-- Check that default deny does not apply to table owner.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM t1;
+ a | b
+----+----------------------------------
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 11 | 6512bd43d9caa6e02c990b0a82652dca
+ 13 | c51ce410c124a10e0db5e4b97fc2af39
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 17 | 70efdf2ec9b086079795c442636b55fb
+ 19 | 1f0e3dad99908345f7439f8ffabdffc4
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(22 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+----------------
+ Seq Scan on t1
+(1 row)
+
+-- Check that default deny applies to non-owner/non-superuser when RLS on.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM t1;
+ a | b
+---+---
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+--------------------------
+ Result
+ One-Time Filter: false
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM t1;
+ a | b
+---+---
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+--------------------------
+ Result
+ One-Time Filter: false
+(2 rows)
+
+--
+-- COPY TO/FROM
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE copy_t CASCADE;
+ERROR: table "copy_t" does not exist
+CREATE TABLE copy_t (a integer, b text);
+CREATE POLICY p1 ON copy_t USING (a % 2 = 0);
+ALTER TABLE copy_t ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON copy_t TO regress_rls_bob, regress_rls_exempt_user;
+INSERT INTO copy_t (SELECT x, md5(x::text) FROM generate_series(0,10) x);
+-- Check COPY TO as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ',';
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ',';
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+2,c81e728d9d4c2f636f067f89cc14862c
+4,a87ff679a2f3e71d9181a67b7542122c
+6,1679091c5a880faf6fb5e6087eb1b2dc
+8,c9f0f895fb98ab9159f51fd0297e236d
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user without permissions. SET row_security TO OFF;
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_t
+-- Check COPY relation TO; keep it just one row to avoid reordering issues
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+CREATE TABLE copy_rel_to (a integer, b text);
+CREATE POLICY p1 ON copy_rel_to USING (a % 2 = 0);
+ALTER TABLE copy_rel_to ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON copy_rel_to TO regress_rls_bob, regress_rls_exempt_user;
+INSERT INTO copy_rel_to VALUES (1, md5('1'));
+-- Check COPY TO as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ',';
+1,c4ca4238a0b923820dcc509a6f75849b
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ',';
+1,c4ca4238a0b923820dcc509a6f75849b
+-- Check COPY TO as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_rel_to"
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+-- Check COPY TO as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+1,c4ca4238a0b923820dcc509a6f75849b
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+1,c4ca4238a0b923820dcc509a6f75849b
+-- Check COPY TO as user without permissions. SET row_security TO OFF;
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_rel_to
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_rel_to
+-- Check COPY FROM as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --ok
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --ok
+-- Check COPY FROM as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --fail - would be affected by RLS.
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --fail - COPY FROM not supported by RLS.
+ERROR: COPY FROM not supported with row-level security
+HINT: Use INSERT statements instead.
+-- Check COPY FROM as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --ok
+-- Check COPY FROM as user without permissions.
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --fail - permission denied.
+ERROR: permission denied for table copy_t
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --fail - permission denied.
+ERROR: permission denied for table copy_t
+RESET SESSION AUTHORIZATION;
+DROP TABLE copy_t;
+DROP TABLE copy_rel_to CASCADE;
+-- Check WHERE CURRENT OF
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE current_check (currentid int, payload text, rlsuser text);
+GRANT ALL ON current_check TO PUBLIC;
+INSERT INTO current_check VALUES
+ (1, 'abc', 'regress_rls_bob'),
+ (2, 'bcd', 'regress_rls_bob'),
+ (3, 'cde', 'regress_rls_bob'),
+ (4, 'def', 'regress_rls_bob');
+CREATE POLICY p1 ON current_check FOR SELECT USING (currentid % 2 = 0);
+CREATE POLICY p2 ON current_check FOR DELETE USING (currentid = 4 AND rlsuser = current_user);
+CREATE POLICY p3 ON current_check FOR UPDATE USING (currentid = 4) WITH CHECK (rlsuser = current_user);
+ALTER TABLE current_check ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Can SELECT even rows
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+ 4 | def | regress_rls_bob
+(2 rows)
+
+-- Cannot UPDATE row 2
+UPDATE current_check SET payload = payload || '_new' WHERE currentid = 2 RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+BEGIN;
+DECLARE current_check_cursor SCROLL CURSOR FOR SELECT * FROM current_check;
+-- Returns rows that can be seen according to SELECT policy, like plain SELECT
+-- above (even rows)
+FETCH ABSOLUTE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+-- Still cannot UPDATE row 2 through cursor
+UPDATE current_check SET payload = payload || '_new' WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+-- Can update row 4 through cursor, which is the next visible row
+FETCH RELATIVE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def | regress_rls_bob
+(1 row)
+
+UPDATE current_check SET payload = payload || '_new' WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def_new | regress_rls_bob
+(1 row)
+
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+ 4 | def_new | regress_rls_bob
+(2 rows)
+
+-- Plan should be a subquery TID scan
+EXPLAIN (COSTS OFF) UPDATE current_check SET payload = payload WHERE CURRENT OF current_check_cursor;
+ QUERY PLAN
+-------------------------------------------------------------
+ Update on current_check
+ -> Tid Scan on current_check
+ TID Cond: CURRENT OF current_check_cursor
+ Filter: ((currentid = 4) AND ((currentid % 2) = 0))
+(4 rows)
+
+-- Similarly can only delete row 4
+FETCH ABSOLUTE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+DELETE FROM current_check WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+FETCH RELATIVE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def | regress_rls_bob
+(1 row)
+
+DELETE FROM current_check WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def_new | regress_rls_bob
+(1 row)
+
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+COMMIT;
+--
+-- check pg_stats view filtering
+--
+SET row_security TO ON;
+SET SESSION AUTHORIZATION regress_rls_alice;
+ANALYZE current_check;
+-- Stats visible
+SELECT row_security_active('current_check');
+ row_security_active
+---------------------
+ f
+(1 row)
+
+SELECT attname, most_common_vals FROM pg_stats
+ WHERE tablename = 'current_check'
+ ORDER BY 1;
+ attname | most_common_vals
+-----------+-------------------
+ currentid | {4}
+ payload |
+ rlsuser | {regress_rls_bob}
+(3 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Stats not visible
+SELECT row_security_active('current_check');
+ row_security_active
+---------------------
+ t
+(1 row)
+
+SELECT attname, most_common_vals FROM pg_stats
+ WHERE tablename = 'current_check'
+ ORDER BY 1;
+ attname | most_common_vals
+---------+------------------
+(0 rows)
+
+--
+-- Collation support
+--
+BEGIN;
+CREATE TABLE coll_t (c) AS VALUES ('bar'::text);
+CREATE POLICY coll_p ON coll_t USING (c < ('foo'::text COLLATE "C"));
+ALTER TABLE coll_t ENABLE ROW LEVEL SECURITY;
+GRANT SELECT ON coll_t TO regress_rls_alice;
+SELECT (string_to_array(polqual, ':'))[7] AS inputcollid FROM pg_policy WHERE polrelid = 'coll_t'::regclass;
+ inputcollid
+------------------
+ inputcollid 950
+(1 row)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM coll_t;
+ c
+-----
+ bar
+(1 row)
+
+ROLLBACK;
+--
+-- Shared Object Dependencies
+--
+RESET SESSION AUTHORIZATION;
+BEGIN;
+CREATE ROLE regress_rls_eve;
+CREATE ROLE regress_rls_frank;
+CREATE TABLE tbl1 (c) AS VALUES ('bar'::text);
+GRANT SELECT ON TABLE tbl1 TO regress_rls_eve;
+CREATE POLICY P ON tbl1 TO regress_rls_eve, regress_rls_frank USING (true);
+SELECT refclassid::regclass, deptype
+ FROM pg_depend
+ WHERE classid = 'pg_policy'::regclass
+ AND refobjid = 'tbl1'::regclass;
+ refclassid | deptype
+------------+---------
+ pg_class | a
+(1 row)
+
+SELECT refclassid::regclass, deptype
+ FROM pg_shdepend
+ WHERE classid = 'pg_policy'::regclass
+ AND refobjid IN ('regress_rls_eve'::regrole, 'regress_rls_frank'::regrole);
+ refclassid | deptype
+------------+---------
+ pg_authid | r
+ pg_authid | r
+(2 rows)
+
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
+ERROR: role "regress_rls_eve" cannot be dropped because some objects depend on it
+DETAIL: privileges for table tbl1
+target of policy p on table tbl1
+ROLLBACK TO q;
+ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --fails due to dependency on GRANT SELECT
+ERROR: role "regress_rls_eve" cannot be dropped because some objects depend on it
+DETAIL: privileges for table tbl1
+ROLLBACK TO q;
+REVOKE ALL ON TABLE tbl1 FROM regress_rls_eve;
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --succeeds
+ROLLBACK TO q;
+SAVEPOINT q;
+DROP ROLE regress_rls_frank; --fails due to dependency on POLICY p
+ERROR: role "regress_rls_frank" cannot be dropped because some objects depend on it
+DETAIL: target of policy p on table tbl1
+ROLLBACK TO q;
+DROP POLICY p ON tbl1;
+SAVEPOINT q;
+DROP ROLE regress_rls_frank; -- succeeds
+ROLLBACK TO q;
+ROLLBACK; -- cleanup
+--
+-- Converting table to view
+--
+BEGIN;
+CREATE TABLE t (c int);
+CREATE POLICY p ON t USING (c % 2 = 1);
+ALTER TABLE t ENABLE ROW LEVEL SECURITY;
+SAVEPOINT q;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- fails due to row level security enabled
+ERROR: could not convert table "t" to a view because it has row security enabled
+ROLLBACK TO q;
+ALTER TABLE t DISABLE ROW LEVEL SECURITY;
+SAVEPOINT q;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- fails due to policy p on t
+ERROR: could not convert table "t" to a view because it has row security policies
+ROLLBACK TO q;
+DROP POLICY p ON t;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- succeeds
+ROLLBACK;
+--
+-- Policy expression handling
+--
+BEGIN;
+CREATE TABLE t (c) AS VALUES ('bar'::text);
+CREATE POLICY p ON t USING (max(c)); -- fails: aggregate functions are not allowed in policy expressions
+ERROR: aggregate functions are not allowed in policy expressions
+ROLLBACK;
+--
+-- Non-target relations are only subject to SELECT policies
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE r1 (a int);
+CREATE TABLE r2 (a int);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+GRANT ALL ON r1, r2 TO regress_rls_bob;
+CREATE POLICY p1 ON r1 USING (true);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON r2 FOR SELECT USING (true);
+CREATE POLICY p2 ON r2 FOR INSERT WITH CHECK (false);
+CREATE POLICY p3 ON r2 FOR UPDATE USING (false);
+CREATE POLICY p4 ON r2 FOR DELETE USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM r1;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+SELECT * FROM r2;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+-- r2 is read-only
+INSERT INTO r2 VALUES (2); -- Not allowed
+ERROR: new row violates row-level security policy for table "r2"
+UPDATE r2 SET a = 2 RETURNING *; -- Updates nothing
+ a
+---
+(0 rows)
+
+DELETE FROM r2 RETURNING *; -- Deletes nothing
+ a
+---
+(0 rows)
+
+-- r2 can be used as a non-target relation in DML
+INSERT INTO r1 SELECT a + 1 FROM r2 RETURNING *; -- OK
+ a
+----
+ 11
+ 21
+(2 rows)
+
+UPDATE r1 SET a = r2.a + 2 FROM r2 WHERE r1.a = r2.a RETURNING *; -- OK
+ a | a
+----+----
+ 12 | 10
+ 22 | 20
+(2 rows)
+
+DELETE FROM r1 USING r2 WHERE r1.a = r2.a + 2 RETURNING *; -- OK
+ a | a
+----+----
+ 12 | 10
+ 22 | 20
+(2 rows)
+
+SELECT * FROM r1;
+ a
+----
+ 11
+ 21
+(2 rows)
+
+SELECT * FROM r2;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP TABLE r1;
+DROP TABLE r2;
+--
+-- FORCE ROW LEVEL SECURITY applies RLS to owners too
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int);
+INSERT INTO r1 VALUES (10), (20);
+CREATE POLICY p1 ON r1 USING (false);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- No error, but no rows
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- RLS error
+INSERT INTO r1 VALUES (1);
+ERROR: new row violates row-level security policy for table "r1"
+-- No error (unable to see any rows to update)
+UPDATE r1 SET a = 1;
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- No error (unable to see any rows to delete)
+DELETE FROM r1;
+TABLE r1;
+ a
+---
+(0 rows)
+
+SET row_security = off;
+-- these all fail, would be affected by RLS
+TABLE r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+UPDATE r1 SET a = 1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+DELETE FROM r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+DROP TABLE r1;
+--
+-- FORCE ROW LEVEL SECURITY does not break RI
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Errors due to rows in r2
+DELETE FROM r1;
+ERROR: update or delete on table "r1" violates foreign key constraint "r2_a_fkey" on table "r2"
+DETAIL: Key (a)=(10) is still referenced from table "r2".
+-- Reset r2 to no-RLS
+DROP POLICY p1 ON r2;
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+ALTER TABLE r2 DISABLE ROW LEVEL SECURITY;
+-- clean out r2 for INSERT test below
+DELETE FROM r2;
+-- Change r1 to not allow rows to be seen
+CREATE POLICY p1 ON r1 USING (false);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- No rows seen
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- No error, RI still sees that row exists in r1
+INSERT INTO r2 VALUES (10);
+DROP TABLE r2;
+DROP TABLE r1;
+-- Ensure cascaded DELETE works
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1 ON DELETE CASCADE);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Deletes all records from both
+DELETE FROM r1;
+-- Remove FORCE from r2
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+-- As owner, we now bypass RLS
+-- verify no rows in r2 now
+TABLE r2;
+ a
+---
+(0 rows)
+
+DROP TABLE r2;
+DROP TABLE r1;
+-- Ensure cascaded UPDATE works
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1 ON UPDATE CASCADE);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Updates records in both
+UPDATE r1 SET a = a+5;
+-- Remove FORCE from r2
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+-- As owner, we now bypass RLS
+-- verify records in r2 updated
+TABLE r2;
+ a
+----
+ 15
+ 25
+(2 rows)
+
+DROP TABLE r2;
+DROP TABLE r1;
+--
+-- Test INSERT+RETURNING applies SELECT policies as
+-- WithCheckOptions (meaning an error is thrown)
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int);
+CREATE POLICY p1 ON r1 FOR SELECT USING (false);
+CREATE POLICY p2 ON r1 FOR INSERT WITH CHECK (true);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Works fine
+INSERT INTO r1 VALUES (10), (20);
+-- No error, but no rows
+TABLE r1;
+ a
+---
+(0 rows)
+
+SET row_security = off;
+-- fail, would be affected by RLS
+TABLE r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+SET row_security = on;
+-- Error
+INSERT INTO r1 VALUES (10), (20) RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+DROP TABLE r1;
+--
+-- Test UPDATE+RETURNING applies SELECT policies as
+-- WithCheckOptions (meaning an error is thrown)
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE POLICY p1 ON r1 FOR SELECT USING (a < 20);
+CREATE POLICY p2 ON r1 FOR UPDATE USING (a < 20) WITH CHECK (true);
+CREATE POLICY p3 ON r1 FOR INSERT WITH CHECK (true);
+INSERT INTO r1 VALUES (10);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Works fine
+UPDATE r1 SET a = 30;
+-- Show updated rows
+ALTER TABLE r1 NO FORCE ROW LEVEL SECURITY;
+TABLE r1;
+ a
+----
+ 30
+(1 row)
+
+-- reset value in r1 for test with RETURNING
+UPDATE r1 SET a = 10;
+-- Verify row reset
+TABLE r1;
+ a
+----
+ 10
+(1 row)
+
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Error
+UPDATE r1 SET a = 30 RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+-- UPDATE path of INSERT ... ON CONFLICT DO UPDATE should also error out
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT (a) DO UPDATE SET a = 30 RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+-- Should still error out without RETURNING (use of arbiter always requires
+-- SELECT permissions)
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT (a) DO UPDATE SET a = 30;
+ERROR: new row violates row-level security policy for table "r1"
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT ON CONSTRAINT r1_pkey DO UPDATE SET a = 30;
+ERROR: new row violates row-level security policy for table "r1"
+DROP TABLE r1;
+-- Check dependency handling
+RESET SESSION AUTHORIZATION;
+CREATE TABLE dep1 (c1 int);
+CREATE TABLE dep2 (c1 int);
+CREATE POLICY dep_p1 ON dep1 TO regress_rls_bob USING (c1 > (select max(dep2.c1) from dep2));
+ALTER POLICY dep_p1 ON dep1 TO regress_rls_bob,regress_rls_carol;
+-- Should return one
+SELECT count(*) = 1 FROM pg_depend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_class WHERE relname = 'dep2');
+ ?column?
+----------
+ t
+(1 row)
+
+ALTER POLICY dep_p1 ON dep1 USING (true);
+-- Should return one
+SELECT count(*) = 1 FROM pg_shdepend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_authid WHERE rolname = 'regress_rls_bob');
+ ?column?
+----------
+ t
+(1 row)
+
+-- Should return one
+SELECT count(*) = 1 FROM pg_shdepend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_authid WHERE rolname = 'regress_rls_carol');
+ ?column?
+----------
+ t
+(1 row)
+
+-- Should return zero
+SELECT count(*) = 0 FROM pg_depend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_class WHERE relname = 'dep2');
+ ?column?
+----------
+ t
+(1 row)
+
+-- DROP OWNED BY testing
+RESET SESSION AUTHORIZATION;
+CREATE ROLE regress_rls_dob_role1;
+CREATE ROLE regress_rls_dob_role2;
+CREATE TABLE dob_t1 (c1 int);
+CREATE TABLE dob_t2 (c1 int) PARTITION BY RANGE (c1);
+CREATE POLICY p1 ON dob_t1 TO regress_rls_dob_role1 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t1; -- should fail, already gone
+ERROR: policy "p1" for table "dob_t1" does not exist
+CREATE POLICY p1 ON dob_t1 TO regress_rls_dob_role1,regress_rls_dob_role2 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t1; -- should succeed
+CREATE POLICY p1 ON dob_t2 TO regress_rls_dob_role1,regress_rls_dob_role2 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t2; -- should succeed
+DROP USER regress_rls_dob_role1;
+DROP USER regress_rls_dob_role2;
+-- Bug #15708: view + table with RLS should check policies as view owner
+CREATE TABLE ref_tbl (a int);
+INSERT INTO ref_tbl VALUES (1);
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl VALUES (10);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (EXISTS (SELECT 1 FROM ref_tbl));
+GRANT SELECT ON ref_tbl TO regress_rls_bob;
+GRANT SELECT ON rls_tbl TO regress_rls_bob;
+CREATE VIEW rls_view AS SELECT * FROM rls_tbl;
+ALTER VIEW rls_view OWNER TO regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_alice;
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM ref_tbl; -- Permission denied
+ERROR: permission denied for table ref_tbl
+SELECT * FROM rls_tbl; -- Permission denied
+ERROR: permission denied for table rls_tbl
+SELECT * FROM rls_view; -- OK
+ a
+----
+ 10
+(1 row)
+
+RESET SESSION AUTHORIZATION;
+DROP VIEW rls_view;
+DROP TABLE rls_tbl;
+DROP TABLE ref_tbl;
+-- Leaky operator test
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl SELECT x/10 FROM generate_series(1, 100) x;
+ANALYZE rls_tbl;
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+GRANT SELECT ON rls_tbl TO regress_rls_alice;
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE FUNCTION op_leak(int, int) RETURNS bool
+ AS 'BEGIN RAISE NOTICE ''op_leak => %, %'', $1, $2; RETURN $1 < $2; END'
+ LANGUAGE plpgsql;
+CREATE OPERATOR <<< (procedure = op_leak, leftarg = int, rightarg = int,
+ restrict = scalarltsel);
+SELECT * FROM rls_tbl WHERE a <<< 1000;
+ a
+---
+(0 rows)
+
+DROP OPERATOR <<< (int, int);
+DROP FUNCTION op_leak(int, int);
+RESET SESSION AUTHORIZATION;
+DROP TABLE rls_tbl;
+-- Bug #16006: whole-row Vars in a policy don't play nice with sub-selects
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rls_tbl (a int, b int, c int);
+CREATE POLICY p1 ON rls_tbl USING (rls_tbl >= ROW(1,1,1));
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+ALTER TABLE rls_tbl FORCE ROW LEVEL SECURITY;
+INSERT INTO rls_tbl SELECT 10, 20, 30;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO rls_tbl
+ SELECT * FROM (SELECT b, c FROM rls_tbl ORDER BY a) ss;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Insert on regress_rls_schema.rls_tbl
+ -> Subquery Scan on ss
+ Output: ss.b, ss.c, NULL::integer
+ -> Sort
+ Output: rls_tbl_1.b, rls_tbl_1.c, rls_tbl_1.a
+ Sort Key: rls_tbl_1.a
+ -> Seq Scan on regress_rls_schema.rls_tbl rls_tbl_1
+ Output: rls_tbl_1.b, rls_tbl_1.c, rls_tbl_1.a
+ Filter: (rls_tbl_1.* >= '(1,1,1)'::record)
+(9 rows)
+
+INSERT INTO rls_tbl
+ SELECT * FROM (SELECT b, c FROM rls_tbl ORDER BY a) ss;
+SELECT * FROM rls_tbl;
+ a | b | c
+----+----+----
+ 10 | 20 | 30
+ 20 | 30 |
+(2 rows)
+
+DROP TABLE rls_tbl;
+RESET SESSION AUTHORIZATION;
+--
+-- Clean up objects
+--
+RESET SESSION AUTHORIZATION;
+DROP SCHEMA regress_rls_schema CASCADE;
+NOTICE: drop cascades to 29 other objects
+DETAIL: drop cascades to function f_leak(text)
+drop cascades to table uaccount
+drop cascades to table category
+drop cascades to table document
+drop cascades to table part_document
+drop cascades to table dependent
+drop cascades to table rec1
+drop cascades to table rec2
+drop cascades to view rec1v
+drop cascades to view rec2v
+drop cascades to table s1
+drop cascades to table s2
+drop cascades to view v2
+drop cascades to table b1
+drop cascades to view bv1
+drop cascades to table z1
+drop cascades to table z2
+drop cascades to table x1
+drop cascades to table y1
+drop cascades to table y2
+drop cascades to table t1
+drop cascades to table t2
+drop cascades to table t3
+drop cascades to table t4
+drop cascades to table current_check
+drop cascades to table dep1
+drop cascades to table dep2
+drop cascades to table dob_t1
+drop cascades to table dob_t2
+DROP USER regress_rls_alice;
+DROP USER regress_rls_bob;
+DROP USER regress_rls_carol;
+DROP USER regress_rls_dave;
+DROP USER regress_rls_exempt_user;
+DROP ROLE regress_rls_group1;
+DROP ROLE regress_rls_group2;
+-- Arrange to have a few policies left over, for testing
+-- pg_dump/pg_restore
+CREATE SCHEMA regress_rls_schema;
+CREATE TABLE rls_tbl (c1 int);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (c1 > 5);
+CREATE POLICY p2 ON rls_tbl FOR SELECT USING (c1 <= 3);
+CREATE POLICY p3 ON rls_tbl FOR UPDATE USING (c1 <= 3) WITH CHECK (c1 > 5);
+CREATE POLICY p4 ON rls_tbl FOR DELETE USING (c1 <= 3);
+CREATE TABLE rls_tbl_force (c1 int);
+ALTER TABLE rls_tbl_force ENABLE ROW LEVEL SECURITY;
+ALTER TABLE rls_tbl_force FORCE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl_force USING (c1 = 5) WITH CHECK (c1 < 5);
+CREATE POLICY p2 ON rls_tbl_force FOR SELECT USING (c1 = 8);
+CREATE POLICY p3 ON rls_tbl_force FOR UPDATE USING (c1 = 8) WITH CHECK (c1 >= 5);
+CREATE POLICY p4 ON rls_tbl_force FOR DELETE USING (c1 = 8);
diff --git src/test/regress/expected/sanity_check.out src/test/regress/expected/sanity_check.out
index 192445878d..088d35fdc9 100644
--- src/test/regress/expected/sanity_check.out
+++ src/test/regress/expected/sanity_check.out
@@ -43,7 +43,6 @@ dupindexcols|t
e_star|f
emp|f
equipment_r|f
-extra_wide_table|f
f_star|f
fast_emp4000|t
float4_tbl|f
diff --git src/test/regress/expected/select_parallel_1.out src/test/regress/expected/select_parallel_1.out
new file mode 100644
index 0000000000..c63aa334c0
--- /dev/null
+++ src/test/regress/expected/select_parallel_1.out
@@ -0,0 +1,1150 @@
+--
+-- PARALLEL
+--
+create function sp_parallel_restricted(int) returns int as
+ $$begin return $1; end$$ language plpgsql parallel restricted;
+-- Serializable isolation would disable parallel query, so explicitly use an
+-- arbitrary other level.
+begin isolation level repeatable read;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+-- Parallel Append with partial-subplans
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+-----------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Partial Aggregate
+ -> Parallel Append
+ -> Parallel Seq Scan on d_star
+ -> Parallel Seq Scan on f_star
+ -> Parallel Seq Scan on e_star
+ -> Parallel Seq Scan on b_star
+ -> Parallel Seq Scan on c_star
+ -> Parallel Seq Scan on a_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a1;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+-- Parallel Append with both partial and non-partial subplans
+alter table c_star set (parallel_workers = 0);
+alter table d_star set (parallel_workers = 0);
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+-----------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Partial Aggregate
+ -> Parallel Append
+ -> Seq Scan on d_star
+ -> Seq Scan on c_star
+ -> Parallel Seq Scan on f_star
+ -> Parallel Seq Scan on e_star
+ -> Parallel Seq Scan on b_star
+ -> Parallel Seq Scan on a_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a2;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+-- Parallel Append with only non-partial subplans
+alter table a_star set (parallel_workers = 0);
+alter table b_star set (parallel_workers = 0);
+alter table e_star set (parallel_workers = 0);
+alter table f_star set (parallel_workers = 0);
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+--------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Partial Aggregate
+ -> Parallel Append
+ -> Seq Scan on d_star
+ -> Seq Scan on f_star
+ -> Seq Scan on e_star
+ -> Seq Scan on b_star
+ -> Seq Scan on c_star
+ -> Seq Scan on a_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a3;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+-- Temporary hack to investigate whether extra vacuum/analyze is happening
+select relname, relpages, reltuples
+from pg_class
+where relname like '__star' order by relname;
+ relname | relpages | reltuples
+---------+----------+-----------
+ a_star | 5 | 3
+ b_star | 6 | 4
+ c_star | 6 | 4
+ d_star | 8 | 16
+ e_star | 7 | 7
+ f_star | 8 | 16
+(6 rows)
+
+-- Disable Parallel Append
+alter table a_star reset (parallel_workers);
+alter table b_star reset (parallel_workers);
+alter table c_star reset (parallel_workers);
+alter table d_star reset (parallel_workers);
+alter table e_star reset (parallel_workers);
+alter table f_star reset (parallel_workers);
+set enable_parallel_append to off;
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+-----------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Append
+ -> Parallel Seq Scan on a_star
+ -> Parallel Seq Scan on b_star
+ -> Parallel Seq Scan on c_star
+ -> Parallel Seq Scan on d_star
+ -> Parallel Seq Scan on e_star
+ -> Parallel Seq Scan on f_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a4;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+reset enable_parallel_append;
+-- Parallel Append that runs serially
+create function sp_test_func() returns setof text as
+$$ select 'foo'::varchar union all select 'bar'::varchar $$
+language sql stable;
+select sp_test_func() order by 1;
+ sp_test_func
+--------------
+ bar
+ foo
+(2 rows)
+
+-- Parallel Append is not to be used when the subpath depends on the outer param
+create table part_pa_test(a int, b int) partition by range(a);
+create table part_pa_test_p1 partition of part_pa_test for values from (minvalue) to (0);
+create table part_pa_test_p2 partition of part_pa_test for values from (0) to (maxvalue);
+explain (costs off)
+ select (select max((select pa1.b from part_pa_test pa1 where pa1.a = pa2.a)))
+ from part_pa_test pa2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Parallel Append
+ -> Parallel Seq Scan on part_pa_test_p1 pa2
+ -> Parallel Seq Scan on part_pa_test_p2 pa2_1
+ SubPlan 2
+ -> Result
+ SubPlan 1
+ -> Append
+ -> Seq Scan on part_pa_test_p1 pa1
+ Filter: (a = pa2.a)
+ -> Seq Scan on part_pa_test_p2 pa1_1
+ Filter: (a = pa2.a)
+(14 rows)
+
+drop table part_pa_test;
+-- test with leader participation disabled
+set parallel_leader_participation = off;
+explain (costs off)
+ select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ QUERY PLAN
+---------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (stringu1 = 'GRAAAA'::name)
+(6 rows)
+
+select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ count
+-------
+ 15
+(1 row)
+
+-- test with leader participation disabled, but no workers available (so
+-- the leader will have to run the plan despite the setting)
+set max_parallel_workers = 0;
+explain (costs off)
+ select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ QUERY PLAN
+---------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (stringu1 = 'GRAAAA'::name)
+(6 rows)
+
+select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ count
+-------
+ 15
+(1 row)
+
+reset max_parallel_workers;
+reset parallel_leader_participation;
+-- test that parallel_restricted function doesn't run in worker
+alter table tenk1 set (parallel_workers = 4);
+explain (verbose, costs off)
+select sp_parallel_restricted(unique1) from tenk1
+ where stringu1 = 'GRAAAA' order by 1;
+ QUERY PLAN
+---------------------------------------------------------
+ Sort
+ Output: (sp_parallel_restricted(unique1))
+ Sort Key: (sp_parallel_restricted(tenk1.unique1))
+ -> Gather
+ Output: sp_parallel_restricted(unique1)
+ Workers Planned: 4
+ -> Parallel Seq Scan on public.tenk1
+ Output: unique1
+ Filter: (tenk1.stringu1 = 'GRAAAA'::name)
+(9 rows)
+
+-- test parallel plan when group by expression is in target list.
+explain (costs off)
+ select length(stringu1) from tenk1 group by length(stringu1);
+ QUERY PLAN
+---------------------------------------------------
+ Finalize HashAggregate
+ Group Key: (length((stringu1)::text))
+ -> Gather
+ Workers Planned: 4
+ -> Partial HashAggregate
+ Group Key: length((stringu1)::text)
+ -> Parallel Seq Scan on tenk1
+(7 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length
+--------
+ 6
+(1 row)
+
+explain (costs off)
+ select stringu1, count(*) from tenk1 group by stringu1 order by stringu1;
+ QUERY PLAN
+----------------------------------------------------
+ Sort
+ Sort Key: stringu1
+ -> Finalize HashAggregate
+ Group Key: stringu1
+ -> Gather
+ Workers Planned: 4
+ -> Partial HashAggregate
+ Group Key: stringu1
+ -> Parallel Seq Scan on tenk1
+(9 rows)
+
+-- test that parallel plan for aggregates is not selected when
+-- target list contains parallel restricted clause.
+explain (costs off)
+ select sum(sp_parallel_restricted(unique1)) from tenk1
+ group by(sp_parallel_restricted(unique1));
+ QUERY PLAN
+-------------------------------------------------------------------
+ HashAggregate
+ Group Key: sp_parallel_restricted(unique1)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+(5 rows)
+
+-- test prepared statement
+prepare tenk1_count(integer) As select count((unique1)) from tenk1 where hundred > $1;
+explain (costs off) execute tenk1_count(1);
+ QUERY PLAN
+----------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (hundred > 1)
+(6 rows)
+
+execute tenk1_count(1);
+ count
+-------
+ 9800
+(1 row)
+
+deallocate tenk1_count;
+-- test parallel plans for queries containing un-correlated subplans.
+alter table tenk2 set (parallel_workers = 0);
+explain (costs off)
+ select count(*) from tenk1 where (two, four) not in
+ (select hundred, thousand from tenk2 where thousand > 100);
+ QUERY PLAN
+------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (NOT (hashed SubPlan 1))
+ SubPlan 1
+ -> Seq Scan on tenk2
+ Filter: (thousand > 100)
+(9 rows)
+
+select count(*) from tenk1 where (two, four) not in
+ (select hundred, thousand from tenk2 where thousand > 100);
+ count
+-------
+ 10000
+(1 row)
+
+-- this is not parallel-safe due to use of random() within SubLink's testexpr:
+explain (costs off)
+ select * from tenk1 where (unique1 + random())::integer not in
+ (select ten from tenk2);
+ QUERY PLAN
+------------------------------------
+ Seq Scan on tenk1
+ Filter: (NOT (hashed SubPlan 1))
+ SubPlan 1
+ -> Seq Scan on tenk2
+(4 rows)
+
+alter table tenk2 reset (parallel_workers);
+-- test parallel plan for a query containing initplan.
+set enable_indexscan = off;
+set enable_indexonlyscan = off;
+set enable_bitmapscan = off;
+alter table tenk2 set (parallel_workers = 2);
+explain (costs off)
+ select count(*) from tenk1
+ where tenk1.unique1 = (Select max(tenk2.unique1) from tenk2);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ InitPlan 1 (returns $2)
+ -> Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk2
+ -> Gather
+ Workers Planned: 4
+ Params Evaluated: $2
+ -> Parallel Seq Scan on tenk1
+ Filter: (unique1 = $2)
+(12 rows)
+
+select count(*) from tenk1
+ where tenk1.unique1 = (Select max(tenk2.unique1) from tenk2);
+ count
+-------
+ 1
+(1 row)
+
+reset enable_indexscan;
+reset enable_indexonlyscan;
+reset enable_bitmapscan;
+alter table tenk2 reset (parallel_workers);
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+ select count((unique1)) from tenk1 where hundred > 1;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Scan using tenk1_hundred on tenk1
+ Index Cond: (hundred > 1)
+(6 rows)
+
+select count((unique1)) from tenk1 where hundred > 1;
+ count
+-------
+ 9800
+(1 row)
+
+-- test parallel index-only scans.
+explain (costs off)
+ select count(*) from tenk1 where thousand > 95;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: (thousand > 95)
+(6 rows)
+
+select count(*) from tenk1 where thousand > 95;
+ count
+-------
+ 9040
+(1 row)
+
+-- test rescan cases too
+set enable_material = false;
+explain (costs off)
+select * from
+ (select count(unique1) from tenk1 where hundred > 10) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Nested Loop Left Join
+ -> Values Scan on "*VALUES*"
+ -> Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Scan using tenk1_hundred on tenk1
+ Index Cond: (hundred > 10)
+(8 rows)
+
+select * from
+ (select count(unique1) from tenk1 where hundred > 10) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ count | x
+-------+---
+ 8900 | 1
+ 8900 | 2
+ 8900 | 3
+(3 rows)
+
+explain (costs off)
+select * from
+ (select count(*) from tenk1 where thousand > 99) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Nested Loop Left Join
+ -> Values Scan on "*VALUES*"
+ -> Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: (thousand > 99)
+(8 rows)
+
+select * from
+ (select count(*) from tenk1 where thousand > 99) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ count | x
+-------+---
+ 9000 | 1
+ 9000 | 2
+ 9000 | 3
+(3 rows)
+
+reset enable_material;
+reset enable_seqscan;
+reset enable_bitmapscan;
+-- test parallel bitmap heap scan.
+set enable_seqscan to off;
+set enable_indexscan to off;
+set enable_hashjoin to off;
+set enable_mergejoin to off;
+set enable_material to off;
+-- test prefetching, if the platform allows it
+DO $$
+BEGIN
+ SET effective_io_concurrency = 50;
+EXCEPTION WHEN invalid_parameter_value THEN
+END $$;
+set work_mem='64kB'; --set small work mem to force lossy pages
+explain (costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+ QUERY PLAN
+------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ -> Seq Scan on tenk2
+ Filter: (thousand = 0)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Bitmap Heap Scan on tenk1
+ Recheck Cond: (hundred > 1)
+ -> Bitmap Index Scan on tenk1_hundred
+ Index Cond: (hundred > 1)
+(10 rows)
+
+select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+ count
+-------
+ 98000
+(1 row)
+
+create table bmscantest (a int, t text);
+insert into bmscantest select r, 'fooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo' FROM generate_series(1,100000) r;
+create index i_bmtest ON bmscantest(a);
+select count(*) from bmscantest where a>1;
+ count
+-------
+ 99999
+(1 row)
+
+-- test accumulation of stats for parallel nodes
+reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1
+ and tenk2.thousand=0;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=98000 loops=1)
+ -> Seq Scan on tenk2 (actual rows=10 loops=1)
+ Filter: (thousand = 0)
+ Rows Removed by Filter: 9990
+ -> Gather (actual rows=9800 loops=10)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
+ Filter: (hundred > 1)
+ Rows Removed by Filter: 40
+(11 rows)
+
+alter table tenk2 reset (parallel_workers);
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+ explain_parallel_sort_stats
+--------------------------------------------------------------------------
+ Nested Loop Left Join (actual rows=30000 loops=1)
+ -> Values Scan on "*VALUES*" (actual rows=3 loops=1)
+ -> Gather Merge (actual rows=10000 loops=3)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Sort (actual rows=2000 loops=15)
+ Sort Key: tenk1.ten
+ Sort Method: quicksort Memory: xxx
+ Worker 0: Sort Method: quicksort Memory: xxx
+ Worker 1: Sort Method: quicksort Memory: xxx
+ Worker 2: Sort Method: quicksort Memory: xxx
+ Worker 3: Sort Method: quicksort Memory: xxx
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=15)
+ Filter: (ten < 100)
+(14 rows)
+
+reset enable_indexscan;
+reset enable_hashjoin;
+reset enable_mergejoin;
+reset enable_material;
+reset effective_io_concurrency;
+drop table bmscantest;
+drop function explain_parallel_sort_stats();
+-- test parallel merge join path.
+set enable_hashjoin to off;
+set enable_nestloop to off;
+explain (costs off)
+ select count(*) from tenk1, tenk2 where tenk1.unique1 = tenk2.unique1;
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Merge Join
+ Merge Cond: (tenk1.unique1 = tenk2.unique1)
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+ -> Index Only Scan using tenk2_unique1 on tenk2
+(8 rows)
+
+select count(*) from tenk1, tenk2 where tenk1.unique1 = tenk2.unique1;
+ count
+-------
+ 10000
+(1 row)
+
+reset enable_hashjoin;
+reset enable_nestloop;
+-- test gather merge
+set enable_hashagg = false;
+explain (costs off)
+ select count(*) from tenk1 group by twenty;
+ QUERY PLAN
+----------------------------------------------------
+ Finalize GroupAggregate
+ Group Key: twenty
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: twenty
+ -> Sort
+ Sort Key: twenty
+ -> Parallel Seq Scan on tenk1
+(9 rows)
+
+select count(*) from tenk1 group by twenty;
+ count
+-------
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+(20 rows)
+
+--test expressions in targetlist are pushed down for gather merge
+create function sp_simple_func(var1 integer) returns integer
+as $$
+begin
+ return var1 + 10;
+end;
+$$ language plpgsql PARALLEL SAFE;
+explain (costs off, verbose)
+ select ten, sp_simple_func(ten) from tenk1 where ten < 100 order by ten;
+ QUERY PLAN
+-----------------------------------------------------
+ Gather Merge
+ Output: ten, (sp_simple_func(ten))
+ Workers Planned: 4
+ -> Result
+ Output: ten, sp_simple_func(ten)
+ -> Sort
+ Output: ten
+ Sort Key: tenk1.ten
+ -> Parallel Seq Scan on public.tenk1
+ Output: ten
+ Filter: (tenk1.ten < 100)
+(11 rows)
+
+drop function sp_simple_func(integer);
+-- test handling of SRFs in targetlist (bug in 10.0)
+explain (costs off)
+ select count(*), generate_series(1,2) from tenk1 group by twenty;
+ QUERY PLAN
+----------------------------------------------------------
+ ProjectSet
+ -> Finalize GroupAggregate
+ Group Key: twenty
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: twenty
+ -> Sort
+ Sort Key: twenty
+ -> Parallel Seq Scan on tenk1
+(10 rows)
+
+select count(*), generate_series(1,2) from tenk1 group by twenty;
+ count | generate_series
+-------+-----------------
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+(40 rows)
+
+-- test gather merge with parallel leader participation disabled
+set parallel_leader_participation = off;
+explain (costs off)
+ select count(*) from tenk1 group by twenty;
+ QUERY PLAN
+----------------------------------------------------
+ Finalize GroupAggregate
+ Group Key: twenty
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: twenty
+ -> Sort
+ Sort Key: twenty
+ -> Parallel Seq Scan on tenk1
+(9 rows)
+
+select count(*) from tenk1 group by twenty;
+ count
+-------
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+(20 rows)
+
+reset parallel_leader_participation;
+--test rescan behavior of gather merge
+set enable_material = false;
+explain (costs off)
+select * from
+ (select string4, count(unique2)
+ from tenk1 group by string4 order by string4) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ QUERY PLAN
+----------------------------------------------------------
+ Nested Loop Left Join
+ -> Values Scan on "*VALUES*"
+ -> Finalize GroupAggregate
+ Group Key: tenk1.string4
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: tenk1.string4
+ -> Sort
+ Sort Key: tenk1.string4
+ -> Parallel Seq Scan on tenk1
+(11 rows)
+
+select * from
+ (select string4, count(unique2)
+ from tenk1 group by string4 order by string4) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ string4 | count | x
+---------+-------+---
+ AAAAxx | 2500 | 1
+ HHHHxx | 2500 | 1
+ OOOOxx | 2500 | 1
+ VVVVxx | 2500 | 1
+ AAAAxx | 2500 | 2
+ HHHHxx | 2500 | 2
+ OOOOxx | 2500 | 2
+ VVVVxx | 2500 | 2
+ AAAAxx | 2500 | 3
+ HHHHxx | 2500 | 3
+ OOOOxx | 2500 | 3
+ VVVVxx | 2500 | 3
+(12 rows)
+
+reset enable_material;
+reset enable_hashagg;
+-- check parallelized int8 aggregate (bug #14897)
+explain (costs off)
+select avg(unique1::int8) from tenk1;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+(5 rows)
+
+select avg(unique1::int8) from tenk1;
+ avg
+-----------------------
+ 4999.5000000000000000
+(1 row)
+
+-- gather merge test with a LIMIT
+explain (costs off)
+ select fivethous from tenk1 order by fivethous limit 4;
+ QUERY PLAN
+----------------------------------------------
+ Limit
+ -> Gather Merge
+ Workers Planned: 4
+ -> Sort
+ Sort Key: fivethous
+ -> Parallel Seq Scan on tenk1
+(6 rows)
+
+select fivethous from tenk1 order by fivethous limit 4;
+ fivethous
+-----------
+ 0
+ 0
+ 1
+ 1
+(4 rows)
+
+-- gather merge test with 0 worker
+set max_parallel_workers = 0;
+explain (costs off)
+ select string4 from tenk1 order by string4 limit 5;
+ QUERY PLAN
+----------------------------------------------
+ Limit
+ -> Gather Merge
+ Workers Planned: 4
+ -> Sort
+ Sort Key: string4
+ -> Parallel Seq Scan on tenk1
+(6 rows)
+
+select string4 from tenk1 order by string4 limit 5;
+ string4
+---------
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+(5 rows)
+
+-- gather merge test with 0 workers, with parallel leader
+-- participation disabled (the leader will have to run the plan
+-- despite the setting)
+set parallel_leader_participation = off;
+explain (costs off)
+ select string4 from tenk1 order by string4 limit 5;
+ QUERY PLAN
+----------------------------------------------
+ Limit
+ -> Gather Merge
+ Workers Planned: 4
+ -> Sort
+ Sort Key: string4
+ -> Parallel Seq Scan on tenk1
+(6 rows)
+
+select string4 from tenk1 order by string4 limit 5;
+ string4
+---------
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+(5 rows)
+
+reset parallel_leader_participation;
+reset max_parallel_workers;
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+explain (costs off)
+ select stringu1::int2 from tenk1 where unique1 = 1;
+ QUERY PLAN
+-----------------------------------------------
+ Gather
+ Workers Planned: 1
+ Single Copy: true
+ -> Index Scan using tenk1_unique1 on tenk1
+ Index Cond: (unique1 = 1)
+(5 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+-- exercise record typmod remapping between backends
+CREATE FUNCTION make_record(n int)
+ RETURNS RECORD LANGUAGE plpgsql PARALLEL SAFE AS
+$$
+BEGIN
+ RETURN CASE n
+ WHEN 1 THEN ROW(1)
+ WHEN 2 THEN ROW(1, 2)
+ WHEN 3 THEN ROW(1, 2, 3)
+ WHEN 4 THEN ROW(1, 2, 3, 4)
+ ELSE ROW(1, 2, 3, 4, 5)
+ END;
+END;
+$$;
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+SELECT make_record(x) FROM (SELECT generate_series(1, 5) x) ss ORDER BY x;
+ make_record
+-------------
+ (1)
+ (1,2)
+ (1,2,3)
+ (1,2,3,4)
+ (1,2,3,4,5)
+(5 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+DROP function make_record(n int);
+-- test the sanity of parallel query after the active role is dropped.
+drop role if exists regress_parallel_worker;
+NOTICE: role "regress_parallel_worker" does not exist, skipping
+create role regress_parallel_worker;
+set role regress_parallel_worker;
+reset session authorization;
+drop role regress_parallel_worker;
+set force_parallel_mode = 1;
+select count(*) from tenk1;
+ count
+-------
+ 10000
+(1 row)
+
+reset force_parallel_mode;
+reset role;
+-- Window function calculation can't be pushed to workers.
+explain (costs off, verbose)
+ select count(*) from tenk1 a where (unique1, two) in
+ (select unique1, row_number() over() from tenk1 b);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
+ Aggregate
+ Output: count(*)
+ -> Hash Semi Join
+ Hash Cond: ((a.unique1 = b.unique1) AND (a.two = (row_number() OVER (?))))
+ -> Gather
+ Output: a.unique1, a.two
+ Workers Planned: 4
+ -> Parallel Seq Scan on public.tenk1 a
+ Output: a.unique1, a.two
+ -> Hash
+ Output: b.unique1, (row_number() OVER (?))
+ -> WindowAgg
+ Output: b.unique1, row_number() OVER (?)
+ -> Gather
+ Output: b.unique1
+ Workers Planned: 4
+ -> Parallel Index Only Scan using tenk1_unique1 on public.tenk1 b
+ Output: b.unique1
+(18 rows)
+
+-- LIMIT/OFFSET within sub-selects can't be pushed to workers.
+explain (costs off)
+ select * from tenk1 a where two in
+ (select two from tenk1 b where stringu1 like '%AAAA' limit 3);
+ QUERY PLAN
+---------------------------------------------------------------
+ Hash Semi Join
+ Hash Cond: (a.two = b.two)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 a
+ -> Hash
+ -> Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 b
+ Filter: (stringu1 ~~ '%AAAA'::text)
+(11 rows)
+
+-- to increase the parallel query test coverage
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM tenk1;
+ QUERY PLAN
+-------------------------------------------------------------
+ Gather (actual rows=10000 loops=1)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=5)
+(4 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+-- provoke error in worker
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+select stringu1::int2 from tenk1 where unique1 = 1;
+ERROR: invalid input syntax for type smallint: "BAAAAA"
+CONTEXT: parallel worker
+ROLLBACK TO SAVEPOINT settings;
+-- test interaction with set-returning functions
+SAVEPOINT settings;
+-- multiple subqueries under a single Gather node
+-- must set parallel_setup_cost > 0 to discourage multiple Gather nodes
+SET LOCAL parallel_setup_cost = 10;
+EXPLAIN (COSTS OFF)
+SELECT unique1 FROM tenk1 WHERE fivethous = tenthous + 1
+UNION ALL
+SELECT unique1 FROM tenk1 WHERE fivethous = tenthous + 1;
+ QUERY PLAN
+----------------------------------------------------
+ Gather
+ Workers Planned: 4
+ -> Parallel Append
+ -> Parallel Seq Scan on tenk1
+ Filter: (fivethous = (tenthous + 1))
+ -> Parallel Seq Scan on tenk1 tenk1_1
+ Filter: (fivethous = (tenthous + 1))
+(7 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+-- can't use multiple subqueries under a single Gather node due to initPlans
+EXPLAIN (COSTS OFF)
+SELECT unique1 FROM tenk1 WHERE fivethous =
+ (SELECT unique1 FROM tenk1 WHERE fivethous = 1 LIMIT 1)
+UNION ALL
+SELECT unique1 FROM tenk1 WHERE fivethous =
+ (SELECT unique2 FROM tenk1 WHERE fivethous = 1 LIMIT 1)
+ORDER BY 1;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sort
+ Sort Key: tenk1.unique1
+ -> Append
+ -> Gather
+ Workers Planned: 4
+ Params Evaluated: $1
+ InitPlan 1 (returns $1)
+ -> Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 tenk1_2
+ Filter: (fivethous = 1)
+ -> Parallel Seq Scan on tenk1
+ Filter: (fivethous = $1)
+ -> Gather
+ Workers Planned: 4
+ Params Evaluated: $3
+ InitPlan 2 (returns $3)
+ -> Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 tenk1_3
+ Filter: (fivethous = 1)
+ -> Parallel Seq Scan on tenk1 tenk1_1
+ Filter: (fivethous = $3)
+(25 rows)
+
+-- test interaction with SRFs
+SELECT * FROM information_schema.foreign_data_wrapper_options
+ORDER BY 1, 2, 3;
+ foreign_data_wrapper_catalog | foreign_data_wrapper_name | option_name | option_value
+------------------------------+---------------------------+-------------+--------------
+(0 rows)
+
+-- test passing expanded-value representations to workers
+CREATE FUNCTION make_some_array(int,int) returns int[] as
+$$declare x int[];
+ begin
+ x[1] := $1;
+ x[2] := $2;
+ return x;
+ end$$ language plpgsql parallel safe;
+CREATE TABLE fooarr(f1 text, f2 int[], f3 text);
+INSERT INTO fooarr VALUES('1', ARRAY[1,2], 'one');
+PREPARE pstmt(text, int[]) AS SELECT * FROM fooarr WHERE f1 = $1 AND f2 = $2;
+EXPLAIN (COSTS OFF) EXECUTE pstmt('1', make_some_array(1,2));
+ QUERY PLAN
+------------------------------------------------------------------
+ Gather
+ Workers Planned: 3
+ -> Parallel Seq Scan on fooarr
+ Filter: ((f1 = '1'::text) AND (f2 = '{1,2}'::integer[]))
+(4 rows)
+
+EXECUTE pstmt('1', make_some_array(1,2));
+ f1 | f2 | f3
+----+-------+-----
+ 1 | {1,2} | one
+(1 row)
+
+DEALLOCATE pstmt;
+-- test interaction between subquery and partial_paths
+CREATE VIEW tenk1_vw_sec WITH (security_barrier) AS SELECT * FROM tenk1;
+EXPLAIN (COSTS OFF)
+SELECT 1 FROM tenk1_vw_sec
+ WHERE (SELECT sum(f1) FROM int4_tbl WHERE f1 < unique1) < 100;
+ QUERY PLAN
+-------------------------------------------------------------------
+ Subquery Scan on tenk1_vw_sec
+ Filter: ((SubPlan 1) < 100)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+ SubPlan 1
+ -> Aggregate
+ -> Seq Scan on int4_tbl
+ Filter: (f1 < tenk1_vw_sec.unique1)
+(9 rows)
+
+rollback;
diff --git src/test/regress/expected/strings_1.out src/test/regress/expected/strings_1.out
new file mode 100644
index 0000000000..5959baf837
--- /dev/null
+++ src/test/regress/expected/strings_1.out
@@ -0,0 +1,1973 @@
+--
+-- STRINGS
+-- Test various data entry syntaxes.
+--
+-- SQL string continuation syntax
+-- E021-03 character string literals
+SELECT 'first line'
+' - next line'
+ ' - third line'
+ AS "Three lines to one";
+ Three lines to one
+-------------------------------------
+ first line - next line - third line
+(1 row)
+
+-- illegal string continuation syntax
+SELECT 'first line'
+' - next line' /* this comment is not allowed here */
+' - third line'
+ AS "Illegal comment within continuation";
+ERROR: syntax error at or near "' - third line'"
+LINE 3: ' - third line'
+ ^
+-- Unicode escapes
+SET standard_conforming_strings TO on;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ data
+------
+ data
+(1 row)
+
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ dat\+000061
+-------------
+ dat\+000061
+(1 row)
+
+SELECT U&'a\\b' AS "a\b";
+ a\b
+-----
+ a\b
+(1 row)
+
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ tricky
+--------
+ \
+(1 row)
+
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT U&'wrong: \061';
+ ^
+HINT: Unicode escapes must be \XXXX or \+XXXXXX.
+SELECT U&'wrong: \+0061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+HINT: Unicode escapes must be \XXXX or \+XXXXXX.
+SELECT U&'wrong: +0061' UESCAPE +;
+ERROR: UESCAPE must be followed by a simple string literal at or near "+"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE +;
+ ^
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: invalid Unicode escape character at or near "'+'"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+SELECT U&'wrong: \db99';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99';
+ ^
+SELECT U&'wrong: \db99xy';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99xy';
+ ^
+SELECT U&'wrong: \db99\\';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99\\';
+ ^
+SELECT U&'wrong: \db99\0061';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99\0061';
+ ^
+SELECT U&'wrong: \+00db99\+000061';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \+00db99\+000061';
+ ^
+SELECT U&'wrong: \+2FFFFF';
+ERROR: invalid Unicode escape value
+LINE 1: SELECT U&'wrong: \+2FFFFF';
+ ^
+-- while we're here, check the same cases in E-style literals
+SELECT E'd\u0061t\U00000061' AS "data";
+ data
+------
+ data
+(1 row)
+
+SELECT E'a\\b' AS "a\b";
+ a\b
+-----
+ a\b
+(1 row)
+
+SELECT E'wrong: \u061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT E'wrong: \u061';
+ ^
+HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX.
+SELECT E'wrong: \U0061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT E'wrong: \U0061';
+ ^
+HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX.
+SELECT E'wrong: \udb99';
+ERROR: invalid Unicode surrogate pair at or near "'"
+LINE 1: SELECT E'wrong: \udb99';
+ ^
+SELECT E'wrong: \udb99xy';
+ERROR: invalid Unicode surrogate pair at or near "x"
+LINE 1: SELECT E'wrong: \udb99xy';
+ ^
+SELECT E'wrong: \udb99\\';
+ERROR: invalid Unicode surrogate pair at or near "\"
+LINE 1: SELECT E'wrong: \udb99\\';
+ ^
+SELECT E'wrong: \udb99\u0061';
+ERROR: invalid Unicode surrogate pair at or near "\u0061"
+LINE 1: SELECT E'wrong: \udb99\u0061';
+ ^
+SELECT E'wrong: \U0000db99\U00000061';
+ERROR: invalid Unicode surrogate pair at or near "\U00000061"
+LINE 1: SELECT E'wrong: \U0000db99\U00000061';
+ ^
+SELECT E'wrong: \U002FFFFF';
+ERROR: invalid Unicode escape value at or near "\U002FFFFF"
+LINE 1: SELECT E'wrong: \U002FFFFF';
+ ^
+SET standard_conforming_strings TO off;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061...
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&' \' UESCAPE '!' AS "tricky";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: \+0061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR: invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR: invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR: invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+----------
+ DeSdBeEf
+(1 row)
+
+--
+-- test conversions between various string types
+-- E021-10 implicit casting among the character data types
+--
+SELECT CAST(f1 AS text) AS "text(char)" FROM CHAR_TBL;
+ text(char)
+------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(f1 AS text) AS "text(varchar)" FROM VARCHAR_TBL;
+ text(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS text) AS "text(name)";
+ text(name)
+------------
+ namefield
+(1 row)
+
+-- since this is an explicit cast, it should truncate w/o error:
+SELECT CAST(f1 AS char(10)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+------------
+ doh!
+ hi de ho n
+(2 rows)
+
+-- note: implicit-cast case is tested in char.sql
+SELECT CAST(f1 AS char(20)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+----------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS char(10)) AS "char(varchar)" FROM VARCHAR_TBL;
+ char(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS char(10)) AS "char(name)";
+ char(name)
+------------
+ namefield
+(1 row)
+
+SELECT CAST(f1 AS varchar) AS "varchar(text)" FROM TEXT_TBL;
+ varchar(text)
+-------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS varchar) AS "varchar(char)" FROM CHAR_TBL;
+ varchar(char)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS varchar) AS "varchar(name)";
+ varchar(name)
+---------------
+ namefield
+(1 row)
+
+--
+-- test SQL string functions
+-- E### and T### are feature reference numbers from SQL99
+--
+-- E021-09 trim function
+SELECT TRIM(BOTH FROM ' bunch o blanks ') = 'bunch o blanks' AS "bunch o blanks";
+ bunch o blanks
+----------------
+ t
+(1 row)
+
+SELECT TRIM(LEADING FROM ' bunch o blanks ') = 'bunch o blanks ' AS "bunch o blanks ";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(TRAILING FROM ' bunch o blanks ') = ' bunch o blanks' AS " bunch o blanks";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(BOTH 'x' FROM 'xxxxxsome Xsxxxxx') = 'some Xs' AS "some Xs";
+ some Xs
+---------
+ t
+(1 row)
+
+-- E021-06 substring expression
+SELECT SUBSTRING('1234567890' FROM 3) = '34567890' AS "34567890";
+ 34567890
+----------
+ t
+(1 row)
+
+SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
+ 456
+-----
+ t
+(1 row)
+
+-- T581 regular expression substring (with SQL's bizarre regexp syntax)
+SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
+ bcd
+-----
+ bcd
+(1 row)
+
+-- No match should return NULL
+SELECT SUBSTRING('abcdefg' FROM '#"(b_d)#"%' FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- Null inputs should return NULL
+SELECT SUBSTRING('abcdefg' FROM '%' FOR NULL) IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING(NULL FROM '%' FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM NULL FOR '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- The first and last parts should act non-greedy
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- Vertical bar in any part affects only that part
+SELECT SUBSTRING('abcdefg' FROM 'a|b#"%#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%#"x|g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a#"%|ab#"g' FOR '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+-- Can't have more than two part separators
+SELECT SUBSTRING('abcdefg' FROM 'a*#"%#"g*#"x' FOR '#') AS "error";
+ERROR: SQL regular expression may not contain more than two escape-double-quote separators
+CONTEXT: SQL function "substring" statement 1
+-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
+SELECT SUBSTRING('abcdefg' FROM 'a#"%g' FOR '#') AS "bcdefg";
+ bcdefg
+--------
+ bcdefg
+(1 row)
+
+SELECT SUBSTRING('abcdefg' FROM 'a%g' FOR '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- substring() with just two arguments is not allowed by SQL spec;
+-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
+SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- With a parenthesized subexpression, return only what matches the subexpr
+SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- Check behavior of SIMILAR TO, which uses largely the same regexp variant
+SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
+ true
+------
+ t
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false;
+ false
+-------
+ f
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false;
+ false
+-------
+ f
+(1 row)
+
+SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true;
+ true
+------
+ t
+(1 row)
+
+-- Postgres uses '\' as the default escape character, which is not per spec
+SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false;
+ false
+-------
+ f
+(1 row)
+
+-- and an empty string to mean "no escape", which is also not per spec
+SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true;
+ true
+------
+ t
+(1 row)
+
+-- these behaviors are per spec, though:
+SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null;
+ null
+------
+
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error;
+ERROR: invalid escape string
+HINT: Escape string must be empty or one character.
+-- Test back reference in regexp_replace
+SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
+ regexp_replace
+----------------
+ (111) 222-3333
+(1 row)
+
+SELECT regexp_replace('AAA BBB CCC ', E'\\s+', ' ', 'g');
+ regexp_replace
+----------------
+ AAA BBB CCC
+(1 row)
+
+SELECT regexp_replace('AAA', '^|$', 'Z', 'g');
+ regexp_replace
+----------------
+ ZAAAZ
+(1 row)
+
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'gi');
+ regexp_replace
+----------------
+ Z Z
+(1 row)
+
+-- invalid regexp option
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
+ERROR: invalid regular expression option: "z"
+-- set so we can tell NULL from empty string
+\pset null '\\N'
+-- return all matches from regexp
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,beque}
+(1 row)
+
+-- test case insensitive
+SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
+ regexp_matches
+----------------
+ {bAR,bEqUE}
+(1 row)
+
+-- global option - more than one match
+SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+ regexp_matches
+----------------
+ {bar,beque}
+ {bazil,barf}
+(2 rows)
+
+-- empty capture group (matched empty string)
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,"",beque}
+(1 row)
+
+-- no match
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)(beque)$re$);
+ regexp_matches
+----------------
+(0 rows)
+
+-- optional capture group did not match, null entry in array
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)?(beque)$re$);
+ regexp_matches
+------------------
+ {bar,NULL,beque}
+(1 row)
+
+-- no capture groups
+SELECT regexp_matches('foobarbequebaz', $re$barbeque$re$);
+ regexp_matches
+----------------
+ {barbeque}
+(1 row)
+
+-- start/end-of-line matches are of zero length
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '^', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '^.?', 'mg');
+ regexp_matches
+----------------
+ {1}
+ {2}
+ {3}
+ {4}
+ {""}
+(5 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+ {""}
+(10 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4', '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+(9 rows)
+
+-- give me errors
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$, 'gz');
+ERROR: invalid regular expression option: "z"
+SELECT regexp_matches('foobarbequebaz', $re$(barbeque$re$);
+ERROR: invalid regular expression: parentheses () not balanced
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque){2,1}$re$);
+ERROR: invalid regular expression: invalid repetition count(s)
+-- split string on regexp
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s+$re$) AS foo;
+ foo | length
+-------+--------
+ the | 3
+ quick | 5
+ brown | 5
+ fox | 3
+ jumps | 5
+ over | 4
+ the | 3
+ lazy | 4
+ dog | 3
+(9 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s+$re$);
+ regexp_split_to_array
+-----------------------------------------------
+ {the,quick,brown,fox,jumps,over,the,lazy,dog}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s*$re$) AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ f | 1
+ o | 1
+ x | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ t | 1
+ h | 1
+ e | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ d | 1
+ o | 1
+ g | 1
+(35 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s*$re$);
+ regexp_split_to_array
+-------------------------------------------------------------------------
+ {t,h,e,q,u,i,c,k,b,r,o,w,n,f,o,x,j,u,m,p,s,o,v,e,r,t,h,e,l,a,z,y,d,o,g}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '') AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ | 1
+ f | 1
+ o | 1
+ x | 1
+ | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ | 1
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ | 1
+ d | 1
+ o | 1
+ g | 1
+(43 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '');
+ regexp_split_to_array
+---------------------------------------------------------------------------------------------------------
+ {t,h,e," ",q,u,i,c,k," ",b,r,o,w,n," ",f,o,x," ",j,u,m,p,s," ",o,v,e,r," ",t,h,e," ",l,a,z,y," ",d,o,g}
+(1 row)
+
+-- case insensitive
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i') AS foo;
+ foo | length
+---------------------------+--------
+ th | 2
+ QUick bROWn FOx jUMPs ov | 25
+ r Th | 4
+ lazy dOG | 9
+(4 rows)
+
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i');
+ regexp_split_to_array
+-----------------------------------------------------
+ {th," QUick bROWn FOx jUMPs ov","r Th"," lazy dOG"}
+(1 row)
+
+-- no match of pattern
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', 'nomatch') AS foo;
+ foo | length
+---------------------------------------------+--------
+ the quick brown fox jumps over the lazy dog | 43
+(1 row)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', 'nomatch');
+ regexp_split_to_array
+-------------------------------------------------
+ {"the quick brown fox jumps over the lazy dog"}
+(1 row)
+
+-- some corner cases
+SELECT regexp_split_to_array('123456','1');
+ regexp_split_to_array
+-----------------------
+ {"",23456}
+(1 row)
+
+SELECT regexp_split_to_array('123456','6');
+ regexp_split_to_array
+-----------------------
+ {12345,""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','.');
+ regexp_split_to_array
+------------------------
+ {"","","","","","",""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('123456','(?:)');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('1','');
+ regexp_split_to_array
+-----------------------
+ {1}
+(1 row)
+
+-- errors
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'zippy') AS foo;
+ERROR: invalid regular expression option: "z"
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'iz');
+ERROR: invalid regular expression option: "z"
+-- global option meaningless for regexp_split
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g') AS foo;
+ERROR: regexp_split_to_table() does not support the "global" option
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g');
+ERROR: regexp_split_to_array() does not support the "global" option
+-- change NULL-display back
+\pset null ''
+-- E021-11 position expression
+SELECT POSITION('4' IN '1234567890') = '4' AS "4";
+ 4
+---
+ t
+(1 row)
+
+SELECT POSITION('5' IN '1234567890') = '5' AS "5";
+ 5
+---
+ t
+(1 row)
+
+-- T312 character overlay function
+SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";
+ yabadaba
+----------
+ yabadaba
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";
+ yabadabadoo
+-------------
+ yabadabadoo
+(1 row)
+
+SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
+ bubba
+-------
+ bubba
+(1 row)
+
+--
+-- test LIKE
+-- Be sure to form every test as a LIKE/NOT LIKE pair.
+--
+-- simplest examples
+-- E061-04 like predicate
+SELECT 'hawkeye' LIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' LIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'indio%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'indio%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'h%eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in__o' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in__o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in_o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in_o' AS "true";
+ true
+------
+ t
+(1 row)
+
+-- unused escape character
+SELECT 'hawkeye' LIKE 'h%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'ind_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'ind_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character
+-- E061-05 like predicate with escape clause
+SELECT 'h%' LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%' NOT LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%awkeye' LIKE 'h#%a%k%e' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%awkeye' NOT LIKE 'h#%a%k%e' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_nd_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_nd_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d%o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d%o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character same as pattern character
+SELECT 'maca' LIKE 'm%aca' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'maca' NOT LIKE 'm%aca' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'ma%a' LIKE 'm%a%%a' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'ma%a' NOT LIKE 'm%a%%a' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'bear' LIKE 'b_ear' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'bear' NOT LIKE 'b_ear' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE 'b_e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'be_r' NOT LIKE 'b_e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE '__e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' NOT LIKE '__e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+--
+-- test ILIKE (case-insensitive LIKE)
+-- Be sure to form every test as an ILIKE/NOT ILIKE pair.
+--
+SELECT 'hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%Eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%Eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'Hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'Hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+--
+-- test %/_ combination cases, cf bugs #4821 and #5478
+--
+SELECT 'foo' LIKE '_%' as t, 'f' LIKE '_%' as t, '' LIKE '_%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%_' as t, 'f' LIKE '%_' as t, '' LIKE '%_' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '__%' as t, 'foo' LIKE '___%' as t, 'foo' LIKE '____%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%__' as t, 'foo' LIKE '%___' as t, 'foo' LIKE '%____' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'jack' LIKE '%____%' AS t;
+ t
+---
+ t
+(1 row)
+
+--
+-- basic tests of LIKE with indexes
+--
+CREATE TABLE texttest (a text PRIMARY KEY, b int);
+SELECT * FROM texttest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+CREATE TABLE byteatest (a bytea PRIMARY KEY, b int);
+SELECT * FROM byteatest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+DROP TABLE texttest, byteatest;
+--
+-- test implicit type conversion
+--
+-- E021-07 character concatenation
+SELECT 'unknown' || ' and unknown' AS "Concat unknown types";
+ Concat unknown types
+----------------------
+ unknown and unknown
+(1 row)
+
+SELECT text 'text' || ' and unknown' AS "Concat text to unknown type";
+ Concat text to unknown type
+-----------------------------
+ text and unknown
+(1 row)
+
+SELECT char(20) 'characters' || ' and text' AS "Concat char to unknown type";
+ Concat char to unknown type
+-----------------------------
+ characters and text
+(1 row)
+
+SELECT text 'text' || char(20) ' and characters' AS "Concat text to char";
+ Concat text to char
+---------------------
+ text and characters
+(1 row)
+
+SELECT text 'text' || varchar ' and varchar' AS "Concat text to varchar";
+ Concat text to varchar
+------------------------
+ text and varchar
+(1 row)
+
+--
+-- test substr with toasted text values
+--
+CREATE TABLE toasttest(f1 text);
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+TRUNCATE TABLE toasttest;
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect >0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+TRUNCATE TABLE toasttest;
+ALTER TABLE toasttest set (toast_tuple_target = 4080);
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect 0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test substr with toasted bytea values
+--
+CREATE TABLE toasttest(f1 bytea);
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+DROP TABLE toasttest;
+-- test internally compressing datums
+-- this tests compressing a datum to a very small size which exercises a
+-- corner case in packed-varlena handling: even though small, the compressed
+-- datum must be given a 4-byte header because there are no bits to indicate
+-- compression in a 1-byte header
+CREATE TABLE toasttest (c char(4096));
+INSERT INTO toasttest VALUES('x');
+SELECT length(c), c::text FROM toasttest;
+ length | c
+--------+---
+ 1 | x
+(1 row)
+
+SELECT c FROM toasttest;
+ c
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ x
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test length
+--
+SELECT length('abcdef') AS "length_6";
+ length_6
+----------
+ 6
+(1 row)
+
+--
+-- test strpos
+--
+SELECT strpos('abcdef', 'cd') AS "pos_3";
+ pos_3
+-------
+ 3
+(1 row)
+
+SELECT strpos('abcdef', 'xy') AS "pos_0";
+ pos_0
+-------
+ 0
+(1 row)
+
+SELECT strpos('abcdef', '') AS "pos_1";
+ pos_1
+-------
+ 1
+(1 row)
+
+SELECT strpos('', 'xy') AS "pos_0";
+ pos_0
+-------
+ 0
+(1 row)
+
+SELECT strpos('', '') AS "pos_1";
+ pos_1
+-------
+ 1
+(1 row)
+
+--
+-- test replace
+--
+SELECT replace('abcdef', 'de', '45') AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT replace('yabadabadoo', 'ba', '123') AS "ya123da123doo";
+ ya123da123doo
+---------------
+ ya123da123doo
+(1 row)
+
+SELECT replace('yabadoo', 'bad', '') AS "yaoo";
+ yaoo
+------
+ yaoo
+(1 row)
+
+--
+-- test split_part
+--
+select split_part('joeuser@mydatabase','@',0) AS "an error";
+ERROR: field position must be greater than zero
+select split_part('joeuser@mydatabase','@',1) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+select split_part('joeuser@mydatabase','@',2) AS "mydatabase";
+ mydatabase
+------------
+ mydatabase
+(1 row)
+
+select split_part('joeuser@mydatabase','@',3) AS "empty string";
+ empty string
+--------------
+
+(1 row)
+
+select split_part('@joeuser@mydatabase@','@',2) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+--
+-- test to_hex
+--
+select to_hex(256*256*256 - 1) AS "ffffff";
+ ffffff
+--------
+ ffffff
+(1 row)
+
+select to_hex(256::bigint*256::bigint*256::bigint*256::bigint - 1) AS "ffffffff";
+ ffffffff
+----------
+ ffffffff
+(1 row)
+
+--
+-- MD5 test suite - from IETF RFC 1321
+-- (see: ftp://ftp.rfc-editor.org/in-notes/rfc1321.txt)
+--
+select md5('') = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+--
+-- SHA-2
+--
+SET bytea_output TO hex;
+SELECT sha224('');
+ sha224
+------------------------------------------------------------
+ \xd14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f
+(1 row)
+
+SELECT sha224('The quick brown fox jumps over the lazy dog.');
+ sha224
+------------------------------------------------------------
+ \x619cba8e8e05826e9b8c519c0a5c68f4fb653e8a3d8aa04bb2c8cd4c
+(1 row)
+
+SELECT sha256('');
+ sha256
+--------------------------------------------------------------------
+ \xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
+(1 row)
+
+SELECT sha256('The quick brown fox jumps over the lazy dog.');
+ sha256
+--------------------------------------------------------------------
+ \xef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c
+(1 row)
+
+SELECT sha384('');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \x38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b
+(1 row)
+
+SELECT sha384('The quick brown fox jumps over the lazy dog.');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \xed892481d8272ca6df370bf706e4d7bc1b5739fa2177aae6c50e946678718fc67a7af2819a021c2fc34e91bdb63409d7
+(1 row)
+
+SELECT sha512('');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \xcf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e
+(1 row)
+
+SELECT sha512('The quick brown fox jumps over the lazy dog.');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \x91ea1245f20d46ae9a037a989f54f1f790f0a47607eeb8a14d12890cea77a1bbc6c7ed9cf205e67b7f2b8fd4c7dfd3a7a8617e45f3c463d481c7e586c39ac1ed
+(1 row)
+
+--
+-- test behavior of escape_string_warning and standard_conforming_strings options
+--
+set escape_string_warning = off;
+set standard_conforming_strings = off;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ off
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ off
+(1 row)
+
+set escape_string_warning = on;
+set standard_conforming_strings = on;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ on
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ on
+(1 row)
+
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd'...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set escape_string_warning = off;
+set standard_conforming_strings = on;
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+--
+-- Additional string functions
+--
+SET bytea_output TO escape;
+SELECT initcap('hi THOMAS');
+ initcap
+-----------
+ Hi Thomas
+(1 row)
+
+SELECT lpad('hi', 5, 'xy');
+ lpad
+-------
+ xyxhi
+(1 row)
+
+SELECT lpad('hi', 5);
+ lpad
+-------
+ hi
+(1 row)
+
+SELECT lpad('hi', -5, 'xy');
+ lpad
+------
+
+(1 row)
+
+SELECT lpad('hello', 2);
+ lpad
+------
+ he
+(1 row)
+
+SELECT lpad('hi', 5, '');
+ lpad
+------
+ hi
+(1 row)
+
+SELECT rpad('hi', 5, 'xy');
+ rpad
+-------
+ hixyx
+(1 row)
+
+SELECT rpad('hi', 5);
+ rpad
+-------
+ hi
+(1 row)
+
+SELECT rpad('hi', -5, 'xy');
+ rpad
+------
+
+(1 row)
+
+SELECT rpad('hello', 2);
+ rpad
+------
+ he
+(1 row)
+
+SELECT rpad('hi', 5, '');
+ rpad
+------
+ hi
+(1 row)
+
+SELECT ltrim('zzzytrim', 'xyz');
+ ltrim
+-------
+ trim
+(1 row)
+
+SELECT translate('', '14', 'ax');
+ translate
+-----------
+
+(1 row)
+
+SELECT translate('12345', '14', 'ax');
+ translate
+-----------
+ a23x5
+(1 row)
+
+SELECT ascii('x');
+ ascii
+-------
+ 120
+(1 row)
+
+SELECT ascii('');
+ ascii
+-------
+ 0
+(1 row)
+
+SELECT chr(65);
+ chr
+-----
+ A
+(1 row)
+
+SELECT chr(0);
+ERROR: null character not permitted
+SELECT repeat('Pg', 4);
+ repeat
+----------
+ PgPgPgPg
+(1 row)
+
+SELECT repeat('Pg', -4);
+ repeat
+--------
+
+(1 row)
+
+SELECT trim(E'\\000'::bytea from E'\\000Tom\\000'::bytea);
+ btrim
+-------
+ Tom
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, E'\\000'::bytea);
+ btrim
+-------
+ trim
+(1 row)
+
+SELECT btrim(''::bytea, E'\\000'::bytea);
+ btrim
+-------
+
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, ''::bytea);
+ btrim
+--------------
+ \000trim\000
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'Th\\001omas'::bytea from 2),'escape');
+ encode
+-------------
+ TTh\x01omas
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 8),'escape');
+ encode
+--------------------
+ Th\000omas\x02\x03
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 5 for 3),'escape');
+ encode
+-----------------
+ Th\000o\x02\x03
+(1 row)
+
diff --git src/test/regress/expected/tablesample_1.out src/test/regress/expected/tablesample_1.out
new file mode 100644
index 0000000000..8f0c7a2669
--- /dev/null
+++ src/test/regress/expected/tablesample_1.out
@@ -0,0 +1,300 @@
+CREATE TABLE test_tablesample (id int, name text) WITH (fillfactor=10);
+-- use fillfactor so we don't have to load too much data to get multiple pages
+INSERT INTO test_tablesample
+ SELECT i, repeat(i::text, 200) FROM generate_series(0, 9) s(i);
+SELECT t.id FROM test_tablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (100.0/11) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
+ id
+----
+ 3
+ 4
+ 5
+ 7
+ 8
+(5 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (5.5) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+-- 100% should give repeatable count results (ie, all rows) in any case
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100);
+ count
+-------
+ 10
+(1 row)
+
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100) REPEATABLE (1+2);
+ count
+-------
+ 10
+(1 row)
+
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100) REPEATABLE (0.4);
+ count
+-------
+ 10
+(1 row)
+
+CREATE VIEW test_tablesample_v1 AS
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (10*2) REPEATABLE (2);
+CREATE VIEW test_tablesample_v2 AS
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (99);
+\d+ test_tablesample_v1
+ View "public.test_tablesample_v1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ id | integer | | | | plain |
+View definition:
+ SELECT test_tablesample.id
+ FROM test_tablesample TABLESAMPLE system ((10 * 2)) REPEATABLE (2);
+
+\d+ test_tablesample_v2
+ View "public.test_tablesample_v2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ id | integer | | | | plain |
+View definition:
+ SELECT test_tablesample.id
+ FROM test_tablesample TABLESAMPLE system (99);
+
+-- check a sampled query doesn't affect cursor in progress
+BEGIN;
+DECLARE tablesample_cur CURSOR FOR
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+FETCH FIRST FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH FIRST FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+CLOSE tablesample_cur;
+END;
+EXPLAIN (COSTS OFF)
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (2);
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sample Scan on test_tablesample
+ Sampling: system ('50'::real) REPEATABLE ('2'::double precision)
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+ SELECT * FROM test_tablesample_v1;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sample Scan on test_tablesample
+ Sampling: system ('20'::real) REPEATABLE ('2'::double precision)
+(2 rows)
+
+-- check inheritance behavior
+explain (costs off)
+ select count(*) from person tablesample bernoulli (100);
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Append
+ -> Sample Scan on person
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on emp
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on student
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on stud_emp
+ Sampling: bernoulli ('100'::real)
+(10 rows)
+
+select count(*) from person tablesample bernoulli (100);
+ count
+-------
+ 58
+(1 row)
+
+select count(*) from person;
+ count
+-------
+ 58
+(1 row)
+
+-- check that collations get assigned within the tablesample arguments
+SELECT count(*) FROM test_tablesample TABLESAMPLE bernoulli (('1'::text < '0'::text)::int);
+ count
+-------
+ 0
+(1 row)
+
+-- check behavior during rescans, as well as correct handling of min/max pct
+select * from
+ (values (0),(100)) v(pct),
+ lateral (select count(*) from tenk1 tablesample bernoulli (pct)) ss;
+ pct | count
+-----+-------
+ 0 | 0
+ 100 | 10000
+(2 rows)
+
+select * from
+ (values (0),(100)) v(pct),
+ lateral (select count(*) from tenk1 tablesample system (pct)) ss;
+ pct | count
+-----+-------
+ 0 | 0
+ 100 | 10000
+(2 rows)
+
+explain (costs off)
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample bernoulli (pct)) ss
+ group by pct;
+ QUERY PLAN
+--------------------------------------------------------
+ HashAggregate
+ Group Key: "*VALUES*".column1
+ -> Nested Loop
+ -> Values Scan on "*VALUES*"
+ -> Sample Scan on tenk1
+ Sampling: bernoulli ("*VALUES*".column1)
+(6 rows)
+
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample bernoulli (pct)) ss
+ group by pct;
+ pct | count
+-----+-------
+ 100 | 10000
+(1 row)
+
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample system (pct)) ss
+ group by pct;
+ pct | count
+-----+-------
+ 100 | 10000
+(1 row)
+
+-- errors
+SELECT id FROM test_tablesample TABLESAMPLE FOOBAR (1);
+ERROR: tablesample method foobar does not exist
+LINE 1: SELECT id FROM test_tablesample TABLESAMPLE FOOBAR (1);
+ ^
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (NULL);
+ERROR: TABLESAMPLE parameter cannot be null
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (NULL);
+ERROR: TABLESAMPLE REPEATABLE parameter cannot be null
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (-1);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (200);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (-1);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (200);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample_v1 TABLESAMPLE BERNOULLI (1);
+ERROR: TABLESAMPLE clause can only be applied to tables and materialized views
+LINE 1: SELECT id FROM test_tablesample_v1 TABLESAMPLE BERNOULLI (1)...
+ ^
+INSERT INTO test_tablesample_v1 VALUES(1);
+ERROR: cannot insert into view "test_tablesample_v1"
+DETAIL: Views containing TABLESAMPLE are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+WITH query_select AS (SELECT * FROM test_tablesample)
+SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEATABLE (1);
+ERROR: TABLESAMPLE clause can only be applied to tables and materialized views
+LINE 2: SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEA...
+ ^
+SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5);
+ERROR: syntax error at or near "TABLESAMPLE"
+LINE 1: ...CT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPL...
+ ^
+-- check partitioned tables support tablesample
+create table parted_sample (a int) partition by list (a);
+create table parted_sample_1 partition of parted_sample for values in (1);
+create table parted_sample_2 partition of parted_sample for values in (2);
+explain (costs off)
+ select * from parted_sample tablesample bernoulli (100);
+ QUERY PLAN
+-------------------------------------------
+ Append
+ -> Sample Scan on parted_sample_1
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on parted_sample_2
+ Sampling: bernoulli ('100'::real)
+(5 rows)
+
+drop table parted_sample, parted_sample_1, parted_sample_2;
diff --git src/test/regress/expected/transactions_1.out src/test/regress/expected/transactions_1.out
new file mode 100644
index 0000000000..672b918303
--- /dev/null
+++ src/test/regress/expected/transactions_1.out
@@ -0,0 +1,1022 @@
+--
+-- TRANSACTIONS
+--
+BEGIN;
+SELECT *
+ INTO TABLE xacttest
+ FROM aggtest;
+INSERT INTO xacttest (a, b) VALUES (777, 777.777);
+END;
+-- should retrieve one value--
+SELECT a FROM xacttest WHERE a > 100;
+ a
+-----
+ 777
+(1 row)
+
+BEGIN;
+CREATE TABLE disappear (a int4);
+DELETE FROM aggtest;
+-- should be empty
+SELECT * FROM aggtest;
+ a | b
+---+---
+(0 rows)
+
+ABORT;
+-- should not exist
+SELECT oid FROM pg_class WHERE relname = 'disappear';
+ oid
+-----
+(0 rows)
+
+-- should have members again
+SELECT * FROM aggtest;
+ a | b
+-----+---------
+ 56 | 7.8
+ 100 | 99.097
+ 0 | 0.09561
+ 42 | 324.78
+(4 rows)
+
+-- Read-only tests
+CREATE TABLE writetest (a int);
+CREATE TEMPORARY TABLE temptest (a int);
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ WRITE; --fail
+ERROR: transaction read-write mode must be set before any query
+COMMIT;
+BEGIN;
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; -- ok
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SAVEPOINT x;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; --fail
+ERROR: cannot set transaction read-write mode inside a read-only transaction
+COMMIT;
+BEGIN;
+SET TRANSACTION READ WRITE; -- ok
+SAVEPOINT x;
+SET TRANSACTION READ WRITE; -- ok
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; --fail
+ERROR: cannot set transaction read-write mode inside a read-only transaction
+COMMIT;
+BEGIN;
+SET TRANSACTION READ WRITE; -- ok
+SAVEPOINT x;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+ROLLBACK TO SAVEPOINT x;
+SHOW transaction_read_only; -- off
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SAVEPOINT y;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+RELEASE SAVEPOINT y;
+SHOW transaction_read_only; -- off
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+COMMIT;
+SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY;
+DROP TABLE writetest; -- fail
+ERROR: cannot execute DROP TABLE in a read-only transaction
+INSERT INTO writetest VALUES (1); -- fail
+ERROR: cannot execute INSERT in a read-only transaction
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+DELETE FROM temptest; -- ok
+UPDATE temptest SET a = 0 FROM writetest WHERE temptest.a = 1 AND writetest.a = temptest.a; -- ok
+PREPARE test AS UPDATE writetest SET a = 0; -- ok
+EXECUTE test; -- fail
+ERROR: cannot execute UPDATE in a read-only transaction
+SELECT * FROM writetest, temptest; -- ok
+ a | a
+---+---
+(0 rows)
+
+CREATE TABLE test AS SELECT * FROM writetest; -- fail
+ERROR: cannot execute CREATE TABLE AS in a read-only transaction
+START TRANSACTION READ WRITE;
+DROP TABLE writetest; -- ok
+COMMIT;
+-- Subtransactions, basic tests
+-- create & drop tables
+SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE;
+CREATE TABLE trans_foobar (a int);
+BEGIN;
+ CREATE TABLE trans_foo (a int);
+ SAVEPOINT one;
+ DROP TABLE trans_foo;
+ CREATE TABLE trans_bar (a int);
+ ROLLBACK TO SAVEPOINT one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ CREATE TABLE trans_baz (a int);
+ RELEASE SAVEPOINT two;
+ drop TABLE trans_foobar;
+ CREATE TABLE trans_barbaz (a int);
+COMMIT;
+-- should exist: trans_barbaz, trans_baz, trans_foo
+SELECT * FROM trans_foo; -- should be empty
+ a
+---
+(0 rows)
+
+SELECT * FROM trans_bar; -- shouldn't exist
+ERROR: relation "trans_bar" does not exist
+LINE 1: SELECT * FROM trans_bar;
+ ^
+SELECT * FROM trans_barbaz; -- should be empty
+ a
+---
+(0 rows)
+
+SELECT * FROM trans_baz; -- should be empty
+ a
+---
+(0 rows)
+
+-- inserts
+BEGIN;
+ INSERT INTO trans_foo VALUES (1);
+ SAVEPOINT one;
+ INSERT into trans_bar VALUES (1);
+ERROR: relation "trans_bar" does not exist
+LINE 1: INSERT into trans_bar VALUES (1);
+ ^
+ ROLLBACK TO one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ INSERT into trans_barbaz VALUES (1);
+ RELEASE two;
+ SAVEPOINT three;
+ SAVEPOINT four;
+ INSERT INTO trans_foo VALUES (2);
+ RELEASE SAVEPOINT four;
+ ROLLBACK TO SAVEPOINT three;
+ RELEASE SAVEPOINT three;
+ INSERT INTO trans_foo VALUES (3);
+COMMIT;
+SELECT * FROM trans_foo; -- should have 1 and 3
+ a
+---
+ 1
+ 3
+(2 rows)
+
+SELECT * FROM trans_barbaz; -- should have 1
+ a
+---
+ 1
+(1 row)
+
+-- test whole-tree commit
+BEGIN;
+ SAVEPOINT one;
+ SELECT trans_foo;
+ERROR: column "trans_foo" does not exist
+LINE 1: SELECT trans_foo;
+ ^
+ ROLLBACK TO SAVEPOINT one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ CREATE TABLE savepoints (a int);
+ SAVEPOINT three;
+ INSERT INTO savepoints VALUES (1);
+ SAVEPOINT four;
+ INSERT INTO savepoints VALUES (2);
+ SAVEPOINT five;
+ INSERT INTO savepoints VALUES (3);
+ ROLLBACK TO SAVEPOINT five;
+COMMIT;
+COMMIT; -- should not be in a transaction block
+WARNING: there is no transaction in progress
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- test whole-tree rollback
+BEGIN;
+ SAVEPOINT one;
+ DELETE FROM savepoints WHERE a=1;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ DELETE FROM savepoints WHERE a=1;
+ SAVEPOINT three;
+ DELETE FROM savepoints WHERE a=2;
+ROLLBACK;
+COMMIT; -- should not be in a transaction block
+WARNING: there is no transaction in progress
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- test whole-tree commit on an aborted subtransaction
+BEGIN;
+ INSERT INTO savepoints VALUES (4);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (5);
+ SELECT trans_foo;
+ERROR: column "trans_foo" does not exist
+LINE 1: SELECT trans_foo;
+ ^
+COMMIT;
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (6);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (7);
+ RELEASE SAVEPOINT one;
+ INSERT INTO savepoints VALUES (8);
+COMMIT;
+-- rows 6 and 8 should have been created by the same xact
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=6 AND b.a=8;
+ ?column?
+----------
+ t
+(1 row)
+
+-- rows 6 and 7 should have been created by different xacts
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=6 AND b.a=7;
+ ?column?
+----------
+ t
+(1 row)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (9);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (10);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (11);
+COMMIT;
+SELECT a FROM savepoints WHERE a in (9, 10, 11);
+ a
+----
+ 9
+ 11
+(2 rows)
+
+-- rows 9 and 11 should have been created by different xacts
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=9 AND b.a=11;
+ ?column?
+----------
+ t
+(1 row)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (12);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (13);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (14);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (15);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (16);
+ SAVEPOINT three;
+ INSERT INTO savepoints VALUES (17);
+COMMIT;
+SELECT a FROM savepoints WHERE a BETWEEN 12 AND 17;
+ a
+----
+ 12
+ 15
+ 16
+ 17
+(4 rows)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (18);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (19);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (20);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (21);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (22);
+COMMIT;
+SELECT a FROM savepoints WHERE a BETWEEN 18 AND 22;
+ a
+----
+ 18
+ 22
+(2 rows)
+
+DROP TABLE savepoints;
+-- only in a transaction block:
+SAVEPOINT one;
+ERROR: SAVEPOINT can only be used in transaction blocks
+ROLLBACK TO SAVEPOINT one;
+ERROR: ROLLBACK TO SAVEPOINT can only be used in transaction blocks
+RELEASE SAVEPOINT one;
+ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
+-- Only "rollback to" allowed in aborted state
+BEGIN;
+ SAVEPOINT one;
+ SELECT 0/0;
+ERROR: division by zero
+ SAVEPOINT two; -- ignored till the end of ...
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+ RELEASE SAVEPOINT one; -- ignored till the end of ...
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+ ROLLBACK TO SAVEPOINT one;
+ SELECT 1;
+ ?column?
+----------
+ 1
+(1 row)
+
+COMMIT;
+SELECT 1; -- this should work
+ ?column?
+----------
+ 1
+(1 row)
+
+-- check non-transactional behavior of cursors
+BEGIN;
+ DECLARE c CURSOR FOR SELECT unique2 FROM tenk1 ORDER BY unique2;
+ SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+(10 rows)
+
+ ROLLBACK TO SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 10
+ 11
+ 12
+ 13
+ 14
+ 15
+ 16
+ 17
+ 18
+ 19
+(10 rows)
+
+ RELEASE SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 20
+ 21
+ 22
+ 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+(10 rows)
+
+ CLOSE c;
+ DECLARE c CURSOR FOR SELECT unique2/0 FROM tenk1 ORDER BY unique2;
+ SAVEPOINT two;
+ FETCH 10 FROM c;
+ERROR: division by zero
+ ROLLBACK TO SAVEPOINT two;
+ -- c is now dead to the world ...
+ FETCH 10 FROM c;
+ERROR: portal "c" cannot be run
+ ROLLBACK TO SAVEPOINT two;
+ RELEASE SAVEPOINT two;
+ FETCH 10 FROM c;
+ERROR: portal "c" cannot be run
+COMMIT;
+--
+-- Check that "stable" functions are really stable. They should not be
+-- able to see the partial results of the calling query. (Ideally we would
+-- also check that they don't see commits of concurrent transactions, but
+-- that's a mite hard to do within the limitations of pg_regress.)
+--
+select * from xacttest;
+ a | b
+-----+---------
+ 56 | 7.8
+ 100 | 99.097
+ 0 | 0.09561
+ 42 | 324.78
+ 777 | 777.777
+(5 rows)
+
+create or replace function max_xacttest() returns smallint language sql as
+'select max(a) from xacttest' stable;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 787 | 99.097
+ 787 | 324.78
+ 787 | 777.777
+(5 rows)
+
+rollback;
+-- But a volatile function can see the partial results of the calling query
+create or replace function max_xacttest() returns smallint language sql as
+'select max(a) from xacttest' volatile;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 797 | 99.097
+ 807 | 324.78
+ 817 | 777.777
+(5 rows)
+
+rollback;
+-- Now the same test with plpgsql (since it depends on SPI which is different)
+create or replace function max_xacttest() returns smallint language plpgsql as
+'begin return max(a) from xacttest; end' stable;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 787 | 99.097
+ 787 | 324.78
+ 787 | 777.777
+(5 rows)
+
+rollback;
+create or replace function max_xacttest() returns smallint language plpgsql as
+'begin return max(a) from xacttest; end' volatile;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 797 | 99.097
+ 807 | 324.78
+ 817 | 777.777
+(5 rows)
+
+rollback;
+-- test case for problems with dropping an open relation during abort
+BEGIN;
+ savepoint x;
+ CREATE TABLE koju (a INT UNIQUE);
+ INSERT INTO koju VALUES (1);
+ INSERT INTO koju VALUES (1);
+ERROR: duplicate key value violates unique constraint "koju_a_key"
+DETAIL: Key (a)=(1) already exists.
+ rollback to x;
+ CREATE TABLE koju (a INT UNIQUE);
+ INSERT INTO koju VALUES (1);
+ INSERT INTO koju VALUES (1);
+ERROR: duplicate key value violates unique constraint "koju_a_key"
+DETAIL: Key (a)=(1) already exists.
+ROLLBACK;
+DROP TABLE trans_foo;
+DROP TABLE trans_baz;
+DROP TABLE trans_barbaz;
+-- test case for problems with revalidating an open relation during abort
+create function inverse(int) returns float8 as
+$$
+begin
+ analyze revalidate_bug;
+ return 1::float8/$1;
+exception
+ when division_by_zero then return 0;
+end$$ language plpgsql volatile;
+create table revalidate_bug (c float8 unique);
+insert into revalidate_bug values (1);
+insert into revalidate_bug values (inverse(0));
+drop table revalidate_bug;
+drop function inverse(int);
+-- verify that cursors created during an aborted subtransaction are
+-- closed, but that we do not rollback the effect of any FETCHs
+-- performed in the aborted subtransaction
+begin;
+savepoint x;
+create table abc (a int);
+insert into abc values (5);
+insert into abc values (10);
+declare foo cursor for select * from abc;
+fetch from foo;
+ a
+---
+ 5
+(1 row)
+
+rollback to x;
+-- should fail
+fetch from foo;
+ERROR: cursor "foo" does not exist
+commit;
+begin;
+create table abc (a int);
+insert into abc values (5);
+insert into abc values (10);
+insert into abc values (15);
+declare foo cursor for select * from abc;
+fetch from foo;
+ a
+---
+ 5
+(1 row)
+
+savepoint x;
+fetch from foo;
+ a
+----
+ 10
+(1 row)
+
+rollback to x;
+fetch from foo;
+ a
+----
+ 15
+(1 row)
+
+abort;
+-- Test for proper cleanup after a failure in a cursor portal
+-- that was created in an outer subtransaction
+CREATE FUNCTION invert(x float8) RETURNS float8 LANGUAGE plpgsql AS
+$$ begin return 1/x; end $$;
+CREATE FUNCTION create_temp_tab() RETURNS text
+LANGUAGE plpgsql AS $$
+BEGIN
+ CREATE TEMP TABLE new_table (f1 float8);
+ -- case of interest is that we fail while holding an open
+ -- relcache reference to new_table
+ INSERT INTO new_table SELECT invert(0.0);
+ RETURN 'foo';
+END $$;
+BEGIN;
+DECLARE ok CURSOR FOR SELECT * FROM int8_tbl;
+DECLARE ctt CURSOR FOR SELECT create_temp_tab();
+FETCH ok;
+ q1 | q2
+-----+-----
+ 123 | 456
+(1 row)
+
+SAVEPOINT s1;
+FETCH ok; -- should work
+ q1 | q2
+-----+------------------
+ 123 | 4567890123456789
+(1 row)
+
+FETCH ctt; -- error occurs here
+ERROR: division by zero
+CONTEXT: PL/pgSQL function invert(double precision) line 1 at RETURN
+SQL statement "INSERT INTO new_table SELECT invert(0.0)"
+PL/pgSQL function create_temp_tab() line 6 at SQL statement
+ROLLBACK TO s1;
+FETCH ok; -- should work
+ q1 | q2
+------------------+-----
+ 4567890123456789 | 123
+(1 row)
+
+FETCH ctt; -- must be rejected
+ERROR: portal "ctt" cannot be run
+COMMIT;
+DROP FUNCTION create_temp_tab();
+DROP FUNCTION invert(x float8);
+-- Tests for AND CHAIN
+CREATE TABLE abc (a int);
+-- set nondefault value so we have something to override below
+SET default_transaction_read_only = on;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (1);
+INSERT INTO abc VALUES (2);
+COMMIT AND CHAIN; -- TBLOCK_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+INSERT INTO abc VALUES (3); -- check it's really aborted
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+COMMIT AND CHAIN; -- TBLOCK_ABORT_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (4);
+COMMIT;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+SAVEPOINT x;
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+COMMIT AND CHAIN; -- TBLOCK_ABORT_PENDING
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (5);
+COMMIT;
+-- different mix of options just for fun
+START TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ WRITE, NOT DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+INSERT INTO abc VALUES (6);
+ROLLBACK AND CHAIN; -- TBLOCK_ABORT_PENDING
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+ROLLBACK AND CHAIN; -- TBLOCK_ABORT_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+ROLLBACK;
+-- not allowed outside a transaction block
+COMMIT AND CHAIN; -- error
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+ROLLBACK AND CHAIN; -- error
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+SELECT * FROM abc ORDER BY 1;
+ a
+---
+ 1
+ 2
+ 4
+ 5
+(4 rows)
+
+RESET default_transaction_read_only;
+DROP TABLE abc;
+-- Test assorted behaviors around the implicit transaction block created
+-- when multiple SQL commands are sent in a single Query message. These
+-- tests rely on the fact that psql will not break SQL commands apart at a
+-- backslash-quoted semicolon, but will send them as one Query.
+create temp table i_table (f1 int);
+-- psql will show only the last result in a multi-statement Query
+SELECT 1\; SELECT 2\; SELECT 3;
+ ?column?
+----------
+ 3
+(1 row)
+
+-- this implicitly commits:
+insert into i_table values(1)\; select * from i_table;
+ f1
+----
+ 1
+(1 row)
+
+-- 1/0 error will cause rolling back the whole implicit transaction
+insert into i_table values(2)\; select * from i_table\; select 1/0;
+ERROR: division by zero
+select * from i_table;
+ f1
+----
+ 1
+(1 row)
+
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- can use regular begin/commit/rollback within a single Query
+begin\; insert into i_table values(3)\; commit;
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+begin\; insert into i_table values(4)\; rollback;
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- begin converts implicit transaction into a regular one that
+-- can extend past the end of the Query
+select 1\; begin\; insert into i_table values(5);
+commit;
+select 1\; begin\; insert into i_table values(6);
+rollback;
+-- commit in implicit-transaction state commits but issues a warning.
+insert into i_table values(7)\; commit\; insert into i_table values(8)\; select 1/0;
+WARNING: there is no transaction in progress
+ERROR: division by zero
+-- similarly, rollback aborts but issues a warning.
+insert into i_table values(9)\; rollback\; select 2;
+WARNING: there is no transaction in progress
+ ?column?
+----------
+ 2
+(1 row)
+
+select * from i_table;
+ f1
+----
+ 1
+ 3
+ 5
+ 7
+(4 rows)
+
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- implicit transaction block is still a transaction block, for e.g. VACUUM
+SELECT 1\; VACUUM;
+ERROR: VACUUM cannot run inside a transaction block
+SELECT 1\; COMMIT\; VACUUM;
+WARNING: there is no transaction in progress
+ERROR: VACUUM cannot run inside a transaction block
+-- we disallow savepoint-related commands in implicit-transaction state
+SELECT 1\; SAVEPOINT sp;
+ERROR: SAVEPOINT can only be used in transaction blocks
+SELECT 1\; COMMIT\; SAVEPOINT sp;
+WARNING: there is no transaction in progress
+ERROR: SAVEPOINT can only be used in transaction blocks
+ROLLBACK TO SAVEPOINT sp\; SELECT 2;
+ERROR: ROLLBACK TO SAVEPOINT can only be used in transaction blocks
+SELECT 2\; RELEASE SAVEPOINT sp\; SELECT 3;
+ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
+-- but this is OK, because the BEGIN converts it to a regular xact
+SELECT 1\; BEGIN\; SAVEPOINT sp\; ROLLBACK TO SAVEPOINT sp\; COMMIT;
+-- Tests for AND CHAIN in implicit transaction blocks
+SET TRANSACTION READ ONLY\; COMMIT AND CHAIN; -- error
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SET TRANSACTION READ ONLY\; ROLLBACK AND CHAIN; -- error
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+CREATE TABLE abc (a int);
+-- COMMIT/ROLLBACK + COMMIT/ROLLBACK AND CHAIN
+INSERT INTO abc VALUES (7)\; COMMIT\; INSERT INTO abc VALUES (8)\; COMMIT AND CHAIN; -- 7 commit, 8 error
+WARNING: there is no transaction in progress
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+INSERT INTO abc VALUES (9)\; ROLLBACK\; INSERT INTO abc VALUES (10)\; ROLLBACK AND CHAIN; -- 9 rollback, 10 error
+WARNING: there is no transaction in progress
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+-- COMMIT/ROLLBACK AND CHAIN + COMMIT/ROLLBACK
+INSERT INTO abc VALUES (11)\; COMMIT AND CHAIN\; INSERT INTO abc VALUES (12)\; COMMIT; -- 11 error, 12 not reached
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+INSERT INTO abc VALUES (13)\; ROLLBACK AND CHAIN\; INSERT INTO abc VALUES (14)\; ROLLBACK; -- 13 error, 14 not reached
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+-- START TRANSACTION + COMMIT/ROLLBACK AND CHAIN
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (15)\; COMMIT AND CHAIN; -- 15 ok
+SHOW transaction_isolation; -- transaction is active at this point
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+COMMIT;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (16)\; ROLLBACK AND CHAIN; -- 16 ok
+SHOW transaction_isolation; -- transaction is active at this point
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+ROLLBACK;
+-- START TRANSACTION + COMMIT/ROLLBACK + COMMIT/ROLLBACK AND CHAIN
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (17)\; COMMIT\; INSERT INTO abc VALUES (18)\; COMMIT AND CHAIN; -- 17 commit, 18 error
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+SHOW transaction_isolation; -- out of transaction block
+ transaction_isolation
+-----------------------
+ read committed
+(1 row)
+
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (19)\; ROLLBACK\; INSERT INTO abc VALUES (20)\; ROLLBACK AND CHAIN; -- 19 rollback, 20 error
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+SHOW transaction_isolation; -- out of transaction block
+ transaction_isolation
+-----------------------
+ read committed
+(1 row)
+
+SELECT * FROM abc ORDER BY 1;
+ a
+----
+ 7
+ 15
+ 17
+(3 rows)
+
+DROP TABLE abc;
+-- Test for successful cleanup of an aborted transaction at session exit.
+-- THIS MUST BE THE LAST TEST IN THIS FILE.
+begin;
+select 1/0;
+ERROR: division by zero
+rollback to X;
+ERROR: savepoint "x" does not exist
+-- DO NOT ADD ANYTHING HERE.
diff --git src/test/regress/expected/triggers_1.out src/test/regress/expected/triggers_1.out
new file mode 100644
index 0000000000..5a8f055aed
--- /dev/null
+++ src/test/regress/expected/triggers_1.out
@@ -0,0 +1,2850 @@
+--
+-- TRIGGERS
+--
+create table pkeys (pkey1 int4 not null, pkey2 text not null);
+create table fkeys (fkey1 int4, fkey2 text, fkey3 int);
+create table fkeys2 (fkey21 int4, fkey22 text, pkey23 int not null);
+create index fkeys_i on fkeys (fkey1, fkey2);
+create index fkeys2_i on fkeys2 (fkey21, fkey22);
+create index fkeys2p_i on fkeys2 (pkey23);
+insert into pkeys values (10, '1');
+insert into pkeys values (20, '2');
+insert into pkeys values (30, '3');
+insert into pkeys values (40, '4');
+insert into pkeys values (50, '5');
+insert into pkeys values (60, '6');
+create unique index pkeys_i on pkeys (pkey1, pkey2);
+--
+-- For fkeys:
+-- (fkey1, fkey2) --> pkeys (pkey1, pkey2)
+-- (fkey3) --> fkeys2 (pkey23)
+--
+create trigger check_fkeys_pkey_exist
+ before insert or update on fkeys
+ for each row
+ execute function
+ check_primary_key ('fkey1', 'fkey2', 'pkeys', 'pkey1', 'pkey2');
+create trigger check_fkeys_pkey2_exist
+ before insert or update on fkeys
+ for each row
+ execute function check_primary_key ('fkey3', 'fkeys2', 'pkey23');
+--
+-- For fkeys2:
+-- (fkey21, fkey22) --> pkeys (pkey1, pkey2)
+--
+create trigger check_fkeys2_pkey_exist
+ before insert or update on fkeys2
+ for each row
+ execute procedure
+ check_primary_key ('fkey21', 'fkey22', 'pkeys', 'pkey1', 'pkey2');
+-- Test comments
+COMMENT ON TRIGGER check_fkeys2_pkey_bad ON fkeys2 IS 'wrong';
+ERROR: trigger "check_fkeys2_pkey_bad" for table "fkeys2" does not exist
+COMMENT ON TRIGGER check_fkeys2_pkey_exist ON fkeys2 IS 'right';
+COMMENT ON TRIGGER check_fkeys2_pkey_exist ON fkeys2 IS NULL;
+--
+-- For pkeys:
+-- ON DELETE/UPDATE (pkey1, pkey2) CASCADE:
+-- fkeys (fkey1, fkey2) and fkeys2 (fkey21, fkey22)
+--
+create trigger check_pkeys_fkey_cascade
+ before delete or update on pkeys
+ for each row
+ execute procedure
+ check_foreign_key (2, 'cascade', 'pkey1', 'pkey2',
+ 'fkeys', 'fkey1', 'fkey2', 'fkeys2', 'fkey21', 'fkey22');
+--
+-- For fkeys2:
+-- ON DELETE/UPDATE (pkey23) RESTRICT:
+-- fkeys (fkey3)
+--
+create trigger check_fkeys2_fkey_restrict
+ before delete or update on fkeys2
+ for each row
+ execute procedure check_foreign_key (1, 'restrict', 'pkey23', 'fkeys', 'fkey3');
+insert into fkeys2 values (10, '1', 1);
+insert into fkeys2 values (30, '3', 2);
+insert into fkeys2 values (40, '4', 5);
+insert into fkeys2 values (50, '5', 3);
+-- no key in pkeys
+insert into fkeys2 values (70, '5', 3);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys2_pkey_exist" found tuple referencing non-existent key in "pkeys".
+insert into fkeys values (10, '1', 2);
+insert into fkeys values (30, '3', 3);
+insert into fkeys values (40, '4', 2);
+insert into fkeys values (50, '5', 2);
+-- no key in pkeys
+insert into fkeys values (70, '5', 1);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys_pkey_exist" found tuple referencing non-existent key in "pkeys".
+-- no key in fkeys2
+insert into fkeys values (60, '6', 4);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys_pkey2_exist" found tuple referencing non-existent key in "fkeys2".
+delete from pkeys where pkey1 = 30 and pkey2 = '3';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+ERROR: "check_fkeys2_fkey_restrict": tuple is referenced in "fkeys"
+CONTEXT: SQL statement "delete from fkeys2 where fkey21 = $1 and fkey22 = $2 "
+delete from pkeys where pkey1 = 40 and pkey2 = '4';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys2 are deleted
+update pkeys set pkey1 = 7, pkey2 = '70' where pkey1 = 50 and pkey2 = '5';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+ERROR: "check_fkeys2_fkey_restrict": tuple is referenced in "fkeys"
+CONTEXT: SQL statement "delete from fkeys2 where fkey21 = $1 and fkey22 = $2 "
+update pkeys set pkey1 = 7, pkey2 = '70' where pkey1 = 10 and pkey2 = '1';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys2 are deleted
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table in ('pkeys', 'fkeys', 'fkeys2')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+----------------------------+--------------------+---------------------+--------------------+--------------+------------------+--------------------+---------------+----------------------------+----------------------------
+ check_fkeys2_fkey_restrict | DELETE | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_fkey_restrict | UPDATE | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_pkey_exist | INSERT | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_pkey_exist | UPDATE | public | fkeys2 | 2 | | ROW | BEFORE | |
+ check_fkeys_pkey2_exist | INSERT | public | fkeys | 1 | | ROW | BEFORE | |
+ check_fkeys_pkey2_exist | UPDATE | public | fkeys | 1 | | ROW | BEFORE | |
+ check_fkeys_pkey_exist | INSERT | public | fkeys | 2 | | ROW | BEFORE | |
+ check_fkeys_pkey_exist | UPDATE | public | fkeys | 2 | | ROW | BEFORE | |
+ check_pkeys_fkey_cascade | DELETE | public | pkeys | 1 | | ROW | BEFORE | |
+ check_pkeys_fkey_cascade | UPDATE | public | pkeys | 1 | | ROW | BEFORE | |
+(10 rows)
+
+DROP TABLE pkeys;
+DROP TABLE fkeys;
+DROP TABLE fkeys2;
+-- Check behavior when trigger returns unmodified trigtuple
+create table trigtest (f1 int, f2 text);
+create trigger trigger_return_old
+ before insert or delete or update on trigtest
+ for each row execute procedure trigger_return_old();
+insert into trigtest values(1, 'foo');
+select * from trigtest;
+ f1 | f2
+----+-----
+ 1 | foo
+(1 row)
+
+update trigtest set f2 = f2 || 'bar';
+select * from trigtest;
+ f1 | f2
+----+-----
+ 1 | foo
+(1 row)
+
+delete from trigtest;
+select * from trigtest;
+ f1 | f2
+----+----
+(0 rows)
+
+drop table trigtest;
+create sequence ttdummy_seq increment 10 start 0 minvalue 0;
+create table tttest (
+ price_id int4,
+ price_val int4,
+ price_on int4,
+ price_off int4 default 999999
+);
+create trigger ttdummy
+ before delete or update on tttest
+ for each row
+ execute procedure
+ ttdummy (price_on, price_off);
+create trigger ttserial
+ before insert or update on tttest
+ for each row
+ execute procedure
+ autoinc (price_on, ttdummy_seq);
+insert into tttest values (1, 1, null);
+insert into tttest values (2, 2, null);
+insert into tttest values (3, 3, 0);
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 999999
+ 3 | 3 | 30 | 999999
+(3 rows)
+
+delete from tttest where price_id = 2;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 3 | 3 | 30 | 999999
+ 2 | 2 | 20 | 40
+(3 rows)
+
+-- what do we see ?
+-- get current prices
+select * from tttest where price_off = 999999;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 3 | 3 | 30 | 999999
+(2 rows)
+
+-- change price for price_id == 3
+update tttest set price_val = 30 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 30 | 50 | 999999
+ 3 | 3 | 30 | 50
+(4 rows)
+
+-- now we want to change pric_id in ALL tuples
+-- this gets us not what we need
+update tttest set price_id = 5 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 3 | 30 | 50
+ 5 | 30 | 60 | 999999
+ 3 | 30 | 50 | 60
+(5 rows)
+
+-- restore data as before last update:
+select set_ttdummy(0);
+ set_ttdummy
+-------------
+ 1
+(1 row)
+
+delete from tttest where price_id = 5;
+update tttest set price_off = 999999 where price_val = 30;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 3 | 30 | 50
+ 3 | 30 | 50 | 999999
+(4 rows)
+
+-- and try change price_id now!
+update tttest set price_id = 5 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 5 | 3 | 30 | 50
+ 5 | 30 | 50 | 999999
+(4 rows)
+
+-- isn't it what we need ?
+select set_ttdummy(1);
+ set_ttdummy
+-------------
+ 0
+(1 row)
+
+-- we want to correct some "date"
+update tttest set price_on = -1 where price_id = 1;
+ERROR: ttdummy (tttest): you cannot change price_on and/or price_off columns (use set_ttdummy)
+-- but this doesn't work
+-- try in this way
+select set_ttdummy(0);
+ set_ttdummy
+-------------
+ 1
+(1 row)
+
+update tttest set price_on = -1 where price_id = 1;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 2 | 2 | 20 | 40
+ 5 | 3 | 30 | 50
+ 5 | 30 | 50 | 999999
+ 1 | 1 | -1 | 999999
+(4 rows)
+
+-- isn't it what we need ?
+-- get price for price_id == 5 as it was @ "date" 35
+select * from tttest where price_on <= 35 and price_off > 35 and price_id = 5;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 5 | 3 | 30 | 50
+(1 row)
+
+drop table tttest;
+drop sequence ttdummy_seq;
+--
+-- tests for per-statement triggers
+--
+CREATE TABLE log_table (tstamp timestamp default timeofday()::timestamp);
+CREATE TABLE main_table (a int unique, b int);
+COPY main_table (a,b) FROM stdin;
+CREATE FUNCTION trigger_func() RETURNS trigger LANGUAGE plpgsql AS '
+BEGIN
+ RAISE NOTICE ''trigger_func(%) called: action = %, when = %, level = %'', TG_ARGV[0], TG_OP, TG_WHEN, TG_LEVEL;
+ RETURN NULL;
+END;';
+CREATE TRIGGER before_ins_stmt_trig BEFORE INSERT ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('before_ins_stmt');
+CREATE TRIGGER after_ins_stmt_trig AFTER INSERT ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('after_ins_stmt');
+--
+-- if neither 'FOR EACH ROW' nor 'FOR EACH STATEMENT' was specified,
+-- CREATE TRIGGER should default to 'FOR EACH STATEMENT'
+--
+CREATE TRIGGER after_upd_stmt_trig AFTER UPDATE ON main_table
+EXECUTE PROCEDURE trigger_func('after_upd_stmt');
+-- Both insert and update statement level triggers (before and after) should
+-- fire. Doesn't fire UPDATE before trigger, but only because one isn't
+-- defined.
+INSERT INTO main_table (a, b) VALUES (5, 10) ON CONFLICT (a)
+ DO UPDATE SET b = EXCLUDED.b;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+CREATE TRIGGER after_upd_row_trig AFTER UPDATE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_row');
+INSERT INTO main_table DEFAULT VALUES;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+UPDATE main_table SET a = a + 1 WHERE b < 30;
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+-- UPDATE that effects zero rows should still call per-statement trigger
+UPDATE main_table SET a = a + 2 WHERE b > 100;
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+-- constraint now unneeded
+ALTER TABLE main_table DROP CONSTRAINT main_table_a_key;
+-- COPY should fire per-row and per-statement INSERT triggers
+COPY main_table (a, b) FROM stdin;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+SELECT * FROM main_table ORDER BY a, b;
+ a | b
+----+----
+ 6 | 10
+ 21 | 20
+ 30 | 40
+ 31 | 10
+ 50 | 35
+ 50 | 60
+ 81 | 15
+ |
+(8 rows)
+
+--
+-- test triggers with WHEN clause
+--
+CREATE TRIGGER modified_a BEFORE UPDATE OF a ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a) EXECUTE PROCEDURE trigger_func('modified_a');
+CREATE TRIGGER modified_any BEFORE UPDATE OF a ON main_table
+FOR EACH ROW WHEN (OLD.* IS DISTINCT FROM NEW.*) EXECUTE PROCEDURE trigger_func('modified_any');
+CREATE TRIGGER insert_a AFTER INSERT ON main_table
+FOR EACH ROW WHEN (NEW.a = 123) EXECUTE PROCEDURE trigger_func('insert_a');
+CREATE TRIGGER delete_a AFTER DELETE ON main_table
+FOR EACH ROW WHEN (OLD.a = 123) EXECUTE PROCEDURE trigger_func('delete_a');
+CREATE TRIGGER insert_when BEFORE INSERT ON main_table
+FOR EACH STATEMENT WHEN (true) EXECUTE PROCEDURE trigger_func('insert_when');
+CREATE TRIGGER delete_when AFTER DELETE ON main_table
+FOR EACH STATEMENT WHEN (true) EXECUTE PROCEDURE trigger_func('delete_when');
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table IN ('main_table')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+----------------------+--------------------+---------------------+--------------------+--------------+--------------------------------+--------------------+---------------+----------------------------+----------------------------
+ after_ins_stmt_trig | INSERT | public | main_table | 1 | | STATEMENT | AFTER | |
+ after_upd_row_trig | UPDATE | public | main_table | 1 | | ROW | AFTER | |
+ after_upd_stmt_trig | UPDATE | public | main_table | 1 | | STATEMENT | AFTER | |
+ before_ins_stmt_trig | INSERT | public | main_table | 1 | | STATEMENT | BEFORE | |
+ delete_a | DELETE | public | main_table | 1 | (old.a = 123) | ROW | AFTER | |
+ delete_when | DELETE | public | main_table | 1 | true | STATEMENT | AFTER | |
+ insert_a | INSERT | public | main_table | 1 | (new.a = 123) | ROW | AFTER | |
+ insert_when | INSERT | public | main_table | 2 | true | STATEMENT | BEFORE | |
+ modified_a | UPDATE | public | main_table | 1 | (old.a <> new.a) | ROW | BEFORE | |
+ modified_any | UPDATE | public | main_table | 2 | (old.* IS DISTINCT FROM new.*) | ROW | BEFORE | |
+(10 rows)
+
+INSERT INTO main_table (a) VALUES (123), (456);
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_when) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_a) called: action = INSERT, when = AFTER, level = ROW
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+COPY main_table FROM stdin;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_when) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_a) called: action = INSERT, when = AFTER, level = ROW
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+DELETE FROM main_table WHERE a IN (123, 456);
+NOTICE: trigger_func(delete_a) called: action = DELETE, when = AFTER, level = ROW
+NOTICE: trigger_func(delete_a) called: action = DELETE, when = AFTER, level = ROW
+NOTICE: trigger_func(delete_when) called: action = DELETE, when = AFTER, level = STATEMENT
+UPDATE main_table SET a = 50, b = 60;
+NOTICE: trigger_func(modified_any) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_any) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+SELECT * FROM main_table ORDER BY a, b;
+ a | b
+----+----
+ 6 | 10
+ 21 | 20
+ 30 | 40
+ 31 | 10
+ 50 | 35
+ 50 | 60
+ 81 | 15
+ |
+(8 rows)
+
+SELECT pg_get_triggerdef(oid, true) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_a BEFORE UPDATE OF a ON main_table FOR EACH ROW WHEN (old.a <> new.a) EXECUTE FUNCTION trigger_func('modified_a')
+(1 row)
+
+SELECT pg_get_triggerdef(oid, false) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ pg_get_triggerdef
+----------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_a BEFORE UPDATE OF a ON public.main_table FOR EACH ROW WHEN ((old.a <> new.a)) EXECUTE FUNCTION trigger_func('modified_a')
+(1 row)
+
+SELECT pg_get_triggerdef(oid, true) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_any';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_any BEFORE UPDATE OF a ON main_table FOR EACH ROW WHEN (old.* IS DISTINCT FROM new.*) EXECUTE FUNCTION trigger_func('modified_any')
+(1 row)
+
+-- Test RENAME TRIGGER
+ALTER TRIGGER modified_a ON main_table RENAME TO modified_modified_a;
+SELECT count(*) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ count
+-------
+ 0
+(1 row)
+
+SELECT count(*) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_modified_a';
+ count
+-------
+ 1
+(1 row)
+
+DROP TRIGGER modified_modified_a ON main_table;
+DROP TRIGGER modified_any ON main_table;
+DROP TRIGGER insert_a ON main_table;
+DROP TRIGGER delete_a ON main_table;
+DROP TRIGGER insert_when ON main_table;
+DROP TRIGGER delete_when ON main_table;
+-- Test WHEN condition accessing system columns.
+create table table_with_oids(a int);
+insert into table_with_oids values (1);
+create trigger oid_unchanged_trig after update on table_with_oids
+ for each row
+ when (new.tableoid = old.tableoid AND new.tableoid <> 0)
+ execute procedure trigger_func('after_upd_oid_unchanged');
+update table_with_oids set a = a + 1;
+NOTICE: trigger_func(after_upd_oid_unchanged) called: action = UPDATE, when = AFTER, level = ROW
+drop table table_with_oids;
+-- Test column-level triggers
+DROP TRIGGER after_upd_row_trig ON main_table;
+CREATE TRIGGER before_upd_a_row_trig BEFORE UPDATE OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_a_row');
+CREATE TRIGGER after_upd_b_row_trig AFTER UPDATE OF b ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_b_row');
+CREATE TRIGGER after_upd_a_b_row_trig AFTER UPDATE OF a, b ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_a_b_row');
+CREATE TRIGGER before_upd_a_stmt_trig BEFORE UPDATE OF a ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('before_upd_a_stmt');
+CREATE TRIGGER after_upd_b_stmt_trig AFTER UPDATE OF b ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('after_upd_b_stmt');
+SELECT pg_get_triggerdef(oid) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'after_upd_a_b_row_trig';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER after_upd_a_b_row_trig AFTER UPDATE OF a, b ON public.main_table FOR EACH ROW EXECUTE FUNCTION trigger_func('after_upd_a_b_row')
+(1 row)
+
+UPDATE main_table SET a = 50;
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+UPDATE main_table SET b = 10;
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+--
+-- Test case for bug with BEFORE trigger followed by AFTER trigger with WHEN
+--
+CREATE TABLE some_t (some_col boolean NOT NULL);
+CREATE FUNCTION dummy_update_func() RETURNS trigger AS $$
+BEGIN
+ RAISE NOTICE 'dummy_update_func(%) called: action = %, old = %, new = %',
+ TG_ARGV[0], TG_OP, OLD, NEW;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER some_trig_before BEFORE UPDATE ON some_t FOR EACH ROW
+ EXECUTE PROCEDURE dummy_update_func('before');
+CREATE TRIGGER some_trig_aftera AFTER UPDATE ON some_t FOR EACH ROW
+ WHEN (NOT OLD.some_col AND NEW.some_col)
+ EXECUTE PROCEDURE dummy_update_func('aftera');
+CREATE TRIGGER some_trig_afterb AFTER UPDATE ON some_t FOR EACH ROW
+ WHEN (NOT NEW.some_col)
+ EXECUTE PROCEDURE dummy_update_func('afterb');
+INSERT INTO some_t VALUES (TRUE);
+UPDATE some_t SET some_col = TRUE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (t), new = (t)
+UPDATE some_t SET some_col = FALSE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (t), new = (f)
+NOTICE: dummy_update_func(afterb) called: action = UPDATE, old = (t), new = (f)
+UPDATE some_t SET some_col = TRUE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (f), new = (t)
+NOTICE: dummy_update_func(aftera) called: action = UPDATE, old = (f), new = (t)
+DROP TABLE some_t;
+-- bogus cases
+CREATE TRIGGER error_upd_and_col BEFORE UPDATE OR UPDATE OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_upd_and_col');
+ERROR: duplicate trigger events specified at or near "ON"
+LINE 1: ...ER error_upd_and_col BEFORE UPDATE OR UPDATE OF a ON main_ta...
+ ^
+CREATE TRIGGER error_upd_a_a BEFORE UPDATE OF a, a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_upd_a_a');
+ERROR: column "a" specified more than once
+CREATE TRIGGER error_ins_a BEFORE INSERT OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_ins_a');
+ERROR: syntax error at or near "OF"
+LINE 1: CREATE TRIGGER error_ins_a BEFORE INSERT OF a ON main_table
+ ^
+CREATE TRIGGER error_ins_when BEFORE INSERT OR UPDATE ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a)
+EXECUTE PROCEDURE trigger_func('error_ins_old');
+ERROR: INSERT trigger's WHEN condition cannot reference OLD values
+LINE 2: FOR EACH ROW WHEN (OLD.a <> NEW.a)
+ ^
+CREATE TRIGGER error_del_when BEFORE DELETE OR UPDATE ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a)
+EXECUTE PROCEDURE trigger_func('error_del_new');
+ERROR: DELETE trigger's WHEN condition cannot reference NEW values
+LINE 2: FOR EACH ROW WHEN (OLD.a <> NEW.a)
+ ^
+CREATE TRIGGER error_del_when BEFORE INSERT OR UPDATE ON main_table
+FOR EACH ROW WHEN (NEW.tableoid <> 0)
+EXECUTE PROCEDURE trigger_func('error_when_sys_column');
+ERROR: BEFORE trigger's WHEN condition cannot reference NEW system columns
+LINE 2: FOR EACH ROW WHEN (NEW.tableoid <> 0)
+ ^
+CREATE TRIGGER error_stmt_when BEFORE UPDATE OF a ON main_table
+FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
+EXECUTE PROCEDURE trigger_func('error_stmt_when');
+ERROR: statement trigger's WHEN condition cannot reference column values
+LINE 2: FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
+ ^
+-- check dependency restrictions
+ALTER TABLE main_table DROP COLUMN b;
+ERROR: cannot drop column b of table main_table because other objects depend on it
+DETAIL: trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
+trigger after_upd_a_b_row_trig on table main_table depends on column b of table main_table
+trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+-- this should succeed, but we'll roll it back to keep the triggers around
+begin;
+DROP TRIGGER after_upd_a_b_row_trig ON main_table;
+DROP TRIGGER after_upd_b_row_trig ON main_table;
+DROP TRIGGER after_upd_b_stmt_trig ON main_table;
+ALTER TABLE main_table DROP COLUMN b;
+rollback;
+-- Test enable/disable triggers
+create table trigtest (i serial primary key);
+-- test that disabling RI triggers works
+create table trigtest2 (i int references trigtest(i) on delete cascade);
+create function trigtest() returns trigger as $$
+begin
+ raise notice '% % % %', TG_RELNAME, TG_OP, TG_WHEN, TG_LEVEL;
+ return new;
+end;$$ language plpgsql;
+create trigger trigtest_b_row_tg before insert or update or delete on trigtest
+for each row execute procedure trigtest();
+create trigger trigtest_a_row_tg after insert or update or delete on trigtest
+for each row execute procedure trigtest();
+create trigger trigtest_b_stmt_tg before insert or update or delete on trigtest
+for each statement execute procedure trigtest();
+create trigger trigtest_a_stmt_tg after insert or update or delete on trigtest
+for each statement execute procedure trigtest();
+insert into trigtest default values;
+NOTICE: trigtest INSERT BEFORE STATEMENT
+NOTICE: trigtest INSERT BEFORE ROW
+NOTICE: trigtest INSERT AFTER ROW
+NOTICE: trigtest INSERT AFTER STATEMENT
+alter table trigtest disable trigger trigtest_b_row_tg;
+insert into trigtest default values;
+NOTICE: trigtest INSERT BEFORE STATEMENT
+NOTICE: trigtest INSERT AFTER ROW
+NOTICE: trigtest INSERT AFTER STATEMENT
+alter table trigtest disable trigger user;
+insert into trigtest default values;
+alter table trigtest enable trigger trigtest_a_stmt_tg;
+insert into trigtest default values;
+NOTICE: trigtest INSERT AFTER STATEMENT
+set session_replication_role = replica;
+insert into trigtest default values; -- does not trigger
+alter table trigtest enable always trigger trigtest_a_stmt_tg;
+insert into trigtest default values; -- now it does
+NOTICE: trigtest INSERT AFTER STATEMENT
+reset session_replication_role;
+insert into trigtest2 values(1);
+insert into trigtest2 values(2);
+delete from trigtest where i=2;
+NOTICE: trigtest DELETE AFTER STATEMENT
+select * from trigtest2;
+ i
+---
+ 1
+(1 row)
+
+alter table trigtest disable trigger all;
+delete from trigtest where i=1;
+select * from trigtest2;
+ i
+---
+ 1
+(1 row)
+
+-- ensure we still insert, even when all triggers are disabled
+insert into trigtest default values;
+select * from trigtest;
+ i
+---
+ 3
+ 4
+ 5
+ 6
+ 7
+(5 rows)
+
+drop table trigtest2;
+drop table trigtest;
+-- dump trigger data
+CREATE TABLE trigger_test (
+ i int,
+ v varchar
+);
+CREATE OR REPLACE FUNCTION trigger_data() RETURNS trigger
+LANGUAGE plpgsql AS $$
+
+declare
+
+ argstr text;
+ relid text;
+
+begin
+
+ relid := TG_relid::regclass;
+
+ -- plpgsql can't discover its trigger data in a hash like perl and python
+ -- can, or by a sort of reflection like tcl can,
+ -- so we have to hard code the names.
+ raise NOTICE 'TG_NAME: %', TG_name;
+ raise NOTICE 'TG_WHEN: %', TG_when;
+ raise NOTICE 'TG_LEVEL: %', TG_level;
+ raise NOTICE 'TG_OP: %', TG_op;
+ raise NOTICE 'TG_RELID::regclass: %', relid;
+ raise NOTICE 'TG_RELNAME: %', TG_relname;
+ raise NOTICE 'TG_TABLE_NAME: %', TG_table_name;
+ raise NOTICE 'TG_TABLE_SCHEMA: %', TG_table_schema;
+ raise NOTICE 'TG_NARGS: %', TG_nargs;
+
+ argstr := '[';
+ for i in 0 .. TG_nargs - 1 loop
+ if i > 0 then
+ argstr := argstr || ', ';
+ end if;
+ argstr := argstr || TG_argv[i];
+ end loop;
+ argstr := argstr || ']';
+ raise NOTICE 'TG_ARGV: %', argstr;
+
+ if TG_OP != 'INSERT' then
+ raise NOTICE 'OLD: %', OLD;
+ end if;
+
+ if TG_OP != 'DELETE' then
+ raise NOTICE 'NEW: %', NEW;
+ end if;
+
+ if TG_OP = 'DELETE' then
+ return OLD;
+ else
+ return NEW;
+ end if;
+
+end;
+$$;
+CREATE TRIGGER show_trigger_data_trig
+BEFORE INSERT OR UPDATE OR DELETE ON trigger_test
+FOR EACH ROW EXECUTE PROCEDURE trigger_data(23,'skidoo');
+insert into trigger_test values(1,'insert');
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: INSERT
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: NEW: (1,insert)
+update trigger_test set v = 'update' where i = 1;
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: UPDATE
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: OLD: (1,insert)
+NOTICE: NEW: (1,update)
+delete from trigger_test;
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: DELETE
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: OLD: (1,update)
+DROP TRIGGER show_trigger_data_trig on trigger_test;
+DROP FUNCTION trigger_data();
+DROP TABLE trigger_test;
+--
+-- Test use of row comparisons on OLD/NEW
+--
+CREATE TABLE trigger_test (f1 int, f2 text, f3 text);
+-- this is the obvious (and wrong...) way to compare rows
+CREATE FUNCTION mytrigger() RETURNS trigger LANGUAGE plpgsql as $$
+begin
+ if row(old.*) = row(new.*) then
+ raise notice 'row % not changed', new.f1;
+ else
+ raise notice 'row % changed', new.f1;
+ end if;
+ return new;
+end$$;
+CREATE TRIGGER t
+BEFORE UPDATE ON trigger_test
+FOR EACH ROW EXECUTE PROCEDURE mytrigger();
+INSERT INTO trigger_test VALUES(1, 'foo', 'bar');
+INSERT INTO trigger_test VALUES(2, 'baz', 'quux');
+UPDATE trigger_test SET f3 = 'bar';
+NOTICE: row 1 not changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+-- this demonstrates that the above isn't really working as desired:
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+-- the right way when considering nulls is
+CREATE OR REPLACE FUNCTION mytrigger() RETURNS trigger LANGUAGE plpgsql as $$
+begin
+ if row(old.*) is distinct from row(new.*) then
+ raise notice 'row % changed', new.f1;
+ else
+ raise notice 'row % not changed', new.f1;
+ end if;
+ return new;
+end$$;
+UPDATE trigger_test SET f3 = 'bar';
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 not changed
+NOTICE: row 2 not changed
+DROP TABLE trigger_test;
+DROP FUNCTION mytrigger();
+-- Test snapshot management in serializable transactions involving triggers
+-- per bug report in 6bc73d4c0910042358k3d1adff3qa36f8df75198ecea@mail.gmail.com
+CREATE FUNCTION serializable_update_trig() RETURNS trigger LANGUAGE plpgsql AS
+$$
+declare
+ rec record;
+begin
+ new.description = 'updated in trigger';
+ return new;
+end;
+$$;
+CREATE TABLE serializable_update_tab (
+ id int,
+ filler text,
+ description text
+);
+CREATE TRIGGER serializable_update_trig BEFORE UPDATE ON serializable_update_tab
+ FOR EACH ROW EXECUTE PROCEDURE serializable_update_trig();
+INSERT INTO serializable_update_tab SELECT a, repeat('xyzxz', 100), 'new'
+ FROM generate_series(1, 50) a;
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+UPDATE serializable_update_tab SET description = 'no no', id = 1 WHERE id = 1;
+COMMIT;
+SELECT description FROM serializable_update_tab WHERE id = 1;
+ description
+--------------------
+ updated in trigger
+(1 row)
+
+DROP TABLE serializable_update_tab;
+-- minimal update trigger
+CREATE TABLE min_updates_test (
+ f1 text,
+ f2 int,
+ f3 int);
+INSERT INTO min_updates_test VALUES ('a',1,2),('b','2',null);
+CREATE TRIGGER z_min_update
+BEFORE UPDATE ON min_updates_test
+FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
+\set QUIET false
+UPDATE min_updates_test SET f1 = f1;
+UPDATE 0
+UPDATE min_updates_test SET f2 = f2 + 1;
+UPDATE 2
+UPDATE min_updates_test SET f3 = 2 WHERE f3 is null;
+UPDATE 1
+\set QUIET true
+SELECT * FROM min_updates_test;
+ f1 | f2 | f3
+----+----+----
+ a | 2 | 2
+ b | 3 | 2
+(2 rows)
+
+DROP TABLE min_updates_test;
+--
+-- Test triggers on views
+--
+CREATE VIEW main_view AS SELECT a, b FROM main_table;
+-- VIEW trigger function
+CREATE OR REPLACE FUNCTION view_trigger() RETURNS trigger
+LANGUAGE plpgsql AS $$
+declare
+ argstr text := '';
+begin
+ for i in 0 .. TG_nargs - 1 loop
+ if i > 0 then
+ argstr := argstr || ', ';
+ end if;
+ argstr := argstr || TG_argv[i];
+ end loop;
+
+ raise notice '% % % % (%)', TG_RELNAME, TG_WHEN, TG_OP, TG_LEVEL, argstr;
+
+ if TG_LEVEL = 'ROW' then
+ if TG_OP = 'INSERT' then
+ raise NOTICE 'NEW: %', NEW;
+ INSERT INTO main_table VALUES (NEW.a, NEW.b);
+ RETURN NEW;
+ end if;
+
+ if TG_OP = 'UPDATE' then
+ raise NOTICE 'OLD: %, NEW: %', OLD, NEW;
+ UPDATE main_table SET a = NEW.a, b = NEW.b WHERE a = OLD.a AND b = OLD.b;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN NEW;
+ end if;
+
+ if TG_OP = 'DELETE' then
+ raise NOTICE 'OLD: %', OLD;
+ DELETE FROM main_table WHERE a = OLD.a AND b = OLD.b;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN OLD;
+ end if;
+ end if;
+
+ RETURN NULL;
+end;
+$$;
+-- Before row triggers aren't allowed on views
+CREATE TRIGGER invalid_trig BEFORE INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_ins_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig BEFORE UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig BEFORE DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_del_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+-- After row triggers aren't allowed on views
+CREATE TRIGGER invalid_trig AFTER INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_ins_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig AFTER UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig AFTER DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_del_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+-- Truncate triggers aren't allowed on views
+CREATE TRIGGER invalid_trig BEFORE TRUNCATE ON main_view
+EXECUTE PROCEDURE trigger_func('before_tru_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have TRUNCATE triggers.
+CREATE TRIGGER invalid_trig AFTER TRUNCATE ON main_view
+EXECUTE PROCEDURE trigger_func('before_tru_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have TRUNCATE triggers.
+-- INSTEAD OF triggers aren't allowed on tables
+CREATE TRIGGER invalid_trig INSTEAD OF INSERT ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_ins');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+CREATE TRIGGER invalid_trig INSTEAD OF DELETE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_del');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+-- Don't support WHEN clauses with INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_view
+FOR EACH ROW WHEN (OLD.a <> NEW.a) EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers cannot have WHEN conditions
+-- Don't support column-level INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE OF a ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers cannot have column lists
+-- Don't support statement-level INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_view
+EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers must be FOR EACH ROW
+-- Valid INSTEAD OF triggers
+CREATE TRIGGER instead_of_insert_trig INSTEAD OF INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_ins');
+CREATE TRIGGER instead_of_update_trig INSTEAD OF UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+CREATE TRIGGER instead_of_delete_trig INSTEAD OF DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_del');
+-- Valid BEFORE statement VIEW triggers
+CREATE TRIGGER before_ins_stmt_trig BEFORE INSERT ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_ins_stmt');
+CREATE TRIGGER before_upd_stmt_trig BEFORE UPDATE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_upd_stmt');
+CREATE TRIGGER before_del_stmt_trig BEFORE DELETE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_del_stmt');
+-- Valid AFTER statement VIEW triggers
+CREATE TRIGGER after_ins_stmt_trig AFTER INSERT ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_ins_stmt');
+CREATE TRIGGER after_upd_stmt_trig AFTER UPDATE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_upd_stmt');
+CREATE TRIGGER after_del_stmt_trig AFTER DELETE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_del_stmt');
+\set QUIET false
+-- Insert into view using trigger
+INSERT INTO main_view VALUES (20, 30);
+NOTICE: main_view BEFORE INSERT STATEMENT (before_view_ins_stmt)
+NOTICE: main_view INSTEAD OF INSERT ROW (instead_of_ins)
+NOTICE: NEW: (20,30)
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER INSERT STATEMENT (after_view_ins_stmt)
+INSERT 0 1
+INSERT INTO main_view VALUES (21, 31) RETURNING a, b;
+NOTICE: main_view BEFORE INSERT STATEMENT (before_view_ins_stmt)
+NOTICE: main_view INSTEAD OF INSERT ROW (instead_of_ins)
+NOTICE: NEW: (21,31)
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER INSERT STATEMENT (after_view_ins_stmt)
+ a | b
+----+----
+ 21 | 31
+(1 row)
+
+INSERT 0 1
+-- Table trigger will prevent updates
+UPDATE main_view SET b = 31 WHERE a = 20;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (20,30), NEW: (20,31)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 0
+UPDATE main_view SET b = 32 WHERE a = 21 AND b = 31 RETURNING a, b;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (21,31), NEW: (21,32)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+ a | b
+---+---
+(0 rows)
+
+UPDATE 0
+-- Remove table trigger to allow updates
+DROP TRIGGER before_upd_a_row_trig ON main_table;
+DROP TRIGGER
+UPDATE main_view SET b = 31 WHERE a = 20;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (20,30), NEW: (20,31)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 1
+UPDATE main_view SET b = 32 WHERE a = 21 AND b = 31 RETURNING a, b;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (21,31), NEW: (21,32)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+ a | b
+----+----
+ 21 | 32
+(1 row)
+
+UPDATE 1
+-- Before and after stmt triggers should fire even when no rows are affected
+UPDATE main_view SET b = 0 WHERE false;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 0
+-- Delete from view using trigger
+DELETE FROM main_view WHERE a IN (20,21);
+NOTICE: main_view BEFORE DELETE STATEMENT (before_view_del_stmt)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (21,10)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (20,31)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (21,32)
+NOTICE: main_view AFTER DELETE STATEMENT (after_view_del_stmt)
+DELETE 3
+DELETE FROM main_view WHERE a = 31 RETURNING a, b;
+NOTICE: main_view BEFORE DELETE STATEMENT (before_view_del_stmt)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (31,10)
+NOTICE: main_view AFTER DELETE STATEMENT (after_view_del_stmt)
+ a | b
+----+----
+ 31 | 10
+(1 row)
+
+DELETE 1
+\set QUIET true
+-- Describe view should list triggers
+\d main_view
+ View "public.main_view"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+Triggers:
+ after_del_stmt_trig AFTER DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_del_stmt')
+ after_ins_stmt_trig AFTER INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_ins_stmt')
+ after_upd_stmt_trig AFTER UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_upd_stmt')
+ before_del_stmt_trig BEFORE DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_del_stmt')
+ before_ins_stmt_trig BEFORE INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_ins_stmt')
+ before_upd_stmt_trig BEFORE UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_upd_stmt')
+ instead_of_delete_trig INSTEAD OF DELETE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_del')
+ instead_of_insert_trig INSTEAD OF INSERT ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_ins')
+ instead_of_update_trig INSTEAD OF UPDATE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_upd')
+
+-- Test dropping view triggers
+DROP TRIGGER instead_of_insert_trig ON main_view;
+DROP TRIGGER instead_of_delete_trig ON main_view;
+\d+ main_view
+ View "public.main_view"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+ b | integer | | | | plain |
+View definition:
+ SELECT main_table.a,
+ main_table.b
+ FROM main_table;
+Triggers:
+ after_del_stmt_trig AFTER DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_del_stmt')
+ after_ins_stmt_trig AFTER INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_ins_stmt')
+ after_upd_stmt_trig AFTER UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_upd_stmt')
+ before_del_stmt_trig BEFORE DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_del_stmt')
+ before_ins_stmt_trig BEFORE INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_ins_stmt')
+ before_upd_stmt_trig BEFORE UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_upd_stmt')
+ instead_of_update_trig INSTEAD OF UPDATE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_upd')
+
+DROP VIEW main_view;
+--
+-- Test triggers on a join view
+--
+CREATE TABLE country_table (
+ country_id serial primary key,
+ country_name text unique not null,
+ continent text not null
+);
+INSERT INTO country_table (country_name, continent)
+ VALUES ('Japan', 'Asia'),
+ ('UK', 'Europe'),
+ ('USA', 'North America')
+ RETURNING *;
+ country_id | country_name | continent
+------------+--------------+---------------
+ 1 | Japan | Asia
+ 2 | UK | Europe
+ 3 | USA | North America
+(3 rows)
+
+CREATE TABLE city_table (
+ city_id serial primary key,
+ city_name text not null,
+ population bigint,
+ country_id int references country_table
+);
+CREATE VIEW city_view AS
+ SELECT city_id, city_name, population, country_name, continent
+ FROM city_table ci
+ LEFT JOIN country_table co ON co.country_id = ci.country_id;
+CREATE FUNCTION city_insert() RETURNS trigger LANGUAGE plpgsql AS $$
+declare
+ ctry_id int;
+begin
+ if NEW.country_name IS NOT NULL then
+ SELECT country_id, continent INTO ctry_id, NEW.continent
+ FROM country_table WHERE country_name = NEW.country_name;
+ if NOT FOUND then
+ raise exception 'No such country: "%"', NEW.country_name;
+ end if;
+ else
+ NEW.continent := NULL;
+ end if;
+
+ if NEW.city_id IS NOT NULL then
+ INSERT INTO city_table
+ VALUES(NEW.city_id, NEW.city_name, NEW.population, ctry_id);
+ else
+ INSERT INTO city_table(city_name, population, country_id)
+ VALUES(NEW.city_name, NEW.population, ctry_id)
+ RETURNING city_id INTO NEW.city_id;
+ end if;
+
+ RETURN NEW;
+end;
+$$;
+CREATE TRIGGER city_insert_trig INSTEAD OF INSERT ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_insert();
+CREATE FUNCTION city_delete() RETURNS trigger LANGUAGE plpgsql AS $$
+begin
+ DELETE FROM city_table WHERE city_id = OLD.city_id;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN OLD;
+end;
+$$;
+CREATE TRIGGER city_delete_trig INSTEAD OF DELETE ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_delete();
+CREATE FUNCTION city_update() RETURNS trigger LANGUAGE plpgsql AS $$
+declare
+ ctry_id int;
+begin
+ if NEW.country_name IS DISTINCT FROM OLD.country_name then
+ SELECT country_id, continent INTO ctry_id, NEW.continent
+ FROM country_table WHERE country_name = NEW.country_name;
+ if NOT FOUND then
+ raise exception 'No such country: "%"', NEW.country_name;
+ end if;
+
+ UPDATE city_table SET city_name = NEW.city_name,
+ population = NEW.population,
+ country_id = ctry_id
+ WHERE city_id = OLD.city_id;
+ else
+ UPDATE city_table SET city_name = NEW.city_name,
+ population = NEW.population
+ WHERE city_id = OLD.city_id;
+ NEW.continent := OLD.continent;
+ end if;
+
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN NEW;
+end;
+$$;
+CREATE TRIGGER city_update_trig INSTEAD OF UPDATE ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_update();
+\set QUIET false
+-- INSERT .. RETURNING
+INSERT INTO city_view(city_name) VALUES('Tokyo') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_name, population) VALUES('London', 7556900) RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 2 | London | 7556900 | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_name, country_name) VALUES('Washington DC', 'USA') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+---------------+------------+--------------+---------------
+ 3 | Washington DC | | USA | North America
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_id, city_name) VALUES(123456, 'New York') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 123456 | New York | | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view VALUES(234567, 'Birmingham', 1016800, 'UK', 'EU') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+INSERT 0 1
+-- UPDATE .. RETURNING
+UPDATE city_view SET country_name = 'Japon' WHERE city_name = 'Tokyo'; -- error
+ERROR: No such country: "Japon"
+CONTEXT: PL/pgSQL function city_update() line 9 at RAISE
+UPDATE city_view SET country_name = 'Japan' WHERE city_name = 'Takyo'; -- no match
+UPDATE 0
+UPDATE city_view SET country_name = 'Japan' WHERE city_name = 'Tokyo' RETURNING *; -- OK
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | | Japan | Asia
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET population = 13010279 WHERE city_name = 'Tokyo' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | 13010279 | Japan | Asia
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET country_name = 'UK' WHERE city_name = 'New York' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 123456 | New York | | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET country_name = 'USA', population = 8391881 WHERE city_name = 'New York' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+---------------
+ 123456 | New York | 8391881 | USA | North America
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET continent = 'EU' WHERE continent = 'Europe' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE city_view v1 SET country_name = v2.country_name FROM city_view v2
+ WHERE v2.city_name = 'Birmingham' AND v1.city_name = 'London' RETURNING *;
+ city_id | city_name | population | country_name | continent | city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------+---------+------------+------------+--------------+-----------
+ 2 | London | 7556900 | UK | Europe | 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+UPDATE 1
+-- DELETE .. RETURNING
+DELETE FROM city_view WHERE city_name = 'Birmingham' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+DELETE 1
+\set QUIET true
+-- read-only view with WHERE clause
+CREATE VIEW european_city_view AS
+ SELECT * FROM city_view WHERE continent = 'Europe';
+SELECT count(*) FROM european_city_view;
+ count
+-------
+ 1
+(1 row)
+
+CREATE FUNCTION no_op_trig_fn() RETURNS trigger LANGUAGE plpgsql
+AS 'begin RETURN NULL; end';
+CREATE TRIGGER no_op_trig INSTEAD OF INSERT OR UPDATE OR DELETE
+ON european_city_view FOR EACH ROW EXECUTE PROCEDURE no_op_trig_fn();
+\set QUIET false
+INSERT INTO european_city_view VALUES (0, 'x', 10000, 'y', 'z');
+INSERT 0 0
+UPDATE european_city_view SET population = 10000;
+UPDATE 0
+DELETE FROM european_city_view;
+DELETE 0
+\set QUIET true
+-- rules bypassing no-op triggers
+CREATE RULE european_city_insert_rule AS ON INSERT TO european_city_view
+DO INSTEAD INSERT INTO city_view
+VALUES (NEW.city_id, NEW.city_name, NEW.population, NEW.country_name, NEW.continent)
+RETURNING *;
+CREATE RULE european_city_update_rule AS ON UPDATE TO european_city_view
+DO INSTEAD UPDATE city_view SET
+ city_name = NEW.city_name,
+ population = NEW.population,
+ country_name = NEW.country_name
+WHERE city_id = OLD.city_id
+RETURNING NEW.*;
+CREATE RULE european_city_delete_rule AS ON DELETE TO european_city_view
+DO INSTEAD DELETE FROM city_view WHERE city_id = OLD.city_id RETURNING *;
+\set QUIET false
+-- INSERT not limited by view's WHERE clause, but UPDATE AND DELETE are
+INSERT INTO european_city_view(city_name, country_name)
+ VALUES ('Cambridge', 'USA') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+---------------
+ 4 | Cambridge | | USA | North America
+(1 row)
+
+INSERT 0 1
+UPDATE european_city_view SET country_name = 'UK'
+ WHERE city_name = 'Cambridge';
+UPDATE 0
+DELETE FROM european_city_view WHERE city_name = 'Cambridge';
+DELETE 0
+-- UPDATE and DELETE via rule and trigger
+UPDATE city_view SET country_name = 'UK'
+ WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE european_city_view SET population = 122800
+ WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | 122800 | UK | Europe
+(1 row)
+
+UPDATE 1
+DELETE FROM european_city_view WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | 122800 | UK | Europe
+(1 row)
+
+DELETE 1
+-- join UPDATE test
+UPDATE city_view v SET population = 599657
+ FROM city_table ci, country_table co
+ WHERE ci.city_name = 'Washington DC' and co.country_name = 'USA'
+ AND v.city_id = ci.city_id AND v.country_name = co.country_name
+ RETURNING co.country_id, v.country_name,
+ v.city_id, v.city_name, v.population;
+ country_id | country_name | city_id | city_name | population
+------------+--------------+---------+---------------+------------
+ 3 | USA | 3 | Washington DC | 599657
+(1 row)
+
+UPDATE 1
+\set QUIET true
+SELECT * FROM city_view;
+ city_id | city_name | population | country_name | continent
+---------+---------------+------------+--------------+---------------
+ 1 | Tokyo | 13010279 | Japan | Asia
+ 123456 | New York | 8391881 | USA | North America
+ 2 | London | 7556900 | UK | Europe
+ 3 | Washington DC | 599657 | USA | North America
+(4 rows)
+
+DROP TABLE city_table CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view city_view
+drop cascades to view european_city_view
+DROP TABLE country_table;
+-- Test pg_trigger_depth()
+create table depth_a (id int not null primary key);
+create table depth_b (id int not null primary key);
+create table depth_c (id int not null primary key);
+create function depth_a_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ insert into depth_b values (new.id);
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ return new;
+end;
+$$;
+create trigger depth_a_tr before insert on depth_a
+ for each row execute procedure depth_a_tf();
+create function depth_b_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ begin
+ execute 'insert into depth_c values (' || new.id::text || ')';
+ exception
+ when sqlstate 'U9999' then
+ raise notice 'SQLSTATE = U9999: depth = %', pg_trigger_depth();
+ end;
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ if new.id = 1 then
+ execute 'insert into depth_c values (' || new.id::text || ')';
+ end if;
+ return new;
+end;
+$$;
+create trigger depth_b_tr before insert on depth_b
+ for each row execute procedure depth_b_tf();
+create function depth_c_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ if new.id = 1 then
+ raise exception sqlstate 'U9999';
+ end if;
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ return new;
+end;
+$$;
+create trigger depth_c_tr before insert on depth_c
+ for each row execute procedure depth_c_tf();
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+insert into depth_a values (1);
+NOTICE: depth_a_tr: depth = 1
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+NOTICE: SQLSTATE = U9999: depth = 2
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+ERROR: U9999
+CONTEXT: PL/pgSQL function depth_c_tf() line 5 at RAISE
+SQL statement "insert into depth_c values (1)"
+PL/pgSQL function depth_b_tf() line 12 at EXECUTE
+SQL statement "insert into depth_b values (new.id)"
+PL/pgSQL function depth_a_tf() line 4 at SQL statement
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+insert into depth_a values (2);
+NOTICE: depth_a_tr: depth = 1
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+NOTICE: depth_c_tr: depth = 3
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_a_tr: depth = 1
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+drop table depth_a, depth_b, depth_c;
+drop function depth_a_tf();
+drop function depth_b_tf();
+drop function depth_c_tf();
+--
+-- Test updates to rows during firing of BEFORE ROW triggers.
+-- As of 9.2, such cases should be rejected (see bug #6123).
+--
+create temp table parent (
+ aid int not null primary key,
+ val1 text,
+ val2 text,
+ val3 text,
+ val4 text,
+ bcnt int not null default 0);
+create temp table child (
+ bid int not null primary key,
+ aid int not null,
+ val1 text);
+create function parent_upd_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if old.val1 <> new.val1 then
+ new.val2 = new.val1;
+ delete from child where child.aid = new.aid and child.val1 = new.val1;
+ end if;
+ return new;
+end;
+$$;
+create trigger parent_upd_trig before update on parent
+ for each row execute procedure parent_upd_func();
+create function parent_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ delete from child where aid = old.aid;
+ return old;
+end;
+$$;
+create trigger parent_del_trig before delete on parent
+ for each row execute procedure parent_del_func();
+create function child_ins_func()
+ returns trigger language plpgsql as
+$$
+begin
+ update parent set bcnt = bcnt + 1 where aid = new.aid;
+ return new;
+end;
+$$;
+create trigger child_ins_trig after insert on child
+ for each row execute procedure child_ins_func();
+create function child_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ update parent set bcnt = bcnt - 1 where aid = old.aid;
+ return old;
+end;
+$$;
+create trigger child_del_trig after delete on child
+ for each row execute procedure child_del_func();
+insert into parent values (1, 'a', 'a', 'a', 'a', 0);
+insert into child values (10, 1, 'b');
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+update parent set val1 = 'b' where aid = 1; -- should fail
+ERROR: tuple to be updated was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+delete from parent where aid = 1; -- should fail
+ERROR: tuple to be deleted was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+-- replace the trigger function with one that restarts the deletion after
+-- having modified a child
+create or replace function parent_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ delete from child where aid = old.aid;
+ if found then
+ delete from parent where aid = old.aid;
+ return null; -- cancel outer deletion
+ end if;
+ return old;
+end;
+$$;
+delete from parent where aid = 1;
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+(0 rows)
+
+ bid | aid | val1
+-----+-----+------
+(0 rows)
+
+drop table parent, child;
+drop function parent_upd_func();
+drop function parent_del_func();
+drop function child_ins_func();
+drop function child_del_func();
+-- similar case, but with a self-referencing FK so that parent and child
+-- rows can be affected by a single operation
+create temp table self_ref_trigger (
+ id int primary key,
+ parent int references self_ref_trigger,
+ data text,
+ nchildren int not null default 0
+);
+create function self_ref_trigger_ins_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if new.parent is not null then
+ update self_ref_trigger set nchildren = nchildren + 1
+ where id = new.parent;
+ end if;
+ return new;
+end;
+$$;
+create trigger self_ref_trigger_ins_trig before insert on self_ref_trigger
+ for each row execute procedure self_ref_trigger_ins_func();
+create function self_ref_trigger_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if old.parent is not null then
+ update self_ref_trigger set nchildren = nchildren - 1
+ where id = old.parent;
+ end if;
+ return old;
+end;
+$$;
+create trigger self_ref_trigger_del_trig before delete on self_ref_trigger
+ for each row execute procedure self_ref_trigger_del_func();
+insert into self_ref_trigger values (1, null, 'root');
+insert into self_ref_trigger values (2, 1, 'root child A');
+insert into self_ref_trigger values (3, 1, 'root child B');
+insert into self_ref_trigger values (4, 2, 'grandchild 1');
+insert into self_ref_trigger values (5, 3, 'grandchild 2');
+update self_ref_trigger set data = 'root!' where id = 1;
+select * from self_ref_trigger;
+ id | parent | data | nchildren
+----+--------+--------------+-----------
+ 2 | 1 | root child A | 1
+ 4 | 2 | grandchild 1 | 0
+ 3 | 1 | root child B | 1
+ 5 | 3 | grandchild 2 | 0
+ 1 | | root! | 2
+(5 rows)
+
+delete from self_ref_trigger;
+ERROR: tuple to be deleted was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from self_ref_trigger;
+ id | parent | data | nchildren
+----+--------+--------------+-----------
+ 2 | 1 | root child A | 1
+ 4 | 2 | grandchild 1 | 0
+ 3 | 1 | root child B | 1
+ 5 | 3 | grandchild 2 | 0
+ 1 | | root! | 2
+(5 rows)
+
+drop table self_ref_trigger;
+drop function self_ref_trigger_ins_func();
+drop function self_ref_trigger_del_func();
+--
+-- Check that statement triggers work correctly even with all children excluded
+--
+create table stmt_trig_on_empty_upd (a int);
+create table stmt_trig_on_empty_upd1 () inherits (stmt_trig_on_empty_upd);
+create function update_stmt_notice() returns trigger as $$
+begin
+ raise notice 'updating %', TG_TABLE_NAME;
+ return null;
+end;
+$$ language plpgsql;
+create trigger before_stmt_trigger
+ before update on stmt_trig_on_empty_upd
+ execute procedure update_stmt_notice();
+create trigger before_stmt_trigger
+ before update on stmt_trig_on_empty_upd1
+ execute procedure update_stmt_notice();
+-- inherited no-op update
+update stmt_trig_on_empty_upd set a = a where false returning a+1 as aa;
+NOTICE: updating stmt_trig_on_empty_upd
+ aa
+----
+(0 rows)
+
+-- simple no-op update
+update stmt_trig_on_empty_upd1 set a = a where false returning a+1 as aa;
+NOTICE: updating stmt_trig_on_empty_upd1
+ aa
+----
+(0 rows)
+
+drop table stmt_trig_on_empty_upd cascade;
+NOTICE: drop cascades to table stmt_trig_on_empty_upd1
+drop function update_stmt_notice();
+--
+-- Check that index creation (or DDL in general) is prohibited in a trigger
+--
+create table trigger_ddl_table (
+ col1 integer,
+ col2 integer
+);
+create function trigger_ddl_func() returns trigger as $$
+begin
+ alter table trigger_ddl_table add primary key (col1);
+ return new;
+end$$ language plpgsql;
+create trigger trigger_ddl_func before insert on trigger_ddl_table for each row
+ execute procedure trigger_ddl_func();
+insert into trigger_ddl_table values (1, 42); -- fail
+ERROR: cannot ALTER TABLE "trigger_ddl_table" because it is being used by active queries in this session
+CONTEXT: SQL statement "alter table trigger_ddl_table add primary key (col1)"
+PL/pgSQL function trigger_ddl_func() line 3 at SQL statement
+create or replace function trigger_ddl_func() returns trigger as $$
+begin
+ create index on trigger_ddl_table (col2);
+ return new;
+end$$ language plpgsql;
+insert into trigger_ddl_table values (1, 42); -- fail
+ERROR: cannot CREATE INDEX "trigger_ddl_table" because it is being used by active queries in this session
+CONTEXT: SQL statement "create index on trigger_ddl_table (col2)"
+PL/pgSQL function trigger_ddl_func() line 3 at SQL statement
+drop table trigger_ddl_table;
+drop function trigger_ddl_func();
+--
+-- Verify behavior of before and after triggers with INSERT...ON CONFLICT
+-- DO UPDATE
+--
+create table upsert (key int4 primary key, color text);
+create function upsert_before_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if (TG_OP = 'UPDATE') then
+ raise warning 'before update (old): %', old.*::text;
+ raise warning 'before update (new): %', new.*::text;
+ elsif (TG_OP = 'INSERT') then
+ raise warning 'before insert (new): %', new.*::text;
+ if new.key % 2 = 0 then
+ new.key := new.key + 1;
+ new.color := new.color || ' trig modified';
+ raise warning 'before insert (new, modified): %', new.*::text;
+ end if;
+ end if;
+ return new;
+end;
+$$;
+create trigger upsert_before_trig before insert or update on upsert
+ for each row execute procedure upsert_before_func();
+create function upsert_after_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if (TG_OP = 'UPDATE') then
+ raise warning 'after update (old): %', old.*::text;
+ raise warning 'after update (new): %', new.*::text;
+ elsif (TG_OP = 'INSERT') then
+ raise warning 'after insert (new): %', new.*::text;
+ end if;
+ return null;
+end;
+$$;
+create trigger upsert_after_trig after insert or update on upsert
+ for each row execute procedure upsert_after_func();
+insert into upsert values(1, 'black') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (1,black)
+WARNING: after insert (new): (1,black)
+insert into upsert values(2, 'red') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (2,red)
+WARNING: before insert (new, modified): (3,"red trig modified")
+WARNING: after insert (new): (3,"red trig modified")
+insert into upsert values(3, 'orange') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (3,orange)
+WARNING: before update (old): (3,"red trig modified")
+WARNING: before update (new): (3,"updated red trig modified")
+WARNING: after update (old): (3,"red trig modified")
+WARNING: after update (new): (3,"updated red trig modified")
+insert into upsert values(4, 'green') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (4,green)
+WARNING: before insert (new, modified): (5,"green trig modified")
+WARNING: after insert (new): (5,"green trig modified")
+insert into upsert values(5, 'purple') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (5,purple)
+WARNING: before update (old): (5,"green trig modified")
+WARNING: before update (new): (5,"updated green trig modified")
+WARNING: after update (old): (5,"green trig modified")
+WARNING: after update (new): (5,"updated green trig modified")
+insert into upsert values(6, 'white') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (6,white)
+WARNING: before insert (new, modified): (7,"white trig modified")
+WARNING: after insert (new): (7,"white trig modified")
+insert into upsert values(7, 'pink') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (7,pink)
+WARNING: before update (old): (7,"white trig modified")
+WARNING: before update (new): (7,"updated white trig modified")
+WARNING: after update (old): (7,"white trig modified")
+WARNING: after update (new): (7,"updated white trig modified")
+insert into upsert values(8, 'yellow') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (8,yellow)
+WARNING: before insert (new, modified): (9,"yellow trig modified")
+WARNING: after insert (new): (9,"yellow trig modified")
+select * from upsert;
+ key | color
+-----+-----------------------------
+ 1 | black
+ 3 | updated red trig modified
+ 5 | updated green trig modified
+ 7 | updated white trig modified
+ 9 | yellow trig modified
+(5 rows)
+
+drop table upsert;
+drop function upsert_before_func();
+drop function upsert_after_func();
+--
+-- Verify that triggers with transition tables are not allowed on
+-- views
+--
+create table my_table (i int);
+create view my_view as select * from my_table;
+create function my_trigger_function() returns trigger as $$ begin end; $$ language plpgsql;
+create trigger my_trigger after update on my_view referencing old table as old_table
+ for each statement execute procedure my_trigger_function();
+ERROR: "my_view" is a view
+DETAIL: Triggers on views cannot have transition tables.
+drop function my_trigger_function();
+drop view my_view;
+drop table my_table;
+--
+-- Verify cases that are unsupported with partitioned tables
+--
+create table parted_trig (a int) partition by list (a);
+create function trigger_nothing() returns trigger
+ language plpgsql as $$ begin end; $$;
+create trigger failed before insert or update or delete on parted_trig
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a partitioned table
+DETAIL: Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.
+create trigger failed instead of update on parted_trig
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+create trigger failed after update on parted_trig
+ referencing old table as old_table
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a partitioned table
+DETAIL: Triggers on partitioned tables cannot have transition tables.
+drop table parted_trig;
+--
+-- Verify trigger creation for partitioned tables, and drop behavior
+--
+create table trigpart (a int, b int) partition by range (a);
+create table trigpart1 partition of trigpart for values from (0) to (1000);
+create trigger trg1 after insert on trigpart for each row execute procedure trigger_nothing();
+create table trigpart2 partition of trigpart for values from (1000) to (2000);
+create table trigpart3 (like trigpart);
+alter table trigpart attach partition trigpart3 for values from (2000) to (3000);
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+-----------+--------+-----------------
+ trigpart | trg1 | trigger_nothing
+ trigpart1 | trg1 | trigger_nothing
+ trigpart2 | trg1 | trigger_nothing
+ trigpart3 | trg1 | trigger_nothing
+(4 rows)
+
+drop trigger trg1 on trigpart1; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop trigger trg1 on trigpart2; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart2 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop trigger trg1 on trigpart3; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart3 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop table trigpart2; -- ok, trigger should be gone in that partition
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+-----------+--------+-----------------
+ trigpart | trg1 | trigger_nothing
+ trigpart1 | trg1 | trigger_nothing
+ trigpart3 | trg1 | trigger_nothing
+(3 rows)
+
+drop trigger trg1 on trigpart; -- ok, all gone
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+---------+--------+--------
+(0 rows)
+
+drop table trigpart;
+drop function trigger_nothing();
+--
+-- Verify that triggers are fired for partitioned tables
+--
+create table parted_stmt_trig (a int) partition by list (a);
+create table parted_stmt_trig1 partition of parted_stmt_trig for values in (1);
+create table parted_stmt_trig2 partition of parted_stmt_trig for values in (2);
+create table parted2_stmt_trig (a int) partition by list (a);
+create table parted2_stmt_trig1 partition of parted2_stmt_trig for values in (1);
+create table parted2_stmt_trig2 partition of parted2_stmt_trig for values in (2);
+create or replace function trigger_notice() returns trigger as $$
+ begin
+ raise notice 'trigger % on % % % for %', TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL;
+ if TG_LEVEL = 'ROW' then
+ return NEW;
+ end if;
+ return null;
+ end;
+ $$ language plpgsql;
+-- insert/update/delete statement-level triggers on the parent
+create trigger trig_ins_before before insert on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_ins_after after insert on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_before before update on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_after after update on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_before before delete on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_after after delete on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+-- insert/update/delete row-level triggers on the parent
+create trigger trig_ins_after_parent after insert on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_after_parent after update on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+create trigger trig_del_after_parent after delete on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+-- insert/update/delete row-level triggers on the first partition
+create trigger trig_ins_before_child before insert on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_ins_after_child after insert on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_before_child before update on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_after_child after update on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_del_before_child before delete on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_del_after_child after delete on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+-- insert/update/delete statement-level triggers on the parent
+create trigger trig_ins_before_3 before insert on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_ins_after_3 after insert on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_before_3 before update on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_after_3 after update on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_before_3 before delete on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_after_3 after delete on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+with ins (a) as (
+ insert into parted2_stmt_trig values (1), (2) returning a
+) insert into parted_stmt_trig select a from ins returning tableoid::regclass, a;
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_3 on parted2_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig2 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_3 on parted2_stmt_trig AFTER INSERT for STATEMENT
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+ tableoid | a
+-------------------+---
+ parted_stmt_trig1 | 1
+ parted_stmt_trig2 | 2
+(2 rows)
+
+with upd as (
+ update parted2_stmt_trig set a = a
+) update parted_stmt_trig set a = a;
+NOTICE: trigger trig_upd_before on parted_stmt_trig BEFORE UPDATE for STATEMENT
+NOTICE: trigger trig_upd_before_child on parted_stmt_trig1 BEFORE UPDATE for ROW
+NOTICE: trigger trig_upd_before_3 on parted2_stmt_trig BEFORE UPDATE for STATEMENT
+NOTICE: trigger trig_upd_after_child on parted_stmt_trig1 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after_parent on parted_stmt_trig1 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after_parent on parted_stmt_trig2 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after on parted_stmt_trig AFTER UPDATE for STATEMENT
+NOTICE: trigger trig_upd_after_3 on parted2_stmt_trig AFTER UPDATE for STATEMENT
+delete from parted_stmt_trig;
+NOTICE: trigger trig_del_before on parted_stmt_trig BEFORE DELETE for STATEMENT
+NOTICE: trigger trig_del_before_child on parted_stmt_trig1 BEFORE DELETE for ROW
+NOTICE: trigger trig_del_after_parent on parted_stmt_trig2 AFTER DELETE for ROW
+NOTICE: trigger trig_del_after on parted_stmt_trig AFTER DELETE for STATEMENT
+-- insert via copy on the parent
+copy parted_stmt_trig(a) from stdin;
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig2 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+-- insert via copy on the first partition
+copy parted_stmt_trig1(a) from stdin;
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+-- Disabling a trigger in the parent table should disable children triggers too
+alter table parted_stmt_trig disable trigger trig_ins_after_parent;
+insert into parted_stmt_trig values (1);
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+alter table parted_stmt_trig enable trigger trig_ins_after_parent;
+insert into parted_stmt_trig values (1);
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+drop table parted_stmt_trig, parted2_stmt_trig;
+-- Verify that triggers fire in alphabetical order
+create table parted_trig (a int) partition by range (a);
+create table parted_trig_1 partition of parted_trig for values from (0) to (1000)
+ partition by range (a);
+create table parted_trig_1_1 partition of parted_trig_1 for values from (0) to (100);
+create table parted_trig_2 partition of parted_trig for values from (1000) to (2000);
+create trigger zzz after insert on parted_trig for each row execute procedure trigger_notice();
+create trigger mmm after insert on parted_trig_1_1 for each row execute procedure trigger_notice();
+create trigger aaa after insert on parted_trig_1 for each row execute procedure trigger_notice();
+create trigger bbb after insert on parted_trig for each row execute procedure trigger_notice();
+create trigger qqq after insert on parted_trig_1_1 for each row execute procedure trigger_notice();
+insert into parted_trig values (50), (1500);
+NOTICE: trigger aaa on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger bbb on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger mmm on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger qqq on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger zzz on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger bbb on parted_trig_2 AFTER INSERT for ROW
+NOTICE: trigger zzz on parted_trig_2 AFTER INSERT for ROW
+drop table parted_trig;
+-- Verify propagation of trigger arguments to partitions
+create table parted_trig (a int) partition by list (a);
+create table parted_trig1 partition of parted_trig for values in (1);
+create or replace function trigger_notice() returns trigger as $$
+ declare
+ arg1 text = TG_ARGV[0];
+ arg2 integer = TG_ARGV[1];
+ begin
+ raise notice 'trigger % on % % % for % args % %',
+ TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL, arg1, arg2;
+ return null;
+ end;
+ $$ language plpgsql;
+create trigger aaa after insert on parted_trig
+ for each row execute procedure trigger_notice('quirky', 1);
+-- Verify propagation of trigger arguments to partitions attached after creating trigger
+create table parted_trig2 partition of parted_trig for values in (2);
+create table parted_trig3 (like parted_trig);
+alter table parted_trig attach partition parted_trig3 for values in (3);
+insert into parted_trig values (1), (2), (3);
+NOTICE: trigger aaa on parted_trig1 AFTER INSERT for ROW args quirky 1
+NOTICE: trigger aaa on parted_trig2 AFTER INSERT for ROW args quirky 1
+NOTICE: trigger aaa on parted_trig3 AFTER INSERT for ROW args quirky 1
+drop table parted_trig;
+-- test irregular partitions (i.e., different column definitions),
+-- including that the WHEN clause works
+create function bark(text) returns bool language plpgsql immutable
+ as $$ begin raise notice '% <- woof!', $1; return true; end; $$;
+create or replace function trigger_notice_ab() returns trigger as $$
+ begin
+ raise notice 'trigger % on % % % for %: (a,b)=(%,%)',
+ TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL,
+ NEW.a, NEW.b;
+ if TG_LEVEL = 'ROW' then
+ return NEW;
+ end if;
+ return null;
+ end;
+ $$ language plpgsql;
+create table parted_irreg_ancestor (fd text, b text, fd2 int, fd3 int, a int)
+ partition by range (b);
+alter table parted_irreg_ancestor drop column fd,
+ drop column fd2, drop column fd3;
+create table parted_irreg (fd int, a int, fd2 int, b text)
+ partition by range (b);
+alter table parted_irreg drop column fd, drop column fd2;
+alter table parted_irreg_ancestor attach partition parted_irreg
+ for values from ('aaaa') to ('zzzz');
+create table parted1_irreg (b text, fd int, a int);
+alter table parted1_irreg drop column fd;
+alter table parted_irreg attach partition parted1_irreg
+ for values from ('aaaa') to ('bbbb');
+create trigger parted_trig after insert on parted_irreg
+ for each row execute procedure trigger_notice_ab();
+create trigger parted_trig_odd after insert on parted_irreg for each row
+ when (bark(new.b) AND new.a % 2 = 1) execute procedure trigger_notice_ab();
+-- we should hear barking for every insert, but parted_trig_odd only emits
+-- noise for odd values of a. parted_trig does it for all inserts.
+insert into parted_irreg values (1, 'aardvark'), (2, 'aanimals');
+NOTICE: aardvark <- woof!
+NOTICE: aanimals <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_odd on parted1_irreg AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(2,aanimals)
+insert into parted1_irreg values ('aardwolf', 2);
+NOTICE: aardwolf <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+insert into parted_irreg_ancestor values ('aasvogel', 3);
+NOTICE: aasvogel <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+NOTICE: trigger parted_trig_odd on parted1_irreg AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+drop table parted_irreg_ancestor;
+--
+-- Constraint triggers and partitioned tables
+create table parted_constr_ancestor (a int, b text)
+ partition by range (b);
+create table parted_constr (a int, b text)
+ partition by range (b);
+alter table parted_constr_ancestor attach partition parted_constr
+ for values from ('aaaa') to ('zzzz');
+create table parted1_constr (a int, b text);
+alter table parted_constr attach partition parted1_constr
+ for values from ('aaaa') to ('bbbb');
+create constraint trigger parted_trig after insert on parted_constr_ancestor
+ deferrable
+ for each row execute procedure trigger_notice_ab();
+create constraint trigger parted_trig_two after insert on parted_constr
+ deferrable initially deferred
+ for each row when (bark(new.b) AND new.a % 2 = 1)
+ execute procedure trigger_notice_ab();
+-- The immediate constraint is fired immediately; the WHEN clause of the
+-- deferred constraint is also called immediately. The deferred constraint
+-- is fired at commit time.
+begin;
+insert into parted_constr values (1, 'aardvark');
+NOTICE: aardvark <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+insert into parted1_constr values (2, 'aardwolf');
+NOTICE: aardwolf <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+insert into parted_constr_ancestor values (3, 'aasvogel');
+NOTICE: aasvogel <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+commit;
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+-- The WHEN clause is immediate, and both constraint triggers are fired at
+-- commit time.
+begin;
+set constraints parted_trig deferred;
+insert into parted_constr values (1, 'aardvark');
+NOTICE: aardvark <- woof!
+insert into parted1_constr values (2, 'aardwolf'), (3, 'aasvogel');
+NOTICE: aardwolf <- woof!
+NOTICE: aasvogel <- woof!
+commit;
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+drop table parted_constr_ancestor;
+drop function bark(text);
+-- Test that the WHEN clause is set properly to partitions
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create trigger parted_trigger after update on parted_trigger
+ for each row when (new.a % 2 = 1 and length(old.b) >= 2) execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (3);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (3) to (5);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+insert into parted_trigger values
+ (0, 'a'), (1, 'bbb'), (2, 'bcd'), (3, 'c'),
+ (1000, 'c'), (1001, 'ddd'), (1002, 'efg'), (1003, 'f'),
+ (2000, 'e'), (2001, 'fff'), (2002, 'ghi'), (2003, 'h');
+update parted_trigger set a = a + 2; -- notice for odd 'a' values, long 'b' values
+NOTICE: trigger parted_trigger on parted_trigger_1 AFTER UPDATE for ROW: (a,b)=(3,bbb)
+NOTICE: trigger parted_trigger on parted_trigger_2 AFTER UPDATE for ROW: (a,b)=(1003,ddd)
+NOTICE: trigger parted_trigger on parted_trigger_3_2 AFTER UPDATE for ROW: (a,b)=(2003,fff)
+drop table parted_trigger;
+-- try a constraint trigger, also
+create table parted_referenced (a int);
+create table unparted_trigger (a int, b text); -- for comparison purposes
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create constraint trigger parted_trigger after update on parted_trigger
+ from parted_referenced
+ for each row execute procedure trigger_notice_ab();
+create constraint trigger parted_trigger after update on unparted_trigger
+ from parted_referenced
+ for each row execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (3);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (3) to (5);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+select tgname, conname, t.tgrelid::regclass, t.tgconstrrelid::regclass,
+ c.conrelid::regclass, c.confrelid::regclass
+ from pg_trigger t join pg_constraint c on (t.tgconstraint = c.oid)
+ where tgname = 'parted_trigger'
+ order by t.tgrelid::regclass::text;
+ tgname | conname | tgrelid | tgconstrrelid | conrelid | confrelid
+----------------+----------------+--------------------+-------------------+--------------------+-----------
+ parted_trigger | parted_trigger | parted_trigger | parted_referenced | parted_trigger | -
+ parted_trigger | parted_trigger | parted_trigger_1 | parted_referenced | parted_trigger_1 | -
+ parted_trigger | parted_trigger | parted_trigger_2 | parted_referenced | parted_trigger_2 | -
+ parted_trigger | parted_trigger | parted_trigger_3 | parted_referenced | parted_trigger_3 | -
+ parted_trigger | parted_trigger | parted_trigger_3_1 | parted_referenced | parted_trigger_3_1 | -
+ parted_trigger | parted_trigger | parted_trigger_3_2 | parted_referenced | parted_trigger_3_2 | -
+ parted_trigger | parted_trigger | unparted_trigger | parted_referenced | unparted_trigger | -
+(7 rows)
+
+drop table parted_referenced, parted_trigger, unparted_trigger;
+-- verify that the "AFTER UPDATE OF columns" event is propagated correctly
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create trigger parted_trigger after update of b on parted_trigger
+ for each row execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (4);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (4) to (8);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+insert into parted_trigger values (0, 'a'), (1000, 'c'), (2000, 'e'), (2001, 'eeee');
+update parted_trigger set a = a + 2; -- no notices here
+update parted_trigger set b = b || 'b'; -- all triggers should fire
+NOTICE: trigger parted_trigger on parted_trigger_1 AFTER UPDATE for ROW: (a,b)=(2,ab)
+NOTICE: trigger parted_trigger on parted_trigger_2 AFTER UPDATE for ROW: (a,b)=(1002,cb)
+NOTICE: trigger parted_trigger on parted_trigger_3_1 AFTER UPDATE for ROW: (a,b)=(2002,eb)
+NOTICE: trigger parted_trigger on parted_trigger_3_2 AFTER UPDATE for ROW: (a,b)=(2003,eeeeb)
+drop table parted_trigger;
+drop function trigger_notice_ab();
+-- Make sure we don't end up with unnecessary copies of triggers, when
+-- cloning them.
+create table trg_clone (a int) partition by range (a);
+create table trg_clone1 partition of trg_clone for values from (0) to (1000);
+alter table trg_clone add constraint uniq unique (a) deferrable;
+create table trg_clone2 partition of trg_clone for values from (1000) to (2000);
+create table trg_clone3 partition of trg_clone for values from (2000) to (3000)
+ partition by range (a);
+create table trg_clone_3_3 partition of trg_clone3 for values from (2000) to (2100);
+select tgrelid::regclass, count(*) from pg_trigger
+ where tgrelid::regclass in ('trg_clone', 'trg_clone1', 'trg_clone2',
+ 'trg_clone3', 'trg_clone_3_3')
+ group by tgrelid::regclass order by tgrelid::regclass;
+ tgrelid | count
+---------------+-------
+ trg_clone | 1
+ trg_clone1 | 1
+ trg_clone2 | 1
+ trg_clone3 | 1
+ trg_clone_3_3 | 1
+(5 rows)
+
+drop table trg_clone;
+--
+-- Test the interaction between transition tables and both kinds of
+-- inheritance. We'll dump the contents of the transition tables in a
+-- format that shows the attribute order, so that we can distinguish
+-- tuple formats (though not dropped attributes).
+--
+create or replace function dump_insert() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, new table = %',
+ TG_NAME,
+ (select string_agg(new_table::text, ', ' order by a) from new_table);
+ return null;
+ end;
+$$;
+create or replace function dump_update() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, old table = %, new table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' order by a) from old_table),
+ (select string_agg(new_table::text, ', ' order by a) from new_table);
+ return null;
+ end;
+$$;
+create or replace function dump_delete() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, old table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' order by a) from old_table);
+ return null;
+ end;
+$$;
+--
+-- Verify behavior of statement triggers on partition hierarchy with
+-- transition tables. Tuples should appear to each trigger in the
+-- format of the relation the trigger is attached to.
+--
+-- set up a partition hierarchy with some different TupleDescriptors
+create table parent (a text, b int) partition by list (a);
+-- a child matching parent
+create table child1 partition of parent for values in ('AAA');
+-- a child with a dropped column
+create table child2 (x int, a text, b int);
+alter table child2 drop column x;
+alter table parent attach partition child2 for values in ('BBB');
+-- a child with a different column order
+create table child3 (b int, a text);
+alter table parent attach partition child3 for values in ('CCC');
+create trigger parent_insert_trig
+ after insert on parent referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger parent_update_trig
+ after update on parent referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger parent_delete_trig
+ after delete on parent referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child1_insert_trig
+ after insert on child1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child1_update_trig
+ after update on child1 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child1_delete_trig
+ after delete on child1 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child2_insert_trig
+ after insert on child2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child2_update_trig
+ after update on child2 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child2_delete_trig
+ after delete on child2 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child3_insert_trig
+ after insert on child3 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child3_update_trig
+ after update on child3 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child3_delete_trig
+ after delete on child3 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table IN ('parent', 'child1', 'child2', 'child3')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+--------------------+--------------------+---------------------+--------------------+--------------+------------------+--------------------+---------------+----------------------------+----------------------------
+ child1_delete_trig | DELETE | public | child1 | 1 | | STATEMENT | AFTER | old_table |
+ child1_insert_trig | INSERT | public | child1 | 1 | | STATEMENT | AFTER | | new_table
+ child1_update_trig | UPDATE | public | child1 | 1 | | STATEMENT | AFTER | old_table | new_table
+ child2_delete_trig | DELETE | public | child2 | 1 | | STATEMENT | AFTER | old_table |
+ child2_insert_trig | INSERT | public | child2 | 1 | | STATEMENT | AFTER | | new_table
+ child2_update_trig | UPDATE | public | child2 | 1 | | STATEMENT | AFTER | old_table | new_table
+ child3_delete_trig | DELETE | public | child3 | 1 | | STATEMENT | AFTER | old_table |
+ child3_insert_trig | INSERT | public | child3 | 1 | | STATEMENT | AFTER | | new_table
+ child3_update_trig | UPDATE | public | child3 | 1 | | STATEMENT | AFTER | old_table | new_table
+ parent_delete_trig | DELETE | public | parent | 1 | | STATEMENT | AFTER | old_table |
+ parent_insert_trig | INSERT | public | parent | 1 | | STATEMENT | AFTER | | new_table
+ parent_update_trig | UPDATE | public | parent | 1 | | STATEMENT | AFTER | old_table | new_table
+(12 rows)
+
+-- insert directly into children sees respective child-format tuples
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values ('BBB', 42);
+NOTICE: trigger = child2_insert_trig, new table = (BBB,42)
+insert into child3 values (42, 'CCC');
+NOTICE: trigger = child3_insert_trig, new table = (42,CCC)
+-- update via parent sees parent-format tuples
+update parent set b = b + 1;
+NOTICE: trigger = parent_update_trig, old table = (AAA,42), (BBB,42), (CCC,42), new table = (AAA,43), (BBB,43), (CCC,43)
+-- delete via parent sees parent-format tuples
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,43), (BBB,43), (CCC,43)
+-- insert into parent sees parent-format tuples
+insert into parent values ('AAA', 42);
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42)
+insert into parent values ('BBB', 42);
+NOTICE: trigger = parent_insert_trig, new table = (BBB,42)
+insert into parent values ('CCC', 42);
+NOTICE: trigger = parent_insert_trig, new table = (CCC,42)
+-- delete from children sees respective child-format tuples
+delete from child1;
+NOTICE: trigger = child1_delete_trig, old table = (AAA,42)
+delete from child2;
+NOTICE: trigger = child2_delete_trig, old table = (BBB,42)
+delete from child3;
+NOTICE: trigger = child3_delete_trig, old table = (42,CCC)
+-- copy into parent sees parent-format tuples
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- DML affecting parent sees tuples collected from children even if
+-- there is no transition table trigger on the children
+drop trigger child1_insert_trig on child1;
+drop trigger child1_update_trig on child1;
+drop trigger child1_delete_trig on child1;
+drop trigger child2_insert_trig on child2;
+drop trigger child2_update_trig on child2;
+drop trigger child2_delete_trig on child2;
+drop trigger child3_insert_trig on child3;
+drop trigger child3_update_trig on child3;
+drop trigger child3_delete_trig on child3;
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,42), (BBB,42), (CCC,42)
+-- copy into parent sees tuples collected from children even if there
+-- is no transition-table trigger on the children
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- insert into parent with a before trigger on a child tuple before
+-- insertion, and we capture the newly modified row in parent format
+create or replace function intercept_insert() returns trigger language plpgsql as
+$$
+ begin
+ new.b = new.b + 1000;
+ return new;
+ end;
+$$;
+create trigger intercept_insert_child3
+ before insert on child3
+ for each row execute procedure intercept_insert();
+-- insert, parent trigger sees post-modification parent-format tuple
+insert into parent values ('AAA', 42), ('BBB', 42), ('CCC', 66);
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,1066)
+-- copy, parent trigger sees post-modification parent-format tuple
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,1234)
+drop table child1, child2, child3, parent;
+drop function intercept_insert();
+--
+-- Verify prohibition of row triggers with transition triggers on
+-- partitions
+--
+create table parent (a text, b int) partition by list (a);
+create table child partition of parent for values in ('AAA');
+-- adding row trigger with transition table fails
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+ERROR: ROW triggers with transition tables are not supported on partitions
+-- detaching it first works
+alter table parent detach partition child;
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+-- but now we're not allowed to reattach it
+alter table parent attach partition child for values in ('AAA');
+ERROR: trigger "child_row_trig" prevents table "child" from becoming a partition
+DETAIL: ROW triggers with transition tables are not supported on partitions
+-- drop the trigger, and now we're allowed to attach it again
+drop trigger child_row_trig on child;
+alter table parent attach partition child for values in ('AAA');
+drop table child, parent;
+--
+-- Verify behavior of statement triggers on (non-partition)
+-- inheritance hierarchy with transition tables; similar to the
+-- partition case, except there is no rerouting on insertion and child
+-- tables can have extra columns
+--
+-- set up inheritance hierarchy with different TupleDescriptors
+create table parent (a text, b int);
+-- a child matching parent
+create table child1 () inherits (parent);
+-- a child with a different column order
+create table child2 (b int, a text);
+alter table child2 inherit parent;
+-- a child with an extra column
+create table child3 (c text) inherits (parent);
+create trigger parent_insert_trig
+ after insert on parent referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger parent_update_trig
+ after update on parent referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger parent_delete_trig
+ after delete on parent referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child1_insert_trig
+ after insert on child1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child1_update_trig
+ after update on child1 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child1_delete_trig
+ after delete on child1 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child2_insert_trig
+ after insert on child2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child2_update_trig
+ after update on child2 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child2_delete_trig
+ after delete on child2 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child3_insert_trig
+ after insert on child3 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child3_update_trig
+ after update on child3 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child3_delete_trig
+ after delete on child3 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+-- insert directly into children sees respective child-format tuples
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values (42, 'BBB');
+NOTICE: trigger = child2_insert_trig, new table = (42,BBB)
+insert into child3 values ('CCC', 42, 'foo');
+NOTICE: trigger = child3_insert_trig, new table = (CCC,42,foo)
+-- update via parent sees parent-format tuples
+update parent set b = b + 1;
+NOTICE: trigger = parent_update_trig, old table = (AAA,42), (BBB,42), (CCC,42), new table = (AAA,43), (BBB,43), (CCC,43)
+-- delete via parent sees parent-format tuples
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,43), (BBB,43), (CCC,43)
+-- reinsert values into children for next test...
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values (42, 'BBB');
+NOTICE: trigger = child2_insert_trig, new table = (42,BBB)
+insert into child3 values ('CCC', 42, 'foo');
+NOTICE: trigger = child3_insert_trig, new table = (CCC,42,foo)
+-- delete from children sees respective child-format tuples
+delete from child1;
+NOTICE: trigger = child1_delete_trig, old table = (AAA,42)
+delete from child2;
+NOTICE: trigger = child2_delete_trig, old table = (42,BBB)
+delete from child3;
+NOTICE: trigger = child3_delete_trig, old table = (CCC,42,foo)
+-- copy into parent sees parent-format tuples (no rerouting, so these
+-- are really inserted into the parent)
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- same behavior for copy if there is an index (interesting because rows are
+-- captured by a different code path in copy.c if there are indexes)
+create index on parent(b);
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (DDD,42)
+-- DML affecting parent sees tuples collected from children even if
+-- there is no transition table trigger on the children
+drop trigger child1_insert_trig on child1;
+drop trigger child1_update_trig on child1;
+drop trigger child1_delete_trig on child1;
+drop trigger child2_insert_trig on child2;
+drop trigger child2_update_trig on child2;
+drop trigger child2_delete_trig on child2;
+drop trigger child3_insert_trig on child3;
+drop trigger child3_update_trig on child3;
+drop trigger child3_delete_trig on child3;
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,42), (BBB,42), (CCC,42), (DDD,42)
+drop table child1, child2, child3, parent;
+--
+-- Verify prohibition of row triggers with transition triggers on
+-- inheritance children
+--
+create table parent (a text, b int);
+create table child () inherits (parent);
+-- adding row trigger with transition table fails
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+ERROR: ROW triggers with transition tables are not supported on inheritance children
+-- disinheriting it first works
+alter table child no inherit parent;
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+-- but now we're not allowed to make it inherit anymore
+alter table child inherit parent;
+ERROR: trigger "child_row_trig" prevents table "child" from becoming an inheritance child
+DETAIL: ROW triggers with transition tables are not supported in inheritance hierarchies.
+-- drop the trigger, and now we're allowed to make it inherit again
+drop trigger child_row_trig on child;
+alter table child inherit parent;
+drop table child, parent;
+--
+-- Verify behavior of queries with wCTEs, where multiple transition
+-- tuplestores can be active at the same time because there are
+-- multiple DML statements that might fire triggers with transition
+-- tables
+--
+create table table1 (a int);
+create table table2 (a text);
+create trigger table1_trig
+ after insert on table1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger table2_trig
+ after insert on table2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+with wcte as (insert into table1 values (42))
+ insert into table2 values ('hello world');
+NOTICE: trigger = table2_trig, new table = ("hello world")
+NOTICE: trigger = table1_trig, new table = (42)
+with wcte as (insert into table1 values (43))
+ insert into table1 values (44);
+NOTICE: trigger = table1_trig, new table = (43), (44)
+select * from table1;
+ a
+----
+ 42
+ 44
+ 43
+(3 rows)
+
+select * from table2;
+ a
+-------------
+ hello world
+(1 row)
+
+drop table table1;
+drop table table2;
+--
+-- Verify behavior of INSERT ... ON CONFLICT DO UPDATE ... with
+-- transition tables.
+--
+create table my_table (a int primary key, b text);
+create trigger my_table_insert_trig
+ after insert on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger my_table_update_trig
+ after update on my_table referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+-- inserts only
+insert into my_table values (1, 'AAA'), (2, 'BBB')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = <NULL>, new table = <NULL>
+NOTICE: trigger = my_table_insert_trig, new table = (1,AAA), (2,BBB)
+-- mixture of inserts and updates
+insert into my_table values (1, 'AAA'), (2, 'BBB'), (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = (1,AAA), (2,BBB), new table = (1,AAA:AAA), (2,BBB:BBB)
+NOTICE: trigger = my_table_insert_trig, new table = (3,CCC), (4,DDD)
+-- updates only
+insert into my_table values (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = (3,CCC), (4,DDD), new table = (3,CCC:CCC), (4,DDD:DDD)
+NOTICE: trigger = my_table_insert_trig, new table = <NULL>
+--
+-- now using a partitioned table
+--
+create table iocdu_tt_parted (a int primary key, b text) partition by list (a);
+create table iocdu_tt_parted1 partition of iocdu_tt_parted for values in (1);
+create table iocdu_tt_parted2 partition of iocdu_tt_parted for values in (2);
+create table iocdu_tt_parted3 partition of iocdu_tt_parted for values in (3);
+create table iocdu_tt_parted4 partition of iocdu_tt_parted for values in (4);
+create trigger iocdu_tt_parted_insert_trig
+ after insert on iocdu_tt_parted referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger iocdu_tt_parted_update_trig
+ after update on iocdu_tt_parted referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+-- inserts only
+insert into iocdu_tt_parted values (1, 'AAA'), (2, 'BBB')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = <NULL>, new table = <NULL>
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = (1,AAA), (2,BBB)
+-- mixture of inserts and updates
+insert into iocdu_tt_parted values (1, 'AAA'), (2, 'BBB'), (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = (1,AAA), (2,BBB), new table = (1,AAA:AAA), (2,BBB:BBB)
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = (3,CCC), (4,DDD)
+-- updates only
+insert into iocdu_tt_parted values (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = (3,CCC), (4,DDD), new table = (3,CCC:CCC), (4,DDD:DDD)
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = <NULL>
+drop table iocdu_tt_parted;
+--
+-- Verify that you can't create a trigger with transition tables for
+-- more than one event.
+--
+create trigger my_table_multievent_trig
+ after insert or update on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+ERROR: transition tables cannot be specified for triggers with more than one event
+--
+-- Verify that you can't create a trigger with transition tables with
+-- a column list.
+--
+create trigger my_table_col_update_trig
+ after update of b on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+ERROR: transition tables cannot be specified for triggers with column lists
+drop table my_table;
+--
+-- Test firing of triggers with transition tables by foreign key cascades
+--
+create table refd_table (a int primary key, b text);
+create table trig_table (a int, b text,
+ foreign key (a) references refd_table on update cascade on delete cascade
+);
+create trigger trig_table_before_trig
+ before insert or update or delete on trig_table
+ for each statement execute procedure trigger_func('trig_table');
+create trigger trig_table_insert_trig
+ after insert on trig_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger trig_table_update_trig
+ after update on trig_table referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger trig_table_delete_trig
+ after delete on trig_table referencing old table as old_table
+ for each statement execute procedure dump_delete();
+insert into refd_table values
+ (1, 'one'),
+ (2, 'two'),
+ (3, 'three');
+insert into trig_table values
+ (1, 'one a'),
+ (1, 'one b'),
+ (2, 'two a'),
+ (2, 'two b'),
+ (3, 'three a'),
+ (3, 'three b');
+NOTICE: trigger_func(trig_table) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_insert_trig, new table = (1,"one a"), (1,"one b"), (2,"two a"), (2,"two b"), (3,"three a"), (3,"three b")
+update refd_table set a = 11 where b = 'one';
+NOTICE: trigger_func(trig_table) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_update_trig, old table = (1,"one a"), (1,"one b"), new table = (11,"one a"), (11,"one b")
+select * from trig_table;
+ a | b
+----+---------
+ 2 | two a
+ 2 | two b
+ 3 | three a
+ 3 | three b
+ 11 | one a
+ 11 | one b
+(6 rows)
+
+delete from refd_table where length(b) = 3;
+NOTICE: trigger_func(trig_table) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b"), (11,"one a"), (11,"one b")
+select * from trig_table;
+ a | b
+---+---------
+ 3 | three a
+ 3 | three b
+(2 rows)
+
+drop table refd_table, trig_table;
+--
+-- self-referential FKs are even more fun
+--
+create table self_ref (a int primary key,
+ b int references self_ref(a) on delete cascade);
+create trigger self_ref_before_trig
+ before delete on self_ref
+ for each statement execute procedure trigger_func('self_ref');
+create trigger self_ref_r_trig
+ after delete on self_ref referencing old table as old_table
+ for each row execute procedure dump_delete();
+create trigger self_ref_s_trig
+ after delete on self_ref referencing old table as old_table
+ for each statement execute procedure dump_delete();
+insert into self_ref values (1, null), (2, 1), (3, 2);
+delete from self_ref where a = 1;
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_r_trig, old table = (1,), (2,1)
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_r_trig, old table = (1,), (2,1)
+NOTICE: trigger = self_ref_s_trig, old table = (1,), (2,1)
+NOTICE: trigger = self_ref_r_trig, old table = (3,2)
+NOTICE: trigger = self_ref_s_trig, old table = (3,2)
+-- without AR trigger, cascaded deletes all end up in one transition table
+drop trigger self_ref_r_trig on self_ref;
+insert into self_ref values (1, null), (2, 1), (3, 2), (4, 3);
+delete from self_ref where a = 1;
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_s_trig, old table = (1,), (2,1), (3,2), (4,3)
+drop table self_ref;
+-- cleanup
+drop function dump_insert();
+drop function dump_update();
+drop function dump_delete();
diff --git src/test/regress/expected/tsrf_1.out src/test/regress/expected/tsrf_1.out
new file mode 100644
index 0000000000..a0f7d80c69
--- /dev/null
+++ src/test/regress/expected/tsrf_1.out
@@ -0,0 +1,712 @@
+--
+-- tsrf - targetlist set returning function tests
+--
+-- simple srf
+SELECT generate_series(1, 3);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- parallel iteration
+SELECT generate_series(1, 3), generate_series(3,5);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 3
+ 2 | 4
+ 3 | 5
+(3 rows)
+
+-- parallel iteration, different number of rows
+SELECT generate_series(1, 2), generate_series(1,4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 1
+ 2 | 2
+ | 3
+ | 4
+(4 rows)
+
+-- srf, with SRF argument
+SELECT generate_series(1, generate_series(1, 3));
+ generate_series
+-----------------
+ 1
+ 1
+ 2
+ 1
+ 2
+ 3
+(6 rows)
+
+-- but we've traditionally rejected the same in FROM
+SELECT * FROM generate_series(1, generate_series(1, 3));
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM generate_series(1, generate_series(1, 3));
+ ^
+-- srf, with two SRF arguments
+SELECT generate_series(generate_series(1,3), generate_series(2, 4));
+ generate_series
+-----------------
+ 1
+ 2
+ 2
+ 3
+ 3
+ 4
+(6 rows)
+
+-- check proper nesting of SRFs in different expressions
+explain (verbose, costs off)
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ ProjectSet
+ Output: generate_series(1, (generate_series(1, 3))), (generate_series(2, 4))
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(2, 4)
+ -> Result
+(5 rows)
+
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 2
+ 1 | 3
+ 2 | 3
+ 1 | 4
+ 2 | 4
+ 3 | 4
+(6 rows)
+
+CREATE TABLE few(id int, dataa text, datab text);
+INSERT INTO few VALUES(1, 'a', 'foo'),(2, 'a', 'bar'),(3, 'b', 'bar');
+-- SRF with a provably-dummy relation
+explain (verbose, costs off)
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ QUERY PLAN
+--------------------------------------
+ ProjectSet
+ Output: unnest('{1,2}'::integer[])
+ -> Result
+ One-Time Filter: false
+(4 rows)
+
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ unnest
+--------
+(0 rows)
+
+-- SRF shouldn't prevent upper query from recognizing lower as dummy
+explain (verbose, costs off)
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ QUERY PLAN
+------------------------------------------------
+ Result
+ Output: f1.id, f1.dataa, f1.datab, ss.unnest
+ One-Time Filter: false
+(3 rows)
+
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ id | dataa | datab | unnest
+----+-------+-------+--------
+(0 rows)
+
+-- SRF output order of sorting is maintained, if SRF is not referenced
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id DESC;
+ id | g
+----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 1 | 1
+ 1 | 2
+ 1 | 3
+(9 rows)
+
+-- but SRFs can be referenced in sort
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, g DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, generate_series(1,3) DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+-- it's weird to have ORDER BYs that increase the number of results
+SELECT few.id FROM few ORDER BY id, generate_series(1,3) DESC;
+ id
+----
+ 1
+ 1
+ 1
+ 2
+ 2
+ 2
+ 3
+ 3
+ 3
+(9 rows)
+
+-- SRFs are computed after aggregation
+SET enable_hashagg TO 0; -- stable output order
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(3 rows)
+
+-- unless referenced in GROUP BY clause
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, unnest('{1,1,3}'::int[]);
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, 5;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+RESET enable_hashagg;
+-- check HAVING works when GROUP BY does [not] reference SRF output
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1, 2 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+-- it's weird to have GROUP BYs that increase the number of results
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+(1 row)
+
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa, unnest('{1,1,3}'::int[]) ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+ a | 4
+(2 rows)
+
+-- SRFs are not allowed if they'd need to be conditionally executed
+SELECT q1, case when q1 > 0 then generate_series(1,3) else 0 end FROM int8_tbl;
+ERROR: set-returning functions are not allowed in CASE
+LINE 1: SELECT q1, case when q1 > 0 then generate_series(1,3) else 0...
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ERROR: set-returning functions are not allowed in COALESCE
+LINE 1: SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are not allowed in aggregate arguments
+SELECT min(generate_series(1, 3)) FROM few;
+ERROR: aggregate function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- ... unless they're within a sub-select
+SELECT sum((3 = ANY(SELECT generate_series(1,4)))::int);
+ sum
+-----
+ 1
+(1 row)
+
+SELECT sum((3 = ANY(SELECT lag(x) over(order by x)
+ FROM generate_series(1,4) x))::int);
+ sum
+-----
+ 1
+(1 row)
+
+-- SRFs are not allowed in window function arguments, either
+SELECT min(generate_series(1, 3)) OVER() FROM few;
+ERROR: window function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) OVER() FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are normally computed after window functions
+SELECT id,lag(id) OVER(), count(*) OVER(), generate_series(1,3) FROM few;
+ id | lag | count | generate_series
+----+-----+-------+-----------------
+ 1 | | 3 | 1
+ 1 | | 3 | 2
+ 1 | | 3 | 3
+ 2 | 1 | 3 | 1
+ 2 | 1 | 3 | 2
+ 2 | 1 | 3 | 3
+ 3 | 2 | 3 | 1
+ 3 | 2 | 3 | 2
+ 3 | 2 | 3 | 3
+(9 rows)
+
+-- unless referencing SRFs
+SELECT SUM(count(*)) OVER(PARTITION BY generate_series(1,3) ORDER BY generate_series(1,3)), generate_series(1,3) g FROM few GROUP BY g;
+ sum | g
+-----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(3 rows)
+
+-- sorting + grouping
+SELECT few.dataa, count(*), min(id), max(id), generate_series(1,3) FROM few GROUP BY few.dataa ORDER BY 5, 1;
+ dataa | count | min | max | generate_series
+-------+-------+-----+-----+-----------------
+ a | 2 | 1 | 2 | 1
+ b | 1 | 3 | 3 | 1
+ a | 2 | 1 | 2 | 2
+ b | 1 | 3 | 3 | 2
+ a | 2 | 1 | 2 | 3
+ b | 1 | 3 | 3 | 3
+(6 rows)
+
+-- grouping sets are a bit special, they produce NULLs in columns not actually NULL
+set enable_hashagg = false;
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | 1 | 1
+ | | 1 | 3
+ | bar | 1 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ b | | 2 | 1
+ a | foo | 2 | 1
+ | bar | 2 | 2
+ a | | 2 | 2
+ | | 2 | 3
+ b | bar | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | foo | | 2
+ a | | | 4
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | | 2 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | foo | | 2
+ a | | | 4
+ a | | 2 | 2
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | | 2
+ b | | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | 2 | 1
+ | | 2 | 3
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ | | 1 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ b | bar | 1 | 1
+ | bar | 1 | 2
+ | foo | 1 | 1
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | bar | 2 | 2
+ | | 2 | 3
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ a | foo | 2 | 1
+ b | bar | 2 | 1
+ a | | | 4
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ a | foo | | 2
+ a | bar | | 2
+ | bar | | 4
+ | foo | | 2
+(24 rows)
+
+reset enable_hashagg;
+-- case with degenerate ORDER BY
+explain (verbose, costs off)
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ QUERY PLAN
+------------------------------------------------
+ ProjectSet
+ Output: ('foo'::text), generate_series(1, 2)
+ -> Seq Scan on public.few
+ Output: 'foo'::text
+(4 rows)
+
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ f | g
+-----+---
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+(6 rows)
+
+-- data modification
+CREATE TABLE fewmore AS SELECT generate_series(1,3) AS data;
+INSERT INTO fewmore VALUES(generate_series(4,5));
+SELECT * FROM fewmore;
+ data
+------
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+-- SRFs are not allowed in UPDATE (they once were, but it was nonsense)
+UPDATE fewmore SET data = generate_series(4,9);
+ERROR: set-returning functions are not allowed in UPDATE
+LINE 1: UPDATE fewmore SET data = generate_series(4,9);
+ ^
+-- SRFs are not allowed in RETURNING
+INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3);
+ERROR: set-returning functions are not allowed in RETURNING
+LINE 1: INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3)...
+ ^
+-- nor standalone VALUES (but surely this is a bug?)
+VALUES(1, generate_series(1,2));
+ERROR: set-returning functions are not allowed in VALUES
+LINE 1: VALUES(1, generate_series(1,2));
+ ^
+-- We allow tSRFs that are not at top level
+SELECT int4mul(generate_series(1,2), 10);
+ int4mul
+---------
+ 10
+ 20
+(2 rows)
+
+SELECT generate_series(1,3) IS DISTINCT FROM 2;
+ ?column?
+----------
+ t
+ f
+ t
+(3 rows)
+
+-- but SRFs in function RTEs must be at top level (annoying restriction)
+SELECT * FROM int4mul(generate_series(1,2), 10);
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM int4mul(generate_series(1,2), 10);
+ ^
+-- DISTINCT ON is evaluated before tSRF evaluation if SRF is not
+-- referenced either in ORDER BY or in the DISTINCT ON list. The ORDER
+-- BY reference can be implicitly generated, if there's no other ORDER BY.
+-- implicit reference (via implicit ORDER) to all columns
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 1 | 1 | 1
+ 3 | 2 | 1
+ 5 | 3 | 1
+(3 rows)
+
+-- unreferenced in DISTINCT ON or ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 1
+ 1 | 4 | 2
+ 1 | 4 | 3
+ 3 | 2 | 1
+ 3 | 2 | 2
+ 3 | 2 | 3
+ 5 | 3 | 1
+ 5 | 3 | 2
+ 5 | 3 | 3
+(9 rows)
+
+-- referenced in ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 3 | 2 | 3
+ 5 | 3 | 3
+(3 rows)
+
+-- referenced in ORDER BY and DISTINCT ON
+SELECT DISTINCT ON (a, b, g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 1 | 4 | 2
+ 1 | 4 | 1
+ 1 | 1 | 3
+ 1 | 1 | 2
+ 1 | 1 | 1
+ 3 | 2 | 3
+ 3 | 2 | 2
+ 3 | 2 | 1
+ 3 | 1 | 3
+ 3 | 1 | 2
+ 3 | 1 | 1
+ 5 | 3 | 3
+ 5 | 3 | 2
+ 5 | 3 | 1
+ 5 | 1 | 3
+ 5 | 1 | 2
+ 5 | 1 | 1
+(18 rows)
+
+-- only SRF mentioned in DISTINCT ON
+SELECT DISTINCT ON (g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 3 | 2 | 1
+ 5 | 1 | 2
+ 3 | 1 | 3
+(3 rows)
+
+-- LIMIT / OFFSET is evaluated after SRF evaluation
+SELECT a, generate_series(1,2) FROM (VALUES(1),(2),(3)) r(a) LIMIT 2 OFFSET 2;
+ a | generate_series
+---+-----------------
+ 2 | 1
+ 2 | 2
+(2 rows)
+
+-- SRFs are not allowed in LIMIT.
+SELECT 1 LIMIT generate_series(1,3);
+ERROR: set-returning functions are not allowed in LIMIT
+LINE 1: SELECT 1 LIMIT generate_series(1,3);
+ ^
+-- tSRF in correlated subquery, referencing table outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET few.id) FROM few;
+ generate_series
+-----------------
+ 2
+ 3
+
+(3 rows)
+
+-- tSRF in correlated subquery, referencing SRF outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET g.i) FROM generate_series(0,3) g(i);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+
+(4 rows)
+
+-- Operators can return sets too
+CREATE OPERATOR |@| (PROCEDURE = unnest, RIGHTARG = ANYARRAY);
+SELECT |@|ARRAY[1,2,3];
+ ?column?
+----------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- Some fun cases involving duplicate SRF calls
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(1, 3)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ x | xp1
+---+-----
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+explain (verbose, costs off)
+select generate_series(1,3)+1 order by generate_series(1,3);
+ QUERY PLAN
+------------------------------------------------------------------------
+ Sort
+ Output: (((generate_series(1, 3)) + 1)), (generate_series(1, 3))
+ Sort Key: (generate_series(1, 3))
+ -> Result
+ Output: ((generate_series(1, 3)) + 1), (generate_series(1, 3))
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(8 rows)
+
+select generate_series(1,3)+1 order by generate_series(1,3);
+ ?column?
+----------
+ 2
+ 3
+ 4
+(3 rows)
+
+-- Check that SRFs of same nesting level run in lockstep
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(3, 6)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(3, 6)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ x | y
+---+---
+ 1 | 4
+ 2 | 5
+ 3 | 6
+ | 7
+(4 rows)
+
+-- Clean up
+DROP TABLE few;
+DROP TABLE fewmore;
diff --git src/test/regress/expected/updatable_views_1.out src/test/regress/expected/updatable_views_1.out
new file mode 100644
index 0000000000..27db0cecd1
--- /dev/null
+++ src/test/regress/expected/updatable_views_1.out
@@ -0,0 +1,3018 @@
+--
+-- UPDATABLE VIEWS
+--
+-- avoid bit-exact output here because operations may not be bit-exact.
+SET extra_float_digits = 0;
+-- check that non-updatable views and columns are rejected with useful error
+-- messages
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW ro_view1 AS SELECT DISTINCT a, b FROM base_tbl; -- DISTINCT not supported
+CREATE VIEW ro_view2 AS SELECT a, b FROM base_tbl GROUP BY a, b; -- GROUP BY not supported
+CREATE VIEW ro_view3 AS SELECT 1 FROM base_tbl HAVING max(a) > 0; -- HAVING not supported
+CREATE VIEW ro_view4 AS SELECT count(*) FROM base_tbl; -- Aggregate functions not supported
+CREATE VIEW ro_view5 AS SELECT a, rank() OVER() FROM base_tbl; -- Window functions not supported
+CREATE VIEW ro_view6 AS SELECT a, b FROM base_tbl UNION SELECT -a, b FROM base_tbl; -- Set ops not supported
+CREATE VIEW ro_view7 AS WITH t AS (SELECT a, b FROM base_tbl) SELECT * FROM t; -- WITH not supported
+CREATE VIEW ro_view8 AS SELECT a, b FROM base_tbl ORDER BY a OFFSET 1; -- OFFSET not supported
+CREATE VIEW ro_view9 AS SELECT a, b FROM base_tbl ORDER BY a LIMIT 1; -- LIMIT not supported
+CREATE VIEW ro_view10 AS SELECT 1 AS a; -- No base relations
+CREATE VIEW ro_view11 AS SELECT b1.a, b2.b FROM base_tbl b1, base_tbl b2; -- Multiple base relations
+CREATE VIEW ro_view12 AS SELECT * FROM generate_series(1, 10) AS g(a); -- SRF in rangetable
+CREATE VIEW ro_view13 AS SELECT a, b FROM (SELECT * FROM base_tbl) AS t; -- Subselect in rangetable
+CREATE VIEW rw_view14 AS SELECT ctid, a, b FROM base_tbl; -- System columns may be part of an updatable view
+CREATE VIEW rw_view15 AS SELECT a, upper(b) FROM base_tbl; -- Expression/function may be part of an updatable view
+CREATE VIEW rw_view16 AS SELECT a, b, a AS aa FROM base_tbl; -- Repeated column may be part of an updatable view
+CREATE VIEW ro_view17 AS SELECT * FROM ro_view1; -- Base relation not updatable
+CREATE VIEW ro_view18 AS SELECT * FROM (VALUES(1)) AS tmp(a); -- VALUES in rangetable
+CREATE SEQUENCE uv_seq;
+CREATE VIEW ro_view19 AS SELECT * FROM uv_seq; -- View based on a sequence
+CREATE VIEW ro_view20 AS SELECT a, b, generate_series(1, a) g FROM base_tbl; -- SRF in targetlist not supported
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ ro_view1 | NO
+ ro_view10 | NO
+ ro_view11 | NO
+ ro_view12 | NO
+ ro_view13 | NO
+ ro_view17 | NO
+ ro_view18 | NO
+ ro_view19 | NO
+ ro_view2 | NO
+ ro_view20 | NO
+ ro_view3 | NO
+ ro_view4 | NO
+ ro_view5 | NO
+ ro_view6 | NO
+ ro_view7 | NO
+ ro_view8 | NO
+ ro_view9 | NO
+ rw_view14 | YES
+ rw_view15 | YES
+ rw_view16 | YES
+(20 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ ro_view1 | NO | NO
+ ro_view10 | NO | NO
+ ro_view11 | NO | NO
+ ro_view12 | NO | NO
+ ro_view13 | NO | NO
+ ro_view17 | NO | NO
+ ro_view18 | NO | NO
+ ro_view19 | NO | NO
+ ro_view2 | NO | NO
+ ro_view20 | NO | NO
+ ro_view3 | NO | NO
+ ro_view4 | NO | NO
+ ro_view5 | NO | NO
+ ro_view6 | NO | NO
+ ro_view7 | NO | NO
+ ro_view8 | NO | NO
+ ro_view9 | NO | NO
+ rw_view14 | YES | YES
+ rw_view15 | YES | YES
+ rw_view16 | YES | YES
+(20 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ ro_view1 | a | NO
+ ro_view1 | b | NO
+ ro_view10 | a | NO
+ ro_view11 | a | NO
+ ro_view11 | b | NO
+ ro_view12 | a | NO
+ ro_view13 | a | NO
+ ro_view13 | b | NO
+ ro_view17 | a | NO
+ ro_view17 | b | NO
+ ro_view18 | a | NO
+ ro_view19 | last_value | NO
+ ro_view19 | log_cnt | NO
+ ro_view19 | is_called | NO
+ ro_view2 | a | NO
+ ro_view2 | b | NO
+ ro_view20 | a | NO
+ ro_view20 | b | NO
+ ro_view20 | g | NO
+ ro_view3 | ?column? | NO
+ ro_view4 | count | NO
+ ro_view5 | a | NO
+ ro_view5 | rank | NO
+ ro_view6 | a | NO
+ ro_view6 | b | NO
+ ro_view7 | a | NO
+ ro_view7 | b | NO
+ ro_view8 | a | NO
+ ro_view8 | b | NO
+ ro_view9 | a | NO
+ ro_view9 | b | NO
+ rw_view14 | ctid | NO
+ rw_view14 | a | YES
+ rw_view14 | b | YES
+ rw_view15 | a | YES
+ rw_view15 | upper | NO
+ rw_view16 | a | YES
+ rw_view16 | b | YES
+ rw_view16 | aa | YES
+(39 rows)
+
+-- Read-only views
+DELETE FROM ro_view1;
+ERROR: cannot delete from view "ro_view1"
+DETAIL: Views containing DISTINCT are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view2;
+ERROR: cannot delete from view "ro_view2"
+DETAIL: Views containing GROUP BY are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view3;
+ERROR: cannot delete from view "ro_view3"
+DETAIL: Views containing HAVING are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view4;
+ERROR: cannot delete from view "ro_view4"
+DETAIL: Views that return aggregate functions are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view5;
+ERROR: cannot delete from view "ro_view5"
+DETAIL: Views that return window functions are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view6;
+ERROR: cannot delete from view "ro_view6"
+DETAIL: Views containing UNION, INTERSECT, or EXCEPT are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+UPDATE ro_view7 SET a=a+1;
+ERROR: cannot update view "ro_view7"
+DETAIL: Views containing WITH are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view8 SET a=a+1;
+ERROR: cannot update view "ro_view8"
+DETAIL: Views containing LIMIT or OFFSET are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view9 SET a=a+1;
+ERROR: cannot update view "ro_view9"
+DETAIL: Views containing LIMIT or OFFSET are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view10 SET a=a+1;
+ERROR: cannot update view "ro_view10"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view11 SET a=a+1;
+ERROR: cannot update view "ro_view11"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view12 SET a=a+1;
+ERROR: cannot update view "ro_view12"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+INSERT INTO ro_view13 VALUES (3, 'Row 3');
+ERROR: cannot insert into view "ro_view13"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+-- Partially updatable view
+INSERT INTO rw_view14 VALUES (null, 3, 'Row 3'); -- should fail
+ERROR: cannot insert into column "ctid" of view "rw_view14"
+DETAIL: View columns that refer to system columns are not updatable.
+INSERT INTO rw_view14 (a, b) VALUES (3, 'Row 3'); -- should be OK
+UPDATE rw_view14 SET ctid=null WHERE a=3; -- should fail
+ERROR: cannot update column "ctid" of view "rw_view14"
+DETAIL: View columns that refer to system columns are not updatable.
+UPDATE rw_view14 SET b='ROW 3' WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+--------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ 3 | ROW 3
+(6 rows)
+
+DELETE FROM rw_view14 WHERE a=3; -- should be OK
+-- Partially updatable view
+INSERT INTO rw_view15 VALUES (3, 'ROW 3'); -- should fail
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view15 (a) VALUES (3); -- should be OK
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT DO NOTHING; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO NOTHING; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO UPDATE set a = excluded.a; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO UPDATE set upper = 'blarg'; -- fails
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+ALTER VIEW rw_view15 ALTER COLUMN upper SET DEFAULT 'NOT SET';
+INSERT INTO rw_view15 (a) VALUES (4); -- should fail
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET upper='ROW 3' WHERE a=3; -- should fail
+ERROR: cannot update column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET upper=DEFAULT WHERE a=3; -- should fail
+ERROR: cannot update column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET a=4 WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+-------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ 4 | Unspecified
+(6 rows)
+
+DELETE FROM rw_view15 WHERE a=4; -- should be OK
+-- Partially updatable view
+INSERT INTO rw_view16 VALUES (3, 'Row 3', 3); -- should fail
+ERROR: multiple assignments to same column "a"
+INSERT INTO rw_view16 (a, b) VALUES (3, 'Row 3'); -- should be OK
+UPDATE rw_view16 SET a=3, aa=-3 WHERE a=3; -- should fail
+ERROR: multiple assignments to same column "a"
+UPDATE rw_view16 SET aa=-3 WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+--------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ -3 | Row 3
+(6 rows)
+
+DELETE FROM rw_view16 WHERE a=-3; -- should be OK
+-- Read-only views
+INSERT INTO ro_view17 VALUES (3, 'ROW 3');
+ERROR: cannot insert into view "ro_view1"
+DETAIL: Views containing DISTINCT are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+DELETE FROM ro_view18;
+ERROR: cannot delete from view "ro_view18"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+UPDATE ro_view19 SET last_value=1000;
+ERROR: cannot update view "ro_view19"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view20 SET b=upper(b);
+ERROR: cannot update view "ro_view20"
+DETAIL: Views that return set-returning functions are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 16 other objects
+DETAIL: drop cascades to view ro_view1
+drop cascades to view ro_view17
+drop cascades to view ro_view2
+drop cascades to view ro_view3
+drop cascades to view ro_view4
+drop cascades to view ro_view5
+drop cascades to view ro_view6
+drop cascades to view ro_view7
+drop cascades to view ro_view8
+drop cascades to view ro_view9
+drop cascades to view ro_view11
+drop cascades to view ro_view13
+drop cascades to view rw_view14
+drop cascades to view rw_view15
+drop cascades to view rw_view16
+drop cascades to view ro_view20
+DROP VIEW ro_view10, ro_view12, ro_view18;
+DROP SEQUENCE uv_seq CASCADE;
+NOTICE: drop cascades to view ro_view19
+-- simple updatable view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view1';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view1';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view1'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | YES
+ rw_view1 | b | YES
+(2 rows)
+
+INSERT INTO rw_view1 VALUES (3, 'Row 3');
+INSERT INTO rw_view1 (a) VALUES (4);
+UPDATE rw_view1 SET a=5 WHERE a=4;
+DELETE FROM rw_view1 WHERE b='Row 2';
+SELECT * FROM base_tbl;
+ a | b
+----+-------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 3 | Row 3
+ 5 | Unspecified
+(6 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET a=6 WHERE a=5;
+ QUERY PLAN
+--------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a > 0) AND (a = 5))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE a=5;
+ QUERY PLAN
+--------------------------------------------------
+ Delete on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a > 0) AND (a = 5))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- view on top of view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT b AS bb, a AS aa FROM base_tbl WHERE a>0;
+CREATE VIEW rw_view2 AS SELECT aa AS aaa, bb AS bbb FROM rw_view1 WHERE aa<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view2';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view2 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view2';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view2 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view2'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view2 | aaa | YES
+ rw_view2 | bbb | YES
+(2 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3');
+INSERT INTO rw_view2 (aaa) VALUES (4);
+SELECT * FROM rw_view2;
+ aaa | bbb
+-----+-------------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row 3
+ 4 | Unspecified
+(4 rows)
+
+UPDATE rw_view2 SET bbb='Row 4' WHERE aaa=4;
+DELETE FROM rw_view2 WHERE aaa=2;
+SELECT * FROM rw_view2;
+ aaa | bbb
+-----+-------
+ 1 | Row 1
+ 3 | Row 3
+ 4 | Row 4
+(3 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET aaa=5 WHERE aaa=4;
+ QUERY PLAN
+--------------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a < 10) AND (a > 0) AND (a = 4))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE aaa=4;
+ QUERY PLAN
+--------------------------------------------------------
+ Delete on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a < 10) AND (a > 0) AND (a = 4))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- view on top of view with rules
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0 OFFSET 0; -- not updatable without rules/triggers
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | NO
+ rw_view2 | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a, NEW.b) RETURNING *;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | YES
+ rw_view2 | NO | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_upd_rule AS ON UPDATE TO rw_view1
+ DO INSTEAD UPDATE base_tbl SET b=NEW.b WHERE a=OLD.a RETURNING NEW.*;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | YES
+ rw_view2 | NO | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_del_rule AS ON DELETE TO rw_view1
+ DO INSTEAD DELETE FROM base_tbl WHERE a=OLD.a RETURNING OLD.*;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+ rw_view2 | YES | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | YES
+ rw_view1 | b | YES
+ rw_view2 | a | YES
+ rw_view2 | b | YES
+(4 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3') RETURNING *;
+ a | b
+---+-------
+ 3 | Row 3
+(1 row)
+
+UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-----------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row three
+(3 rows)
+
+DELETE FROM rw_view2 WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-------
+ 1 | Row 1
+ 2 | Row 2
+(2 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET a=3 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Update on base_tbl
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl base_tbl_1
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(10 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Delete on base_tbl
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl base_tbl_1
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(10 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- view on top of view with triggers
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0 OFFSET 0; -- not updatable without rules/triggers
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | NO | NO | NO
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ INSERT INTO base_tbl VALUES (NEW.a, NEW.b);
+ RETURN NEW;
+ ELSIF TG_OP = 'UPDATE' THEN
+ UPDATE base_tbl SET b=NEW.b WHERE a=OLD.a;
+ RETURN NEW;
+ ELSIF TG_OP = 'DELETE' THEN
+ DELETE FROM base_tbl WHERE a=OLD.a;
+ RETURN OLD;
+ END IF;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_ins_trig INSTEAD OF INSERT ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | NO | NO | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE TRIGGER rw_view1_upd_trig INSTEAD OF UPDATE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | YES | NO | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE TRIGGER rw_view1_del_trig INSTEAD OF DELETE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | YES | YES | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3') RETURNING *;
+ a | b
+---+-------
+ 3 | Row 3
+(1 row)
+
+UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-----------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row three
+(3 rows)
+
+DELETE FROM rw_view2 WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-------
+ 1 | Row 1
+ 2 | Row 2
+(2 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET a=3 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------
+ Update on rw_view1 rw_view1_1
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(7 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------
+ Delete on rw_view1 rw_view1_1
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(7 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP FUNCTION rw_view1_trig_fn();
+-- update using whole row from view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT b AS bb, a AS aa FROM base_tbl;
+CREATE FUNCTION rw_view1_aa(x rw_view1)
+ RETURNS int AS $$ SELECT x.aa $$ LANGUAGE sql;
+UPDATE rw_view1 v SET bb='Updated row 2' WHERE rw_view1_aa(v)=2
+ RETURNING rw_view1_aa(v), v.bb;
+ rw_view1_aa | bb
+-------------+---------------
+ 2 | Updated row 2
+(1 row)
+
+SELECT * FROM base_tbl;
+ a | b
+----+---------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Updated row 2
+(5 rows)
+
+EXPLAIN (costs off)
+UPDATE rw_view1 v SET bb='Updated row 2' WHERE rw_view1_aa(v)=2
+ RETURNING rw_view1_aa(v), v.bb;
+ QUERY PLAN
+--------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to function rw_view1_aa(rw_view1)
+-- permissions checks
+CREATE USER regress_view_user1;
+CREATE USER regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user1;
+CREATE TABLE base_tbl(a int, b text, c float);
+INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
+CREATE VIEW rw_view1 AS SELECT b AS bb, c AS cc, a AS aa FROM base_tbl;
+INSERT INTO rw_view1 VALUES ('Row 2', 2.0, 2);
+GRANT SELECT ON base_tbl TO regress_view_user2;
+GRANT SELECT ON rw_view1 TO regress_view_user2;
+GRANT UPDATE (a,c) ON base_tbl TO regress_view_user2;
+GRANT UPDATE (bb,cc) ON rw_view1 TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+CREATE VIEW rw_view2 AS SELECT b AS bb, c AS cc, a AS aa FROM base_tbl;
+SELECT * FROM base_tbl; -- ok
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+ 2 | Row 2 | 2
+(2 rows)
+
+SELECT * FROM rw_view1; -- ok
+ bb | cc | aa
+-------+----+----
+ Row 1 | 1 | 1
+ Row 2 | 2 | 2
+(2 rows)
+
+SELECT * FROM rw_view2; -- ok
+ bb | cc | aa
+-------+----+----
+ Row 1 | 1 | 1
+ Row 2 | 2 | 2
+(2 rows)
+
+INSERT INTO base_tbl VALUES (3, 'Row 3', 3.0); -- not allowed
+ERROR: permission denied for table base_tbl
+INSERT INTO rw_view1 VALUES ('Row 3', 3.0, 3); -- not allowed
+ERROR: permission denied for view rw_view1
+INSERT INTO rw_view2 VALUES ('Row 3', 3.0, 3); -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE base_tbl SET a=a, c=c; -- ok
+UPDATE base_tbl SET b=b; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET bb=bb, cc=cc; -- ok
+UPDATE rw_view1 SET aa=aa; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET aa=aa, cc=cc; -- ok
+UPDATE rw_view2 SET bb=bb; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM base_tbl; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM rw_view1; -- not allowed
+ERROR: permission denied for view rw_view1
+DELETE FROM rw_view2; -- not allowed
+ERROR: permission denied for table base_tbl
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT INSERT, DELETE ON base_tbl TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+INSERT INTO base_tbl VALUES (3, 'Row 3', 3.0); -- ok
+INSERT INTO rw_view1 VALUES ('Row 4', 4.0, 4); -- not allowed
+ERROR: permission denied for view rw_view1
+INSERT INTO rw_view2 VALUES ('Row 4', 4.0, 4); -- ok
+DELETE FROM base_tbl WHERE a=1; -- ok
+DELETE FROM rw_view1 WHERE aa=2; -- not allowed
+ERROR: permission denied for view rw_view1
+DELETE FROM rw_view2 WHERE aa=2; -- ok
+SELECT * FROM base_tbl;
+ a | b | c
+---+-------+---
+ 3 | Row 3 | 3
+ 4 | Row 4 | 4
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user1;
+REVOKE INSERT, DELETE ON base_tbl FROM regress_view_user2;
+GRANT INSERT, DELETE ON rw_view1 TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+INSERT INTO base_tbl VALUES (5, 'Row 5', 5.0); -- not allowed
+ERROR: permission denied for table base_tbl
+INSERT INTO rw_view1 VALUES ('Row 5', 5.0, 5); -- ok
+INSERT INTO rw_view2 VALUES ('Row 6', 6.0, 6); -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM base_tbl WHERE a=3; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM rw_view1 WHERE aa=3; -- ok
+DELETE FROM rw_view2 WHERE aa=4; -- not allowed
+ERROR: permission denied for table base_tbl
+SELECT * FROM base_tbl;
+ a | b | c
+---+-------+---
+ 4 | Row 4 | 4
+ 5 | Row 5 | 5
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- nested-view permissions
+CREATE TABLE base_tbl(a int, b text, c float);
+INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
+SET SESSION AUTHORIZATION regress_view_user1;
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl;
+SELECT * FROM rw_view1; -- not allowed
+ERROR: permission denied for table base_tbl
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1;
+SELECT * FROM rw_view2; -- not allowed
+ERROR: permission denied for view rw_view1
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+RESET SESSION AUTHORIZATION;
+GRANT SELECT ON base_tbl TO regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2; -- not allowed
+ERROR: permission denied for view rw_view1
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT SELECT ON rw_view1 TO regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+RESET SESSION AUTHORIZATION;
+GRANT UPDATE ON base_tbl TO regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT UPDATE ON rw_view1 TO regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1;
+RESET SESSION AUTHORIZATION;
+REVOKE UPDATE ON base_tbl FROM regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-----+---
+ 1 | bar | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | bar | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+RESET SESSION AUTHORIZATION;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP USER regress_view_user1;
+DROP USER regress_view_user2;
+-- column defaults
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified', c serial);
+INSERT INTO base_tbl VALUES (1, 'Row 1');
+INSERT INTO base_tbl VALUES (2, 'Row 2');
+INSERT INTO base_tbl VALUES (3);
+CREATE VIEW rw_view1 AS SELECT a AS aa, b AS bb FROM base_tbl;
+ALTER VIEW rw_view1 ALTER COLUMN bb SET DEFAULT 'View default';
+INSERT INTO rw_view1 VALUES (4, 'Row 4');
+INSERT INTO rw_view1 (aa) VALUES (5);
+SELECT * FROM base_tbl;
+ a | b | c
+---+--------------+---
+ 1 | Row 1 | 1
+ 2 | Row 2 | 2
+ 3 | Unspecified | 3
+ 4 | Row 4 | 4
+ 5 | View default | 5
+(5 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- Table having triggers
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl VALUES (1, 'Row 1');
+INSERT INTO base_tbl VALUES (2, 'Row 2');
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ UPDATE base_tbl SET b=NEW.b WHERE a=1;
+ RETURN NULL;
+ END IF;
+ RETURN NULL;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_ins_trig AFTER INSERT ON base_tbl
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+CREATE VIEW rw_view1 AS SELECT a AS aa, b AS bb FROM base_tbl;
+INSERT INTO rw_view1 VALUES (3, 'Row 3');
+select * from base_tbl;
+ a | b
+---+-------
+ 2 | Row 2
+ 3 | Row 3
+ 1 | Row 3
+(3 rows)
+
+DROP VIEW rw_view1;
+DROP TRIGGER rw_view1_ins_trig on base_tbl;
+DROP FUNCTION rw_view1_trig_fn();
+DROP TABLE base_tbl;
+-- view with ORDER BY
+CREATE TABLE base_tbl (a int, b int);
+INSERT INTO base_tbl VALUES (1,2), (4,5), (3,-3);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl ORDER BY a+b;
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 3 | -3
+ 1 | 2
+ 4 | 5
+(3 rows)
+
+INSERT INTO rw_view1 VALUES (7,-8);
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 7 | -8
+ 3 | -3
+ 1 | 2
+ 4 | 5
+(4 rows)
+
+EXPLAIN (verbose, costs off) UPDATE rw_view1 SET b = b + 1 RETURNING *;
+ QUERY PLAN
+-------------------------------------------------------------
+ Update on public.base_tbl
+ Output: base_tbl.a, base_tbl.b
+ -> Seq Scan on public.base_tbl
+ Output: base_tbl.a, (base_tbl.b + 1), base_tbl.ctid
+(4 rows)
+
+UPDATE rw_view1 SET b = b + 1 RETURNING *;
+ a | b
+---+----
+ 1 | 3
+ 4 | 6
+ 3 | -2
+ 7 | -7
+(4 rows)
+
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 7 | -7
+ 3 | -2
+ 1 | 3
+ 4 | 6
+(4 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- multiple array-column updates
+CREATE TABLE base_tbl (a int, arr int[]);
+INSERT INTO base_tbl VALUES (1,ARRAY[2]), (3,ARRAY[4]);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl;
+UPDATE rw_view1 SET arr[1] = 42, arr[2] = 77 WHERE a = 3;
+SELECT * FROM rw_view1;
+ a | arr
+---+---------
+ 1 | {2}
+ 3 | {42,77}
+(2 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- views with updatable and non-updatable columns
+CREATE TABLE base_tbl(a float);
+INSERT INTO base_tbl SELECT i/10.0 FROM generate_series(1,10) g(i);
+CREATE VIEW rw_view1 AS
+ SELECT ctid, sin(a) s, a, cos(a) c
+ FROM base_tbl
+ WHERE a != 0
+ ORDER BY abs(a);
+INSERT INTO rw_view1 VALUES (null, null, 1.1, null); -- should fail
+ERROR: cannot insert into column "ctid" of view "rw_view1"
+DETAIL: View columns that refer to system columns are not updatable.
+INSERT INTO rw_view1 (s, c, a) VALUES (null, null, 1.1); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view1 (a) VALUES (1.1) RETURNING a, s, c; -- OK
+ a | s | c
+-----+-------------------+-------------------
+ 1.1 | 0.891207360061435 | 0.453596121425577
+(1 row)
+
+UPDATE rw_view1 SET s = s WHERE a = 1.1; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view1 SET a = 1.05 WHERE a = 1.1 RETURNING s; -- OK
+ s
+-------------------
+ 0.867423225594017
+(1 row)
+
+DELETE FROM rw_view1 WHERE a = 1.05; -- OK
+CREATE VIEW rw_view2 AS
+ SELECT s, c, s/c t, a base_a, ctid
+ FROM rw_view1;
+INSERT INTO rw_view2 VALUES (null, null, null, 1.1, null); -- should fail
+ERROR: cannot insert into column "t" of view "rw_view2"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view2(s, c, base_a) VALUES (null, null, 1.1); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view2(base_a) VALUES (1.1) RETURNING t; -- OK
+ t
+------------------
+ 1.96475965724865
+(1 row)
+
+UPDATE rw_view2 SET s = s WHERE base_a = 1.1; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view2 SET t = t WHERE base_a = 1.1; -- should fail
+ERROR: cannot update column "t" of view "rw_view2"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view2 SET base_a = 1.05 WHERE base_a = 1.1; -- OK
+DELETE FROM rw_view2 WHERE base_a = 1.05 RETURNING base_a, s, c, t; -- OK
+ base_a | s | c | t
+--------+-------------------+-------------------+------------------
+ 1.05 | 0.867423225594017 | 0.497571047891727 | 1.74331530998317
+(1 row)
+
+CREATE VIEW rw_view3 AS
+ SELECT s, c, s/c t, ctid
+ FROM rw_view1;
+INSERT INTO rw_view3 VALUES (null, null, null, null); -- should fail
+ERROR: cannot insert into column "t" of view "rw_view3"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view3(s) VALUES (null); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view3 SET s = s; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+DELETE FROM rw_view3 WHERE s = sin(0.1); -- should be OK
+SELECT * FROM base_tbl ORDER BY a;
+ a
+-----
+ 0.2
+ 0.3
+ 0.4
+ 0.5
+ 0.6
+ 0.7
+ 0.8
+ 0.9
+ 1
+(9 rows)
+
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+ rw_view3 | NO
+(3 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+ rw_view2 | YES | YES
+ rw_view3 | NO | NO
+(3 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | ctid | NO
+ rw_view1 | s | NO
+ rw_view1 | a | YES
+ rw_view1 | c | NO
+ rw_view2 | s | NO
+ rw_view2 | c | NO
+ rw_view2 | t | NO
+ rw_view2 | base_a | YES
+ rw_view2 | ctid | NO
+ rw_view3 | s | NO
+ rw_view3 | c | NO
+ rw_view3 | t | NO
+ rw_view3 | ctid | NO
+(13 rows)
+
+SELECT events & 4 != 0 AS upd,
+ events & 8 != 0 AS ins,
+ events & 16 != 0 AS del
+ FROM pg_catalog.pg_relation_is_updatable('rw_view3'::regclass, false) t(events);
+ upd | ins | del
+-----+-----+-----
+ f | f | t
+(1 row)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+drop cascades to view rw_view3
+-- inheritance tests
+CREATE TABLE base_tbl_parent (a int);
+CREATE TABLE base_tbl_child (CHECK (a > 0)) INHERITS (base_tbl_parent);
+INSERT INTO base_tbl_parent SELECT * FROM generate_series(-8, -1);
+INSERT INTO base_tbl_child SELECT * FROM generate_series(1, 8);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl_parent;
+CREATE VIEW rw_view2 AS SELECT * FROM ONLY base_tbl_parent;
+SELECT * FROM rw_view1 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+(16 rows)
+
+SELECT * FROM ONLY rw_view1 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+(16 rows)
+
+SELECT * FROM rw_view2 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+(8 rows)
+
+INSERT INTO rw_view1 VALUES (-100), (100);
+INSERT INTO rw_view2 VALUES (-200), (200);
+UPDATE rw_view1 SET a = a*10 WHERE a IN (-1, 1); -- Should produce -10 and 10
+UPDATE ONLY rw_view1 SET a = a*10 WHERE a IN (-2, 2); -- Should produce -20 and 20
+UPDATE rw_view2 SET a = a*10 WHERE a IN (-3, 3); -- Should produce -30 only
+UPDATE ONLY rw_view2 SET a = a*10 WHERE a IN (-4, 4); -- Should produce -40 only
+DELETE FROM rw_view1 WHERE a IN (-5, 5); -- Should delete -5 and 5
+DELETE FROM ONLY rw_view1 WHERE a IN (-6, 6); -- Should delete -6 and 6
+DELETE FROM rw_view2 WHERE a IN (-7, 7); -- Should delete -7 only
+DELETE FROM ONLY rw_view2 WHERE a IN (-8, 8); -- Should delete -8 only
+SELECT * FROM ONLY base_tbl_parent ORDER BY a;
+ a
+------
+ -200
+ -100
+ -40
+ -30
+ -20
+ -10
+ 100
+ 200
+(8 rows)
+
+SELECT * FROM base_tbl_child ORDER BY a;
+ a
+----
+ 3
+ 4
+ 7
+ 8
+ 10
+ 20
+(6 rows)
+
+CREATE TABLE other_tbl_parent (id int);
+CREATE TABLE other_tbl_child () INHERITS (other_tbl_parent);
+INSERT INTO other_tbl_parent VALUES (7),(200);
+INSERT INTO other_tbl_child VALUES (8),(100);
+EXPLAIN (costs off)
+UPDATE rw_view1 SET a = a + 1000 FROM other_tbl_parent WHERE a = id;
+ QUERY PLAN
+---------------------------------------------------------------
+ Update on base_tbl_parent
+ Update on base_tbl_parent
+ Update on base_tbl_child
+ -> Merge Join
+ Merge Cond: (base_tbl_parent.a = other_tbl_parent.id)
+ -> Sort
+ Sort Key: base_tbl_parent.a
+ -> Seq Scan on base_tbl_parent
+ -> Sort
+ Sort Key: other_tbl_parent.id
+ -> Append
+ -> Seq Scan on other_tbl_parent
+ -> Seq Scan on other_tbl_child
+ -> Merge Join
+ Merge Cond: (base_tbl_child.a = other_tbl_parent.id)
+ -> Sort
+ Sort Key: base_tbl_child.a
+ -> Seq Scan on base_tbl_child
+ -> Sort
+ Sort Key: other_tbl_parent.id
+ -> Append
+ -> Seq Scan on other_tbl_parent
+ -> Seq Scan on other_tbl_child
+(23 rows)
+
+UPDATE rw_view1 SET a = a + 1000 FROM other_tbl_parent WHERE a = id;
+SELECT * FROM ONLY base_tbl_parent ORDER BY a;
+ a
+------
+ -200
+ -100
+ -40
+ -30
+ -20
+ -10
+ 1100
+ 1200
+(8 rows)
+
+SELECT * FROM base_tbl_child ORDER BY a;
+ a
+------
+ 3
+ 4
+ 10
+ 20
+ 1007
+ 1008
+(6 rows)
+
+DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP TABLE other_tbl_parent CASCADE;
+NOTICE: drop cascades to table other_tbl_child
+-- simple WITH CHECK OPTION
+CREATE TABLE base_tbl (a int, b int DEFAULT 10);
+INSERT INTO base_tbl VALUES (1,2), (2,3), (1,-1);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a < b
+ WITH LOCAL CHECK OPTION;
+\d+ rw_view1
+ View "public.rw_view1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+ b | integer | | | | plain |
+View definition:
+ SELECT base_tbl.a,
+ base_tbl.b
+ FROM base_tbl
+ WHERE base_tbl.a < base_tbl.b;
+Options: check_option=local
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view1';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+------------------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view1 | SELECT base_tbl.a, +| LOCAL | YES | YES | NO | NO | NO
+ | | | base_tbl.b +| | | | | |
+ | | | FROM base_tbl +| | | | | |
+ | | | WHERE (base_tbl.a < base_tbl.b); | | | | | |
+(1 row)
+
+INSERT INTO rw_view1 VALUES(3,4); -- ok
+INSERT INTO rw_view1 VALUES(4,3); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (4, 3).
+INSERT INTO rw_view1 VALUES(5,null); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (5, null).
+UPDATE rw_view1 SET b = 5 WHERE a = 3; -- ok
+UPDATE rw_view1 SET b = -5 WHERE a = 3; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (3, -5).
+INSERT INTO rw_view1(a) VALUES (9); -- ok
+INSERT INTO rw_view1(a) VALUES (10); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, 10).
+SELECT * FROM base_tbl;
+ a | b
+---+----
+ 1 | 2
+ 2 | 3
+ 1 | -1
+ 3 | 5
+ 9 | 10
+(5 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH LOCAL/CASCADED CHECK OPTION
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a > 0;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a < 10
+ WITH CHECK OPTION; -- implicitly cascaded
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+Options: check_option=cascaded
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (-5); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-5).
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (15); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (15).
+SELECT * FROM base_tbl;
+ a
+---
+ 5
+(1 row)
+
+UPDATE rw_view2 SET a = a - 10; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-5).
+UPDATE rw_view2 SET a = a + 10; -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (15).
+CREATE OR REPLACE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a < 10
+ WITH LOCAL CHECK OPTION;
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+Options: check_option=local
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| LOCAL | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (-10); -- ok, but not in view
+INSERT INTO rw_view2 VALUES (20); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (20).
+SELECT * FROM base_tbl;
+ a
+-----
+ 5
+ -10
+(2 rows)
+
+ALTER VIEW rw_view1 SET (check_option=here); -- invalid
+ERROR: invalid value for enum option "check_option": here
+DETAIL: Valid values are "local" and "cascaded".
+ALTER VIEW rw_view1 SET (check_option=local);
+INSERT INTO rw_view2 VALUES (-20); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-20).
+INSERT INTO rw_view2 VALUES (30); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (30).
+ALTER VIEW rw_view2 RESET (check_option);
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| NONE | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (30); -- ok, but not in view
+SELECT * FROM base_tbl;
+ a
+-----
+ 5
+ -10
+ 30
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- WITH CHECK OPTION with no local view qual
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WITH CHECK OPTION;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a > 0;
+CREATE VIEW rw_view3 AS SELECT * FROM rw_view2 WITH CHECK OPTION;
+SELECT * FROM information_schema.views WHERE table_name LIKE E'rw\\_view_' ORDER BY table_name;
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+---------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view1 | SELECT base_tbl.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM base_tbl; | | | | | |
+ regression | public | rw_view2 | SELECT rw_view1.a +| NONE | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a > 0); | | | | | |
+ regression | public | rw_view3 | SELECT rw_view2.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM rw_view2; | | | | | |
+(3 rows)
+
+INSERT INTO rw_view1 VALUES (-1); -- ok
+INSERT INTO rw_view1 VALUES (1); -- ok
+INSERT INTO rw_view2 VALUES (-2); -- ok, but not in view
+INSERT INTO rw_view2 VALUES (2); -- ok
+INSERT INTO rw_view3 VALUES (-3); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-3).
+INSERT INTO rw_view3 VALUES (3); -- ok
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+drop cascades to view rw_view3
+-- WITH CHECK OPTION with scalar array ops
+CREATE TABLE base_tbl (a int, b int[]);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a = ANY (b)
+ WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (1, ARRAY[1,2,3]); -- ok
+INSERT INTO rw_view1 VALUES (10, ARRAY[4,5]); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, {4,5}).
+UPDATE rw_view1 SET b[2] = -b[2] WHERE a = 1; -- ok
+UPDATE rw_view1 SET b[1] = -b[1] WHERE a = 1; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (1, {-1,-2,3}).
+PREPARE ins(int, int[]) AS INSERT INTO rw_view1 VALUES($1, $2);
+EXECUTE ins(2, ARRAY[1,2,3]); -- ok
+EXECUTE ins(10, ARRAY[4,5]); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, {4,5}).
+DEALLOCATE PREPARE ins;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH CHECK OPTION with subquery
+CREATE TABLE base_tbl (a int);
+CREATE TABLE ref_tbl (a int PRIMARY KEY);
+INSERT INTO ref_tbl SELECT * FROM generate_series(1,10);
+CREATE VIEW rw_view1 AS
+ SELECT * FROM base_tbl b
+ WHERE EXISTS(SELECT 1 FROM ref_tbl r WHERE r.a = b.a)
+ WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (5); -- ok
+INSERT INTO rw_view1 VALUES (15); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15).
+UPDATE rw_view1 SET a = a + 5; -- ok
+UPDATE rw_view1 SET a = a + 5; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15).
+EXPLAIN (costs off) INSERT INTO rw_view1 VALUES (5);
+ QUERY PLAN
+---------------------------------------------------------
+ Insert on base_tbl b
+ -> Result
+ SubPlan 1
+ -> Index Only Scan using ref_tbl_pkey on ref_tbl r
+ Index Cond: (a = b.a)
+ SubPlan 2
+ -> Seq Scan on ref_tbl r_1
+(7 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET a = a + 5;
+ QUERY PLAN
+-----------------------------------------------------------
+ Update on base_tbl b
+ -> Hash Join
+ Hash Cond: (b.a = r.a)
+ -> Seq Scan on base_tbl b
+ -> Hash
+ -> Seq Scan on ref_tbl r
+ SubPlan 1
+ -> Index Only Scan using ref_tbl_pkey on ref_tbl r_1
+ Index Cond: (a = b.a)
+ SubPlan 2
+ -> Seq Scan on ref_tbl r_2
+(11 rows)
+
+DROP TABLE base_tbl, ref_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH CHECK OPTION with BEFORE trigger on base table
+CREATE TABLE base_tbl (a int, b int);
+CREATE FUNCTION base_tbl_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ NEW.b := 10;
+ RETURN NEW;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER base_tbl_trig BEFORE INSERT OR UPDATE ON base_tbl
+ FOR EACH ROW EXECUTE PROCEDURE base_tbl_trig_fn();
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a < b WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (5,0); -- ok
+INSERT INTO rw_view1 VALUES (15, 20); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15, 10).
+UPDATE rw_view1 SET a = 20, b = 30; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (20, 10).
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+DROP FUNCTION base_tbl_trig_fn();
+-- WITH LOCAL CHECK OPTION with INSTEAD OF trigger on base view
+CREATE TABLE base_tbl (a int, b int);
+CREATE VIEW rw_view1 AS SELECT a FROM base_tbl WHERE a < b;
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ INSERT INTO base_tbl VALUES (NEW.a, 10);
+ RETURN NEW;
+ ELSIF TG_OP = 'UPDATE' THEN
+ UPDATE base_tbl SET a=NEW.a WHERE a=OLD.a;
+ RETURN NEW;
+ ELSIF TG_OP = 'DELETE' THEN
+ DELETE FROM base_tbl WHERE a=OLD.a;
+ RETURN OLD;
+ END IF;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_trig
+ INSTEAD OF INSERT OR UPDATE OR DELETE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+CREATE VIEW rw_view2 AS
+ SELECT * FROM rw_view1 WHERE a > 0 WITH LOCAL CHECK OPTION;
+INSERT INTO rw_view2 VALUES (-5); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-5).
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (50); -- ok, but not in view
+UPDATE rw_view2 SET a = a - 10; -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-5).
+SELECT * FROM base_tbl;
+ a | b
+----+----
+ 5 | 10
+ 50 | 10
+(2 rows)
+
+-- Check option won't cascade down to base view with INSTEAD OF triggers
+ALTER VIEW rw_view2 SET (check_option=cascaded);
+INSERT INTO rw_view2 VALUES (100); -- ok, but not in view (doesn't fail rw_view1's check)
+UPDATE rw_view2 SET a = 200 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view1's check)
+SELECT * FROM base_tbl;
+ a | b
+-----+----
+ 50 | 10
+ 100 | 10
+ 200 | 10
+(3 rows)
+
+-- Neither local nor cascaded check options work with INSTEAD rules
+DROP TRIGGER rw_view1_trig ON rw_view1;
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a, 10);
+CREATE RULE rw_view1_upd_rule AS ON UPDATE TO rw_view1
+ DO INSTEAD UPDATE base_tbl SET a=NEW.a WHERE a=OLD.a;
+INSERT INTO rw_view2 VALUES (-10); -- ok, but not in view (doesn't fail rw_view2's check)
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (20); -- ok, but not in view (doesn't fail rw_view1's check)
+UPDATE rw_view2 SET a = 30 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view1's check)
+INSERT INTO rw_view2 VALUES (5); -- ok
+UPDATE rw_view2 SET a = -5 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view2's check)
+SELECT * FROM base_tbl;
+ a | b
+-----+----
+ 50 | 10
+ 100 | 10
+ 200 | 10
+ -10 | 10
+ 20 | 10
+ 30 | 10
+ -5 | 10
+(7 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP FUNCTION rw_view1_trig_fn();
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT a,10 AS b FROM base_tbl;
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a);
+CREATE VIEW rw_view2 AS
+ SELECT * FROM rw_view1 WHERE a > b WITH LOCAL CHECK OPTION;
+INSERT INTO rw_view2 VALUES (2,3); -- ok, but not in view (doesn't fail rw_view2's check)
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- security barrier view
+CREATE TABLE base_tbl (person text, visibility text);
+INSERT INTO base_tbl VALUES ('Tom', 'public'),
+ ('Dick', 'private'),
+ ('Harry', 'public');
+CREATE VIEW rw_view1 AS
+ SELECT person FROM base_tbl WHERE visibility = 'public';
+CREATE FUNCTION snoop(anyelement)
+RETURNS boolean AS
+$$
+BEGIN
+ RAISE NOTICE 'snooped value: %', $1;
+ RETURN true;
+END;
+$$
+LANGUAGE plpgsql COST 0.000001;
+CREATE OR REPLACE FUNCTION leakproof(anyelement)
+RETURNS boolean AS
+$$
+BEGIN
+ RETURN true;
+END;
+$$
+LANGUAGE plpgsql STRICT IMMUTABLE LEAKPROOF;
+SELECT * FROM rw_view1 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view1 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Harry
+DELETE FROM rw_view1 WHERE NOT snoop(person);
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+ALTER VIEW rw_view1 SET (security_barrier = true);
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view1';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view1';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view1'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | person | YES
+(1 row)
+
+SELECT * FROM rw_view1 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view1 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+DELETE FROM rw_view1 WHERE NOT snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+EXPLAIN (costs off) SELECT * FROM rw_view1 WHERE snoop(person);
+ QUERY PLAN
+-----------------------------------------------
+ Subquery Scan on rw_view1
+ Filter: snoop(rw_view1.person)
+ -> Seq Scan on base_tbl
+ Filter: (visibility = 'public'::text)
+(4 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET person=person WHERE snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------
+ Update on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE NOT snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Delete on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND (NOT snoop(person)))
+(3 rows)
+
+-- security barrier view on top of security barrier view
+CREATE VIEW rw_view2 WITH (security_barrier = true) AS
+ SELECT * FROM rw_view1 WHERE snoop(person);
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view2';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view2 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view2';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view2 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view2'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view2 | person | YES
+(1 row)
+
+SELECT * FROM rw_view2 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view2 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+DELETE FROM rw_view2 WHERE NOT snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+EXPLAIN (costs off) SELECT * FROM rw_view2 WHERE snoop(person);
+ QUERY PLAN
+-----------------------------------------------------
+ Subquery Scan on rw_view2
+ Filter: snoop(rw_view2.person)
+ -> Subquery Scan on rw_view1
+ Filter: snoop(rw_view1.person)
+ -> Seq Scan on base_tbl
+ Filter: (visibility = 'public'::text)
+(6 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET person=person WHERE snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Update on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person) AND snoop(person))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE NOT snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------------------------
+ Delete on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person) AND (NOT snoop(person)))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- security barrier view on top of table with rules
+CREATE TABLE base_tbl(id int PRIMARY KEY, data text, deleted boolean);
+INSERT INTO base_tbl VALUES (1, 'Row 1', false), (2, 'Row 2', true);
+CREATE RULE base_tbl_ins_rule AS ON INSERT TO base_tbl
+ WHERE EXISTS (SELECT 1 FROM base_tbl t WHERE t.id = new.id)
+ DO INSTEAD
+ UPDATE base_tbl SET data = new.data, deleted = false WHERE id = new.id;
+CREATE RULE base_tbl_del_rule AS ON DELETE TO base_tbl
+ DO INSTEAD
+ UPDATE base_tbl SET deleted = true WHERE id = old.id;
+CREATE VIEW rw_view1 WITH (security_barrier=true) AS
+ SELECT id, data FROM base_tbl WHERE NOT deleted;
+SELECT * FROM rw_view1;
+ id | data
+----+-------
+ 1 | Row 1
+(1 row)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE id = 1 AND snoop(data);
+ QUERY PLAN
+-------------------------------------------------------------------
+ Update on base_tbl base_tbl_1
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl base_tbl_1
+ Index Cond: (id = 1)
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (id = 1)
+ Filter: ((NOT deleted) AND snoop(data))
+(7 rows)
+
+DELETE FROM rw_view1 WHERE id = 1 AND snoop(data);
+NOTICE: snooped value: Row 1
+EXPLAIN (costs off) INSERT INTO rw_view1 VALUES (2, 'New row 2');
+ QUERY PLAN
+-----------------------------------------------------------
+ Insert on base_tbl
+ InitPlan 1 (returns $0)
+ -> Index Only Scan using base_tbl_pkey on base_tbl t
+ Index Cond: (id = 2)
+ -> Result
+ One-Time Filter: ($0 IS NOT TRUE)
+
+ Update on base_tbl
+ InitPlan 1 (returns $0)
+ -> Index Only Scan using base_tbl_pkey on base_tbl t
+ Index Cond: (id = 2)
+ -> Result
+ One-Time Filter: $0
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (id = 2)
+(15 rows)
+
+INSERT INTO rw_view1 VALUES (2, 'New row 2');
+SELECT * FROM base_tbl;
+ id | data | deleted
+----+-----------+---------
+ 1 | Row 1 | t
+ 2 | New row 2 | f
+(2 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- security barrier view based on inheritance set
+CREATE TABLE t1 (a int, b float, c text);
+CREATE INDEX t1_a_idx ON t1(a);
+INSERT INTO t1
+SELECT i,i,'t1' FROM generate_series(1,10) g(i);
+ANALYZE t1;
+CREATE TABLE t11 (d text) INHERITS (t1);
+CREATE INDEX t11_a_idx ON t11(a);
+INSERT INTO t11
+SELECT i,i,'t11','t11d' FROM generate_series(1,10) g(i);
+ANALYZE t11;
+CREATE TABLE t12 (e int[]) INHERITS (t1);
+CREATE INDEX t12_a_idx ON t12(a);
+INSERT INTO t12
+SELECT i,i,'t12','{1,2}'::int[] FROM generate_series(1,10) g(i);
+ANALYZE t12;
+CREATE TABLE t111 () INHERITS (t11, t12);
+NOTICE: merging multiple inherited definitions of column "a"
+NOTICE: merging multiple inherited definitions of column "b"
+NOTICE: merging multiple inherited definitions of column "c"
+CREATE INDEX t111_a_idx ON t111(a);
+INSERT INTO t111
+SELECT i,i,'t111','t111d','{1,1,1}'::int[] FROM generate_series(1,10) g(i);
+ANALYZE t111;
+CREATE VIEW v1 WITH (security_barrier=true) AS
+SELECT *, (SELECT d FROM t11 WHERE t11.a = t1.a LIMIT 1) AS d
+FROM t1
+WHERE a > 5 AND EXISTS(SELECT 1 FROM t12 WHERE t12.a = t1.a);
+SELECT * FROM v1 WHERE a=3; -- should not see anything
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM v1 WHERE a=8;
+ a | b | c | d
+---+---+------+------
+ 8 | 8 | t1 | t11d
+ 8 | 8 | t11 | t11d
+ 8 | 8 | t12 | t11d
+ 8 | 8 | t111 | t11d
+(4 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------------------------
+ Update on public.t1
+ Update on public.t1
+ Update on public.t11
+ Update on public.t12
+ Update on public.t111
+ -> Index Scan using t1_a_idx on public.t1
+ Output: 100, t1.b, t1.c, t1.ctid
+ Index Cond: ((t1.a > 5) AND (t1.a < 7))
+ Filter: ((t1.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t1.a) AND leakproof(t1.a))
+ SubPlan 1
+ -> Append
+ -> Seq Scan on public.t12 t12_1
+ Filter: (t12_1.a = t1.a)
+ -> Seq Scan on public.t111 t111_1
+ Filter: (t111_1.a = t1.a)
+ SubPlan 2
+ -> Append
+ -> Seq Scan on public.t12 t12_2
+ Output: t12_2.a
+ -> Seq Scan on public.t111 t111_2
+ Output: t111_2.a
+ -> Index Scan using t11_a_idx on public.t11
+ Output: 100, t11.b, t11.c, t11.d, t11.ctid
+ Index Cond: ((t11.a > 5) AND (t11.a < 7))
+ Filter: ((t11.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t11.a) AND leakproof(t11.a))
+ -> Index Scan using t12_a_idx on public.t12
+ Output: 100, t12.b, t12.c, t12.e, t12.ctid
+ Index Cond: ((t12.a > 5) AND (t12.a < 7))
+ Filter: ((t12.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t12.a) AND leakproof(t12.a))
+ -> Index Scan using t111_a_idx on public.t111
+ Output: 100, t111.b, t111.c, t111.d, t111.e, t111.ctid
+ Index Cond: ((t111.a > 5) AND (t111.a < 7))
+ Filter: ((t111.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t111.a) AND leakproof(t111.a))
+(33 rows)
+
+UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6;
+SELECT * FROM v1 WHERE a=100; -- Nothing should have been changed to 100
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM t1 WHERE a=100; -- Nothing should have been changed to 100
+ a | b | c
+---+---+---
+(0 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE v1 SET a=a+1 WHERE snoop(a) AND leakproof(a) AND a = 8;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
+ Update on public.t1
+ Update on public.t1
+ Update on public.t11
+ Update on public.t12
+ Update on public.t111
+ -> Index Scan using t1_a_idx on public.t1
+ Output: (t1.a + 1), t1.b, t1.c, t1.ctid
+ Index Cond: ((t1.a > 5) AND (t1.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t1.a) AND leakproof(t1.a))
+ SubPlan 1
+ -> Append
+ -> Seq Scan on public.t12 t12_1
+ Filter: (t12_1.a = t1.a)
+ -> Seq Scan on public.t111 t111_1
+ Filter: (t111_1.a = t1.a)
+ SubPlan 2
+ -> Append
+ -> Seq Scan on public.t12 t12_2
+ Output: t12_2.a
+ -> Seq Scan on public.t111 t111_2
+ Output: t111_2.a
+ -> Index Scan using t11_a_idx on public.t11
+ Output: (t11.a + 1), t11.b, t11.c, t11.d, t11.ctid
+ Index Cond: ((t11.a > 5) AND (t11.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t11.a) AND leakproof(t11.a))
+ -> Index Scan using t12_a_idx on public.t12
+ Output: (t12.a + 1), t12.b, t12.c, t12.e, t12.ctid
+ Index Cond: ((t12.a > 5) AND (t12.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t12.a) AND leakproof(t12.a))
+ -> Index Scan using t111_a_idx on public.t111
+ Output: (t111.a + 1), t111.b, t111.c, t111.d, t111.e, t111.ctid
+ Index Cond: ((t111.a > 5) AND (t111.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t111.a) AND leakproof(t111.a))
+(33 rows)
+
+UPDATE v1 SET a=a+1 WHERE snoop(a) AND leakproof(a) AND a = 8;
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+SELECT * FROM v1 WHERE b=8;
+ a | b | c | d
+---+---+------+------
+ 9 | 8 | t1 | t11d
+ 9 | 8 | t11 | t11d
+ 9 | 8 | t12 | t11d
+ 9 | 8 | t111 | t11d
+(4 rows)
+
+DELETE FROM v1 WHERE snoop(a) AND leakproof(a); -- should not delete everything, just where a>5
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 9
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+TABLE t1; -- verify all a<=5 are intact
+ a | b | c
+---+---+------
+ 1 | 1 | t1
+ 2 | 2 | t1
+ 3 | 3 | t1
+ 4 | 4 | t1
+ 5 | 5 | t1
+ 1 | 1 | t11
+ 2 | 2 | t11
+ 3 | 3 | t11
+ 4 | 4 | t11
+ 5 | 5 | t11
+ 1 | 1 | t12
+ 2 | 2 | t12
+ 3 | 3 | t12
+ 4 | 4 | t12
+ 5 | 5 | t12
+ 1 | 1 | t111
+ 2 | 2 | t111
+ 3 | 3 | t111
+ 4 | 4 | t111
+ 5 | 5 | t111
+(20 rows)
+
+DROP TABLE t1, t11, t12, t111 CASCADE;
+NOTICE: drop cascades to view v1
+DROP FUNCTION snoop(anyelement);
+DROP FUNCTION leakproof(anyelement);
+CREATE TABLE tx1 (a integer);
+CREATE TABLE tx2 (b integer);
+CREATE TABLE tx3 (c integer);
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 values (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+(1 row)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+CREATE TABLE tx1 (a integer);
+CREATE TABLE tx2 (b integer);
+CREATE TABLE tx3 (c integer);
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 VALUES (1);
+INSERT INTO vx1 VALUES (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+CREATE TABLE tx1 (a integer, b integer);
+CREATE TABLE tx2 (b integer, c integer);
+CREATE TABLE tx3 (c integer, d integer);
+ALTER TABLE tx1 DROP COLUMN b;
+ALTER TABLE tx2 DROP COLUMN c;
+ALTER TABLE tx3 DROP COLUMN d;
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 VALUES (1);
+INSERT INTO vx1 VALUES (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+--
+-- Test handling of vars from correlated subqueries in quals from outer
+-- security barrier views, per bug #13988
+--
+CREATE TABLE t1 (a int, b text, c int);
+INSERT INTO t1 VALUES (1, 'one', 10);
+CREATE TABLE t2 (cc int);
+INSERT INTO t2 VALUES (10), (20);
+CREATE VIEW v1 WITH (security_barrier = true) AS
+ SELECT * FROM t1 WHERE (a > 0)
+ WITH CHECK OPTION;
+CREATE VIEW v2 WITH (security_barrier = true) AS
+ SELECT * FROM v1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.cc = v1.c)
+ WITH CHECK OPTION;
+INSERT INTO v2 VALUES (2, 'two', 20); -- ok
+INSERT INTO v2 VALUES (-2, 'minus two', 20); -- not allowed
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-2, minus two, 20).
+INSERT INTO v2 VALUES (3, 'three', 30); -- not allowed
+ERROR: new row violates check option for view "v2"
+DETAIL: Failing row contains (3, three, 30).
+UPDATE v2 SET b = 'ONE' WHERE a = 1; -- ok
+UPDATE v2 SET a = -1 WHERE a = 1; -- not allowed
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-1, ONE, 10).
+UPDATE v2 SET c = 30 WHERE a = 1; -- not allowed
+ERROR: new row violates check option for view "v2"
+DETAIL: Failing row contains (1, ONE, 30).
+DELETE FROM v2 WHERE a = 2; -- ok
+SELECT * FROM v2;
+ a | b | c
+---+-----+----
+ 1 | ONE | 10
+(1 row)
+
+DROP VIEW v2;
+DROP VIEW v1;
+DROP TABLE t2;
+DROP TABLE t1;
+--
+-- Test CREATE OR REPLACE VIEW turning a non-updatable view into an
+-- auto-updatable view and adding check options in a single step
+--
+CREATE TABLE t1 (a int, b text);
+CREATE VIEW v1 AS SELECT null::int AS a;
+CREATE OR REPLACE VIEW v1 AS SELECT * FROM t1 WHERE a > 0 WITH CHECK OPTION;
+INSERT INTO v1 VALUES (1, 'ok'); -- ok
+INSERT INTO v1 VALUES (-1, 'invalid'); -- should fail
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-1, invalid).
+DROP VIEW v1;
+DROP TABLE t1;
+-- check that an auto-updatable view on a partitioned table works correctly
+create table uv_pt (a int, b int, v varchar) partition by range (a, b);
+create table uv_pt1 (b int not null, v varchar, a int not null) partition by range (b);
+create table uv_pt11 (like uv_pt1);
+alter table uv_pt11 drop a;
+alter table uv_pt11 add a int;
+alter table uv_pt11 drop a;
+alter table uv_pt11 add a int not null;
+alter table uv_pt1 attach partition uv_pt11 for values from (2) to (5);
+alter table uv_pt attach partition uv_pt1 for values from (1, 2) to (1, 10);
+create view uv_ptv as select * from uv_pt;
+select events & 4 != 0 AS upd,
+ events & 8 != 0 AS ins,
+ events & 16 != 0 AS del
+ from pg_catalog.pg_relation_is_updatable('uv_pt'::regclass, false) t(events);
+ upd | ins | del
+-----+-----+-----
+ t | t | t
+(1 row)
+
+select pg_catalog.pg_column_is_updatable('uv_pt'::regclass, 1::smallint, false);
+ pg_column_is_updatable
+------------------------
+ t
+(1 row)
+
+select pg_catalog.pg_column_is_updatable('uv_pt'::regclass, 2::smallint, false);
+ pg_column_is_updatable
+------------------------
+ t
+(1 row)
+
+select table_name, is_updatable, is_insertable_into
+ from information_schema.views where table_name = 'uv_ptv';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ uv_ptv | YES | YES
+(1 row)
+
+select table_name, column_name, is_updatable
+ from information_schema.columns where table_name = 'uv_ptv' order by column_name;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ uv_ptv | a | YES
+ uv_ptv | b | YES
+ uv_ptv | v | YES
+(3 rows)
+
+insert into uv_ptv values (1, 2);
+select tableoid::regclass, * from uv_pt;
+ tableoid | a | b | v
+----------+---+---+---
+ uv_pt11 | 1 | 2 |
+(1 row)
+
+create view uv_ptv_wco as select * from uv_pt where a = 0 with check option;
+insert into uv_ptv_wco values (1, 2);
+ERROR: new row violates check option for view "uv_ptv_wco"
+DETAIL: Failing row contains (1, 2, null).
+drop view uv_ptv, uv_ptv_wco;
+drop table uv_pt, uv_pt1, uv_pt11;
+-- check that wholerow vars appearing in WITH CHECK OPTION constraint expressions
+-- work fine with partitioned tables
+create table wcowrtest (a int) partition by list (a);
+create table wcowrtest1 partition of wcowrtest for values in (1);
+create view wcowrtest_v as select * from wcowrtest where wcowrtest = '(2)'::wcowrtest with check option;
+insert into wcowrtest_v values (1);
+ERROR: new row violates check option for view "wcowrtest_v"
+DETAIL: Failing row contains (1).
+alter table wcowrtest add b text;
+create table wcowrtest2 (b text, c int, a int);
+alter table wcowrtest2 drop c;
+alter table wcowrtest attach partition wcowrtest2 for values in (2);
+create table sometable (a int, b text);
+insert into sometable values (1, 'a'), (2, 'b');
+create view wcowrtest_v2 as
+ select *
+ from wcowrtest r
+ where r in (select s from sometable s where r.a = s.a)
+with check option;
+-- WITH CHECK qual will be processed with wcowrtest2's
+-- rowtype after tuple-routing
+insert into wcowrtest_v2 values (2, 'no such row in sometable');
+ERROR: new row violates check option for view "wcowrtest_v2"
+DETAIL: Failing row contains (2, no such row in sometable).
+drop view wcowrtest_v, wcowrtest_v2;
+drop table wcowrtest, sometable;
+-- Check INSERT .. ON CONFLICT DO UPDATE works correctly when the view's
+-- columns are named and ordered differently than the underlying table's.
+create table uv_iocu_tab (a text unique, b float);
+insert into uv_iocu_tab values ('xyxyxy', 0);
+create view uv_iocu_view as
+ select b, b+1 as c, a, '2.0'::text as two from uv_iocu_tab;
+insert into uv_iocu_view (a, b) values ('xyxyxy', 1)
+ on conflict (a) do update set b = uv_iocu_view.b;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 0
+(1 row)
+
+insert into uv_iocu_view (a, b) values ('xyxyxy', 1)
+ on conflict (a) do update set b = excluded.b;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 1
+(1 row)
+
+-- OK to access view columns that are not present in underlying base
+-- relation in the ON CONFLICT portion of the query
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = cast(excluded.two as float);
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 2
+(1 row)
+
+explain (costs off)
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = excluded.b where excluded.c > 0;
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Insert on uv_iocu_tab
+ Conflict Resolution: UPDATE
+ Conflict Arbiter Indexes: uv_iocu_tab_a_key
+ Conflict Filter: ((excluded.b + '1'::double precision) > '0'::double precision)
+ -> Result
+(5 rows)
+
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = excluded.b where excluded.c > 0;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 3
+(1 row)
+
+drop view uv_iocu_view;
+drop table uv_iocu_tab;
+-- Test whole-row references to the view
+create table uv_iocu_tab (a int unique, b text);
+create view uv_iocu_view as
+ select b as bb, a as aa, uv_iocu_tab::text as cc from uv_iocu_tab;
+insert into uv_iocu_view (aa,bb) values (1,'x');
+explain (costs off)
+insert into uv_iocu_view (aa,bb) values (1,'y')
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*
+ where excluded.aa > 0
+ and excluded.bb != ''
+ and excluded.cc is not null;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
+ Insert on uv_iocu_tab
+ Conflict Resolution: UPDATE
+ Conflict Arbiter Indexes: uv_iocu_tab_a_key
+ Conflict Filter: ((excluded.a > 0) AND (excluded.b <> ''::text) AND ((excluded.*)::text IS NOT NULL))
+ -> Result
+(5 rows)
+
+insert into uv_iocu_view (aa,bb) values (1,'y')
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*
+ where excluded.aa > 0
+ and excluded.bb != ''
+ and excluded.cc is not null;
+select * from uv_iocu_view;
+ bb | aa | cc
+-------------------------+----+---------------------------------
+ Rejected: (y,1,"(1,y)") | 1 | (1,"Rejected: (y,1,""(1,y)"")")
+(1 row)
+
+-- Test omitting a column of the base relation
+delete from uv_iocu_view;
+insert into uv_iocu_view (aa,bb) values (1,'x');
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-----------------------+----+-------------------------------
+ Rejected: (,1,"(1,)") | 1 | (1,"Rejected: (,1,""(1,)"")")
+(1 row)
+
+alter table uv_iocu_tab alter column b set default 'table default';
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-------------------------------------------------------+----+---------------------------------------------------------------------
+ Rejected: ("table default",1,"(1,""table default"")") | 1 | (1,"Rejected: (""table default"",1,""(1,""""table default"""")"")")
+(1 row)
+
+alter view uv_iocu_view alter column bb set default 'view default';
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-----------------------------------------------------+----+-------------------------------------------------------------------
+ Rejected: ("view default",1,"(1,""view default"")") | 1 | (1,"Rejected: (""view default"",1,""(1,""""view default"""")"")")
+(1 row)
+
+-- Should fail to update non-updatable columns
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set cc = 'XXX';
+ERROR: cannot insert into column "cc" of view "uv_iocu_view"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+drop view uv_iocu_view;
+drop table uv_iocu_tab;
+-- ON CONFLICT DO UPDATE permissions checks
+create user regress_view_user1;
+create user regress_view_user2;
+set session authorization regress_view_user1;
+create table base_tbl(a int unique, b text, c float);
+insert into base_tbl values (1,'xxx',1.0);
+create view rw_view1 as select b as bb, c as cc, a as aa from base_tbl;
+grant select (aa,bb) on rw_view1 to regress_view_user2;
+grant insert on rw_view1 to regress_view_user2;
+grant update (bb) on rw_view1 to regress_view_user2;
+set session authorization regress_view_user2;
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = excluded.cc; -- Not allowed
+ERROR: permission denied for view rw_view1
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = rw_view1.cc; -- Not allowed
+ERROR: permission denied for view rw_view1
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+insert into rw_view1 values ('zzz',2.0,1)
+ on conflict (aa) do update set bb = rw_view1.bb||'xxx'; -- OK
+insert into rw_view1 values ('zzz',2.0,1)
+ on conflict (aa) do update set cc = 3.0; -- Not allowed
+ERROR: permission denied for view rw_view1
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+--------+---
+ 1 | yyyxxx | 1
+(1 row)
+
+set session authorization regress_view_user1;
+grant select (a,b) on base_tbl to regress_view_user2;
+grant insert (a,b) on base_tbl to regress_view_user2;
+grant update (a,b) on base_tbl to regress_view_user2;
+set session authorization regress_view_user2;
+create view rw_view2 as select b as bb, c as cc, a as aa from base_tbl;
+insert into rw_view2 (aa,bb) values (1,'xxx')
+ on conflict (aa) do update set bb = excluded.bb; -- Not allowed
+ERROR: permission denied for table base_tbl
+create view rw_view3 as select b as bb, a as aa from base_tbl;
+insert into rw_view3 (aa,bb) values (1,'xxx')
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+-----+---
+ 1 | xxx | 1
+(1 row)
+
+set session authorization regress_view_user2;
+create view rw_view4 as select aa, bb, cc FROM rw_view1;
+insert into rw_view4 (aa,bb) values (1,'yyy')
+ on conflict (aa) do update set bb = excluded.bb; -- Not allowed
+ERROR: permission denied for view rw_view1
+create view rw_view5 as select aa, bb FROM rw_view1;
+insert into rw_view5 (aa,bb) values (1,'yyy')
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+-----+---
+ 1 | yyy | 1
+(1 row)
+
+drop view rw_view5;
+drop view rw_view4;
+drop view rw_view3;
+drop view rw_view2;
+drop view rw_view1;
+drop table base_tbl;
+drop user regress_view_user1;
+drop user regress_view_user2;
+-- Test single- and multi-row inserts with table and view defaults.
+-- Table defaults should be used, unless overridden by view defaults.
+create table base_tab_def (a int, b text default 'Table default',
+ c text default 'Table default', d text, e text);
+create view base_tab_def_view as select * from base_tab_def;
+alter view base_tab_def_view alter b set default 'View default';
+alter view base_tab_def_view alter d set default 'View default';
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | Table default | View default |
+ 12 | View default | Table default | View default |
+ 13 | View default | Table default | View default |
+ 14 | View default | Table default | View default |
+ 15 | View default | Table default | View default |
+ 16 | View default | Table default | View default |
+ 17 | View default | Table default | View default |
+ | View default | Table default | View default |
+(14 rows)
+
+-- Adding an INSTEAD OF trigger should cause NULLs to be inserted instead of
+-- table defaults, where there are no view defaults.
+create function base_tab_def_view_instrig_func() returns trigger
+as
+$$
+begin
+ insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+ return new;
+end;
+$$
+language plpgsql;
+create trigger base_tab_def_view_instrig instead of insert on base_tab_def_view
+ for each row execute function base_tab_def_view_instrig_func();
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | | View default |
+ 12 | View default | | View default |
+ 13 | View default | | View default |
+ 14 | View default | | View default |
+ 15 | View default | | View default |
+ 16 | View default | | View default |
+ 17 | View default | | View default |
+ | View default | | View default |
+(14 rows)
+
+-- Using an unconditional DO INSTEAD rule should also cause NULLs to be
+-- inserted where there are no view defaults.
+drop trigger base_tab_def_view_instrig on base_tab_def_view;
+drop function base_tab_def_view_instrig_func;
+create rule base_tab_def_view_ins_rule as on insert to base_tab_def_view
+ do instead insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | | View default |
+ 12 | View default | | View default |
+ 13 | View default | | View default |
+ 14 | View default | | View default |
+ 15 | View default | | View default |
+ 16 | View default | | View default |
+ 17 | View default | | View default |
+ | View default | | View default |
+(14 rows)
+
+-- A DO ALSO rule should cause each row to be inserted twice. The first
+-- insert should behave the same as an auto-updatable view (using table
+-- defaults, unless overridden by view defaults). The second insert should
+-- behave the same as a rule-updatable view (inserting NULLs where there are
+-- no view defaults).
+drop rule base_tab_def_view_ins_rule on base_tab_def_view;
+create rule base_tab_def_view_ins_rule as on insert to base_tab_def_view
+ do also insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a, c NULLS LAST;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | Table default | View default |
+ 11 | View default | | View default |
+ 12 | View default | Table default | View default |
+ 12 | View default | | View default |
+ 13 | View default | Table default | View default |
+ 13 | View default | | View default |
+ 14 | View default | Table default | View default |
+ 14 | View default | | View default |
+ 15 | View default | Table default | View default |
+ 15 | View default | | View default |
+ 16 | View default | Table default | View default |
+ 16 | View default | | View default |
+ 17 | View default | Table default | View default |
+ 17 | View default | | View default |
+ | View default | Table default | View default |
+ | View default | | View default |
+(22 rows)
+
+drop view base_tab_def_view;
+drop table base_tab_def;
+-- Test defaults with array assignments
+create table base_tab (a serial, b int[], c text, d text default 'Table default');
+create view base_tab_view as select c, a, b from base_tab;
+alter view base_tab_view alter column c set default 'View default';
+insert into base_tab_view (b[1], b[2], c, b[5], b[4], a, b[3])
+values (1, 2, default, 5, 4, default, 3), (10, 11, 'C value', 14, 13, 100, 12);
+select * from base_tab order by a;
+ a | b | c | d
+-----+------------------+--------------+---------------
+ 1 | {1,2,3,4,5} | View default | Table default
+ 100 | {10,11,12,13,14} | C value | Table default
+(2 rows)
+
+drop view base_tab_view;
+drop table base_tab;
diff --git src/test/regress/expected/update_1.out src/test/regress/expected/update_1.out
new file mode 100644
index 0000000000..b0eb99decd
--- /dev/null
+++ src/test/regress/expected/update_1.out
@@ -0,0 +1,937 @@
+--
+-- UPDATE syntax tests
+--
+CREATE TABLE update_test (
+ a INT DEFAULT 10,
+ b INT,
+ c TEXT
+);
+CREATE TABLE upsert_test (
+ a INT PRIMARY KEY,
+ b TEXT
+);
+INSERT INTO update_test VALUES (5, 10, 'foo');
+INSERT INTO update_test(b, a) VALUES (15, 10);
+SELECT * FROM update_test;
+ a | b | c
+----+----+-----
+ 5 | 10 | foo
+ 10 | 15 |
+(2 rows)
+
+UPDATE update_test SET a = DEFAULT, b = DEFAULT;
+SELECT * FROM update_test;
+ a | b | c
+----+---+-----
+ 10 | | foo
+ 10 | |
+(2 rows)
+
+-- aliases for the UPDATE target table
+UPDATE update_test AS t SET b = 10 WHERE t.a = 10;
+SELECT * FROM update_test;
+ a | b | c
+----+----+-----
+ 10 | 10 | foo
+ 10 | 10 |
+(2 rows)
+
+UPDATE update_test t SET b = t.b + 10 WHERE t.a = 10;
+SELECT * FROM update_test;
+ a | b | c
+----+----+-----
+ 10 | 20 | foo
+ 10 | 20 |
+(2 rows)
+
+--
+-- Test VALUES in FROM
+--
+UPDATE update_test SET a=v.i FROM (VALUES(100, 20)) AS v(i, j)
+ WHERE update_test.b = v.j;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 20 | foo
+ 100 | 20 |
+(2 rows)
+
+-- fail, wrong data type:
+UPDATE update_test SET a = v.* FROM (VALUES(100, 20)) AS v(i, j)
+ WHERE update_test.b = v.j;
+ERROR: column "a" is of type integer but expression is of type record
+LINE 1: UPDATE update_test SET a = v.* FROM (VALUES(100, 20)) AS v(i...
+ ^
+HINT: You will need to rewrite or cast the expression.
+--
+-- Test multiple-set-clause syntax
+--
+INSERT INTO update_test SELECT a,b+1,c FROM update_test;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 20 | foo
+ 100 | 20 |
+ 100 | 21 | foo
+ 100 | 21 |
+(4 rows)
+
+UPDATE update_test SET (c,b,a) = ('bugle', b+11, DEFAULT) WHERE c = 'foo';
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-------
+ 100 | 20 |
+ 100 | 21 |
+ 10 | 31 | bugle
+ 10 | 32 | bugle
+(4 rows)
+
+UPDATE update_test SET (c,b) = ('car', a+b), a = a + 1 WHERE a = 10;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 20 |
+ 100 | 21 |
+ 11 | 41 | car
+ 11 | 42 | car
+(4 rows)
+
+-- fail, multi assignment to same column:
+UPDATE update_test SET (c,b) = ('car', a+b), b = a + 1 WHERE a = 10;
+ERROR: multiple assignments to same column "b"
+-- uncorrelated sub-select:
+UPDATE update_test
+ SET (b,a) = (select a,b from update_test where b = 41 and c = 'car')
+ WHERE a = 100 AND b = 20;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 21 |
+ 11 | 41 | car
+ 11 | 42 | car
+ 41 | 11 |
+(4 rows)
+
+-- correlated sub-select:
+UPDATE update_test o
+ SET (b,a) = (select a+1,b from update_test i
+ where i.a=o.a and i.b=o.b and i.c is not distinct from o.c);
+SELECT * FROM update_test;
+ a | b | c
+----+-----+-----
+ 21 | 101 |
+ 41 | 12 | car
+ 42 | 12 | car
+ 11 | 42 |
+(4 rows)
+
+-- fail, multiple rows supplied:
+UPDATE update_test SET (b,a) = (select a+1,b from update_test);
+ERROR: more than one row returned by a subquery used as an expression
+-- set to null if no rows supplied:
+UPDATE update_test SET (b,a) = (select a+1,b from update_test where a = 1000)
+ WHERE a = 11;
+SELECT * FROM update_test;
+ a | b | c
+----+-----+-----
+ 21 | 101 |
+ 41 | 12 | car
+ 42 | 12 | car
+ | |
+(4 rows)
+
+-- *-expansion should work in this context:
+UPDATE update_test SET (a,b) = ROW(v.*) FROM (VALUES(21, 100)) AS v(i, j)
+ WHERE update_test.a = v.i;
+-- you might expect this to work, but syntactically it's not a RowExpr:
+UPDATE update_test SET (a,b) = (v.*) FROM (VALUES(21, 101)) AS v(i, j)
+ WHERE update_test.a = v.i;
+ERROR: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression
+LINE 1: UPDATE update_test SET (a,b) = (v.*) FROM (VALUES(21, 101)) ...
+ ^
+-- if an alias for the target table is specified, don't allow references
+-- to the original table name
+UPDATE update_test AS t SET b = update_test.b + 10 WHERE t.a = 10;
+ERROR: invalid reference to FROM-clause entry for table "update_test"
+LINE 1: UPDATE update_test AS t SET b = update_test.b + 10 WHERE t.a...
+ ^
+HINT: Perhaps you meant to reference the table alias "t".
+-- Make sure that we can update to a TOASTed value.
+UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
+SELECT a, b, char_length(c) FROM update_test;
+ a | b | char_length
+----+-----+-------------
+ | |
+ 21 | 100 |
+ 41 | 12 | 10000
+ 42 | 12 | 10000
+(4 rows)
+
+-- Check multi-assignment with a Result node to handle a one-time filter.
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE update_test t
+ SET (a, b) = (SELECT b, a FROM update_test s WHERE s.a = t.a)
+ WHERE CURRENT_USER = SESSION_USER;
+ QUERY PLAN
+------------------------------------------------------------------
+ Update on public.update_test t
+ -> Result
+ Output: $1, $2, t.c, (SubPlan 1 (returns $1,$2)), t.ctid
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Seq Scan on public.update_test t
+ Output: t.c, t.a, t.ctid
+ SubPlan 1 (returns $1,$2)
+ -> Seq Scan on public.update_test s
+ Output: s.b, s.a
+ Filter: (s.a = t.a)
+(10 rows)
+
+UPDATE update_test t
+ SET (a, b) = (SELECT b, a FROM update_test s WHERE s.a = t.a)
+ WHERE CURRENT_USER = SESSION_USER;
+SELECT a, b, char_length(c) FROM update_test;
+ a | b | char_length
+-----+----+-------------
+ | |
+ 100 | 21 |
+ 12 | 41 | 10000
+ 12 | 42 | 10000
+(4 rows)
+
+-- Test ON CONFLICT DO UPDATE
+INSERT INTO upsert_test VALUES(1, 'Boo');
+-- uncorrelated sub-select:
+WITH aaa AS (SELECT 1 AS a, 'Foo' AS b) INSERT INTO upsert_test
+ VALUES (1, 'Bar') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b, a FROM aaa) RETURNING *;
+ a | b
+---+-----
+ 1 | Foo
+(1 row)
+
+-- correlated sub-select:
+INSERT INTO upsert_test VALUES (1, 'Baz') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Correlated', a from upsert_test i WHERE i.a = upsert_test.a)
+ RETURNING *;
+ a | b
+---+-----------------
+ 1 | Foo, Correlated
+(1 row)
+
+-- correlated sub-select (EXCLUDED.* alias):
+INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Excluded', a from upsert_test i WHERE i.a = excluded.a)
+ RETURNING *;
+ a | b
+---+---------------------------
+ 1 | Foo, Correlated, Excluded
+(1 row)
+
+-- ON CONFLICT using system attributes in RETURNING, testing both the
+-- inserting and updating paths. See bug report at:
+-- https://www.postgresql.org/message-id/73436355-6432-49B1-92ED-1FE4F7E7E100%40finefun.com.au
+CREATE FUNCTION xid_current() RETURNS xid LANGUAGE SQL AS $$SELECT (txid_current() % ((1::int8<<32)))::text::xid;$$;
+INSERT INTO upsert_test VALUES (2, 'Beeble') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Excluded', a from upsert_test i WHERE i.a = excluded.a)
+ RETURNING tableoid::regclass, xmin = xid_current() AS xmin_correct, xmax = 0 AS xmax_correct;
+ERROR: zedstore tuple table slot does not have system attributes (except xmin and cmin)
+-- currently xmax is set after a conflict - that's probably not good,
+-- but it seems worthwhile to have to be explicit if that changes.
+INSERT INTO upsert_test VALUES (2, 'Brox') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Excluded', a from upsert_test i WHERE i.a = excluded.a)
+ RETURNING tableoid::regclass, xmin = xid_current() AS xmin_correct, xmax = xid_current() AS xmax_correct;
+ERROR: zedstore tuple table slot does not have system attributes (except xmin and cmin)
+DROP FUNCTION xid_current();
+DROP TABLE update_test;
+DROP TABLE upsert_test;
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+CREATE TABLE range_parted (
+ a text,
+ b bigint,
+ c numeric,
+ d int,
+ e varchar
+) PARTITION BY RANGE (a, b);
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_d_15_20 | b | 15 | 105 | 16 |
+ part_d_15_20 | b | 17 | 105 | 19 |
+(6 rows)
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+ QUERY PLAN
+-------------------------------------
+ Update on range_parted
+ Update on part_a_1_a_10
+ Update on part_a_10_a_20
+ Update on part_b_1_b_10
+ Update on part_c_1_100
+ Update on part_d_1_15
+ Update on part_d_15_20
+ Update on part_b_20_b_30
+ -> Seq Scan on part_a_1_a_10
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_a_10_a_20
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_b_1_b_10
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_c_1_100
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_d_1_15
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_d_15_20
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_b_20_b_30
+ Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+ERROR: new row for relation "part_c_100_200" violates partition constraint
+DETAIL: Failing row contains (105, 85, null, b, 15).
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+ERROR: new row for relation "part_c_1_100" violates partition constraint
+DETAIL: Failing row contains (null, 1, 96, 12, a).
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+ c | b | a
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10 | a | 1 | 1 | 1 | 1
+ part_d_1_15 | b | 12 | 116 | 1 | 1
+ part_d_1_15 | b | 13 | 117 | 2 | 2
+ part_d_1_15 | b | 15 | 125 | 6 | 6
+ part_d_1_15 | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+ERROR: new row for relation "part_d_1_15" violates partition constraint
+DETAIL: Failing row contains (2, 117, 2, b, 7).
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+ a | ?column?
+---+----------
+ a | 204
+ b | 124
+ b | 134
+ b | 136
+(4 rows)
+
+:show_data;
+ partname | a | b | c | d | e
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a | 1 | 1 | 1 | 1
+ part_a_1_a_10 | a | 4 | 200 | 1 | 1
+ part_b_1_b_10 | b | 7 | 117 | 2 | 2
+ part_b_1_b_10 | b | 9 | 125 | 6 | 6
+ part_d_1_15 | b | 11 | 125 | 9 | 9
+ part_d_1_15 | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+ERROR: new row violates check option for view "upview"
+DETAIL: Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+ERROR: new row violates check option for view "upview"
+DETAIL: Failing row contains (b, 15, 120, 1, 1).
+-- ok, row movement, check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+:show_data;
+ partname | a | b | c | d | e
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a | 1 | 1 | 1 | 1
+ part_b_1_b_10 | b | 7 | 117 | 2 | 2
+ part_b_1_b_10 | b | 9 | 125 | 6 | 6
+ part_d_1_15 | b | 11 | 125 | 9 | 9
+ part_d_1_15 | b | 12 | 116 | 1 | 1
+ part_d_1_15 | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+DROP VIEW upview;
+-- RETURNING having whole-row vars.
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted), *;
+ range_parted | a | b | c | d | e
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 |
+ (b,17,95,19,) | b | 17 | 95 | 19 |
+(2 rows)
+
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_c_1_100 | b | 15 | 95 | 16 |
+ part_c_1_100 | b | 17 | 95 | 19 |
+(6 rows)
+
+-- Transition tables with update row movement
+:init_range_parted;
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+ begin
+ raise notice 'trigger = %, old table = %, new table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+ (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+ return null;
+ end;
+$$;
+CREATE TRIGGER trans_updatetrig
+ AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+ FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 13 | 98 | 2 |
+ part_d_15_20 | b | 15 | 106 | 16 |
+ part_d_15_20 | b | 17 | 106 | 19 |
+ part_d_1_15 | b | 12 | 110 | 1 |
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+ AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+ FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+ AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+ FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_d_15_20 | b | 15 | 155 | 16 |
+ part_d_15_20 | b | 17 | 155 | 19 |
+ part_d_1_15 | b | 12 | 146 | 1 |
+ part_d_1_15 | b | 13 | 147 | 2 |
+(6 rows)
+
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, the conversion
+-- is not usually needed, because the original tuple is already compatible with
+-- the desired transition tuple format. But conversion happens when there is a
+-- BR trigger because the trigger can change the inserted row. So install a
+-- BR triggers on those child partitions where the rows will be moved.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+ NEW.b = NEW.b + 1;
+ return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+ FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+ FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+ FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,110,1,), (b,15,98,2,), (b,17,106,16,), (b,19,106,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 15 | 98 | 2 |
+ part_d_15_20 | b | 17 | 106 | 16 |
+ part_d_15_20 | b | 19 | 106 | 19 |
+ part_d_1_15 | b | 15 | 110 | 1 |
+(6 rows)
+
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,146,1,), (b,16,147,2,), (b,17,155,16,), (b,19,155,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_d_15_20 | b | 17 | 155 | 16 |
+ part_d_15_20 | b | 19 | 155 | 19 |
+ part_d_1_15 | b | 15 | 146 | 1 |
+ part_d_1_15 | b | 16 | 147 | 2 |
+(6 rows)
+
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.
+:init_range_parted;
+UPDATE range_parted set b = 15 WHERE b = 1;
+NOTICE: trigger = trans_updatetrig, old table = (a,1,1,1,), new table = (a,15,1,1,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_10_a_20 | a | 15 | 1 | 1 |
+ part_c_1_100 | b | 13 | 96 | 1 |
+ part_c_1_100 | b | 14 | 97 | 2 |
+ part_d_15_20 | b | 16 | 105 | 16 |
+ part_d_15_20 | b | 18 | 105 | 19 |
+(6 rows)
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy for table "range_parted"
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+ NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+ return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+ FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted
+ AS RESTRICTIVE for UPDATE USING (true)
+ WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- ok
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+ WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = % fired on table % during %',
+ TG_NAME, TG_TABLE_NAME, TG_OP;
+ return null;
+ end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+ AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+ AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+ AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+ AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+ AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+ AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+ AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+ AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+ AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+ AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+ AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+ AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+NOTICE: trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_c_1_100 | b | 15 | 55 | 16 |
+ part_c_1_100 | b | 17 | 55 | 19 |
+(6 rows)
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
+-- Creating default partition for range
+:init_range_parted;
+create table part_def partition of range_parted default;
+\d+ part_def
+ Table "public.part_def"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a | text | | | | extended | |
+ b | bigint | | | | plain | |
+ c | numeric | | | | main | |
+ d | integer | | | | plain | |
+ e | character varying | | | | extended | |
+Partition of: range_parted DEFAULT
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
+
+insert into range_parted values ('c', 9);
+-- ok
+update part_def set a = 'd' where a = 'c';
+-- fail
+update part_def set a = 'a' where a = 'd';
+ERROR: new row for relation "part_def" violates partition constraint
+DETAIL: Failing row contains (a, 9, null, null, null).
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_d_15_20 | b | 15 | 105 | 16 |
+ part_d_15_20 | b | 17 | 105 | 19 |
+ part_def | d | 9 | | |
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+ERROR: new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL: Failing row contains (ad, 10, 200, 1, null).
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+ partname | a | b | c | d | e
+----------+----+----+-----+----+---
+ part_def | ad | 1 | 1 | 1 |
+ part_def | ad | 10 | 200 | 1 |
+ part_def | bd | 12 | 96 | 1 |
+ part_def | bd | 13 | 97 | 2 |
+ part_def | bd | 15 | 105 | 16 |
+ part_def | bd | 17 | 105 | 19 |
+ part_def | d | 9 | | |
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_d_15_20 | b | 15 | 105 | 16 |
+ part_d_15_20 | b | 17 | 105 | 19 |
+ part_def | d | 9 | | |
+(7 rows)
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+CREATE TABLE list_parted (
+ a text,
+ b int
+) PARTITION BY list (a);
+CREATE TABLE list_part1 PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
+-- fail
+UPDATE list_default set a = 'a' WHERE a = 'd';
+ERROR: new row for relation "list_default" violates partition constraint
+DETAIL: Failing row contains (a, 10).
+-- ok
+UPDATE list_default set a = 'x' WHERE a = 'd';
+DROP TABLE list_parted;
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+ERROR: new row for relation "sub_part2" violates partition constraint
+DETAIL: Failing row contains (2, 10, 2).
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text, * FROM list_parted WHERE a = 2 ORDER BY 1;
+ tableoid | a | b | c
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text, * FROM list_parted WHERE a = 2 ORDER BY 1;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-- Test the case where BR UPDATE triggers change the partition key.
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+ NEW.b = 2; -- This is changing partition key column.
+ return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+ for each row execute procedure func_parted_mod_b();
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part1 | 1 | 1 | 60
+ sub_part2 | 1 | 2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b = 1;
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part2 | 1 | 2 | 10
+ sub_part2 | 1 | 2 | 70
+(4 rows)
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+ raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+ return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+ for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+NOTICE: Trigger: Got OLD row (2,70,1), but returning NULL
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part2 | 1 | 2 | 10
+ sub_part2 | 1 | 2 | 70
+(4 rows)
+
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part1 | 1 | 1 | 70
+ sub_part2 | 1 | 2 | 10
+(4 rows)
+
+DROP FUNCTION func_parted_mod_b();
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only
+-- once. There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 1 | 70
+ list_part1 | 2 | 2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+(4 rows)
+
+DROP TABLE non_parted;
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
+-- create custom operator class and hash function, for the same reason
+-- explained in alter_table.sql
+create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
+$$ begin return (a + seed); end; $$ language 'plpgsql' immutable;
+create operator class custom_opclass for type int4 using hash as
+operator 1 = , function 2 dummy_hashint4(int4, int8);
+create table hash_parted (
+ a int,
+ b int
+) partition by hash (a custom_opclass, b custom_opclass);
+create table hpart1 partition of hash_parted for values with (modulus 2, remainder 1);
+create table hpart2 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hpart3 partition of hash_parted for values with (modulus 8, remainder 0);
+create table hpart4 partition of hash_parted for values with (modulus 8, remainder 4);
+insert into hpart1 values (1, 1);
+insert into hpart2 values (2, 5);
+insert into hpart4 values (3, 4);
+-- fail
+update hpart1 set a = 3, b=4 where a = 1;
+ERROR: new row for relation "hpart1" violates partition constraint
+DETAIL: Failing row contains (3, 4).
+-- ok, row movement
+update hash_parted set b = b - 1 where b = 1;
+-- ok
+update hash_parted set b = b + 8 where b = 1;
+-- cleanup
+drop table hash_parted;
+drop operator class custom_opclass using hash;
+drop function dummy_hashint4(a int4, seed int8);
diff --git src/test/regress/expected/vacuum_1.out src/test/regress/expected/vacuum_1.out
new file mode 100644
index 0000000000..dc3edbfac5
--- /dev/null
+++ src/test/regress/expected/vacuum_1.out
@@ -0,0 +1,382 @@
+--
+-- VACUUM
+--
+CREATE TABLE vactst (i INT);
+INSERT INTO vactst VALUES (1);
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst VALUES (0);
+SELECT count(*) FROM vactst;
+ count
+-------
+ 2049
+(1 row)
+
+DELETE FROM vactst WHERE i != 0;
+SELECT * FROM vactst;
+ i
+---
+ 0
+(1 row)
+
+VACUUM FULL vactst;
+UPDATE vactst SET i = i + 1;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst VALUES (0);
+SELECT count(*) FROM vactst;
+ count
+-------
+ 2049
+(1 row)
+
+DELETE FROM vactst WHERE i != 0;
+VACUUM (FULL) vactst;
+DELETE FROM vactst;
+SELECT * FROM vactst;
+ i
+---
+(0 rows)
+
+VACUUM (FULL, FREEZE) vactst;
+VACUUM (ANALYZE, FULL) vactst;
+CREATE TABLE vaccluster (i INT PRIMARY KEY);
+ALTER TABLE vaccluster CLUSTER ON vaccluster_pkey;
+CLUSTER vaccluster;
+CREATE FUNCTION do_analyze() RETURNS VOID VOLATILE LANGUAGE SQL
+ AS 'ANALYZE pg_am';
+CREATE FUNCTION wrap_do_analyze(c INT) RETURNS INT IMMUTABLE LANGUAGE SQL
+ AS 'SELECT $1 FROM do_analyze()';
+CREATE INDEX ON vaccluster(wrap_do_analyze(i));
+INSERT INTO vaccluster VALUES (1), (2);
+ANALYZE vaccluster;
+ERROR: ANALYZE cannot be executed from VACUUM or ANALYZE
+CONTEXT: SQL function "do_analyze" statement 1
+SQL function "wrap_do_analyze" statement 1
+-- Test ANALYZE in transaction, where the transaction surrounding
+-- analyze performed modifications. This tests for the bug at
+-- https://postgr.es/m/c7988239-d42c-ddc4-41db-171b23b35e4f%40ssinger.info
+-- (which hopefully is unlikely to be reintroduced), but also seems
+-- independently worthwhile to cover.
+INSERT INTO vactst SELECT generate_series(1, 300);
+DELETE FROM vactst WHERE i % 7 = 0; -- delete a few rows outside
+BEGIN;
+INSERT INTO vactst SELECT generate_series(301, 400);
+DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
+ANALYZE vactst;
+COMMIT;
+VACUUM FULL pg_am;
+VACUUM FULL pg_class;
+VACUUM FULL pg_database;
+VACUUM FULL vaccluster;
+ERROR: ANALYZE cannot be executed from VACUUM or ANALYZE
+CONTEXT: SQL function "do_analyze" statement 1
+SQL function "wrap_do_analyze" statement 1
+VACUUM FULL vactst;
+VACUUM (DISABLE_PAGE_SKIPPING) vaccluster;
+-- PARALLEL option
+CREATE TABLE pvactst (i INT, a INT[], p POINT) with (autovacuum_enabled = off);
+INSERT INTO pvactst SELECT i, array[1,2,3], point(i, i+1) FROM generate_series(1,1000) i;
+CREATE INDEX btree_pvactst ON pvactst USING btree (i);
+CREATE INDEX hash_pvactst ON pvactst USING hash (i);
+CREATE INDEX brin_pvactst ON pvactst USING brin (i);
+CREATE INDEX gin_pvactst ON pvactst USING gin (a);
+CREATE INDEX gist_pvactst ON pvactst USING gist (p);
+CREATE INDEX spgist_pvactst ON pvactst USING spgist (p);
+-- VACUUM invokes parallel index cleanup
+SET min_parallel_index_scan_size to 0;
+VACUUM (PARALLEL 2) pvactst;
+-- VACUUM invokes parallel bulk-deletion
+UPDATE pvactst SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 2) pvactst;
+UPDATE pvactst SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+VACUUM (PARALLEL -1) pvactst; -- error
+ERROR: parallel vacuum degree must be between 0 and 1024
+LINE 1: VACUUM (PARALLEL -1) pvactst;
+ ^
+VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
+VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
+ERROR: cannot specify both FULL and PARALLEL options
+VACUUM (PARALLEL) pvactst; -- error, cannot use PARALLEL option without parallel degree
+ERROR: parallel option requires a value between 0 and 1024
+LINE 1: VACUUM (PARALLEL) pvactst;
+ ^
+CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
+CREATE INDEX tmp_idx1 ON tmp (a);
+VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option
+RESET min_parallel_index_scan_size;
+DROP TABLE pvactst;
+-- INDEX_CLEANUP option
+CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
+-- Use uncompressed data stored in toast.
+CREATE INDEX no_index_cleanup_idx ON no_index_cleanup(t);
+ALTER TABLE no_index_cleanup ALTER COLUMN t SET STORAGE EXTERNAL;
+INSERT INTO no_index_cleanup(i, t) VALUES (generate_series(1,30),
+ repeat('1234567890',300));
+-- index cleanup option is ignored if VACUUM FULL
+VACUUM (INDEX_CLEANUP TRUE, FULL TRUE) no_index_cleanup;
+VACUUM (FULL TRUE) no_index_cleanup;
+-- Toast inherits the value from its parent table.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false);
+DELETE FROM no_index_cleanup WHERE i < 15;
+-- Nothing is cleaned up.
+VACUUM no_index_cleanup;
+-- Both parent relation and toast are cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = true);
+VACUUM no_index_cleanup;
+-- Parameter is set for both the parent table and its toast relation.
+INSERT INTO no_index_cleanup(i, t) VALUES (generate_series(31,60),
+ repeat('1234567890',300));
+DELETE FROM no_index_cleanup WHERE i < 45;
+-- Only toast index is cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false,
+ toast.vacuum_index_cleanup = true);
+VACUUM no_index_cleanup;
+-- Only parent is cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = true,
+ toast.vacuum_index_cleanup = false);
+VACUUM no_index_cleanup;
+-- Test some extra relations.
+VACUUM (INDEX_CLEANUP FALSE) vaccluster;
+VACUUM (INDEX_CLEANUP FALSE) vactst; -- index cleanup option is ignored if no indexes
+VACUUM (INDEX_CLEANUP FALSE, FREEZE TRUE) vaccluster;
+-- TRUNCATE option
+CREATE TABLE vac_truncate_test(i INT NOT NULL, j text)
+ WITH (vacuum_truncate=true, autovacuum_enabled=false);
+INSERT INTO vac_truncate_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" of relation "vac_truncate_test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM (TRUNCATE FALSE) vac_truncate_test;
+SELECT pg_relation_size('vac_truncate_test') > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+VACUUM vac_truncate_test;
+SELECT pg_relation_size('vac_truncate_test') = 0;
+ ?column?
+----------
+ f
+(1 row)
+
+VACUUM (TRUNCATE FALSE, FULL TRUE) vac_truncate_test;
+DROP TABLE vac_truncate_test;
+-- partitioned table
+CREATE TABLE vacparted (a int, b char) PARTITION BY LIST (a);
+CREATE TABLE vacparted1 PARTITION OF vacparted FOR VALUES IN (1);
+INSERT INTO vacparted VALUES (1, 'a');
+UPDATE vacparted SET b = 'b';
+VACUUM (ANALYZE) vacparted;
+VACUUM (FULL) vacparted;
+VACUUM (FREEZE) vacparted;
+-- check behavior with duplicate column mentions
+VACUUM ANALYZE vacparted(a,b,a);
+ERROR: column "a" of relation "vacparted" appears more than once
+ANALYZE vacparted(a,b,b);
+ERROR: column "b" of relation "vacparted" appears more than once
+-- multiple tables specified
+VACUUM vaccluster, vactst;
+VACUUM vacparted, does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+VACUUM (FREEZE) vacparted, vaccluster, vactst;
+VACUUM (FREEZE) does_not_exist, vaccluster;
+ERROR: relation "does_not_exist" does not exist
+VACUUM ANALYZE vactst, vacparted (a);
+VACUUM ANALYZE vactst (does_not_exist), vacparted (b);
+ERROR: column "does_not_exist" of relation "vactst" does not exist
+VACUUM FULL vacparted, vactst;
+VACUUM FULL vactst, vacparted (a, b), vaccluster (i);
+ERROR: ANALYZE option must be specified when a column list is provided
+ANALYZE vactst, vacparted;
+ANALYZE vacparted (b), vactst;
+ANALYZE vactst, does_not_exist, vacparted;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE vactst (i), vacparted (does_not_exist);
+ERROR: column "does_not_exist" of relation "vacparted" does not exist
+ANALYZE vactst, vactst;
+BEGIN; -- ANALYZE behaves differently inside a transaction block
+ANALYZE vactst, vactst;
+COMMIT;
+-- parenthesized syntax for ANALYZE
+ANALYZE (VERBOSE) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE (nonexistent-arg) does_not_exist;
+ERROR: syntax error at or near "arg"
+LINE 1: ANALYZE (nonexistent-arg) does_not_exist;
+ ^
+ANALYZE (nonexistentarg) does_not_exit;
+ERROR: unrecognized ANALYZE option "nonexistentarg"
+LINE 1: ANALYZE (nonexistentarg) does_not_exit;
+ ^
+-- ensure argument order independence, and that SKIP_LOCKED on non-existing
+-- relation still errors out. Suppress WARNING messages caused by concurrent
+-- autovacuums.
+SET client_min_messages TO 'ERROR';
+ANALYZE (SKIP_LOCKED, VERBOSE) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE (VERBOSE, SKIP_LOCKED) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+-- SKIP_LOCKED option
+VACUUM (SKIP_LOCKED) vactst;
+VACUUM (SKIP_LOCKED, FULL) vactst;
+ANALYZE (SKIP_LOCKED) vactst;
+RESET client_min_messages;
+-- ensure VACUUM and ANALYZE don't have a problem with serializable
+SET default_transaction_isolation = serializable;
+VACUUM vactst;
+ANALYZE vactst;
+RESET default_transaction_isolation;
+BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ANALYZE vactst;
+COMMIT;
+DROP TABLE vaccluster;
+DROP TABLE vactst;
+DROP TABLE vacparted;
+DROP TABLE no_index_cleanup;
+-- relation ownership, WARNING logs generated as all are skipped.
+CREATE TABLE vacowned (a int);
+CREATE TABLE vacowned_parted (a int) PARTITION BY LIST (a);
+CREATE TABLE vacowned_part1 PARTITION OF vacowned_parted FOR VALUES IN (1);
+CREATE TABLE vacowned_part2 PARTITION OF vacowned_parted FOR VALUES IN (2);
+CREATE ROLE regress_vacuum;
+SET ROLE regress_vacuum;
+-- Simple table
+VACUUM vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can vacuum it
+ANALYZE vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can vacuum it
+-- Catalog
+VACUUM pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can vacuum it
+ANALYZE pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can analyze it
+VACUUM (ANALYZE) pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can vacuum it
+-- Shared catalog
+VACUUM pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can vacuum it
+ANALYZE pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can analyze it
+VACUUM (ANALYZE) pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can vacuum it
+-- Partitioned table and its partitions, nothing owned by other user.
+-- Relations are not listed in a single command to test ownership
+-- independently.
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Partitioned table and one partition owned by other user.
+ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
+ALTER TABLE vacowned_part1 OWNER TO regress_vacuum;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Only one partition owned by other user.
+ALTER TABLE vacowned_parted OWNER TO CURRENT_USER;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Only partitioned table owned by other user.
+ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
+ALTER TABLE vacowned_part1 OWNER TO CURRENT_USER;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+DROP TABLE vacowned;
+DROP TABLE vacowned_parted;
+DROP ROLE regress_vacuum;
diff --git src/test/regress/expected/zedstore.out src/test/regress/expected/zedstore.out
new file mode 100644
index 0000000000..9ab09d8514
--- /dev/null
+++ src/test/regress/expected/zedstore.out
@@ -0,0 +1,666 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+(10 rows)
+
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+ c1 | c3
+----+----
+ 1 | 3
+ 2 | 4
+ 3 | 5
+ 4 | 6
+ 5 | 7
+ 6 | 8
+ 7 | 9
+ 8 | 10
+ 9 | 11
+ 10 | 12
+(10 rows)
+
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+ c3
+----
+ 7
+ 8
+ 9
+ 10
+ 11
+ 12
+(6 rows)
+
+-- where clause with invalid ctid works
+select * from t_zedstore where ctid = '(0,0)';
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(15 rows)
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+-- index scan
+select * from t_zedstore where c1 = 5;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+(1 row)
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+ c1
+----
+ 5
+(1 row)
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+ c1 | c2
+----+----
+ 5 | 6
+ 6 | 7
+ 7 | 8
+ 8 | 9
+ 9 | 10
+ 10 | 11
+(6 rows)
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(14 rows)
+
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(11 rows)
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test in-line toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+select c1, length(t) from t_zedtoast;
+ c1 | length
+----+--------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+ 6 | 10000
+ 7 | 10000
+ 8 | 10000
+ 9 | 10000
+ 10 | 10000
+(10 rows)
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ f
+(1 row)
+
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ f
+(1 row)
+
+--
+-- Test out-of-line toasting
+--
+insert into t_zedtoast select i, repeat('x', 1000000) from generate_series(1, 10) i;
+select c1, length(t) from t_zedtoast;
+ c1 | length
+----+---------
+ 1 | 1000000
+ 2 | 1000000
+ 3 | 1000000
+ 4 | 1000000
+ 5 | 1000000
+ 6 | 1000000
+ 7 | 1000000
+ 8 | 1000000
+ 9 | 1000000
+ 10 | 1000000
+(10 rows)
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ t
+(1 row)
+
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ f
+(1 row)
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+select c2 from t_zednullvalues;
+ c2
+----
+
+ 2
+(2 rows)
+
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ 1 |
+(2 rows)
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 4 | 4 | stuff | test_4 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(12 rows)
+
+COPY t_zedcopy (a, d, e) to stdout;
+9999 NN \N
+10000 41 51
+1 test_1 \N
+2 test_2 \N
+3 test_3 \N
+4 test_4 \N
+5 test_5 \N
+10001 42 52
+10002 43 53
+10003 44 54
+10004 45 55
+10005 46 56
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(11 rows)
+
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(9 rows)
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+-----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+ 5 | 100 | stuff | test_5 |
+(9 rows)
+
+-- Test rolling back COPY
+begin;
+COPY t_zedcopy (b, d) from stdin;
+rollback;
+select count(*) from t_zedcopy where b >= 20000;
+ count
+-------
+ 0
+(1 row)
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+ count
+-------
+ 1
+(1 row)
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+ a | b
+---+---
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+(1 row)
+
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+ 2 | 3
+(2 rows)
+
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+ a | b | c
+---+---+---
+ 1 | 2 | 3
+ 2 | 3 | 3
+ 3 | 4 | 3
+(3 rows)
+
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+ d
+----------
+ abcdefgh
+ abcdefgh
+ abcdefgh
+(3 rows)
+
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+ a | b | c | d
+---+---+---+----------
+ 1 | 2 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+ 3 | 4 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+(4 rows)
+
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+ b | c | d
+---+---+-------------
+ 2 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 5 | test_insert
+(5 rows)
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,14) | 141
+ (1,16) | 143
+ (1,18) | 145
+ (1,20) | 147
+ (1,22) | 149
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,36) | 163
+ (1,38) | 165
+ (1,40) | 167
+ (1,42) | 169
+ (1,44) | 171
+ (1,46) | 173
+ (1,48) | 175
+ (1,50) | 177
+ (1,52) | 179
+ (1,54) | 181
+ (1,56) | 183
+ (1,58) | 185
+ (1,60) | 187
+ (1,62) | 189
+ (1,64) | 191
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,72) | 199
+ (1,74) | 201
+ (1,76) | 203
+ (1,78) | 205
+ (1,80) | 207
+ (1,82) | 209
+ (1,84) | 211
+ (1,86) | 213
+ (1,88) | 215
+ (1,90) | 217
+ (1,92) | 219
+ (1,94) | 221
+ (1,96) | 223
+ (1,98) | 225
+ (1,100) | 227
+ (1,102) | 229
+ (1,104) | 231
+ (1,106) | 233
+ (1,108) | 235
+ (1,110) | 237
+ (1,112) | 239
+ (1,114) | 241
+ (1,116) | 243
+ (1,118) | 245
+ (1,120) | 247
+ (1,122) | 249
+ (1,124) | 251
+ (1,126) | 253
+ (1,128) | 255
+ (2,2) | 257
+ (2,4) | 259
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,14) | 269
+ (2,16) | 271
+ (2,18) | 273
+ (2,20) | 275
+ (2,22) | 277
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,32) | 287
+ (2,34) | 289
+ (2,36) | 291
+ (2,38) | 293
+ (2,40) | 295
+ (2,42) | 297
+ (2,44) | 299
+(86 rows)
+
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (0,4) | 3
+ (0,6) | 5
+ (0,8) | 7
+ (0,20) | 19
+ (0,30) | 29
+ (0,42) | 41
+ (0,44) | 43
+ (0,48) | 47
+ (0,52) | 51
+ (0,54) | 53
+ (0,56) | 55
+ (0,62) | 61
+ (0,64) | 63
+ (0,66) | 65
+ (0,76) | 75
+ (0,80) | 79
+ (0,82) | 81
+ (0,84) | 83
+ (0,88) | 87
+ (0,90) | 89
+ (0,92) | 91
+ (0,98) | 97
+ (0,106) | 105
+ (0,108) | 107
+ (0,122) | 121
+ (0,126) | 125
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,20) | 147
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,40) | 167
+ (1,44) | 171
+ (1,46) | 173
+ (1,58) | 185
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,78) | 205
+ (1,80) | 207
+ (1,88) | 215
+ (1,92) | 219
+ (1,96) | 223
+ (1,100) | 227
+ (1,102) | 229
+ (1,106) | 233
+ (1,112) | 239
+ (1,116) | 243
+ (1,120) | 247
+ (1,122) | 249
+ (1,126) | 253
+ (2,2) | 257
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,16) | 271
+ (2,18) | 273
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,34) | 289
+ (2,36) | 291
+ (2,42) | 297
+ (2,44) | 299
+(74 rows)
+
diff --git src/test/regress/expected/zstidscan.out src/test/regress/expected/zstidscan.out
new file mode 100644
index 0000000000..d7f3c76a24
--- /dev/null
+++ src/test/regress/expected/zstidscan.out
@@ -0,0 +1,213 @@
+select '1'::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select '-1'::zstid;
+ERROR: value "-1" is out of range for type zstid
+LINE 1: select '-1'::zstid;
+ ^
+select -'1'::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -'1'::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+-- int2 conversion
+select 1::int2::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select (-1)::int2::zstid;
+ERROR: value -1 is out of range for type zstid
+select -1::int2::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -1::int2::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+-- int4 conversion
+select 1::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select (-1)::zstid;
+ERROR: value -1 is out of range for type zstid
+select -1::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -1::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+-- int8 conversion
+select 1::int8::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select 1000000000000000::zstid; -- bigger than MaxZSTid
+ERROR: value 1000000000000000 is out of range for type zstid
+select (-1)::int8::zstid;
+ERROR: value -1 is out of range for type zstid
+select -1::int8::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -1::int8::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+create table if not exists zstidscan(a int) using zedstore;
+insert into zstidscan values (1), (2), (3);
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan;
+ QUERY PLAN
+-----------------------
+ Seq Scan on zstidscan
+(1 row)
+
+select ctid, ctid::zstid as zstid, a from zstidscan;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+ (0,2) | 2 | 2
+ (0,3) | 3 | 3
+(3 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+ QUERY PLAN
+----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid = '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,2) | 2 | 2
+(1 row)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid >= '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,2) | 2 | 2
+ (0,3) | 3 | 3
+(2 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid <= '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+ (0,2) | 2 | 2
+(2 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+ QUERY PLAN
+----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid < '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+(1 row)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+ QUERY PLAN
+----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid > '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,3) | 3 | 3
+(1 row)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid <> '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+ (0,3) | 3 | 3
+(2 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+ QUERY PLAN
+----------------------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid = ANY ('{2,3}'::zstid[]))
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+ ctid | zstid | a
+-------+-------+---
+ (0,2) | 2 | 2
+ (0,3) | 3 | 3
+(2 rows)
+
+-- TODO: casting to int2 or int4 might be useful
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int2 % 3 = 0;
+ERROR: cannot cast type zstid to smallint
+LINE 1: ...zstid as zstid, a from zstidscan where ctid::zstid::int2 % 3...
+ ^
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int4 % 3 = 0;
+ERROR: cannot cast type zstid to integer
+LINE 1: ...zstid as zstid, a from zstidscan where ctid::zstid::int4 % 3...
+ ^
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+ QUERY PLAN
+---------------------------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((((ctid)::zstid)::bigint % '3'::bigint) = 0)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+ ctid | zstid | a
+-------+-------+---
+ (0,3) | 3 | 3
+(1 row)
+
+-- TODO: Add necessary functions to do these useful aggregates on zstid types
+select max(ctid::zstid) from zstidscan;
+ERROR: function max(zstid) does not exist
+LINE 1: select max(ctid::zstid) from zstidscan;
+ ^
+HINT: No function matches the given name and argument types. You might need to add explicit type casts.
+select min(ctid::zstid) from zstidscan;
+ERROR: function min(zstid) does not exist
+LINE 1: select min(ctid::zstid) from zstidscan;
+ ^
+HINT: No function matches the given name and argument types. You might need to add explicit type casts.
+drop table zstidscan;
diff --git src/test/regress/gpdiff.pl src/test/regress/gpdiff.pl
new file mode 100755
index 0000000000..02ec0272c6
--- /dev/null
+++ src/test/regress/gpdiff.pl
@@ -0,0 +1,260 @@
+#!/usr/bin/env perl
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Pod::Usage is loaded lazily when needed, if the --help or other such option
+# is actually used. Loading the module takes some time, which adds up when
+# running hundreds of regression tests, and gpdiff.pl calls this script twice
+# for every test. See lazy_pod2usage().
+#use Pod::Usage;
+
+use strict;
+use warnings;
+use File::Spec;
+use Getopt::Long qw(GetOptions);
+Getopt::Long::Configure qw(pass_through);
+
+# Load atmsort module from the same dir as this script
+use FindBin;
+use lib "$FindBin::Bin";
+use atmsort;
+
+=head1 NAME
+
+B<gpdiff.pl>
+
+=head1 SYNOPSIS
+
+B<gpdiff.pl> [options] logfile [logfile...]
+
+Options:
+
+Normally, gpdiff takes the standard "diff" options and passes them
+directly to the diff program. Try `diff --help' for more information
+on the standard options. The following options are specific to gpdiff:
+
+ -help brief help message
+ -man full documentation
+ -version print gpdiff version and underlying diff version
+ -verbose print verbose info
+ -gpd_ignore_plans ignore explain plan content in input files
+ -gpd_init <file> load initialization file
+
+=head1 OPTIONS
+
+=over 8
+
+=item B<-help>
+
+ Print a brief help message and exits.
+
+=item B<-man>
+
+ Prints the manual page and exits.
+
+=item B<-version>
+
+ Prints the gpdiff version and underlying diff version
+
+=item B<-verbose>
+
+ Prints verbose information.
+
+=item B<-gpd_ignore_plans>
+
+Specify this option to ignore any explain plan diffs between the
+input files. This will completely ignore any plan content in
+the input files thus masking differences in plans between the input files.
+
+=item B<-init_file> <file>
+
+Specify an initialization file containing a series of directives
+(mainly for match_subs) that get applied to the input files. To
+specify multiple initialization files, use multiple init_file arguments, eg:
+
+ -init_file file1 -init_file file2
+
+=back
+
+=head1 DESCRIPTION
+
+gpdiff compares files using diff after processing them with
+atmsort.pm. This comparison is designed to handle the cases where
+query output order may differ or plans maybe differ. Type "atmsort.pl
+--man" for more details. gpdiff is invoked by pg_regress as part of
+"make installcheck-world". In this case the diff options are
+something like:
+
+ "-w -I NOTICE: -I HINT: -I CONTEXT: -I REGRESS_IGNORE:".
+
+Like diff, gpdiff can compare two files, a file and directory, a
+directory and file, and two directories. However, when gpdiff compares
+two directories, it only returns the exit status of the diff
+comparison of the final two files.
+
+=head1 BUGS
+
+While the exit status is set correctly for most cases,
+STDERR messages from diff are not displayed.
+
+=cut
+
+# Calls pod2usage, but loads the module first.
+sub lazy_pod2usage
+{
+ require Pod::Usage;
+ Pod::Usage::pod2usage(@_);
+}
+
+my %glob_atmsort_args;
+
+my $glob_ignore_plans;
+my $glob_init_file = [];
+
+sub gpdiff_files
+{
+ my ($f1, $f2, $d2d) = @_;
+ my @tmpfils;
+ my $newf1;
+ my $newf2;
+
+ atmsort::atmsort_init(%glob_atmsort_args);
+ $newf1 = atmsort::run($f1);
+ $newf2 = atmsort::run($f2);
+
+ my $args = join(' ', @ARGV, $newf1, $newf2);
+
+# print "args: $args\n";
+
+ my $outi =`diff $args`;
+
+ my $stat = $? >> 8; # diff status
+
+ # prefix the diff output with the files names for a "directory to
+ # directory" diff
+ if (defined($d2d) && length($outi))
+ {
+ $outi = "diff $f1 $f2\n" . $outi;
+ }
+
+ # replace temp file name references with actual file names
+ $outi =~ s/\Q$newf1\E/\Q$f1\E/gm;
+ $outi =~ s/\Q$newf2\E/\Q$f2\E/gm;
+
+ print $outi;
+
+#my $stat = WEXITVALUE($?); # diff status
+
+ unlink $newf1;
+ unlink $newf2;
+
+ return ($stat);
+}
+
+sub filefunc
+{
+ my ($f1, $f2, $d2d) = @_;
+
+ if ((-f $f1) && (-f $f2))
+ {
+ return (gpdiff_files($f1, $f2, $d2d));
+ }
+
+ # if f1 is a directory, do the filefunc of every file in that directory
+ if ((-d $f1) && (-d $f2))
+ {
+ my $dir = $f1;
+ my ($dir_h, $stat);
+
+ if (opendir($dir_h, $dir))
+ {
+ my $fnam;
+ while ($fnam = readdir($dir_h))
+ {
+ # ignore ., ..
+ next if ($fnam eq '.' || $fnam eq '..');
+
+ my $absname = File::Spec->rel2abs(
+ File::Spec->catfile($dir, $fnam));
+
+ # specify that is a directory comparison
+ $d2d = {} unless (defined($d2d));
+ $d2d->{dir} = 1;
+ $stat = filefunc($absname, $f2, $d2d);
+ }
+ closedir $dir_h;
+ }
+ return $stat;
+ }
+
+ # if f2 is a directory, find the corresponding file in that directory
+ if ((-f $f1) && (-d $f2))
+ {
+ my $stat;
+ my @foo = File::Spec->splitpath($f1);
+
+ return 0 unless (scalar(@foo));
+ my $basenam = $foo[-1];
+
+ my $fnam = File::Spec->rel2abs(File::Spec->catfile( $f2, $basenam));
+
+ $stat = filefunc($f1, $fnam, $d2d);
+
+ return $stat;
+ }
+
+ # find f2 in dir f1
+ if ((-f $f2) && (-d $f1))
+ {
+ my $stat;
+ my @foo = File::Spec->splitpath($f2);
+
+ return 0 unless (scalar(@foo));
+ my $basenam = $foo[-1];
+
+ my $fnam = File::Spec->rel2abs( File::Spec->catfile( $f1, $basenam));
+
+ $stat = filefunc($fnam, $f2, $d2d);
+
+ return $stat;
+ }
+
+ return 0;
+}
+
+sub print_version
+{
+ print "(PostgreSQL)";
+ exit(0);
+}
+
+if (1)
+{
+ my $pmsg = "";
+
+ GetOptions(
+ "man" => sub { lazy_pod2usage(-msg => $pmsg, -exitstatus => 0, -verbose => 2) },
+ "help" => sub { lazy_pod2usage(-msg => $pmsg, -exitstatus => 1) },
+ "version|v" => \&print_version ,
+ "verbose|Verbose" => \$glob_atmsort_args{VERBOSE},
+ "gpd_ignore_plans|gp_ignore_plans" => \$glob_atmsort_args{IGNORE_PLANS},
+ "gpd_init|gp_init_file=s" => \@{$glob_atmsort_args{INIT_FILES}}
+ );
+
+ lazy_pod2usage(-msg => $pmsg, -exitstatus => 1) unless (scalar(@ARGV) >= 2);
+
+ my $f2 = pop @ARGV;
+ my $f1 = pop @ARGV;
+
+ for my $fname ($f1, $f2)
+ {
+ unless (-e $fname)
+ {
+ print STDERR "gpdiff: $fname: No such file or directory\n";
+ }
+ }
+ exit(2) unless ((-e $f1) && (-e $f2));
+
+ exit(filefunc($f1, $f2));
+}
diff --git src/test/regress/output/misc_1.source src/test/regress/output/misc_1.source
new file mode 100644
index 0000000000..c29c54c414
--- /dev/null
+++ src/test/regress/output/misc_1.source
@@ -0,0 +1,692 @@
+--
+-- MISC
+--
+--
+-- BTREE
+--
+UPDATE onek
+ SET unique1 = onek.unique1 + 1;
+UPDATE onek
+ SET unique1 = onek.unique1 - 1;
+--
+-- BTREE partial
+--
+-- UPDATE onek2
+-- SET unique1 = onek2.unique1 + 1;
+--UPDATE onek2
+-- SET unique1 = onek2.unique1 - 1;
+--
+-- BTREE shutting out non-functional updates
+--
+-- the following two tests seem to take a long time on some
+-- systems. This non-func update stuff needs to be examined
+-- more closely. - jolly (2/22/96)
+--
+UPDATE tmp
+ SET stringu1 = reverse_name(onek.stringu1)
+ FROM onek
+ WHERE onek.stringu1 = 'JBAAAA' and
+ onek.stringu1 = tmp.stringu1;
+UPDATE tmp
+ SET stringu1 = reverse_name(onek2.stringu1)
+ FROM onek2
+ WHERE onek2.stringu1 = 'JCAAAA' and
+ onek2.stringu1 = tmp.stringu1;
+DROP TABLE tmp;
+--UPDATE person*
+-- SET age = age + 1;
+--UPDATE person*
+-- SET age = age + 3
+-- WHERE name = 'linda';
+--
+-- copy
+--
+COPY onek TO '@abs_builddir@/results/onek.data';
+DELETE FROM onek;
+COPY onek FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+DELETE FROM onek2;
+COPY onek2 FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek2 WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+COPY BINARY stud_emp TO '@abs_builddir@/results/stud_emp.data';
+DELETE FROM stud_emp;
+COPY BINARY stud_emp FROM '@abs_builddir@/results/stud_emp.data';
+SELECT * FROM stud_emp;
+ name | age | location | salary | manager | gpa | percent
+-------+-----+------------+--------+---------+-----+---------
+ jeff | 23 | (8,7.7) | 600 | sharon | 3.5 |
+ cim | 30 | (10.5,4.7) | 400 | | 3.4 |
+ linda | 19 | (0.9,6.1) | 100 | | 2.9 |
+(3 rows)
+
+-- COPY aggtest FROM stdin;
+-- 56 7.8
+-- 100 99.097
+-- 0 0.09561
+-- 42 324.78
+-- .
+-- COPY aggtest TO stdout;
+--
+-- inheritance stress test
+--
+SELECT * FROM a_star*;
+ class | a
+-------+----
+ a | 1
+ a | 2
+ a |
+ b | 3
+ b | 4
+ b |
+ b |
+ c | 5
+ c | 6
+ c |
+ c |
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d |
+ d | 11
+ d | 12
+ d | 13
+ d |
+ d |
+ d |
+ d | 14
+ d |
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e | 17
+ e |
+ e | 18
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f |
+ f | 24
+ f | 25
+ f | 26
+ f |
+ f |
+ f |
+ f | 27
+ f |
+ f |
+ f |
+ f |
+(50 rows)
+
+SELECT *
+ FROM b_star* x
+ WHERE x.b = text 'bumble' or x.a < 3;
+ class | a | b
+-------+---+--------
+ b | | bumble
+(1 row)
+
+SELECT class, a
+ FROM c_star* x
+ WHERE x.c ~ text 'hi';
+ class | a
+-------+----
+ c | 5
+ c |
+ d | 7
+ d | 8
+ d | 10
+ d |
+ d | 12
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f |
+ f | 24
+ f |
+ f |
+ f |
+(22 rows)
+
+SELECT class, b, c
+ FROM d_star* x
+ WHERE x.a < 100;
+ class | b | c
+-------+---------+------------
+ d | grumble | hi sunita
+ d | stumble | hi koko
+ d | rumble |
+ d | | hi kristin
+ d | fumble |
+ d | | hi avi
+ d | |
+ d | |
+(8 rows)
+
+SELECT class, c FROM e_star* x WHERE x.c NOTNULL;
+ class | c
+-------+-------------
+ e | hi carol
+ e | hi bob
+ e | hi michelle
+ e | hi elisa
+ f | hi claire
+ f | hi mike
+ f | hi marcel
+ f | hi keith
+ f | hi marc
+ f | hi allison
+ f | hi jeff
+ f | hi carl
+(12 rows)
+
+SELECT * FROM f_star* x WHERE x.c ISNULL;
+ class | a | c | e | f
+-------+----+---+-----+-------------------------------------------
+ f | 22 | | -7 | ((111,555),(222,666),(333,777),(444,888))
+ f | 25 | | -9 |
+ f | 26 | | | ((11111,33333),(22222,44444))
+ f | | | -11 | ((1111111,3333333),(2222222,4444444))
+ f | 27 | | |
+ f | | | -12 |
+ f | | | | ((11111111,33333333),(22222222,44444444))
+ f | | | |
+(8 rows)
+
+-- grouping and aggregation on inherited sets have been busted in the past...
+SELECT sum(a) FROM a_star*;
+ sum
+-----
+ 355
+(1 row)
+
+SELECT class, sum(a) FROM a_star* GROUP BY class ORDER BY class;
+ class | sum
+-------+-----
+ a | 3
+ b | 7
+ c | 11
+ d | 84
+ e | 66
+ f | 184
+(6 rows)
+
+ALTER TABLE f_star RENAME COLUMN f TO ff;
+ALTER TABLE e_star* RENAME COLUMN e TO ee;
+ALTER TABLE d_star* RENAME COLUMN d TO dd;
+ALTER TABLE c_star* RENAME COLUMN c TO cc;
+ALTER TABLE b_star* RENAME COLUMN b TO bb;
+ALTER TABLE a_star* RENAME COLUMN a TO aa;
+SELECT class, aa
+ FROM a_star* x
+ WHERE aa ISNULL;
+ class | aa
+-------+----
+ a |
+ b |
+ b |
+ c |
+ c |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ e |
+ e |
+ e |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+(24 rows)
+
+-- As of Postgres 7.1, ALTER implicitly recurses,
+-- so this should be same as ALTER a_star*
+ALTER TABLE a_star RENAME COLUMN aa TO foo;
+SELECT class, foo
+ FROM a_star* x
+ WHERE x.foo >= 2;
+ class | foo
+-------+-----
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(25 rows)
+
+ALTER TABLE a_star RENAME COLUMN foo TO aa;
+SELECT *
+ from a_star*
+ WHERE aa < 1000;
+ class | aa
+-------+----
+ a | 1
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(26 rows)
+
+ALTER TABLE f_star ADD COLUMN f int4;
+UPDATE f_star SET f = 10;
+ALTER TABLE e_star* ADD COLUMN e int4;
+--UPDATE e_star* SET e = 42;
+SELECT * FROM e_star*;
+ class | aa | cc | ee | e
+-------+----+-------------+-----+---
+ e | 15 | hi carol | -1 |
+ e | 16 | hi bob | |
+ e | 17 | | -2 |
+ e | | hi michelle | -3 |
+ e | 18 | | |
+ e | | hi elisa | |
+ e | | | -4 |
+ f | 19 | hi claire | -5 |
+ f | 20 | hi mike | -6 |
+ f | 21 | hi marcel | |
+ f | 22 | | -7 |
+ f | | hi keith | -8 |
+ f | 24 | hi marc | |
+ f | 25 | | -9 |
+ f | 26 | | |
+ f | | hi allison | -10 |
+ f | | hi jeff | |
+ f | | | -11 |
+ f | 27 | | |
+ f | | hi carl | |
+ f | | | -12 |
+ f | | | |
+ f | | | |
+(23 rows)
+
+ALTER TABLE a_star* ADD COLUMN a text;
+NOTICE: merging definition of column "a" for child "d_star"
+-- That ALTER TABLE should have added TOAST tables.
+SELECT relname, reltoastrelid <> 0 AS has_toast_table
+ FROM pg_class
+ WHERE oid::regclass IN ('a_star', 'c_star')
+ ORDER BY 1;
+ relname | has_toast_table
+---------+-----------------
+ a_star | f
+ c_star | f
+(2 rows)
+
+--UPDATE b_star*
+-- SET a = text 'gazpacho'
+-- WHERE aa > 4;
+SELECT class, aa, a FROM a_star*;
+ class | aa | a
+-------+----+---
+ a | 1 |
+ a | 2 |
+ a | |
+ b | 3 |
+ b | 4 |
+ b | |
+ b | |
+ c | 5 |
+ c | 6 |
+ c | |
+ c | |
+ d | 7 |
+ d | 8 |
+ d | 9 |
+ d | 10 |
+ d | |
+ d | 11 |
+ d | 12 |
+ d | 13 |
+ d | |
+ d | |
+ d | |
+ d | 14 |
+ d | |
+ d | |
+ d | |
+ d | |
+ e | 15 |
+ e | 16 |
+ e | 17 |
+ e | |
+ e | 18 |
+ e | |
+ e | |
+ f | 19 |
+ f | 20 |
+ f | 21 |
+ f | 22 |
+ f | |
+ f | 24 |
+ f | 25 |
+ f | 26 |
+ f | |
+ f | |
+ f | |
+ f | 27 |
+ f | |
+ f | |
+ f | |
+ f | |
+(50 rows)
+
+--
+-- versions
+--
+--
+-- postquel functions
+--
+--
+-- mike does post_hacking,
+-- joe and sally play basketball, and
+-- everyone else does nothing.
+--
+SELECT p.name, name(p.hobbies) FROM ONLY person p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+(3 rows)
+
+--
+-- as above, but jeff also does post_hacking.
+--
+SELECT p.name, name(p.hobbies) FROM person* p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+ jeff | posthacking
+(4 rows)
+
+--
+-- the next two queries demonstrate how functions generate bogus duplicates.
+-- this is a "feature" ..
+--
+SELECT DISTINCT hobbies_r.name, name(hobbies_r.equipment) FROM hobbies_r
+ ORDER BY 1,2;
+ name | name
+-------------+---------------
+ basketball | hightops
+ posthacking | advil
+ posthacking | peet's coffee
+ skywalking | guts
+(4 rows)
+
+SELECT hobbies_r.name, (hobbies_r.equipment).name FROM hobbies_r;
+ name | name
+-------------+---------------
+ posthacking | advil
+ posthacking | peet's coffee
+ posthacking | advil
+ posthacking | peet's coffee
+ basketball | hightops
+ basketball | hightops
+ skywalking | guts
+(7 rows)
+
+--
+-- mike needs advil and peet's coffee,
+-- joe and sally need hightops, and
+-- everyone else is fine.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM ONLY person p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+(4 rows)
+
+--
+-- as above, but jeff needs advil and peet's coffee as well.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM person* p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+ jeff | posthacking | advil
+ jeff | posthacking | peet's coffee
+(6 rows)
+
+--
+-- just like the last two, but make sure that the target list fixup and
+-- unflattening is being done correctly.
+--
+SELECT name(equipment(p.hobbies)), p.name, name(p.hobbies) FROM ONLY person p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+(4 rows)
+
+SELECT (p.hobbies).equipment.name, p.name, name(p.hobbies) FROM person* p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+ advil | jeff | posthacking
+ peet's coffee | jeff | posthacking
+(6 rows)
+
+SELECT (p.hobbies).equipment.name, name(p.hobbies), p.name FROM ONLY person p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+(4 rows)
+
+SELECT name(equipment(p.hobbies)), name(p.hobbies), p.name FROM person* p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+ advil | posthacking | jeff
+ peet's coffee | posthacking | jeff
+(6 rows)
+
+SELECT name(equipment(hobby_construct(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1a(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1b(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1c(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2a(text 'skywalking'));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2b(text 'skywalking'));
+ name
+---------------
+ advil
+ peet's coffee
+ hightops
+ guts
+(4 rows)
+
+SELECT hobbies_by_name('basketball');
+ hobbies_by_name
+-----------------
+ joe
+(1 row)
+
+SELECT name, overpaid(emp.*) FROM emp;
+ name | overpaid
+--------+----------
+ sharon | t
+ sam | t
+ bill | t
+ jeff | f
+ cim | f
+ linda | f
+(6 rows)
+
+--
+-- Try a few cases with SQL-spec row constructor expressions
+--
+SELECT * FROM equipment(ROW('skywalking', 'mer'));
+ name | hobby
+------+------------
+ guts | skywalking
+(1 row)
+
+SELECT name(equipment(ROW('skywalking', 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT *, name(equipment(h.*)) FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+SELECT *, (equipment(CAST((h.*) AS hobbies_r))).name FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+--
+-- functional joins
+--
+--
+-- instance rules
+--
+--
+-- rewrite rules
+--
diff --git src/test/regress/parallel_schedule src/test/regress/parallel_schedule
index d2b17dd3ea..08ac118d5b 100644
--- src/test/regress/parallel_schedule
+++ src/test/regress/parallel_schedule
@@ -78,7 +78,15 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
# ----------
# Another group of parallel tests
# ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf collate.icu.utf8 tidscan zstidscan
+
+
+# ----------
+# zedstore does a vacuum followed by checking the recycled pages, other active
+# transactions may affect the results, so it should not run in parallel with
+# other tests
+# ----------
+test: zedstore
# rules cannot run concurrently with any test that creates
# a view or rule in the public schema
diff --git src/test/regress/pg_regress.c src/test/regress/pg_regress.c
index f6a5e1b9c7..af90a43969 100644
--- src/test/regress/pg_regress.c
+++ src/test/regress/pg_regress.c
@@ -56,14 +56,16 @@ char *host_platform = HOST_TUPLE;
static char *shellprog = SHELLPROG;
#endif
+static char gpdiffprog[MAXPGPATH] = "diff";
+
/*
* On Windows we use -w in diff switches to avoid problems with inconsistent
* newline representation. The actual result files will generally have
* Windows-style newlines, but the comparison files might or might not.
*/
#ifndef WIN32
-const char *basic_diff_opts = "";
-const char *pretty_diff_opts = "-U3";
+const char *basic_diff_opts = "-I REGRESS_IGNORE:";
+const char *pretty_diff_opts = "-I REGRESS_IGNORE: -U3";
#else
const char *basic_diff_opts = "-w";
const char *pretty_diff_opts = "-w -U3";
@@ -92,6 +94,8 @@ static bool port_specified_by_user = false;
static char *dlpath = PKGLIBDIR;
static char *user = NULL;
static _stringlist *extraroles = NULL;
+static char *initfile = NULL;
+static bool ignore_plans_tuple_order_diff = false;
static char *config_auth_datadir = NULL;
/* internal variables */
@@ -122,6 +126,9 @@ static void header(const char *fmt,...) pg_attribute_printf(1, 2);
static void status(const char *fmt,...) pg_attribute_printf(1, 2);
static void psql_command(const char *database, const char *query,...) pg_attribute_printf(2, 3);
+static int
+run_diff(const char *cmd, const char *filename);
+
/*
* allow core files if possible.
*/
@@ -1369,11 +1376,14 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
char diff[MAXPGPATH];
char cmd[MAXPGPATH * 3];
char best_expect_file[MAXPGPATH];
+ char diff_opts[MAXPGPATH];
+ char m_pretty_diff_opts[MAXPGPATH];
FILE *difffile;
int best_line_count;
int i;
int l;
const char *platform_expectfile;
+ const char *ignore_plans_opts;
/*
* We can pass either the resultsfile or the expectfile, they should have
@@ -1394,13 +1404,36 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
strcpy(++p, platform_expectfile);
}
+ if (ignore_plans_tuple_order_diff)
+ ignore_plans_opts = " -gpd_ignore_plans";
+ else
+ ignore_plans_opts = "";
+
/* Name to use for temporary diff file */
snprintf(diff, sizeof(diff), "%s.diff", resultsfile);
+ /* Add init file arguments if provided via commandline */
+ if (initfile)
+ {
+ snprintf(diff_opts, sizeof(diff_opts),
+ "%s%s --gpd_init %s", basic_diff_opts, ignore_plans_opts, initfile);
+
+ snprintf(m_pretty_diff_opts, sizeof(m_pretty_diff_opts),
+ "%s%s --gpd_init %s", pretty_diff_opts, ignore_plans_opts, initfile);
+ }
+ else
+ {
+ snprintf(diff_opts, sizeof(diff_opts),
+ "%s%s", basic_diff_opts, ignore_plans_opts);
+
+ snprintf(m_pretty_diff_opts, sizeof(m_pretty_diff_opts),
+ "%s%s", pretty_diff_opts, ignore_plans_opts);
+ }
+
/* OK, run the diff */
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, expectfile, resultsfile, diff);
/* Is the diff file empty? */
if (run_diff(cmd, diff) == 0)
@@ -1432,8 +1465,8 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
}
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, alt_expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, alt_expectfile, resultsfile, diff);
if (run_diff(cmd, diff) == 0)
{
@@ -1460,8 +1493,8 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
if (platform_expectfile)
{
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, default_expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, default_expectfile, resultsfile, diff);
if (run_diff(cmd, diff) == 0)
{
@@ -1494,10 +1527,9 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
fclose(difffile);
}
- /* Run diff */
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" >> \"%s\"",
- pretty_diff_opts, best_expect_file, resultsfile, difffilename);
+ "%s %s \"%s\" \"%s\" >> \"%s\"",
+ gpdiffprog, m_pretty_diff_opts, best_expect_file, resultsfile, difffilename);
run_diff(cmd, difffilename);
unlink(diff);
@@ -1921,6 +1953,27 @@ run_single_test(const char *test, test_function tfunc)
status_end();
}
+/*
+ * Find the gpdiff.pl binary.
+ */
+static void
+find_helper_programs(const char *argv0)
+{
+ if (find_other_exec(argv0, "gpdiff.pl", "(PostgreSQL)", gpdiffprog) != 0)
+ {
+ char full_path[MAXPGPATH];
+
+ if (find_my_exec(argv0, full_path) < 0)
+ strlcpy(full_path, progname, sizeof(full_path));
+
+ fprintf(stderr,
+ _("The program \"gpdiff.pl\" is needed by %s "
+ "but was not found in the same directory as \"%s\".\n"),
+ progname, full_path);
+ exit(1);
+ }
+}
+
/*
* Create the summary-output files (making them empty if already existing)
*/
@@ -2102,6 +2155,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
{"load-extension", required_argument, NULL, 22},
{"config-auth", required_argument, NULL, 24},
{"max-concurrent-tests", required_argument, NULL, 25},
+ {"ignore-plans-and-tuple-order-diff", no_argument, NULL, 26},
+ {"ignore-tuple-order-diff", no_argument, NULL, 27},
{NULL, 0, NULL, 0}
};
@@ -2111,6 +2166,7 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
int option_index;
char buf[MAXPGPATH * 4];
char buf2[MAXPGPATH * 4];
+ bool ignore_tuple_order_diff = false;
pg_logging_init(argv[0]);
progname = get_progname(argv[0]);
@@ -2218,6 +2274,14 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
case 25:
max_concurrent_tests = atoi(optarg);
break;
+ case 26:
+ /* ignore plans means also ignore tuple order differences */
+ ignore_plans_tuple_order_diff = true;
+ ignore_tuple_order_diff = true;
+ break;
+ case 27:
+ ignore_tuple_order_diff = true;
+ break;
default:
/* getopt_long already emitted a complaint */
fprintf(stderr, _("\nTry \"%s -h\" for more information.\n"),
@@ -2261,6 +2325,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
/*
* Initialization
*/
+ if (ignore_tuple_order_diff || ignore_plans_tuple_order_diff)
+ find_helper_programs(argv[0]);
open_result_files();
initialize_environment();
diff --git src/test/regress/serial_schedule src/test/regress/serial_schedule
index acba391332..36db593fe3 100644
--- src/test/regress/serial_schedule
+++ src/test/regress/serial_schedule
@@ -132,8 +132,10 @@ test: dbsize
test: misc_functions
test: sysviews
test: tsrf
-test: tidscan
test: collate.icu.utf8
+test: tidscan
+test: zstidscan
+test: zedstore
test: rules
test: psql
test: psql_crosstab
diff --git src/test/regress/sql/box.sql src/test/regress/sql/box.sql
index cd3e00261f..1d219bd8ca 100644
--- src/test/regress/sql/box.sql
+++ src/test/regress/sql/box.sql
@@ -148,6 +148,7 @@ INSERT INTO box_temp
('(-infinity,-infinity)(infinity,infinity)');
SET enable_seqscan = false;
+SET enable_bitmapscan = false;
SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
@@ -186,6 +187,7 @@ SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
RESET enable_seqscan;
+RESET enable_bitmapscan;
DROP INDEX box_spgist;
diff --git src/test/regress/sql/create_index.sql src/test/regress/sql/create_index.sql
index c3246cb296..a31baa497f 100644
--- src/test/regress/sql/create_index.sql
+++ src/test/regress/sql/create_index.sql
@@ -722,7 +722,9 @@ SELECT count(*) FROM dupindexcols
-- Check ordering of =ANY indexqual results (bug in 9.2.0)
--
-vacuum tenk1; -- ensure we get consistent plans here
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
explain (costs off)
SELECT unique1 FROM tenk1
@@ -742,6 +744,7 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+RESET enable_indexscan;
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -753,6 +756,8 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+RESET enable_seqscan;
+RESET enable_bitmapscan;
RESET enable_indexonlyscan;
--
diff --git src/test/regress/sql/create_table.sql src/test/regress/sql/create_table.sql
index 00ef81a685..c7ddd0ca74 100644
--- src/test/regress/sql/create_table.sql
+++ src/test/regress/sql/create_table.sql
@@ -285,15 +285,16 @@ CREATE TABLE IF NOT EXISTS as_select1 AS EXECUTE select1;
DROP TABLE as_select1;
DEALLOCATE select1;
+-- FIXME: enable this test when we introduce meta-page overflow for zedstore
-- create an extra wide table to test for issues related to that
-- (temporarily hide query, to avoid the long CREATE TABLE stmt)
-\set ECHO none
-SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
-FROM generate_series(1, 1100) g(i)
-\gexec
-\set ECHO all
-INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
-SELECT firstc, lastc FROM extra_wide_table;
+-- \set ECHO none
+-- SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
+-- FROM generate_series(1, 1100) g(i)
+-- \gexec
+-- \set ECHO all
+-- INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
+-- SELECT firstc, lastc FROM extra_wide_table;
-- check that tables with oids cannot be created anymore
CREATE TABLE withoid() WITH OIDS;
diff --git src/test/regress/sql/zedstore.sql src/test/regress/sql/zedstore.sql
new file mode 100644
index 0000000000..48dfe019ec
--- /dev/null
+++ src/test/regress/sql/zedstore.sql
@@ -0,0 +1,211 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+-- where clause with invalid ctid works
+select * from t_zedstore where ctid = '(0,0)';
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+
+-- index scan
+select * from t_zedstore where c1 = 5;
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+
+--
+-- Test in-line toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+
+select c1, length(t) from t_zedtoast;
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+
+--
+-- Test out-of-line toasting
+--
+insert into t_zedtoast select i, repeat('x', 1000000) from generate_series(1, 10) i;
+
+select c1, length(t) from t_zedtoast;
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+select c2 from t_zednullvalues;
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+9999 \N \\N \NN \N
+10000 21 31 41 51
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+1 test_1
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+2 test_2
+3 test_3
+4 test_4
+5 test_5
+\.
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+10001 22 32 42 52
+10002 23 33 43 53
+10003 24 34 44 54
+10004 25 35 45 55
+10005 26 36 46 56
+\.
+
+select * from t_zedcopy;
+COPY t_zedcopy (a, d, e) to stdout;
+
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+
+
+-- Test rolling back COPY
+begin;
+COPY t_zedcopy (b, d) from stdin;
+20001 test_1
+20002 test_2
+20003 test_3
+20004 test_4
+\.
+rollback;
+select count(*) from t_zedcopy where b >= 20000;
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
diff --git src/test/regress/sql/zstidscan.sql src/test/regress/sql/zstidscan.sql
new file mode 100644
index 0000000000..df4a3e77b9
--- /dev/null
+++ src/test/regress/sql/zstidscan.sql
@@ -0,0 +1,69 @@
+select '1'::zstid;
+select '-1'::zstid;
+select -'1'::zstid;
+
+-- int2 conversion
+select 1::int2::zstid;
+select (-1)::int2::zstid;
+select -1::int2::zstid;
+
+-- int4 conversion
+select 1::zstid;
+select (-1)::zstid;
+select -1::zstid;
+
+-- int8 conversion
+select 1::int8::zstid;
+select 1000000000000000::zstid; -- bigger than MaxZSTid
+select (-1)::int8::zstid;
+select -1::int8::zstid;
+
+create table if not exists zstidscan(a int) using zedstore;
+
+insert into zstidscan values (1), (2), (3);
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan;
+select ctid, ctid::zstid as zstid, a from zstidscan;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+
+-- TODO: casting to int2 or int4 might be useful
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int2 % 3 = 0;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int4 % 3 = 0;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+
+-- TODO: Add necessary functions to do these useful aggregates on zstid types
+select max(ctid::zstid) from zstidscan;
+select min(ctid::zstid) from zstidscan;
+
+drop table zstidscan;
diff --git src/test/storageperf/driver.sql src/test/storageperf/driver.sql
new file mode 100644
index 0000000000..3a197ee6fe
--- /dev/null
+++ src/test/storageperf/driver.sql
@@ -0,0 +1,45 @@
+--
+-- Main script, to run all the tests, and print the results.
+--
+--
+
+-- First run the tests using heap.
+DROP SCHEMA IF EXISTS storagetest_heap CASCADE;
+CREATE SCHEMA storagetest_heap;
+SET search_path='storagetest_heap';
+
+CREATE TABLE results (testname text, size numeric, walsize numeric, time numeric) USING heap;
+
+SET default_table_access_method=heap;
+checkpoint;
+\i tests.sql
+
+
+-- Repeat with zedstore
+
+DROP SCHEMA IF EXISTS storagetest_zedstore CASCADE;
+CREATE SCHEMA storagetest_zedstore;
+SET search_path='storagetest_zedstore';
+
+CREATE TABLE results (testname text, size numeric, walsize numeric, time numeric) USING heap;
+
+SET default_table_access_method=zedstore;
+checkpoint;
+\i tests.sql
+
+
+SET search_path='public';
+
+SELECT COALESCE(h.testname, zs.testname) as testname,
+ h.time as "heap time",
+ h.size as "heap size",
+ h.walsize as "heap wal",
+
+ zs.time as "ZS time",
+ zs.size as "ZS size",
+ zs.walsize as "ZS wal",
+ round(zs.time / h.time, 2) as "time ratio",
+ round(zs.size / h.size, 2) as "size ratio",
+ case when zs.walsize > 0 and h.walsize > 0 then round(zs.walsize / h.walsize, 2) else null end as "wal ratio"
+FROM storagetest_heap.results h
+FULL OUTER JOIN storagetest_zedstore.results zs ON (h.testname = zs.testname);
diff --git src/test/storageperf/sql/inlinecompress.sql src/test/storageperf/sql/inlinecompress.sql
new file mode 100644
index 0000000000..9e89117d10
--- /dev/null
+++ src/test/storageperf/sql/inlinecompress.sql
@@ -0,0 +1,133 @@
+-- Tests with a narrow, single-column table.
+
+CREATE /* UNLOGGED */ TABLE inlinecompress (i text);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, insert-select',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY inlinecompress TO '/tmp/inlinecompress.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE inlinecompress;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY inlinecompress FROM '/tmp/inlinecompress.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, COPY',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- SELECT
+--
+
+VACUUM FREEZE inlinecompress;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM inlinecompress GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, SELECT, seqscan',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- Bitmap scans
+--
+
+CREATE INDEX ON inlinecompress (i);
+
+set enable_seqscan=off;
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM inlinecompress where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress where LENGTH(i) < 100000 GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, SELECT, bitmap scan',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+reset enable_seqscan;
+reset enable_indexscan;
+reset enable_bitmapscan;
+
+
+--
+-- Delete half of the rows
+--
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+DELETE FROM inlinecompress WHERE ctid::zstid::int8 % 2 = 0;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, deleted half',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- And vacuum the deleted rows away
+--
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+VACUUM inlinecompress;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, vacuumed',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git src/test/storageperf/sql/lockperf.sql src/test/storageperf/sql/lockperf.sql
new file mode 100644
index 0000000000..880ff9dfd8
--- /dev/null
+++ src/test/storageperf/sql/lockperf.sql
@@ -0,0 +1,49 @@
+drop view if exists public.redirector;
+drop table if exists twocol;
+
+CREATE TABLE twocol (i int4, val int4);
+
+CREATE VIEW public.redirector AS SELECT * FROM twocol;
+
+INSERT INTO twocol SELECT g, 0 FROM generate_series(1, 10) g;
+
+COPY twocol TO '/tmp/twocol.data'; -- dump the data, for COPY test below.
+TRUNCATE twocol;
+COPY twocol FROM '/tmp/twocol.data';
+
+
+-- FOR SHARE
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "\set i random(1, 10)" > /tmp/lockperf-pgbench-script.sql
+\! echo "SELECT i FROM redirector WHERE i = :i FOR SHARE" >> /tmp/lockperf-pgbench-script.sql
+
+\! pgbench -n -t1000 -c 20 -j 4 -f /tmp/lockperf-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('lockperf, pgbench, FOR SHARE',
+ pg_total_relation_size('twocol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+-- UPDATE
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "\set i random(1, 10)" > /tmp/lockperf-pgbench-script.sql
+\! echo "UPDATE redirector SET val = val + 1 WHERE i = :i;" >> /tmp/lockperf-pgbench-script.sql
+
+\! pgbench -n -t1000 -c 20 -j 4 -f /tmp/lockperf-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('lockperf, pgbench, UPDATE',
+ pg_total_relation_size('twocol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git src/test/storageperf/sql/nullcol.sql src/test/storageperf/sql/nullcol.sql
new file mode 100644
index 0000000000..bb974c0491
--- /dev/null
+++ src/test/storageperf/sql/nullcol.sql
@@ -0,0 +1,45 @@
+-- Tests with a narrow, single-column table, with some nulls.
+
+CREATE TABLE nullcol (i int4);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+INSERT INTO nullcol SELECT NULL FROM generate_series(1, 100000) g;
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+INSERT INTO nullcol SELECT g FROM generate_series(1, 100000) g;
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('nullcol, insert-select',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY nullcol TO '/tmp/nullcol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE nullcol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY nullcol FROM '/tmp/nullcol.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('nullcol, COPY',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git src/test/storageperf/sql/onecol.sql src/test/storageperf/sql/onecol.sql
new file mode 100644
index 0000000000..f5c4c819ef
--- /dev/null
+++ src/test/storageperf/sql/onecol.sql
@@ -0,0 +1,158 @@
+-- Tests with a narrow, single-column table.
+
+CREATE /* UNLOGGED */ TABLE onecol (i int4);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, insert-select',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY onecol TO '/tmp/onecol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE onecol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY onecol FROM '/tmp/onecol.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, COPY',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- SELECT
+--
+
+VACUUM FREEZE onecol;
+
+-- Test with seq scan parallelism enabled.
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+explain SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, SELECT, seqscan',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+-- Test with seq scan parallelism disabled.
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SET max_parallel_workers_per_gather to 0;
+
+explain SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('onecol, SELECT, seqscan, parallel seqscan disabled',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+RESET max_parallel_workers_per_gather;
+--
+-- Bitmap scans
+--
+
+CREATE INDEX ON onecol (i);
+
+set enable_seqscan=off;
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT SUM(i) FROM onecol where i < 400000;
+SELECT SUM(i) FROM onecol where i < 400000;
+SELECT SUM(i) FROM onecol where i < 400000;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, SELECT, bitmap scan',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+reset enable_seqscan;
+reset enable_indexscan;
+reset enable_bitmapscan;
+
+
+--
+-- Delete half of the rows
+--
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+DELETE FROM onecol WHERE i%2 = 0;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, deleted half',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- And vacuum the deleted rows away
+--
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+VACUUM onecol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, vacuumed',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git src/test/storageperf/sql/toast.sql src/test/storageperf/sql/toast.sql
new file mode 100644
index 0000000000..16b03c91da
--- /dev/null
+++ src/test/storageperf/sql/toast.sql
@@ -0,0 +1,133 @@
+-- Tests with a narrow, single-column table.
+
+CREATE /* UNLOGGED */ TABLE toastcol (i text);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, insert-select',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY toastcol TO '/tmp/toastcol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE toastcol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY toastcol FROM '/tmp/toastcol.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, COPY',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- SELECT
+--
+
+VACUUM FREEZE toastcol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM toastcol GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, SELECT, seqscan',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- Bitmap scans
+--
+
+CREATE INDEX ON toastcol (i);
+
+set enable_seqscan=off;
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM toastcol where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol where LENGTH(i) < 100000 GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, SELECT, bitmap scan',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+reset enable_seqscan;
+reset enable_indexscan;
+reset enable_bitmapscan;
+
+
+--
+-- Delete half of the rows
+--
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+DELETE FROM toastcol WHERE ctid::zstid::int8 % 2 = 0;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, deleted half',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- And vacuum the deleted rows away
+--
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+VACUUM toastcol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, vacuumed',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git src/test/storageperf/tests.sql src/test/storageperf/tests.sql
new file mode 100644
index 0000000000..be87be6abd
--- /dev/null
+++ src/test/storageperf/tests.sql
@@ -0,0 +1,7 @@
+-- Test "schedule". List all the tests you want to run here.
+
+\i sql/onecol.sql
+\i sql/nullcol.sql
+\i sql/lockperf.sql
+\i sql/inlinecompress.sql
+\i sql/toast.sql
Hello,
We (Jacob and me) have an update for this thread.
1. We recently made some improvements to the table AM APIs for fetching
a single row (tuple_fetch_row_version()) and locking (and fetching) a
tuple (tuple_lock()), such that they could take a set of columns. We
extracted these columns at plan time and in some cases, executor time.
The changes are in the same spirit as some column-oriented changes that
are already a part of Zedstore - namely the ability to pass a set of
columns to sequential and index scans among other operations.
We observed that the two table AM functions are called in contexts
which don't need the entire set of columns to be populated in the
output TupleTableSlots associated with these APIs. For instance, in
DELETE RETURNING, we don't need to fetch all of the columns, just the
ones in the RETURNING clause.
We saw improvements (see results attached) for a variety of tests - we
added a bunch of tests in our storageperf test suite to test these
cases. We don't see a performance improvement for UPSERT and ON CONFLICT
DO NOTHING as there is an index lookup pulling in the entire row
preceding the call to table_tuple_lock() in both these cases. We do
see significant improvements (~3x) for DELETE RETURNING and row-level
locking and around a ~25x improvement in TidScan runtime.
Please refer to src/test/storageperf for the storageperf test suite.
2. We absorbed the scanCols patch [1]/messages/by-id/CAAKRu_YxyYOCCO2e83UmHb51sky1hXgeRzQw-PoqT1iHj2ZKVg@mail.gmail.com, replacing some of the existing
executor-level column extraction for scans with the scanCols populated
during planning as in [1]/messages/by-id/CAAKRu_YxyYOCCO2e83UmHb51sky1hXgeRzQw-PoqT1iHj2ZKVg@mail.gmail.com.
3. We also merged Zedstore upto PG 14 commit: efc5dcfd8a
PFA the latest version of the Zedstore patch.
Regards,
Jacob and Soumyadeep
[1]: /messages/by-id/CAAKRu_YxyYOCCO2e83UmHb51sky1hXgeRzQw-PoqT1iHj2ZKVg@mail.gmail.com
Attachments:
v5-zedstore.patchtext/x-patch; charset=US-ASCII; name=v5-zedstore.patchDownload
diff --git a/README.md b/README.md
new file mode 100644
index 0000000000..54901cc4b1
--- /dev/null
+++ b/README.md
@@ -0,0 +1,27 @@
+Zedstore - Column Store for PostgreSQL
+--------------------------------------
+
+Zedstore is a column store for PostgreSQL. It is under development. The development happens here, you are in the right place.
+
+This is a fork of the PostgreSQL repository. All the interesting Zedstore stuff is in the subdirectory [src/backend/access/zedstore/](src/backend/access/zedstore/) subdirectory. There are only few modifications to the rest of PostgreSQL, outside that subdirectory. Eventually, any modifications needed to accommodate Zedstore needs to be submitted as separate patches and committed to PostgreSQL itself.
+
+Join the discussion on pgsql-hackers:
+
+https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com
+
+
+Try it out
+----------
+
+Clone the repository, and compile it. Use the --with-lz4 option configure option, otherwise zedstore will be horrendously slow:
+
+ ./configure --with-lz4
+
+To use zedstore:
+
+ CREATE TABLE mytable (t text) using zedstore;
+
+Or you can set it as the default for all your tables, in postgresql.conf:
+
+ default_table_access_method = 'zedstore'
+
diff --git a/configure b/configure
index ace4ed5dec..63dc92e528 100755
--- a/configure
+++ b/configure
@@ -699,6 +699,7 @@ with_gnu_ld
LD
LDFLAGS_SL
LDFLAGS_EX
+with_lz4
with_zlib
with_system_tzdata
with_libxslt
@@ -867,6 +868,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_lz4
with_gnu_ld
enable_largefile
'
@@ -1571,6 +1573,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-lz4 build with LZ4 support
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8601,6 +8604,41 @@ fi
+#
+# LZ4
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with LZ4 support" >&5
+$as_echo_n "checking whether to build with LZ4 support... " >&6; }
+
+
+
+# Check whether --with-lz4 was given.
+if test "${with_lz4+set}" = set; then :
+ withval=$with_lz4;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LZ4 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-lz4 option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_lz4=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_lz4" >&5
+$as_echo "$with_lz4" >&6; }
+
+
#
# Assignments
#
@@ -12092,6 +12130,56 @@ fi
fi
+if test "$with_lz4" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for LZ4_compress_default in -llz4" >&5
+$as_echo_n "checking for LZ4_compress_default in -llz4... " >&6; }
+if ${ac_cv_lib_lz4_LZ4_compress_default+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-llz4 $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char LZ4_compress_default ();
+int
+main ()
+{
+return LZ4_compress_default ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_lz4_LZ4_compress_default=yes
+else
+ ac_cv_lib_lz4_LZ4_compress_default=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_lz4_LZ4_compress_default" >&5
+$as_echo "$ac_cv_lib_lz4_LZ4_compress_default" >&6; }
+if test "x$ac_cv_lib_lz4_LZ4_compress_default" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBLZ4 1
+_ACEOF
+
+ LIBS="-llz4 $LIBS"
+
+else
+ as_fn_error $? "library 'lz4' is required for LZ4 support" "$LINENO" 5
+fi
+
+fi
+
if test "$enable_spinlocks" = yes; then
$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
@@ -13295,6 +13383,36 @@ Use --without-zlib to disable zlib support." "$LINENO" 5
fi
+fi
+
+if test "$with_lz4" = yes; then
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ for ac_header in lz4.h
+do :
+ ac_fn_c_check_header_mongrel "$LINENO" "lz4.h" "ac_cv_header_lz4_h" "$ac_includes_default"
+if test "x$ac_cv_header_lz4_h" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LZ4_H 1
+_ACEOF
+
+else
+ as_fn_error $? "lz4.h header file is required for LZ4" "$LINENO" 5
+fi
+
+done
+
+fi
+
+done
+
fi
if test "$with_gssapi" = yes ; then
diff --git a/configure.ac b/configure.ac
index 5b91c83fd0..9131fcaa6c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -999,6 +999,16 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# LZ4
+#
+AC_MSG_CHECKING([whether to build with LZ4 support])
+PGAC_ARG_BOOL(with, lz4, no,
+ [build with LZ4 support],
+ [AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])])
+AC_MSG_RESULT([$with_lz4])
+AC_SUBST(with_lz4)
+
#
# Assignments
#
@@ -1186,6 +1196,10 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
+fi
+
if test "$enable_spinlocks" = yes; then
AC_DEFINE(HAVE_SPINLOCKS, 1, [Define to 1 if you have spinlocks.])
else
@@ -1400,6 +1414,11 @@ failure. It is possible the compiler isn't looking in the proper directory.
Use --without-zlib to disable zlib support.])])
fi
+if test "$with_lz4" = yes; then
+ AC_CHECK_HEADERS(lz4.h, [],
+ [AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is required for LZ4])])])
+fi
+
if test "$with_gssapi" = yes ; then
AC_CHECK_HEADERS(gssapi/gssapi.h, [],
[AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is required for GSSAPI])])])
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 072a6dc1c1..7d2606d3c9 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -19,6 +19,7 @@
#include "access/reloptions.h"
#include "access/sysattr.h"
#include "access/table.h"
+#include "access/tableam.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_foreign_table.h"
#include "commands/copy.h"
@@ -158,10 +159,8 @@ static void estimate_size(PlannerInfo *root, RelOptInfo *baserel,
static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
FileFdwPlanState *fdw_private,
Cost *startup_cost, Cost *total_cost);
-static int file_acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows);
-
+static void file_acquire_sample_rows(Relation onerel, int elevel,
+ AnalyzeSampleContext *context);
/*
* Foreign-data wrapper handler function: return a struct with pointers
@@ -1092,14 +1091,16 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
* may be meaningless, but it's OK because we don't use the estimates
* currently (the planner only pays attention to correlation for indexscans).
*/
-static int
+static void
file_acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
int numrows = 0;
+ int targrows = 0;
+ double totalrows = 0;
double rowstoskip = -1; /* -1 means not set yet */
ReservoirStateData rstate;
+ HeapTuple tuple;
TupleDesc tupDesc;
Datum *values;
bool *nulls;
@@ -1112,6 +1113,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
MemoryContext oldcontext = CurrentMemoryContext;
MemoryContext tupcontext;
+ targrows = context->targrows;
+
Assert(onerel);
Assert(targrows > 0);
@@ -1145,8 +1148,6 @@ file_acquire_sample_rows(Relation onerel, int elevel,
errcallback.previous = error_context_stack;
error_context_stack = &errcallback;
- *totalrows = 0;
- *totaldeadrows = 0;
for (;;)
{
/* Check for user-requested abort or sleep */
@@ -1171,7 +1172,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
*/
if (numrows < targrows)
{
- rows[numrows++] = heap_form_tuple(tupDesc, values, nulls);
+ tuple = heap_form_tuple(tupDesc, values, nulls);
+ AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, numrows++, false /* replace */, false);
}
else
{
@@ -1181,7 +1183,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
* not-yet-incremented value of totalrows as t.
*/
if (rowstoskip < 0)
- rowstoskip = reservoir_get_next_S(&rstate, *totalrows, targrows);
+ rowstoskip = reservoir_get_next_S(&rstate, totalrows, targrows);
if (rowstoskip <= 0)
{
@@ -1192,14 +1194,14 @@ file_acquire_sample_rows(Relation onerel, int elevel,
int k = (int) (targrows * sampler_random_fract(rstate.randstate));
Assert(k >= 0 && k < targrows);
- heap_freetuple(rows[k]);
- rows[k] = heap_form_tuple(tupDesc, values, nulls);
+ tuple = heap_form_tuple(tupDesc, values, nulls);
+ AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, k, true /* replace */, false);
}
rowstoskip -= 1;
}
- *totalrows += 1;
+ totalrows += 1;
}
/* Remove error callback. */
@@ -1220,7 +1222,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
(errmsg("\"%s\": file contains %.0f rows; "
"%d rows in sample",
RelationGetRelationName(onerel),
- *totalrows, numrows)));
+ totalrows, numrows)));
- return numrows;
+ context->totalrows += totalrows;
+ context->totalsampledrows += numrows;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9c5aaacc51..16c7ac8c25 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -17,6 +17,7 @@
#include "access/htup_details.h"
#include "access/sysattr.h"
#include "access/table.h"
+#include "access/tableam.h"
#include "catalog/pg_class.h"
#include "commands/defrem.h"
#include "commands/explain.h"
@@ -237,7 +238,6 @@ typedef struct PgFdwAnalyzeState
List *retrieved_attrs; /* attr numbers retrieved by query */
/* collected sample rows */
- HeapTuple *rows; /* array of size targrows */
int targrows; /* target # of sample rows */
int numrows; /* # of sample rows collected */
@@ -464,12 +464,11 @@ static void process_query_params(ExprContext *econtext,
FmgrInfo *param_flinfo,
List *param_exprs,
const char **param_values);
-static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows,
- double *totaldeadrows);
+static void postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+ AnalyzeSampleContext *context);
static void analyze_row_processor(PGresult *res, int row,
- PgFdwAnalyzeState *astate);
+ PgFdwAnalyzeState *astate,
+ AnalyzeSampleContext *context);
static HeapTuple make_tuple_from_result_row(PGresult *res,
int row,
Relation rel,
@@ -4489,11 +4488,9 @@ postgresAnalyzeForeignTable(Relation relation,
* may be meaningless, but it's OK because we don't use the estimates
* currently (the planner only pays attention to correlation for indexscans).
*/
-static int
+static void
postgresAcquireSampleRowsFunc(Relation relation, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows,
- double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
PgFdwAnalyzeState astate;
ForeignTable *table;
@@ -4507,13 +4504,11 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
/* Initialize workspace state */
astate.rel = relation;
astate.attinmeta = TupleDescGetAttInMetadata(RelationGetDescr(relation));
-
- astate.rows = rows;
- astate.targrows = targrows;
+ astate.targrows = context->targrows;
astate.numrows = 0;
astate.samplerows = 0;
astate.rowstoskip = -1; /* -1 means not set yet */
- reservoir_init_selection_state(&astate.rstate, targrows);
+ reservoir_init_selection_state(&astate.rstate, astate.targrows);
/* Remember ANALYZE context, and create a per-tuple temp context */
astate.anl_cxt = CurrentMemoryContext;
@@ -4605,7 +4600,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
/* Process whatever we got. */
numrows = PQntuples(res);
for (i = 0; i < numrows; i++)
- analyze_row_processor(res, i, &astate);
+ analyze_row_processor(res, i, &astate, context);
PQclear(res);
res = NULL;
@@ -4629,10 +4624,13 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
ReleaseConnection(conn);
/* We assume that we have no dead tuple. */
- *totaldeadrows = 0.0;
+ context->totaldeadrows = 0.0;
/* We've retrieved all living tuples from foreign server. */
- *totalrows = astate.samplerows;
+ context->totalrows += astate.samplerows;
+
+ /* Increase the number of sample rows stored in the context */
+ context->totalsampledrows += astate.numrows;
/*
* Emit some interesting relation info
@@ -4641,8 +4639,6 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
(errmsg("\"%s\": table contains %.0f rows, %d rows in sample",
RelationGetRelationName(relation),
astate.samplerows, astate.numrows)));
-
- return astate.numrows;
}
/*
@@ -4651,10 +4647,11 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
* - Subsequently, replace already-sampled tuples randomly.
*/
static void
-analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
+analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate, AnalyzeSampleContext *context)
{
int targrows = astate->targrows;
int pos; /* array index to store tuple in */
+ bool replace;
MemoryContext oldcontext;
/* Always increment sample row counter. */
@@ -4668,6 +4665,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
{
/* First targrows rows are always included into the sample */
pos = astate->numrows++;
+ replace = false;
}
else
{
@@ -4684,7 +4682,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
/* Choose a random reservoir element to replace. */
pos = (int) (targrows * sampler_random_fract(astate->rstate.randstate));
Assert(pos >= 0 && pos < targrows);
- heap_freetuple(astate->rows[pos]);
+ replace = true;
}
else
{
@@ -4697,18 +4695,22 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
if (pos >= 0)
{
+ HeapTuple tuple;
/*
* Create sample tuple from current result row, and store it in the
* position determined above. The tuple has to be created in anl_cxt.
*/
oldcontext = MemoryContextSwitchTo(astate->anl_cxt);
- astate->rows[pos] = make_tuple_from_result_row(res, row,
- astate->rel,
- astate->attinmeta,
- astate->retrieved_attrs,
- NULL,
- astate->temp_cxt);
+ tuple = make_tuple_from_result_row(res, row,
+ astate->rel,
+ astate->attinmeta,
+ astate->retrieved_attrs,
+ NULL,
+ astate->temp_cxt);
+
+ /* Tuple is already created in anl_cxt, we can record it directly */
+ AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, pos, replace, false);
MemoryContextSwitchTo(oldcontext);
}
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bb..6d36f3bd26 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \
- table tablesample transam
+ table tablesample transam zedstore
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..90203a8245 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -382,6 +382,15 @@ static relopt_int intRelOpts[] =
},
-1, 0, 1024
},
+ {
+ {
+ "zedstore_rel_extension_factor",
+ "Extend zedstore relations by zedstore_rel_extension_factor #blocks.",
+ RELOPT_KIND_HEAP,
+ ShareUpdateExclusiveLock
+ },
+ ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR, 1, INT_MAX
+ },
/* list terminator */
{{NULL}}
@@ -1859,7 +1868,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
- offsetof(StdRdOptions, vacuum_truncate)}
+ offsetof(StdRdOptions, vacuum_truncate)},
+ {"zedstore_rel_extension_factor", RELOPT_TYPE_INT,
+ offsetof(StdRdOptions, zedstore_rel_extension_factor)}
};
return (bytea *) build_reloptions(reloptions, validate, kind,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..b88f6d14c3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
#include "executor/executor.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "parser/analyze.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/lmgr.h"
@@ -45,6 +46,7 @@
#include "utils/builtins.h"
#include "utils/rel.h"
+static int compare_rows(const void *a, const void *b);
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
Datum *values, bool *isnull, RewriteState rwstate);
@@ -180,7 +182,8 @@ static bool
heapam_fetch_row_version(Relation relation,
ItemPointer tid,
Snapshot snapshot,
- TupleTableSlot *slot)
+ TupleTableSlot *slot,
+ Bitmapset *project_cols)
{
BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
Buffer buffer;
@@ -348,7 +351,7 @@ static TM_Result
heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
LockWaitPolicy wait_policy, uint8 flags,
- TM_FailureData *tmfd)
+ TM_FailureData *tmfd, Bitmapset *project_cols)
{
BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
TM_Result result;
@@ -961,10 +964,25 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pfree(isnull);
}
+static void
+heapam_scan_analyze_beginscan(Relation onerel, AnalyzeSampleContext *context)
+{
+ context->scan = table_beginscan_analyze(onerel);
+
+ /* initialize the totalblocks analyze can scan */
+ context->totalblocks = RelationGetNumberOfBlocks(onerel);
+
+ /* reset the statistic */
+ context->liverows = 0;
+ context->deadrows = 0;
+ context->ordered = true;
+}
+
static bool
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
- BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(BlockNumber blockno,
+ AnalyzeSampleContext *context)
{
+ TableScanDesc scan = context->scan;
HeapScanDesc hscan = (HeapScanDesc) scan;
/*
@@ -979,7 +997,7 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
hscan->rs_cblock = blockno;
hscan->rs_cindex = FirstOffsetNumber;
hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
- blockno, RBM_NORMAL, bstrategy);
+ blockno, RBM_NORMAL, context->bstrategy);
LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
/* in heap all blocks can contain tuples, so always return true */
@@ -987,14 +1005,14 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
}
static bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
- double *liverows, double *deadrows,
- TupleTableSlot *slot)
+heapam_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
{
+ TableScanDesc scan = context->scan;
HeapScanDesc hscan = (HeapScanDesc) scan;
Page targpage;
OffsetNumber maxoffset;
BufferHeapTupleTableSlot *hslot;
+ TupleTableSlot *slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DATA);
Assert(TTS_IS_BUFFERTUPLE(slot));
@@ -1020,7 +1038,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
if (!ItemIdIsNormal(itemid))
{
if (ItemIdIsDead(itemid))
- *deadrows += 1;
+ context->deadrows += 1;
continue;
}
@@ -1035,13 +1053,13 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
{
case HEAPTUPLE_LIVE:
sample_it = true;
- *liverows += 1;
+ context->liverows += 1;
break;
case HEAPTUPLE_DEAD:
case HEAPTUPLE_RECENTLY_DEAD:
/* Count dead and recently-dead rows */
- *deadrows += 1;
+ context->deadrows += 1;
break;
case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1067,7 +1085,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
{
sample_it = true;
- *liverows += 1;
+ context->liverows += 1;
}
break;
@@ -1096,11 +1114,11 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
* concurrent transaction never commits.
*/
if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
- *deadrows += 1;
+ context->deadrows += 1;
else
{
sample_it = true;
- *liverows += 1;
+ context->liverows += 1;
}
break;
@@ -1129,6 +1147,71 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
return false;
}
+static void
+heapam_scan_analyze_sample_tuple(int pos, bool replace, AnalyzeSampleContext *context)
+{
+ TupleTableSlot *slot;
+ Relation onerel = context->scan->rs_rd;
+
+ Assert(pos >= 0);
+ /*
+ * heapam_scan_analyze_next_tuple should already put the tuple
+ * in the sample slot, just record it into the array of sample
+ * rows.
+ */
+ slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DATA);
+ AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DATA, pos, replace, true);
+
+ /*
+ * if replace happens, the sample rows are no longer ordered
+ * in physical position.
+ */
+ if (replace)
+ context->ordered = false;
+}
+
+static void
+heapam_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+ HeapTuple *rows = AnalyzeGetSampleRows(context, ANALYZE_SAMPLE_DATA, context->totalsampledrows);
+
+ /*
+ * If we didn't find as many tuples as we wanted then we're done. No sort
+ * is needed, since they're already in order.
+ *
+ * Otherwise we need to sort the collected tuples by position
+ * (itempointer).
+ */
+ if (!context->ordered)
+ qsort((void *)rows, context->targrows, sizeof(HeapTuple), compare_rows);
+
+ table_endscan(context->scan);
+}
+
+/*
+ * qsort comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b)
+{
+ HeapTuple ha = *(const HeapTuple *) a;
+ HeapTuple hb = *(const HeapTuple *) b;
+ BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+ OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+ BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+ OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+ if (ba < bb)
+ return -1;
+ if (ba > bb)
+ return 1;
+ if (oa < ob)
+ return -1;
+ if (oa > ob)
+ return 1;
+ return 0;
+}
+
static double
heapam_index_build_range_scan(Relation heapRelation,
Relation indexRelation,
@@ -2508,6 +2591,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
+ .scans_leverage_column_projection = false,
.slot_callbacks = heapam_slot_callbacks,
@@ -2544,8 +2628,11 @@ static const TableAmRoutine heapam_methods = {
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
.relation_vacuum = heap_vacuum_rel,
+ .scan_analyze_beginscan = heapam_scan_analyze_beginscan,
.scan_analyze_next_block = heapam_scan_analyze_next_block,
.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
+ .scan_analyze_sample_tuple = heapam_scan_analyze_sample_tuple,
+ .scan_analyze_endscan = heapam_scan_analyze_endscan,
.index_build_range_scan = heapam_index_build_range_scan,
.index_validate_scan = heapam_index_validate_scan,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8730de25ed..25ee10806b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1967,7 +1967,8 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
indexInfo = BuildIndexInfo(btspool->index);
indexInfo->ii_Concurrent = btshared->isconcurrent;
scan = table_beginscan_parallel(btspool->heap,
- ParallelTableScanFromBTShared(btshared));
+ ParallelTableScanFromBTShared(btshared),
+ NULL);
reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
true, progress, _bt_build_callback,
(void *) &buildstate, scan);
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index f88d72fd86..3585dbd7e4 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -29,6 +29,7 @@ OBJS = \
standbydesc.o \
tblspcdesc.o \
xactdesc.o \
- xlogdesc.o
+ xlogdesc.o \
+ zedstoredesc.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/zedstoredesc.c b/src/backend/access/rmgrdesc/zedstoredesc.c
new file mode 100644
index 0000000000..c2499d9e6f
--- /dev/null
+++ b/src/backend/access/rmgrdesc/zedstoredesc.c
@@ -0,0 +1,135 @@
+/*
+ * zedstoredesc.c
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/zedstoredesc.c
+ */
+#include "postgres.h"
+
+#include "access/xlogreader.h"
+#include "access/zedstore_tid.h"
+#include "access/zedstore_wal.h"
+#include "lib/stringinfo.h"
+
+void
+zedstore_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == WAL_ZEDSTORE_INIT_METAPAGE)
+ {
+ wal_zedstore_init_metapage *walrec = (wal_zedstore_init_metapage *) rec;
+
+ appendStringInfo(buf, "natts %d", walrec->natts);
+ }
+ else if (info == WAL_ZEDSTORE_UNDO_NEWPAGE)
+ {
+ wal_zedstore_undo_newpage *walrec = (wal_zedstore_undo_newpage *) rec;
+
+ appendStringInfo(buf, "first_counter " UINT64_FORMAT, walrec->first_counter);
+ }
+ else if (info == WAL_ZEDSTORE_UNDO_DISCARD)
+ {
+ wal_zedstore_undo_discard *walrec = (wal_zedstore_undo_discard *) rec;
+
+ appendStringInfo(buf, "oldest_undorecptr " UINT64_FORMAT ", oldest_undopage %u",
+ walrec->oldest_undorecptr.counter,
+ walrec->oldest_undopage);
+ }
+ else if (info == WAL_ZEDSTORE_BTREE_NEW_ROOT)
+ {
+ wal_zedstore_btree_new_root *walrec = (wal_zedstore_btree_new_root *) rec;
+
+ appendStringInfo(buf, "attno %d", walrec->attno);
+ }
+ else if (info == WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS)
+ {
+ wal_zedstore_tidleaf_items *walrec = (wal_zedstore_tidleaf_items *) rec;
+
+ appendStringInfo(buf, "%d items, off %d", walrec->nitems, walrec->off);
+ }
+ else if (info == WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM)
+ {
+ wal_zedstore_tidleaf_items *walrec = (wal_zedstore_tidleaf_items *) rec;
+
+ appendStringInfo(buf, "%d items, off %d", walrec->nitems, walrec->off);
+ }
+ else if (info == WAL_ZEDSTORE_ATTSTREAM_CHANGE)
+ {
+ wal_zedstore_attstream_change *walrec = (wal_zedstore_attstream_change *) rec;
+
+ if (walrec->is_upper)
+ appendStringInfo(buf, "upper stream change");
+ else
+ appendStringInfo(buf, "lower stream change");
+ appendStringInfo(buf, ", new size %d", walrec->new_attstream_size);
+ }
+ else if (info == WAL_ZEDSTORE_TOAST_NEWPAGE)
+ {
+ wal_zedstore_toast_newpage *walrec = (wal_zedstore_toast_newpage *) rec;
+
+ appendStringInfo(buf, "tid (%u/%d), attno %d, offset %d/%d",
+ ZSTidGetBlockNumber(walrec->tid), ZSTidGetOffsetNumber(walrec->tid),
+ walrec->attno, walrec->offset, walrec->total_size);
+ }
+ else if (info == WAL_ZEDSTORE_FPM_DELETE_PAGE)
+ {
+ wal_zedstore_fpm_delete_page *walrec = (wal_zedstore_fpm_delete_page *) rec;
+
+ appendStringInfo(buf, "nextblkno %u", walrec->next_free_blkno);
+ }
+ else if (info == WAL_ZEDSTORE_FPM_REUSE_PAGE)
+ {
+ wal_zedstore_fpm_reuse_page *walrec = (wal_zedstore_fpm_reuse_page *) rec;
+
+ appendStringInfo(buf, "nextblkno %u", walrec->next_free_blkno);
+ }
+}
+
+const char *
+zedstore_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case WAL_ZEDSTORE_INIT_METAPAGE:
+ id = "INIT_METAPAGE";
+ break;
+ case WAL_ZEDSTORE_UNDO_NEWPAGE:
+ id = "UNDO_NEWPAGE";
+ break;
+ case WAL_ZEDSTORE_UNDO_DISCARD:
+ id = "UNDO_DISCARD";
+ break;
+ case WAL_ZEDSTORE_BTREE_NEW_ROOT:
+ id = "BTREE_NEW_ROOT";
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS:
+ id = "BTREE_TIDLEAF_ADD_ITEMS";
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM:
+ id = "BTREE_TIDLEAF_REPLACE_ITEM";
+ break;
+ case WAL_ZEDSTORE_BTREE_REWRITE_PAGES:
+ id = "BTREE_REWRITE_PAGES";
+ break;
+ case WAL_ZEDSTORE_ATTSTREAM_CHANGE:
+ id = "ATTSTREAM_CHANGE";
+ break;
+ case WAL_ZEDSTORE_TOAST_NEWPAGE:
+ id = "ZSTOAST_NEWPAGE";
+ break;
+ case WAL_ZEDSTORE_FPM_DELETE_PAGE:
+ id = "FPM_DELETE_PAGE";
+ break;
+ case WAL_ZEDSTORE_FPM_REUSE_PAGE:
+ id = "FPM_REUSE_PAGE";
+ break;
+ }
+ return id;
+}
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6438c45716..9d0dc40530 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -23,7 +23,9 @@
#include "access/syncscan.h"
#include "access/tableam.h"
+#include "access/tupconvert.h"
#include "access/xact.h"
+#include "catalog/pg_type.h"
#include "optimizer/plancat.h"
#include "port/pg_bitutils.h"
#include "storage/bufmgr.h"
@@ -172,7 +174,7 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
}
TableScanDesc
-table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
+table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan, Bitmapset *proj)
{
Snapshot snapshot;
uint32 flags = SO_TYPE_SEQSCAN |
@@ -194,6 +196,9 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
snapshot = SnapshotAny;
}
+ if (proj)
+ return relation->rd_tableam->scan_begin_with_column_projection(relation, snapshot, 0, NULL,
+ parallel_scan, flags, proj);
return relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
parallel_scan, flags);
}
@@ -763,3 +768,216 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
else
*allvisfrac = (double) relallvisible / curpages;
}
+
+/* Create the analyze sample context to acquire sample rows */
+AnalyzeSampleContext *
+CreateAnalyzeSampleContext(Relation onerel,
+ List *anl_cols,
+ int totaltargrows,
+ BufferAccessStrategy strategy)
+{
+ AnalyzeSampleContext *context;
+
+ context = (AnalyzeSampleContext *) palloc(sizeof(AnalyzeSampleContext));
+ context->parent = onerel;
+ context->anl_cols = anl_cols;
+ context->bstrategy = strategy;
+ context->totaltargrows = totaltargrows;
+ context->targrows = totaltargrows;
+ context->scan = NULL;
+ context->totalblocks = 0;
+ context->totalrows = 0;
+ context->totaldeadrows = 0;
+ context->totalsampledrows = 0;
+ context->liverows = 0;
+ context->deadrows = 0;
+ context->ordered = false;
+ context->tup_convert_map = NULL;
+
+ /* empty all sample type */
+ memset(context->sample_slots, 0, MAX_ANALYZE_SAMPLE * sizeof(TupleTableSlot *));
+ memset(context->sample_rows, 0, MAX_ANALYZE_SAMPLE * sizeof(HeapTuple *));
+
+ return context;
+}
+
+/* Destroy analyze sample context */
+void
+DestroyAnalyzeSampleContext(AnalyzeSampleContext *context)
+{
+ for (int i = 0; i < MAX_ANALYZE_SAMPLE; i++)
+ {
+ TupleTableSlot *slot = context->sample_slots[i];
+ if (slot)
+ ExecDropSingleTupleTableSlot(slot);
+ }
+}
+
+/*
+ * To acquire sample rows from an inherited table, all child
+ * relations use the same analyze sample context, this function
+ * must be called before starting analyze a new child relation.
+ */
+void
+InitAnalyzeSampleContextForChild(AnalyzeSampleContext *context,
+ Relation child,
+ int childtargrows)
+{
+ /* Set targrows to childtargrows */
+ context->targrows = childtargrows;
+
+ /* We may need to convert from child's rowtype to parent's */
+ if (!equalTupleDescs(RelationGetDescr(child),
+ RelationGetDescr(context->parent)))
+ {
+ if (context->tup_convert_map)
+ free_conversion_map(context->tup_convert_map);
+ /* Create a convert map so it can be used when recording sample rows */
+ context->tup_convert_map =
+ convert_tuples_by_name(RelationGetDescr(child),
+ RelationGetDescr(context->parent));
+
+ /* We also cannot use previous sample slot anymore */
+ if (context->sample_slots[ANALYZE_SAMPLE_DATA])
+ {
+ ExecDropSingleTupleTableSlot(context->sample_slots[ANALYZE_SAMPLE_DATA]);
+ context->sample_slots[ANALYZE_SAMPLE_DATA] = NULL;
+ }
+ /* We also cannot use previous sample slot anymore */
+ if (context->sample_slots[ANALYZE_SAMPLE_DISKSIZE])
+ {
+ ExecDropSingleTupleTableSlot(context->sample_slots[ANALYZE_SAMPLE_DISKSIZE]);
+ context->sample_slots[ANALYZE_SAMPLE_DISKSIZE] = NULL;
+ }
+ }
+}
+
+void
+AnalyzeGetSampleStats(AnalyzeSampleContext *context,
+ int *totalsampledrows,
+ double *totalrows,
+ double *totaldeadrows)
+{
+ if (totalsampledrows)
+ *totalsampledrows = context->totalsampledrows;
+ if (totalrows)
+ *totalrows = context->totalrows;
+ if (*totaldeadrows)
+ *totaldeadrows = context->totaldeadrows;
+}
+
+
+/*
+ * Get or initialize a sample slot to hold sample tuple, normally
+ * the tuple in the slot will be copied to the sample_rows[type]
+ * by AnalyzeRecordSampleRow().
+ */
+TupleTableSlot *
+AnalyzeGetSampleSlot(AnalyzeSampleContext *context,
+ Relation onerel,
+ AnalyzeSampleType type)
+{
+ TupleDesc tupdesc;
+ int attr_cnt = onerel->rd_att->natts;
+
+ if (context->sample_slots[type])
+ return context->sample_slots[type];
+
+ switch (type)
+ {
+ case ANALYZE_SAMPLE_DATA:
+ tupdesc = RelationGetDescr(onerel);
+ break;
+ case ANALYZE_SAMPLE_DISKSIZE:
+ tupdesc = CreateTemplateTupleDesc(attr_cnt);
+ for (int i = 1; i <= attr_cnt; i++)
+ TupleDescInitEntry(tupdesc, i, "", FLOAT8OID, -1, 0);
+ break;
+ default:
+ elog(ERROR, "unknown analyze sample type");
+ }
+
+ context->sample_slots[type] =
+ MakeSingleTupleTableSlot(tupdesc, table_slot_callbacks(onerel));
+ return context->sample_slots[type];
+}
+
+HeapTuple *
+AnalyzeGetSampleRows(AnalyzeSampleContext *context,
+ AnalyzeSampleType type,
+ int offset)
+{
+ Assert(offset < context->totaltargrows);
+ if (!context->sample_rows[type])
+ context->sample_rows[type] =
+ (HeapTuple *) palloc(context->totaltargrows * sizeof(HeapTuple));
+
+ return context->sample_rows[type] + offset;
+}
+
+/*
+ * Record a sample tuple into sample_rows[type].
+ *
+ * sample_tuple:
+ * Input sample tuple. Sometimes, callers has already
+ * formed sample tuple in its memory context, we can
+ * record it directly.
+ * sample_slot:
+ * Slot which contains the sample tuple. We need to copy
+ * the sample tuple and then record it.
+ * pos:
+ * The postion in the sample_rows[type].
+ * replace:
+ * Replace the old sample tuple in the specified position.
+ * withtid:
+ * Set the tid of sample tuple, this is only valid when
+ * sample_slot is set.
+ *
+ * We prefer to use sample_slot if both sample_tuple and
+ * sample_slot are set, sample_slot is the most common case.
+ */
+void
+AnalyzeRecordSampleRow(AnalyzeSampleContext *context,
+ TupleTableSlot *sample_slot,
+ HeapTuple sample_tuple,
+ AnalyzeSampleType type,
+ int pos,
+ bool replace,
+ bool withtid)
+{
+ HeapTuple tuple;
+ HeapTuple *rows;
+
+ rows = AnalyzeGetSampleRows(context, type, context->totalsampledrows);
+
+ /* We need to free the old tuple if replace is true */
+ if (replace)
+ heap_freetuple(rows[pos]);
+
+ Assert(sample_slot || sample_tuple);
+ if (sample_slot)
+ tuple = ExecCopySlotHeapTuple(sample_slot);
+ else
+ tuple = sample_tuple;
+
+ /* We may need to convert from child's rowtype to parent's */
+ if (context->tup_convert_map != NULL && type == ANALYZE_SAMPLE_DATA)
+ {
+ HeapTuple newtup;
+ newtup = execute_attr_map_tuple(tuple, context->tup_convert_map);
+ heap_freetuple(tuple);
+ tuple = newtup;
+ }
+
+ if (withtid && sample_slot)
+ tuple->t_self = sample_slot->tts_tid;
+
+ /* store the tuple to right position */
+ rows[pos] = tuple;
+}
+
+bool
+AnalyzeSampleIsValid(AnalyzeSampleContext *context, AnalyzeSampleType type)
+{
+ return context->sample_rows[type] != NULL;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 58091f6b52..74ef539770 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -20,6 +20,7 @@
#include "access/spgxlog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/zedstore_wal.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
#include "commands/sequence.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb1..973581b7f7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
+#include "access/zedstoream.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2131,6 +2132,13 @@ CommitTransaction(void)
/* Shut down the deferred-trigger manager */
AfterTriggerEndXact(true);
+ /*
+ * Flush tuple buffers in zedstore. Cannot insert/update zedstore tables
+ * in this transaction after this. This must happen before ON COMMIT
+ * actions, so we don't fail on flushing to ON COMMIT DROP tables.
+ */
+ AtEOXact_zedstore_tuplebuffers(true);
+
/*
* Let ON COMMIT management do its thing (must happen after closing
* cursors, to avoid dangling-reference problems)
@@ -2371,6 +2379,13 @@ PrepareTransaction(void)
/* Shut down the deferred-trigger manager */
AfterTriggerEndXact(true);
+ /*
+ * Flush tuple buffers in zedstore. Cannot insert/update zedstore tables
+ * in this transaction after this. This must happen before ON COMMIT
+ * actions, so we don't fail on flushing to ON COMMIT DROP tables.
+ */
+ AtEOXact_zedstore_tuplebuffers(true);
+
/*
* Let ON COMMIT management do its thing (must happen after closing
* cursors, to avoid dangling-reference problems)
@@ -2708,6 +2723,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ AtEOXact_zedstore_tuplebuffers(false);
smgrDoPendingSyncs(false, is_parallel_worker);
AtEOXact_LargeObject(false);
AtAbort_Notify();
@@ -4798,6 +4814,8 @@ StartSubTransaction(void)
s->state = TRANS_INPROGRESS;
+ AtSubStart_zedstore_tuplebuffers();
+
/*
* Call start-of-subxact callbacks
*/
@@ -4836,6 +4854,8 @@ CommitSubTransaction(void)
s->parallelModeLevel = 0;
}
+ AtEOSubXact_zedstore_tuplebuffers(true);
+
/* Do the actual "commit", such as it is */
s->state = TRANS_COMMIT;
@@ -5019,6 +5039,7 @@ AbortSubTransaction(void)
s->parent->subTransactionId,
s->curTransactionOwner,
s->parent->curTransactionOwner);
+ AtEOSubXact_zedstore_tuplebuffers(false);
AtEOSubXact_LargeObject(false, s->subTransactionId,
s->parent->subTransactionId);
AtSubAbort_Notify();
diff --git a/src/backend/access/zedstore/Makefile b/src/backend/access/zedstore/Makefile
new file mode 100644
index 0000000000..2e3ca364df
--- /dev/null
+++ b/src/backend/access/zedstore/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/zedstore
+#
+# IDENTIFICATION
+# src/backend/access/zedstore/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/zedstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = zedstore_btree.o zedstore_tiditem.o zedstore_tidpage.o \
+ zedstore_tid.o zedstore_attstream.o zedstore_attpage.o \
+ zedstore_compression.o zedstore_simple8b.o zedstoream_handler.o \
+ zedstore_meta.o zedstore_undolog.o zedstore_undorec.o \
+ zedstore_toast.o zedstore_visibility.o zedstore_inspect.o \
+ zedstore_freepagemap.o zedstore_tupslot.o zedstore_wal.o \
+ zedstore_tuplebuffer.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/zedstore/README b/src/backend/access/zedstore/README
new file mode 100644
index 0000000000..e2c6ae461f
--- /dev/null
+++ b/src/backend/access/zedstore/README
@@ -0,0 +1,362 @@
+
+src/backend/access/zedstore/README
+
+ZedStore - compressed column (and row) store for PostgreSQL
+===========================================================
+
+The purpose of this README is to provide overview of zedstore's
+design, major requirements/objectives it intends to fulfill and
+high-level implementation details.
+
+Objectives
+----------
+
+* Performance improvement for queries selecting subset of columns
+(reduced IO).
+
+* Reduced on-disk footprint compared to heap table. Shorter tuple
+headers and also leveraging compression of similar type data
+
+* Be first-class citizen in the Postgres architecture (tables data can
+just independently live in columnar storage) and not be at arm's
+length though an opaque interface.
+
+* Fully MVCC compliant - basically all operations supported similar to
+heap, like update, delete, serializable transactions etc...
+
+* All Indexes supported
+
+* Hybrid row-column store, where some columns are stored together, and
+others separately. Provide flexibility of granularity on how to divide
+the columns. Columns accessed together can be stored together.
+
+* Provide better control over bloat.
+
+* Eliminate need for separate toast tables
+
+* Faster add / drop column or changing data type of column by avoiding
+full rewrite of the table.
+
+Highlevel design of zedStore - B-trees for the win!
+---------------------------------------------------
+
+ZedStore consists of multiple B-trees. There is one B-tree, called the
+TID tree, which contains the visibility information of each tuple, but
+no user data. In addition to that, there is one B-tree for each
+attribute, called the attribute trees, to store the user data. Note that
+these B-tree implementations are completely unrelated to PostgreSQL's
+B-tree indexes.
+
+The TID tree, and all the attribute trees, use the TID as the key. The
+TID is used as a logical row identifier. Internally, ZedStore passed
+TIDs around as 64-bit integers (zstid), but for interfacing with the
+rest of the system, they are converted to/from ItemPointers. When
+converted to an ItemPointer, the conversion ensures that the ItemPointer
+looks valid, i.e. offset 0 is never used. However, the TID is just a
+48-bit row identifier, the traditional division into block and offset
+numbers is meaningless. There is locality of access, though; TIDs that
+are close to each other, will probably also reside close to each other
+on disk. So, for example, bitmap index scans or BRIN indexes, which
+work with block numbers, still make some sense, even though the "block
+number" stored in a zedstore ItemPointer doesn't correspond to a
+physical block.
+
+The internal pages of the B-trees are super simple and boring. The internal
+pages of the TID and attribute trees look identical. Functions that work
+with either the TID or attribute tree use ZS_META_ATTRIBUTE_NUM as the
+"attribute number", when working with the TID tree.
+
+
+
+The leaf pages look different TID tree and the attribute trees. Let's
+look at the TID tree first:
+
+TID tree
+--------
+
+A TID tree page consists of multiple ZSTidArrayItems. Each ZSTidArrayItem
+represents a group of tuples, with TIDs in a particular range. The TID
+ranges of ZSTidArrayItems never overlap. For each tuple, we logically
+store the TID, and its UNDO pointer. The actual visibility information
+is stored in the UNDO log, if the tuple was recently modified.
+
+A tuple can also be marked as dead, which means that the tuple is not
+visible to anyone. Dead tuples are marked with a special constant
+UNDO pointer value, DeadUndoPtr. The TIDs of dead tuples cannot be
+reused, until all index pointers to the tuples have been removed, by
+VACUUM. VACUUM scans the TID tree to collect all the dead TIDs. (Note
+that VACUUM does not need to scan the attribute trees, and the TID tree
+is hopefully just a small fraction of the table. Vacuum on zedstore is
+therefore hopefully much faster than on heap. (Although the freeze map
+can be pretty effective on the heap, too))
+
+So logically, the TID tree stores the TID and UNDO pointer for every
+tuple. However, that would take a lot of space. To reduce disk usage,
+the TID tree consists of ZSTidArrayItems, which contain the TIDs and
+their UNDO pointers in a specially encoded format. The encoded format
+is optimized for the common cases, where the gaps between TIDs are
+small, and most tuples are visible to everyone. See comments
+ZSTidArrayItem in zedstore_internal.h for details.
+
+Having a TID tree that's separate from the attributes helps to support
+zero column tables (which can be result of ADD COLUMN DROP COLUMN actions
+as well). Plus, having meta-data stored separately from data, helps to get
+better compression ratios. And also helps to simplify the overall
+design/implementation as for deletes just need to edit the TID tree
+and avoid touching the attribute btrees.
+
+
+Attribute trees
+---------------
+
+The leaf pages on the attribute tree contain two "streams" of attribute
+data, one compressed, and one uncompressed. The compressed stream is
+compressed using LZ4. (Assuming the server has been built with "configure
+--with-lz4". Otherwise, PostgreSQL's built-in pglz algorithm is used, but
+it is *much* slower). When new rows are added, the new attribute data is
+appended to the uncompressed stream, until the page gets full, at which
+point all the uncompressed data is repacked and moved to the compressed
+stream. An attribute stream consists of smaller "chunks", and each chunk
+contains the TIDs and data for 1-60 rows.
+
+In uncompressed form, an attribute stream on a page can be arbitrarily
+large, but after compression, it must fit into a physical 8k block. If
+on insert or update of a tuple, the page cannot be compressed below 8k
+anymore, the page is split. Note that because TIDs are logical rather
+than physical identifiers, we can freely move tuples from one physical
+page to another during page split. A tuple's TID never changes.
+
+The buffer cache caches compressed blocks. Likewise, WAL-logging,
+full-page images etc. work on compressed blocks. Uncompression is done
+on-the-fly, as and when needed in backend-private memory, when
+reading. For some compressions like rel encoding or delta encoding
+tuples can be constructed directly from compressed data.
+
+
+To reconstruct a row with given TID, scan descends down the B-trees for
+all the columns using that TID, and fetches all attributes. Likewise, a
+sequential scan walks all the B-trees in lockstep.
+
+
+TODO: Currently, each attribute is stored in a separate attribute
+B-tree. But a hybrid row-column store would also be possible, where some
+columns were stored together in the same tree. Or even a row store, where
+all the user data was stored in a single tree, or even combined with the
+TID tree.
+
+Metapage
+--------
+
+A metapage at block 0, has links to the roots of the B-trees.
+
+
+Low-level locking / concurrency issues
+------------------------------- ------
+Design principles:
+
+* Every page is self-identifying. Every page has a page type ID,
+ which indicates what kind of a page it is. For a B-tree page,
+ the page header contains the attribute number and lo/hi key.
+ That is enough information to find the downlink to the page, so
+ that it can be deleted if necessary. There is enough information
+ on each leaf page to easily re-build the internal pages from
+ scratch, in case of corruption, for example.
+
+* Concurrency control: When traversing the B-tree, or walking UNDO
+ or TOAST pages, it's possible that a concurrent process splits or
+ moves a page just when we're about to step on it. There is enough
+ information on each page to detect that case. For example, if a
+ B-tree page is split just when you are about to step on it, you
+ can detect that by looking at the lo/hi key. If a page is deleted,
+ that can be detected too, because the attribute number or lo/hikey
+ are not what you expected. In that case, start the scan from the
+ root.
+
+* Any page can be fairly easily be moved, starting with just the
+ page itself. When you have a B-tree page at hand, you can re-find
+ its parent using its lokey, and modify the downlink. A toast page
+ contains the attno/TID, which can be used to find the pointer to
+ it in the b-tree. An UNDO page cannot currently be moved because
+ UNDO pointers contain the physical block number, but as soon as an
+ UNDO page expires, it can be deleted.
+
+
+MVCC
+----
+
+Undo record pointers are used to implement MVCC, like in zheap. Hence,
+transaction information if not directly stored with the data. In
+zheap, there's a small, fixed, number of "transaction slots" on each
+page, but zedstore has undo pointer with each item directly; in normal
+cases, the compression squeezes this down to almost nothing. In case
+of bulk load the undo record pointer is maintained for array of items
+and not per item. Undo pointer is only stored in meta-column and all
+MVCC operations are performed using the meta-column only.
+
+
+Insert:
+Inserting a new row, splits the row into datums. Then while adding
+entry for meta-column adds, decides block to insert, picks a TID for
+it, and writes undo record for the same. All the data columns are
+inserted using that TID.
+
+Toast:
+When an overly large datum is stored, it is divided into chunks, and
+each chunk is stored on a dedicated toast page within the same
+physical file. The toast pages of a datum form list, each page has a
+next/prev pointer.
+
+Select:
+Property is added to Table AM to convey if column projection is
+leveraged by AM for scans. While scanning tables with AM leveraging
+this property, executor parses the plan. Leverages the target list and
+quals to find the required columns for query. This list is passed down
+to AM on beginscan. Zedstore uses this column projection list to only
+pull data from selected columns. Virtual tuple table slot is used to
+pass back the datums for subset of columns.
+
+Current table am API requires enhancement here to pass down column
+projection to AM. The patch showcases two different ways for the same.
+
+* For sequential scans added new beginscan_with_column_projection()
+API. Executor checks AM property and if it leverages column projection
+uses this new API else normal beginscan() API.
+
+* For index scans instead of modifying the begin scan API, added new
+API to specifically pass column projection list after calling begin
+scan to populate the scan descriptor but before fetching the tuples.
+
+Delete:
+When deleting a tuple, new undo record is created for delete and only
+meta-column item is updated with this new undo record. New undo record
+created points to previous undo record pointer (insert undo record)
+present for the tuple. Hence, delete only operates on meta-column and
+no data column is edited.
+
+Update:
+Update in zedstore is pretty equivalent to delete and insert. Delete
+action is performed as stated above and new entry is added with
+updated values. So, no in-place update happens.
+
+Index Support:
+Building index also leverages columnar storage and only scans columns
+required to build the index. Indexes work pretty similar to heap
+tables. Data is inserted into tables and TID for the tuple gets stored
+in index. On index scans, required column Btrees are scanned for given
+TID and datums passed back using virtual tuple. Since only meta-column
+is leveraged to perform visibility check, only visible tuples data are
+fetched from rest of the Btrees.
+
+Page Format
+-----------
+A ZedStore table contains different kinds of pages, all in the same
+file. Kinds of pages are meta-page, per-attribute btree internal and
+leaf pages, UNDO log page, and toast pages. Each page type has its own
+distinct data storage format.
+
+META Page:
+Block 0 is always a metapage. It contains the block numbers of the
+other data structures stored within the file, like the per-attribute
+B-trees, and the UNDO log.
+
+BTREE Page:
+
+UNDO Page:
+
+TOAST Page:
+
+
+Free Pages Map
+--------------
+
+The Free Page Map structure used in Zedstore is simply a linked list of unused
+blocks. There are multiple free page maps, with one free page map for the table,
+to cater to allocation requests for UNDO pages. There is a free page map for the
+tid tree and a separate free page map for each attribute (we use the term
+attribute-level free page map for the free page map for the tid tree as well,
+considering it as attribute 0).
+The block number of the first unused page for each of these lists is stored
+in the metapage. Each unused block contains link to the next unused
+block in the chain. When a block comes unused, it is added to the
+head of the list.
+
+By batching page allocations and by having attribute-level free page maps, we
+ensure that each attribute B-tree gets more contiguous ranges of blocks, even under
+concurrent inserts to the same table to allow I/O readahead to be effective.
+The batching factor we use is the reloption: zedstore_rel_extension_factor.
+
+TODO: That doesn't scale very well, and the pages are reused in LIFO
+order. We'll probably want to do something smarter to avoid making the
+metapage a bottleneck for this.
+
+
+Enhancement ideas / alternative designs
+---------------------------------------
+
+Instead of compressing all the tuples on a page in one batch, store a
+small "dictionary", e.g. in page header or meta page or separate
+dedicated page, and use it to compress tuple by tuple. That could make
+random reads and updates of individual tuples faster. Need to find how
+to create the dictionary first.
+
+Only cached compressed pages in the page cache. If we want to cache
+uncompressed pages instead, or in addition to that, we need to invent
+a whole new kind of a buffer cache that can deal with the
+variable-size blocks. For a first version, I think we can live without
+it.
+
+Instead of storing all columns in the same file, we could store them
+in separate files (separate forks?). That would allow immediate reuse
+of space, after dropping a column. It's not clear how to use an FSM in
+that case, though. Might have to implement an integrated FSM,
+too. (Which might not be a bad idea, anyway).
+
+Design allows for hybrid row-column store, where some columns are
+stored together, and others have a dedicated B-tree. Need to have user
+facing syntax to allow specifying how to group the columns.
+
+Salient points for the design
+------------------------------
+
+* Layout the data/tuples in mapped fashion instead of keeping the
+logical to physical mapping separate from actual data. So, keep all
+the meta-data and data logically in single stream of file, avoiding
+the need for separate forks/files to store meta-data and data.
+
+* Handle/treat operations at tuple level and not block level.
+
+* Stick to fixed size physical blocks. Variable size blocks (for
+possibly higher compression ratios) pose need for increased logical to
+physical mapping maintenance, plus restrictions on concurrency of
+writes and reads to files. Hence adopt compression to fit fixed size
+blocks instead of other way round.
+
+
+Predicate locking
+-----------------
+
+Predicate locks, to support SERIALIZABLE transactinons, are taken like
+with the heap. From README-SSI:
+
+* For a table scan, the entire relation will be locked.
+
+* Each tuple read which is visible to the reading transaction will be
+locked, whether or not it meets selection criteria; except that there
+is no need to acquire an SIREAD lock on a tuple when the transaction
+already holds a write lock on any tuple representing the row, since a
+rw-conflict would also create a ww-dependency which has more
+aggressive enforcement and thus will prevent any anomaly.
+
+* Modifying a heap tuple creates a rw-conflict with any transaction
+that holds a SIREAD lock on that tuple, or on the page or relation
+that contains it.
+
+* Inserting a new tuple creates a rw-conflict with any transaction
+holding a SIREAD lock on the entire relation. It doesn't conflict with
+page-level locks, because page-level locks are only used to aggregate
+tuple locks. Unlike index page locks, they don't lock "gaps" on the
+page.
+
+
+ZedStore isn't block-based, so page-level locks really just mean a
+range of TIDs. They're only used to aggregate tuple locks.
diff --git a/src/backend/access/zedstore/zedstore_attpage.c b/src/backend/access/zedstore/zedstore_attpage.c
new file mode 100644
index 0000000000..52c2a5a1b4
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_attpage.c
@@ -0,0 +1,1068 @@
+/*
+ * zedstore_attpage.c
+ * Routines for handling attribute leaf pages.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with a scan of one attribute tree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_attpage.c
+ */
+#include "postgres.h"
+
+#include "access/xlogutils.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+typedef struct
+{
+ Page currpage;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ AttrNumber attno;
+ zstid hikey;
+
+ BlockNumber nextblkno;
+} zsbt_attr_repack_context;
+
+/* prototypes for local functions */
+static ZSAttStream *get_page_lowerstream(Page page);
+static ZSAttStream *get_page_upperstream(Page page);
+static void wal_log_attstream_change(Relation rel, Buffer buf, ZSAttStream *attstream, bool is_upper,
+ uint16 begin_offset, uint16 end_offset);
+
+static void zsbt_attr_repack_init(zsbt_attr_repack_context *cxt, AttrNumber attno, Buffer oldbuf, bool append);
+static void zsbt_attr_repack_newpage(zsbt_attr_repack_context *cxt, zstid nexttid);
+static void zsbt_attr_pack_attstream(Form_pg_attribute attr, attstream_buffer *buf, Page page);
+static void zsbt_attr_repack_writeback_pages(zsbt_attr_repack_context *cxt,
+ Relation rel, AttrNumber attno,
+ Buffer oldbuf);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of an attribute btree.
+ *
+ * Fills in the scan struct in *scan.
+ */
+void
+zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno,
+ ZSAttrTreeScan *scan)
+{
+ scan->rel = rel;
+ scan->attno = attno;
+ scan->attdesc = TupleDescAttr(tdesc, attno - 1);
+
+ scan->context = CurrentMemoryContext;
+
+ init_attstream_decoder(&scan->decoder, scan->attdesc->attbyval, scan->attdesc->attlen);
+ scan->decoder.tmpcxt = AllocSetContextCreate(scan->context,
+ "ZedstoreAMAttrScanContext",
+ ALLOCSET_DEFAULT_SIZES);
+
+ scan->decoder_last_idx = -1;
+
+ scan->active = true;
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+}
+
+void
+zsbt_attr_end_scan(ZSAttrTreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+
+ scan->active = false;
+
+ destroy_attstream_decoder(&scan->decoder);
+}
+
+/*
+ * Load scan->array_* arrays with data that contains 'nexttid'.
+ *
+ * Return true if data containing 'nexttid' was found. The tid/Datum/isnull
+ * data are placed into scan->array_* fields. The data is valid until the
+ * next call of this function. Note that the item's range contains 'nexttid',
+ * but its TID list might not include the exact TID itself. The caller
+ * must scan the array to check for that.
+ *
+ * This is normally not used directly. Use the zsbt_attr_fetch() wrapper,
+ * instead.
+ */
+bool
+zsbt_attr_scan_fetch_array(ZSAttrTreeScan *scan, zstid nexttid)
+{
+ Buffer buf;
+ Page page;
+ ZSAttStream *stream;
+
+ if (!scan->active)
+ return InvalidZSTid;
+
+ /*
+ * If the TID we're looking for is in the current attstream, we just
+ * need to decoder more of it.
+ *
+ * TODO: We could restart the decoder, if the current attstream
+ * covers the target TID, but we already decoded past it.
+ */
+ if (scan->decoder.pos < scan->decoder.chunks_len &&
+ nexttid >= scan->decoder.firsttid &&
+ nexttid <= scan->decoder.lasttid)
+ {
+ if (nexttid <= scan->decoder.prevtid)
+ {
+ /*
+ * The target TID is in this attstream, but we already scanned
+ * past it. Restart the decoder.
+ */
+ scan->decoder.pos = 0;
+ scan->decoder.prevtid = 0;
+ }
+
+ /* Advance the scan, until we have reached the target TID */
+ while (nexttid > scan->decoder.prevtid)
+ (void) decode_attstream_cont(&scan->decoder);
+
+ if (scan->decoder.num_elements == 0 ||
+ nexttid < scan->decoder.tids[0])
+ return false;
+ else
+ return true;
+ }
+
+ /* reset the decoder */
+ scan->decoder.num_elements = 0;
+ scan->decoder.chunks_len = 0;
+ scan->decoder.pos = 0;
+ scan->decoder.prevtid = 0;
+
+ /*
+ * Descend the tree, tind and lock the leaf page containing 'nexttid'.
+ */
+ buf = zsbt_find_and_lock_leaf_containing_tid(scan->rel, scan->attno,
+ scan->lastbuf, nexttid,
+ BUFFER_LOCK_SHARE);
+ scan->lastbuf = buf;
+ if (!BufferIsValid(buf))
+ {
+ /*
+ * Completely empty tree. This should only happen at the beginning
+ * of a scan - a tree cannot go missing after it's been created -
+ * but we don't currently check for that.
+ */
+ return false;
+ }
+ page = BufferGetPage(buf);
+
+ /* See if the upper stream covers the target tid */
+ stream = get_page_upperstream(page);
+ if (stream && nexttid <= stream->t_lasttid)
+ {
+ decode_attstream_begin(&scan->decoder, stream);
+ }
+ /*
+ * How about the lower stream? (We assume that the upper stream is < lower
+ * stream, and there's no overlap).
+ */
+ else
+ {
+ stream = get_page_lowerstream(page);
+ if (stream && nexttid <= stream->t_lasttid)
+ {
+ /* If there is a match, it will be in this attstream */
+ decode_attstream_begin(&scan->decoder, stream);
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * We now have the attstream we need copied into scan->decoder (or not, if
+ * no covering attstream was found)
+ */
+ if (scan->decoder.pos < scan->decoder.chunks_len &&
+ nexttid >= scan->decoder.firsttid &&
+ nexttid <= scan->decoder.lasttid)
+ {
+ /* Advance the scan, until we have reached the target TID */
+ while (nexttid > scan->decoder.prevtid)
+ (void) decode_attstream_cont(&scan->decoder);
+
+ if (scan->decoder.num_elements == 0 ||
+ nexttid < scan->decoder.tids[0])
+ return false;
+ else
+ return true;
+ }
+ else
+ return false;
+}
+
+/*
+ * Remove data for the given TIDs from the attribute tree.
+ */
+void
+zsbt_attr_remove(Relation rel, AttrNumber attno, IntegerSet *tids)
+{
+ Form_pg_attribute attr = &rel->rd_att->attrs[attno - 1];
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ zstid nexttid;
+ MemoryContext oldcontext;
+ MemoryContext tmpcontext;
+ zstid *tids_to_remove;
+ int num_to_remove;
+ int allocated_size;
+
+ tids_to_remove = palloc(1000 * sizeof(zstid));
+ allocated_size = 1000;
+
+ tmpcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMVacuumContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(tmpcontext);
+
+ intset_begin_iterate(tids);
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = InvalidZSTid;
+
+ while (nexttid < MaxPlusOneZSTid)
+ {
+ ZSAttStream *lowerstream;
+ ZSAttStream *upperstream;
+
+ buf = zsbt_descend(rel, attno, nexttid, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * We now have a page at hand, that (should) contain at least one
+ * of the TIDs we want to remove.
+ */
+ num_to_remove = 0;
+ while (nexttid < opaque->zs_hikey)
+ {
+ if (num_to_remove == allocated_size)
+ {
+ tids_to_remove = repalloc(tids_to_remove, (allocated_size * 2) * sizeof(zstid));
+ allocated_size *= 2;
+ }
+ tids_to_remove[num_to_remove++] = nexttid;
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+
+ /* Remove the data for those TIDs, and rewrite the page */
+ lowerstream = get_page_lowerstream(page);
+ upperstream = get_page_upperstream(page);
+
+ if (lowerstream || upperstream)
+ {
+ attstream_buffer upperbuf;
+ attstream_buffer lowerbuf;
+ attstream_buffer *newbuf;
+ zsbt_attr_repack_context cxt;
+
+ upperbuf.len = 0;
+ upperbuf.cursor = 0;
+ lowerbuf.len = 0;
+ lowerbuf.cursor = 0;
+
+ if (upperstream)
+ vacuum_attstream(rel, attno, &upperbuf, upperstream,
+ tids_to_remove, num_to_remove);
+
+ if (lowerstream)
+ vacuum_attstream(rel, attno, &lowerbuf, lowerstream,
+ tids_to_remove, num_to_remove);
+
+ if (upperbuf.len - upperbuf.cursor > 0 &&
+ lowerbuf.len - lowerbuf.cursor > 0)
+ {
+ merge_attstream_buffer(attr, &upperbuf, &lowerbuf);
+ newbuf = &upperbuf;
+ }
+ else if (upperbuf.len - upperbuf.cursor > 0)
+ newbuf = &upperbuf;
+ else
+ newbuf = &lowerbuf;
+
+ /*
+ * Now we have a list of non-overlapping items, containing all the old and
+ * new data. zsbt_attr_rewrite_page() takes care of storing them on the
+ * page, splitting the page if needed.
+ */
+ zsbt_attr_repack_init(&cxt, attno, buf, false);
+ if (newbuf->len - newbuf->cursor > 0)
+ {
+ /*
+ * Then, store them on the page, creating new pages as needed.
+ */
+ zsbt_attr_pack_attstream(attr, newbuf, cxt.currpage);
+ while (newbuf->cursor < newbuf->len)
+ {
+ zsbt_attr_repack_newpage(&cxt, newbuf->firsttid);
+ zsbt_attr_pack_attstream(attr, newbuf, cxt.currpage);
+ }
+ }
+ zsbt_attr_repack_writeback_pages(&cxt, rel, attno, buf);
+ /* zsbt_attr_rewriteback_pages() unlocked and released the buffer */
+ }
+ else
+ UnlockReleaseBuffer(buf);
+
+ /*
+ * We can now free the decompression contexts. The pointers in the 'items' list
+ * point to decompression buffers, so we cannot free them until after writing out
+ * the pages.
+ */
+ MemoryContextReset(tmpcontext);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(tmpcontext);
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+static ZSAttStream *
+get_page_lowerstream(Page page)
+{
+ int lowersize;
+ ZSAttStream *lowerstream;
+
+ /* Read old uncompressed (lower) attstream */
+ lowersize = ((PageHeader) page)->pd_lower - SizeOfPageHeaderData;
+ if (lowersize > SizeOfZSAttStreamHeader)
+ {
+ lowerstream = (ZSAttStream *) (((char *) page) + SizeOfPageHeaderData);
+ Assert((lowerstream)->t_size == lowersize);
+
+ /* by convention, lower stream is always uncompressed */
+ Assert((lowerstream->t_flags & ATTSTREAM_COMPRESSED) == 0);
+ }
+ else
+ {
+ Assert (lowersize == 0);
+ lowerstream = NULL;
+ }
+
+ return lowerstream;
+}
+
+static ZSAttStream *
+get_page_upperstream(Page page)
+{
+ int uppersize;
+ ZSAttStream *upperstream;
+
+ uppersize = ((PageHeader) page)->pd_special - ((PageHeader) page)->pd_upper;
+ if (uppersize > SizeOfZSAttStreamHeader)
+ {
+ upperstream = (ZSAttStream *) (((char *) page) + ((PageHeader) page)->pd_upper);
+ Assert(upperstream->t_size == uppersize);
+ /* by convention, upper stream is always compressed */
+ Assert((upperstream->t_flags & ATTSTREAM_COMPRESSED) != 0);
+ }
+ else
+ {
+ upperstream = NULL;
+ Assert(uppersize == 0);
+ }
+ return upperstream;
+}
+
+/*
+ * Add data to attribute leaf pages.
+ *
+ * 'attbuf' contains the new attribute data that is to be added to the page.
+ *
+ * This function writes as much data as is convenient; typically, as much
+ * as fits on a single page, after compression. Some data is always written.
+ * If you want to flush all data to disk, call zsbt_attr_add() repeatedly,
+ * until 'attbuf' is empty.
+ *
+ * 'attbuf' is updated, so that on exit, it contains the data that remains,
+ * i.e. data that was not yet written out.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page, as needed.
+ */
+void
+zsbt_attr_add(Relation rel, AttrNumber attno, attstream_buffer *attbuf)
+{
+ Form_pg_attribute attr = &rel->rd_att->attrs[attno - 1];
+ Buffer origbuf;
+ Page origpage;
+ ZSBtreePageOpaque *origpageopaque;
+ ZSAttStream *lowerstream;
+ ZSAttStream *upperstream;
+ int lowerstreamsz;
+ uint16 orig_pd_lower;
+ uint16 new_pd_lower;
+ zstid firstnewtid;
+ zstid splittid;
+ zsbt_attr_repack_context cxt;
+ bool split = false;
+
+ Assert (attbuf->len - attbuf->cursor > 0);
+
+ /*
+ * Find the right place to insert the new data.
+ */
+ origbuf = zsbt_descend(rel, attno, attbuf->firsttid, 0, false);
+ origpage = BufferGetPage(origbuf);
+ origpageopaque = ZSBtreePageGetOpaque(origpage);
+ splittid = origpageopaque->zs_hikey - 1;
+
+ Assert (attbuf->firsttid <= splittid);
+
+ lowerstream = get_page_lowerstream(origpage);
+ upperstream = get_page_upperstream(origpage);
+
+ /* Is there space to add the new attstream as it is? */
+ orig_pd_lower = ((PageHeader) origpage)->pd_lower;
+
+ if (lowerstream == NULL)
+ {
+ /*
+ * No existing uncompressed data on page, see if the new data can fit
+ * into the uncompressed area.
+ */
+ Assert(orig_pd_lower == SizeOfPageHeaderData);
+
+ if (SizeOfZSAttStreamHeader + (attbuf->len - attbuf->cursor) <= PageGetExactFreeSpace(origpage))
+ {
+ ZSAttStream newhdr;
+ attstream_buffer newattbuf;
+
+ newhdr.t_flags = 0;
+ newhdr.t_decompressed_size = 0;
+ newhdr.t_decompressed_bufsize = 0;
+
+ if (attbuf->lasttid > splittid)
+ {
+ /*
+ * We should not accommodate items with tids greater than the
+ * hikey of the target leaf page. So if our attbuf does have such
+ * items, we split the attbuf into two buffers at tid: hikey - 1.
+ * This will ensure that we only insert the tids that fit into
+ * the page's range.
+ */
+ split_attstream_buffer(attbuf, &newattbuf, splittid);
+ split = true;
+ }
+
+ newhdr.t_size = SizeOfZSAttStreamHeader + (attbuf->len - attbuf->cursor);
+ newhdr.t_lasttid = attbuf->lasttid;
+ new_pd_lower = SizeOfPageHeaderData + newhdr.t_size;
+
+ START_CRIT_SECTION();
+
+ memcpy(origpage + SizeOfPageHeaderData, &newhdr, SizeOfZSAttStreamHeader);
+ memcpy(origpage + SizeOfPageHeaderData + SizeOfZSAttStreamHeader,
+ attbuf->data + attbuf->cursor, attbuf->len - attbuf->cursor);
+ ((PageHeader) origpage)->pd_lower = new_pd_lower;
+
+ MarkBufferDirty(origbuf);
+
+ if (RelationNeedsWAL(rel))
+ wal_log_attstream_change(rel, origbuf,
+ (ZSAttStream *) (origpage + SizeOfPageHeaderData), false,
+ orig_pd_lower, new_pd_lower);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(origbuf);
+ if (split)
+ {
+ /*
+ * Make attbuf represent the chunks that were on the right hand
+ * side of the split. These are the chunks that are left over.
+ */
+ pfree(attbuf->data);
+ memcpy(attbuf, &newattbuf, sizeof(attstream_buffer));
+ }
+ else
+ attbuf->cursor = attbuf->len;
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Try to append the new data to the existing uncompressed data first
+ */
+ START_CRIT_SECTION();
+
+ if (attbuf->lasttid <= splittid &&
+ append_attstream_inplace(attr, lowerstream,
+ PageGetExactFreeSpace(origpage),
+ attbuf))
+ {
+ new_pd_lower = SizeOfPageHeaderData + lowerstream->t_size;
+
+ /* fast path succeeded */
+ MarkBufferDirty(origbuf);
+
+ /*
+ * NOTE: in theory, if append_attstream_inplace() was smarter, it might
+ * modify the existing data. The new combined stream might even be smaller
+ * than the old stream, if the last codewords are packed more tighthly.
+ * But at the moment, append_attstreams_inplace() doesn't do anything
+ * that smart. So we assume that the existing data didn't change, and we
+ * only need to WAL log the new data at the end of the stream.
+ */
+ ((PageHeader) origpage)->pd_lower = new_pd_lower;
+
+ if (RelationNeedsWAL(rel))
+ wal_log_attstream_change(rel, origbuf, lowerstream, false,
+ orig_pd_lower, new_pd_lower);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(origbuf);
+ return;
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * If the orig page contains already-compressed data, and it is almost full,
+ * leave the old data untouched and create a new page. This avoids repeatedly
+ * recompressing pages when inserting rows one by one. Somewhat arbitrarily,
+ * we put the threshold at 2.5%.
+ *
+ * TODO: skipping allocating new page here if attbuf->lasttid > splittid,
+ * because we don't know how to handle that without calling merge_attstream()
+ */
+ firstnewtid = attbuf->firsttid;
+ lowerstreamsz = lowerstream ? lowerstream->t_size : 0;
+ if (attbuf->lasttid <= splittid &&
+ PageGetExactFreeSpace(origpage) + lowerstreamsz < (int) (BLCKSZ * 0.025) &&
+ (lowerstream == NULL || firstnewtid > lowerstream->t_lasttid) &&
+ upperstream && firstnewtid > upperstream->t_lasttid)
+ {
+ /*
+ * Keep the original page unmodified, and allocate a new page
+ * for the new data.
+ */
+ zsbt_attr_repack_init(&cxt, attno, origbuf, true);
+ zsbt_attr_repack_newpage(&cxt, attbuf->firsttid);
+
+ /* write out the new data (or part of it) */
+ zsbt_attr_pack_attstream(attr, attbuf, cxt.currpage);
+ }
+ else
+ {
+ /*
+ * Rewrite existing data on the page, and add as much of the
+ * new data as fits. But make sure that we write at least one
+ * chunk of new data, otherwise we might get stuck in a loop
+ * without making any progress.
+ */
+ zstid mintid = attbuf->firsttid;
+ attstream_buffer rightattbuf;
+
+#if 0
+ if (upperstream && lowerstream)
+ elog(NOTICE, "merging upper %d lower %d new %d", upperstream->t_decompressed_size, lowerstream->t_size, attbuf->len - attbuf->cursor);
+ else if (upperstream)
+ elog(NOTICE, "merging upper %d new %d", upperstream->t_decompressed_size, attbuf->len - attbuf->cursor);
+ else if (lowerstream)
+ elog(NOTICE, "merging lower %d new %d", lowerstream->t_size, attbuf->len - attbuf->cursor);
+ else if (lowerstream)
+ elog(NOTICE, "merging new %d", attbuf->len - attbuf->cursor);
+#endif
+
+ /* merge the old items to the working buffer */
+ if (upperstream && lowerstream)
+ {
+ attstream_buffer tmpbuf;
+
+ mintid = Max(mintid, lowerstream->t_lasttid);
+ mintid = Max(mintid, upperstream->t_lasttid);
+
+ init_attstream_buffer_from_stream(&tmpbuf, attr->attbyval,
+ attr->attlen, upperstream, GetMemoryChunkContext(attbuf->data));
+
+ merge_attstream(attr, attbuf, lowerstream);
+
+ merge_attstream_buffer(attr, &tmpbuf, attbuf);
+
+ pfree(attbuf->data);
+ *attbuf = tmpbuf;
+ }
+ else if (lowerstream)
+ {
+ mintid = Max(mintid, lowerstream->t_lasttid);
+ merge_attstream(attr, attbuf, lowerstream);
+ }
+ else if (upperstream)
+ {
+ mintid = Max(mintid, upperstream->t_lasttid);
+ merge_attstream(attr, attbuf, upperstream);
+ }
+
+ /*
+ * Now we have a list of non-overlapping items, containing all the old and
+ * new data. Write it out, making sure that at least all the old data is
+ * written out (otherwise, we'd momentarily remove existing data!)
+ */
+ zsbt_attr_repack_init(&cxt, attno, origbuf, false);
+
+ if (attbuf->lasttid > splittid)
+ {
+ /*
+ * We should not accommodate items with tids greater than the
+ * hikey of the target leaf page. So if our attbuf does have such
+ * items, we split the attbuf into two buffers at tid: hikey - 1.
+ * This will ensure that we only insert the tids that fit into
+ * the page's range.
+ */
+ split_attstream_buffer(attbuf, &rightattbuf, splittid);
+ split = true;
+ }
+
+ zsbt_attr_pack_attstream(attr, attbuf, cxt.currpage);
+
+ while (attbuf->cursor < attbuf->len && (split || attbuf->firsttid <= mintid))
+ {
+ zsbt_attr_repack_newpage(&cxt, attbuf->firsttid);
+ zsbt_attr_pack_attstream(attr, attbuf, cxt.currpage);
+ }
+
+ if (split)
+ {
+ /*
+ * Make attbuf represent the chunks that were on the right hand
+ * side of the split. These are the chunks that are left over.
+ */
+ Assert(attbuf->cursor == attbuf->len);
+ pfree(attbuf->data);
+ memcpy(attbuf, &rightattbuf, sizeof(attstream_buffer));
+ }
+ }
+ zsbt_attr_repack_writeback_pages(&cxt, rel, attno, origbuf);
+}
+
+/*
+ * Repacker routines
+ *
+ * Usage:
+ *
+ * 1. Call zsbt_attr_repack_init() to start a repacking operation.
+ * 2. Call zsbt_attr_pack_attstream() to compress and chop a page-sied slice
+ * of incoming data, and store it on the current page copy.
+ * 3. Call zsbt_attr_repack_newpage() to allocate a new page, if you want
+ * to compress and write more data, and goto 2.
+ * 4. Call zsbt_attr_repack_writeback_pages() to finish the repacking
+ * operation, making all on-disk changes.
+ *
+ * Steps 1-3 happen on in-memory pages copies. No data on-disk is
+ * is modified until step 4.
+ */
+static void
+zsbt_attr_repack_init(zsbt_attr_repack_context *cxt, AttrNumber attno, Buffer origbuf, bool append)
+{
+ Page origpage;
+ ZSBtreePageOpaque *origopaque;
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ origpage = BufferGetPage(origbuf);
+ origopaque = ZSBtreePageGetOpaque(origpage);
+
+ cxt->stack_head = cxt->stack_tail = NULL;
+ cxt->attno = attno;
+ cxt->hikey = origopaque->zs_hikey;
+ cxt->nextblkno = origopaque->zs_next;
+
+ newpage = (Page) palloc(BLCKSZ);
+ if (append)
+ memcpy(newpage, origpage, BLCKSZ);
+ else
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ cxt->currpage = newpage;
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (append)
+ stack->special_only = true;
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = cxt->attno;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = origopaque->zs_lokey;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = origopaque->zs_flags & ZSBT_ROOT;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_attr_repack_newpage(zsbt_attr_repack_context *cxt, zstid nexttid)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ /* set the last tid on previous page */
+ oldopaque->zs_hikey = nexttid;
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ cxt->stack_tail->next = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = cxt->attno;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = 0;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+/*
+ * Compress and write as much of the data from 'attbuf' onto 'page' as fits.
+ * 'attbuf' is updated in place, so that on exit, it contains the remaining chunks
+ * that did not fit on 'page'.
+ */
+static void
+zsbt_attr_pack_attstream(Form_pg_attribute attr, attstream_buffer *attbuf,
+ Page page)
+{
+ Size freespc;
+ int orig_bytes;
+ char *pstart;
+ char *pend;
+ char *dst;
+ int complete_chunks_len;
+ zstid lasttid = 0;
+ int srcSize;
+ int compressed_size;
+ ZSAttStream *hdr;
+ char compressbuf[BLCKSZ];
+
+ /* this should only be called on an empty page */
+ Assert(((PageHeader) page)->pd_lower == SizeOfPageHeaderData);
+ freespc = PageGetExactFreeSpace(page);
+
+ pstart = &attbuf->data[attbuf->cursor];
+ pend = &attbuf->data[attbuf->len];
+ orig_bytes = pend - pstart;
+
+ freespc -= SizeOfZSAttStreamHeader;
+
+ /*
+ * Try compressing.
+ *
+ * Note: we try compressing, even if the data fits uncompressed. That might seem
+ * like a waste of time, but compression is very cheap, and this leaves more free
+ * space on the page for new additions.
+ */
+ srcSize = orig_bytes;
+ compressed_size = zs_compress_destSize(pstart, compressbuf, &srcSize, freespc);
+ if (compressed_size > 0)
+ {
+ /* store compressed, in upper stream */
+ int bytes_compressed = srcSize;
+
+ /*
+ * How many complete chunks did we compress?
+ */
+ if (bytes_compressed == orig_bytes)
+ {
+ complete_chunks_len = orig_bytes;
+ lasttid = attbuf->lasttid;
+ }
+ else
+ complete_chunks_len =
+ find_chunk_for_offset(attbuf, bytes_compressed, &lasttid);
+
+ if (complete_chunks_len == 0)
+ elog(ERROR, "could not fit any chunks on page");
+
+ dst = (char *) page + ((PageHeader) page)->pd_special - (SizeOfZSAttStreamHeader + compressed_size);
+ hdr = (ZSAttStream *) dst;
+ hdr->t_size = SizeOfZSAttStreamHeader + compressed_size;
+ hdr->t_flags = ATTSTREAM_COMPRESSED;
+ hdr->t_decompressed_size = complete_chunks_len;
+ hdr->t_decompressed_bufsize = bytes_compressed;
+ hdr->t_lasttid = lasttid;
+
+ dst = hdr->t_payload;
+ memcpy(dst, compressbuf, compressed_size);
+ ((PageHeader) page)->pd_upper -= hdr->t_size;
+ }
+ else
+ {
+ /* Store uncompressed, in lower stream. */
+
+ /*
+ * How many complete chunks can we fit?
+ */
+ if (orig_bytes < freespc)
+ {
+ complete_chunks_len = orig_bytes;
+ lasttid = attbuf->lasttid;
+ }
+ else
+ complete_chunks_len =
+ find_chunk_for_offset(attbuf, freespc, &lasttid);
+
+ if (complete_chunks_len == 0)
+ elog(ERROR, "could not fit any chunks on page");
+
+ hdr = (ZSAttStream *) ((char *) page + SizeOfPageHeaderData);
+
+ hdr->t_size = SizeOfZSAttStreamHeader + complete_chunks_len;
+ hdr->t_flags = 0;
+ hdr->t_decompressed_size = 0;
+ hdr->t_decompressed_bufsize = 0;
+ hdr->t_lasttid = lasttid;
+
+ dst = hdr->t_payload;
+ memcpy(dst, pstart, complete_chunks_len);
+ ((PageHeader) page)->pd_lower += hdr->t_size;
+ }
+
+ /*
+ * Chop off the part of the chunk stream in 'attbuf' that we wrote out.
+ */
+ trim_attstream_upto_offset(attbuf, complete_chunks_len, lasttid);
+}
+static void
+zsbt_attr_repack_writeback_pages(zsbt_attr_repack_context *cxt,
+ Relation rel, AttrNumber attno,
+ Buffer oldbuf)
+{
+ Page oldpage = BufferGetPage(oldbuf);
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(oldpage);
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt->stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, attno);
+ stack->next->buf = nextbuf;
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = cxt->nextblkno;
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt->stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt->stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt->stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt->stack_tail->next = zsbt_newroot(rel, attno, oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt->stack_tail->next = zsbt_insert_downlinks(rel, attno,
+ oldopaque->zs_lokey,
+ BufferGetBlockNumber(oldbuf),
+ oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt->stack_head, NULL, attno);
+}
+
+static void
+wal_log_attstream_change(Relation rel, Buffer buf, ZSAttStream *attstream,
+ bool is_upper, uint16 begin_offset, uint16 end_offset)
+{
+ /*
+ * log only the modified portion.
+ */
+ Page page = BufferGetPage(buf);
+ XLogRecPtr recptr;
+ wal_zedstore_attstream_change xlrec;
+#ifdef USE_ASSERT_CHECKING
+ uint16 pd_lower = ((PageHeader) page)->pd_lower;
+ uint16 pd_upper = ((PageHeader) page)->pd_upper;
+ uint16 pd_special = ((PageHeader) page)->pd_special;
+#endif
+
+ Assert(begin_offset < end_offset);
+
+ memset(&xlrec, 0, sizeof(xlrec)); /* clear padding */
+ xlrec.is_upper = is_upper;
+
+ xlrec.new_attstream_size = attstream->t_size;
+ xlrec.new_decompressed_size = attstream->t_decompressed_size;
+ xlrec.new_decompressed_bufsize = attstream->t_decompressed_bufsize;
+ xlrec.new_lasttid = attstream->t_lasttid;
+
+ xlrec.begin_offset = begin_offset;
+ xlrec.end_offset = end_offset;
+
+ if (is_upper)
+ Assert(begin_offset >= pd_upper && end_offset <= pd_special);
+ else
+ Assert(begin_offset >= SizeOfPageHeaderData && end_offset <= pd_lower);
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalAttstreamChange);
+ XLogRegisterBufData(0, (char *) page + begin_offset, end_offset - begin_offset);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_ATTSTREAM_CHANGE);
+
+ PageSetLSN(page, recptr);
+}
+
+void
+zsbt_attstream_change_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_attstream_change *xlrec =
+ (wal_zedstore_attstream_change *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+ {
+ Page page = (Page) BufferGetPage(buffer);
+ Size datasz;
+ char *data = XLogRecGetBlockData(record, 0, &datasz);
+ ZSAttStream *attstream;
+
+ Assert(datasz == xlrec->end_offset - xlrec->begin_offset);
+
+ if (xlrec->is_upper)
+ {
+ /*
+ * In the upper stream, if the size changes, the old data is moved
+ * to begin at pd_upper, and then the new data is applied.
+ *
+ * XXX: we could be much smarter about this, and not move data that
+ * we will overwrite on the next line.
+ */
+ uint16 pd_special = ((PageHeader) page)->pd_special;
+ uint16 new_pd_upper = pd_special - xlrec->new_attstream_size;
+ uint16 old_pd_upper = ((PageHeader) page)->pd_upper;
+ uint16 old_size = old_pd_upper - pd_special;
+ uint16 new_size = new_pd_upper - pd_special;
+
+ memmove(page + new_pd_upper, page + old_pd_upper, Min(old_size, new_size));
+
+ ((PageHeader) page)->pd_upper = new_pd_upper;
+ }
+ else
+ {
+ uint16 new_pd_lower = SizeOfPageHeaderData + xlrec->new_attstream_size;
+
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+ }
+
+ memcpy(page + xlrec->begin_offset, data, datasz);
+
+ /*
+ * Finally, adjust the size in the attstream header to match.
+ * (if the replacement data in the WAL record covered the attstream
+ * header, this is unnecessarily but harmless)
+ */
+ attstream = (ZSAttStream *) (
+ xlrec->is_upper ? (page + ((PageHeader) page)->pd_upper) :
+ (page + SizeOfPageHeaderData));
+ attstream->t_size = xlrec->new_attstream_size;
+ attstream->t_decompressed_size = xlrec->new_decompressed_size;
+ attstream->t_decompressed_bufsize = xlrec->new_decompressed_bufsize;
+ attstream->t_lasttid = xlrec->new_lasttid;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+}
diff --git a/src/backend/access/zedstore/zedstore_attstream.c b/src/backend/access/zedstore/zedstore_attstream.c
new file mode 100644
index 0000000000..a25946c677
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_attstream.c
@@ -0,0 +1,2523 @@
+/*
+ * zedstore_attstream.c
+ * Routines for packing datums into "attribute streams", to be stored
+ * on attribute tree leaf pages.
+ *
+ * An attribute stream consists of "chunks", where one chunk contains
+ * the TIDs of 1-60 datums, packed in a compact form, and their datums.
+ * Each chunk begins with a 64-bit codeword, which contains the TIDs
+ * in the chunk. The TIDs are delta-encoded, so we store the difference of
+ * each TID to the previous TID in the stream, and the differences are
+ * packed in 64-bit codewords using a variant of Simple-8b encoding.
+ *
+ * For the first TID in a stream, the "previous" TID is thought to be 0,
+ * so the first TID in the stream's first chunk actually stores the
+ * absolute TID.
+ *
+ * The encoding of TIDs in the codeword is a variant of the Simple-8b
+ * algorithm. 4 bits in each codeword determine a "mode", and the remaining
+ * 60 bits encode the TIDs in a format that depends on the mode. But we also
+ * use the codeword to encode the presence of NULLs, and in the case of
+ * variable-width attributes, the length of each datum in the chunk.
+ * Therefore, fixed- and variable-length attributes use different "modes".
+ *
+ * This chunked format has a few desireable properties:
+ *
+ * - It is compact for the common case of no or few gaps between TIDs.
+ * In the best case, one codeword can pack 60 consecutive TIDs in
+ * one 64-bit codeword. It also "degrades gracefully", as TIDs are
+ * removed, so deleting a few TIDs here and there doesn't usually make
+ * the overall stream larger.
+ *
+ * - It is relatively fast to encode and decode.
+ *
+ * - A long stream can be split easily. You can cut the stream at any chunk,
+ * having to re-encode only the first chunk after the split point. Also,
+ * each chunk is relatively small, which minimizes waste when a large
+ * stream needs to be chopped into page-sized pieces.
+ *
+ * - Two streams can easily be appended to each other, without having to
+ * re-encode the chunks (although it might not result in the most compact
+ * possible codewords.)
+ *
+ * Some drawbacks:
+ *
+ * - Random access is not possible. To access a particular TID, the stream
+ * must be read starting from the beginning.
+ *
+ *
+ * Most of the functions in this file deal with 'attstream_buffers'. An
+ * attstream_buffer is an in-memory representation of an attribute stream.
+ * It is a resizeable buffer, without the ZSAttStream header, but enough
+ * information in the attstream_buffer struct to construct the ZSAttStream
+ * header when needed.
+ *
+ * Another in-memory representation is 'attstream_decoder'. It holds state
+ * when reading an attribute stream.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_attstream.c
+ */
+#include "postgres.h"
+
+#include "access/detoast.h"
+#include "access/toast_internals.h"
+#include "access/zedstore_internal.h"
+#include "miscadmin.h"
+#include "utils/datum.h"
+#include "utils/memutils.h"
+
+#define TARGET_CHUNK_SIZE 128
+
+/*
+ * Internal functions that operate on a single chunk.
+ */
+static bool replace_first_tid_in_place(int attlen, zstid newtid, char *chunk);
+static int skip_chunk(int attlen, char *chunk, zstid *lasttid);
+static int get_chunk_length(int attlen, char *chunk);
+static zstid get_chunk_first_tid(int attlen, char *chunk);
+static int decode_chunk(bool attbyval, int attlen, zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls);
+static int encode_chunk(attstream_buffer *dst, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls);
+static int chunk_num_elements(char *chunk, int attlen);
+
+/* Other internal functions. */
+static void merge_attstream_guts(Form_pg_attribute attr, attstream_buffer *buffer, char *chunks2, int chunks2len, zstid lasttid2);
+#ifdef USE_ASSERT_CHECKING
+static void verify_attstream(attstream_buffer *buffer);
+#endif
+
+
+/* ----------------------------------------------------------------------------
+ * Decoder routines
+ *
+ * To read an attribute stream, initialize a "decoder" by calling
+ * init_attstream_decoder(). Then call decode_attstream_begin()
+ * to load the decoder with data. Read the data, one chunk at a time,
+ * by calling decode_attstream_cont(), until it returns false. Each
+ * call to decode_attstream_cont() fills the arrays in the decoder
+ * struct with the TIDs, Datums and isnull-flags in current chunk.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Initialize a decoder.
+ */
+void
+init_attstream_decoder(attstream_decoder *decoder, bool attbyval, int16 attlen)
+{
+ decoder->cxt = CurrentMemoryContext;
+ decoder->tmpcxt = NULL; /* can be set by caller */
+
+ decoder->attbyval = attbyval;
+ decoder->attlen = attlen;
+
+ decoder->chunks_buf = NULL;
+ decoder->chunks_buf_size = 0;
+ decoder->chunks_len = 0;
+ decoder->lasttid = InvalidZSTid;
+
+ decoder->pos = 0;
+ decoder->prevtid = InvalidZSTid;
+
+ decoder->num_elements = 0;
+}
+
+void
+destroy_attstream_decoder(attstream_decoder *decoder)
+{
+ if (decoder->chunks_buf)
+ pfree(decoder->chunks_buf);
+ decoder->chunks_buf = NULL;
+ decoder->chunks_buf_size = 0;
+ decoder->chunks_len = 0;
+ decoder->num_elements = 0;
+}
+
+/*
+ * Begin reading an attribute stream.
+ */
+void
+decode_attstream_begin(attstream_decoder *decoder, ZSAttStream *attstream)
+{
+ int buf_size_needed;
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ buf_size_needed = attstream->t_decompressed_bufsize;
+ else
+ buf_size_needed = attstream->t_size - SizeOfZSAttStreamHeader;
+
+ if (decoder->chunks_buf_size < buf_size_needed)
+ {
+ if (decoder->chunks_buf)
+ pfree(decoder->chunks_buf);
+
+ decoder->chunks_buf = MemoryContextAlloc(decoder->cxt, buf_size_needed);
+ decoder->chunks_buf_size = buf_size_needed;
+ }
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ /* decompress */
+ zs_decompress(attstream->t_payload, decoder->chunks_buf,
+ attstream->t_size - SizeOfZSAttStreamHeader,
+ attstream->t_decompressed_bufsize);
+ decoder->chunks_len = attstream->t_decompressed_size;
+ decoder->compression_ratio = ((float8) buf_size_needed) / attstream->t_size;
+ }
+ else
+ {
+ memcpy(decoder->chunks_buf,
+ ((char *) attstream) + SizeOfZSAttStreamHeader,
+ attstream->t_size - SizeOfZSAttStreamHeader);
+ decoder->chunks_len = attstream->t_size - SizeOfZSAttStreamHeader;
+ decoder->compression_ratio = 1.0;
+ }
+ decoder->firsttid = get_chunk_first_tid(decoder->attlen, decoder->chunks_buf);
+ decoder->lasttid = attstream->t_lasttid;
+
+ decoder->pos = 0;
+ decoder->prevtid = 0;
+
+ decoder->num_elements = 0;
+ decoder->avg_elements_size = 0;
+}
+
+/*
+ * internal routine like decode_attstream_begin(), for reading chunks without the
+ * ZSAttStream header.
+ */
+static void
+decode_chunks_begin(attstream_decoder *decoder, char *chunks, int chunkslen, zstid lasttid)
+{
+ if (decoder->chunks_buf_size < chunkslen)
+ {
+ if (decoder->chunks_buf)
+ pfree(decoder->chunks_buf);
+
+ decoder->chunks_buf = MemoryContextAlloc(decoder->cxt, chunkslen);
+ decoder->chunks_buf_size = chunkslen;
+ }
+
+ /* XXX: do we really need to make a copy? */
+ memcpy(decoder->chunks_buf, chunks, chunkslen);
+ decoder->chunks_len = chunkslen;
+ decoder->lasttid = lasttid;
+
+ decoder->pos = 0;
+ decoder->prevtid = 0;
+
+ decoder->num_elements = 0;
+}
+
+/*
+ * Decode the next chunk in an attribute steam.
+ *
+ * The TIDs, Datums and isnull flags in 'decoder' are filled in with
+ * data from the next chunk. Returns true if there was more data,
+ * false if the end of chunk was reached.
+ *
+ * TODO: avoid extracting elements we're not interested in, by passing
+ * starttid/endtid. Or provide a separate "fast forward" function.
+ */
+bool
+decode_attstream_cont(attstream_decoder *decoder)
+{
+ zstid lasttid;
+ int total_decoded;
+ char *p;
+ char *lastp;
+ char *pend;
+ MemoryContext oldcxt;
+
+ oldcxt = CurrentMemoryContext;
+ if (decoder->tmpcxt)
+ {
+ MemoryContextReset(decoder->tmpcxt);
+ MemoryContextSwitchTo(decoder->tmpcxt);
+ }
+
+ lastp = p = decoder->chunks_buf + decoder->pos;
+ pend = decoder->chunks_buf + decoder->chunks_len;
+
+ total_decoded = 0;
+ lasttid = decoder->prevtid;
+
+ if (p >= pend)
+ return false;
+
+ while (p < pend && total_decoded + 60 < DECODER_MAX_ELEMS)
+ {
+ int num_decoded;
+
+ p += decode_chunk(decoder->attbyval, decoder->attlen, &lasttid, p,
+ &num_decoded,
+ &decoder->tids[total_decoded],
+ &decoder->datums[total_decoded],
+ &decoder->isnulls[total_decoded]);
+ total_decoded += num_decoded;
+ }
+
+ MemoryContextSwitchTo(oldcxt);
+
+ Assert(p <= pend);
+ decoder->num_elements = total_decoded;
+ decoder->avg_elements_size = ((p - lastp) / total_decoded) / decoder->compression_ratio;
+ decoder->pos = p - decoder->chunks_buf;
+ if (total_decoded > 0)
+ {
+ decoder->prevtid = decoder->tids[total_decoded - 1];
+ return true;
+ }
+ else
+ return false;
+}
+
+bool
+get_attstream_chunk_cont(attstream_decoder *decoder, zstid *prevtid, zstid *firsttid, zstid *lasttid, bytea **chunk)
+{
+ char *p;
+ char *pend;
+ MemoryContext oldcxt;
+ int len;
+ bytea *attr_data = NULL;
+
+ oldcxt = CurrentMemoryContext;
+ if(decoder->tmpcxt)
+ {
+ MemoryContextReset(decoder->tmpcxt);
+ MemoryContextSwitchTo(decoder->tmpcxt);
+ }
+
+ p = decoder->chunks_buf + decoder->pos;
+ pend = decoder->chunks_buf + decoder->chunks_len;
+ *prevtid = decoder->prevtid;
+ *firsttid = get_chunk_first_tid(decoder->attlen, p) + decoder->prevtid;
+
+ if (p < pend)
+ {
+ len = skip_chunk(decoder->attlen, p, &decoder->prevtid);
+
+ attr_data = (bytea *) palloc(len + VARHDRSZ);
+ SET_VARSIZE(attr_data, len + VARHDRSZ);
+ memcpy(VARDATA(attr_data), p, len);
+
+ decoder->num_elements = chunk_num_elements(p, decoder->attlen);
+
+ p += len;
+ }
+
+ MemoryContextSwitchTo(oldcxt);
+
+ Assert(p <= pend);
+ decoder->pos = p - decoder->chunks_buf;
+
+ if(attr_data != NULL) {
+ *lasttid = decoder->prevtid;
+ *chunk = attr_data;
+ return true;
+ }
+
+ return false;
+}
+
+/* ----------------------------------------------------------------------------
+ * Functions for constructing and manipulating attribute streams.
+ * ----------------------------------------------------------------------------
+ */
+
+static void
+enlarge_attstream_buffer_slow(attstream_buffer *buf, int needed)
+{
+ /* copied from StringInfo */
+ int newlen;
+
+ if (((Size) needed) >= (MaxAllocSize - (Size) buf->len))
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("out of memory"),
+ errdetail("Cannot enlarge attstream buffer containing %d bytes by %d more bytes.",
+ buf->len, needed)));
+
+ needed += buf->len; /* total space required now */
+
+ /* Because of the above test, we now have needed <= MaxAllocSize */
+
+ if (needed <= buf->maxlen)
+ return; /* got enough space already */
+
+ /*
+ * We don't want to allocate just a little more space with each append;
+ * for efficiency, double the buffer size each time it overflows.
+ * Actually, we might need to more than double it if 'needed' is big...
+ */
+ newlen = 2 * buf->maxlen;
+ while (needed > newlen)
+ newlen = 2 * newlen;
+
+ /*
+ * Clamp to MaxAllocSize in case we went past it. Note we are assuming
+ * here that MaxAllocSize <= INT_MAX/2, else the above loop could
+ * overflow. We will still have newlen >= needed.
+ */
+ if (newlen > (int) MaxAllocSize)
+ newlen = (int) MaxAllocSize;
+
+ buf->data = (char *) repalloc(buf->data, newlen);
+
+ buf->maxlen = newlen;
+}
+
+static inline void
+enlarge_attstream_buffer(attstream_buffer *buf, int needed)
+{
+ if (needed > buf->maxlen - buf->len)
+ enlarge_attstream_buffer_slow(buf, needed);
+}
+
+/*
+ * Pack given datums into an attstream.
+ */
+void
+create_attstream(attstream_buffer *dst, bool attbyval, int16 attlen,
+ int nelems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ int num_encoded;
+ int elems_remain;
+ zstid prevtid;
+
+ Assert(nelems > 0);
+
+#define INIT_ATTREAM_BUF_SIZE 1024
+ dst->data = palloc(INIT_ATTREAM_BUF_SIZE);
+ dst->len = 0;
+ dst->maxlen = INIT_ATTREAM_BUF_SIZE;
+ dst->cursor = 0;
+ dst->attlen = attlen;
+ dst->attbyval = attbyval;
+
+ dst->firsttid = tids[0];
+ dst->lasttid = tids[nelems - 1];
+
+ prevtid = 0;
+ elems_remain = nelems;
+ while (elems_remain > 0)
+ {
+ num_encoded = encode_chunk(dst, prevtid, elems_remain, tids, datums, isnulls);
+ Assert(num_encoded > 0);
+ prevtid = tids[num_encoded - 1];
+ datums += num_encoded;
+ isnulls += num_encoded;
+ tids += num_encoded;
+ elems_remain -= num_encoded;
+ }
+}
+
+int
+append_attstream(attstream_buffer *buf, bool all, int nelems,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ int num_encoded;
+ int elems_remain;
+ zstid prevtid;
+
+ /* Can we avoid enlarging the buffer by moving the existing data? */
+ if (buf->cursor > 128 * 1024 && buf->cursor > buf->len / 2)
+ {
+ memcpy(buf->data, buf->data + buf->cursor, buf->len - buf->cursor);
+ buf->len -= buf->cursor;
+ buf->cursor = 0;
+ }
+
+ Assert(nelems > 0);
+ Assert((tids[0] > buf->lasttid) || buf->cursor == buf->len);
+
+ if (buf->len - buf->cursor == 0)
+ {
+ buf->firsttid = tids[0];
+ prevtid = 0;
+ }
+ else
+ prevtid = buf->lasttid;
+ elems_remain = nelems;
+ while (elems_remain > (all ? 0 : 59))
+ {
+ num_encoded = encode_chunk(buf, prevtid, elems_remain, tids, datums, isnulls);
+ Assert(num_encoded > 0);
+ prevtid = tids[num_encoded - 1];
+ datums += num_encoded;
+ isnulls += num_encoded;
+ tids += num_encoded;
+ elems_remain -= num_encoded;
+ }
+
+ buf->lasttid = prevtid;
+
+ return nelems - elems_remain;
+}
+
+/*
+ * Split the attstream_buffer supplied in oldattbuf at splittid. The oldattbuf
+ * and the newattbuf will represent the left and right attstream_buffers
+ * resultant from the split respectively.
+ */
+
+void
+split_attstream_buffer(attstream_buffer *oldattbuf, attstream_buffer *newattbuf, zstid splittid)
+{
+ zstid lasttid_prev_chunk;
+ zstid lasttid;
+
+ int split_chunk_pos;
+ int split_chunk_len;
+ char *split_chunk;
+ int split_chunk_num_elems = 0;
+
+ int left_chunk_num_elems = 0;
+ zstid left_chunk_tids[60];
+ Datum left_chunk_datums[60];
+ bool left_chunk_isnulls[60];
+
+ int right_chunk_num_elems = 0;
+ zstid right_chunk_tids[60];
+ Datum right_chunk_datums[60];
+ bool right_chunk_isnulls[60];
+
+ char *right_stream_second_chunk;
+ int right_stream_remaining_length;
+
+ MemoryContext attstream_memory_context = GetMemoryChunkContext(oldattbuf->data);
+ MemoryContext oldcontext;
+
+ Assert(oldattbuf->lasttid > splittid);
+
+ /*
+ * First, we find the beginning offset of the chunk that contains the
+ * splittid: the split_chunk and the lasttid of the chunk preceding the
+ * split_chunk.
+ */
+
+ split_chunk_pos =
+ find_chunk_containing_tid(oldattbuf, splittid, &lasttid_prev_chunk);
+ split_chunk = oldattbuf->data + oldattbuf->cursor + split_chunk_pos;
+ lasttid = lasttid_prev_chunk;
+
+ /*
+ * Then we decode the split_chunk so that we can extract the items that will
+ * belong to the left and right attstream_buffers respectively (called
+ * left_chunk and right_chunk). The left attstream_buffer will end with
+ * left_chunk and the right attstream_buffer will begin with right_chunk.
+ */
+
+ split_chunk_len = decode_chunk(oldattbuf->attbyval,
+ oldattbuf->attlen,
+ &lasttid,
+ split_chunk,
+ &split_chunk_num_elems,
+ left_chunk_tids,
+ left_chunk_datums,
+ left_chunk_isnulls);
+ left_chunk_num_elems = split_chunk_num_elems;
+ for (int i = 0; i < split_chunk_num_elems; i++)
+ {
+ if (left_chunk_tids[i] > splittid)
+ {
+ left_chunk_num_elems = i;
+ right_chunk_num_elems = split_chunk_num_elems - left_chunk_num_elems;
+
+ memmove(right_chunk_tids, &left_chunk_tids[i], sizeof(zstid) * right_chunk_num_elems);
+ memmove(right_chunk_datums, &left_chunk_datums[i], sizeof(Datum) * right_chunk_num_elems);
+ memmove(right_chunk_isnulls, &left_chunk_isnulls[i], sizeof(bool) * right_chunk_num_elems);
+ break;
+ }
+ }
+ right_chunk_num_elems = split_chunk_num_elems - left_chunk_num_elems;
+ oldcontext = MemoryContextSwitchTo(attstream_memory_context);
+ if (right_chunk_num_elems > 0)
+ right_stream_second_chunk = split_chunk + split_chunk_len;
+ else
+ {
+ Assert(right_chunk_num_elems == 0);
+ /*
+ * This indicates that the splittid lies on a chunk boundary. We need to
+ * treat the chunk starting at that boundary to be the right chunk. In
+ * order to do so, we need to decode that chunk.
+ */
+ right_stream_second_chunk = split_chunk + split_chunk_len +
+ decode_chunk(oldattbuf->attbyval,
+ oldattbuf->attlen,
+ &lasttid,
+ split_chunk + split_chunk_len,
+ &right_chunk_num_elems,
+ right_chunk_tids,
+ right_chunk_datums,
+ right_chunk_isnulls);
+ }
+
+ /*
+ * Initialize the right stream to begin with right_chunk. Then, append all
+ * of the chunks following the right_chunk. This will complete the right
+ * attstream_buffer.
+ */
+
+ create_attstream(newattbuf, oldattbuf->attbyval, oldattbuf->attlen,
+ right_chunk_num_elems,
+ right_chunk_tids,
+ right_chunk_datums,
+ right_chunk_isnulls);
+ MemoryContextSwitchTo(oldcontext);
+ right_stream_remaining_length = (oldattbuf->data + oldattbuf->len) - right_stream_second_chunk;
+ enlarge_attstream_buffer(newattbuf, right_stream_remaining_length);
+ memcpy(newattbuf->data + newattbuf->len, right_stream_second_chunk,
+ right_stream_remaining_length);
+ newattbuf->len += right_stream_remaining_length;
+ newattbuf->lasttid = oldattbuf->lasttid;
+
+ /*
+ * Truncate the left attstream_buffer beyond the start of the split_chunk.
+ * Then, append the left_chunk to the end of the left attstream_buffer.
+ * This will complete the left attstream_buffer.
+ */
+
+ oldattbuf->len = oldattbuf->cursor + split_chunk_pos;
+ oldattbuf->lasttid = lasttid_prev_chunk;
+ if (left_chunk_num_elems > 0)
+ append_attstream(oldattbuf, true, left_chunk_num_elems,
+ left_chunk_tids, left_chunk_datums, left_chunk_isnulls);
+
+#ifdef USE_ASSERT_CHECKING
+ verify_attstream(oldattbuf);
+ verify_attstream(newattbuf);
+#endif
+}
+
+/*
+ * Trim attstream buffer from cursor to 'chunk_pos'.
+ * The chunk beginning at 'chunk_pos' will be the new first chunk of the attstream
+ * buffer after this operation is completed. 'prev_lasttid' is the lasttid of
+ * the chunk that precedes the one that starts at 'chunk_pos', in order to
+ * calculate the new absolute first tid of the attstream buffer.
+ */
+void
+trim_attstream_upto_offset(attstream_buffer *buf, int chunk_pos, zstid prev_lasttid)
+{
+ /*
+ * 'first_chunk' represents the first chunk in the resultant attstream buffer.
+ */
+ char *first_chunk;
+ int first_chunk_len;
+ zstid first_chunk_tids[60];
+ Datum first_chunk_datums[60];
+ bool first_chunk_isnulls[60];
+ int first_chunk_num_elems;
+
+ attstream_buffer tmpbuf;
+ zstid newfirsttid;
+
+ /*
+ * Trim the attstream buffer from the beginning by advancing the cursor.
+ */
+ buf->cursor += chunk_pos;
+ Assert(buf->cursor <= buf->len);
+ if (buf->cursor >= buf->len)
+ {
+ Assert(buf->cursor == buf->len);
+ return;
+ }
+
+ /* FIXME: arbitrary limit. We need some space before the split point, in
+ * case we need to re-encode the first new chunk. Compute this correctly,
+ * and perhaps reallocate a bigger buffer if needed. ATM, though, this is
+ * only used to chop large attstreams to page-sized parts, so this never
+ * gets called with a very small 'chunk_pos'.
+ */
+ if (buf->cursor < 500)
+ elog(ERROR, "cannot split");
+
+ first_chunk = buf->data + buf->cursor;
+
+ newfirsttid = prev_lasttid + get_chunk_first_tid(buf->attlen, first_chunk);
+ if (!replace_first_tid_in_place(buf->attlen, newfirsttid, first_chunk))
+ {
+ /*
+ * We need to decode and re-encode the chunk in order to ensure that the
+ * firsttid of the attstream buffer is absolute.
+ */
+ first_chunk_len = decode_chunk(buf->attbyval, buf->attlen, &prev_lasttid,
+ first_chunk,
+ &first_chunk_num_elems,
+ first_chunk_tids,
+ first_chunk_datums,
+ first_chunk_isnulls);
+ create_attstream(&tmpbuf, buf->attbyval, buf->attlen,
+ first_chunk_num_elems,
+ first_chunk_tids,
+ first_chunk_datums,
+ first_chunk_isnulls);
+
+ /*
+ * Replace the chunk with the re-encoded version.
+ * First, shave off the existing chunk. Then prepend the re-encoded
+ * version of the chunk into the attstream buffer.
+ */
+ buf->cursor += first_chunk_len;
+ if (buf->cursor < tmpbuf.len - tmpbuf.cursor)
+ elog(ERROR, "not enough work space to split");
+ buf->cursor -= (tmpbuf.len - tmpbuf.cursor);
+ memcpy(&buf->data[buf->cursor],
+ tmpbuf.data + tmpbuf.cursor,
+ tmpbuf.len - tmpbuf.cursor);
+
+ pfree(tmpbuf.data);
+
+ }
+ buf->firsttid = newfirsttid;
+#ifdef USE_ASSERT_CHECKING
+ verify_attstream(buf);
+#endif
+}
+
+/*
+ * Find the beginning offset of chunk containing 'tid'.
+ * Also populate the lasttid of the chunk preceding the one we found in 'lasttid'.
+ * Returns -1 if there are no full chunks. (FIXME: no it doesn't currently)
+ */
+int
+find_chunk_containing_tid(attstream_buffer *attbuf, zstid tid, zstid *lasttid)
+{
+ zstid prev_lasttid;
+ char *prev_chunk;
+ char *chunk;
+
+ chunk = prev_chunk = &attbuf->data[attbuf->cursor];
+ prev_lasttid = *lasttid = 0;
+
+ while (*lasttid < tid)
+ {
+ prev_lasttid = *lasttid;
+ prev_chunk = chunk;
+
+ chunk += skip_chunk(attbuf->attlen, chunk, lasttid);
+ }
+
+ *lasttid = prev_lasttid;
+
+ /*
+ * prev_chunk now points to the beginning of the chunk that contains tid
+ */
+ return prev_chunk - &attbuf->data[attbuf->cursor];
+}
+
+/*
+ * Find the beginning offset of the first chunk that contains the 'offset'.
+ * Also, populate the lasttid of the chunk preceding the one found in 'lasttid'.
+ * Returns -1 if there are no full chunks. (FIXME: no it doesn't currently)
+ */
+int
+find_chunk_for_offset(attstream_buffer *attbuf, int offset, zstid *lasttid)
+{
+ char *chunks = attbuf->data + attbuf->cursor;
+ char *p = chunks;
+ char *pend = p + offset;
+
+ *lasttid = 0;
+ while (p + sizeof(uint64) <= pend)
+ {
+ int this_chunk_len;
+
+ this_chunk_len = get_chunk_length(attbuf->attlen, p);
+
+ if (p + this_chunk_len > pend)
+ break; /* this one is not complete */
+
+ p += skip_chunk(attbuf->attlen, p, lasttid);
+ }
+ /* 'p' now points to the first incomplete chunk */
+ return p - (char *) chunks;
+}
+
+void
+init_attstream_buffer_from_stream(attstream_buffer *buf, bool attbyval, int16 attlen,
+ ZSAttStream *attstream, MemoryContext memcontext)
+{
+ int buf_size;
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ buf_size = attstream->t_decompressed_bufsize;
+ else
+ buf_size = attstream->t_size - SizeOfZSAttStreamHeader;
+
+ buf->data = MemoryContextAlloc(memcontext, buf_size);
+ buf->len = 0;
+ buf->maxlen = buf_size;
+ buf->cursor = 0;
+
+ buf->attlen = attlen;
+ buf->attbyval = attbyval;
+
+ if ((attstream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ zs_decompress(attstream->t_payload, buf->data,
+ attstream->t_size - SizeOfZSAttStreamHeader,
+ attstream->t_decompressed_bufsize);
+ buf->len = attstream->t_decompressed_size;
+ }
+ else
+ {
+ memcpy(buf->data, attstream->t_payload, attstream->t_size - SizeOfZSAttStreamHeader);
+ buf->len = attstream->t_size - SizeOfZSAttStreamHeader;
+ }
+
+ buf->firsttid = get_chunk_first_tid(buf->attlen, buf->data + buf->cursor);
+ buf->lasttid = attstream->t_lasttid;
+}
+
+void
+init_attstream_buffer(attstream_buffer *buf, bool attbyval, int16 attlen)
+{
+#define ATTBUF_INIT_SIZE 1024
+ buf->data = palloc(ATTBUF_INIT_SIZE);
+ buf->len = 0;
+ buf->maxlen = ATTBUF_INIT_SIZE;
+ buf->cursor = 0;
+
+ buf->firsttid = 0;
+ buf->lasttid = 0;
+
+ buf->attlen = attlen;
+ buf->attbyval = attbyval;
+}
+
+void
+vacuum_attstream(Relation rel, AttrNumber attno, attstream_buffer *dst,
+ ZSAttStream *attstream,
+ zstid *tids_to_remove, int num_tids_to_remove)
+{
+ Form_pg_attribute attr = &rel->rd_att->attrs[attno - 1];
+ attstream_decoder decoder;
+ int removeidx;
+ zstid *tids;
+ Datum *datums;
+ bool *isnulls;
+ int num_buffered;
+ int buffer_size = 1000;
+
+ /*
+ * Decode the input, leave out the items that are to be removed, and
+ * re-encode as we go.
+ */
+ tids = palloc(buffer_size * sizeof(zstid));
+ datums = palloc(buffer_size * sizeof(Datum));
+ isnulls = palloc(buffer_size * sizeof(bool));
+
+ init_attstream_buffer(dst, attr->attbyval, attr->attlen);
+
+ init_attstream_decoder(&decoder, attr->attbyval, attr->attlen);
+ decode_attstream_begin(&decoder, attstream);
+
+ num_buffered = 0;
+ removeidx = 0;
+ while (decode_attstream_cont(&decoder))
+ {
+ for (int idx = 0; idx < decoder.num_elements; idx++)
+ {
+ zstid tid = decoder.tids[idx];
+ Datum datum = decoder.datums[idx];
+ bool isnull = decoder.isnulls[idx];
+
+ /* also "merge" in the list of tids to remove */
+ while (removeidx < num_tids_to_remove && tid > tids_to_remove[removeidx])
+ removeidx++;
+ if (removeidx < num_tids_to_remove && tid == tids_to_remove[removeidx])
+ {
+ /*
+ * This datum needs to be removed. Leave it out from the result.
+ *
+ * If it's a toasted datum, also remove the toast blocks.
+ */
+ if (attr->attlen == -1 && !isnull &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(datum);
+ BlockNumber toast_blkno = toastptr->zst_block;
+
+ zedstore_toast_delete(rel, attr, tid, toast_blkno);
+ }
+ }
+ else
+ {
+ tids[num_buffered] = tid;
+ datums[num_buffered] = datum;
+ isnulls[num_buffered] = isnull;
+ num_buffered++;
+
+ if (num_buffered == buffer_size)
+ {
+ /* encode datums that we've buffered so far */
+ int num_encoded;
+ int num_remain;
+
+ num_encoded = append_attstream(dst, false, num_buffered, tids, datums, isnulls);
+
+ /* move the remaining ones to beginning of buffer, and continue */
+ num_remain = num_buffered - num_encoded;
+ memmove(tids, &tids[num_encoded], num_remain * sizeof(zstid));
+ memmove(datums, &datums[num_encoded], num_remain * sizeof(Datum));
+ memmove(isnulls, &isnulls[num_encoded], num_remain * sizeof(bool));
+ num_buffered = num_remain;
+ }
+ }
+ }
+ }
+
+ /* flush the rest of buffered datums to the attstream */
+ if (num_buffered > 0)
+ append_attstream(dst, true, num_buffered, tids, datums, isnulls);
+
+#ifdef USE_ASSERT_CHECKING
+ verify_attstream(dst);
+#endif
+
+ pfree(datums);
+ pfree(isnulls);
+ pfree(tids);
+ destroy_attstream_decoder(&decoder);
+}
+
+/*
+ * Merge two attstreams together.
+ *
+ * This is the workhorse of repacking and re-encoding data, when
+ * new attribute data is added to a page (INSERT/UPDATE), or when
+ * some data is removed (VACUUM after a DELETE).
+ *
+ * 'attstream1' and 'attstream2' are the two streams to merge.
+ * Either one can be NULL, if you just want to re-encode and
+ * recompress an existing stream.
+ *
+ * 'tids_to_remove' is an optional array of TIDs to remove from
+ * the stream(s).
+ *
+ * There are some heuristics here:
+ *
+ * - A compressed attstream is assumed to already be in a "dense"
+ * form, using maximally compact codewords. An uncompressed
+ * stream, however, might not be, so uncompressed streams are
+ * always decoded into constituent datums, and re-encoded.
+ *
+ */
+void
+merge_attstream(Form_pg_attribute attr, attstream_buffer *buf, ZSAttStream *attstream2)
+{
+ if (attstream2 == NULL)
+ return;
+
+ /*
+ * If the input is compressed, decompress it.
+ */
+ if ((attstream2->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ char *decompress_buf;
+
+ decompress_buf = palloc(attstream2->t_decompressed_bufsize);
+ zs_decompress(attstream2->t_payload, decompress_buf,
+ attstream2->t_size - SizeOfZSAttStreamHeader,
+ attstream2->t_decompressed_bufsize);
+
+ merge_attstream_guts(attr, buf,
+ decompress_buf, attstream2->t_decompressed_size,
+ attstream2->t_lasttid);
+
+ pfree(decompress_buf);
+ }
+ else
+ {
+ merge_attstream_guts(attr, buf,
+ attstream2->t_payload, attstream2->t_size - SizeOfZSAttStreamHeader,
+ attstream2->t_lasttid);
+ }
+}
+
+void
+merge_attstream_buffer(Form_pg_attribute attr, attstream_buffer *buf, attstream_buffer *buf2)
+{
+ merge_attstream_guts(attr, buf,
+ buf2->data + buf2->cursor, buf2->len - buf2->cursor, buf2->lasttid);
+}
+static void
+merge_attstream_guts(Form_pg_attribute attr, attstream_buffer *buf, char *chunks2, int chunks2len, zstid lasttid2)
+{
+ attstream_decoder decoder1;
+ bool decoder1_continues;
+ int decoder1_idx;
+ attstream_decoder decoder2;
+ bool decoder2_continues;
+ int decoder2_idx;
+ Datum *result_datums;
+ bool *result_isnulls;
+ zstid *result_tids;
+ int num_buffered;
+ zstid lasttid1;
+ zstid firsttid2;
+ int buffer_size = 1000;
+ char *olddata;
+ int newmaxlen;
+
+ lasttid1 = buf->lasttid;
+ firsttid2 = get_chunk_first_tid(buf->attlen, chunks2);
+
+ /*
+ * Fast path:
+ *
+ * If the two streams don't overlap, then
+ * we can avoid re-encoding and just append one stream after the other.
+ * We only do this if the stream that comes first was compressed:
+ * otherwise it may not be optimally packed, and we want to re-encode it
+ * to make sure it's using densest possible codewords.
+ *
+ * XXX: we don't take this fastpath, if the new stream is strictly
+ * below the old stream. We could swap the inputs and do it in that
+ * case too...
+ *
+ * FIXME: we don't actually pay attention to the compression anymore.
+ * We never repack.
+ * FIXME: this is backwords, the normal fast path is if (firsttid1 > lasttid2)
+ */
+ if (firsttid2 > lasttid1)
+ {
+ char *pos_new;
+ uint64 delta;
+
+ enlarge_attstream_buffer(buf, chunks2len);
+ pos_new = buf->data + buf->len;
+
+ memcpy(pos_new, chunks2, chunks2len);
+
+ delta = firsttid2 - lasttid1;
+ replace_first_tid_in_place(buf->attlen, delta, pos_new);
+
+ buf->len += chunks2len;
+ buf->lasttid = lasttid2;
+
+ return;
+ }
+
+ /*
+ * naive implementation: decode everything, merge arrays, and re-encode.
+ * FIXME: becuase this is naive, this could be costly when we have a large number of tids in the attbuffer
+ */
+ init_attstream_decoder(&decoder1, attr->attbyval, attr->attlen);
+ decode_chunks_begin(&decoder1, buf->data + buf->cursor, buf->len - buf->cursor, buf->lasttid);
+ decoder1_continues = decode_attstream_cont(&decoder1);
+ decoder1_idx = 0;
+
+ init_attstream_decoder(&decoder2, attr->attbyval, attr->attlen);
+ decode_chunks_begin(&decoder2, chunks2, chunks2len, lasttid2);
+ decoder2_continues = decode_attstream_cont(&decoder2);
+ decoder2_idx = 0;
+
+ buffer_size = 1000; /* arbitrary initial size */
+ result_tids = palloc(buffer_size * sizeof(zstid));
+ result_datums = palloc(buffer_size * sizeof(Datum));
+ result_isnulls = palloc(buffer_size * sizeof(bool));
+
+ /*
+ * Reallocate a new buffer, in the same memory context as the old one, to
+ * write the result to. (We can't write diretly to the old buffer, because
+ * we'll read it simultaneously.
+ */
+ newmaxlen = (buf->len - buf->cursor) + chunks2len + 100; /* 100 is gives some headroom, to avoid
+ * repallocs */
+ olddata = buf->data;
+ buf->data = MemoryContextAlloc(GetMemoryChunkContext(olddata), newmaxlen);
+ buf->maxlen = newmaxlen;
+ buf->len = 0;
+ buf->cursor = 0;
+ buf->firsttid = 0;
+ buf->lasttid = 0;
+
+ num_buffered = 0;
+ for (;;)
+ {
+ attstream_decoder *decodernext;
+ int *decodernext_idx;
+ bool *decodernext_continues;
+ zstid tid;
+ Datum datum;
+ bool isnull;
+
+ if (decoder1_continues && decoder2_continues)
+ {
+ if (decoder1.tids[decoder1_idx] < decoder2.tids[decoder2_idx])
+ {
+ decodernext = &decoder1;
+ decodernext_idx = &decoder1_idx;
+ decodernext_continues = &decoder1_continues;
+ }
+ else if (decoder1.tids[decoder1_idx] > decoder2.tids[decoder2_idx])
+ {
+ decodernext = &decoder2;
+ decodernext_idx = &decoder2_idx;
+ decodernext_continues = &decoder2_continues;
+ }
+ else
+ elog(ERROR, "cannot merge attstreams with duplicate TIDs");
+ }
+ else if (decoder1_continues)
+ {
+ decodernext = &decoder1;
+ decodernext_idx = &decoder1_idx;
+ decodernext_continues = &decoder1_continues;
+ }
+ else if (decoder2_continues)
+ {
+ decodernext = &decoder2;
+ decodernext_idx = &decoder2_idx;
+ decodernext_continues = &decoder2_continues;
+ }
+ else
+ {
+ break; /* all done */
+ }
+
+ tid = decodernext->tids[*decodernext_idx];
+ datum = decodernext->datums[*decodernext_idx];
+ isnull = decodernext->isnulls[*decodernext_idx];
+ (*decodernext_idx)++;
+
+ result_tids[num_buffered] = tid;
+ result_datums[num_buffered] = datum;
+ result_isnulls[num_buffered] = isnull;
+ num_buffered++;
+
+ if (num_buffered == buffer_size)
+ {
+ /* encode datums that we've buffered so far */
+ int num_encoded;
+ int num_remain;
+
+ num_encoded = append_attstream(buf, false, num_buffered,
+ result_tids, result_datums, result_isnulls);
+
+ /* move the remaining ones to beginning of buffer, and continue */
+ num_remain = num_buffered - num_encoded;
+ memmove(result_tids, &result_tids[num_encoded], num_remain * sizeof(zstid));
+ memmove(result_datums, &result_datums[num_encoded], num_remain * sizeof(Datum));
+ memmove(result_isnulls, &result_isnulls[num_encoded], num_remain * sizeof(bool));
+ num_buffered = num_remain;
+ }
+
+ if (*decodernext_idx == decodernext->num_elements)
+ {
+ *decodernext_continues = decode_attstream_cont(decodernext);
+ *decodernext_idx = 0;
+ }
+ }
+
+ /* flush the rest of buffered datums to the attstream */
+ if (num_buffered > 0)
+ append_attstream(buf, true, num_buffered, result_tids, result_datums, result_isnulls);
+
+ pfree(olddata);
+ pfree(result_datums);
+ pfree(result_isnulls);
+ pfree(result_tids);
+ destroy_attstream_decoder(&decoder1);
+ destroy_attstream_decoder(&decoder2);
+}
+
+/*
+ * Append 'newstream' to 'oldstream' in place, modifying 'oldstream'.
+ *
+ * There is assumed to be 'freespace' bytes after 'oldstream', where we can
+ * write the new data.
+ *
+ * If the new data doesn't fit in the available space, does nothing and
+ * returns false.
+ *
+ * NB: This is used within a critical section, so keep it simple. No ereport
+ * or pallocs!
+ */
+bool
+append_attstream_inplace(Form_pg_attribute att, ZSAttStream *oldstream, int freespace,
+ attstream_buffer *newbuf)
+{
+ zstid firstnewtid;
+ char *pos_new;
+ zstid delta;
+
+ Assert(CritSectionCount > 0);
+
+ /*
+ * fast path requirements:
+ *
+ * - the new stream goes after the old one
+ * - there is enough space to append 'newbuf'
+ * - neither stream is compressed
+ */
+ if (oldstream->t_flags & ATTSTREAM_COMPRESSED)
+ return false;
+
+ if (freespace < newbuf->len - newbuf->cursor)
+ return false; /* no space */
+
+ firstnewtid = get_chunk_first_tid(att->attlen, newbuf->data + newbuf->cursor);
+ if (firstnewtid <= oldstream->t_lasttid)
+ {
+ /* overlap */
+ return false;
+ }
+
+ /*
+ * We can do it!
+ *
+ * The trivial way is to just append the new stream to the new stream,
+ * adjusting the first TID at the seam, so that it's a delta from the last
+ * old tid.
+ *
+ * TODO A better way: try to re-code the last old item, and first new item
+ * together. For example, if new data is added one row at a time, we currently
+ * generate a stream of single-datum chunks, with the 8-byte codeword for
+ * every datum. It would be better to combine the chunks at the seam, using
+ * more compact codewords. But if you implement that, make sure the callers
+ * are happy with that! At the moment, the caller WAL-logs the change, and
+ * doesn't expect us to change the existing data.
+ */
+ pos_new = ((char *) oldstream) + oldstream->t_size;
+ memcpy(pos_new,
+ newbuf->data + newbuf->cursor,
+ newbuf->len - newbuf->cursor);
+
+ delta = firstnewtid - oldstream->t_lasttid;
+ replace_first_tid_in_place(att->attlen, delta, pos_new);
+ oldstream->t_size += newbuf->len - newbuf->cursor;
+ oldstream->t_lasttid = newbuf->lasttid;
+
+ newbuf->cursor = newbuf->len;
+
+ return true;
+}
+
+
+#ifdef USE_ASSERT_CHECKING
+static void
+verify_attstream(attstream_buffer *attbuf)
+{
+ char *p = attbuf->data + attbuf->cursor;
+ char *pend = attbuf->data + attbuf->len;
+ zstid tid;
+
+ tid = 0;
+
+ while (p < pend)
+ {
+ p += skip_chunk(attbuf->attlen, p, &tid);
+ }
+ Assert(tid == attbuf->lasttid);
+ Assert(p == pend);
+}
+#endif
+
+void
+print_attstream(int attlen, char *chunk, int len)
+{
+ char *p = chunk;
+ char *pend = chunk + len;
+ zstid tid;
+
+ tid = 0;
+ while (p < pend)
+ {
+ uint64 codeword;
+ int len;
+
+ memcpy(&codeword, p, sizeof(uint64));
+
+ len = skip_chunk(attlen, p, &tid);
+ elog(NOTICE, "%016lX: TID %lu, %d bytes", codeword, tid, len);
+
+ p += len;
+ }
+}
+
+/* ----------------------------------------------------------------------------
+ * Functions work with individual chunks in an attstream.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * FIXED-LENGTH CODEWORD MODES
+ * ---------------------------
+ *
+ * These modes are used with fixed-length attributes (attlen > 0). Each codeword
+ * includes a 4-bit mode selector, and between 1-60 TIDs, and in some modes, a NULL
+ * bitmap. To avoid creating too large chunks, which might not fit conveniently on
+ * a page, we avoid using the most dense modes when the resulting chunk would exceed
+ * TARGET_CHUNK_SIZE.
+ *
+ * Glossary:
+ *
+ * x Bit positions representing TIDs (or rather, deltas between TIDs.
+ * 0..9 In the lower modes that encode a lot of TIDs, the boundaries between TIDs
+ * are not shown and 'x' is used to represent all of them. In higher modes,
+ * the numbers are used to indicate which bit position encodes which TID.)
+ *
+ * N Bit positions used for a NULL bitmap
+ *
+ * w unused, wasted, bits
+ *
+ * mode 0: 0000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 60 TIDs, 1 bit each
+ *
+ * mode 1: 0001 NNNN NNNNNNNN NNNNNNNN NNNNNNNN NNxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 30 TIDs, 1 bit each
+ * 30 NULL bits
+ *
+ * mode 2: 0010 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 30 TIDs, 2 bits each
+ *
+ * mode 3: 0011 NNNN NNNNNNNN NNNNNNNN xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 20 TIDs, 2 bits each
+ * 20 NULL bits
+ *
+ * mode 4: 0100 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 20 TIDs, 3 bits each
+ *
+ * mode 5: 0101 NNNN NNNNNNNN NNNxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 15 TIDs, 3 bits each
+ * 15 NULL bits
+ *
+ * mode 6: 0110 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 15 TIDs, 4 bits each
+ *
+ * mode 7: 0111 NNNN NNNNNNNN xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 12 TIDs, 4 bits each
+ * 12 NULL bits
+ *
+ * mode 8: 1000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 12 TIDs, 5 bits each
+ *
+ * mode 9: 1001 NNNN NNNNNNxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * 10 TIDs, 5 bits each
+ * 10 NULL bits
+ *
+ * mode 10: 1010 wwww NNNNNNNN 88888877 77776666 66555555 44444433 33332222 22111111
+ *
+ * 8 TIDs, 6 bits each
+ * 8 NULL bits
+ * (four bits are wasted)
+ *
+ * mode 11: 1011 NNNN NN666666 66655555 55554444 44444333 33333322 22222221 11111111
+ *
+ * 6 TIDs, 9 bits each
+ * 6 NULL bits
+ *
+ * mode 12: 1100 NNNN 44444444 44444433 33333333 33332222 22222222 22111111 11111111
+ * 4 TIDs, 14 bits each
+ * 4 NULL bits
+ *
+ * mode 13: 1101 NNN3 33333333 33333333 33222222 22222222 22222111 11111111 11111111
+ *
+ * three TIDs, 19 bits each
+ * 3 NULL bits
+ *
+ * mode 14: 1110 NN22 22222222 22222222 22222222 22211111 11111111 11111111 11111111
+ *
+ * two TIDs, 29 bits each
+ * two NULL bits
+ *
+ * mode 15: 1111 0000 Nxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
+ *
+ * one TID, 59 bits
+ * NULL bit
+ *
+ * XXX: we store the first TID in the low bits, and subsequent TIDs in higher bits. Not
+ * sure if that's how it's usually done...
+ *
+ * XXX: We could use delta 0 to mark unused slots. That way, we wouldn't need to shift
+ * to a higher mode when we're running out of TIDs to encode. Or we could subtract one
+ * from each distance, so that value 0 didn't go wasted, and we could sometimes use
+ * more compact modes.
+ */
+static const struct codeword_mode
+{
+ uint8 bits_per_int;
+ uint8 num_ints;
+ bool nullbitmap;
+} fixed_width_modes[17] =
+{
+ {1, 60, false}, /* mode 0 */
+ {1, 30, true}, /* mode 1 */
+ {2, 30, false}, /* mode 2 */
+ {2, 20, true}, /* mode 3 */
+ {3, 20, false}, /* mode 4 */
+ {3, 15, true}, /* mode 5 */
+ {4, 15, false}, /* mode 6 */
+ {4, 12, true}, /* mode 7 */
+ {5, 12, false}, /* mode 8 */
+ {5, 10, true}, /* mode 9 */
+ {6, 8, true}, /* mode 10 */
+ {9, 6, true}, /* mode 11 */
+ {14, 4, true}, /* mode 12 */
+ {19, 3, true}, /* mode 13 */
+ {29, 2, true}, /* mode 14 */
+ {55, 1, true}, /* mode 15 */
+ {0, 0, false} /* sentinel */
+};
+
+static int
+get_chunk_length_fixed(int attlen, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int nints = fixed_width_modes[selector].num_ints;
+ int bits = fixed_width_modes[selector].bits_per_int;
+ bool has_nulls = fixed_width_modes[selector].nullbitmap;
+ int num_nulls;
+
+ /* skip over the TIDs */
+ codeword >>= bits * nints;
+
+ num_nulls = 0;
+ if (has_nulls)
+ {
+ /* count set bits in the NULL bitmap */
+ for (int i = 0; i < nints; i++)
+ {
+ if (codeword & 1)
+ num_nulls++;
+ codeword >>= 1;
+ }
+ }
+ return sizeof(uint64) + (nints - num_nulls) * attlen;
+ }
+}
+
+static zstid
+get_chunk_first_tid_fixed(int attlen, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = fixed_width_modes[selector].bits_per_int;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ /* get first tid */
+ return (codeword & mask);
+ }
+}
+
+static bool
+replace_first_tid_fixed(int attlen, zstid newtid, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = fixed_width_modes[selector].bits_per_int;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ /* get first tid */
+ if (newtid >= (1 << bits))
+ return false;
+
+ /* zero out the TID */
+ codeword &= ~mask;
+ codeword |= newtid;
+
+ memcpy(chunk, &codeword, sizeof(uint64));
+
+ return true;
+ }
+}
+
+static int
+skip_chunk_fixed(int attlen, char *chunk, zstid *lasttid)
+{
+ zstid prevtid = *lasttid;
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int nints = fixed_width_modes[selector].num_ints;
+ int bits = fixed_width_modes[selector].bits_per_int;
+ bool has_nulls = fixed_width_modes[selector].nullbitmap;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+ int num_nulls;
+ zstid tid = prevtid;
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ tid += val;
+ codeword >>= bits;
+ }
+
+ num_nulls = 0;
+ if (has_nulls)
+ {
+ /* count set bits in the NULL bitmap */
+ for (int i = 0; i < nints; i++)
+ {
+ if (codeword & 1)
+ num_nulls++;
+ codeword >>= 1;
+ }
+ }
+
+ /* ignore the datums */
+ *lasttid = tid;
+ return sizeof(uint64) + (nints - num_nulls) * attlen;
+ }
+}
+
+static int
+decode_chunk_fixed(bool attbyval, int attlen, zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int bits = fixed_width_modes[selector].bits_per_int;
+ bool has_nulls = fixed_width_modes[selector].nullbitmap;
+ int nints = fixed_width_modes[selector].num_ints;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+ zstid tid = *lasttid;
+ uint64 nullbitmap;
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ tid = tid + val;
+ tids[i] = tid;
+ codeword >>= bits;
+ }
+ *lasttid = tid;
+
+ if (has_nulls)
+ nullbitmap = codeword & UINT64CONST(0x0FFFFFFFFFFFFF);
+ else
+ nullbitmap = 0;
+
+ /* datums follow */
+ if (attbyval)
+ {
+ if (nullbitmap == 0)
+ {
+ /* FIXME: the loops below ignore alignment. 'p' might not be aligned */
+ if (attlen == sizeof(Datum))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = *((Datum *) p);
+ isnulls[i] = false;
+ p += sizeof(Datum);
+ }
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = Int32GetDatum(*(int32 *) p);
+ isnulls[i] = false;
+ p += sizeof(int32);
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = DatumGetInt16(*(int16 *) p);
+ isnulls[i] = false;
+ p += sizeof(int16);
+ }
+ }
+ else if (attlen == sizeof(char))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ datums[i] = CharGetDatum(*p);
+ isnulls[i] = false;
+ p++;
+ }
+ }
+ else
+ elog(ERROR, "unsupported byval length: %d", attlen);
+ }
+ else
+ {
+ /* FIXME: the loops below ignore alignment. 'p' might not be aligned */
+ if (attlen == sizeof(Datum))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = *((Datum *) p);
+ isnulls[i] = false;
+ p += sizeof(Datum);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = Int32GetDatum(*(int32 *) p);
+ isnulls[i] = false;
+ p += sizeof(int32);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = DatumGetInt16(*(int16 *) p);
+ isnulls[i] = false;
+ p += sizeof(int16);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else if (attlen == sizeof(char))
+ {
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ datums[i] = CharGetDatum(*p);
+ isnulls[i] = false;
+ p++;
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+ else
+ elog(ERROR, "unsupported byval length: %d", attlen);
+ }
+ }
+ else
+ {
+ char *datumbuf = palloc(MAXALIGN(attlen) * nints); /* XXX: attalign */
+ char *datump = datumbuf;
+
+ for (int i = 0; i < nints; i++)
+ {
+ if ((nullbitmap & (1 << i)) == 0)
+ {
+ memcpy(datump, p, attlen);
+ datums[i] = PointerGetDatum(datump);
+ isnulls[i] = false;
+ p += attlen;
+ datump += MAXALIGN(attlen);
+ }
+ else
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ }
+ }
+
+ *num_elems = nints;
+ return p - chunk;
+ }
+}
+
+static int
+encode_chunk_fixed(attstream_buffer *dst, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ bool attbyval = dst->attbyval;
+ int16 attlen = dst->attlen;
+ int selector;
+ int this_nints;
+ int this_bits;
+ bool this_supports_nulls;
+ uint64 val;
+ int i;
+ bool has_nulls;
+ int size = sizeof(uint64);
+ uint64 codeword;
+ uint64 deltas[60];
+ char *p;
+
+ selector = 0;
+ this_nints = fixed_width_modes[0].num_ints;
+ this_bits = fixed_width_modes[0].bits_per_int;
+ this_supports_nulls = fixed_width_modes[0].nullbitmap;
+
+ val = tids[0] - prevtid;
+ has_nulls = isnulls[0];
+ i = 0;
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << this_bits) ||
+ (has_nulls && !this_supports_nulls))
+ {
+ /* Too large, or need NULL bitmap. Step up to next mode */
+ selector++;
+ this_nints = fixed_width_modes[selector].num_ints;
+ this_bits = fixed_width_modes[selector].bits_per_int;
+ this_supports_nulls = fixed_width_modes[selector].nullbitmap;
+
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= this_nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ deltas[i] = val;
+ if (!isnulls[i])
+ size += attlen;
+ i++;
+ if (i >= this_nints)
+ break;
+ /* examine next delta */
+ has_nulls |= isnulls[i];
+ if (i < ntids && size + attlen <= TARGET_CHUNK_SIZE)
+ {
+ val = tids[i] - tids[i - 1];
+ }
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ Assert(i > 0);
+
+ /*
+ * Encode the integers using the selected mode.
+ */
+ codeword = 0;
+ if (has_nulls)
+ {
+ for (int i = 0; i < this_nints; i++)
+ codeword |= isnulls[i] ? (1 << i) : 0;
+ codeword <<= this_nints * this_bits;
+ }
+ for (int i = 0; i < this_nints; i++)
+ codeword |= deltas[i] << (i * this_bits);
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ /*
+ * Note: 'size' is too large at this point, if we had to "back down" to a
+ * less dense mode. That's fine for sizing the destination buffer, but we
+ * can't rely on it for the final size of the chunk.
+ */
+ enlarge_attstream_buffer(dst, size);
+ p = &dst->data[dst->len];
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+
+ /*
+ * Now, the data
+ */
+
+ /* FIXME: the loops below ignore alignment. 'p' might not be aligned */
+ if (attbyval)
+ {
+ if (attlen == sizeof(Datum))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ *((Datum *) p) = datums[i];
+ p += sizeof(Datum);
+ }
+ }
+ }
+ else if (attlen == sizeof(int32))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ *((int32 *) p) = DatumGetInt32(datums[i]);
+ p += sizeof(int32);
+ }
+ }
+ }
+ else if (attlen == sizeof(int16))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ *((int16 *) p) = DatumGetInt16(datums[i]);
+ p += sizeof(int16);
+ }
+ }
+ }
+ else if (attlen == sizeof(char))
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ *(p++) = DatumGetChar(datums[i]);
+ }
+ }
+ else
+ elog(ERROR, "unsupported byval length: %d", attlen);
+ }
+ else
+ {
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ memcpy(p, DatumGetPointer(datums[i]), attlen);
+ p += attlen;
+ }
+ }
+ }
+ dst->len = p - dst->data;
+ Assert(dst->len <= dst->maxlen);
+
+ return this_nints;
+}
+
+/*
+ * VARIABLE-SIZE MODES
+ * -------------------
+ *
+ * These are used with varlenas. With varlenas, we encode not only the
+ * TIDs and NULLness of each datum, but also its length, in the same
+ * codeword. The value 0 stored in the length field is used to represent
+ * a NULL; there is no separate NULL bitmap. For non-NULLs, the value
+ * stored in the length is the real length + 1.
+ *
+ * We don't track a maximum size for the chunk during encoding, but the
+ * fact that we use a smallish number of bits to store the length, depending
+ * on the mode, puts a cap on the max chunk size. For example, in mode 4,
+ * we encode 10 TIDs in a codeword with 4 bits to store the length. With four
+ * bits, each datum can be max 14 bytes long. That limits the max size of a
+ * chunk in mode 4 to 10*14 = 140 bytes. Below table shows the modes with the
+ * number of bits use for the TID and length of each datum, and the maximum
+ * chunk size they give (not including the size of the codeword itself)
+ *
+ * mode tidbits lenbits wasted ntids maxsize
+ * 0 1 1 0 30 30
+ * 1 1 2 0 20 40
+ * 2 1 3 0 15 90
+ * 3 2 3 0 12 72
+ * 4 2 4 0 10 140
+ * 5 3 4 4 8 112
+ * 6 4 4 4 7 98
+ * 7 5 5 0 6 180
+ * 8 6 6 0 5 310
+ * 9 8 7 0 4 504
+ * 10 13 7 0 3 378
+ * 11 23 7 0 2 252
+ * 12 45 15 0 1 32766
+ * 13 unused
+ * 14 toast
+ * 15 unused
+ *
+ * Modes 13 and 15 are currently unused. (The idea is that 15 could be
+ * used for various extended modes with special handling, using more
+ * bits to indicate which extended mode it is. And it seems logical to
+ * have special modes, like the toast mode, at the end. We could use
+ * 13 for another "regular" mode.. )
+ *
+ * Mode 14 is special: It is used to encode a toasted datum. The toast
+ * datum is compressed with toast_compress_datum(). Unlike the other
+ * modes, the toast mode lenbits field is overloaded and is used to
+ * identify where the toast datum is stored. The toast datum could be
+ * stored either inline or in separate toast pages. The TID of the
+ * datum is stored in the codeword as is. If the toast datum is stored
+ * inline, then after the codeword comes the zs_toast_header_inline
+ * followed by the datum. If the toast datum is stored in separate
+ * toast pages, then after the codeword comes the block number of the
+ * first toast block, as a 32-bit integer.
+ *
+ * FIXME: Mode 12 is the widest mode, but it only uses up to 45 bits for
+ * the TID. That's not enough to cover the whole range of valid zstids.
+ * I think we need one more special mode, where we use full 60 bits for
+ * the TID, with the length stored separately after the codeword, for
+ * the odd case that you have a very large datum with a very high TID.
+ */
+static const struct
+{
+ uint8 bits_per_tid;
+ uint8 lenbits;
+ uint8 num_ints;
+} varlen_modes[17] =
+{
+ { 1, 1, 30 }, /* mode 0 */
+ { 1, 2, 20 }, /* mode 1 */
+ { 1, 3, 15 }, /* mode 2 */
+ { 2, 3, 12 }, /* mode 3 */
+ { 2, 4, 10 }, /* mode 4 */
+ { 3, 4, 8 }, /* mode 5 */
+ { 4, 4, 7 }, /* mode 6 */
+ { 5, 5, 6 }, /* mode 7 */
+ { 6, 6, 5 }, /* mode 8 */
+ { 8, 7, 4 }, /* mode 9 */
+ { 13, 7, 3 }, /* mode 10 */
+ { 23, 7, 2 }, /* mode 11 */
+ { 45, 15, 1 }, /* mode 12 */
+
+ /* special modes */
+ { 0, 0, 0 }, /* mode 13 (unused) */
+ { 48, 12, 1 }, /* mode 14 (toast) */
+ { 0, 0, 0 }, /* mode 15 */
+
+ { 0, 0, 0 } /* sentinel */
+};
+
+#define ZS_VARLENA_INLINE_TOAST 1
+#define ZS_VARLENA_TOAST_PAGE 0
+
+typedef struct zs_toast_header_external {
+ BlockNumber toast_blkno;
+} zs_toast_header_external;
+
+typedef struct zs_toast_header_inline
+{
+ uint32 compressed_size;
+ uint32 rawsize;
+} zs_toast_header_inline;
+
+static int
+get_toast_chunk_length(char *chunk, uint64 toast_mode_selector)
+{
+ int chunk_len;
+
+ if (toast_mode_selector == ZS_VARLENA_INLINE_TOAST)
+ {
+ zs_toast_header_inline *hdr;
+ chunk_len = sizeof(uint64) + sizeof(zs_toast_header_inline);
+
+ hdr = (zs_toast_header_inline *) (chunk + sizeof(uint64));
+ chunk_len += hdr->compressed_size;
+ }
+ else if (toast_mode_selector == ZS_VARLENA_TOAST_PAGE)
+ {
+ chunk_len = sizeof(uint64) + sizeof(zs_toast_header_external);
+ }
+ else
+ {
+ elog(ERROR, "Invalid toast chunk type");
+ }
+
+ return chunk_len;
+}
+
+static int
+get_chunk_length_varlen(char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int nints = varlen_modes[selector].num_ints;
+ int tidbits = varlen_modes[selector].bits_per_tid;
+ int lenbits = varlen_modes[selector].lenbits;
+ uint64 lenmask = (UINT64CONST(1) << lenbits) - 1;
+ int total_len;
+
+ /* skip over the TIDs */
+ codeword >>= tidbits * nints;
+
+ if (selector == 14)
+ {
+ /* in-line toast datum or toast pointer */
+ return get_toast_chunk_length(chunk, codeword & lenmask);
+ }
+
+ /* Sum up the lengths */
+ total_len = 0;
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 len = codeword & lenmask;
+
+ if (len > 0)
+ total_len += len - 1;
+ codeword >>= lenbits;
+ }
+ return sizeof(uint64) + total_len;
+ }
+}
+
+static zstid
+get_chunk_first_tid_varlen(char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = varlen_modes[selector].bits_per_tid;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ /* get first tid */
+ return (codeword & mask);
+ }
+}
+
+static bool
+replace_first_tid_varlen(zstid newtid, char *chunk)
+{
+ uint64 codeword;
+
+ memcpy(&codeword, chunk, sizeof(uint64));
+
+ {
+ int selector = (codeword >> 60);
+ int bits = varlen_modes[selector].bits_per_tid;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ if (newtid >= (1 << bits))
+ return false;
+
+ /* zero out the TID */
+ codeword &= ~mask;
+ codeword |= newtid;
+
+ memcpy(chunk, &codeword, sizeof(uint64));
+
+ return true;
+ }
+}
+
+static int
+skip_chunk_varlen(char *chunk, zstid *lasttid)
+{
+ zstid prevtid = *lasttid;
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int nints = varlen_modes[selector].num_ints;
+ int tidbits = varlen_modes[selector].bits_per_tid;
+ int lenbits = varlen_modes[selector].lenbits;
+ uint64 mask = (UINT64CONST(1) << tidbits) - 1;
+ uint64 lenmask = (UINT64CONST(1) << lenbits) - 1;
+ int total_len;
+ zstid tid = prevtid;
+
+ if (selector == 14)
+ {
+ /* toast pointer */
+ *lasttid = tid + (codeword & mask);
+ codeword >>= tidbits * nints;
+ return get_toast_chunk_length(chunk, codeword & lenmask);
+ }
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ tid += val;
+ codeword >>= tidbits;
+ }
+
+ /* Sum up the lengths */
+ total_len = 0;
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 len = codeword & lenmask;
+
+ if (len > 0)
+ total_len += len - 1;
+ codeword >>= lenbits;
+ }
+
+ /* ignore the datums */
+ *lasttid = tid;
+ return sizeof(uint64) + total_len;
+ }
+}
+
+static int
+chunk_num_elements(char *chunk, int attlen)
+{
+ char *p = chunk;
+ uint64 codeword;
+ int selector;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ selector = (codeword >> 60);
+ return attlen > 0 ? fixed_width_modes[selector].num_ints : varlen_modes[selector].num_ints;
+}
+
+static int
+decode_chunk_varlen(zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ char *p = chunk;
+ uint64 codeword;
+
+ memcpy(&codeword, p, sizeof(uint64));
+ p += sizeof(uint64);
+
+ {
+ int selector = (codeword >> 60);
+ int nints = varlen_modes[selector].num_ints;
+ int tidbits = varlen_modes[selector].bits_per_tid;
+ int lenbits = varlen_modes[selector].lenbits;
+ uint64 tidmask = (UINT64CONST(1) << tidbits) - 1;
+ uint64 lenmask = (UINT64CONST(1) << lenbits) - 1;
+ zstid tid = *lasttid;
+ char *datump;
+
+ if (selector == 14)
+ {
+ /* in-line toast datum or toast pointer */
+ tid += (codeword & tidmask);
+
+ if ((codeword >> (tidbits * nints)) & lenmask & ZS_VARLENA_INLINE_TOAST)
+ {
+ zs_toast_header_inline hdr;
+ uint32 len;
+
+ memcpy(&hdr, p, sizeof(zs_toast_header_inline));
+ p += sizeof(zs_toast_header_inline);
+
+ len = hdr.compressed_size;
+ datump = palloc0(len + TOAST_COMPRESS_HDRSZ);
+ SET_VARSIZE_COMPRESSED(datump, len + TOAST_COMPRESS_HDRSZ);
+ TOAST_COMPRESS_SET_RAWSIZE(datump, hdr.rawsize);
+ memcpy(datump + TOAST_COMPRESS_HDRSZ, p, len);
+ p += len;
+
+ datums[0] = PointerGetDatum(datump);
+ }
+ else
+ {
+ zs_toast_header_external hdr;
+ BlockNumber toastblkno;
+ varatt_zs_toastptr *toastptr;
+
+ memcpy(&hdr, p, sizeof(zs_toast_header_external));
+ p += sizeof(zs_toast_header_external);
+
+ toastblkno = hdr.toast_blkno;
+
+ toastptr = palloc0(sizeof(varatt_zs_toastptr));
+ SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
+ toastptr->zst_block = toastblkno;
+
+ datums[0] = PointerGetDatum(toastptr);
+ }
+
+ tids[0] = tid;
+ isnulls[0] = false;
+ *num_elems = 1;
+
+ *lasttid = tid;
+ return p - chunk;
+ }
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & tidmask;
+
+ tid = tid + val;
+ tids[i] = tid;
+ codeword >>= tidbits;
+ }
+ *lasttid = tid;
+
+ /* Decode the datums / isnulls */
+ datump = palloc(MAXALIGN(VARHDRSZ + ((1 << lenbits))) * nints);
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 len = codeword & lenmask;
+
+ if (len == 0)
+ {
+ datums[i] = (Datum) 0;
+ isnulls[i] = true;
+ }
+ else
+ {
+ memcpy(VARDATA(datump), p, len - 1);
+ SET_VARSIZE(datump, len - 1 + VARHDRSZ);
+
+ datums[i] = PointerGetDatum(datump);
+ isnulls[i] = false;
+
+ datump += MAXALIGN(VARHDRSZ + len - 1);
+ p += (len - 1);
+ }
+ codeword >>= lenbits;
+ }
+
+ *num_elems = nints;
+ return p - chunk;
+ }
+}
+
+static int
+encode_chunk_varlen_inline_toast(attstream_buffer *dst, zstid prevtid,
+ zstid *tids, Datum *datums)
+{
+ uint32 len;
+ uint64 codeword;
+ char *p;
+
+ zs_toast_header_inline hdr;
+ hdr.compressed_size = TOAST_COMPRESS_SIZE(datums[0]);
+ hdr.rawsize = TOAST_COMPRESS_RAWSIZE(datums[0]);
+ len = hdr.compressed_size;
+
+ codeword = UINT64CONST(14) << 12;
+ codeword = (codeword | ((uint64)ZS_VARLENA_INLINE_TOAST)) << 48;
+ codeword = codeword | (tids[0] - prevtid);
+
+ enlarge_attstream_buffer(dst, sizeof(uint64) +
+ sizeof(zs_toast_header_inline) + len);
+ p = dst->data + dst->len;
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+ memcpy(p, (char *) &hdr, sizeof(zs_toast_header_inline));
+ p += sizeof(zs_toast_header_inline);
+ memcpy(p, (char *) TOAST_COMPRESS_RAWDATA(datums[0]), len);
+ dst->len += sizeof(uint64) + sizeof(zs_toast_header_inline) + len;
+ return 1;
+}
+
+static int
+encode_chunk_varlen_toast_page(attstream_buffer *dst, zstid prevtid, zstid *tids, Datum * datums)
+{
+ int64 codeword;
+ char *p;
+ zs_toast_header_external hdr;
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(datums[0]);
+ hdr.toast_blkno = toastptr->zst_block;
+
+ codeword = UINT64CONST(14) << 12;
+ codeword = (codeword | ((uint64)ZS_VARLENA_TOAST_PAGE)) << 48;
+ codeword = codeword | (tids[0] - prevtid);
+
+ enlarge_attstream_buffer(dst, sizeof(uint64) + sizeof(zs_toast_header_external));
+ p = dst->data + dst->len;
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+ memcpy(p, (char *) &hdr, sizeof(zs_toast_header_external));
+ dst->len += sizeof(uint64) + sizeof(zs_toast_header_external);
+ return 1;
+}
+
+static int
+encode_chunk_varlen(attstream_buffer *dst, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ int selector;
+ int this_nints;
+ int this_tidbits;
+ int this_lenbits;
+ uint64 val;
+ int len;
+ int i;
+ uint64 codeword;
+ uint64 deltas[60];
+ char *p;
+
+ /* special case for inline toast */
+ if (!isnulls[0] && VARATT_IS_COMPRESSED(datums[0]))
+ return encode_chunk_varlen_inline_toast(dst, prevtid, tids, datums);
+
+ /* special case for toast pointers */
+ else if (!isnulls[0] && VARATT_IS_EXTERNAL(datums[0]) && VARTAG_EXTERNAL(datums[0]) == VARTAG_ZEDSTORE)
+ return encode_chunk_varlen_toast_page(dst, prevtid, tids, datums);
+
+ selector = 0;
+ this_nints = varlen_modes[0].num_ints;
+ this_tidbits = varlen_modes[0].bits_per_tid;
+ this_lenbits = varlen_modes[0].lenbits;
+
+ val = tids[0] - prevtid;
+ if (isnulls[0])
+ len = 0;
+ else
+ len = VARSIZE_ANY_EXHDR(datums[0]) + 1;
+ i = 0;
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << this_tidbits) ||
+ len >= (UINT64CONST(1) << this_lenbits))
+ {
+ /* Too large TID distance, or length. Step up to next mode */
+ selector++;
+ this_nints = varlen_modes[selector].num_ints;
+ this_tidbits = varlen_modes[selector].bits_per_tid;
+ this_lenbits = varlen_modes[selector].lenbits;
+
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= this_nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ deltas[i] = val;
+ i++;
+
+ if (i >= this_nints)
+ break;
+
+ /* examine next delta and length */
+ if (i < ntids)
+ {
+ val = tids[i] - tids[i - 1];
+ if (isnulls[i])
+ len = 0;
+ else
+ {
+ if ((VARATT_IS_EXTERNAL(datums[i]) && VARTAG_EXTERNAL(datums[i]) == VARTAG_ZEDSTORE) ||
+ VARATT_IS_COMPRESSED(datums[i]))
+ {
+ /* toast pointer or toast datum, bail out */
+ val = PG_UINT64_MAX;
+ len = PG_INT32_MAX;
+ }
+ else
+ len = VARSIZE_ANY_EXHDR(datums[i]) + 1;
+ }
+ }
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ len = PG_INT32_MAX;
+ }
+ }
+ }
+
+ Assert(i > 0);
+
+ /*
+ * Encode the length and TID deltas using the selected mode.
+ */
+ codeword = 0;
+ for (int i = 0; i < this_nints; i++)
+ {
+ int len;
+
+ if (isnulls[i])
+ len = 0;
+ else
+ len = VARSIZE_ANY_EXHDR(datums[i]) + 1;
+ codeword |= (uint64) len << (i * this_lenbits);
+ }
+ codeword <<= this_nints * this_tidbits;
+
+ for (int i = 0; i < this_nints; i++)
+ codeword |= deltas[i] << (i * this_tidbits);
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ enlarge_attstream_buffer(dst, sizeof(uint64) + (1 << this_lenbits) * this_nints);
+ p = &dst->data[dst->len];
+
+ memcpy(p, (char *) &codeword, sizeof(uint64));
+ p += sizeof(uint64);
+
+ /*
+ * Now, the data
+ */
+ for (int i = 0; i < this_nints; i++)
+ {
+ if (!isnulls[i])
+ {
+ int len = VARSIZE_ANY_EXHDR(datums[i]);
+
+ memcpy(p, VARDATA_ANY(datums[i]), len);
+ p += len;
+ }
+ }
+
+ Assert(p - dst->data < dst->maxlen);
+ dst->len = p - dst->data;
+ return this_nints;
+}
+
+
+
+
+/*
+ * Wrapper functions over the fixed-length and varlen variants.
+ */
+
+static bool
+replace_first_tid_in_place(int attlen, zstid newtid, char *chunk)
+{
+ if (attlen > 0)
+ return replace_first_tid_fixed(attlen, newtid, chunk);
+ else
+ return replace_first_tid_varlen(newtid, chunk);
+}
+
+static int
+skip_chunk(int attlen, char *chunk, zstid *lasttid)
+{
+ if (attlen > 0)
+ return skip_chunk_fixed(attlen, chunk, lasttid);
+ else
+ return skip_chunk_varlen(chunk, lasttid);
+}
+
+static int
+get_chunk_length(int attlen, char *chunk)
+{
+ if (attlen > 0)
+ return get_chunk_length_fixed(attlen, chunk);
+ else
+ return get_chunk_length_varlen(chunk);
+}
+
+static zstid
+get_chunk_first_tid(int attlen, char *chunk)
+{
+ if (attlen > 0)
+ return get_chunk_first_tid_fixed(attlen, chunk);
+ else
+ return get_chunk_first_tid_varlen(chunk);
+}
+
+/*
+ * Decode given chunk into 'tids', 'datums', 'isnulls'.
+ * Also populate 'num_elems' - the number of elements present in the chunk.
+ * N.B. 'lasttid' is an I/O parameter. The caller must supply the absolute
+ * lasttid of the preceding chunk. This is to ensure that we can calculate
+ * absolute tids for the 'tids' array.
+ */
+static int
+decode_chunk(bool attbyval, int attlen, zstid *lasttid, char *chunk,
+ int *num_elems, zstid *tids, Datum *datums, bool *isnulls)
+{
+ if (attlen > 0)
+ return decode_chunk_fixed(attbyval, attlen, lasttid, chunk, num_elems,
+ tids, datums, isnulls);
+ else
+ return decode_chunk_varlen(lasttid, chunk, num_elems,
+ tids, datums, isnulls);
+}
+
+static int
+encode_chunk(attstream_buffer *buf, zstid prevtid, int ntids,
+ zstid *tids, Datum *datums, bool *isnulls)
+{
+ if (buf->attlen > 0)
+ return encode_chunk_fixed(buf, prevtid, ntids,
+ tids, datums, isnulls);
+ else
+ return encode_chunk_varlen(buf, prevtid, ntids,
+ tids, datums, isnulls);
+}
diff --git a/src/backend/access/zedstore/zedstore_btree.c b/src/backend/access/zedstore/zedstore_btree.c
new file mode 100644
index 0000000000..1e0ab08380
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_btree.c
@@ -0,0 +1,1002 @@
+/*
+ * zedstore_btree.c
+ * Common routines for handling TID and attibute B-tree structures
+ *
+ * A Zedstore table consists of multiple B-trees, one to store TIDs and
+ * visibility information of the rows, and one tree for each attribute,
+ * to hold the data. The TID and attribute trees differ at the leaf
+ * level, but the internal pages have the same layout. This file contains
+ * routines to deal with internal pages, and some other common
+ * functionality.
+ *
+ * When dealing with the TID tree, pass ZS_META_ATTRIBUTE_NUM as the
+ * attribute number.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_btree.c
+ */
+#include "postgres.h"
+
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "utils/rel.h"
+
+/* prototypes for local functions */
+static zs_split_stack *zsbt_split_internal_page(Relation rel, AttrNumber attno,
+ Buffer leftbuf, OffsetNumber newoff, List *downlinks);
+static zs_split_stack *zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left);
+
+static int zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems);
+
+/*
+ * Find the page containing the given key TID at the given level.
+ *
+ * Level 0 means leaf. The returned buffer is exclusive-locked.
+ *
+ * If tree doesn't exist at all (probably because the table was just created
+ * or truncated), the behavior depends on the 'readonly' argument. If
+ * readonly == true, then returns InvalidBuffer. If readonly == false, then
+ * the tree is created.
+ */
+Buffer
+zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly)
+{
+ BlockNumber next;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ int nextlevel;
+ BlockNumber failblk = InvalidBlockNumber;
+ int faillevel = -1;
+ ZSMetaCacheData *metacache;
+
+ Assert(key != InvalidZSTid);
+
+ /* Fast path for the very common case that we're looking for the rightmost page */
+ metacache = zsmeta_get_cache(rel);
+ if (level == 0 &&
+ attno < metacache->cache_nattributes &&
+ metacache->cache_attrs[attno].rightmost != InvalidBlockNumber &&
+ key >= metacache->cache_attrs[attno].rightmost_lokey)
+ {
+ next = metacache->cache_attrs[attno].rightmost;
+ nextlevel = 0;
+ }
+ else
+ {
+ /* start from root */
+ next = zsmeta_get_root_for_attribute(rel, attno, readonly);
+ if (next == InvalidBlockNumber)
+ {
+ /* completely empty tree */
+ return InvalidBuffer;
+ }
+ nextlevel = -1;
+ }
+ for (;;)
+ {
+ /*
+ * If we arrive again to a block that was a dead-end earlier, it seems
+ * that the tree is corrupt.
+ *
+ * XXX: It's theoretically possible that the block was removed, but then
+ * added back at the same location, and removed again. So perhaps retry
+ * a few times?
+ */
+ if (next == failblk || next == ZS_META_BLK)
+ elog(ERROR, "arrived at incorrect block %u while descending zedstore btree", next);
+
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); /* TODO: shared */
+ page = BufferGetPage(buf);
+ if (!zsbt_page_is_expected(rel, attno, key, nextlevel, buf))
+ {
+ /*
+ * We arrived at an unexpected page. This can happen with concurrent
+ * splits, or page deletions. We could try following the right-link, but
+ * there's no guarantee that's the correct page either, so let's restart
+ * from the root. If we landed here because of concurrent modifications,
+ * the next attempt should land on the correct page. Remember that we
+ * incorrectly ended up on this page, so that if this happens because
+ * the tree is corrupt, rather than concurrent splits, and we land here
+ * again, we won't loop forever.
+ */
+ UnlockReleaseBuffer(buf);
+
+ failblk = next;
+ faillevel = nextlevel;
+ nextlevel = -1;
+ zsmeta_invalidate_cache(rel);
+ next = zsmeta_get_root_for_attribute(rel, attno, readonly);
+ if (next == InvalidBlockNumber)
+ elog(ERROR, "could not find root for attribute %d", attno);
+
+ /*
+ * If the root was split after we cached the metadata, it's
+ * possible that the page we thought was the root page no longer
+ * is, but as we descend from the new root page, we'll end up on
+ * the same page again anyway. Don't treat thatas an error. To
+ * avoid it, check for the root case here, and if reset 'failblk'.
+ */
+ if (faillevel == -1)
+ {
+ if (next == failblk)
+ elog(ERROR, "arrived at incorrect block %u while descending zedstore btree", next);
+ failblk = InvalidBlockNumber;
+ }
+ continue;
+ }
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (nextlevel == -1)
+ nextlevel = opaque->zs_level;
+
+ else if (opaque->zs_level != nextlevel)
+ elog(ERROR, "unexpected level encountered when descending tree");
+
+ if (opaque->zs_level == level)
+ break;
+
+ /* Find the downlink and follow it */
+ items = ZSBtreeInternalPageGetItems(page);
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+
+ itemno = zsbt_binsrch_internal(key, items, nitems);
+ if (itemno < 0)
+ elog(ERROR, "could not descend tree for tid (%u, %u)",
+ ZSTidGetBlockNumber(key), ZSTidGetOffsetNumber(key));
+
+ next = items[itemno].childblk;
+ nextlevel--;
+
+ UnlockReleaseBuffer(buf);
+ }
+
+ if (opaque->zs_level == 0 && opaque->zs_next == InvalidBlockNumber)
+ {
+ metacache = zsmeta_get_cache(rel);
+ if (attno < metacache->cache_nattributes)
+ {
+ metacache->cache_attrs[attno].rightmost = next;
+ metacache->cache_attrs[attno].rightmost_lokey = opaque->zs_lokey;
+ }
+ }
+
+ return buf;
+}
+
+
+/*
+ * Find and lock the leaf page that contains data for scan->nexttid.
+ *
+ * If 'buf' is valid, it is a previously pinned page. We will check that
+ * page first. If it's not the correct page, it will be released.
+ *
+ * Returns InvalidBuffer, if the attribute tree doesn't exist at all.
+ * That should only happen after ALTER TABLE ADD COLUMN. Or on a newly
+ * created table, but none of the current callers would even try to
+ * fetch attribute data, without scanning the TID tree first.)
+ */
+Buffer
+zsbt_find_and_lock_leaf_containing_tid(Relation rel, AttrNumber attno,
+ Buffer buf, zstid nexttid, int lockmode)
+{
+ if (BufferIsValid(buf))
+ {
+retry:
+ LockBuffer(buf, lockmode);
+
+ /*
+ * It's possible that the page was concurrently split or recycled by
+ * another backend (or ourselves). Have to re-check that the page is
+ * still valid.
+ */
+ if (zsbt_page_is_expected(rel, attno, nexttid, 0, buf))
+ return buf;
+ else
+ {
+ /*
+ * It's not valid for the TID we're looking for, but maybe it was the
+ * right page for the previous TID. In that case, we don't need to
+ * restart from the root, we can follow the right-link instead.
+ */
+ if (nexttid > MinZSTid &&
+ zsbt_page_is_expected(rel, attno, nexttid - 1, 0, buf))
+ {
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ BlockNumber next = opaque->zs_next;
+
+ if (next != InvalidBlockNumber)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ buf = ReleaseAndReadBuffer(buf, rel, next);
+ goto retry;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ }
+ }
+
+ /* Descend the B-tree to find the correct leaf page. */
+ if (!BufferIsValid(buf))
+ buf = zsbt_descend(rel, attno, nexttid, 0, true);
+
+ return buf;
+}
+
+
+/*
+ * Check that a page is a valid B-tree page, and covers the given key.
+ *
+ * This is used when traversing the tree, to check that e.g. a concurrent page
+ * split didn't move pages around, so that the page we were walking to isn't
+ * the correct one anymore.
+ */
+bool
+zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque;
+
+ /*
+ * The page might have been deleted and even reused as a completely different
+ * kind of a page, so we must be prepared for anything.
+ */
+ if (PageIsNew(page))
+ return false;
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ return false;
+
+ opaque = ZSBtreePageGetOpaque(page);
+
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ return false;
+
+ if (opaque->zs_attno != attno)
+ return false;
+
+ if (level == -1)
+ {
+ if ((opaque->zs_flags & ZSBT_ROOT) == 0)
+ return false;
+ }
+ else
+ {
+ if (opaque->zs_level != level)
+ return false;
+ }
+
+ if (opaque->zs_lokey > key || opaque->zs_hikey <= key)
+ return false;
+
+ /* extra checks for corrupted pages */
+ if (opaque->zs_next == BufferGetBlockNumber(buf))
+ elog(ERROR, "btree page %u next-pointer points to itself", opaque->zs_next);
+
+ return true;
+}
+
+/*
+ * Create a new btree root page, containing supplied downlinks.
+ *
+ * NOTE: the very first root page of a btree, which is also the leaf, is created
+ * in zsmeta_get_root_for_attribute(), not here.
+ *
+ * XXX: What if there are too many downlinks to fit on a page? Shouldn't happen
+ * in practice..
+ */
+zs_split_stack *
+zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks)
+{
+ Page metapage;
+ ZSMetaPage *metapg;
+ Buffer newrootbuf;
+ Page newrootpage;
+ ZSBtreePageOpaque *newrootopaque;
+ ZSBtreeInternalPageItem *items;
+ Buffer metabuf;
+ zs_split_stack *stack1;
+ zs_split_stack *stack2;
+ ListCell *lc;
+ int i;
+
+ newrootbuf = zspage_getnewbuf(rel, attno);
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* allocate a new root page */
+ newrootpage = palloc(BLCKSZ);
+ PageInit(newrootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ newrootopaque = ZSBtreePageGetOpaque(newrootpage);
+ newrootopaque->zs_attno = attno;
+ newrootopaque->zs_next = InvalidBlockNumber;
+ newrootopaque->zs_lokey = MinZSTid;
+ newrootopaque->zs_hikey = MaxPlusOneZSTid;
+ newrootopaque->zs_level = level;
+ newrootopaque->zs_flags = ZSBT_ROOT;
+ newrootopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ items = ZSBtreeInternalPageGetItems(newrootpage);
+
+ /* add all the downlinks */
+ i = 0;
+ foreach (lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ items[i++] = *downlink;
+ }
+ ((PageHeader) newrootpage)->pd_lower += i * sizeof(ZSBtreeInternalPageItem);
+
+ /* FIXME: Check that all the downlinks fit on the page. */
+
+ /* update the metapage */
+ metapage = PageGetTempPageCopy(BufferGetPage(metabuf));
+
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+ if ((attno != ZS_META_ATTRIBUTE_NUM) && (attno <= 0 || attno > metapg->nattributes))
+ elog(ERROR, "invalid attribute number %d (table \"%s\" has only %d attributes)",
+ attno, RelationGetRelationName(rel), metapg->nattributes);
+
+ metapg->tree_root_dir[attno].root = BufferGetBlockNumber(newrootbuf);
+
+ stack1 = zs_new_split_stack_entry(metabuf, metapage);
+ stack2 = zs_new_split_stack_entry(newrootbuf, newrootpage);
+ stack2->next = stack1;
+
+ return stack2;
+}
+
+/*
+ * After page split, insert the downlink of 'rightblkno' to the parent.
+ *
+ * On entry, 'leftbuf' must be pinned exclusive-locked.
+ */
+zs_split_stack *
+zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks)
+{
+ int numdownlinks = list_length(downlinks);
+ ZSBtreeInternalPageItem *items;
+ int nitems;
+ int itemno;
+ Buffer parentbuf;
+ Page parentpage;
+ zs_split_stack *split_stack;
+ ZSBtreeInternalPageItem *firstdownlink;
+
+ /*
+ * re-find parent
+ *
+ * TODO: this is a bit inefficient. Usually, we have just descended the
+ * tree, and if we just remembered the path we descended, we could just
+ * walk back up.
+ */
+
+ /*
+ * XXX:: There was a concurrency bug here, too, observed by running
+ * "make installcheck-parallel":
+ * We're holding a lock on a page on level 0, and the root is at level 1.
+ * However, the metacache says that the page we're holding locked is
+ * the root. When we get here, to find the parent page, we will start
+ * with the cached block at level 0, and deadlock with ourselves.
+ *
+ * To fix that, invalidate the cache, if it claims that the child
+ * block is the root. I'm not sure this fixes the whole general problem
+ * those, so this needs some more thought...
+ */
+ {
+ ZSMetaCacheData *metacache;
+
+ metacache = zsmeta_get_cache(rel);
+ if (attno < metacache->cache_nattributes &&
+ metacache->cache_attrs[attno].root == leftblkno)
+ {
+ metacache->cache_attrs[attno].root = InvalidBlockNumber;
+ }
+ }
+
+ parentbuf = zsbt_descend(rel, attno, leftlokey, level, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ firstdownlink = (ZSBtreeInternalPageItem *) linitial(downlinks);
+
+ /* Find the position in the parent for the downlink */
+ items = ZSBtreeInternalPageGetItems(parentpage);
+ nitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(firstdownlink->tid, items, nitems);
+
+ /* sanity checks */
+ if (itemno < 0 || items[itemno].tid != leftlokey ||
+ items[itemno].childblk != leftblkno)
+ {
+ elog(ERROR, "could not find downlink for block %u TID (%u, %u)",
+ leftblkno, ZSTidGetBlockNumber(leftlokey),
+ ZSTidGetOffsetNumber(leftlokey));
+ }
+ itemno++;
+
+ if (PageGetExactFreeSpace(parentpage) < numdownlinks * sizeof(ZSBtreeInternalPageItem))
+ {
+ /* split internal page */
+ split_stack = zsbt_split_internal_page(rel, attno, parentbuf, itemno, downlinks);
+ }
+ else
+ {
+ ZSBtreeInternalPageItem *newitems;
+ Page newpage;
+ int i;
+ ListCell *lc;
+
+ newpage = PageGetTempPageCopySpecial(parentpage);
+
+ split_stack = zs_new_split_stack_entry(parentbuf, newpage);
+
+ /* insert the new downlink for the right page. */
+ newitems = ZSBtreeInternalPageGetItems(newpage);
+ memcpy(newitems, items, itemno * sizeof(ZSBtreeInternalPageItem));
+
+ i = itemno;
+ foreach(lc, downlinks)
+ {
+ ZSBtreeInternalPageItem *downlink = (ZSBtreeInternalPageItem *) lfirst(lc);
+
+ Assert(downlink->childblk != 0);
+ newitems[i++] = *downlink;
+ }
+
+ memcpy(&newitems[i], &items[itemno], (nitems - itemno) * sizeof(ZSBtreeInternalPageItem));
+ ((PageHeader) newpage)->pd_lower += (nitems + numdownlinks) * sizeof(ZSBtreeInternalPageItem);
+ }
+ return split_stack;
+}
+
+/*
+ * Split an internal page.
+ *
+ * The new downlink specified by 'newkey' is inserted to position 'newoff', on 'leftbuf'.
+ * The page is split.
+ */
+static zs_split_stack *
+zsbt_split_internal_page(Relation rel, AttrNumber attno, Buffer origbuf,
+ OffsetNumber newoff, List *newitems)
+{
+ Page origpage = BufferGetPage(origbuf);
+ ZSBtreePageOpaque *origopaque = ZSBtreePageGetOpaque(origpage);
+ Buffer buf;
+ Page page;
+ ZSBtreeInternalPageItem *origitems;
+ int orignitems;
+ zs_split_stack *stack_first;
+ zs_split_stack *stack;
+ Size splitthreshold;
+ ListCell *lc;
+ int origitemno;
+ List *downlinks = NIL;
+
+ origitems = ZSBtreeInternalPageGetItems(origpage);
+ orignitems = ZSBtreeInternalPageGetNumItems(origpage);
+
+ page = PageGetTempPageCopySpecial(origpage);
+ buf = origbuf;
+
+ stack = zs_new_split_stack_entry(buf, page);
+ stack_first = stack;
+
+ /* XXX: currently, we always do 90/10 splits */
+ splitthreshold = PageGetExactFreeSpace(page) * 0.10;
+
+ lc = list_head(newitems);
+ origitemno = 0;
+ for (;;)
+ {
+ ZSBtreeInternalPageItem *item;
+ ZSBtreeInternalPageItem *p;
+
+ if (origitemno == newoff && lc)
+ {
+ item = lfirst(lc);
+ lc = lnext(newitems, lc);
+ }
+ else
+ {
+ if (origitemno == orignitems)
+ break;
+ item = &origitems[origitemno];
+ origitemno++;
+ }
+
+ if (PageGetExactFreeSpace(page) < splitthreshold)
+ {
+ /* have to split to another page */
+ ZSBtreePageOpaque *prevopaque = ZSBtreePageGetOpaque(page);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ BlockNumber blkno;
+ ZSBtreeInternalPageItem *downlink;
+
+ buf = zspage_getnewbuf(rel, attno);
+ blkno = BufferGetBlockNumber(buf);
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ opaque = ZSBtreePageGetOpaque(page);
+ opaque->zs_attno = attno;
+ opaque->zs_next = prevopaque->zs_next;
+ opaque->zs_lokey = item->tid;
+ opaque->zs_hikey = prevopaque->zs_hikey;
+ opaque->zs_level = prevopaque->zs_level;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ prevopaque->zs_next = blkno;
+ prevopaque->zs_hikey = item->tid;
+
+ stack->next = zs_new_split_stack_entry(buf, page);
+ stack = stack->next;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = item->tid;
+ downlink->childblk = blkno;
+ downlinks = lappend(downlinks, downlink);
+ }
+
+ p = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+ *p = *item;
+ ((PageHeader) page)->pd_lower += sizeof(ZSBtreeInternalPageItem);
+ }
+
+ /* recurse to insert downlinks, if we had to split. */
+ if (downlinks)
+ {
+ if ((origopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(origbuf);
+ downlinks = lcons(downlink, downlinks);
+
+ stack->next = zsbt_newroot(rel, attno, origopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ ZSBtreePageGetOpaque(stack_first->page)->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ stack->next = zsbt_insert_downlinks(rel, attno,
+ origopaque->zs_lokey,
+ BufferGetBlockNumber(origbuf),
+ origopaque->zs_level + 1,
+ downlinks);
+ }
+ }
+
+ return stack_first;
+}
+
+
+/*
+ * Removes the last item from page, and unlinks the page from the tree.
+ *
+ * NOTE: you cannot remove the only leaf. Returns NULL if the page could not
+ * be deleted.
+ */
+zs_split_stack *
+zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level)
+{
+ Page page = BufferGetPage(buf);
+ ZSBtreePageOpaque *opaque = ZSBtreePageGetOpaque(page);
+ Buffer leftbuf;
+ Buffer rightbuf;
+ zs_split_stack *stack;
+
+ /* cannot currently remove the only page at its level. */
+ if (opaque->zs_lokey == MinZSTid && opaque->zs_hikey == MaxPlusOneZSTid)
+ {
+ return NULL;
+ }
+
+ /*
+ * Find left sibling.
+ * or if this is leftmost page, find right sibling.
+ */
+ if (opaque->zs_lokey != MinZSTid)
+ {
+ rightbuf = buf;
+ leftbuf = zsbt_descend(rel, attno, opaque->zs_lokey - 1, level, false);
+
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, false);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(leftbuf);
+ return NULL;
+ }
+ }
+ else
+ {
+ rightbuf = zsbt_descend(rel, attno, opaque->zs_hikey, level, false);
+ leftbuf = buf;
+ stack = zsbt_merge_pages(rel, attno, leftbuf, rightbuf, true);
+ if (!stack)
+ {
+ UnlockReleaseBuffer(rightbuf);
+ return NULL;
+ }
+ }
+
+ return stack;
+}
+
+/*
+ * Page deletion:
+ *
+ * Mark page empty, remove downlink. If parent becomes empty, recursively delete it.
+ *
+ * Unlike in the nbtree index, we don't need to worry about concurrent scans. They
+ * will simply retry if they land on an unexpected page.
+ */
+static zs_split_stack *
+zsbt_merge_pages(Relation rel, AttrNumber attno, Buffer leftbuf, Buffer rightbuf, bool target_is_left)
+{
+ Buffer parentbuf;
+ Page origleftpage;
+ Page leftpage;
+ Page rightpage;
+ ZSBtreePageOpaque *leftopaque;
+ ZSBtreePageOpaque *origleftopaque;
+ ZSBtreePageOpaque *rightopaque;
+ ZSBtreeInternalPageItem *parentitems;
+ int parentnitems;
+ Page parentpage;
+ int itemno;
+ zs_split_stack *stack;
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ origleftpage = BufferGetPage(leftbuf);
+ origleftopaque = ZSBtreePageGetOpaque(origleftpage);
+ rightpage = BufferGetPage(rightbuf);
+ rightopaque = ZSBtreePageGetOpaque(rightpage);
+
+ /* find downlink for 'rightbuf' in the parent */
+ parentbuf = zsbt_descend(rel, attno, rightopaque->zs_lokey, origleftopaque->zs_level + 1, false);
+ parentpage = BufferGetPage(parentbuf);
+
+ parentitems = ZSBtreeInternalPageGetItems(parentpage);
+ parentnitems = ZSBtreeInternalPageGetNumItems(parentpage);
+ itemno = zsbt_binsrch_internal(rightopaque->zs_lokey, parentitems, parentnitems);
+ if (itemno < 0 || parentitems[itemno].childblk != BufferGetBlockNumber(rightbuf))
+ elog(ERROR, "could not find downlink to FPM page %u", BufferGetBlockNumber(rightbuf));
+
+ if (parentnitems > 1 && itemno == 0)
+ {
+ /*
+ * Don't delete the leftmost child of a parent. That would move the
+ * keyspace of the parent, so we'd need to adjust the lo/hikey of
+ * the parent page, and the parent's downlink in the grandparent.
+ * Maybe later...
+ */
+ UnlockReleaseBuffer(parentbuf);
+ elog(DEBUG1, "deleting leftmost child of a parent not implemented");
+ return NULL;
+ }
+
+ if (target_is_left)
+ {
+ /* move all items from right to left before unlinking the right page */
+ leftpage = PageGetTempPageCopy(rightpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+
+ memcpy(leftopaque, origleftopaque, sizeof(ZSBtreePageOpaque));
+ }
+ else
+ {
+ /* right page is empty. */
+ leftpage = PageGetTempPageCopy(origleftpage);
+ leftopaque = ZSBtreePageGetOpaque(leftpage);
+ }
+
+ /* update left hikey */
+ leftopaque->zs_hikey = ZSBtreePageGetOpaque(rightpage)->zs_hikey;
+ leftopaque->zs_next = ZSBtreePageGetOpaque(rightpage)->zs_next;
+
+ Assert(ZSBtreePageGetOpaque(leftpage)->zs_level == ZSBtreePageGetOpaque(rightpage)->zs_level);
+
+ stack = zs_new_split_stack_entry(leftbuf, leftpage);
+ stack_head = stack_tail = stack;
+
+ /* Mark right page as empty/unused */
+ rightpage = palloc0(BLCKSZ);
+
+ stack = zs_new_split_stack_entry(rightbuf, rightpage);
+ stack->recycle = true;
+ stack_tail->next = stack;
+ stack_tail = stack;
+
+ /* remove downlink from parent */
+ if (parentnitems > 1)
+ {
+ Page newpage = PageGetTempPageCopySpecial(parentpage);
+ ZSBtreeInternalPageItem *newitems = ZSBtreeInternalPageGetItems(newpage);
+
+ memcpy(newitems, parentitems, itemno * sizeof(ZSBtreeInternalPageItem));
+ memcpy(&newitems[itemno], &parentitems[itemno + 1], (parentnitems - itemno -1) * sizeof(ZSBtreeInternalPageItem));
+
+ ((PageHeader) newpage)->pd_lower += (parentnitems - 1) * sizeof(ZSBtreeInternalPageItem);
+
+ stack = zs_new_split_stack_entry(parentbuf, newpage);
+ stack_tail->next = stack;
+ stack_tail = stack;
+ }
+ else
+ {
+ /* the parent becomes empty as well. Recursively remove it. */
+ stack_tail->next = zsbt_unlink_page(rel, attno, parentbuf, leftopaque->zs_level + 1);
+ if (stack_tail->next == NULL)
+ {
+ /* oops, couldn't remove the parent. Back out */
+ stack = stack_head;
+ while (stack)
+ {
+ zs_split_stack *next = stack->next;
+
+ pfree(stack->page);
+ pfree(stack);
+ stack = next;
+ }
+ }
+ }
+
+ return stack_head;
+}
+
+/*
+ * Allocate a new zs_split_stack struct.
+ */
+zs_split_stack *
+zs_new_split_stack_entry(Buffer buf, Page page)
+{
+ zs_split_stack *stack;
+
+ stack = palloc(sizeof(zs_split_stack));
+ stack->next = NULL;
+ stack->buf = buf;
+ stack->page = page;
+
+ /* caller can change these */
+ stack->recycle = false;
+ stack->special_only = false;
+
+ return stack;
+}
+
+#define MAX_BLOCKS_IN_REWRITE 199
+
+/*
+ * Apply all the changes represented by a list of zs_split_stack
+ * entries.
+ */
+void
+zs_apply_split_changes(Relation rel, zs_split_stack *stack, zs_pending_undo_op *undo_op, AttrNumber attrNumber)
+{
+ zs_split_stack *head = stack;
+ bool wal_needed = RelationNeedsWAL(rel);
+ wal_zedstore_btree_rewrite_pages *xlrec = NULL;
+ int xlrecsz = 0;
+ int block_id = 0;
+ XLogRecPtr recptr;
+
+ if (wal_needed)
+ {
+ int num_pages = 0;
+ int i;
+
+ for (stack = head; stack != NULL; stack = stack->next)
+ num_pages++;
+
+ if (num_pages > MAX_BLOCKS_IN_REWRITE)
+ elog(ERROR, "cannot rewrite more than %d pages in one WAL record",
+ MAX_BLOCKS_IN_REWRITE);
+ /*
+ * +1 buffers, for UNDO.
+ * +1 rdata for UNDO, and +1 for the wal_zedstore_btree_rewrite_pages struct
+ */
+ XLogEnsureRecordSpace(1 + num_pages,
+ 1 + 1 + num_pages);
+
+ xlrecsz = SizeOfZSWalBtreeRewritePages(num_pages);
+ xlrec = palloc(xlrecsz);
+
+ xlrec->numpages = num_pages;
+ i = 0;
+ for (stack = head; stack != NULL; stack = stack->next)
+ {
+ xlrec->pageinfo[i].recycle = stack->recycle;
+ xlrec->pageinfo[i].special_only = stack->special_only;
+ i++;
+ }
+ Assert(i == num_pages);
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) xlrec, xlrecsz);
+
+ if (undo_op)
+ XLogRegisterUndoOp(0, undo_op);
+ }
+
+ START_CRIT_SECTION();
+
+ block_id = 1; /* 0 is undo page */
+ for (stack = head; stack != NULL; stack = stack->next)
+ {
+ Page origpage = BufferGetPage(stack->buf);
+
+ if (stack->special_only)
+ {
+ char *orig_special_area = PageGetSpecialPointer(origpage);
+ char *new_special_area = PageGetSpecialPointer(stack->page);
+ uint16 special_size = PageGetSpecialSize(stack->page);
+
+ Assert(PageGetSpecialSize(origpage) == special_size);
+ Assert(memcmp(origpage, stack->page, BLCKSZ - special_size) == 0);
+ memcpy(orig_special_area, new_special_area, special_size);
+ pfree(stack->page);
+
+ if (wal_needed)
+ {
+ XLogRegisterBuffer(block_id, stack->buf, REGBUF_STANDARD);
+ XLogRegisterBufData(block_id, orig_special_area, special_size);
+ }
+ }
+ else
+ {
+ PageRestoreTempPage(stack->page, BufferGetPage(stack->buf));
+
+ if (wal_needed)
+ XLogRegisterBuffer(block_id, stack->buf,
+ REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ }
+ MarkBufferDirty(stack->buf);
+
+ block_id++;
+ }
+
+ if (undo_op)
+ zsundo_finish_pending_op(undo_op, (char *) undo_op->payload);
+
+ if (wal_needed)
+ {
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_BTREE_REWRITE_PAGES);
+
+ if (undo_op)
+ PageSetLSN(BufferGetPage(undo_op->reservation.undobuf), recptr);
+
+ for (stack = head; stack != NULL; stack = stack->next)
+ PageSetLSN(BufferGetPage(stack->buf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ if (undo_op)
+ {
+ UnlockReleaseBuffer(undo_op->reservation.undobuf);
+ pfree(undo_op);
+ }
+
+ stack = head;
+ while (stack)
+ {
+ zs_split_stack *next;
+
+ /* add this page to the Free Page Map for recycling */
+ if (stack->recycle)
+ zspage_delete_page(rel, stack->buf, InvalidBuffer, attrNumber);
+
+ UnlockReleaseBuffer(stack->buf);
+
+ next = stack->next;
+ pfree(stack);
+ stack = next;
+ }
+ if (wal_needed)
+ pfree(xlrec);
+}
+
+static int
+zsbt_binsrch_internal(zstid key, ZSBtreeInternalPageItem *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid].tid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
+
+void
+zsbt_rewrite_pages_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_btree_rewrite_pages *xlrec =
+ (wal_zedstore_btree_rewrite_pages *) XLogRecGetData(record);
+ Buffer buffers[MAX_BLOCKS_IN_REWRITE];
+ uint8 block_id;
+
+ /* sanity checks */
+ if (record->max_block_id >= MAX_BLOCKS_IN_REWRITE)
+ elog(ERROR, "too many blocks in zedstore rewrite_pages record: %d", record->max_block_id + 1);
+ if (xlrec->numpages != record->max_block_id)
+ elog(ERROR, "number of blocks in WAL record %d does not match record struct %d",
+ record->max_block_id, xlrec->numpages);
+ if (XLogRecGetDataLen(record) != SizeOfZSWalBtreeRewritePages(xlrec->numpages))
+ elog(ERROR, "incorrect record struct size %d %ld", XLogRecGetDataLen(record), SizeOfZSWalBtreeRewritePages(xlrec->numpages));
+
+ if (XLogRecHasBlockRef(record, 0))
+ buffers[0] = XLogRedoUndoOp(record, 0);
+ else
+ buffers[0] = InvalidBuffer;
+
+ /* Iterate over blocks */
+ for (block_id = 1; block_id <= record->max_block_id; block_id++)
+ {
+ if (XLogReadBufferForRedo(record, block_id, &buffers[block_id]) == BLK_NEEDS_REDO)
+ {
+ Page page = BufferGetPage(buffers[block_id]);
+ char *special_area = PageGetSpecialPointer(page);
+ uint16 special_size = PageGetSpecialSize(page);
+ Size new_special_size;
+ char *new_special_area = XLogRecGetBlockData(record, block_id, &new_special_size);
+
+ if (!xlrec->pageinfo[block_id - 1].special_only)
+ elog(ERROR, "zedstore rewrite_pages WAL record did not contain a full-page image");
+
+ if (new_special_size != special_size)
+ elog(ERROR, "size of page's special area in WAL record does not match old page");
+
+ memcpy(special_area, new_special_area, special_size);
+ PageSetLSN(page, lsn);
+ }
+ }
+
+ /* Changes are done: unlock and release all buffers */
+ for (block_id = 0; block_id <= record->max_block_id; block_id++)
+ {
+ if (BufferIsValid(buffers[block_id]))
+ UnlockReleaseBuffer(buffers[block_id]);
+ }
+
+ /*
+ * XXX: The WAL record includes the 'recycle' flags, but we don't use them
+ * for anything. Deleting a page is WAL-logged separately. We could use the
+ * recycle flag here to catch leaked pages on crash, but it's probably not
+ * a big deal in practice.
+ */
+}
diff --git a/src/backend/access/zedstore/zedstore_compression.c b/src/backend/access/zedstore/zedstore_compression.c
new file mode 100644
index 0000000000..0eaff9c625
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_compression.c
@@ -0,0 +1,92 @@
+/*
+ * zedstore_compression.c
+ * Routines for compression
+ *
+ * There are two implementations at the moment: LZ4, and the Postgres
+ * pg_lzcompress(). LZ4 support requires that the server was compiled
+ * with --with-lz4.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_compression.c
+ */
+#include "postgres.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#include "access/zedstore_compression.h"
+#include "common/pg_lzcompress.h"
+#include "utils/datum.h"
+
+#ifdef USE_LZ4
+
+int
+zs_compress_destSize(const char *src, char *dst, int *srcSizePtr, int targetDstSize)
+{
+ return LZ4_compress_destSize(src, dst, srcSizePtr, targetDstSize);
+}
+
+void
+zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize)
+{
+ int decompressed_size;
+
+ decompressed_size = LZ4_decompress_safe(src, dst, compressedSize, uncompressedSize);
+ if (decompressed_size < 0)
+ elog(ERROR, "could not decompress chunk (%d bytes compressed, %d bytes uncompressed)",
+ compressedSize, uncompressedSize);
+ if (decompressed_size != uncompressedSize)
+ elog(ERROR, "unexpected decompressed size");
+}
+
+#else
+/* PGLZ implementation */
+
+int
+zs_compress_destSize(const char *src, char *dst, int *srcSizePtr, int targetDstSize)
+{
+ int maxInputSize;
+ int compressed_size;
+
+ /*
+ * FIXME: pglz doesn't have an interface like LZ4 does, to compress up to a certain
+ * target compressed output size. We take a conservative approach and compress
+ * 'targetDstSize' bytes, and return that. Alternatively, we could guess the
+ * compression ratio, and try compressing a larget chunk hoping that it will fit
+ * in the target size, and try again if it didn't fit. Or we could enhance pglz
+ * code to do this cleverly. But it doesn't seem worth the effort, LZ4 (or something
+ * else, but not pglz) is the future.
+ */
+
+ /* reverse the computation of PGLZ_MAX_OUTPUT */
+ if (targetDstSize < 4)
+ return 0;
+
+ maxInputSize = targetDstSize - 4;
+ Assert(PGLZ_MAX_OUTPUT(maxInputSize) <= targetDstSize);
+ if (maxInputSize > *srcSizePtr)
+ maxInputSize = *srcSizePtr;
+
+ compressed_size = pglz_compress(src, maxInputSize, dst, PGLZ_strategy_always);
+ *srcSizePtr = maxInputSize;
+
+ return compressed_size;
+}
+
+void
+zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize)
+{
+ int decompressed_size;
+
+ decompressed_size = pglz_decompress(src, compressedSize, dst, uncompressedSize, true);
+ if (decompressed_size < 0)
+ elog(ERROR, "could not decompress chunk");
+ if (decompressed_size != uncompressedSize)
+ elog(ERROR, "unexpected decompressed size");
+}
+
+#endif /* !USE_LZ4 */
diff --git a/src/backend/access/zedstore/zedstore_freepagemap.c b/src/backend/access/zedstore/zedstore_freepagemap.c
new file mode 100644
index 0000000000..e87fc76ae5
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_freepagemap.c
@@ -0,0 +1,438 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstore_freepagemap.c
+ * ZedStore free space management
+ *
+ * The Free Page Map keeps track of unused pages in the relation.
+ *
+ * The FPM is a linked list of pages. Each page contains a pointer to the
+ * next free page.
+
+ * Design principles:
+ *
+ * - it's ok to have a block incorrectly stored in the FPM. Before actually
+ * reusing a page, we must check that it's safe.
+ *
+ * - a deletable page must be simple to detect just by looking at the page,
+ * and perhaps a few other pages. It should *not* require scanning the
+ * whole table, or even a whole b-tree. For example, if a column is dropped,
+ * we can detect if a b-tree page belongs to the dropped column just by
+ * looking at the information (the attribute number) stored in the page
+ * header.
+ *
+ * - if a page is deletable, it should become immediately reusable. No
+ * "wait out all possible readers that might be about to follow a link
+ * to it" business. All code that reads pages need to keep pages locked
+ * while following a link, or be prepared to retry if they land on an
+ * unexpected page.
+ *
+ *
+ * TODO:
+ *
+ * - Avoid fragmentation. If B-tree page is split, try to hand out a page
+ * that's close to the old page. When the relation is extended, allocate
+ * a larger chunk at once.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_freepagemap.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufpage.h"
+#include "storage/lmgr.h"
+#include "utils/rel.h"
+
+/*
+ * Deleted pages are initialized as having this structure, in the
+ * "special area".
+ *
+ * zs_next points to the next free block in the FPM chain.
+ */
+typedef struct ZSFreePageOpaque
+{
+ BlockNumber zs_next;
+ uint16 padding;
+ uint16 zs_page_id; /* ZS_FREE_PAGE_ID */
+} ZSFreePageOpaque;
+
+static Buffer zspage_extendrel_newbuf(Relation rel);
+
+/*
+ * zspage_is_recyclable()
+ *
+ * Is the current page recyclable?
+ *
+ * It can be:
+ *
+ * - an empty, all-zeros page,
+ * - explicitly marked as deleted,
+ * - an UNDO page older than oldest_undo_ptr
+ * - a b-tree page belonging to a deleted attribute
+ * - a TOAST page belonging to a dead item
+ *
+ * TODO: currently though, we require that it's always explicitly marked as empty.
+ *
+ */
+static bool
+zspage_is_unused(Buffer buf)
+{
+ Page page;
+ ZSFreePageOpaque *opaque;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
+ return false;
+
+ if (PageGetSpecialSize(page) != sizeof(ZSFreePageOpaque))
+ return false;
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_FREE_PAGE_ID)
+ return false;
+
+ return true;
+}
+
+/*
+ * Allocate a new page.
+ *
+ * The page is exclusive-locked, but not initialized.
+ *
+ * The head of the FPM chain is kept in the metapage, and thus this
+ * function will acquire the lock on the metapage. The caller must
+ * not be holding it, or we will self-deadlock!
+ *
+ * Unlinking the page from the FPM is WAL-logged. Once this function
+ * returns, the caller must use the page, and WAL-log its initialization,
+ * or give it back by calling zspage_delete_page().
+ *
+ * NOTE: There is a gap between this function unlinking the page from the
+ * FPM, and the caller initializing the page and linking it to somewhere
+ * else. If we crash in between, the page will be permanently leaked.
+ * That's unfortunate, but hopefully won't happen too often.
+ */
+Buffer
+zspage_getnewbuf(Relation rel, AttrNumber attrNumber)
+{
+ Buffer buf;
+ BlockNumber blk;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPage *metapg;
+ ZSMetaPageOpaque *metaopaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+ metapage = BufferGetPage(metabuf);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+
+ if (attrNumber == ZS_INVALID_ATTRIBUTE_NUM)
+ blk = metaopaque->zs_fpm_head;
+ else
+ blk = metapg->tree_root_dir[attrNumber].fpm_head;
+
+ if (blk == ZS_META_BLK)
+ {
+ /* metapage, not expected */
+ elog(ERROR, "could not find valid page in FPM");
+ }
+ if (blk != InvalidBlockNumber)
+ {
+ ZSFreePageOpaque *opaque;
+ Page page;
+ BlockNumber next_free_blkno;
+
+ buf = ReadBuffer(rel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Check that the page really is unused. */
+ if (!zspage_is_unused(buf))
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "unexpected page found in free page list");
+ }
+ page = BufferGetPage(buf);
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ next_free_blkno = opaque->zs_next;
+
+ if (attrNumber == ZS_INVALID_ATTRIBUTE_NUM)
+ metaopaque->zs_fpm_head = next_free_blkno;
+ else
+ metapg->tree_root_dir[attrNumber].fpm_head = next_free_blkno;
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_fpm_reuse_page xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.next_free_blkno = next_free_blkno;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalFpmReusePage);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+ /*
+ * NOTE: We don't WAL-log the reused page here. It's up to the
+ * caller to WAL-log its initialization. If we crash between here
+ * and the initialization, the page is leaked. That's unfortunate,
+ * but it should be rare enough that we can live with it.
+ */
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_FPM_REUSE_PAGE);
+
+ PageSetLSN(BufferGetPage(metabuf), recptr);
+ }
+ UnlockReleaseBuffer(metabuf);
+ }
+ else
+ {
+ /*
+ * No free pages in the FPM. Have to extend the relation.
+ * 1. We extend the relation by zedstore_rel_extension_factor #blocks.
+ * 2. Out of the zedstore_rel_extension_factor #blocks returned by the
+ * storage manager, we return the first block. The other blocks
+ * returned are prepended to the attribute level FPM.
+ */
+ StdRdOptions *rd_options = (StdRdOptions *)rel->rd_options;
+ int extension_factor = rd_options ? rd_options->zedstore_rel_extension_factor : ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR;
+ Buffer *extrabufs = palloc((extension_factor - 1) * sizeof(Buffer));
+
+ buf = zspage_extendrel_newbuf(rel);
+ blk = BufferGetBlockNumber(buf);
+
+ for (int i = 0; i < extension_factor - 1; i++) {
+ extrabufs[i] = zspage_extendrel_newbuf(rel);
+ /*
+ * We unlock the extrabuf here to prevent hitting MAX_SIMUL_LWLOCKS.
+ * It is safe to unlock the extrabuf here as it cannot be referenced
+ * by other backends until it is put on the attribute-level FPM.
+ * We grab the lock again in the following loop before placing the
+ * page on the FPM.
+ */
+ LockBuffer(extrabufs[i], BUFFER_LOCK_UNLOCK);
+ }
+
+ for (int i = extension_factor - 2; i >=0; i--) {
+ LockBuffer(extrabufs[i], BUFFER_LOCK_EXCLUSIVE);
+ zspage_delete_page(rel, extrabufs[i], metabuf, attrNumber);
+ UnlockReleaseBuffer(extrabufs[i]);
+ }
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ return buf;
+}
+
+void
+zspage_reuse_page_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_fpm_reuse_page *xlrec = (wal_zedstore_fpm_reuse_page *) XLogRecGetData(record);
+ Buffer metabuf;
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_fpm_head = xlrec->next_free_blkno;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * Extend the relation.
+ *
+ * Returns the new page, exclusive-locked.
+ */
+static Buffer
+zspage_extendrel_newbuf(Relation rel)
+{
+ Buffer buf;
+ bool needLock;
+
+ /*
+ * Extend the relation by one page.
+ *
+ * We have to use a lock to ensure no one else is extending the rel at
+ * the same time, else we will both try to initialize the same new
+ * page. We can skip locking for new or temp relations, however,
+ * since no one else could be accessing them.
+ */
+ needLock = !RELATION_IS_LOCAL(rel);
+
+ if (needLock)
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ buf = ReadBuffer(rel, P_NEW);
+
+ /* Acquire buffer lock on new page */
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * Release the file-extension lock; it's now OK for someone else to
+ * extend the relation some more. Note that we cannot release this
+ * lock before we have buffer lock on the new page, or we risk a race
+ * condition against btvacuumscan --- see comments therein.
+ */
+ if (needLock)
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ return buf;
+}
+
+/*
+ * Explictly mark a page as deleted and recyclable, and add it to the FPM.
+ *
+ * The caller must hold an exclusive-lock on the page.
+ *
+ * This function needs to modify the metapage, to link the page to the
+ * FPM chain. If the caller is already holding a lock on the metapage,
+ * pass it in the 'metabuf' argument.
+ *
+ * NOTE: The deletion of the page is WAL-logged. There is a gap between
+ * the caller making the page obsolete, and calling this function, and
+ * if we crash in between, the page will be leaked. That's unfortunate,
+ * but like in zspage_getnewbuf(), we mostly just live with it. However,
+ * you can use zspage_mark_page_deleted() to avoid it.
+ */
+void
+zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf, AttrNumber attrNumber)
+{
+ bool release_metabuf;
+ BlockNumber blk = BufferGetBlockNumber(buf);
+ Page metapage;
+ ZSMetaPage *metapg;
+ ZSMetaPageOpaque *metaopaque;
+ Page page;
+ BlockNumber next_free_blkno;
+
+ if (metabuf == InvalidBuffer)
+ {
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ release_metabuf = true;
+ }
+ else
+ release_metabuf = false;
+
+ metapage = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(metapage);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ page = BufferGetPage(buf);
+
+ if (attrNumber != ZS_INVALID_ATTRIBUTE_NUM)
+ {
+ /*
+ * Add the page to the attribute specific free page map.
+ */
+ next_free_blkno = metapg->tree_root_dir[attrNumber].fpm_head;
+ zspage_mark_page_deleted(page, next_free_blkno);
+ metapg->tree_root_dir[attrNumber].fpm_head = blk;
+ }
+ else
+ {
+ next_free_blkno = metaopaque->zs_fpm_head;
+ zspage_mark_page_deleted(page, next_free_blkno);
+ metaopaque->zs_fpm_head = blk;
+ }
+
+
+ MarkBufferDirty(metabuf);
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_fpm_delete_page xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.next_free_blkno = next_free_blkno;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalFpmDeletePage);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+ XLogRegisterBuffer(1, buf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_FPM_DELETE_PAGE);
+
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(page, recptr);
+ }
+
+ if (release_metabuf)
+ UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * Initialize a page as deleted.
+ *
+ * This is a low-level function, used by zspage_delete_page(), but it can
+ * also be used by callers that are willing to deal with managing the FPM
+ * chain and WAL-logging by themselves.
+ */
+void
+zspage_mark_page_deleted(Page page, BlockNumber next_free_blk)
+{
+ ZSFreePageOpaque *opaque;
+
+ PageInit(page, BLCKSZ, sizeof(ZSFreePageOpaque));
+ opaque = (ZSFreePageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_page_id = ZS_FREE_PAGE_ID;
+ opaque->zs_next = next_free_blk;
+}
+
+void
+zspage_delete_page_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_fpm_delete_page *xlrec = (wal_zedstore_fpm_delete_page *) XLogRecGetData(record);
+ Buffer metabuf;
+ Buffer deletedbuf;
+ Page deletedpg;
+ BlockNumber deletedblkno;
+
+ deletedbuf = XLogInitBufferForRedo(record, 1);
+ deletedpg = BufferGetPage(deletedbuf);
+ deletedblkno = BufferGetBlockNumber(deletedbuf);
+
+ zspage_mark_page_deleted(deletedpg, xlrec->next_free_blkno);
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_fpm_head = deletedblkno;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ PageSetLSN(deletedpg, lsn);
+ MarkBufferDirty(deletedbuf);
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ UnlockReleaseBuffer(deletedbuf);
+}
diff --git a/src/backend/access/zedstore/zedstore_inspect.c b/src/backend/access/zedstore/zedstore_inspect.c
new file mode 100644
index 0000000000..d8ded5ef4e
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_inspect.c
@@ -0,0 +1,1191 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_inspect.c
+ * Debugging functions, for viewing ZedStore page contents
+ *
+ * These should probably be moved to contrib/, but it's handy to have them
+ * here during development.
+ *
+ * Example queries
+ * ---------------
+ *
+ * How many pages of each type a table has?
+ *
+ * select count(*), pg_zs_page_type('t_zedstore', g)
+ * from generate_series(0, pg_table_size('t_zedstore') / 8192 - 1) g group by 2;
+ *
+ * count | pg_zs_page_type
+ * -------+-----------------
+ * 1 | META
+ * 3701 | BTREE
+ * 6 | UNDO
+ * (3 rows)
+ *
+ * Compression ratio of B-tree leaf pages (other pages are not compressed):
+ *
+ * select sum(uncompressedsz::numeric) / sum(totalsz) as compratio
+ * from pg_zs_btree_pages('t_zedstore') ;
+ * compratio
+ * --------------------
+ * 3.6623829559208134
+ * (1 row)
+ *
+ * Per column compression ratio and number of pages:
+ *
+ * select attno, count(*), sum(uncompressedsz::numeric) / sum(totalsz) as
+ * compratio from pg_zs_btree_pages('t_zedstore') group by attno order by
+ * attno;
+ *
+ * attno | count | compratio
+ * -------+-------+------------------------
+ * 0 | 395 | 1.00000000000000000000
+ * 1 | 56 | 1.0252948766341260
+ * 2 | 3 | 38.7542309420398383
+ * (3 rows)
+ *
+ *
+ * Measure of leaf page randomness
+ *
+ * A run is a sequence of consecutive leaf blocks. Two blocks are consecutive
+ * if they have consecutive block numbers
+ *
+ * select (pg_zs_calculate_adjacent_block('t_zedstore'::regclass)).*;
+ *
+ * attnum | nruns | nblocks
+ * -------+-------+---------
+ * 0 | 21 | 27
+ * 1 | 4 | 107
+ * 2 | 4 | 107
+ * 3 | 4 | 107
+ * (4 rows)
+ *
+ *
+ * Get attstreams inside an attribute leaf page. Each row represents an encoded chunk.
+ *
+ * select * from pg_zs_dump_attstreams('t_zedstore', 3);
+ *
+ * select attno, chunkno, upperstream, compressed, chunk_start, chunk_len, prevtid, firsttid, lasttid, itemcount, chunk from pg_zs_dump_attstreams('t_zedstore', 11) limit 5;
+ * attno | chunkno | upperstream | compressed | chunk_start | chunk_len | prevtid | firsttid | lasttid | itemcount | chunk
+ *
+ * -------+---------+-------------+------------+-------------+-----------+---------+----------+---------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
+ * ---------------------------------------------------------------------------------------------------
+ * 1 | 0 | f | f | 0 | 12 | 0 | 10001 | 10001 | 1 | \x11270000000000f001000000
+ * 1 | 0 | t | t | 0 | 24 | 0 | 9931 | 9934 | 4 | \xcb660010000400c0cb260000cc260000cd260000ce260000
+ * 1 | 1 | t | t | 24 | 88 | 9934 | 9935 | 9954 | 20 | \x5555555555000030cf260000d0260000d1260000d2260000d3260000d4260000d5260000d6260000d7260000d8260000d9260000da260000db260000dc260000dd260000de260000df260000e02600
+ * 00e1260000e2260000
+ * 1 | 2 | t | t | 112 | 32 | 9954 | 9955 | 9960 | 6 | \x01020408102000b0e3260000e4260000e5260000e6260000e7260000e8260000
+ * 1 | 3 | t | t | 144 | 128 | 9960 | 9961 | 9990 | 30 | \xffffff3f00000010e9260000ea260000eb260000ec260000ed260000ee260000ef260000f0260000f1260000f2260000f3260000f4260000f5260000f6260000f7260000f8260000f9260000fa2600
+ * 00fb260000fc260000fd260000fe260000ff26000000270000012700000227000003270000042700000527000006270000
+ * (5 rows)
+ *
+ *
+ * Decode chunks inside an attribute leaf page.
+ *
+ * select * from pg_zs_dump_attstreams('t_zedstore', 11), pg_zs_decode_chunk(attbyval,attlen,prevtid,lasttid,chunk);
+ *
+ * select chunkno, tids, datums, isnulls from pg_zs_dump_attstreams('t_zedstore', 11), pg_zs_decode_chunk(attbyval,attlen,prevtid,lasttid,chunk);
+ * chunkno | tids | datums | isnulls
+ * ---------+-----------------------+-----------------------------------------------------------+-----------
+ * 0 | {10001} | {"\\x01000000"} | {f}
+ * 0 | {9931,9932,9933,9934} | {"\\xcb260000","\\xcc260000","\\xcd260000","\\xce260000"} | {f,f,f,f}
+ * (2 rows)
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_inspect.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "commands/vacuum.h"
+#include "funcapi.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+Datum pg_zs_page_type(PG_FUNCTION_ARGS);
+Datum pg_zs_undo_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_btree_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_toast_pages(PG_FUNCTION_ARGS);
+Datum pg_zs_meta_page(PG_FUNCTION_ARGS);
+Datum pg_zs_calculate_adjacent_block(PG_FUNCTION_ARGS);
+Datum pg_zs_dump_attstreams(PG_FUNCTION_ARGS);
+Datum pg_zs_decode_chunk(PG_FUNCTION_ARGS);
+
+Datum
+pg_zs_page_type(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint64 pageno = PG_GETARG_INT64(1);
+ Relation rel;
+ uint16 zs_page_id;
+ Buffer buf;
+ Page page;
+ char *result;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ buf = ReadBuffer(rel, pageno);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ zs_page_id = *((uint16 *) ((char *) page + BLCKSZ - sizeof(uint16)));
+
+ UnlockReleaseBuffer(buf);
+
+ table_close(rel, AccessShareLock);
+
+ switch (zs_page_id)
+ {
+ case ZS_META_PAGE_ID:
+ result = "META";
+ break;
+ case ZS_BTREE_PAGE_ID:
+ result = "BTREE";
+ break;
+ case ZS_UNDO_PAGE_ID:
+ result = "UNDO";
+ break;
+ case ZS_TOAST_PAGE_ID:
+ result = "TOAST";
+ break;
+ case ZS_FREE_PAGE_ID:
+ result = "FREE";
+ break;
+ default:
+ result = psprintf("UNKNOWN 0x%04x", zs_page_id);
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * blkno int8
+ * nrecords int4
+ * freespace int4
+ * firstrecptr int8
+ * lastrecptr int8
+ */
+Datum
+pg_zs_undo_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber blkno;
+ char *ptr;
+ char *endptr;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Loop through UNDO records, starting from the oldest page.
+ */
+ blkno = firstblk;
+ while (blkno != InvalidBlockNumber)
+ {
+ Datum values[5];
+ bool nulls[5];
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+ int nrecords;
+ ZSUndoRecPtr firstptr = {0, 0, 0};
+ ZSUndoRecPtr lastptr = {0, 0, 0};
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ {
+ elog(WARNING, "unexpected page id on UNDO page %u", blkno);
+ break;
+ }
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ nrecords = 0;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+
+ Assert(undorec->undorecptr.blkno == blkno);
+
+ lastptr = undorec->undorecptr;
+ if (nrecords == 0)
+ firstptr = lastptr;
+ nrecords++;
+
+ ptr += undorec->size;
+ }
+
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int32GetDatum(nrecords);
+ values[2] = Int32GetDatum(PageGetExactFreeSpace(page));
+ values[3] = Int64GetDatum(firstptr.counter);
+ values[4] = Int64GetDatum(lastptr.counter);
+
+ blkno = opaque->next;
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * blkno int8
+ * tid int8
+ * total_size int8
+ * prev int8
+ * next int8
+ * decompressed_size uint32
+ * is_compressed bool
+ */
+Datum
+pg_zs_toast_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ BlockNumber blkno;
+ BlockNumber nblocks;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /* scan all blocks in physical order */
+ for (blkno = 1; blkno < nblocks; blkno++)
+ {
+ Datum values[8];
+ bool nulls[8];
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in toast pages.
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSToastPageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_TOAST_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ values[0] = Int64GetDatum(blkno);
+ if (opaque->zs_tid)
+ {
+ values[1] = Int64GetDatum(opaque->zs_tid);
+ values[2] = Int64GetDatum(opaque->zs_total_size);
+ }
+ values[3] = Int64GetDatum(opaque->zs_slice_offset);
+ values[4] = Int64GetDatum(opaque->zs_prev);
+ values[5] = Int64GetDatum(opaque->zs_next);
+ values[6] = Int32GetDatum(opaque->zs_decompressed_size);
+ values[7] = BoolGetDatum(opaque->zs_is_compressed);
+
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * attno int4
+ * chunkno int4
+ * upperstream bool
+ * compressed bool
+ * attbyval bool
+ * attlen int4
+ *
+ * chunk_cursor int4
+ * chunk_len int4
+ *
+ * firsttid zstid
+ * lasttid zstid
+ *
+ * tids[] zstid
+ * datums[] bytea
+ * isnulls[] bool
+ * num_elems int4
+ */
+Datum
+pg_zs_dump_attstreams(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ BlockNumber blkno = PG_GETARG_INT64(1);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ Datum values[14];
+ bool nulls[14];
+
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ int chunkno;
+ bool upperstream = -1;
+ bool attbyval;
+ int16 attlen;
+ int chunk_start;
+ PageHeader phdr;
+
+ attstream_decoder decoder;
+
+ ZSAttStream *streams[2];
+ int nstreams = 0;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in B-tree pages. (Presumably, most of the pages
+ * in the relation are b-tree pages, so it makes sense to scan the whole
+ * relation in physical order)
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ table_close(rel, AccessShareLock);
+ PG_RETURN_NULL();
+ }
+
+ opaque = (ZSBtreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID ||
+ opaque->zs_attno == ZS_META_ATTRIBUTE_NUM ||
+ opaque->zs_level != 0)
+ {
+ UnlockReleaseBuffer(buf);
+ table_close(rel, AccessShareLock);
+ PG_RETURN_NULL();
+ }
+
+ attbyval = rel->rd_att->attrs[opaque->zs_attno - 1].attbyval;
+ attlen = rel->rd_att->attrs[opaque->zs_attno - 1].attlen;
+
+ phdr = (PageHeader) page;
+
+ if (phdr->pd_lower - SizeOfPageHeaderData > SizeOfZSAttStreamHeader)
+ {
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + SizeOfPageHeaderData);
+ }
+
+ if (phdr->pd_special - phdr->pd_upper > SizeOfZSAttStreamHeader)
+ {
+ upperstream = nstreams;
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + phdr->pd_upper);
+ }
+
+ for (int i = 0; i < nstreams; i++)
+ {
+ ZSAttStream *stream = streams[i];
+ bytea *chunk;
+ zstid prevtid;
+ zstid firsttid;
+ zstid lasttid;
+
+ init_attstream_decoder(&decoder, attbyval, attlen);
+ decode_attstream_begin(&decoder, stream);
+
+ chunkno = 0;
+ chunk_start = decoder.pos;
+
+ while (get_attstream_chunk_cont(&decoder, &prevtid, &firsttid, &lasttid, &chunk))
+ {
+ values[0] = Int16GetDatum(opaque->zs_attno);
+ values[1] = Int32GetDatum(chunkno);
+ chunkno++;
+
+ values[2] = BoolGetDatum(upperstream == i);
+ values[3] = BoolGetDatum((stream->t_flags & ATTSTREAM_COMPRESSED) != 0);
+ values[4] = BoolGetDatum(attbyval);
+ values[5] = Int16GetDatum(attlen);
+
+ values[6] = Int32GetDatum(chunk_start);
+ values[7] = Int32GetDatum(decoder.pos - chunk_start);
+ chunk_start = decoder.pos;
+
+ values[8] = ZSTidGetDatum(prevtid);
+ values[9] = ZSTidGetDatum(firsttid);
+ values[10] = ZSTidGetDatum(lasttid);
+ values[11] = PointerGetDatum(chunk);
+ values[12] = PointerGetDatum(decoder.num_elements);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+ table_close(rel, AccessShareLock);
+
+ destroy_attstream_decoder(&decoder);
+
+ return (Datum) 0;
+}
+
+Datum
+pg_zs_decode_chunk(PG_FUNCTION_ARGS)
+{
+ bool attbyval = PG_GETARG_BOOL(0);
+ int attlen = PG_GETARG_INT16(1);
+ zstid prevtid = PG_GETARG_ZSTID(2);
+ zstid lasttid = PG_GETARG_ZSTID(3);
+ bytea *chunk = PG_GETARG_BYTEA_P(4);
+ attstream_decoder decoder;
+ Datum values[4];
+ bool nulls[4];
+ ZSAttStream *attstream = palloc(SizeOfZSAttStreamHeader + VARSIZE_ANY_EXHDR(chunk));
+ TupleDesc tupdesc;
+ HeapTuple tuple;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ attstream->t_decompressed_size = VARSIZE_ANY_EXHDR(chunk);
+ attstream->t_decompressed_bufsize = VARSIZE_ANY_EXHDR(chunk);
+ attstream->t_size = SizeOfZSAttStreamHeader + VARSIZE_ANY_EXHDR(chunk);
+ attstream->t_flags = 0;
+ attstream->t_lasttid = lasttid;
+ memcpy(attstream->t_payload, VARDATA_ANY(chunk), VARSIZE_ANY_EXHDR(chunk));
+
+ init_attstream_decoder(&decoder, attbyval, attlen);
+ decode_attstream_begin(&decoder, attstream);
+ decoder.prevtid = prevtid;
+
+ if (!decode_attstream_cont(&decoder))
+ PG_RETURN_NULL();
+ else
+ {
+ ArrayBuildState *astate_tids = NULL;
+ ArrayBuildState *astate_datums = NULL;
+ ArrayBuildState *astate_isnulls = NULL;
+
+ for (int i = 0; i < decoder.num_elements; i++)
+ {
+
+ bytea *attr_data = NULL;
+
+ astate_tids = accumArrayResult(astate_tids,
+ ZSTidGetDatum(decoder.tids[i]),
+ false,
+ ZSTIDOID,
+ CurrentMemoryContext);
+ if (decoder.isnulls[i])
+ {
+ astate_datums = accumArrayResult(astate_datums,
+ (Datum) 0,
+ true,
+ BYTEAOID,
+ CurrentMemoryContext);
+ }
+ else
+ {
+ /*
+ * Fixed length, attribute by value
+ */
+ if (attbyval && attlen > 0)
+ {
+ attr_data = (bytea *) palloc(attlen + VARHDRSZ);
+ SET_VARSIZE(attr_data, attlen + VARHDRSZ);
+ memcpy(VARDATA(attr_data), &decoder.datums[i], attlen);
+ }
+ else if (!attbyval && attlen > 0)
+ {
+ attr_data = (bytea *) palloc(attlen + VARHDRSZ);
+ SET_VARSIZE(attr_data, attlen + VARHDRSZ);
+ memcpy(VARDATA(attr_data),
+ DatumGetPointer(decoder.datums[i]),
+ attlen);
+ }
+ else if (attlen < 0)
+ {
+ int len;
+
+ len =
+ VARSIZE_ANY_EXHDR(DatumGetPointer(decoder.datums[i]));
+ attr_data = (bytea *) palloc(len + VARHDRSZ);
+ SET_VARSIZE(attr_data, len + VARHDRSZ);
+ memcpy(VARDATA(attr_data),
+ VARDATA_ANY(DatumGetPointer(decoder.datums[i])),
+ len);
+ }
+ astate_datums = accumArrayResult(astate_datums,
+ PointerGetDatum(attr_data),
+ false,
+ BYTEAOID,
+ CurrentMemoryContext);
+ }
+ astate_isnulls = accumArrayResult(astate_isnulls,
+ BoolGetDatum(decoder.isnulls[i]),
+ false,
+ BOOLOID,
+ CurrentMemoryContext);
+ }
+
+ values[0] = Int32GetDatum(decoder.num_elements);
+ values[1] = PointerGetDatum(makeArrayResult(astate_tids, CurrentMemoryContext));
+ values[2] = PointerGetDatum(makeArrayResult(astate_datums, CurrentMemoryContext));
+ values[3] = PointerGetDatum(makeArrayResult(astate_isnulls, CurrentMemoryContext));
+ }
+
+ destroy_attstream_decoder(&decoder);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+/*
+ * blkno int8
+ * nextblk int8
+ * attno int4
+ * level int4
+ *
+ * lokey int8
+ * hikey int8
+
+ * nitems int4
+ * ncompressed int4
+ * totalsz int4
+ * uncompressedsz int4
+ * freespace int4
+ */
+Datum
+pg_zs_btree_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ BlockNumber blkno;
+ BlockNumber nblocks;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /* scan all blocks in physical order */
+ for (blkno = 1; blkno < nblocks; blkno++)
+ {
+ Datum values[11];
+ bool nulls[11];
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ int nitems;
+ int ncompressed;
+ int totalsz;
+ int uncompressedsz;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the page */
+ buf = ReadBuffer(rel, blkno);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * we're only interested in B-tree pages. (Presumably, most of the
+ * pages in the relation are b-tree pages, so it makes sense to scan
+ * the whole relation in physical order)
+ */
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSBtreePageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ opaque = (ZSBtreePageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_BTREE_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+
+ nitems = 0;
+ ncompressed = 0;
+ totalsz = 0;
+ uncompressedsz = 0;
+ if (opaque->zs_level == 0)
+ {
+ /* meta leaf page */
+ if (opaque->zs_attno == ZS_META_ATTRIBUTE_NUM) {
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+
+ ZSTidArrayItem
+ *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ nitems++;
+ totalsz += item->t_size;
+
+ uncompressedsz += item->t_size;
+ }
+ }
+ /* attribute leaf page */
+ else
+ {
+ PageHeader phdr = (PageHeader) page;
+ ZSAttStream *streams[2];
+ int nstreams = 0;
+
+ if (phdr->pd_lower - SizeOfPageHeaderData > SizeOfZSAttStreamHeader)
+ {
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + SizeOfPageHeaderData);
+ }
+
+ if (phdr->pd_special - phdr->pd_upper > SizeOfZSAttStreamHeader)
+ {
+ streams[nstreams++] = (ZSAttStream *) (((char *) page) + phdr->pd_upper);
+ }
+
+ for (int i = 0; i < nstreams; i++)
+ {
+ ZSAttStream *stream = streams[i];
+
+ totalsz += stream->t_size;
+ /*
+ * FIXME: this is wrong. We currently don't calculate the
+ * number of items in the stream
+ */
+ nitems++;
+ if ((stream->t_flags & ATTSTREAM_COMPRESSED) != 0)
+ {
+ ncompressed++;
+ uncompressedsz += stream->t_decompressed_size;
+ }
+ else
+ {
+ uncompressedsz += stream->t_size;
+ }
+ }
+ }
+ }
+ else
+ {
+ /* internal page */
+ nitems = ZSBtreeInternalPageGetNumItems(page);
+ }
+ values[0] = Int64GetDatum(blkno);
+ values[1] = Int64GetDatum(opaque->zs_next);
+ values[2] = Int32GetDatum(opaque->zs_attno);
+ values[3] = Int32GetDatum(opaque->zs_level);
+ values[4] = Int64GetDatum(opaque->zs_lokey);
+ values[5] = Int64GetDatum(opaque->zs_hikey);
+ values[6] = Int32GetDatum(nitems);
+ if (opaque->zs_level == 0)
+ {
+ values[7] = Int32GetDatum(ncompressed);
+ values[8] = Int32GetDatum(totalsz);
+ values[9] = Int32GetDatum(uncompressedsz);
+ }
+ else
+ {
+ nulls[7] = true;
+ nulls[8] = true;
+ nulls[9] = true;
+ }
+ values[10] = Int32GetDatum(PageGetExactFreeSpace(page));
+
+ UnlockReleaseBuffer(buf);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * blkno int8
+ * undo_head int8
+ * undo_tail int8
+ * undo_tail_first_counter int8
+ * undo_oldestpointer_counter int8
+ * undo_oldestpointer_blkno int8
+ * undo_oldestpointer_offset int8
+ * fpm_head int8
+ * flags int4
+ */
+Datum
+pg_zs_meta_page(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ Relation rel;
+ TupleDesc tupdesc;
+ Datum values[9];
+ bool nulls[9];
+ Buffer buf;
+ Page page;
+ ZSMetaPageOpaque *opaque;
+ HeapTuple tuple;
+ Datum result;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* open the metapage */
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /* Read the page */
+ buf = ReadBuffer(rel, ZS_META_BLK);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSMetaPageOpaque)))
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "Bad page special size");
+ }
+ opaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_META_PAGE_ID)
+ {
+ UnlockReleaseBuffer(buf);
+ elog(ERROR, "The zs_page_id does not match ZS_META_PAGE_ID. Got: %d",
+ opaque->zs_page_id);
+ }
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ values[0] = Int64GetDatum(ZS_META_BLK);
+ values[1] = Int64GetDatum(opaque->zs_undo_head);
+ values[2] = Int64GetDatum(opaque->zs_undo_tail);
+ values[3] = Int64GetDatum(opaque->zs_undo_tail_first_counter);
+ values[4] = Int64GetDatum(opaque->zs_undo_oldestptr.counter);
+ values[5] = Int64GetDatum(opaque->zs_undo_oldestptr.blkno);
+ values[6] = Int32GetDatum(opaque->zs_undo_oldestptr.offset);
+ values[7] = Int64GetDatum(opaque->zs_fpm_head);
+ values[8] = Int32GetDatum(opaque->zs_flags);
+
+ UnlockReleaseBuffer(buf);
+
+ table_close(rel, AccessShareLock);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ result = HeapTupleGetDatum(tuple);
+
+ PG_RETURN_DATUM(result);
+}
+
+/*
+ * Function to check whether blocks are adjacent in relfile.
+ *
+ * Returns the number of runs of consecutive blocks per attribute and the total
+ * number of blocks per attribute.
+ */
+Datum
+pg_zs_calculate_adjacent_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+
+ Buffer buf;
+ Relation rel;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ Tuplestorestate *tupstore;
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+
+ int *total_blocks;
+ int *num_runs;
+
+ Datum values[3];
+ bool nulls[3];
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use zedstore inspection functions"))));
+
+ rel = table_open(relid, AccessShareLock);
+
+ /*
+ * Reject attempts to read non-local temporary relations; we would be
+ * likely to get wrong data since we have no visibility into the owning
+ * session's local buffers.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions")));
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ total_blocks = (int *)palloc0((rel->rd_att->natts + 1) * sizeof(int));
+ num_runs = (int *)palloc0((rel->rd_att->natts + 1) * sizeof(int));
+
+ for (int attnum=0; attnum <= rel->rd_att->natts; attnum++)
+ {
+ BlockNumber blkno;
+ buf = zsbt_descend(rel, attnum, MinZSTid, 0, true);
+
+ if (buf == InvalidBuffer)
+ continue;
+
+ blkno = BufferGetBlockNumber(buf);
+
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ num_runs[attnum] = 1;
+ total_blocks[attnum] = 1;
+
+ while (opaque->zs_next != InvalidBlockNumber)
+ {
+ if (opaque->zs_next != blkno + 1)
+ {
+ num_runs[attnum]++;
+ }
+ total_blocks[attnum]++;
+
+ UnlockReleaseBuffer(buf);
+
+ buf = ReadBuffer(rel, opaque->zs_next);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ blkno = BufferGetBlockNumber(buf);
+
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ values[0] = Int32GetDatum(attnum);
+ values[1] = Int32GetDatum(num_runs[attnum]);
+ values[2] = Int32GetDatum(total_blocks[attnum]);
+ nulls[0] = false;
+ nulls[1] = false;
+ nulls[2] = false;
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ table_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/access/zedstore/zedstore_meta.c b/src/backend/access/zedstore/zedstore_meta.c
new file mode 100644
index 0000000000..50496dc5cb
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_meta.c
@@ -0,0 +1,509 @@
+/*
+ * zedstore_meta.c
+ * Routines for handling ZedStore metapage
+ *
+ * The metapage holds a directory of B-tree root block numbers, one for each
+ * column.
+ *
+ * TODO:
+ * - extend the root block dir to an overflow page if there are too many
+ * attributes to fit on one page
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_meta.c
+ */
+#include "postgres.h"
+
+#include "access/itup.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+static void zsmeta_wal_log_metapage(Buffer buf, int natts);
+
+static ZSMetaCacheData *
+zsmeta_populate_cache_from_metapage(Relation rel, Page page)
+{
+ ZSMetaCacheData *cache;
+ ZSMetaPage *metapg;
+ int natts;
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ natts = metapg->nattributes;
+
+ cache =
+ MemoryContextAllocZero(CacheMemoryContext,
+ offsetof(ZSMetaCacheData, cache_attrs[natts]));
+ cache->cache_nattributes = natts;
+
+ for (int i = 0; i < natts; i++)
+ {
+ cache->cache_attrs[i].root = metapg->tree_root_dir[i].root;
+ cache->cache_attrs[i].rightmost = InvalidBlockNumber;
+ }
+
+ rel->rd_amcache = cache;
+ return cache;
+}
+
+ZSMetaCacheData *
+zsmeta_populate_cache(Relation rel)
+{
+ ZSMetaCacheData *cache;
+ Buffer metabuf;
+ BlockNumber nblocks;
+
+ RelationOpenSmgr(rel);
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ {
+ cache =
+ MemoryContextAllocZero(CacheMemoryContext,
+ offsetof(ZSMetaCacheData, cache_attrs));
+ cache->cache_nattributes = 0;
+ rel->rd_amcache = cache;
+ }
+ else
+ {
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ cache = zsmeta_populate_cache_from_metapage(rel, BufferGetPage(metabuf));
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ return cache;
+}
+
+static void
+zsmeta_expand_metapage_for_new_attributes(Relation rel)
+{
+ int natts = RelationGetNumberOfAttributes(rel) + 1;
+ Buffer metabuf;
+ Page page;
+ ZSMetaPage *metapg;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ if (natts > metapg->nattributes)
+ {
+ int new_pd_lower;
+
+ new_pd_lower = (char *) &metapg->tree_root_dir[natts] - (char *) page;
+ if (new_pd_lower > ((PageHeader) page)->pd_upper)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ START_CRIT_SECTION();
+
+ /* Initialize the new attribute roots to InvalidBlockNumber */
+ for (int i = metapg->nattributes; i < natts; i++)
+ {
+ metapg->tree_root_dir[i].root = InvalidBlockNumber;
+ metapg->tree_root_dir[i].fpm_head = InvalidBlockNumber;
+ }
+
+ metapg->nattributes = natts;
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+
+ MarkBufferDirty(metabuf);
+
+ if (RelationNeedsWAL(rel))
+ zsmeta_wal_log_metapage(metabuf, natts);
+
+ END_CRIT_SECTION();
+ }
+ UnlockReleaseBuffer(metabuf);
+
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+}
+
+static Page
+zsmeta_initmetapage_internal(int natts)
+{
+ Page page;
+ ZSMetaPageOpaque *opaque;
+ ZSMetaPage *metapg;
+ int new_pd_lower;
+
+ /*
+ * It's possible that we error out when building the metapage, if there
+ * are too many attribute, so work on a temporary copy first, before actually
+ * allocating the buffer.
+ */
+ page = palloc(BLCKSZ);
+ PageInit(page, BLCKSZ, sizeof(ZSMetaPageOpaque));
+
+ opaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_META_PAGE_ID;
+
+ /* UNDO-related fields */
+ opaque->zs_undo_oldestptr.counter = 2; /* start at 2, so that 0 is always "old", and 1 means "dead" */
+ opaque->zs_undo_head = InvalidBlockNumber;
+ opaque->zs_undo_tail = InvalidBlockNumber;
+ opaque->zs_undo_tail_first_counter = 2;
+
+ opaque->zs_fpm_head = InvalidBlockNumber;
+
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ new_pd_lower = (char *) &metapg->tree_root_dir[natts] - (char *) page;
+ if (new_pd_lower > ((PageHeader) page)->pd_upper)
+ {
+ /*
+ * The root block directory must fit on the metapage.
+ *
+ * TODO: We could extend this by overflowing to another page.
+ */
+ elog(ERROR, "too many attributes for zedstore");
+ }
+
+ metapg->nattributes = natts;
+ for (int i = 0; i < natts; i++)
+ {
+ metapg->tree_root_dir[i].root = InvalidBlockNumber;
+ metapg->tree_root_dir[i].fpm_head = InvalidBlockNumber;
+ }
+
+ ((PageHeader) page)->pd_lower = new_pd_lower;
+ return page;
+}
+
+/*
+ * Initialize the metapage for an empty relation.
+ */
+void
+zsmeta_initmetapage(Relation rel)
+{
+ Buffer buf;
+ Page page;
+ int natts = RelationGetNumberOfAttributes(rel) + 1;
+
+ /* Ok, write it out to disk */
+ buf = ReadBuffer(rel, P_NEW);
+ if (BufferGetBlockNumber(buf) != ZS_META_BLK)
+ elog(ERROR, "table is not empty");
+ page = zsmeta_initmetapage_internal(natts);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ PageRestoreTempPage(page, BufferGetPage(buf));
+
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ zsmeta_wal_log_metapage(buf, natts);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+}
+
+static void
+zsmeta_wal_log_metapage(Buffer buf, int natts)
+{
+ Page page = BufferGetPage(buf);
+ wal_zedstore_init_metapage init_rec;
+ XLogRecPtr recptr;
+
+ init_rec.natts = natts;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &init_rec, SizeOfZSWalInitMetapage);
+ XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_INIT_METAPAGE);
+
+ PageSetLSN(page, recptr);
+}
+
+static void
+zsmeta_wal_log_new_att_root(Buffer metabuf, Buffer rootbuf, AttrNumber attno)
+{
+ Page metapage = BufferGetPage(metabuf);
+ Page rootpage = BufferGetPage(rootbuf);
+ wal_zedstore_btree_new_root xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.attno = attno;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalBtreeNewRoot);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+ XLogRegisterBuffer(1, rootbuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_BTREE_NEW_ROOT);
+
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(rootpage, recptr);
+}
+
+void
+zsmeta_initmetapage_redo(XLogReaderState *record)
+{
+ Buffer buf;
+
+ /*
+ * Metapage changes are so rare that we rely on full-page images
+ * for replay.
+ */
+ if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+ elog(ERROR, "zedstore metapage init WAL record did not contain a full-page image");
+
+ Assert(BufferGetBlockNumber(buf) == ZS_META_BLK);
+ UnlockReleaseBuffer(buf);
+}
+
+void
+zsmeta_new_btree_root_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_btree_new_root *xlrec =
+ (wal_zedstore_btree_new_root *) XLogRecGetData(record);
+ AttrNumber attno = xlrec->attno;
+ Buffer metabuf;
+ Buffer rootbuf;
+ Page rootpage;
+ BlockNumber rootblk;
+ ZSBtreePageOpaque *opaque;
+
+ rootbuf = XLogInitBufferForRedo(record, 1);
+ rootpage = (Page) BufferGetPage(rootbuf);
+ rootblk = BufferGetBlockNumber(rootbuf);
+ /* initialize the page to look like a root leaf */
+ rootpage = BufferGetPage(rootbuf);
+ PageInit(rootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(rootpage);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = 0;
+ opaque->zs_flags = ZSBT_ROOT;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ PageSetLSN(rootpage, lsn);
+ MarkBufferDirty(rootbuf);
+
+ /* Update the metapage to point to it */
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = (Page) BufferGetPage(metabuf);
+ ZSMetaPage *metapg = (ZSMetaPage *) PageGetContents(metapage);
+
+ Assert(BufferGetBlockNumber(metabuf) == ZS_META_BLK);
+ Assert(metapg->tree_root_dir[attno].root == InvalidBlockNumber);
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ UnlockReleaseBuffer(rootbuf);
+}
+
+/*
+ * Get the block number of the b-tree root for given attribute.
+ *
+ * If 'readonly' is true, and the root doesn't exist yet (ie. it's an empty
+ * table), returns InvalidBlockNumber. Otherwise new root is allocated if
+ * the root doesn't exist.
+ */
+BlockNumber
+zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool readonly)
+{
+ Buffer metabuf;
+ ZSMetaPage *metapg;
+ BlockNumber rootblk;
+ ZSMetaCacheData *metacache;
+
+ Assert(attno == ZS_META_ATTRIBUTE_NUM || attno >= 1);
+
+ metacache = zsmeta_get_cache(rel);
+
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ BlockNumber nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (nblocks != 0)
+ metacache = zsmeta_populate_cache(rel);
+ else if (readonly)
+ return InvalidBlockNumber;
+ else
+ {
+ LockRelationForExtension(rel, ExclusiveLock);
+ /*
+ * Confirm number of blocks is still 0 after taking lock,
+ * before initializing a new metapage
+ */
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (nblocks == 0)
+ zsmeta_initmetapage(rel);
+ UnlockRelationForExtension(rel, ExclusiveLock);
+ metacache = zsmeta_populate_cache(rel);
+ }
+ }
+
+ /*
+ * file has less number of attributes stored compared to catalog. This
+ * happens due to add column default value storing value in catalog and
+ * absent in table. This attribute must be marked with atthasmissing.
+ */
+ if (attno >= metacache->cache_nattributes)
+ {
+ if (readonly)
+ {
+ /* re-check */
+ metacache = zsmeta_populate_cache(rel);
+ if (attno >= metacache->cache_nattributes)
+ return InvalidBlockNumber;
+ }
+ else
+ {
+ zsmeta_expand_metapage_for_new_attributes(rel);
+ metacache = zsmeta_populate_cache(rel);
+ }
+ }
+
+ rootblk = metacache->cache_attrs[attno].root;
+
+ /*
+ * Don't believe a cached result that says that the root is empty.
+ * It's possible that it was created after we populated the cache. If the
+ * root block number is out-of-date, that's OK because the caller will
+ * detect that case, but if the tree is missing altogether, the caller
+ * will have nothing to detect and will incorrectly return an empty result.
+ *
+ * XXX: It's a inefficient to repopulate the cache here, if we just
+ * did so in the zsmeta_get_cache() call above already.
+ */
+ if (readonly && rootblk == InvalidBlockNumber)
+ {
+ zsmeta_invalidate_cache(rel);
+ metacache = zsmeta_get_cache(rel);
+ rootblk = metacache->cache_attrs[attno].root;
+ }
+
+ if (!readonly && rootblk == InvalidBlockNumber)
+ {
+ /* try to allocate one */
+ Page page;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(metabuf);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+
+ /*
+ * Re-check that the root is still invalid, now that we have the
+ * metapage locked.
+ */
+ rootblk = metapg->tree_root_dir[attno].root;
+ if (rootblk == InvalidBlockNumber)
+ {
+ Buffer rootbuf;
+ Page rootpage;
+ ZSBtreePageOpaque *opaque;
+
+ /*
+ * Release the lock on the metapage while we find a new block, because
+ * that could take a while. (And accessing the Free Page Map might lock
+ * the metapage, too, causing self-deadlock.)
+ */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+ /* TODO: release lock on metapage while we do I/O */
+ rootbuf = zspage_getnewbuf(rel, attno);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metapg = (ZSMetaPage *) PageGetContents(page);
+ rootblk = metapg->tree_root_dir[attno].root;
+ if (rootblk != InvalidBlockNumber)
+ {
+ /*
+ * Another backend created the root page, while we were busy
+ * finding a free page. We won't need the page we allocated,
+ * after all.
+ */
+ zspage_delete_page(rel, rootbuf, metabuf, attno);
+ }
+ else
+ {
+ rootblk = BufferGetBlockNumber(rootbuf);
+
+ START_CRIT_SECTION();
+
+ metapg->tree_root_dir[attno].root = rootblk;
+
+ /* initialize the page to look like a root leaf */
+ rootpage = BufferGetPage(rootbuf);
+ PageInit(rootpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+ opaque = ZSBtreePageGetOpaque(rootpage);
+ opaque->zs_attno = attno;
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_lokey = MinZSTid;
+ opaque->zs_hikey = MaxPlusOneZSTid;
+ opaque->zs_level = 0;
+ opaque->zs_flags = ZSBT_ROOT;
+ opaque->zs_page_id = ZS_BTREE_PAGE_ID;
+
+ MarkBufferDirty(rootbuf);
+ MarkBufferDirty(metabuf);
+
+ if (RelationNeedsWAL(rel))
+ zsmeta_wal_log_new_att_root(metabuf, rootbuf, attno);
+
+ END_CRIT_SECTION();
+ }
+
+ UnlockReleaseBuffer(rootbuf);
+ }
+ UnlockReleaseBuffer(metabuf);
+
+ metacache->cache_attrs[attno].root = rootblk;
+ }
+
+ return rootblk;
+}
diff --git a/src/backend/access/zedstore/zedstore_simple8b.c b/src/backend/access/zedstore/zedstore_simple8b.c
new file mode 100644
index 0000000000..088b8a25f4
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_simple8b.c
@@ -0,0 +1,391 @@
+/*
+ * zedstore_simple8b.c
+ * Simple-8b encoding for zedstore
+ *
+ * FIXME: This is copy-pasted from src/backend/lib/integerset.c. Some of
+ * the things we do here are not relevant for the use in zedstore, or could
+ * be optimized. For example, EMPTY_CODEWORD is not used.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_simple8b.h
+ */
+#include "postgres.h"
+
+#include "access/zedstore_simple8b.h"
+
+/*
+ * Decode an array of Simple-8b codewords, known to contain 'num_integers'
+ * integers.
+ */
+void
+simple8b_decode_words(uint64 *codewords, int num_codewords,
+ uint64 *dst, int num_integers)
+{
+ int total_decoded = 0;
+
+ /* decode all the codewords */
+ for (int i = 0; i < num_codewords; i++)
+ {
+ int num_decoded;
+
+ num_decoded = simple8b_decode(codewords[i], &dst[total_decoded]);
+ total_decoded += num_decoded;
+ }
+ /*
+ * XXX: This error message is a bit specific, but it matches how this
+ * function is actually used, i.e. to encode TIDs, and the number of integers
+ * comes from the item header.
+ */
+ if (total_decoded != num_integers)
+ elog(ERROR, "number of TIDs in codewords did not match the item header");
+}
+
+/*
+ * Simple-8b encoding.
+ *
+ * The simple-8b algorithm packs between 1 and 240 integers into 64-bit words,
+ * called "codewords". The number of integers packed into a single codeword
+ * depends on the integers being packed; small integers are encoded using
+ * fewer bits than large integers. A single codeword can store a single
+ * 60-bit integer, or two 30-bit integers, for example.
+ *
+ * Since we're storing a unique, sorted, set of integers, we actually encode
+ * the *differences* between consecutive integers. That way, clusters of
+ * integers that are close to each other are packed efficiently, regardless
+ * of their absolute values.
+ *
+ * In Simple-8b, each codeword consists of a 4-bit selector, which indicates
+ * how many integers are encoded in the codeword, and the encoded integers are
+ * packed into the remaining 60 bits. The selector allows for 16 different
+ * ways of using the remaining 60 bits, called "modes". The number of integers
+ * packed into a single codeword in each mode is listed in the simple8b_modes
+ * table below. For example, consider the following codeword:
+ *
+ * 20-bit integer 20-bit integer 20-bit integer
+ * 1101 00000000000000010010 01111010000100100000 00000000000000010100
+ * ^
+ * selector
+ *
+ * The selector 1101 is 13 in decimal. From the modes table below, we see
+ * that it means that the codeword encodes three 20-bit integers. In decimal,
+ * those integers are 18, 500000 and 20. Because we encode deltas rather than
+ * absolute values, the actual values that they represent are 18, 500018 and
+ * 500038.
+ *
+ * Modes 0 and 1 are a bit special; they encode a run of 240 or 120 zeroes
+ * (which means 240 or 120 consecutive integers, since we're encoding the
+ * deltas between integers), without using the rest of the codeword bits
+ * for anything.
+ *
+ * Simple-8b cannot encode integers larger than 60 bits. Values larger than
+ * that are always stored in the 'first' field of a leaf item, never in the
+ * packed codeword. If there is a sequence of integers that are more than
+ * 2^60 apart, the codeword will go unused on those items. To represent that,
+ * we use a magic EMPTY_CODEWORD codeword value.
+ */
+static const struct simple8b_mode
+{
+ uint8 bits_per_int;
+ uint8 num_ints;
+} simple8b_modes[17] =
+
+{
+ {0, 240}, /* mode 0: 240 zeroes */
+ {0, 120}, /* mode 1: 120 zeroes */
+ {1, 60}, /* mode 2: sixty 1-bit integers */
+ {2, 30}, /* mode 3: thirty 2-bit integers */
+ {3, 20}, /* mode 4: twenty 3-bit integers */
+ {4, 15}, /* mode 5: fifteen 4-bit integers */
+ {5, 12}, /* mode 6: twelve 5-bit integers */
+ {6, 10}, /* mode 7: ten 6-bit integers */
+ {7, 8}, /* mode 8: eight 7-bit integers (four bits
+ * are wasted) */
+ {8, 7}, /* mode 9: seven 8-bit integers (four bits
+ * are wasted) */
+ {10, 6}, /* mode 10: six 10-bit integers */
+ {12, 5}, /* mode 11: five 12-bit integers */
+ {15, 4}, /* mode 12: four 15-bit integers */
+ {20, 3}, /* mode 13: three 20-bit integers */
+ {30, 2}, /* mode 14: two 30-bit integers */
+ {60, 1}, /* mode 15: one 60-bit integer */
+
+ {0, 0} /* sentinel value */
+};
+
+/*
+ * Maximum number of integers that can be encoded in a single Simple-8b
+ * codeword.
+ */
+#define SIMPLE8B_MAX_VALUES_PER_CODEWORD 240
+
+/*
+ * EMPTY_CODEWORD is a special value, used to indicate "no values".
+ * It is used if the next value is too large to be encoded with Simple-8b.
+ *
+ * This value looks like a mode-0 codeword, but we can distinguish it
+ * because a regular mode-0 codeword would have zeroes in the unused bits.
+ */
+#define EMPTY_CODEWORD UINT64CONST(0x0FFFFFFFFFFFFFFF)
+
+/*
+ * Encode a number of integers into a Simple-8b codeword.
+ *
+ * Returns the encoded codeword, and sets *num_encoded to the number of
+ * input integers that were encoded. That can be zero, if the first delta
+ * is too large to be encoded.
+ */
+uint64
+simple8b_encode(const uint64 *ints, int num_ints, int *num_encoded)
+{
+ int selector;
+ int nints;
+ int bits;
+ uint64 val;
+ uint64 codeword;
+ int i;
+
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ selector = 0;
+ nints = simple8b_modes[0].num_ints;
+ bits = simple8b_modes[0].bits_per_int;
+ val = ints[0];
+ i = 0; /* number of deltas we have accepted */
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << bits))
+ {
+ /* too large, step up to next mode */
+ selector++;
+ nints = simple8b_modes[selector].num_ints;
+ bits = simple8b_modes[selector].bits_per_int;
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ i++;
+ if (i >= nints)
+ break;
+ /* examine next delta */
+ if (i < num_ints)
+ val = ints[i];
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ if (nints == 0)
+ {
+ /*
+ * The first delta is too large to be encoded with Simple-8b.
+ *
+ * If there is at least one not-too-large integer in the input, we
+ * will encode it using mode 15 (or a more compact mode). Hence, we
+ * can only get here if the *first* delta is >= 2^60.
+ */
+ Assert(i == 0);
+ *num_encoded = 0;
+ return EMPTY_CODEWORD;
+ }
+
+ /*
+ * Encode the integers using the selected mode. Note that we shift them
+ * into the codeword in reverse order, so that they will come out in the
+ * correct order in the decoder.
+ */
+ codeword = 0;
+ if (bits > 0)
+ {
+ for (i = nints - 1; i > 0; i--)
+ {
+ val = ints[i];
+ codeword |= val;
+ codeword <<= bits;
+ }
+ val = ints[0];
+ codeword |= val;
+ }
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ *num_encoded = nints;
+ return codeword;
+}
+
+/*
+ * Encode a number of same integers into a Simple-8b codeword.
+ *
+ * This is a special version of simple8b_encode, where the first input
+ * integer is 'firstint', followed by a number of 'secondint'. This is
+ * equivalent to calling simple8b_encode() with an input array:
+ *
+ * ints[0]: firstint
+ * ints[1]: secondint
+ * ints[2]: secondint
+ * ...
+ * ints[num_ints - 1]: secondint
+ *
+ *
+ * We need that when doing a multi-insert, and it seems nice to have a
+ * specialized version for that, for speed, but also to keep the calling
+ * code simpler, so that it doesn't need to construct an input array.
+ *
+ * TODO: This is just copy-pasted from simple8b_encode, but since we know
+ * what the input is, we could probably optimize this further.
+ */
+uint64
+simple8b_encode_consecutive(const uint64 firstint, const uint64 secondint, int num_ints,
+ int *num_encoded)
+{
+ int selector;
+ int nints;
+ int bits;
+ uint64 val;
+ uint64 codeword;
+ int i;
+
+ /*
+ * Select the "mode" to use for this codeword.
+ *
+ * In each iteration, check if the next value can be represented in the
+ * current mode we're considering. If it's too large, then step up the
+ * mode to a wider one, and repeat. If it fits, move on to the next
+ * integer. Repeat until the codeword is full, given the current mode.
+ *
+ * Note that we don't have any way to represent unused slots in the
+ * codeword, so we require each codeword to be "full". It is always
+ * possible to produce a full codeword unless the very first delta is too
+ * large to be encoded. For example, if the first delta is small but the
+ * second is too large to be encoded, we'll end up using the last "mode",
+ * which has nints == 1.
+ */
+ selector = 0;
+ nints = simple8b_modes[0].num_ints;
+ bits = simple8b_modes[0].bits_per_int;
+ val = firstint;
+ i = 0; /* number of deltas we have accepted */
+ for (;;)
+ {
+ if (val >= (UINT64CONST(1) << bits))
+ {
+ /* too large, step up to next mode */
+ selector++;
+ nints = simple8b_modes[selector].num_ints;
+ bits = simple8b_modes[selector].bits_per_int;
+ /* we might already have accepted enough deltas for this mode */
+ if (i >= nints)
+ break;
+ }
+ else
+ {
+ /* accept this delta; then done if codeword is full */
+ i++;
+ if (i >= nints)
+ break;
+ /* examine next delta */
+ if (i < num_ints)
+ val = secondint;
+ else
+ {
+ /*
+ * Reached end of input. Pretend that the next integer is a
+ * value that's too large to represent in Simple-8b, so that
+ * we fall out.
+ */
+ val = PG_UINT64_MAX;
+ }
+ }
+ }
+
+ if (nints == 0)
+ {
+ /*
+ * The first delta is too large to be encoded with Simple-8b.
+ *
+ * If there is at least one not-too-large integer in the input, we
+ * will encode it using mode 15 (or a more compact mode). Hence, we
+ * can only get here if the *first* delta is >= 2^60.
+ */
+ Assert(i == 0);
+ *num_encoded = 0;
+ return EMPTY_CODEWORD;
+ }
+
+ /*
+ * Encode the integers using the selected mode. Note that we shift them
+ * into the codeword in reverse order, so that they will come out in the
+ * correct order in the decoder.
+ */
+ codeword = 0;
+ if (bits > 0)
+ {
+ for (i = nints - 1; i > 0; i--)
+ {
+ val = secondint;
+ codeword |= val;
+ codeword <<= bits;
+ }
+ val = firstint;
+ codeword |= val;
+ }
+
+ /* add selector to the codeword, and return */
+ codeword |= (uint64) selector << 60;
+
+ *num_encoded = nints;
+ return codeword;
+}
+
+/*
+ * Decode a codeword into an array of integers.
+ * Returns the number of integers decoded.
+ */
+int
+simple8b_decode(uint64 codeword, uint64 *decoded)
+{
+ int selector = (codeword >> 60);
+ int nints = simple8b_modes[selector].num_ints;
+ int bits = simple8b_modes[selector].bits_per_int;
+ uint64 mask = (UINT64CONST(1) << bits) - 1;
+
+ if (codeword == EMPTY_CODEWORD)
+ return 0;
+
+ for (int i = 0; i < nints; i++)
+ {
+ uint64 val = codeword & mask;
+
+ decoded[i] = val;
+ codeword >>= bits;
+ }
+
+ return nints;
+}
diff --git a/src/backend/access/zedstore/zedstore_tid.c b/src/backend/access/zedstore/zedstore_tid.c
new file mode 100644
index 0000000000..45c2385815
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tid.c
@@ -0,0 +1,159 @@
+/*
+ * zedstore_tid.c
+ * Functions for the built-in type zstid
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tid.c
+ */
+
+#include "postgres.h"
+
+#include "access/zedstore_tid.h"
+#include "storage/itemptr.h"
+#include "utils/fmgrprotos.h"
+#include "utils/int8.h"
+
+
+Datum
+tidtozstid(PG_FUNCTION_ARGS)
+{
+ ItemPointerData *arg = PG_GETARG_ITEMPOINTER(0);
+ zstid tid = ZSTidFromItemPointer(*arg);
+
+ PG_RETURN_ZSTID(tid);
+}
+
+Datum
+zstidin(PG_FUNCTION_ARGS)
+{
+ char *str = PG_GETARG_CSTRING(0);
+ int64 result;
+
+ (void) scanint8(str, false, &result);
+
+ if (result > MaxZSTid || result < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type %s",
+ str, "zstid")));
+
+ PG_RETURN_INT64(result);
+}
+
+Datum
+zstidout(PG_FUNCTION_ARGS)
+{
+ zstid tid = PG_GETARG_ZSTID(0);
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%lu", tid);
+
+ PG_RETURN_CSTRING(pstrdup(buf));
+}
+
+Datum
+zstidlt(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 < tid2);
+}
+
+Datum
+zstidgt(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 > tid2);
+}
+
+Datum
+zstideq(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 == tid2);
+}
+
+Datum
+zstidle(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 <= tid2);
+}
+
+Datum
+zstidge(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 >= tid2);
+}
+
+Datum
+zstidne(PG_FUNCTION_ARGS)
+{
+ zstid tid1 = PG_GETARG_ZSTID(0);
+ zstid tid2 = PG_GETARG_ZSTID(1);
+
+ PG_RETURN_BOOL(tid1 != tid2);
+}
+
+Datum
+int2tozstid(PG_FUNCTION_ARGS)
+{
+ int16 arg = PG_GETARG_INT16(0);
+
+ if (arg < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value %d is out of range for type %s",
+ arg, "zstid")));
+
+ PG_RETURN_ZSTID((zstid) arg);
+}
+
+Datum
+int4tozstid(PG_FUNCTION_ARGS)
+{
+ int32 arg = PG_GETARG_INT32(0);
+
+ if (arg < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value %d is out of range for type %s",
+ arg, "zstid")));
+
+ PG_RETURN_ZSTID((zstid) arg);
+}
+
+Datum
+int8tozstid(PG_FUNCTION_ARGS)
+{
+ int64 arg = PG_GETARG_INT64(0);
+
+ if (arg > MaxZSTid || arg < MinZSTid)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value " INT64_FORMAT " is out of range for type %s",
+ arg, "zstid")));
+
+ PG_RETURN_ZSTID((zstid) arg);
+}
+
+Datum
+zstidtoint8(PG_FUNCTION_ARGS)
+{
+ zstid arg = PG_GETARG_ZSTID(0);
+
+ PG_RETURN_INT64((int64) arg);
+}
diff --git a/src/backend/access/zedstore/zedstore_tiditem.c b/src/backend/access/zedstore/zedstore_tiditem.c
new file mode 100644
index 0000000000..33c6c2097e
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tiditem.c
@@ -0,0 +1,936 @@
+/*
+ * zedstore_tiditem.c
+ * Routines for packing TIDs into "items"
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tiditem.c
+ */
+#include "postgres.h"
+
+#include "access/zedstore_internal.h"
+#include "access/zedstore_simple8b.h"
+
+static int remap_slots(uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *orig_slots, int num_orig_slots,
+ int target_idx, ZSUndoRecPtr target_ptr,
+ ZSUndoRecPtr *new_slots,
+ int *new_num_slots,
+ uint8 *new_slotnos,
+ ZSUndoRecPtr recent_oldest_undo);
+static ZSTidArrayItem *build_item(zstid *tids, uint64 *deltas, uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *slots, int num_slots);
+
+static void deltas_to_tids(zstid firsttid, uint64 *deltas, int num_tids, zstid *tids);
+static void slotwords_to_slotnos(uint64 *slotwords, int num_tids, uint8 *slotnos);
+static int binsrch_tid_array(zstid key, zstid *arr, int arr_elems);
+
+/*
+ * Extract TIDs from an item into iterator.
+ */
+void
+zsbt_tid_item_unpack(ZSTidArrayItem *item, ZSTidItemIterator *iter)
+{
+ ZSUndoRecPtr *slots;
+ int num_tids;
+ uint64 *slotwords;
+ uint64 *codewords;
+
+ if (iter->tids_allocated_size < item->t_num_tids)
+ {
+ if (iter->tids)
+ pfree(iter->tids);
+ if (iter->tid_undoslotnos)
+ pfree(iter->tid_undoslotnos);
+ iter->tids = MemoryContextAlloc(iter->context, item->t_num_tids * sizeof(zstid));
+ iter->tid_undoslotnos = MemoryContextAlloc(iter->context, item->t_num_tids * sizeof(uint8));
+ iter->tids_allocated_size = item->t_num_tids;
+ }
+
+ ZSTidArrayItemDecode(item, &codewords, &slots, &slotwords);
+ num_tids = item->t_num_tids;
+
+ /* decode all the codewords */
+ simple8b_decode_words(codewords, item->t_num_codewords, iter->tids, num_tids);
+
+ /* convert the deltas to TIDs */
+ deltas_to_tids(item->t_firsttid, iter->tids, num_tids, iter->tids);
+ iter->num_tids = num_tids;
+ Assert(iter->tids[num_tids - 1] == item->t_endtid - 1);
+
+ /* Expand slotwords to slotnos */
+ slotwords_to_slotnos(slotwords, num_tids, iter->tid_undoslotnos);
+
+ /* also copy out the slots to the iterator */
+ InvalidateUndoVisibility(&iter->visi_infos[ZSBT_OLD_UNDO_SLOT]);
+ InvalidateUndoVisibility(&iter->visi_infos[ZSBT_DEAD_UNDO_SLOT]);
+ iter->visi_infos[ZSBT_DEAD_UNDO_SLOT].undoptr = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < item->t_num_undo_slots; i++)
+ {
+ InvalidateUndoVisibility(&iter->visi_infos[i]);
+ iter->visi_infos[i].undoptr = slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+ }
+}
+
+/*
+ * Create a ZSTidArrayItem (or items), to represent a range of contiguous TIDs,
+ * all with the same UNDO pointer.
+ */
+List *
+zsbt_tid_item_create_for_range(zstid tid, int nelements, ZSUndoRecPtr undo_ptr)
+{
+ uint64 total_encoded;
+ List *newitems = NIL;
+ uint64 codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ int num_slots;
+ int slotno;
+
+ Assert(undo_ptr.counter != DeadUndoPtr.counter);
+ if (IsZSUndoRecPtrValid(&undo_ptr))
+ {
+ slotno = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT + 1;
+ }
+ else
+ {
+ slotno = ZSBT_OLD_UNDO_SLOT;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+ }
+
+ total_encoded = 0;
+ while (total_encoded < nelements)
+ {
+ ZSTidArrayItem *newitem;
+ Size itemsz;
+ int num_codewords;
+ int num_tids;
+ zstid firsttid = tid + total_encoded;
+ uint64 first_delta;
+ uint64 second_delta;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *slotword_p;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ int i;
+
+ /*
+ * The first 'diff' is 0, because the first TID is implicitly 'starttid'.
+ * The rest have distance of 1 to the previous TID.
+ */
+ first_delta = 0;
+ second_delta = 1;
+ num_tids = 0;
+ for (num_codewords = 0;
+ num_codewords < ZSBT_MAX_ITEM_CODEWORDS && total_encoded < nelements && num_tids < ZSBT_MAX_ITEM_TIDS;
+ num_codewords++)
+ {
+ uint64 codeword;
+ int num_encoded;
+
+ codeword = simple8b_encode_consecutive(first_delta, second_delta,
+ nelements - total_encoded,
+ &num_encoded);
+ if (num_encoded == 0)
+ break;
+
+ codewords[num_codewords] = codeword;
+ total_encoded += num_encoded;
+ num_tids += num_encoded;
+ first_delta = 1;
+ }
+
+ itemsz = SizeOfZSTidArrayItem(num_tids, num_slots, num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_tids = num_tids;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = num_codewords;
+ newitem->t_firsttid = firsttid;
+ newitem->t_endtid = tid + total_encoded;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* Fill in undo slots */
+ if (slotno == ZSBT_FIRST_NORMAL_UNDO_SLOT)
+ {
+ Assert(num_slots == ZSBT_FIRST_NORMAL_UNDO_SLOT + 1);
+ newitem_slots[0] = undo_ptr;
+ }
+
+ /* Fill in slotwords */
+ i = 0;
+ slotword_p = newitem_slotwords;
+ while (i < num_tids)
+ {
+ uint64 slotword;
+
+ slotword = 0;
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(slotword_p++) = slotword;
+ }
+
+ /* Fill in TID codewords */
+ for (int i = 0; i < num_codewords; i++)
+ newitem_codewords[i] = codewords[i];
+
+ newitems = lappend(newitems, newitem);
+ }
+
+ return newitems;
+}
+
+/*
+ * Add a range of contiguous TIDs to an existing item.
+ *
+ * If all the new TIDs can be merged with the existing item, returns a List
+ * with a single element, containing the new combined item that covers all
+ * the existing TIDs, and the new TIDs. *modified_orig is set to true.
+ *
+ * If some of the new TIDs can be merged with the existing item, returns a
+ * List with more than one item. The returned items together replace the
+ * original item, such that all the existing TIDs and all the new TIDs are
+ * covered. *modified_orig is set to true in that case, too.
+ *
+ * If the new TIDs could not be merged with the existing item, returns a list
+ * of new items to represent the new TIDs, just like
+ * zsbt_tid_item_create_for_range(), and *modified_orig is set to false.
+ */
+List *
+zsbt_tid_item_add_tids(ZSTidArrayItem *orig, zstid firsttid, int nelements,
+ ZSUndoRecPtr undo_ptr, bool *modified_orig)
+{
+ int num_slots;
+ int num_new_codewords;
+ uint64 new_codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ ZSUndoRecPtr *orig_slots;
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ int slotno;
+ uint64 first_delta;
+ uint64 second_delta;
+ int total_new_encoded;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ List *newitems;
+ int num_tids;
+ ZSUndoRecPtr *dst_slot;
+ uint64 *dst_slotword;
+ uint64 *dst_codeword;
+ int i;
+ int j;
+
+ if (orig == NULL)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ /* Quick check to see if we can add the new TIDs to the previous item */
+ Assert(orig->t_endtid <= firsttid);
+
+ /*
+ * Is there room for a new codeword? Currently, we don't try to add tids to the
+ * last existing codeword, even if we perhaps could.
+ */
+ if (orig->t_num_codewords >= ZSBT_MAX_ITEM_CODEWORDS)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots, &orig_slotwords);
+
+ /* Is there an UNDO slot we can use? */
+ Assert(undo_ptr.counter != DeadUndoPtr.counter);
+ if (!IsZSUndoRecPtrValid(&undo_ptr))
+ {
+ slotno = ZSBT_OLD_UNDO_SLOT;
+ num_slots = orig->t_num_undo_slots;
+ }
+ else
+ {
+ for (slotno = ZSBT_FIRST_NORMAL_UNDO_SLOT; slotno < orig->t_num_undo_slots; slotno++)
+ {
+ if (orig_slots[slotno - ZSBT_FIRST_NORMAL_UNDO_SLOT].counter == undo_ptr.counter)
+ break;
+ }
+ if (slotno >= ZSBT_MAX_ITEM_UNDO_SLOTS)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ if (slotno >= orig->t_num_undo_slots)
+ num_slots = orig->t_num_undo_slots + 1;
+ else
+ num_slots = orig->t_num_undo_slots;
+ }
+
+ /* ok, go ahead, create as many new codewords as fits, or is needed. */
+ first_delta = firsttid - orig->t_endtid + 1;
+ second_delta = 1;
+ total_new_encoded = 0;
+ num_new_codewords = 0;
+ while (num_new_codewords < ZSBT_MAX_ITEM_CODEWORDS - orig->t_num_codewords &&
+ total_new_encoded < nelements && orig->t_num_tids + total_new_encoded < ZSBT_MAX_ITEM_TIDS)
+ {
+ uint64 codeword;
+ int num_encoded;
+
+ codeword = simple8b_encode_consecutive(first_delta,
+ second_delta,
+ nelements - total_new_encoded,
+ &num_encoded);
+ if (num_encoded == 0)
+ break;
+
+ new_codewords[num_new_codewords] = codeword;
+ first_delta = 1;
+ num_new_codewords++;
+ total_new_encoded += num_encoded;
+ }
+
+ if (num_new_codewords == 0)
+ {
+ *modified_orig = false;
+ return zsbt_tid_item_create_for_range(firsttid, nelements, undo_ptr);
+ }
+
+ num_tids = orig->t_num_tids + total_new_encoded;
+
+ itemsz = SizeOfZSTidArrayItem(num_tids, num_slots, orig->t_num_codewords + num_new_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = orig->t_num_codewords + num_new_codewords;
+ newitem->t_firsttid = orig->t_firsttid;
+ newitem->t_endtid = firsttid + total_new_encoded;
+ newitem->t_num_tids = newitem->t_endtid - newitem->t_firsttid;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* copy existing codewords, followed by new ones */
+ dst_codeword = newitem_codewords;
+ for (int i = 0; i < orig->t_num_codewords; i++)
+ *(dst_codeword++) = orig_codewords[i];
+ for (int i = 0; i < num_new_codewords; i++)
+ *(dst_codeword++) = new_codewords[i];
+
+ /* copy existing UNDO slots, followed by new slot, if any */
+ dst_slot = newitem_slots;
+ for (i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ *(dst_slot++) = orig_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+ if (num_slots > orig->t_num_undo_slots)
+ *(dst_slot++) = undo_ptr;
+
+ /*
+ * Copy and build slotwords
+ */
+ dst_slotword = newitem_slotwords;
+ /* copy full original slotwords as is */
+ for (i = 0; i < orig->t_num_tids / ZSBT_SLOTNOS_PER_WORD; i++)
+ *(dst_slotword++) = orig_slotwords[i];
+
+ /* add to the last, partial slotword. */
+ i = orig->t_num_tids;
+ j = orig->t_num_tids % ZSBT_SLOTNOS_PER_WORD;
+ if (j != 0)
+ {
+ uint64 slotword = orig_slotwords[orig->t_num_tids / ZSBT_SLOTNOS_PER_WORD];
+
+ for (; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(dst_slotword++) = slotword;
+ }
+
+ /* new slotwords */
+ while (i < num_tids)
+ {
+ uint64 slotword = 0;
+
+ for (j = 0; j < ZSBT_SLOTNOS_PER_WORD && i < num_tids; j++)
+ {
+ slotword |= (uint64) slotno << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+ i++;
+ }
+ *(dst_slotword++) = slotword;
+ }
+ Assert(dst_slotword == newitem_slotwords + ZSBT_NUM_SLOTWORDS(num_tids));
+
+ /* Create more items for the remainder, if needed */
+ *modified_orig = true;
+ if (total_new_encoded < nelements)
+ newitems = zsbt_tid_item_create_for_range(newitem->t_endtid,
+ nelements - total_new_encoded,
+ undo_ptr);
+ else
+ newitems = NIL;
+ newitems = lcons(newitem, newitems);
+ return newitems;
+}
+
+/*
+ * Change the UNDO pointer of a tuple with TID 'target_tid', inside an item.
+ *
+ * Returns an item, or multiple items, to replace the original one.
+ */
+List *
+zsbt_tid_item_change_undoptr(ZSTidArrayItem *orig, zstid target_tid, ZSUndoRecPtr undoptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ uint64 *deltas;
+ zstid *tids;
+ int num_tids = orig->t_num_tids;
+ int target_idx = -1;
+ ZSUndoRecPtr *orig_slots_partial;
+ ZSUndoRecPtr orig_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ List *newitems;
+ int new_slotno;
+
+ deltas = palloc(sizeof(uint64) * num_tids);
+ tids = palloc(sizeof(zstid) * num_tids);
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots_partial, &orig_slotwords);
+
+ /* decode the codewords, to find the target TID */
+ simple8b_decode_words(orig_codewords, orig->t_num_codewords, deltas, num_tids);
+
+ deltas_to_tids(orig->t_firsttid, deltas, num_tids, tids);
+
+ target_idx = binsrch_tid_array(target_tid, tids, num_tids);
+ Assert(tids[target_idx] == target_tid);
+
+ /* Ok, we know the target TID now. Can we use one of the existing UNDO slots? */
+ new_slotno = -1;
+ if (undoptr.counter == DeadUndoPtr.counter)
+ new_slotno = ZSBT_DEAD_UNDO_SLOT;
+ if (new_slotno == -1 && undoptr.counter < recent_oldest_undo.counter)
+ new_slotno = ZSBT_OLD_UNDO_SLOT;
+
+ orig_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ orig_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ orig_slots[i] = orig_slots_partial[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+
+ if (new_slotno == -1)
+ {
+ for (int i = 0; i < orig->t_num_undo_slots; i++)
+ {
+ if (orig_slots[i].counter == undoptr.counter)
+ {
+ /* We can reuse this existing slot for the target. */
+ new_slotno = i;
+ }
+ }
+ }
+ if (new_slotno == -1 && orig->t_num_undo_slots < ZSBT_MAX_ITEM_UNDO_SLOTS)
+ {
+ /* There's a free slot we can use for the target */
+ new_slotno = orig->t_num_undo_slots;
+ }
+
+ if (new_slotno != -1)
+ {
+ int num_slots;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+
+ num_slots = orig->t_num_undo_slots;
+ if (new_slotno == orig->t_num_undo_slots)
+ num_slots++;
+
+ /* Simple case */
+ itemsz = SizeOfZSTidArrayItem(orig->t_num_tids, num_slots, orig->t_num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = orig->t_num_codewords;
+ newitem->t_firsttid = orig->t_firsttid;
+ newitem->t_endtid = orig->t_endtid;
+ newitem->t_num_tids = orig->t_num_tids;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* copy codewords. They're unmodified. */
+ for (int i = 0; i < orig->t_num_codewords; i++)
+ newitem_codewords[i] = orig_codewords[i];
+
+ /* copy existing slots, followed by new slot, if any */
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ newitem_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT] = orig_slots[i];
+ if (new_slotno == orig->t_num_undo_slots)
+ newitem_slots[new_slotno - ZSBT_FIRST_NORMAL_UNDO_SLOT] = undoptr;
+
+ /* copy slotwords */
+ for (int i = 0; i < ZSBT_NUM_SLOTWORDS(orig->t_num_tids); i++)
+ {
+ uint64 slotword;
+
+ slotword = orig_slotwords[i];
+
+ if (target_idx / ZSBT_SLOTNOS_PER_WORD == i)
+ {
+ /* this slotword contains the target TID */
+ int shift = (target_idx % ZSBT_SLOTNOS_PER_WORD) * ZSBT_ITEM_UNDO_SLOT_BITS;
+ uint64 mask;
+
+ mask = ((UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1) << shift;
+
+ slotword &= ~mask;
+ slotword |= (uint64) new_slotno << shift;
+ }
+
+ newitem_slotwords[i] = slotword;
+ }
+
+ newitems = list_make1(newitem);
+ }
+ else
+ {
+ /* Have to remap the slots. */
+ uint8 *slotnos;
+ ZSUndoRecPtr tmp_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint8 *tmp_slotnos;
+ int idx;
+
+ slotnos = palloc(orig->t_num_tids * sizeof(uint8));
+ slotwords_to_slotnos(orig_slotwords, orig->t_num_tids, slotnos);
+
+ tmp_slotnos = palloc(orig->t_num_tids * sizeof(uint8));
+
+ /* reconstruct items */
+ idx = 0;
+ newitems = NIL;
+ while (idx < orig->t_num_tids)
+ {
+ ZSTidArrayItem *newitem;
+ int num_remapped;
+ int num_tmp_slots;
+
+ num_remapped = remap_slots(&slotnos[idx], orig->t_num_tids - idx,
+ orig_slots, orig->t_num_undo_slots,
+ target_idx - idx, undoptr,
+ tmp_slots, &num_tmp_slots,
+ tmp_slotnos,
+ recent_oldest_undo);
+
+ deltas[idx] = 0;
+ newitem = build_item(&tids[idx], &deltas[idx], tmp_slotnos, num_remapped,
+ tmp_slots, num_tmp_slots);
+
+ newitems = lappend(newitems, newitem);
+ idx += newitem->t_num_tids;
+ }
+
+ pfree(slotnos);
+ pfree(tmp_slotnos);
+ }
+
+ pfree(deltas);
+ pfree(tids);
+
+ return newitems;
+}
+
+/*
+ * Completely remove a number of TIDs from an item. (for vacuum)
+ */
+List *
+zsbt_tid_item_remove_tids(ZSTidArrayItem *orig, zstid *nexttid, IntegerSet *remove_tids,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ ZSUndoRecPtr *orig_slots_partial;
+ ZSUndoRecPtr orig_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint64 *orig_slotwords;
+ uint64 *orig_codewords;
+ int total_remain;
+ uint64 *deltas;
+ zstid *tids;
+ int nelements = orig->t_num_tids;
+ List *newitems = NIL;
+ zstid tid;
+ zstid prev_tid;
+ int idx;
+ uint8 *slotnos;
+
+ deltas = palloc(sizeof(uint64) * nelements);
+ tids = palloc(sizeof(zstid) * nelements);
+ slotnos = palloc(sizeof(uint8) * nelements);
+
+ ZSTidArrayItemDecode(orig, &orig_codewords, &orig_slots_partial, &orig_slotwords);
+
+ /* decode all the codewords */
+ simple8b_decode_words(orig_codewords, orig->t_num_codewords, deltas, orig->t_num_tids);
+
+ /* also decode the slotwords */
+ orig_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ orig_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < orig->t_num_undo_slots; i++)
+ orig_slots[i] = orig_slots_partial[i - ZSBT_FIRST_NORMAL_UNDO_SLOT];
+
+ idx = 0;
+ while (idx < orig->t_num_tids)
+ {
+ uint64 slotword = orig_slotwords[idx / ZSBT_SLOTNOS_PER_WORD];
+
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && idx < orig->t_num_tids; j++)
+ {
+ slotnos[idx++] = slotword & ((UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1);
+ slotword >>= ZSBT_ITEM_UNDO_SLOT_BITS;
+ }
+ }
+
+ /*
+ * Remove all the TIDs we can
+ */
+ total_remain = 0;
+ tid = orig->t_firsttid;
+ prev_tid = tid;
+ for (int i = 0; i < orig->t_num_tids; i++)
+ {
+ uint64 delta = deltas[i];
+
+ tid += delta;
+
+ while (*nexttid < tid)
+ {
+ if (!intset_iterate_next(remove_tids, nexttid))
+ *nexttid = MaxPlusOneZSTid;
+ }
+ if (tid < *nexttid)
+ {
+ deltas[total_remain] = tid - prev_tid;
+ tids[total_remain] = tid;
+ slotnos[total_remain] = slotnos[i];
+ total_remain++;
+ prev_tid = tid;
+ }
+ }
+
+ if (total_remain > 0)
+ {
+ ZSUndoRecPtr tmp_slots[ZSBT_MAX_ITEM_UNDO_SLOTS];
+ uint8 *tmp_slotnos;
+ int idx;
+
+ tmp_slotnos = palloc(total_remain * sizeof(uint8));
+
+ /*
+ * Ok, we have the decoded tids and undo slotnos in vals and undoslotnos now.
+ *
+ * Time to re-encode.
+ */
+ idx = 0;
+ while (idx < total_remain)
+ {
+ ZSTidArrayItem *newitem;
+ int num_remapped;
+ int num_tmp_slots;
+
+ num_remapped = remap_slots(&slotnos[idx], total_remain - idx,
+ orig_slots, orig->t_num_undo_slots,
+ -1, InvalidUndoPtr,
+ tmp_slots, &num_tmp_slots,
+ tmp_slotnos,
+ recent_oldest_undo);
+
+ deltas[idx] = 0;
+ newitem = build_item(&tids[idx], &deltas[idx], tmp_slotnos, num_remapped,
+ tmp_slots, num_tmp_slots);
+
+ newitems = lappend(newitems, newitem);
+ idx += newitem->t_num_tids;
+ }
+ pfree(tmp_slotnos);
+ }
+
+ pfree(deltas);
+ pfree(tids);
+ pfree(slotnos);
+
+ return newitems;
+}
+
+
+/*
+ * Convert an array of deltas to tids.
+ *
+ * Note: the input and output may point to the same array!
+ */
+static void
+deltas_to_tids(zstid firsttid, uint64 *deltas, int num_tids, zstid *tids)
+{
+ zstid prev_tid = firsttid;
+
+ for (int i = 0; i < num_tids; i++)
+ {
+ zstid tid;
+
+ tid = prev_tid + deltas[i];
+ tids[i] = tid;
+ prev_tid = tid;
+ }
+}
+
+/*
+ * Expand the slot numbers packed in slotwords, 2 bits per slotno, into
+ * a regular C array.
+ */
+static void
+slotwords_to_slotnos(uint64 *slotwords, int num_tids, uint8 *slotnos)
+{
+ uint64 *slotword_p;
+ const uint64 mask = (UINT64CONST(1) << ZSBT_ITEM_UNDO_SLOT_BITS) - 1;
+ int i;
+
+ i = 0;
+ slotword_p = slotwords;
+ while (i < num_tids)
+ {
+ uint64 slotword = *(slotword_p++);
+ int j;
+
+ /*
+ * process four elements at a time, for speed (this is an
+ * unrolled version of the loop below
+ */
+ j = 0;
+ while (j < ZSBT_SLOTNOS_PER_WORD && num_tids - i > 3)
+ {
+ slotnos[i] = slotword & mask;
+ slotnos[i + 1] = (slotword >> 2) & mask;
+ slotnos[i + 2] = (slotword >> 4) & mask;
+ slotnos[i + 3] = (slotword >> 6) & mask;
+ slotword = slotword >> 8;
+ i += 4;
+ j += 4;
+ }
+ /* handle the 0-3 elements at the end */
+ while (j < ZSBT_SLOTNOS_PER_WORD && num_tids - i > 0)
+ {
+ slotnos[i] = slotword & mask;
+ slotword = slotword >> 2;
+ i++;
+ j++;
+ }
+ }
+}
+
+/*
+ * Remap undo slots.
+ *
+ * We start with empty UNDO slots, and walk through the items,
+ * filling a slot whenever we encounter an UNDO pointer that we
+ * haven't assigned a slot for yet. If we run out of slots, stop.
+ */
+static int
+remap_slots(uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *orig_slots, int num_orig_slots,
+ int target_idx, ZSUndoRecPtr target_ptr,
+ ZSUndoRecPtr *new_slots,
+ int *new_num_slots,
+ uint8 *new_slotnos,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ int num_slots;
+ int8 slot_mapping[ZSBT_MAX_ITEM_UNDO_SLOTS + 1];
+ int idx;
+
+ new_slots[ZSBT_OLD_UNDO_SLOT] = InvalidUndoPtr;
+ new_slots[ZSBT_DEAD_UNDO_SLOT] = DeadUndoPtr;
+ num_slots = ZSBT_FIRST_NORMAL_UNDO_SLOT;
+
+ /*
+ * Have to remap the UNDO slots.
+- *
+ * We start with empty UNDO slots, and walk through the items,
+ * filling a slot whenever we encounter an UNDO pointer that we
+ * haven't assigned a slot for yet. If we run out of slots, stop.
+ */
+
+ slot_mapping[ZSBT_OLD_UNDO_SLOT] = ZSBT_OLD_UNDO_SLOT;
+ slot_mapping[ZSBT_DEAD_UNDO_SLOT] = ZSBT_DEAD_UNDO_SLOT;
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < num_orig_slots; i++)
+ slot_mapping[i] = -1;
+
+ for (idx = 0; idx < num_tids; idx++)
+ {
+ int orig_slotno = slotnos[idx];
+ int new_slotno;
+
+ if (idx == target_idx)
+ new_slotno = -1;
+ else
+ new_slotno = slot_mapping[orig_slotno];
+ if (new_slotno == -1)
+ {
+ /* assign new slot for this. */
+ ZSUndoRecPtr this_undoptr;
+
+ if (idx == target_idx)
+ this_undoptr = target_ptr;
+ else
+ this_undoptr = orig_slots[orig_slotno];
+
+ if (this_undoptr.counter == DeadUndoPtr.counter)
+ new_slotno = ZSBT_DEAD_UNDO_SLOT;
+ else if (this_undoptr.counter < recent_oldest_undo.counter)
+ new_slotno = ZSBT_OLD_UNDO_SLOT;
+ else
+ {
+ for (int j = 0; j < num_slots; j++)
+ {
+ if (new_slots[j].counter == this_undoptr.counter)
+ {
+ /* We already had a slot for this undo pointer. Reuse it. */
+ new_slotno = j;
+ break;
+ }
+ }
+ if (new_slotno == -1)
+ {
+ if (num_slots >= ZSBT_MAX_ITEM_UNDO_SLOTS)
+ break; /* out of slots */
+ else
+ {
+ /* assign to free slot */
+ new_slots[num_slots] = this_undoptr;
+ new_slotno = num_slots;
+ num_slots++;
+ }
+ }
+ }
+
+ if (idx != target_idx)
+ slot_mapping[orig_slotno] = new_slotno;
+ }
+
+ new_slotnos[idx] = new_slotno;
+ }
+
+ *new_num_slots = num_slots;
+ return idx;
+}
+
+/*
+ * Construct a ZSTidArrayItem.
+ *
+ * 'tids' is the list of TIDs to be packed in the item.
+ *
+ * 'deltas' contain the difference between each TID. They could be computed
+ * from the 'tids', but since the caller has them lready, we can save some
+ * effort by passing them down.
+ *
+ * 'slots' contains the UNDO slots to be stored. NOTE: it contains the
+ * special 0 and 1 slots too, but they won't be stored in the item that's
+ * created.
+ *
+ * 'slotnos' contains the UNDO slot numbers corresponding to each tuple
+ */
+static ZSTidArrayItem *
+build_item(zstid *tids, uint64 *deltas, uint8 *slotnos, int num_tids,
+ ZSUndoRecPtr *slots, int num_slots)
+{
+ int num_codewords;
+ Size itemsz;
+ ZSTidArrayItem *newitem;
+ int num_encoded;
+ uint64 codewords[ZSBT_MAX_ITEM_CODEWORDS];
+ ZSUndoRecPtr *newitem_slots;
+ uint64 *newitem_slotwords;
+ uint64 *newitem_codewords;
+ uint64 *dst_slotword;
+ int idx;
+
+ /*
+ * Create codewords.
+ */
+ num_codewords = 0;
+ num_encoded = 0;
+ while (num_encoded < num_tids && num_codewords < ZSBT_MAX_ITEM_CODEWORDS)
+ {
+ int n;
+ uint64 codeword;
+
+ codeword = simple8b_encode(&deltas[num_encoded], num_tids - num_encoded, &n);
+ if (n == 0)
+ break;
+
+ num_encoded += n;
+
+ codewords[num_codewords++] = codeword;
+ }
+
+ itemsz = SizeOfZSTidArrayItem(num_encoded, num_slots, num_codewords);
+ newitem = palloc(itemsz);
+ newitem->t_size = itemsz;
+ newitem->t_num_tids = num_encoded;
+ newitem->t_num_undo_slots = num_slots;
+ newitem->t_num_codewords = num_codewords;
+ newitem->t_firsttid = tids[0];
+ newitem->t_endtid = tids[num_encoded - 1] + 1;
+
+ ZSTidArrayItemDecode(newitem, &newitem_codewords, &newitem_slots, &newitem_slotwords);
+
+ /* Copy in the TID codewords */
+ for (int i = 0; i < num_codewords; i++)
+ newitem_codewords[i] = codewords[i];
+
+ /* Copy in undo slots */
+ for (int i = ZSBT_FIRST_NORMAL_UNDO_SLOT; i < num_slots; i++)
+ newitem_slots[i - ZSBT_FIRST_NORMAL_UNDO_SLOT] = slots[i];
+
+ /* Create slotwords */
+ dst_slotword = newitem_slotwords;
+ idx = 0;
+ while (idx < num_encoded)
+ {
+ uint64 slotword = 0;
+
+ for (int j = 0; j < ZSBT_SLOTNOS_PER_WORD && idx < num_encoded; j++)
+ slotword |= (uint64) slotnos[idx++] << (j * ZSBT_ITEM_UNDO_SLOT_BITS);
+
+ *(dst_slotword++) = slotword;
+ }
+ Assert(dst_slotword == newitem_slotwords + ZSBT_NUM_SLOTWORDS(num_tids));
+
+ return newitem;
+}
+
+static int
+binsrch_tid_array(zstid key, zstid *arr, int arr_elems)
+{
+ int low,
+ high,
+ mid;
+
+ low = 0;
+ high = arr_elems;
+ while (high > low)
+ {
+ mid = low + (high - low) / 2;
+
+ if (key >= arr[mid])
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
diff --git a/src/backend/access/zedstore/zedstore_tidpage.c b/src/backend/access/zedstore/zedstore_tidpage.c
new file mode 100644
index 0000000000..0091033993
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tidpage.c
@@ -0,0 +1,2058 @@
+/*
+ * zedstore_tidpage.c
+ * Routines for handling the TID tree.
+ *
+ * A Zedstore table consists of multiple B-trees, one for each attribute. The
+ * functions in this file deal with one B-tree at a time, it is the caller's
+ * responsibility to tie together the scans of each btree.
+ *
+ * Operations:
+ *
+ * - Sequential scan in TID order
+ * - must be efficient with scanning multiple trees in sync
+ *
+ * - random lookups, by TID (for index scan)
+ *
+ * - range scans by TID (for bitmap index scan)
+ *
+ * NOTES:
+ * - Locking order: child before parent, left before right
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tidpage.c
+ */
+#include "postgres.h"
+
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+#include "lib/integerset.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+
+/* prototypes for local functions */
+static void zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items, zs_pending_undo_op *undo_op);
+static OffsetNumber zsbt_tid_fetch(Relation rel, zstid tid,
+ Buffer *buf_p, ZSUndoSlotVisibility *visi_info, bool *isdead_p);
+static void zsbt_tid_add_items(Relation rel, Buffer buf, List *newitems,
+ zs_pending_undo_op *pending_undo_op);
+static void zsbt_tid_replace_item(Relation rel, Buffer buf, OffsetNumber off, List *newitems,
+ zs_pending_undo_op *pending_undo_op);
+
+static TM_Result zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd,
+ bool *this_xact_has_lock, ZSUndoRecPtr *prevundoptr_p);
+static zstid zsbt_tid_update_insert_new(Relation rel, TransactionId xid, CommandId cid,
+ ZSUndoRecPtr prevundoptr);
+static bool zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, ZSUndoRecPtr prevrecptr);
+static OffsetNumber zsbt_binsrch_tidpage(zstid key, Page page);
+static void zsbt_wal_log_tidleaf_items_begin(int nitems, zs_pending_undo_op *undo_op);
+static void zsbt_wal_log_tidleaf_items(Relation rel, Buffer buf,
+ OffsetNumber off, bool replace, List *items,
+ zs_pending_undo_op *undo_op);
+
+/* ----------------------------------------------------------------
+ * Public interface
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * Begin a scan of the btree.
+ */
+void
+zsbt_tid_begin_scan(Relation rel, zstid starttid,
+ zstid endtid, Snapshot snapshot, ZSTidTreeScan *scan)
+{
+ scan->rel = rel;
+ scan->snapshot = snapshot;
+ scan->context = CurrentMemoryContext;
+ scan->starttid = starttid;
+ scan->endtid = endtid;
+ scan->currtid = starttid - 1;
+ memset(&scan->recent_oldest_undo, 0, sizeof(scan->recent_oldest_undo));
+ memset(&scan->array_iter, 0, sizeof(scan->array_iter));
+ scan->array_iter.context = CurrentMemoryContext;
+ scan->array_curr_idx = -1;
+
+ scan->active = true;
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+}
+
+/*
+ * Reset the 'next' TID in a scan to the given TID.
+ */
+void
+zsbt_tid_reset_scan(ZSTidTreeScan *scan, zstid starttid, zstid endtid, zstid currtid)
+{
+ scan->starttid = starttid;
+ scan->endtid = endtid;
+ scan->currtid = currtid;
+ scan->array_curr_idx = -1;
+}
+
+void
+zsbt_tid_end_scan(ZSTidTreeScan *scan)
+{
+ if (!scan->active)
+ return;
+
+ if (scan->lastbuf != InvalidBuffer)
+ ReleaseBuffer(scan->lastbuf);
+
+ scan->active = false;
+ scan->array_iter.num_tids = 0;
+ scan->array_curr_idx = -1;
+
+ if (scan->array_iter.tids)
+ pfree(scan->array_iter.tids);
+ if (scan->array_iter.tid_undoslotnos)
+ pfree(scan->array_iter.tid_undoslotnos);
+}
+
+/*
+ * Helper function of zsbt_tid_scan_next_array(), to extract Datums from the given
+ * array item into the scan->array_* fields.
+ */
+static void
+zsbt_tid_scan_extract_array(ZSTidTreeScan *scan, ZSTidArrayItem *aitem)
+{
+ bool slots_visible[4];
+ int first;
+ int last;
+ int num_visible_tids;
+ int continue_at;
+
+ zsbt_tid_item_unpack(aitem, &scan->array_iter);
+
+ slots_visible[ZSBT_OLD_UNDO_SLOT] = true;
+ slots_visible[ZSBT_DEAD_UNDO_SLOT] = false;
+ scan->array_iter.visi_infos[ZSBT_OLD_UNDO_SLOT].xmin = FrozenTransactionId;
+
+ for (int i = 2; i < aitem->t_num_undo_slots; i++)
+ {
+ TransactionId obsoleting_xid;
+
+ slots_visible[i] = zs_SatisfiesVisibility(scan, &obsoleting_xid,
+ NULL, &scan->array_iter.visi_infos[i]);
+ if (scan->serializable && TransactionIdIsValid(obsoleting_xid))
+ CheckForSerializableConflictOut(scan->rel, obsoleting_xid, scan->snapshot);
+ }
+
+ /*
+ * Skip over elements at the beginning and end of the array that
+ * are not within the range we're interested in.
+ */
+ for (first = 0; first < scan->array_iter.num_tids; first++)
+ {
+ if (scan->array_iter.tids[first] >= scan->starttid)
+ break;
+ }
+ for (last = scan->array_iter.num_tids - 1; last >= first; last--)
+ {
+ if (scan->array_iter.tids[last] < scan->endtid)
+ break;
+ }
+
+ /* squeeze out invisible TIDs */
+ if (first == 0)
+ {
+ int j;
+
+ for (j = 0; j <= last; j++)
+ {
+ if (!slots_visible[scan->array_iter.tid_undoslotnos[j]])
+ break;
+ }
+ num_visible_tids = j;
+ continue_at = j + 1;
+ }
+ else
+ {
+ num_visible_tids = 0;
+ continue_at = first;
+ }
+
+ for (int i = continue_at; i <= last; i++)
+ {
+ /* Is this item visible? */
+ if (slots_visible[scan->array_iter.tid_undoslotnos[i]])
+ {
+ scan->array_iter.tids[num_visible_tids] = scan->array_iter.tids[i];
+ scan->array_iter.tid_undoslotnos[num_visible_tids] = scan->array_iter.tid_undoslotnos[i];
+ num_visible_tids++;
+ }
+ }
+ scan->array_iter.num_tids = num_visible_tids;
+ scan->array_curr_idx = -1;
+}
+
+/*
+ * Advance scan to next batch of TIDs.
+ *
+ * Finds the next TID array item >= scan->nexttid, and decodes it into
+ * scan->array_iter. The values in scan->array_iter are valid until
+ * the next call to this function, zsbt_tid_reset_scan() or
+ * zsbt_tid_end_scan().
+ *
+ * Returns true if there was another item, or false if we reached the
+ * end of the scan.
+ *
+ * This is normally not used directly, see zsbt_tid_scan_next() wrapper.
+ */
+bool
+zsbt_tid_scan_next_array(ZSTidTreeScan *scan, zstid nexttid, ScanDirection direction)
+{
+ if (!scan->active)
+ return InvalidZSTid;
+
+ /*
+ * Process items, until we find something that is visible to the snapshot.
+ *
+ * This advances nexttid as it goes.
+ */
+ while (nexttid < scan->endtid && nexttid >= scan->starttid)
+ {
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+ BlockNumber next;
+
+ /*
+ * Find and lock the leaf page containing nexttid.
+ */
+ buf = zsbt_find_and_lock_leaf_containing_tid(scan->rel, ZS_META_ATTRIBUTE_NUM,
+ scan->lastbuf, nexttid,
+ BUFFER_LOCK_SHARE);
+ if (buf != scan->lastbuf)
+ scan->lastoff = InvalidOffsetNumber;
+ scan->lastbuf = buf;
+ if (!BufferIsValid(buf))
+ {
+ /*
+ * Completely empty tree. This should only happen at the beginning of a
+ * scan - a tree cannot go missing after it's been created - but we don't
+ * currently check for that.
+ */
+ break;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ Assert(opaque->zs_page_id == ZS_BTREE_PAGE_ID);
+
+ /*
+ * Scan the items on the page, to find the next one that covers
+ * nexttid.
+ *
+ * We check the last offset first, as an optimization
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (direction == ForwardScanDirection)
+ {
+ /* Search for the next item >= nexttid */
+ off = FirstOffsetNumber;
+ if (scan->lastoff > FirstOffsetNumber && scan->lastoff <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, scan->lastoff);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid >= item->t_endtid)
+ off = scan->lastoff + 1;
+ }
+
+ for (; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid >= item->t_endtid)
+ continue;
+
+ if (item->t_firsttid >= scan->endtid)
+ {
+ nexttid = scan->endtid;
+ break;
+ }
+
+ zsbt_tid_scan_extract_array(scan, item);
+
+ if (scan->array_iter.num_tids > 0)
+ {
+ if (scan->array_iter.tids[scan->array_iter.num_tids - 1] >= nexttid)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ scan->lastoff = off;
+ return true;
+ }
+ nexttid = scan->array_iter.tids[scan->array_iter.num_tids - 1] + 1;
+ }
+ }
+ /* No more items on this page. Walk right, if possible */
+ if (nexttid < opaque->zs_hikey)
+ nexttid = opaque->zs_hikey;
+ next = opaque->zs_next;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (next == InvalidBlockNumber || nexttid >= scan->endtid)
+ {
+ /* reached end of scan */
+ break;
+ }
+
+ scan->lastbuf = ReleaseAndReadBuffer(scan->lastbuf, scan->rel, next);
+ }
+ else
+ {
+ /* Search for the next item <= nexttid */
+ for (off = maxoff; off >= FirstOffsetNumber; off--)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (nexttid < item->t_firsttid)
+ continue;
+
+ if (item->t_endtid < scan->starttid)
+ {
+ nexttid = scan->starttid - 1;
+ break;
+ }
+
+ zsbt_tid_scan_extract_array(scan, item);
+
+ if (scan->array_iter.num_tids > 0)
+ {
+ if (scan->array_iter.tids[0] <= nexttid)
+ {
+ LockBuffer(scan->lastbuf, BUFFER_LOCK_UNLOCK);
+ scan->lastoff = off;
+ return true;
+ }
+ nexttid = scan->array_iter.tids[0] - 1;
+ }
+ }
+ /* No more items on this page. Loop back to find the left sibling. */
+ if (nexttid >= opaque->zs_lokey)
+ nexttid = opaque->zs_lokey - 1;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ if (nexttid < scan->starttid)
+ {
+ /* reached end of scan */
+ break;
+ }
+ scan->lastbuf = InvalidBuffer;
+ }
+ }
+
+ /* Reached end of scan. */
+ scan->array_iter.num_tids = 0;
+ if (BufferIsValid(scan->lastbuf))
+ ReleaseBuffer(scan->lastbuf);
+ scan->lastbuf = InvalidBuffer;
+ scan->lastoff = InvalidOffsetNumber;
+
+ return false;
+}
+
+/*
+ * Get the first tid in the tree.
+ */
+zstid
+zsbt_get_first_tid(Relation rel)
+{
+ zstid leftmostkey;
+ zstid tid;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+
+ /* Find the leftmost leaf */
+ leftmostkey = MinZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, leftmostkey, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ return MaxPlusOneZSTid;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /* Move on to the next page if the current page has no item */
+ while (PageGetMaxOffsetNumber(page) < FirstOffsetNumber)
+ {
+ BlockNumber next = opaque->zs_next;
+
+ if (next == InvalidBlockNumber)
+ {
+ UnlockReleaseBuffer(buf);
+ return MaxPlusOneZSTid;
+ }
+ UnlockReleaseBuffer(buf);
+
+ buf = ReadBuffer(rel, next);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ }
+
+ tid = opaque->zs_lokey;
+ UnlockReleaseBuffer(buf);
+
+ return tid;
+}
+
+/*
+ * Get the last tid (plus one) in the tree.
+ */
+zstid
+zsbt_get_last_tid(Relation rel)
+{
+ zstid rightmostkey;
+ zstid tid;
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+
+ /* Find the rightmost leaf */
+ rightmostkey = MaxZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, rightmostkey, 0, true);
+ if (!BufferIsValid(buf))
+ {
+ return MinZSTid;
+ }
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Look at the last item, for its tid.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+ ZSTidArrayItem *lastitem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ tid = lastitem->t_endtid;
+ }
+ else
+ {
+ tid = opaque->zs_lokey;
+ }
+ UnlockReleaseBuffer(buf);
+
+ return tid;
+}
+
+/*
+ * Insert a multiple TIDs.
+ *
+ * Populates the TIDs of the new tuples.
+ *
+ * If 'tid' in list is valid, then that TID is used. It better not be in use already. If
+ * it's invalid, then a new TID is allocated, as we see best. (When inserting the
+ * first column of the row, pass invalid, and for other columns, pass the TID
+ * you got for the first column.)
+ */
+zstid
+zsbt_tid_multi_insert(Relation rel, int ntuples,
+ TransactionId xid, CommandId cid, uint32 speculative_token, ZSUndoRecPtr prevundoptr)
+{
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ OffsetNumber maxoff;
+ zstid insert_target_key;
+ List *newitems;
+ zs_pending_undo_op *undo_op;
+ zstid endtid;
+ zstid tid;
+ ZSTidArrayItem *lastitem;
+ bool modified_orig;
+
+ /*
+ * Insert to the rightmost leaf.
+ *
+ * TODO: use a Free Space Map to find suitable target.
+ */
+ insert_target_key = MaxZSTid;
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, insert_target_key, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Look at the last item, for its tid.
+ *
+ * assign TIDS for each item.
+ */
+ if (maxoff >= FirstOffsetNumber)
+ {
+ ItemId iid = PageGetItemId(page, maxoff);
+
+ lastitem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ endtid = lastitem->t_endtid;
+ }
+ else
+ {
+ endtid = opaque->zs_lokey;
+ lastitem = NULL;
+ }
+ tid = endtid;
+
+ /* Form an undo record */
+ if (xid != FrozenTransactionId)
+ {
+ undo_op = zsundo_create_for_insert(rel, xid, cid, tid, ntuples,
+ speculative_token, prevundoptr);
+ }
+ else
+ {
+ undo_op = NULL;
+ }
+
+ /*
+ * Create an item to represent all the TIDs, merging with the last existing
+ * item if possible.
+ */
+ newitems = zsbt_tid_item_add_tids(lastitem, tid, ntuples, undo_op ? undo_op->reservation.undorecptr : InvalidUndoPtr,
+ &modified_orig);
+
+ /*
+ * Replace the original last item with the new items, or add new items.
+ * This splits the page if necessary.
+ */
+ if(modified_orig)
+ zsbt_tid_replace_item(rel, buf, maxoff, newitems, undo_op);
+ else
+ zsbt_tid_add_items(rel, buf, newitems, undo_op);
+ /* zsbt_tid_replace/add_item unlocked 'buf' */
+ ReleaseBuffer(buf);
+
+ list_free_deep(newitems);
+
+ /* Return the TIDs to the caller */
+ return tid;
+}
+
+TM_Result
+zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart, bool *this_xact_has_lock)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ ZSUndoSlotVisibility visi_info;
+ bool item_isdead;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zs_pending_undo_op *undo_op;
+ OffsetNumber off;
+ ZSTidArrayItem *origitem;
+ Buffer buf;
+ Page page;
+ zstid next_tid;
+ List *newitems = NIL;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &visi_info, &item_isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to delete with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+ if (item_isdead)
+ {
+ elog(ERROR, "cannot delete tuple that is already marked DEAD (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+
+ if (snapshot)
+ {
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ tid, LockTupleExclusive,
+ &keep_old_undo_ptr, this_xact_has_lock,
+ hufd, &next_tid, &visi_info);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, &obsoleting_xid, NULL, &visi_info))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return TM_Updated;
+ }
+ }
+ }
+
+ /* Create UNDO record. */
+ undo_op = zsundo_create_for_delete(rel, xid, cid, tid, changingPart,
+ keep_old_undo_ptr ? visi_info.undoptr : InvalidUndoPtr);
+
+ /* Update the tid with the new UNDO pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, undo_op->reservation.undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, undo_op);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+
+ return TM_Ok;
+}
+
+void
+zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ ZSUndoSlotVisibility visi_info;
+ bool item_isdead;
+ OffsetNumber off;
+ Buffer buf = InvalidBuffer;
+ /* Just using meta attribute, we can follow the update chain */
+ zstid curr_tid = *tid;
+
+ for(;;)
+ {
+ zstid next_tid = InvalidZSTid;
+ if (curr_tid == InvalidZSTid)
+ break;
+
+ /* Find the item */
+ off = zsbt_tid_fetch(rel, curr_tid, &buf, &visi_info, &item_isdead);
+ if (!OffsetNumberIsValid(off))
+ break;
+
+ if (item_isdead) {
+ UnlockReleaseBuffer(buf);
+ break;
+ }
+
+ if (snapshot)
+ {
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = snapshot;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (zs_SatisfiesVisibility(&scan,
+ &obsoleting_xid, &next_tid, &visi_info))
+ {
+ *tid = curr_tid;
+ }
+
+ curr_tid = next_tid;
+ UnlockReleaseBuffer(buf);
+ }
+ }
+}
+
+/*
+ * A new TID is allocated, as we see best and returned to the caller. This
+ * function is only called for META attribute btree. Data columns will use the
+ * returned tid to insert new items.
+ */
+TM_Result
+zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd,
+ zstid *newtid_p, bool *this_xact_has_lock)
+{
+ TM_Result result;
+ ZSUndoRecPtr prevundoptr;
+ bool success;
+
+ /*
+ * This is currently only used on the meta-attribute. The other attributes
+ * don't need to carry visibility information, so the caller just inserts
+ * the new values with (multi_)insert() instead. This will change once we
+ * start doing the equivalent of HOT updates, where the TID doesn't change.
+ */
+ Assert(*newtid_p == InvalidZSTid);
+
+ /*
+ * Find and lock the old item.
+ *
+ * TODO: If there's free TID space left on the same page, we should keep the
+ * buffer locked, and use the same page for the new tuple.
+ */
+retry:
+ result = zsbt_tid_update_lock_old(rel, otid,
+ xid, cid, key_update, snapshot,
+ crosscheck, wait, hufd, this_xact_has_lock, &prevundoptr);
+
+ if (result != TM_Ok)
+ return result;
+
+ /* insert new version */
+ *newtid_p = zsbt_tid_update_insert_new(rel, xid, cid, prevundoptr);
+
+ /* update the old item with the "t_ctid pointer" for the new item */
+ success = zsbt_tid_mark_old_updated(rel, otid, *newtid_p, xid, cid, key_update, prevundoptr);
+ if (!success)
+ {
+ ZSUndoRecPtr oldest_undoptr = zsundo_get_oldest_undo_ptr(rel, true);
+
+ zsbt_tid_mark_dead(rel, *newtid_p, oldest_undoptr);
+ goto retry;
+ }
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): locks the old item for update.
+ */
+static TM_Result
+zsbt_tid_update_lock_old(Relation rel, zstid otid,
+ TransactionId xid, CommandId cid, bool key_update, Snapshot snapshot,
+ Snapshot crosscheck, bool wait, TM_FailureData *hufd, bool *this_xact_has_lock,
+ ZSUndoRecPtr *prevundoptr_p)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ Buffer buf = InvalidBuffer;
+ ZSUndoSlotVisibility olditem_visi_info;
+ bool olditem_isdead = false;
+ OffsetNumber off;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zstid next_tid;
+
+ /*
+ * Find the item to delete.
+ */
+ off = zsbt_tid_fetch(rel, otid, &buf, &olditem_visi_info, &olditem_isdead);
+ if (!OffsetNumberIsValid(off) || olditem_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+ *prevundoptr_p = olditem_visi_info.undoptr;
+
+ /*
+ * Is it visible to us?
+ */
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ otid,
+ key_update ? LockTupleExclusive : LockTupleNoKeyExclusive,
+ &keep_old_undo_ptr, this_xact_has_lock,
+ hufd, &next_tid, &olditem_visi_info);
+ if (result != TM_Ok)
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ return result;
+ }
+
+ if (crosscheck != InvalidSnapshot && result == TM_Ok)
+ {
+ /* Perform additional check for transaction-snapshot mode RI updates */
+ /* FIXME: dummmy scan */
+ ZSTidTreeScan scan;
+ TransactionId obsoleting_xid;
+ memset(&scan, 0, sizeof(scan));
+ scan.rel = rel;
+ scan.snapshot = crosscheck;
+ scan.recent_oldest_undo = recent_oldest_undo;
+
+ if (!zs_SatisfiesVisibility(&scan, &obsoleting_xid, NULL, &olditem_visi_info))
+ {
+ UnlockReleaseBuffer(buf);
+ /* FIXME: We should fill TM_FailureData *hufd correctly */
+ result = TM_Updated;
+ }
+ }
+
+ /*
+ * TODO: tuple-locking not implemented. Pray that there is no competing
+ * concurrent update!
+ */
+
+ UnlockReleaseBuffer(buf);
+
+ return TM_Ok;
+}
+
+/*
+ * Subroutine of zsbt_update(): inserts the new, updated, item.
+ */
+static zstid
+zsbt_tid_update_insert_new(Relation rel, TransactionId xid, CommandId cid, ZSUndoRecPtr prevundoptr)
+{
+ return zsbt_tid_multi_insert(rel, 1, xid, cid, INVALID_SPECULATIVE_TOKEN, prevundoptr);
+}
+
+/*
+ * Subroutine of zsbt_update(): mark old item as updated.
+ */
+static bool
+zsbt_tid_mark_old_updated(Relation rel, zstid otid, zstid newtid,
+ TransactionId xid, CommandId cid, bool key_update, ZSUndoRecPtr prevrecptr)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false /* we trimmed in the zsbt_tid_update_lock_old() call */);
+ Buffer buf;
+ Page page;
+ ZSUndoSlotVisibility olditem_visi_info;
+ bool olditem_isdead;
+ OffsetNumber off;
+ bool keep_old_undo_ptr = true;
+ zs_pending_undo_op *undo_op;
+ List *newitems;
+ ZSTidArrayItem *origitem;
+
+ /*
+ * Find the item to delete. It could be part of a compressed item,
+ * we let zsbt_fetch() handle that.
+ */
+ off = zsbt_tid_fetch(rel, otid, &buf, &olditem_visi_info, &olditem_isdead);
+ if (!OffsetNumberIsValid(off) || olditem_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find old tuple to update with TID (%u, %u) in TID tree",
+ ZSTidGetBlockNumber(otid), ZSTidGetOffsetNumber(otid));
+ }
+
+ /*
+ * Did it change while we were inserting new row version?
+ */
+ if (!ZSUndoRecPtrEquals(olditem_visi_info.undoptr, prevrecptr))
+ {
+ UnlockReleaseBuffer(buf);
+ return false;
+ }
+
+ /* Prepare an UNDO record. */
+ undo_op = zsundo_create_for_update(rel, xid, cid, otid, newtid,
+ keep_old_undo_ptr ? olditem_visi_info.undoptr : InvalidUndoPtr,
+ key_update);
+
+ /* Replace the ZSBreeItem with one with the updated undo pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, otid, undo_op->reservation.undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, undo_op);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+
+ return true;
+}
+
+TM_Result
+zsbt_tid_lock(Relation rel, zstid tid, TransactionId xid, CommandId cid,
+ LockTupleMode mode, bool follow_updates, Snapshot snapshot,
+ TM_FailureData *hufd, zstid *next_tid, bool *this_xact_has_lock,
+ ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ Buffer buf;
+ Page page;
+ bool item_isdead;
+ OffsetNumber off;
+ TM_Result result;
+ bool keep_old_undo_ptr = true;
+ zs_pending_undo_op *undo_op;
+ List *newitems;
+ ZSTidArrayItem *origitem;
+
+ *next_tid = tid;
+
+ off = zsbt_tid_fetch(rel, tid, &buf, visi_info, &item_isdead);
+ if (!OffsetNumberIsValid(off) || item_isdead)
+ {
+ /*
+ * or should this be TM_Invisible? The heapam at least just throws
+ * an error, I think..
+ */
+ elog(ERROR, "could not find tuple to lock with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ }
+ result = zs_SatisfiesUpdate(rel, snapshot, recent_oldest_undo,
+ tid, mode,
+ &keep_old_undo_ptr, this_xact_has_lock,
+ hufd, next_tid, visi_info);
+
+ if (result != TM_Ok)
+ {
+ if (result == TM_Invisible && follow_updates &&
+ TransactionIdIsInProgress(visi_info->xmin))
+ {
+ /*
+ * need to lock tuple irrespective of its visibility on
+ * follow_updates.
+ */
+ }
+ else
+ {
+ UnlockReleaseBuffer(buf);
+ return result;
+ }
+ }
+
+ /* Create UNDO record. */
+ undo_op = zsundo_create_for_tuple_lock(rel, xid, cid, tid, mode,
+ keep_old_undo_ptr ? visi_info->undoptr : InvalidUndoPtr);
+
+ /* Replace the item with an identical one, but with updated undo pointer. */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, undo_op->reservation.undorecptr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, undo_op);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+ return TM_Ok;
+}
+
+/*
+ * Collect all TIDs marked as dead in the TID tree.
+ *
+ * This is used during VACUUM.
+ */
+IntegerSet *
+zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid, uint64 *num_live_tuples)
+{
+ Buffer buf = InvalidBuffer;
+ IntegerSet *result;
+ ZSBtreePageOpaque *opaque;
+ zstid nexttid;
+ BlockNumber nextblock;
+ ZSTidItemIterator iter;
+
+ memset(&iter, 0, sizeof(ZSTidItemIterator));
+ iter.context = CurrentMemoryContext;
+
+ result = intset_create();
+
+ nexttid = starttid;
+ nextblock = InvalidBlockNumber;
+ for (;;)
+ {
+ Page page;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ if (nextblock != InvalidBlockNumber)
+ {
+ buf = ReleaseAndReadBuffer(buf, rel, nextblock);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (!zsbt_page_is_expected(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, buf))
+ {
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(buf))
+ {
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, true);
+ if (!BufferIsValid(buf))
+ return result;
+ }
+
+ page = BufferGetPage(buf);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ zsbt_tid_item_unpack(item, &iter);
+
+ for (int j = 0; j < iter.num_tids; j++)
+ {
+ (*num_live_tuples)++;
+ if (iter.tid_undoslotnos[j] == ZSBT_DEAD_UNDO_SLOT)
+ intset_add_member(result, iter.tids[j]);
+ }
+ }
+
+ opaque = ZSBtreePageGetOpaque(page);
+ nexttid = opaque->zs_hikey;
+ nextblock = opaque->zs_next;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (nexttid == MaxPlusOneZSTid)
+ {
+ Assert(nextblock == InvalidBlockNumber);
+ break;
+ }
+
+ if (intset_memory_usage(result) > (uint64) maintenance_work_mem * 1024)
+ break;
+ }
+
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ *endtid = nexttid;
+ return result;
+}
+
+/*
+ * Mark item with given TID as dead.
+ *
+ * This is used when UNDO actions are performed, after a transaction becomes
+ * old enough.
+ */
+void
+zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr recent_oldest_undo)
+{
+ Buffer buf = InvalidBuffer;
+ Page page;
+ ZSUndoSlotVisibility visi_info;
+ OffsetNumber off;
+ ZSTidArrayItem *origitem;
+ List *newitems;
+ bool isdead;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &visi_info, &isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ /*
+ * This can happen, at least in this scenario:
+ * 1. a backend reserves a range of TIDs, by inserting them to the
+ * TID tree.
+ * 2. it "cancels" the reservation in the middle of the transaction,
+ * by removing the TIDs from the tree again.
+ * 3. It then aborts. The UNDO record for the insertion is still in
+ * place, but the backend removed the TIDs already.
+ */
+ elog(DEBUG1, "could not find tuple to mark dead with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ return;
+ }
+
+ /* Mark the TID as DEAD. (Unless it's already dead) */
+ if (isdead)
+ {
+ UnlockReleaseBuffer(buf);
+ return;
+ }
+
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, DeadUndoPtr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, NULL);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+}
+
+
+/*
+ * Remove items for the given TIDs from the TID tree.
+ *
+ * This is used during VACUUM.
+ */
+void
+zsbt_tid_remove(Relation rel, IntegerSet *tids)
+{
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, true);
+ zstid nexttid;
+ MemoryContext oldcontext;
+ MemoryContext tmpcontext;
+
+ tmpcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMVacuumContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(tmpcontext);
+
+ intset_begin_iterate(tids);
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+
+ while (nexttid < MaxPlusOneZSTid)
+ {
+ Buffer buf;
+ Page page;
+ ZSBtreePageOpaque *opaque;
+ List *newitems;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ /*
+ * Find the leaf page containing the next item to remove
+ */
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, nexttid, 0, false);
+ page = BufferGetPage(buf);
+ opaque = ZSBtreePageGetOpaque(page);
+
+ /*
+ * Rewrite the items on the page, removing all TIDs that need to be
+ * removed from the page.
+ */
+ newitems = NIL;
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ while (nexttid < item->t_firsttid)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ if (nexttid < item->t_endtid)
+ {
+ List *newitemsx = zsbt_tid_item_remove_tids(item, &nexttid, tids,
+ recent_oldest_undo);
+
+ newitems = list_concat(newitems, newitemsx);
+ }
+ else
+ {
+ /* keep this item unmodified */
+ newitems = lappend(newitems, item);
+ }
+ }
+
+ while (nexttid < opaque->zs_hikey)
+ {
+ if (!intset_iterate_next(tids, &nexttid))
+ nexttid = MaxPlusOneZSTid;
+ }
+
+ /* Pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (newitems)
+ {
+ zsbt_tid_recompress_replace(rel, buf, newitems, NULL);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack, NULL, 0);
+ }
+
+ ReleaseBuffer(buf);
+
+ MemoryContextReset(tmpcontext);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(tmpcontext);
+}
+
+/*
+ * Clear an item's UNDO pointer.
+ *
+ * This is used during VACUUM, to clear out aborted deletions.
+ */
+void
+zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ Buffer buf;
+ Page page;
+ ZSUndoSlotVisibility visi_info;
+ bool item_isdead;
+ OffsetNumber off;
+
+ /* Find the item to delete. (It could be compressed) */
+ off = zsbt_tid_fetch(rel, tid, &buf, &visi_info, &item_isdead);
+ if (!OffsetNumberIsValid(off))
+ {
+ elog(WARNING, "could not find aborted tuple to remove with TID (%u, %u)",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid));
+ return;
+ }
+
+ if (ZSUndoRecPtrEquals(visi_info.undoptr, undoptr))
+ {
+ ZSTidArrayItem *origitem;
+ List *newitems;
+
+ /* FIXME: we're overwriting the undo pointer with 'invalid', meaning the
+ * tuple becomes visible to everyone. That doesn't seem right. Shouldn't
+ * we restore the previous undo pointer, if the insertion was not yet
+ * visible to everyone?
+ */
+ page = BufferGetPage(buf);
+ origitem = (ZSTidArrayItem *) PageGetItem(page, PageGetItemId(page, off));
+ newitems = zsbt_tid_item_change_undoptr(origitem, tid, InvalidUndoPtr,
+ recent_oldest_undo);
+ zsbt_tid_replace_item(rel, buf, off, newitems, NULL);
+ list_free_deep(newitems);
+ ReleaseBuffer(buf); /* zsbt_tid_replace_item unlocked 'buf' */
+ }
+ else
+ {
+ Assert(item_isdead ||
+ visi_info.undoptr.counter > undoptr.counter ||
+ !IsZSUndoRecPtrValid(&visi_info.undoptr));
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * Internal routines
+ * ----------------------------------------------------------------
+ */
+
+void
+zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete)
+{
+ Buffer buf;
+ bool item_isdead;
+ ZSUndoSlotVisibility visi_info;
+ OffsetNumber off;
+
+ off = zsbt_tid_fetch(rel, tid, &buf, &visi_info, &item_isdead);
+ if (!OffsetNumberIsValid(off) || item_isdead)
+ elog(ERROR, "couldn't find item for meta column for inserted tuple with TID (%u, %u) in rel %s",
+ ZSTidGetBlockNumber(tid), ZSTidGetOffsetNumber(tid), rel->rd_rel->relname.data);
+
+ zsundo_clear_speculative_token(rel, visi_info.undoptr);
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Fetch the item with given TID. The page containing the item is kept locked, and
+ * returned to the caller in *buf_p. This is used to locate a tuple for updating
+ * or deleting it. If the item with the given TID is not found, InvalidOffsetNumber
+ * is returned and buf_p is unchanged.
+ */
+static OffsetNumber
+zsbt_tid_fetch(Relation rel, zstid tid, Buffer *buf_p, ZSUndoSlotVisibility *visi_info, bool *isdead_p)
+{
+ Buffer buf;
+ Page page;
+ OffsetNumber maxoff;
+ OffsetNumber off;
+
+ buf = zsbt_descend(rel, ZS_META_ATTRIBUTE_NUM, tid, 0, false);
+ if (buf == InvalidBuffer)
+ return InvalidOffsetNumber;
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /* Find the item on the page that covers the target TID */
+ off = zsbt_binsrch_tidpage(tid, page);
+ if (off >= FirstOffsetNumber && off <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (tid < item->t_endtid)
+ {
+ ZSTidItemIterator iter;
+
+ memset(&iter, 0, sizeof(ZSTidItemIterator));
+ iter.context = CurrentMemoryContext;
+
+ zsbt_tid_item_unpack(item, &iter);
+
+ /* TODO: could do binary search here. Better yet, integrate the
+ * unpack function with the callers */
+ for (int i = 0; i < iter.num_tids; i++)
+ {
+ if (iter.tids[i] == tid)
+ {
+ int slotno = iter.tid_undoslotnos[i];
+
+ *isdead_p = (slotno == ZSBT_DEAD_UNDO_SLOT);
+ *visi_info = iter.visi_infos[slotno];
+ *buf_p = buf;
+
+ if (iter.tids)
+ pfree(iter.tids);
+ if (iter.tid_undoslotnos)
+ pfree(iter.tid_undoslotnos);
+
+ return off;
+ }
+ }
+
+ if (iter.tids)
+ pfree(iter.tids);
+ if (iter.tid_undoslotnos)
+ pfree(iter.tid_undoslotnos);
+ }
+ }
+ /*
+ * We found a page but the tid was not present in that page. So unlock
+ * it anyway.
+ */
+ UnlockReleaseBuffer(buf);
+ return InvalidOffsetNumber;
+}
+
+/*
+ * This helper function is used to implement INSERT.
+ *
+ * The items in 'newitems' are added to the page, to the correct position.
+ * FIXME: Actually, they're always just added to the end of the page, and that
+ * better be the correct position.
+ *
+ * This function handles splitting the page if needed.
+ */
+static void
+zsbt_tid_add_items(Relation rel, Buffer buf, List *newitems, zs_pending_undo_op *undo_op)
+{
+ Page page = BufferGetPage(buf);
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber off;
+ Size newitemsize;
+ ListCell *lc;
+
+ newitemsize = 0;
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ newitemsize += sizeof(ItemIdData) + item->t_size;
+ }
+
+ if (newitemsize <= PageGetExactFreeSpace(page))
+ {
+ /* The new items fit on the page. Add them. */
+ OffsetNumber startoff;
+ OffsetNumber off;
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items_begin(list_length(newitems), undo_op);
+
+ START_CRIT_SECTION();
+
+ startoff = maxoff + 1;
+ off = startoff;
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ if (!PageAddItem(page, (Item) item, item->t_size, off, true, false))
+ elog(ERROR, "could not add item to TID tree page");
+ off++;
+ }
+
+ if (undo_op)
+ zsundo_finish_pending_op(undo_op, (char *) &undo_op->payload);
+
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items(rel, buf, startoff, false, newitems, undo_op);
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (undo_op)
+ {
+ UnlockReleaseBuffer(undo_op->reservation.undobuf);
+ pfree(undo_op);
+ }
+ }
+ else
+ {
+ List *items = NIL;
+
+ /* Collect all the old items on the page to a list */
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ /*
+ * Get the next item to process from the page.
+ */
+ items = lappend(items, item);
+ }
+
+ /* Add any new items to the end */
+ foreach (lc, newitems)
+ {
+ items = lappend(items, lfirst(lc));
+ }
+
+ /* Now pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_tid_recompress_replace(rel, buf, items, undo_op);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack, undo_op, 0);
+ }
+
+ list_free(items);
+ }
+}
+
+
+/*
+ * This helper function is used to implement INSERT, UPDATE and DELETE.
+ *
+ * If 'newitems' is not empty, the items in the list are added to the page,
+ * to the correct position. FIXME: Actually, they're always just added to
+ * the end of the page, and that better be the correct position.
+ *
+ * This function handles decompressing and recompressing items, and splitting
+ * the page if needed.
+ */
+static void
+zsbt_tid_replace_item(Relation rel, Buffer buf, OffsetNumber targetoff, List *newitems,
+ zs_pending_undo_op *undo_op)
+{
+ Page page = BufferGetPage(buf);
+ ItemId iid;
+ ZSTidArrayItem *olditem;
+ ListCell *lc;
+ ssize_t sizediff;
+
+ /*
+ * Find the item that covers the given tid.
+ */
+ if (targetoff < FirstOffsetNumber || targetoff > PageGetMaxOffsetNumber(page))
+ elog(ERROR, "could not find item at off %d to replace", targetoff);
+ iid = PageGetItemId(page, targetoff);
+ olditem = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ /* Calculate how much free space we'll need */
+ sizediff = -(olditem->t_size + sizeof(ItemIdData));
+ foreach(lc, newitems)
+ {
+ ZSTidArrayItem *newitem = (ZSTidArrayItem *) lfirst(lc);
+
+ sizediff += newitem->t_size + sizeof(ItemIdData);
+ }
+
+ /* Can we fit them? */
+ if (sizediff <= PageGetExactFreeSpace(page))
+ {
+ ZSTidArrayItem *newitem;
+ OffsetNumber off;
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items_begin(list_length(newitems), undo_op);
+
+ START_CRIT_SECTION();
+
+ /* Remove existing item, and add new ones */
+ if (newitems == 0)
+ PageIndexTupleDelete(page, targetoff);
+ else
+ {
+ lc = list_head(newitems);
+ newitem = (ZSTidArrayItem *) lfirst(lc);
+ if (!PageIndexTupleOverwrite(page, targetoff, (Item) newitem, newitem->t_size))
+ elog(ERROR, "could not replace item in TID tree page at off %d", targetoff);
+ lc = lnext(newitems, lc);
+
+ off = targetoff + 1;
+ for (; lc != NULL; lc = lnext(newitems, lc))
+ {
+ newitem = (ZSTidArrayItem *) lfirst(lc);
+ if (!PageAddItem(page, (Item) newitem, newitem->t_size, off, false, false))
+ elog(ERROR, "could not add item in TID tree page at off %d", off);
+ off++;
+ }
+ }
+ MarkBufferDirty(buf);
+
+ if (undo_op)
+ zsundo_finish_pending_op(undo_op, (char *) &undo_op->payload);
+
+ if (RelationNeedsWAL(rel))
+ zsbt_wal_log_tidleaf_items(rel, buf, targetoff, true, newitems, undo_op);
+
+ END_CRIT_SECTION();
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid lasttid = 0;
+ OffsetNumber off;
+
+ for (off = FirstOffsetNumber; off <= PageGetMaxOffsetNumber(page); off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ Assert(item->t_firsttid >= lasttid);
+ lasttid = item->t_endtid;
+ }
+ }
+#endif
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (undo_op)
+ {
+ UnlockReleaseBuffer(undo_op->reservation.undobuf);
+ pfree(undo_op);
+ }
+ }
+ else
+ {
+ /* Have to split the page. */
+ List *items = NIL;
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber off;
+
+ /*
+ * Construct a List that contains all the items in the right order, and
+ * let zsbt_tid_recompress_page() do the heavy lifting to fit them on
+ * pages.
+ */
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId iid = PageGetItemId(page, off);
+ ZSTidArrayItem *item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (off == targetoff)
+ {
+ foreach(lc, newitems)
+ {
+ items = lappend(items, (ZSTidArrayItem *) lfirst(lc));
+ }
+ }
+ else
+ items = lappend(items, item);
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ zstid endtid = 0;
+ ListCell *lc;
+
+ foreach (lc, items)
+ {
+ ZSTidArrayItem *i = (ZSTidArrayItem *) lfirst(lc);
+
+ Assert(i->t_firsttid >= endtid);
+ Assert(i->t_endtid > i->t_firsttid);
+ endtid = i->t_endtid;
+ }
+ }
+#endif
+
+ /* Pass the list to the recompressor. */
+ IncrBufferRefCount(buf);
+ if (items)
+ {
+ zsbt_tid_recompress_replace(rel, buf, items, undo_op);
+ }
+ else
+ {
+ zs_split_stack *stack;
+
+ stack = zsbt_unlink_page(rel, ZS_META_ATTRIBUTE_NUM, buf, 0);
+
+ if (!stack)
+ {
+ /* failed. */
+ Page newpage = PageGetTempPageCopySpecial(BufferGetPage(buf));
+
+ stack = zs_new_split_stack_entry(buf, newpage);
+ }
+
+ /* apply the changes */
+ zs_apply_split_changes(rel, stack, undo_op, 0);
+ }
+
+ list_free(items);
+ }
+}
+
+/*
+ * Recompressor routines
+ */
+typedef struct
+{
+ Page currpage;
+
+ /* first page writes over the old buffer, subsequent pages get newly-allocated buffers */
+ zs_split_stack *stack_head;
+ zs_split_stack *stack_tail;
+
+ int num_pages;
+ int free_space_per_page;
+
+ zstid hikey;
+} zsbt_tid_recompress_context;
+
+static void
+zsbt_tid_recompress_newpage(zsbt_tid_recompress_context *cxt, zstid nexttid, int flags)
+{
+ Page newpage;
+ ZSBtreePageOpaque *newopaque;
+ zs_split_stack *stack;
+
+ if (cxt->currpage)
+ {
+ /* set the last tid on previous page */
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(cxt->currpage);
+
+ oldopaque->zs_hikey = nexttid;
+ }
+
+ newpage = (Page) palloc(BLCKSZ);
+ PageInit(newpage, BLCKSZ, sizeof(ZSBtreePageOpaque));
+
+ stack = zs_new_split_stack_entry(InvalidBuffer, /* will be assigned later */
+ newpage);
+ if (cxt->stack_tail)
+ cxt->stack_tail->next = stack;
+ else
+ cxt->stack_head = stack;
+ cxt->stack_tail = stack;
+
+ cxt->currpage = newpage;
+
+ newopaque = ZSBtreePageGetOpaque(newpage);
+ newopaque->zs_attno = ZS_META_ATTRIBUTE_NUM;
+ newopaque->zs_next = InvalidBlockNumber; /* filled in later */
+ newopaque->zs_lokey = nexttid;
+ newopaque->zs_hikey = cxt->hikey; /* overwritten later, if this is not last page */
+ newopaque->zs_level = 0;
+ newopaque->zs_flags = flags;
+ newopaque->zs_page_id = ZS_BTREE_PAGE_ID;
+}
+
+static void
+zsbt_tid_recompress_add_to_page(zsbt_tid_recompress_context *cxt, ZSTidArrayItem *item)
+{
+ OffsetNumber maxoff;
+ Size freespc;
+
+ freespc = PageGetExactFreeSpace(cxt->currpage);
+ if (freespc < item->t_size + sizeof(ItemIdData) ||
+ freespc < cxt->free_space_per_page)
+ {
+ zsbt_tid_recompress_newpage(cxt, item->t_firsttid, 0);
+ }
+
+ maxoff = PageGetMaxOffsetNumber(cxt->currpage);
+ if (!PageAddItem(cxt->currpage, (Item) item, item->t_size, maxoff + 1, true, false))
+ elog(ERROR, "could not add item to TID tree page");
+}
+
+/*
+ * Subroutine of zsbt_tid_recompress_replace. Compute how much space the
+ * items will take, and compute how many pages will be needed for them, and
+ * decide how to distribute any free space thats's left over among the
+ * pages.
+ *
+ * Like in B-tree indexes, we aim for 50/50 splits, except for the
+ * rightmost page where aim for 90/10, so that most of the free space is
+ * left to the end of the index, where it's useful for new inserts. The
+ * 90/10 splits ensure that the we don't waste too much space on a table
+ * that's loaded at the end, and never updated.
+ */
+static void
+zsbt_tid_recompress_picksplit(zsbt_tid_recompress_context *cxt, List *items)
+{
+ size_t total_sz;
+ int num_pages;
+ int space_on_empty_page;
+ Size free_space_per_page;
+ ListCell *lc;
+
+ space_on_empty_page = BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(ZSBtreePageOpaque));
+
+ /* Compute total space needed for all the items. */
+ total_sz = 0;
+ foreach(lc, items)
+ {
+ ZSTidArrayItem *item = lfirst(lc);
+
+ total_sz += sizeof(ItemIdData) + item->t_size;
+ }
+
+ /* How many pages will we need for them? */
+ num_pages = (total_sz + space_on_empty_page - 1) / space_on_empty_page;
+
+ /* If everything fits on one page, don't split */
+ if (num_pages == 1)
+ {
+ free_space_per_page = 0;
+ }
+ /* If this is the rightmost page, do a 90/10 split */
+ else if (cxt->hikey == MaxPlusOneZSTid)
+ {
+ /*
+ * What does 90/10 mean if we have to use more than two pages? It means
+ * that 10% of the items go to the last page, and 90% are distributed to
+ * all the others.
+ */
+ double total_free_space;
+
+ total_free_space = space_on_empty_page * num_pages - total_sz;
+
+ free_space_per_page = total_free_space * 0.1 / (num_pages - 1);
+ }
+ /* Otherwise, aim for an even 50/50 split */
+ else
+ {
+ free_space_per_page = (space_on_empty_page * num_pages - total_sz) / num_pages;
+ }
+
+ cxt->num_pages = num_pages;
+ cxt->free_space_per_page = free_space_per_page;
+}
+
+/*
+ * Rewrite a leaf page, with given 'items' as the new content.
+ *
+ * If there are any uncompressed items in the list, we try to compress them.
+ * Any already-compressed items are added as is.
+ *
+ * If the items no longer fit on the page, then the page is split. It is
+ * entirely possible that they don't fit even on two pages; we split the page
+ * into as many pages as needed. Hopefully not more than a few pages, though,
+ * because otherwise you might hit limits on the number of buffer pins (with
+ * tiny shared_buffers).
+ *
+ * On entry, 'oldbuf' must be pinned and exclusive-locked. On exit, the lock
+ * is released, but it's still pinned.
+ *
+ * TODO: Try to combine single items, and existing array-items, into new array
+ * items.
+ */
+static void
+zsbt_tid_recompress_replace(Relation rel, Buffer oldbuf, List *items, zs_pending_undo_op *undo_op)
+{
+ ListCell *lc;
+ zsbt_tid_recompress_context cxt;
+ ZSBtreePageOpaque *oldopaque = ZSBtreePageGetOpaque(BufferGetPage(oldbuf));
+ BlockNumber orignextblk;
+ zs_split_stack *stack;
+ List *downlinks = NIL;
+
+ orignextblk = oldopaque->zs_next;
+
+ cxt.currpage = NULL;
+ cxt.stack_head = cxt.stack_tail = NULL;
+ cxt.hikey = oldopaque->zs_hikey;
+
+ zsbt_tid_recompress_picksplit(&cxt, items);
+ zsbt_tid_recompress_newpage(&cxt, oldopaque->zs_lokey, (oldopaque->zs_flags & ZSBT_ROOT));
+
+ foreach(lc, items)
+ {
+ ZSTidArrayItem *item = (ZSTidArrayItem *) lfirst(lc);
+
+ zsbt_tid_recompress_add_to_page(&cxt, item);
+ }
+
+ /*
+ * Ok, we now have a list of pages, to replace the original page, as private
+ * in-memory copies. Allocate buffers for them, and write them out.
+ *
+ * allocate all the pages before entering critical section, so that
+ * out-of-disk-space doesn't lead to PANIC
+ */
+ stack = cxt.stack_head;
+ Assert(stack->buf == InvalidBuffer);
+ stack->buf = oldbuf;
+ while (stack->next)
+ {
+ Page thispage = stack->page;
+ ZSBtreePageOpaque *thisopaque = ZSBtreePageGetOpaque(thispage);
+ ZSBtreeInternalPageItem *downlink;
+ Buffer nextbuf;
+
+ Assert(stack->next->buf == InvalidBuffer);
+
+ nextbuf = zspage_getnewbuf(rel, 0);
+ stack->next->buf = nextbuf;
+
+ thisopaque->zs_next = BufferGetBlockNumber(nextbuf);
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = thisopaque->zs_hikey;
+ downlink->childblk = BufferGetBlockNumber(nextbuf);
+ downlinks = lappend(downlinks, downlink);
+
+ stack = stack->next;
+ }
+ /* last one in the chain */
+ ZSBtreePageGetOpaque(stack->page)->zs_next = orignextblk;
+
+ /*
+ * zsbt_tid_recompress_picksplit() calculated that we'd need
+ * 'cxt.num_pages' pages. Check that it matches with how many pages we
+ * actually created.
+ * TODO: sometimes we may end up creating pages greater than the number of
+ * pages calculated. Correct calculation of cxt.num_pages.
+ */
+ Assert(list_length(downlinks) + 1 >= cxt.num_pages);
+
+ /* If we had to split, insert downlinks for the new pages. */
+ if (cxt.stack_head->next)
+ {
+ oldopaque = ZSBtreePageGetOpaque(cxt.stack_head->page);
+
+ if ((oldopaque->zs_flags & ZSBT_ROOT) != 0)
+ {
+ ZSBtreeInternalPageItem *downlink;
+
+ downlink = palloc(sizeof(ZSBtreeInternalPageItem));
+ downlink->tid = MinZSTid;
+ downlink->childblk = BufferGetBlockNumber(cxt.stack_head->buf);
+ downlinks = lcons(downlink, downlinks);
+
+ cxt.stack_tail->next = zsbt_newroot(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_level + 1, downlinks);
+
+ /* clear the ZSBT_ROOT flag on the old root page */
+ oldopaque->zs_flags &= ~ZSBT_ROOT;
+ }
+ else
+ {
+ cxt.stack_tail->next = zsbt_insert_downlinks(rel, ZS_META_ATTRIBUTE_NUM,
+ oldopaque->zs_lokey, BufferGetBlockNumber(oldbuf), oldopaque->zs_level + 1,
+ downlinks);
+ }
+ /* note: stack_tail is not the real tail anymore */
+ }
+
+ /* Finally, overwrite all the pages we had to modify */
+ zs_apply_split_changes(rel, cxt.stack_head, undo_op, 0);
+}
+
+static OffsetNumber
+zsbt_binsrch_tidpage(zstid key, Page page)
+{
+ OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+ OffsetNumber low,
+ high,
+ mid;
+
+ low = FirstOffsetNumber;
+ high = maxoff + 1;
+ while (high > low)
+ {
+ ItemId iid;
+ ZSTidArrayItem *item;
+
+ mid = low + (high - low) / 2;
+
+ iid = PageGetItemId(page, mid);
+ item = (ZSTidArrayItem *) PageGetItem(page, iid);
+
+ if (key >= item->t_firsttid)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+ return low - 1;
+}
+
+/*
+ * Start a WAL operation to log changes to tid tree leaf items.
+ * This allocates enough space to accommodate records for the tid leaf items and
+ * any associated undo_op.
+ */
+static void
+zsbt_wal_log_tidleaf_items_begin(int nitems, zs_pending_undo_op *undo_op)
+{
+ int nrdatas;
+
+ XLogBeginInsert();
+ /*
+ * We allocate an rdata per tid leaf item. We may need two extra
+ * rdatas for UNDO. This must be called before we enter the critical
+ * section as XLogEnsureRecordSpace() performs memory allocation.
+ */
+ nrdatas = nitems + 1;
+ if (undo_op)
+ nrdatas += 2;
+ XLogEnsureRecordSpace(0, nrdatas);
+}
+
+/*
+ * It must be called after zsbt_wal_log_tidleaf_items_begin() is called and it
+ * must be called from a critical section.
+ */
+static void
+zsbt_wal_log_tidleaf_items(Relation rel, Buffer buf,
+ OffsetNumber off, bool replace, List *items,
+ zs_pending_undo_op *undo_op)
+{
+ ListCell *lc;
+ XLogRecPtr recptr;
+ wal_zedstore_tidleaf_items xlrec;
+
+ Assert(CritSectionCount > 0);
+
+ xlrec.nitems = list_length(items);
+ xlrec.off = off;
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalTidLeafItems);
+
+ foreach(lc, items)
+ {
+ void *item = (void *) lfirst(lc);
+ size_t itemsz;
+
+ itemsz = ((ZSTidArrayItem *) item)->t_size;
+
+ XLogRegisterBufData(0, item, itemsz);
+ }
+
+ if (undo_op)
+ XLogRegisterUndoOp(1, undo_op);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID,
+ replace ? WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM : WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS );
+
+ PageSetLSN(BufferGetPage(buf), recptr);
+ if (undo_op)
+ PageSetLSN(BufferGetPage(undo_op->reservation.undobuf), recptr);
+}
+
+void
+zsbt_tidleaf_items_redo(XLogReaderState *record, bool replace)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_tidleaf_items *xlrec =
+ (wal_zedstore_tidleaf_items *) XLogRecGetData(record);
+ Buffer buffer;
+ Buffer undobuf;
+
+ if (XLogRecHasBlockRef(record, 1))
+ undobuf = XLogRedoUndoOp(record, 1);
+ else
+ undobuf = InvalidBuffer;
+
+ if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+ {
+ Page page = (Page) BufferGetPage(buffer);
+ OffsetNumber off = xlrec->off;
+
+ if (xlrec->nitems == 0)
+ {
+ Assert(replace);
+ PageIndexTupleDelete(page, off);
+ }
+ else
+ {
+ char itembuf[BLCKSZ + MAXIMUM_ALIGNOF];
+ char *itembufp;
+ Size datasz;
+ char *data;
+ char *p;
+ int i;
+
+ itembufp = (char *) MAXALIGN(itembuf);
+
+ data = XLogRecGetBlockData(record, 0, &datasz);
+ p = data;
+ for (i = 0; i < xlrec->nitems; i++)
+ {
+ uint16 itemsz;
+
+ /*
+ * XXX: we assume that both ZSTidArrayItem and ZSAttributeArrayItem have
+ * t_size as the first field.
+ */
+ memcpy(&itemsz, p, sizeof(uint16));
+ Assert(itemsz > 0);
+ Assert(itemsz < BLCKSZ);
+ memcpy(itembufp, p, itemsz);
+ p += itemsz;
+
+ if (replace && i == 0)
+ {
+ if (!PageIndexTupleOverwrite(page, off, (Item) itembuf, itemsz))
+ elog(ERROR, "could not replace item on zedstore btree page at off %d", off);
+ }
+ else if (PageAddItem(page, (Item) itembufp, itemsz, off, false, false)
+ == InvalidOffsetNumber)
+ {
+ elog(ERROR, "could not add item to zedstore btree page");
+ }
+ off++;
+ }
+ Assert(p - data == datasz);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ }
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+ if (BufferIsValid(undobuf))
+ UnlockReleaseBuffer(undobuf);
+}
diff --git a/src/backend/access/zedstore/zedstore_toast.c b/src/backend/access/zedstore/zedstore_toast.c
new file mode 100644
index 0000000000..70eb1277b9
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_toast.c
@@ -0,0 +1,324 @@
+/*
+ * zedstore_toast.c
+ * Routines for Toasting oversized tuples in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_toast.c
+ */
+#include "postgres.h"
+
+#include "access/toast_internals.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/datum.h"
+#include "utils/rel.h"
+
+static void zstoast_wal_log_newpage(Buffer prevbuf, Buffer buf, zstid tid, AttrNumber attno,
+ int offset, int32 total_size);
+
+/*
+ * Toast a datum, inside the ZedStore file.
+ *
+ * This is similar to regular toasting, but instead of using a separate index and
+ * heap, the datum is stored within the same ZedStore file as all the btrees and
+ * stuff. A chain of "toast-pages" is allocated for the datum, and each page is filled
+ * with as much of the datum as possible.
+ */
+Datum
+zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value, zstid tid)
+{
+ varatt_zs_toastptr *toastptr;
+ BlockNumber firstblk = InvalidBlockNumber;
+ Buffer buf = InvalidBuffer;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ Buffer prevbuf = InvalidBuffer;
+ ZSToastPageOpaque *prevopaque = NULL;
+ char *ptr;
+ int32 total_size;
+ int32 decompressed_size = 0;
+ int32 offset;
+ bool is_compressed;
+ bool is_first;
+ Datum toasted_datum;
+
+ Assert(tid != InvalidZSTid);
+
+ /*
+ * TID btree will always be inserted first, so there must be > 0 blocks
+ */
+ Assert(RelationGetNumberOfBlocks(rel) != 0);
+
+ if (VARATT_IS_COMPRESSED(value))
+ toasted_datum = value;
+ else
+ toasted_datum = toast_compress_datum(value);
+ if (DatumGetPointer(toasted_datum) != NULL)
+ {
+ /*
+ * If the compressed datum can be stored inline, return the datum
+ * directly.
+ */
+ if (VARSIZE_ANY(toasted_datum) <= MaxZedStoreDatumSize)
+ {
+ return toasted_datum;
+ }
+
+ is_compressed = true;
+ decompressed_size = TOAST_COMPRESS_RAWSIZE(toasted_datum);
+ ptr = TOAST_COMPRESS_RAWDATA(toasted_datum);
+ total_size = VARSIZE_ANY(toasted_datum) - TOAST_COMPRESS_HDRSZ;
+ }
+ else
+ {
+ /*
+ * If the compression doesn't reduce the size enough, allocate a
+ * toast page for it.
+ */
+ is_compressed = false;
+ ptr = VARDATA_ANY(value);
+ total_size = VARSIZE_ANY_EXHDR(value);
+ }
+
+
+ offset = 0;
+ is_first = true;
+ while (total_size - offset > 0)
+ {
+ Size thisbytes;
+
+ buf = zspage_getnewbuf(rel, ZS_INVALID_ATTRIBUTE_NUM);
+ if (prevbuf == InvalidBuffer)
+ firstblk = BufferGetBlockNumber(buf);
+
+ START_CRIT_SECTION();
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, sizeof(ZSToastPageOpaque));
+
+ thisbytes = Min(total_size - offset, PageGetExactFreeSpace(page));
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+ opaque->zs_tid = tid;
+ opaque->zs_attno = attno;
+ opaque->zs_total_size = total_size;
+ opaque->zs_decompressed_size = decompressed_size;
+ opaque->zs_is_compressed = is_compressed;
+ opaque->zs_slice_offset = offset;
+ opaque->zs_prev = is_first ? InvalidBlockNumber : BufferGetBlockNumber(prevbuf);
+ opaque->zs_next = InvalidBlockNumber;
+ opaque->zs_flags = 0;
+ opaque->zs_page_id = ZS_TOAST_PAGE_ID;
+
+ memcpy((char *) page + SizeOfPageHeaderData, ptr, thisbytes);
+ ((PageHeader) page)->pd_lower += thisbytes;
+
+ if (!is_first)
+ {
+ prevopaque->zs_next = BufferGetBlockNumber(buf);
+ MarkBufferDirty(prevbuf);
+ }
+
+ MarkBufferDirty(buf);
+
+ if (RelationNeedsWAL(rel))
+ zstoast_wal_log_newpage(prevbuf, buf, tid, attno, offset, total_size);
+
+ END_CRIT_SECTION();
+
+ if (prevbuf != InvalidBuffer)
+ UnlockReleaseBuffer(prevbuf);
+ ptr += thisbytes;
+ offset += thisbytes;
+ prevbuf = buf;
+ prevopaque = opaque;
+ is_first = false;
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ toastptr = palloc0(sizeof(varatt_zs_toastptr));
+ SET_VARTAG_1B_E(toastptr, VARTAG_ZEDSTORE);
+ toastptr->zst_block = firstblk;
+
+ return PointerGetDatum(toastptr);
+}
+
+Datum
+zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted)
+{
+ varatt_zs_toastptr *toastptr = (varatt_zs_toastptr *) DatumGetPointer(toasted);
+ BlockNumber nextblk;
+ BlockNumber prevblk;
+ char *result = NULL;
+ char *ptr = NULL;
+ int32 total_size = 0;
+
+ Assert(toastptr->va_tag == VARTAG_ZEDSTORE);
+
+ prevblk = InvalidBlockNumber;
+ nextblk = toastptr->zst_block;
+
+ while (nextblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+ uint32 size;
+
+ buf = ReadBuffer(rel, nextblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ Assert(opaque->zs_attno == attno);
+ Assert(opaque->zs_prev == prevblk);
+
+ if (prevblk == InvalidBlockNumber)
+ {
+ Assert(opaque->zs_tid == tid);
+
+ total_size = opaque->zs_total_size;
+
+ if(opaque->zs_is_compressed)
+ {
+ result = palloc(total_size + TOAST_COMPRESS_HDRSZ);
+
+ TOAST_COMPRESS_SET_RAWSIZE(result, opaque->zs_decompressed_size);
+ SET_VARSIZE_COMPRESSED(result, total_size + TOAST_COMPRESS_HDRSZ);
+ ptr = result + TOAST_COMPRESS_HDRSZ;
+ }
+ else
+ {
+ result = palloc(total_size + VARHDRSZ);
+ SET_VARSIZE(result, total_size + VARHDRSZ);
+ ptr = result + VARHDRSZ;
+ }
+ }
+
+ size = ((PageHeader) page)->pd_lower - SizeOfPageHeaderData;
+ memcpy(ptr, (char *) page + SizeOfPageHeaderData, size);
+ ptr += size;
+
+ prevblk = nextblk;
+ nextblk = opaque->zs_next;
+ UnlockReleaseBuffer(buf);
+ }
+ Assert(total_size > 0);
+ Assert(ptr == result + VARSIZE_ANY(result));
+
+ return PointerGetDatum(result);
+}
+
+void
+zedstore_toast_delete(Relation rel, Form_pg_attribute attr, zstid tid, BlockNumber blkno)
+{
+ BlockNumber nextblk;
+
+ nextblk = blkno;
+
+ while (nextblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSToastPageOpaque *opaque;
+
+ buf = ReadBuffer(rel, nextblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ opaque = (ZSToastPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_tid != tid)
+ {
+ UnlockReleaseBuffer(buf);
+ break;
+ }
+
+ Assert(opaque->zs_attno == attr->attnum);
+
+ nextblk = opaque->zs_next;
+ zspage_delete_page(rel, buf, InvalidBuffer, ZS_INVALID_ATTRIBUTE_NUM);
+ UnlockReleaseBuffer(buf);
+ }
+}
+
+static void
+zstoast_wal_log_newpage(Buffer prevbuf, Buffer buf, zstid tid, AttrNumber attno,
+ int offset, int32 total_size)
+{
+ wal_zedstore_toast_newpage xlrec;
+ XLogRecPtr recptr;
+
+ Assert(offset <= total_size);
+
+ xlrec.tid = tid;
+ xlrec.attno = attno;
+ xlrec.offset = offset;
+ xlrec.total_size = total_size;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalToastNewPage);
+
+ /*
+ * It is easier to just force a full-page image, than WAL-log data. That
+ * means that the information in the wal_zedstore_toast_newpage struct isn't
+ * really necessary, but keep it for now, for the benefit of debugging with
+ * pg_waldump.
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+ if (BufferIsValid(prevbuf))
+ XLogRegisterBuffer(1, prevbuf, REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_TOAST_NEWPAGE);
+
+ PageSetLSN(BufferGetPage(buf), recptr);
+ if (BufferIsValid(prevbuf))
+ PageSetLSN(BufferGetPage(prevbuf), recptr);
+}
+
+void
+zstoast_newpage_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+#if UNUSED
+ wal_zedstore_toast_newpage *xlrec = (wal_zedstore_toast_newpage *) XLogRecGetData(record);
+#endif
+ BlockNumber blkno;
+ Buffer buf;
+ Buffer prevbuf = InvalidBuffer;
+
+ XLogRecGetBlockTag(record, 0, NULL, NULL, &blkno);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+ elog(ERROR, "zedstore toast newpage WAL record did not contain a full-page image");
+
+ if (XLogRecHasBlockRef(record, 1))
+ {
+ if (XLogReadBufferForRedo(record, 1, &prevbuf) == BLK_NEEDS_REDO)
+ {
+ Page prevpage = BufferGetPage(prevbuf);
+ ZSToastPageOpaque *prevopaque;
+
+ prevopaque = (ZSToastPageOpaque *) PageGetSpecialPointer(prevpage);
+ prevopaque->zs_next = BufferGetBlockNumber(buf);
+
+ PageSetLSN(prevpage, lsn);
+ MarkBufferDirty(prevbuf);
+ }
+ }
+ else
+ prevbuf = InvalidBuffer;
+
+ if (BufferIsValid(prevbuf))
+ UnlockReleaseBuffer(prevbuf);
+ UnlockReleaseBuffer(buf);
+}
diff --git a/src/backend/access/zedstore/zedstore_tuplebuffer.c b/src/backend/access/zedstore/zedstore_tuplebuffer.c
new file mode 100644
index 0000000000..5e8c5635fa
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tuplebuffer.c
@@ -0,0 +1,583 @@
+/*
+ * zedstore_tuplebuffer.c
+ * Buffering insertions into a zedstore table
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tuplebuffer.c
+ */
+#include "postgres.h"
+
+#include "access/detoast.h"
+#include "access/table.h"
+#include "access/zedstoream.h"
+#include "access/zedstore_internal.h"
+#include "common/hashfn.h"
+#include "miscadmin.h"
+#include "utils/datum.h"
+
+/*
+ * Single inserts:
+ * If we see more than SINGLE_INSERT_TID_RESERVATION_THRESHOLD insertions with the
+ * same XID and CID, with no "flush" calls in between, we start reserving
+ * TIDs in batches of size SINGLE_INSERT_TID_RESERVATION_SIZE. The downside of
+ * reserving TIDs in batches is that if we are left with any unused TIDs at end
+ * of transaction (or when a "flush" call comes), we need to go and kill the
+ * unused TIDs. So only do batching when it seems like we're inserting a lot of rows.
+ *
+ * Multi inserts:
+ * Whenever we see a multi-insert, we allocate MULTI_INSERT_TID_RESERVATION_FACTOR
+ * times more tids than the number requested. This is to ensure that we don't
+ * end up with inefficient page splits from out-of-tid-order inserts into full-ish
+ * btree pages. Such inserts are typically observed under highly concurrent workloads.
+ * See https://www.postgresql.org/message-id/CADwEdopF2S6uRXJRg%3DVZRfPZis80OnawAOCTSh_SrN2i1KGkMw%40mail.gmail.com
+ * for more details.
+ *
+ * TODO: expose these constants as GUCs as they are very workload sensitive.
+ */
+
+#define SINGLE_INSERT_TID_RESERVATION_THRESHOLD 5
+#define SINGLE_INSERT_TID_RESERVATION_SIZE 100
+#define MULTI_INSERT_TID_RESERVATION_FACTOR 10
+
+#define ATTBUFFER_SIZE (1024 * 1024)
+
+typedef struct
+{
+ zstid buffered_tids[60];
+ Datum buffered_datums[60];
+ bool buffered_isnulls[60];
+ int num_buffered_rows;
+
+ attstream_buffer chunks;
+
+} attbuffer;
+
+typedef struct
+{
+ Oid relid; /* table's OID (hash key) */
+ char status; /* hash entry status */
+
+ int natts; /* # of attributes on table might change, if it's ALTERed */
+ attbuffer *attbuffers;
+
+ uint64 num_repeated_single_inserts; /* # of repeated single inserts for the same (xid, cid) */
+
+ TransactionId reserved_tids_xid;
+ CommandId reserved_tids_cid;
+ zstid reserved_tids_start; /* inclusive */
+ zstid reserved_tids_end; /* inclusive */
+
+} tuplebuffer;
+
+
+/* define hashtable mapping block numbers to PagetableEntry's */
+#define SH_PREFIX tuplebuffers
+#define SH_ELEMENT_TYPE tuplebuffer
+#define SH_KEY_TYPE Oid
+#define SH_KEY relid
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_EQUAL(tb, a, b) a == b
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+
+/* prototypes for internal functions */
+static void zsbt_attbuffer_spool(Relation rel, AttrNumber attno, attbuffer *attbuffer, int ntuples, zstid *tids, Datum *datums, bool *isnulls);
+static void zsbt_attbuffer_init(Form_pg_attribute attr, attbuffer *attbuffer);
+static void zsbt_attbuffer_flush(Relation rel, AttrNumber attno, attbuffer *attbuffer, bool all);
+static void tuplebuffer_kill_unused_reserved_tids(Relation rel, tuplebuffer *tupbuffer);
+
+static MemoryContext tuplebuffers_cxt = NULL;
+static struct tuplebuffers_hash *tuplebuffers = NULL;
+
+static tuplebuffer *
+get_tuplebuffer(Relation rel)
+{
+ bool found;
+ tuplebuffer *tupbuffer;
+
+ if (tuplebuffers_cxt == NULL)
+ {
+ tuplebuffers_cxt = AllocSetContextCreate(TopTransactionContext,
+ "ZedstoreAMTupleBuffers",
+ ALLOCSET_DEFAULT_SIZES);
+ tuplebuffers = tuplebuffers_create(tuplebuffers_cxt, 10, NULL);
+ }
+retry:
+ tupbuffer = tuplebuffers_insert(tuplebuffers, RelationGetRelid(rel), &found);
+ if (!found)
+ {
+ MemoryContext oldcxt;
+ AttrNumber attno;
+ int natts;
+
+ oldcxt = MemoryContextSwitchTo(tuplebuffers_cxt);
+ natts = rel->rd_att->natts;
+ tupbuffer->attbuffers = palloc(natts * sizeof(attbuffer));
+ tupbuffer->natts = natts;
+
+ for (attno = 1; attno <= natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(rel->rd_att, attno - 1);
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+
+ zsbt_attbuffer_init(attr, attbuffer);
+ }
+
+ tupbuffer->reserved_tids_xid = InvalidTransactionId;
+ tupbuffer->reserved_tids_cid = InvalidCommandId;
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+ tupbuffer->num_repeated_single_inserts = 0;
+
+ MemoryContextSwitchTo(oldcxt);
+ }
+ else if (rel->rd_att->natts > tupbuffer->natts)
+ {
+ zsbt_tuplebuffer_flush(rel);
+ goto retry;
+ }
+
+ return tupbuffer;
+}
+
+/*
+ * Allocate TIDs for insert.
+ *
+ * First check if the reserved tids can cater to the number of tids requested for
+ * allocation (ntids). If yes, consume the tids from these reserved tids. Else,
+ * we have to request more tids by inserting into the tid tree.
+ *
+ * We reserve tids inside the tupbuffer for the same (xid, cid) combo. The way we
+ * reserve tids is slightly different for single-insert vs multi-insert.
+ *
+ * For single inserts, in the same (xid, cid) once we encounter number of inserts =
+ * SINGLE_INSERT_TID_RESERVATION_THRESHOLD, we request and reserve
+ * SINGLE_INSERT_TID_RESERVATION_SIZE number of tids.
+ *
+ * For multi-inserts, we request and reserve (ntids * MULTI_INSERT_TID_RESERVATION_FACTOR)
+ * number of tids.
+ */
+zstid
+zsbt_tuplebuffer_allocate_tids(Relation rel, TransactionId xid, CommandId cid, int ntids)
+{
+ tuplebuffer *tupbuffer;
+ zstid result;
+
+ tupbuffer = get_tuplebuffer(rel);
+
+ if (tupbuffer->reserved_tids_xid != xid ||
+ tupbuffer->reserved_tids_cid != cid)
+ {
+ /*
+ * This insertion is for a different XID or CID than before. (Or this
+ * is the first insertion.)
+ */
+ tuplebuffer_kill_unused_reserved_tids(rel, tupbuffer);
+ tupbuffer->num_repeated_single_inserts = 0;
+
+ tupbuffer->reserved_tids_xid = xid;
+ tupbuffer->reserved_tids_cid = cid;
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+ }
+
+ if ((tupbuffer->reserved_tids_start != InvalidZSTid &&
+ tupbuffer->reserved_tids_end != InvalidZSTid) && ntids <=
+ (tupbuffer->reserved_tids_end - tupbuffer->reserved_tids_start + 1))
+ {
+ /* We have enough reserved tids */
+ result = tupbuffer->reserved_tids_start;
+ tupbuffer->reserved_tids_start += ntids;
+ }
+ else if (ntids == 1)
+ {
+ /* We don't have enough reserved tids for a single insert */
+ if (tupbuffer->num_repeated_single_inserts < SINGLE_INSERT_TID_RESERVATION_THRESHOLD)
+ {
+ /* We haven't seen many single inserts yet, so just allocate a single TID for this. */
+ result = zsbt_tid_multi_insert(rel, 1, xid, cid,
+ INVALID_SPECULATIVE_TOKEN, InvalidUndoPtr);
+ /* Since we don't reserve any tids, invalidate reservation fields */
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+ }
+ else
+ {
+ /* We're in batch mode for single inserts. Reserve a new block of TIDs. */
+ result = zsbt_tid_multi_insert(rel, SINGLE_INSERT_TID_RESERVATION_SIZE, xid, cid,
+ INVALID_SPECULATIVE_TOKEN, InvalidUndoPtr);
+ tupbuffer->reserved_tids_start = result + 1;
+ tupbuffer->reserved_tids_end = result + SINGLE_INSERT_TID_RESERVATION_SIZE - 1;
+ }
+ tupbuffer->num_repeated_single_inserts++;
+ }
+ else
+ {
+ /* We don't have enough tids for a multi-insert. */
+
+ /*
+ * Kill the unused tids in the tuple buffer first since we will replace
+ * them with a list of fresh continuous tids.
+ */
+ tuplebuffer_kill_unused_reserved_tids(rel, tupbuffer);
+ result = zsbt_tid_multi_insert(rel, MULTI_INSERT_TID_RESERVATION_FACTOR * ntids, xid, cid,
+ INVALID_SPECULATIVE_TOKEN, InvalidUndoPtr);
+ tupbuffer->reserved_tids_end = result + (MULTI_INSERT_TID_RESERVATION_FACTOR * ntids) - 1;
+ tupbuffer->reserved_tids_start = result + ntids;
+ }
+
+ return result;
+}
+
+/* buffer more data */
+void
+zsbt_tuplebuffer_spool_tuple(Relation rel, zstid tid, Datum *datums, bool *isnulls)
+{
+ AttrNumber attno;
+ tuplebuffer *tupbuffer;
+
+ tupbuffer = get_tuplebuffer(rel);
+
+ for (attno = 1; attno <= rel->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(rel->rd_att, attno - 1);
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+ Datum datum;
+ bool isnull;
+
+ datum = datums[attno - 1];
+ isnull = isnulls[attno - 1];
+
+ if (!isnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(datum))
+ datum = PointerGetDatum(detoast_external_attr((struct varlena *) DatumGetPointer(datum)));
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(rel, attno, datum, tid);
+ }
+
+ zsbt_attbuffer_spool(rel, attno, attbuffer, 1, &tid, &datum, &isnull);
+ }
+}
+
+void
+zsbt_tuplebuffer_spool_slots(Relation rel, zstid *tids, TupleTableSlot **slots, int ntuples)
+{
+ AttrNumber attno;
+ tuplebuffer *tupbuffer;
+ Datum *datums;
+ bool *isnulls;
+
+ tupbuffer = get_tuplebuffer(rel);
+
+ datums = palloc(ntuples * sizeof(Datum));
+ isnulls = palloc(ntuples * sizeof(bool));
+
+ for (attno = 1; attno <= rel->rd_att->natts; attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(rel->rd_att, attno - 1);
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+
+ for (int i = 0; i < ntuples; i++)
+ {
+ Datum datum = slots[i]->tts_values[attno - 1];
+ bool isnull = slots[i]->tts_isnull[attno - 1];
+
+ if (attno == 1)
+ slot_getallattrs(slots[i]);
+
+ if (!isnull && attr->attlen < 0 && VARATT_IS_EXTERNAL(datum))
+ datum = PointerGetDatum(detoast_external_attr((struct varlena *) DatumGetPointer(datum)));
+
+ /* If this datum is too large, toast it */
+ if (!isnull && attr->attlen < 0 &&
+ VARSIZE_ANY_EXHDR(datum) > MaxZedStoreDatumSize)
+ {
+ datum = zedstore_toast_datum(rel, attno, datum, tids[i]);
+ }
+ datums[i] = datum;
+ isnulls[i] = isnull;
+ }
+
+ zsbt_attbuffer_spool(rel, attno, attbuffer, ntuples, tids, datums, isnulls);
+ }
+
+ pfree(datums);
+ pfree(isnulls);
+}
+
+
+static void
+zsbt_attbuffer_init(Form_pg_attribute attr, attbuffer *attbuffer)
+{
+ attstream_buffer *attbuf = &attbuffer->chunks;
+
+#define ATTBUF_INIT_SIZE 1024
+ attbuf->data = palloc(ATTBUF_INIT_SIZE);
+ attbuf->len = 0;
+ attbuf->maxlen = ATTBUF_INIT_SIZE;
+ attbuf->cursor = 0;
+
+ attbuf->firsttid = 0;
+ attbuf->lasttid = 0;
+
+ attbuf->attlen = attr->attlen;
+ attbuf->attbyval = attr->attbyval;
+
+ attbuffer->num_buffered_rows = 0;
+}
+
+static void
+zsbt_attbuffer_spool(Relation rel, AttrNumber attno, attbuffer *attbuffer,
+ int ntuples, zstid *tids, Datum *datums, bool *isnulls)
+{
+ int i;
+ attstream_buffer *chunks = &attbuffer->chunks;
+
+ for (i = 0; i < ntuples; i++)
+ {
+ Datum datum;
+
+ if (attbuffer->num_buffered_rows >= 60)
+ zsbt_attbuffer_flush(rel, attno, attbuffer, false);
+
+ if (!chunks->attbyval && !isnulls[i])
+ {
+ /* XXX: have to make a copy of pass-by ref values, because we
+ * need it to live until the end-of-xact, where we flush the buffers.
+ * That's pretty inefficient!
+ */
+ MemoryContext oldcxt = MemoryContextSwitchTo(tuplebuffers_cxt);
+ datum = zs_datumCopy(datums[i], chunks->attbyval, chunks->attlen);
+ MemoryContextSwitchTo(oldcxt);
+ }
+ else
+ datum = datums[i];
+
+ attbuffer->buffered_tids[attbuffer->num_buffered_rows] = tids[i];
+ attbuffer->buffered_datums[attbuffer->num_buffered_rows] = datum;
+ attbuffer->buffered_isnulls[attbuffer->num_buffered_rows] = isnulls[i];
+ attbuffer->num_buffered_rows++;
+ }
+}
+
+/* flush */
+
+static void
+zsbt_attbuffer_flush(Relation rel, AttrNumber attno, attbuffer *attbuffer, bool all)
+{
+ int num_encoded;
+ int num_remain;
+ attstream_buffer *chunks = &attbuffer->chunks;
+
+ /* First encode more */
+ if (attbuffer->num_buffered_rows >= 60 ||
+ (all && attbuffer->num_buffered_rows > 0))
+ {
+ num_encoded = append_attstream(chunks, all, attbuffer->num_buffered_rows,
+ attbuffer->buffered_tids,
+ attbuffer->buffered_datums,
+ attbuffer->buffered_isnulls);
+ num_remain = attbuffer->num_buffered_rows - num_encoded;
+
+ if (!chunks->attbyval)
+ {
+ for (int i = 0; i < num_encoded; i++)
+ {
+ if (!attbuffer->buffered_isnulls[i])
+ pfree(DatumGetPointer(attbuffer->buffered_datums[i]));
+ }
+ }
+
+ memmove(attbuffer->buffered_tids, &attbuffer->buffered_tids[num_encoded], num_remain * sizeof(zstid));
+ memmove(attbuffer->buffered_datums, &attbuffer->buffered_datums[num_encoded], num_remain * sizeof(Datum));
+ memmove(attbuffer->buffered_isnulls, &attbuffer->buffered_isnulls[num_encoded], num_remain * sizeof(bool));
+ attbuffer->num_buffered_rows = num_remain;
+ }
+
+ while ((all && chunks->len - chunks->cursor > 0) ||
+ chunks->len - chunks->cursor > ATTBUFFER_SIZE)
+ {
+ zsbt_attr_add(rel, attno, chunks);
+ }
+}
+
+/*
+ * Remove any reserved but unused TIDs from the TID tree.
+ */
+static void
+tuplebuffer_kill_unused_reserved_tids(Relation rel, tuplebuffer *tupbuffer)
+{
+ IntegerSet *unused_tids;
+ zstid tid;
+
+ if ((tupbuffer->reserved_tids_start == InvalidZSTid &&
+ tupbuffer->reserved_tids_end == InvalidZSTid) ||
+ tupbuffer->reserved_tids_start > tupbuffer->reserved_tids_end)
+ return; /* no reserved TIDs */
+
+ /*
+ * XXX: We use the zsbt_tid_remove() function for this, but it's
+ * a bit too heavy-weight. It's geared towards VACUUM and removing
+ * millions of TIDs in one go. Also, we leak the IntegerSet object;
+ * usually flushing is done at end of transaction, so that's not
+ * a problem, but it could be if we need to flush a lot in the
+ * same transaction.
+ *
+ * XXX: It would be nice to adjust the UNDO record, too. Otherwise,
+ * if we abort, the poor sod that tries to discard the UNDO record
+ * will try to mark these TIDs as unused in vein.
+ */
+ unused_tids = intset_create();
+
+ for (tid = tupbuffer->reserved_tids_start;
+ tid <= tupbuffer->reserved_tids_end;
+ tid++)
+ {
+ intset_add_member(unused_tids, tid);
+ }
+
+ zsbt_tid_remove(rel, unused_tids);
+
+ tupbuffer->reserved_tids_start = InvalidZSTid;
+ tupbuffer->reserved_tids_end = InvalidZSTid;
+}
+
+static void
+tuplebuffer_flush_internal(Relation rel, tuplebuffer *tupbuffer)
+{
+ tuplebuffer_kill_unused_reserved_tids(rel, tupbuffer);
+
+ /* Flush the attribute data */
+ for (AttrNumber attno = 1; attno <= tupbuffer->natts; attno++)
+ {
+ attbuffer *attbuffer = &tupbuffer->attbuffers[attno - 1];
+
+ zsbt_attbuffer_flush(rel, attno, attbuffer, true);
+ }
+
+ tupbuffer->num_repeated_single_inserts = 0;
+}
+
+void
+zsbt_tuplebuffer_flush(Relation rel)
+{
+ tuplebuffer *tupbuffer;
+
+ if (!tuplebuffers)
+ return;
+ tupbuffer = tuplebuffers_lookup(tuplebuffers, RelationGetRelid(rel));
+ if (!tupbuffer)
+ return;
+
+ tuplebuffer_flush_internal(rel, tupbuffer);
+
+ for (int attno = 1 ; attno <= tupbuffer->natts; attno++)
+ {
+ attbuffer *attbuf = &(tupbuffer->attbuffers[attno-1]);
+ pfree(attbuf->chunks.data);
+ }
+ pfree(tupbuffer->attbuffers);
+
+ tuplebuffers_delete(tuplebuffers, RelationGetRelid(rel));
+}
+
+static void
+zsbt_tuplebuffers_flush(void)
+{
+ tuplebuffers_iterator iter;
+ tuplebuffer *tupbuffer;
+
+ tuplebuffers_start_iterate(tuplebuffers, &iter);
+ while ((tupbuffer = tuplebuffers_iterate(tuplebuffers, &iter)) != NULL)
+ {
+ Relation rel;
+
+ rel = table_open(tupbuffer->relid, NoLock);
+
+ tuplebuffer_flush_internal(rel, tupbuffer);
+
+ table_close(rel, NoLock);
+ }
+}
+
+
+/* check in a scan */
+
+
+/*
+ * End-of-transaction cleanup for zedstore.
+ *
+ * Flush tuple buffers in zedstore.
+ *
+ * We must flush everything before the top transaction commit becomes
+ * visible to others, so that they can see the data. On abort, we can drop
+ * everything we had buffered at top transaction abort. That's fortunate,
+ * because we couldn't access the table during abort processing anyway.
+ *
+ * Subtransactions:
+ *
+ * After a subtransaction has been marked as aborted, we mustn't write
+ * out any attribute data belonging to the aborted subtransaction. Two
+ * reasons for that. Firstly, the TIDs belonging to an aborted
+ * subtransaction might be vacuumed away at any point. We mustn't write
+ * out attribute data for a TID that's already been vacuumed away in the
+ * TID tree. Secondly, subtransaction abort releases locks acquired in
+ * the subtransaction, and we cannot write out data if we're not holding
+ * a lock on the table. So we must throw our buffers away at subtransaction
+ * abort.
+ *
+ * Since we throw away our buffers at subtransaction abort, we must take
+ * care that the buffers are empty when a subtransaction begins. If there
+ * was any leftover buffered data for other subtransactions, we would
+ * throw away that data too, if the new subtransaction aborts.
+ *
+ * Writing out the buffers at subtransaction commit probably isn't necessary,
+ * but might as well play it safe and do it.
+ */
+void
+AtEOXact_zedstore_tuplebuffers(bool isCommit)
+{
+ if (tuplebuffers_cxt)
+ {
+ if (isCommit)
+ zsbt_tuplebuffers_flush();
+ MemoryContextDelete(tuplebuffers_cxt);
+ tuplebuffers_cxt = NULL;
+ tuplebuffers = NULL;
+ }
+}
+
+void
+AtSubStart_zedstore_tuplebuffers(void)
+{
+ if (tuplebuffers_cxt)
+ {
+ zsbt_tuplebuffers_flush();
+ MemoryContextDelete(tuplebuffers_cxt);
+ tuplebuffers_cxt = NULL;
+ tuplebuffers = NULL;
+ }
+}
+
+void
+AtEOSubXact_zedstore_tuplebuffers(bool isCommit)
+{
+ if (tuplebuffers_cxt)
+ {
+ if (isCommit)
+ zsbt_tuplebuffers_flush();
+ MemoryContextDelete(tuplebuffers_cxt);
+ tuplebuffers_cxt = NULL;
+ tuplebuffers = NULL;
+ }
+}
diff --git a/src/backend/access/zedstore/zedstore_tupslot.c b/src/backend/access/zedstore/zedstore_tupslot.c
new file mode 100644
index 0000000000..75944e43ff
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_tupslot.c
@@ -0,0 +1,272 @@
+/*
+ * zedstore_tupslot.c
+ * Implementation of a TupleTableSlot for zedstore.
+ *
+ * This implementation is identical to a Virtual tuple slot
+ * (TTSOpsVirtual), but it has a slot_getsysattr() implementation
+ * that can fetch and compute the 'xmin' for the tuple.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_tupslot.c
+ */
+#include "postgres.h"
+
+#include "access/table.h"
+#include "access/zedstore_internal.h"
+#include "executor/tuptable.h"
+#include "utils/expandeddatum.h"
+
+const TupleTableSlotOps TTSOpsZedstore;
+
+static void
+tts_zedstore_init(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ zslot->visi_info = NULL;
+}
+
+static void
+tts_zedstore_release(TupleTableSlot *slot)
+{
+}
+
+static void
+tts_zedstore_clear(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ if (unlikely(TTS_SHOULDFREE(slot)))
+ {
+ pfree(zslot->data);
+ zslot->data = NULL;
+
+ slot->tts_flags &= ~TTS_FLAG_SHOULDFREE;
+ }
+
+ slot->tts_nvalid = 0;
+ slot->tts_flags |= TTS_FLAG_EMPTY;
+ ItemPointerSetInvalid(&slot->tts_tid);
+
+ zslot->visi_info = NULL;
+}
+
+/*
+ * Attribute values are readily available in tts_values and tts_isnull array
+ * in a ZedstoreTupleTableSlot. So there should be no need to call either of the
+ * following two functions.
+ */
+static void
+tts_zedstore_getsomeattrs(TupleTableSlot *slot, int natts)
+{
+ elog(ERROR, "getsomeattrs is not required to be called on a zedstore tuple table slot");
+}
+
+/*
+ * We only support fetching 'xmin', currently. It's needed for referential
+ * integrity triggers (i.e. foreign keys).
+ */
+static Datum
+tts_zedstore_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
+{
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+ if (attnum == MinTransactionIdAttributeNumber ||
+ attnum == MinCommandIdAttributeNumber)
+ {
+ *isnull = false;
+ if (attnum == MinTransactionIdAttributeNumber)
+ return zslot->visi_info ? TransactionIdGetDatum(zslot->visi_info->xmin) : InvalidTransactionId;
+ else
+ {
+ Assert(attnum == MinCommandIdAttributeNumber);
+ return zslot->visi_info ? CommandIdGetDatum(zslot->visi_info->cmin) : InvalidCommandId;
+ }
+ }
+ elog(ERROR, "zedstore tuple table slot does not have system attributes (except xmin and cmin)");
+
+ return 0; /* silence compiler warnings */
+}
+
+/*
+ * To materialize a zedstore slot all the datums that aren't passed by value
+ * have to be copied into the slot's memory context. To do so, compute the
+ * required size, and allocate enough memory to store all attributes. That's
+ * good for cache hit ratio, but more importantly requires only memory
+ * allocation/deallocation.
+ */
+static void
+tts_zedstore_materialize(TupleTableSlot *slot)
+{
+ ZedstoreTupleTableSlot *vslot = (ZedstoreTupleTableSlot *) slot;
+ TupleDesc desc = slot->tts_tupleDescriptor;
+ Size sz = 0;
+ char *data;
+
+ /* already materialized */
+ if (TTS_SHOULDFREE(slot))
+ return;
+
+ /* copy visibility information to go with the slot */
+ if (vslot->visi_info)
+ {
+ vslot->visi_info_buf = *vslot->visi_info;
+ vslot->visi_info = &vslot->visi_info_buf;
+ }
+
+ /* compute size of memory required */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ sz = att_align_nominal(sz, att->attalign);
+ sz += EOH_get_flat_size(DatumGetEOHP(val));
+ }
+ else
+ {
+ sz = att_align_nominal(sz, att->attalign);
+ sz = att_addlength_datum(sz, att->attlen, val);
+ }
+ }
+
+ /* all data is byval */
+ if (sz == 0)
+ return;
+
+ /* allocate memory */
+ vslot->data = data = MemoryContextAlloc(slot->tts_mcxt, sz);
+ slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+ /* and copy all attributes into the pre-allocated space */
+ for (int natt = 0; natt < desc->natts; natt++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, natt);
+ Datum val;
+
+ if (att->attbyval || slot->tts_isnull[natt])
+ continue;
+
+ val = slot->tts_values[natt];
+
+ if (att->attlen == -1 &&
+ VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val)))
+ {
+ Size data_length;
+
+ /*
+ * We want to flatten the expanded value so that the materialized
+ * slot doesn't depend on it.
+ */
+ ExpandedObjectHeader *eoh = DatumGetEOHP(val);
+
+ data = (char *) att_align_nominal(data,
+ att->attalign);
+ data_length = EOH_get_flat_size(eoh);
+ EOH_flatten_into(eoh, data, data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ else
+ {
+ Size data_length = 0;
+
+ data = (char *) att_align_nominal(data, att->attalign);
+ data_length = att_addlength_datum(data_length, att->attlen, val);
+
+ memcpy(data, DatumGetPointer(val), data_length);
+
+ slot->tts_values[natt] = PointerGetDatum(data);
+ data += data_length;
+ }
+ }
+}
+
+static void
+tts_zedstore_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
+{
+ ZedstoreTupleTableSlot *zdstslot = (ZedstoreTupleTableSlot *) dstslot;
+
+ TupleDesc srcdesc = dstslot->tts_tupleDescriptor;
+
+ Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
+
+ tts_zedstore_clear(dstslot);
+
+ slot_getallattrs(srcslot);
+
+ for (int natt = 0; natt < srcdesc->natts; natt++)
+ {
+ dstslot->tts_values[natt] = srcslot->tts_values[natt];
+ dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
+ }
+
+ if (srcslot->tts_ops == &TTSOpsZedstore)
+ {
+ zdstslot->visi_info = ((ZedstoreTupleTableSlot *) srcslot)->visi_info;
+ }
+ else
+ {
+ zdstslot->visi_info = NULL;
+ }
+
+ dstslot->tts_nvalid = srcdesc->natts;
+ dstslot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ /* make sure storage doesn't depend on external memory */
+ tts_zedstore_materialize(dstslot);
+}
+
+static HeapTuple
+tts_zedstore_copy_heap_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+}
+
+static MinimalTuple
+tts_zedstore_copy_minimal_tuple(TupleTableSlot *slot)
+{
+ Assert(!TTS_EMPTY(slot));
+
+ return heap_form_minimal_tuple(slot->tts_tupleDescriptor,
+ slot->tts_values,
+ slot->tts_isnull);
+}
+
+
+const TupleTableSlotOps TTSOpsZedstore = {
+ .base_slot_size = sizeof(ZedstoreTupleTableSlot),
+ .init = tts_zedstore_init,
+ .release = tts_zedstore_release,
+ .clear = tts_zedstore_clear,
+ .getsomeattrs = tts_zedstore_getsomeattrs,
+ .getsysattr = tts_zedstore_getsysattr,
+ .materialize = tts_zedstore_materialize,
+ .copyslot = tts_zedstore_copyslot,
+
+ /*
+ * A zedstore tuple table slot can not "own" a heap tuple or a minimal
+ * tuple.
+ */
+ .get_heap_tuple = NULL,
+ .get_minimal_tuple = NULL,
+ .copy_heap_tuple = tts_zedstore_copy_heap_tuple,
+ .copy_minimal_tuple = tts_zedstore_copy_minimal_tuple
+};
diff --git a/src/backend/access/zedstore/zedstore_undolog.c b/src/backend/access/zedstore/zedstore_undolog.c
new file mode 100644
index 0000000000..8749e4a827
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_undolog.c
@@ -0,0 +1,651 @@
+/*
+ * zedstore_undolog.c
+ * Temporary UNDO-logging for zedstore.
+ *
+ * XXX: This file is hopefully replaced with an upstream UNDO facility later.
+ *
+ * The UNDO log is a dumb a stream of bytes. It can be appended to at the
+ * head, and the tail can be discarded away. The upper layer, see
+ * zedstore_undorec.c, is responsible for dividing the log into records,
+ * and deciding when and what to discard
+ *
+ * The upper layer is also responsible for WAL-logging any insertions and
+ * modifications of UNDO records. This module WAL-logs creation of new UNDO
+ * pages and discarding old ones, but not the content.
+ *
+ * Insertion is a two-step process. First, you reserve the space for the
+ * UNDO record with zsundo_insert_reserve(). You get a pointer to an UNDO
+ * buffer, where you can write the record. Once you're finished, call
+ * zsundo_insert_finish().
+ *
+ * To fetch a record, use zsundo_fetch(). You may modify the record, but
+ * you must dirty the buffer and WAL-log the change yourself. You cannot
+ * change its size, however.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_undolog.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undolog.h"
+#include "access/zedstore_wal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/rel.h"
+
+/*
+ * Reserve space in the UNDO log for a new UNDO record.
+ *
+ * Extends the UNDO log with a new page if needed. Information about the
+ * reservation is returned in *reservation_p. reservation_p->undobuf is
+ * the buffer containing the reserved space. reservation_p->undorecptr
+ * is a pointer that can be use to fetch the record later.
+ *
+ * This doesn't make any on-disk changes. The buffer is locked, but if
+ * the backend aborts later on, before actually writing the record no harm
+ * done.
+ *
+ * The intended usage is to call zs_insert_reserve_space(), then lock any
+ * any other pages needed for the operation. Then, write the UNDO record
+ * reservation_p->ptr, which points directly to the buffer, in the same
+ * critical section as any other page modifications that need to be done
+ * atomically. Finally, call zsundo_insert_finish(), to mark the space as
+ * used in the undo page header.
+ *
+ * The caller is responsible for WAL-logging, and replaying the changes, in
+ * case of a crash. (If there isn't enough space on the current latest UNDO
+ * page, a new page is allocated and appended to the UNDO log. That allocation
+ * is WAL-logged separately, the caller doesn't need to care about that.)
+ */
+void
+zsundo_insert_reserve(Relation rel, size_t size, zs_undo_reservation *reservation_p)
+{
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber tail_blk;
+ Buffer tail_buf = InvalidBuffer;
+ Page tail_pg = NULL;
+ ZSUndoPageOpaque *tail_opaque = NULL;
+ uint64 next_counter;
+ int offset;
+
+ if (size > MaxUndoRecordSize)
+ elog(ERROR, "UNDO record is too large (%zu bytes, max %zu bytes)", size, MaxUndoRecordSize);
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+retry_lock_tail:
+ tail_blk = metaopaque->zs_undo_tail;
+
+ /*
+ * Is there space on the tail page? If not, allocate a new UNDO page.
+ */
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_buf = ReadBuffer(rel, tail_blk);
+ LockBuffer(tail_buf, BUFFER_LOCK_EXCLUSIVE);
+ tail_pg = BufferGetPage(tail_buf);
+ tail_opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(tail_pg);
+ Assert(tail_opaque->first_undorecptr.counter == metaopaque->zs_undo_tail_first_counter);
+ }
+
+ if (tail_blk == InvalidBlockNumber || PageGetExactFreeSpace(tail_pg) < size)
+ {
+ Buffer newbuf;
+ BlockNumber newblk;
+ Page newpage;
+ ZSUndoPageOpaque *newopaque;
+
+ /*
+ * Release the lock on the old tail page and metapage while we find a new block,
+ * because that could take a while. (And accessing the Free Page Map might lock
+ * the metapage, too, causing self-deadlock.)
+ */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+ if (BufferIsValid(tail_buf))
+ LockBuffer(tail_buf, BUFFER_LOCK_UNLOCK);
+
+ /* new page */
+ newbuf = zspage_getnewbuf(rel, ZS_INVALID_ATTRIBUTE_NUM);
+
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ if (metaopaque->zs_undo_tail != tail_blk)
+ {
+ /*
+ * Someone else extended the UNDO log concurrently. We don't need
+ * the new page, after all. (Or maybe we do, if the new
+ * tail block is already full, but we're not smart about it.)
+ */
+ zspage_delete_page(rel, newbuf, metabuf, ZS_INVALID_ATTRIBUTE_NUM);
+ UnlockReleaseBuffer(newbuf);
+ goto retry_lock_tail;
+ }
+ if (BufferIsValid(tail_buf))
+ LockBuffer(tail_buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (tail_blk == InvalidBlockNumber)
+ next_counter = metaopaque->zs_undo_tail_first_counter;
+ else
+ next_counter = tail_opaque->last_undorecptr.counter + 1;
+
+ START_CRIT_SECTION();
+
+ newblk = BufferGetBlockNumber(newbuf);
+ newpage = BufferGetPage(newbuf);
+ PageInit(newpage, BLCKSZ, sizeof(ZSUndoPageOpaque));
+ newopaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(newpage);
+ newopaque->next = InvalidBlockNumber;
+ newopaque->first_undorecptr.blkno = newblk;
+ newopaque->first_undorecptr.offset = SizeOfPageHeaderData;
+ newopaque->first_undorecptr.counter = next_counter;
+ newopaque->last_undorecptr = InvalidUndoPtr;
+ newopaque->zs_page_id = ZS_UNDO_PAGE_ID;
+ MarkBufferDirty(newbuf);
+
+ metaopaque->zs_undo_tail = newblk;
+ metaopaque->zs_undo_tail_first_counter = next_counter;
+ if (tail_blk == InvalidBlockNumber)
+ metaopaque->zs_undo_head = newblk;
+ MarkBufferDirty(metabuf);
+
+ if (tail_blk != InvalidBlockNumber)
+ {
+ tail_opaque->next = newblk;
+ MarkBufferDirty(tail_buf);
+ }
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_undo_newpage xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.first_counter = next_counter;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalUndoNewPage);
+
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+ if (BufferIsValid(tail_buf))
+ XLogRegisterBuffer(1, tail_buf, REGBUF_STANDARD);
+ XLogRegisterBuffer(2, newbuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_UNDO_NEWPAGE);
+
+ PageSetLSN(BufferGetPage(metabuf), recptr);
+ if (BufferIsValid(tail_buf))
+ PageSetLSN(BufferGetPage(tail_buf), recptr);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+
+ if (tail_blk != InvalidBlockNumber)
+ UnlockReleaseBuffer(tail_buf);
+
+ END_CRIT_SECTION();
+
+ Assert(size <= PageGetExactFreeSpace(newpage));
+
+ tail_blk = newblk;
+ tail_buf = newbuf;
+ tail_pg = newpage;
+ tail_opaque = newopaque;
+ }
+ else
+ {
+ if (IsZSUndoRecPtrValid(&tail_opaque->last_undorecptr))
+ {
+ Assert(tail_opaque->last_undorecptr.counter >= metaopaque->zs_undo_tail_first_counter);
+ next_counter = tail_opaque->last_undorecptr.counter + 1;
+ }
+ else
+ {
+ next_counter = tail_opaque->first_undorecptr.counter;
+ Assert(next_counter == metaopaque->zs_undo_tail_first_counter);
+ }
+ }
+
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * All set for writing the record. But since we haven't modified the page
+ * yet, we are free to still turn back and release the lock without writing
+ * anything.
+ */
+ offset = ((PageHeader) tail_pg)->pd_lower;
+
+ /* Return the reservation to the caller */
+ reservation_p->undobuf = tail_buf;
+ reservation_p->undorecptr.counter = next_counter;
+ reservation_p->undorecptr.blkno = tail_blk;
+ reservation_p->undorecptr.offset = offset;
+ reservation_p->length = size;
+ reservation_p->ptr = ((char *) tail_pg) + offset;
+}
+
+/*
+ * Finish the insertion of an UNDO record.
+ *
+ * See zsundo_insert_reserve().
+ */
+void
+zsundo_insert_finish(zs_undo_reservation *reservation)
+{
+ Buffer undobuf = reservation->undobuf;
+ Page undopg = BufferGetPage(undobuf);
+ ZSUndoPageOpaque *opaque;
+
+ /*
+ * This should be used as part of a bigger critical section that
+ * writes a WAL record of the change. The caller must've written the
+ * data.
+ */
+ Assert(CritSectionCount > 0);
+
+ Assert(((PageHeader) undopg)->pd_lower == reservation->undorecptr.offset);
+
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(undopg);
+ opaque->last_undorecptr = reservation->undorecptr;
+
+ ((PageHeader) undopg)->pd_lower += reservation->length;
+
+ MarkBufferDirty(undobuf);
+}
+
+/*
+ * Lock page containing the given UNDO record, and return pointer to it
+ * within the buffer. Once you're done looking at the record, unlock and
+ * unpin the buffer.
+ *
+ * If lockmode is BUFFER_LOCK_EXCLUSIVE, you may modify the record. However,
+ * you cannot change its size, and you must mark the buffer dirty, and WAL-log any
+ * changes yourself.
+ *
+ * If missing_ok is true, it's OK if the UNDO record has been discarded away
+ * already. Will return NULL in that case. If missing_ok is false, throws an
+ * error if the record cannot be found.
+ */
+char *
+zsundo_fetch(Relation rel, ZSUndoRecPtr undoptr, Buffer *buf_p, int lockmode,
+ bool missing_ok)
+{
+ Buffer buf = InvalidBuffer;
+ Page page;
+ PageHeader pagehdr;
+ ZSUndoPageOpaque *opaque;
+ char *ptr;
+ Buffer metabuf = InvalidBuffer;
+
+ buf = ReadBuffer(rel, undoptr.blkno);
+ page = BufferGetPage(buf);
+ pagehdr = (PageHeader) page;
+
+ /*
+ * If the page might've been discarded away, there's a small chance that
+ * the buffer now holds an unrelated page. In that case, it's possible
+ * that we or someone else is holding a lock on it already. If we tried
+ * to lock the page unconditionally, we could accidentally break the
+ * lock ordering rules, by trying to lock a different kind of a page
+ * than we thought.
+ *
+ * To avoid that, try to lock the page optimistically, but if we would
+ * block, check in the metapage that the page hasn't been discarded away.
+ * zsundo_discard() keeps the metapage locked, so if we lock the page
+ * while holding the metapage, we can be sure that it's the UNDO page
+ * we're looking for.
+ */
+ if (!ConditionalLockBufferInMode(buf, lockmode))
+ {
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ if (metaopaque->zs_undo_oldestptr.counter > undoptr.counter)
+ {
+ /* the record has already been discarded */
+ ReleaseBuffer(buf);
+ buf = InvalidBuffer;
+ UnlockReleaseBuffer(metabuf);
+ metabuf = InvalidBuffer;
+ goto record_missing;
+ }
+ LockBuffer(buf, lockmode);
+ }
+
+ if (PageIsNew(page))
+ goto record_missing;
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ goto record_missing;
+
+ /* Check that this page contains the given record */
+ if (undoptr.counter < opaque->first_undorecptr.counter ||
+ !IsZSUndoRecPtrValid(&opaque->last_undorecptr) ||
+ undoptr.counter > opaque->last_undorecptr.counter)
+ goto record_missing;
+
+ if (BufferIsValid(metabuf))
+ {
+ UnlockReleaseBuffer(metabuf);
+ metabuf = InvalidBuffer;
+ }
+
+ /* FIXME: the callers could do a more thorough check like this,
+ * since they know the record size */
+ /* Sanity check that the pointer pointed to a valid place */
+ if (undoptr.offset < SizeOfPageHeaderData ||
+ undoptr.offset >= pagehdr->pd_lower)
+ {
+ /*
+ * this should not happen in the case that the page was recycled for
+ * other use, so error even if 'missing_ok' is true
+ */
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+ }
+
+ ptr = ((char *) page) + undoptr.offset;
+
+#if 0 /* FIXME: move this to the callers? */
+ if (memcmp(&undorec->undorecptr, &undoptr, sizeof(ZSUndoRecPtr)) != 0)
+ {
+ /*
+ * this should not happen in the case that the page was recycled for
+ * other use, so error even if 'fail_ok' is true
+ */
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+ }
+#endif
+
+ *buf_p = buf;
+ return ptr;
+
+record_missing:
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ if (buf)
+ UnlockReleaseBuffer(buf);
+ *buf_p = InvalidBuffer;
+
+ /*
+ * If the metapage says that the page is there, but it doesn't contain the
+ * data we thought, that's an error even with 'missing_ok.
+ */
+ if (missing_ok && !BufferIsValid(metabuf))
+ return NULL;
+ else
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u; not an UNDO page",
+ undoptr.counter, undoptr.blkno, undoptr.offset);
+}
+
+/*
+ * Discard old UNDO log, recycling any now-unused pages.
+ *
+ * Updates the metapage with the oldest value that remains after the discard.
+ */
+void
+zsundo_discard(Relation rel, ZSUndoRecPtr oldest_undorecptr)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber nextblk;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ nextblk = metaopaque->zs_undo_head;
+ while (nextblk != InvalidBlockNumber)
+ {
+ BlockNumber blk = nextblk;
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+ bool discard_this_page = false;
+ BlockNumber nextfreeblkno = InvalidBlockNumber;
+
+ buf = ReadBuffer(rel, blk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * check that the page still looks like what we'd expect.
+ *
+ * FIXME: how to recover? Should these be just warnings?
+ */
+ if (PageIsEmpty(page))
+ elog(ERROR, "corrupted zedstore table; oldest UNDO log page is empty");
+
+ if (PageGetSpecialSize(page) != MAXALIGN(sizeof(ZSUndoPageOpaque)))
+ elog(ERROR, "corrupted zedstore table; oldest page in UNDO log is not an UNDO page");
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "corrupted zedstore table; oldest page in UNDO log has unexpected page id %d",
+ opaque->zs_page_id);
+ /* FIXME: Also check here that the max UndoRecPtr on the page is less
+ * than the new 'oldest_undorecptr'
+ */
+
+ if (!IsZSUndoRecPtrValid(&opaque->last_undorecptr) ||
+ opaque->last_undorecptr.counter < oldest_undorecptr.counter)
+ discard_this_page = true;
+
+ if (discard_this_page && blk == oldest_undorecptr.blkno)
+ elog(ERROR, "corrupted UNDO page chain, tried to discard active page");
+
+ nextblk = opaque->next;
+
+ START_CRIT_SECTION();
+
+ metaopaque->zs_undo_oldestptr = oldest_undorecptr;
+
+ if (discard_this_page)
+ {
+ if (nextblk == InvalidBlockNumber)
+ {
+ metaopaque->zs_undo_head = InvalidBlockNumber;
+ metaopaque->zs_undo_tail = InvalidBlockNumber;
+ metaopaque->zs_undo_tail_first_counter = oldest_undorecptr.counter;
+ }
+ else
+ metaopaque->zs_undo_head = nextblk;
+
+ /* Add the discarded page to the free page list */
+ nextfreeblkno = metaopaque->zs_fpm_head;
+ zspage_mark_page_deleted(page, nextfreeblkno);
+ metaopaque->zs_fpm_head = blk;
+
+ MarkBufferDirty(buf);
+ }
+
+ MarkBufferDirty(metabuf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ wal_zedstore_undo_discard xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.oldest_undorecptr = oldest_undorecptr;
+ xlrec.oldest_undopage = nextblk;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfZSWalUndoDiscard);
+ XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+ if (discard_this_page)
+ {
+ XLogRegisterBuffer(1, buf, REGBUF_KEEP_DATA | REGBUF_WILL_INIT | REGBUF_STANDARD);
+ XLogRegisterBufData(1, (char *) &nextfreeblkno, sizeof(BlockNumber));
+ }
+
+ recptr = XLogInsert(RM_ZEDSTORE_ID, WAL_ZEDSTORE_UNDO_DISCARD);
+
+ PageSetLSN(BufferGetPage(metabuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+ }
+
+ UnlockReleaseBuffer(metabuf);
+}
+
+void
+zsundo_discard_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_undo_discard *xlrec = (wal_zedstore_undo_discard *) XLogRecGetData(record);
+ ZSUndoRecPtr oldest_undorecptr = xlrec->oldest_undorecptr;
+ BlockNumber nextblk = xlrec->oldest_undopage;
+ Buffer metabuf;
+ bool discard_this_page;
+ BlockNumber discardedblkno = InvalidBlockNumber;
+ BlockNumber nextfreeblkno = InvalidBlockNumber;
+
+ discard_this_page = XLogRecHasBlockRef(record, 1);
+ if (discard_this_page)
+ {
+ Size datalen;
+ char *data;
+
+ XLogRecGetBlockTag(record, 1, NULL, NULL, &discardedblkno);
+ data = XLogRecGetBlockData(record, 1, &datalen);
+ Assert(datalen == sizeof(BlockNumber));
+
+ memcpy(&nextfreeblkno, data, sizeof(BlockNumber));
+ }
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_undo_oldestptr = oldest_undorecptr;
+
+ if (discard_this_page)
+ {
+ if (nextblk == InvalidBlockNumber)
+ {
+ metaopaque->zs_undo_head = InvalidBlockNumber;
+ metaopaque->zs_undo_tail = InvalidBlockNumber;
+ metaopaque->zs_undo_tail_first_counter = oldest_undorecptr.counter;
+ }
+ else
+ metaopaque->zs_undo_head = nextblk;
+
+ /* Add the discarded page to the free page list */
+ metaopaque->zs_fpm_head = discardedblkno;
+ }
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (discard_this_page)
+ {
+ Buffer discardedbuf;
+ Page discardedpage;
+
+ discardedbuf = XLogInitBufferForRedo(record, 1);
+ discardedpage = BufferGetPage(discardedbuf);
+ zspage_mark_page_deleted(discardedpage, nextfreeblkno);
+
+ PageSetLSN(discardedpage, lsn);
+ MarkBufferDirty(discardedbuf);
+ UnlockReleaseBuffer(discardedbuf);
+ }
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+}
+
+void
+zsundo_newpage_redo(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ wal_zedstore_undo_newpage *xlrec = (wal_zedstore_undo_newpage *) XLogRecGetData(record);
+ Buffer metabuf;
+ Buffer prevbuf;
+ Buffer newbuf;
+ BlockNumber newblk;
+ Page newpage;
+ ZSUndoPageOpaque *newopaque;
+ bool has_prev_block;
+
+ has_prev_block = XLogRecHasBlockRef(record, 1);
+ XLogRecGetBlockTag(record, 2, NULL, NULL, &newblk);
+
+ if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+ {
+ Page metapage = BufferGetPage(metabuf);
+ ZSMetaPageOpaque *metaopaque;
+
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+ metaopaque->zs_undo_tail = newblk;
+ metaopaque->zs_undo_tail_first_counter = xlrec->first_counter;
+ if (!has_prev_block)
+ metaopaque->zs_undo_head = newblk;
+
+ PageSetLSN(metapage, lsn);
+ MarkBufferDirty(metabuf);
+ }
+
+ if (has_prev_block)
+ {
+ if (XLogReadBufferForRedo(record, 1, &prevbuf) == BLK_NEEDS_REDO)
+ {
+ Page prevpage = BufferGetPage(prevbuf);
+ ZSUndoPageOpaque *prev_opaque;
+
+ prev_opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(prevpage);
+ prev_opaque->next = newblk;
+
+ PageSetLSN(prevpage, lsn);
+ MarkBufferDirty(prevbuf);
+ }
+ }
+ else
+ prevbuf = InvalidBuffer;
+
+ newbuf = XLogInitBufferForRedo(record, 2);
+ newblk = BufferGetBlockNumber(newbuf);
+ newpage = BufferGetPage(newbuf);
+ PageInit(newpage, BLCKSZ, sizeof(ZSUndoPageOpaque));
+ newopaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(newpage);
+ newopaque->next = InvalidBlockNumber;
+ newopaque->first_undorecptr.blkno = newblk;
+ newopaque->first_undorecptr.offset = SizeOfPageHeaderData;
+ newopaque->first_undorecptr.counter = xlrec->first_counter;
+ newopaque->last_undorecptr = InvalidUndoPtr;
+ newopaque->zs_page_id = ZS_UNDO_PAGE_ID;
+
+ PageSetLSN(newpage, lsn);
+ MarkBufferDirty(newbuf);
+
+ if (BufferIsValid(metabuf))
+ UnlockReleaseBuffer(metabuf);
+ if (BufferIsValid(prevbuf))
+ UnlockReleaseBuffer(prevbuf);
+ UnlockReleaseBuffer(newbuf);
+}
diff --git a/src/backend/access/zedstore/zedstore_undorec.c b/src/backend/access/zedstore/zedstore_undorec.c
new file mode 100644
index 0000000000..719b8f8035
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_undorec.c
@@ -0,0 +1,893 @@
+/*
+ * zedstore_undorec.c
+ * Functions for working on UNDO records.
+ *
+ * This file contains higher-level functions for constructing UNDO records
+ * for different kinds of WAL records.
+ *
+ * If you perform multiple operations in the same transaction and command, we
+ * reuse the same UNDO record for it. There's a one-element cache of each
+ * operation type, so this only takes effect in simple cases.
+ *
+ * TODO: make the caching work in more cases. A hash table or something..
+ * Currently, we do this for DELETEs and INSERTs. We could perhaps do this
+ * for UPDATEs as well, although they're more a bit more tricky, as we need
+ * to also store the 'ctid' pointer to the new tuple in an UPDATE.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_undorec.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/multixact.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undolog.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "lib/integerset.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "pgstat.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/pg_rusage.h"
+#include "utils/rel.h"
+#include "utils/lsyscache.h"
+
+/*
+ * Working area for VACUUM.
+ */
+typedef struct ZSVacRelStats
+{
+ int elevel;
+ BufferAccessStrategy vac_strategy;
+
+ /* hasindex = true means two-pass strategy; false means one-pass */
+ bool hasindex;
+ /* Overall statistics about rel */
+ BlockNumber rel_pages; /* total number of pages */
+ BlockNumber tupcount_pages; /* pages whose tuples we counted */
+ double old_live_tuples; /* previous value of pg_class.reltuples */
+ double new_rel_tuples; /* new estimated total # of tuples */
+ double new_live_tuples; /* new estimated total # of live tuples */
+ double new_dead_tuples; /* new estimated total # of dead tuples */
+ BlockNumber pages_removed;
+ double tuples_deleted;
+
+ IntegerSet *dead_tids;
+} ZSVacRelStats;
+
+static bool zs_lazy_tid_reaped(ItemPointer itemptr, void *state);
+static void lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats);
+static void lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats);
+
+
+/*
+ * Fetch the UNDO record with the given undo-pointer.
+ *
+ * The returned record is a palloc'd copy.
+ *
+ * If the record could not be found, returns NULL. That can happen if you try
+ * to fetch an UNDO record that has already been discarded. I.e. if undoptr
+ * is smaller than the oldest UNDO pointer stored in the metapage.
+ */
+ZSUndoRec *
+zsundo_fetch_record(Relation rel, ZSUndoRecPtr undoptr)
+{
+ ZSUndoRec *undorec_copy;
+ ZSUndoRec *undorec;
+ Buffer buf;
+
+ undorec = (ZSUndoRec *) zsundo_fetch(rel, undoptr, &buf, BUFFER_LOCK_SHARE, true);
+
+ if (undorec)
+ {
+ undorec_copy = palloc(undorec->size);
+ memcpy(undorec_copy, undorec, undorec->size);
+ }
+ else
+ undorec_copy = NULL;
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+
+ return undorec_copy;
+}
+
+
+zs_pending_undo_op *
+zsundo_create_for_delete(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ bool changedPart, ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_Delete *undorec;
+ zs_pending_undo_op *pending_op;
+
+ static RelFileNode cached_relfilenode;
+ static TransactionId cached_xid;
+ static CommandId cached_cid;
+ static bool cached_changedPart;
+ static ZSUndoRecPtr cached_prev_undo_ptr;
+ static ZSUndoRecPtr cached_undo_ptr;
+
+ if (RelFileNodeEquals(rel->rd_node, cached_relfilenode) &&
+ xid == cached_xid &&
+ cid == cached_cid &&
+ changedPart == cached_changedPart &&
+ prev_undo_ptr.counter == cached_prev_undo_ptr.counter)
+ {
+ Buffer buf;
+ ZSUndoRec_Delete *orig_undorec;
+
+ orig_undorec = (ZSUndoRec_Delete *) zsundo_fetch(rel, cached_undo_ptr,
+ &buf, BUFFER_LOCK_EXCLUSIVE, false);
+
+ if (orig_undorec->rec.type != ZSUNDO_TYPE_DELETE)
+ elog(ERROR, "unexpected undo record type %d, expected DELETE", orig_undorec->rec.type);
+
+ /* Is there space for a new TID in the record? */
+ if (orig_undorec->num_tids < ZSUNDO_NUM_TIDS_PER_DELETE)
+ {
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Delete));
+ undorec = (ZSUndoRec_Delete *) pending_op->payload;
+
+ pending_op->reservation.undobuf = buf;
+ pending_op->reservation.undorecptr = cached_undo_ptr;
+ pending_op->reservation.length = sizeof(ZSUndoRec_Delete);
+ pending_op->reservation.ptr = (char *) orig_undorec;
+ pending_op->is_update = true;
+
+ memcpy(undorec, orig_undorec, sizeof(ZSUndoRec_Delete));
+ undorec->tids[undorec->num_tids] = tid;
+ undorec->num_tids++;
+
+ return pending_op;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Cache miss. Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Delete));
+ pending_op->is_update = false;
+
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_Delete), &pending_op->reservation);
+
+ undorec = (ZSUndoRec_Delete *) pending_op->payload;
+ undorec->rec.size = sizeof(ZSUndoRec_Delete);
+ undorec->rec.type = ZSUNDO_TYPE_DELETE;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->changedPart = changedPart;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->tids[0] = tid;
+ undorec->num_tids = 1;
+
+ /* XXX: this caching mechanism assumes that once we've reserved the undo record,
+ * we never change our minds and don't write the undo record, after all.
+ */
+ cached_relfilenode = rel->rd_node;
+ cached_xid = xid;
+ cached_cid = cid;
+ cached_changedPart = changedPart;
+ cached_prev_undo_ptr = prev_undo_ptr;
+ cached_undo_ptr = pending_op->reservation.undorecptr;
+
+ return pending_op;
+}
+
+/*
+ * Create an UNDO record for insertion.
+ *
+ * The undo record stores the 'tid' of the row, as well as visibility information.
+ *
+ * There's a primitive caching mechanism here: If you perform multiple insertions
+ * with same visibility information, and consecutive TIDs, we will keep modifying
+ * the range of TIDs in the same UNDO record, instead of creating new records.
+ * That greatly reduces the space required for UNDO log of bulk inserts.
+ */
+zs_pending_undo_op *
+zsundo_create_for_insert(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ int nitems, uint32 speculative_token, ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_Insert *undorec;
+ zs_pending_undo_op *pending_op;
+
+ /*
+ * Cache miss. Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Insert));
+ pending_op->is_update = false;
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_Insert), &pending_op->reservation);
+ undorec = (ZSUndoRec_Insert *) pending_op->payload;
+
+ undorec->rec.size = sizeof(ZSUndoRec_Insert);
+ undorec->rec.type = ZSUNDO_TYPE_INSERT;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->firsttid = tid;
+ undorec->endtid = tid + nitems;
+ undorec->speculative_token = speculative_token;
+
+ return pending_op;
+}
+
+zs_pending_undo_op *
+zsundo_create_for_update(Relation rel, TransactionId xid, CommandId cid,
+ zstid oldtid, zstid newtid, ZSUndoRecPtr prev_undo_ptr,
+ bool key_update)
+{
+ ZSUndoRec_Update *undorec;
+ zs_pending_undo_op *pending_op;
+
+ /*
+ * Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_Update));
+ pending_op->is_update = false;
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_Update), &pending_op->reservation);
+
+ undorec = (ZSUndoRec_Update *) pending_op->payload;
+ undorec->rec.size = sizeof(ZSUndoRec_Update);
+ undorec->rec.type = ZSUNDO_TYPE_UPDATE;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->oldtid = oldtid;
+ undorec->newtid = newtid;
+ undorec->key_update = key_update;
+
+ return pending_op;
+}
+
+zs_pending_undo_op *
+zsundo_create_for_tuple_lock(Relation rel, TransactionId xid, CommandId cid,
+ zstid tid, LockTupleMode lockmode,
+ ZSUndoRecPtr prev_undo_ptr)
+{
+ ZSUndoRec_TupleLock *undorec;
+ zs_pending_undo_op *pending_op;
+
+ /*
+ * Create a new UNDO record.
+ */
+ pending_op = palloc(offsetof(zs_pending_undo_op, payload) + sizeof(ZSUndoRec_TupleLock));
+ pending_op->is_update = false;
+ zsundo_insert_reserve(rel, sizeof(ZSUndoRec_TupleLock), &pending_op->reservation);
+
+ undorec = (ZSUndoRec_TupleLock *) pending_op->payload;
+ undorec->rec.size = sizeof(ZSUndoRec_TupleLock);
+ undorec->rec.type = ZSUNDO_TYPE_TUPLE_LOCK;
+ undorec->rec.undorecptr = pending_op->reservation.undorecptr;
+ undorec->rec.xid = xid;
+ undorec->rec.cid = cid;
+ undorec->rec.prevundorec = prev_undo_ptr;
+ undorec->lockmode = lockmode;
+
+ return pending_op;
+}
+
+
+/*
+ * Scan the UNDO log, starting from oldest entry. Undo the effects of any
+ * aborted transactions. Records for committed transactions can be discarded
+ * away immediately.
+ *
+ * Returns the oldest valid UNDO ptr, after discarding.
+ */
+static ZSUndoRecPtr
+zsundo_trim(Relation rel)
+{
+ /* Scan the undo log from oldest to newest */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+ BlockNumber firstblk;
+ BlockNumber lastblk;
+ ZSUndoRecPtr oldest_undorecptr;
+ bool can_advance_oldestundorecptr;
+ char *ptr;
+ char *endptr;
+ char *pagebuf;
+
+ pagebuf = palloc(BLCKSZ);
+
+ oldest_undorecptr = InvalidUndoPtr;
+
+ /*
+ * Ensure that only one process discards at a time. We use a page lock on the
+ * metapage for that.
+ */
+ LockPage(rel, ZS_META_BLK, ExclusiveLock);
+
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ firstblk = metaopaque->zs_undo_head;
+
+ oldest_undorecptr = metaopaque->zs_undo_oldestptr;
+
+ /*
+ * If we assume that only one process can call TRIM at a time, then we
+ * don't need to hold the metapage locked. Alternatively, if multiple
+ * concurrent trims was possible, we could check after reading the head
+ * page, that it is the page we expect, and re-read the metapage if it's
+ * not.
+ */
+ UnlockReleaseBuffer(metabuf);
+
+ /*
+ * Don't trim undo pages in recovery mode to avoid writing new WALs.
+ */
+ if(RecoveryInProgress())
+ return oldest_undorecptr;
+
+ /*
+ * Loop through UNDO records, starting from the oldest page, until we
+ * hit a record that we cannot remove.
+ */
+ lastblk = firstblk;
+ can_advance_oldestundorecptr = false;
+ while (lastblk != InvalidBlockNumber)
+ {
+ Buffer buf;
+ Page page;
+ ZSUndoPageOpaque *opaque;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Read the UNDO page */
+ buf = ReadBuffer(rel, lastblk);
+ page = BufferGetPage(buf);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ opaque = (ZSUndoPageOpaque *) PageGetSpecialPointer(page);
+
+ if (opaque->zs_page_id != ZS_UNDO_PAGE_ID)
+ elog(ERROR, "unexpected page id on UNDO page");
+
+ /*
+ * Make a copy of the page, because we cannot hold the
+ * lock while we reach out to the TID tree, to mark items dead.
+ * That would cause a deadlock risk (scans lock TID tree pages
+ * first, and then UNDO pages to check visibility)
+ */
+ memcpy(pagebuf, page, BLCKSZ);
+ page = pagebuf;
+ UnlockReleaseBuffer(buf);
+ buf = InvalidBuffer;
+
+ /* loop through all records on the page */
+ endptr = (char *) page + ((PageHeader) page)->pd_lower;
+ ptr = (char *) page + SizeOfPageHeaderData;
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+ bool did_commit;
+
+ Assert(undorec->undorecptr.blkno == lastblk);
+
+ if (undorec->undorecptr.counter < oldest_undorecptr.counter)
+ {
+ ptr += undorec->size;
+ continue;
+ }
+ oldest_undorecptr = undorec->undorecptr;
+
+ if (!GlobalVisCheckRemovableXid(rel, undorec->xid))
+ {
+ /* This is still needed. Bail out */
+ break;
+ }
+
+ /*
+ * No one thinks this transaction is in-progress anymore. If it
+ * committed, we can just discard away its UNDO record. If it aborted,
+ * we need to apply the UNDO record first. (For deletions, it's
+ * the other way round, though.)
+ *
+ * TODO: It would be much more efficient to do these in batches.
+ * So we should just collect the TIDs to mark dead here, and pass
+ * the whole list to zsbt_tid_mark_dead() after the loop.
+ */
+ did_commit = TransactionIdDidCommit(undorec->xid);
+
+ switch (undorec->type)
+ {
+ case ZSUNDO_TYPE_INSERT:
+ if (!did_commit)
+ {
+ ZSUndoRec_Insert *insertrec = (ZSUndoRec_Insert *) undorec;
+
+ for (zstid tid = insertrec->firsttid; tid < insertrec->endtid; tid++)
+ zsbt_tid_mark_dead(rel, tid, oldest_undorecptr);
+ }
+ break;
+ case ZSUNDO_TYPE_DELETE:
+ {
+ ZSUndoRec_Delete *deleterec = (ZSUndoRec_Delete *) undorec;
+
+ if (did_commit)
+ {
+ /* The deletion is now visible to everyone */
+ for (int i = 0; i < deleterec->num_tids; i++)
+ zsbt_tid_mark_dead(rel, deleterec->tids[i], oldest_undorecptr);
+ }
+ else
+ {
+ /*
+ * must clear the item's UNDO pointer, otherwise the deletion
+ * becomes visible to everyone when the UNDO record is discarded
+ * away.
+ */
+ for (int i = 0; i < deleterec->num_tids; i++)
+ zsbt_tid_undo_deletion(rel, deleterec->tids[i], undorec->undorecptr,
+ oldest_undorecptr);
+ }
+ }
+ break;
+ case ZSUNDO_TYPE_UPDATE:
+ if (did_commit)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+
+ zsbt_tid_mark_dead(rel, updaterec->oldtid, oldest_undorecptr);
+ }
+ break;
+ }
+
+ ptr += undorec->size;
+ can_advance_oldestundorecptr = true;
+ }
+
+ if (ptr < endptr)
+ {
+ break;
+ }
+ else
+ {
+ /* We processed all records on the page. Step to the next one, if any. */
+ Assert(ptr == endptr);
+ lastblk = opaque->next;
+ }
+ }
+
+ if (can_advance_oldestundorecptr)
+ {
+ if (lastblk == InvalidBlockNumber)
+ {
+ /*
+ * We stopped after the last valid record. Advance by one, to the next
+ * record which hasn't been created yet, and which is still needed.
+ */
+ oldest_undorecptr.counter++;
+ oldest_undorecptr.blkno = InvalidBlockNumber;
+ oldest_undorecptr.offset = 0;
+ }
+
+ zsundo_discard(rel, oldest_undorecptr);
+ }
+
+ UnlockPage(rel, ZS_META_BLK, ExclusiveLock);
+
+ pfree(pagebuf);
+
+ return oldest_undorecptr;
+}
+
+void
+zsundo_finish_pending_op(zs_pending_undo_op *pendingop, char *payload)
+{
+ /*
+ * This should be used as part of a bigger critical section that
+ * writes a WAL record of the change.
+ */
+ Assert(CritSectionCount > 0);
+
+ memcpy(pendingop->reservation.ptr, payload, pendingop->reservation.length);
+
+ if (!pendingop->is_update)
+ zsundo_insert_finish(&pendingop->reservation);
+ else
+ MarkBufferDirty(pendingop->reservation.undobuf);
+}
+
+
+void
+zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr)
+{
+ ZSUndoRec_Insert *undorec;
+ Buffer buf;
+
+ undorec = (ZSUndoRec_Insert *) zsundo_fetch(rel, undoptr, &buf, BUFFER_LOCK_EXCLUSIVE, false);
+
+ if (undorec->rec.type != ZSUNDO_TYPE_INSERT)
+ elog(ERROR, "unexpected undo record type %d on speculatively inserted row",
+ undorec->rec.type);
+
+ START_CRIT_SECTION();
+
+ MarkBufferDirty(buf);
+
+ undorec->speculative_token = INVALID_SPECULATIVE_TOKEN;
+
+ /*
+ * The speculative insertion token becomes irrelevant, if we crash, so no
+ * need to WAL-log it. However, if checksums are enabled, we may need to take
+ * a full-page image of the page, if a checkpoint happened between the
+ * speculative insertion and this call.
+ */
+ if (RelationNeedsWAL(rel))
+ {
+ if (XLogHintBitIsNeeded())
+ {
+ XLogRecPtr lsn;
+
+ lsn = XLogSaveBufferForHint(buf, true);
+ PageSetLSN(BufferGetPage(buf), lsn);
+ }
+ }
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Support functions for WAL-logging the insertion/modification of an
+ * UNDO record, as part of another WAL-logged change.
+ */
+void
+XLogRegisterUndoOp(uint8 block_id, zs_pending_undo_op *undo_op)
+{
+ zs_wal_undo_op *xlrec = &undo_op->waldata;
+
+ xlrec->undoptr = undo_op->reservation.undorecptr;
+ xlrec->length = undo_op->reservation.length;
+ xlrec->is_update = undo_op->is_update;
+
+ XLogRegisterBuffer(block_id, undo_op->reservation.undobuf,
+ REGBUF_STANDARD);
+ XLogRegisterBufData(block_id, (char *) xlrec, SizeOfZSWalUndoOp);
+ XLogRegisterBufData(block_id, (char *) undo_op->payload, undo_op->reservation.length);
+}
+
+/* redo support for the above */
+Buffer
+XLogRedoUndoOp(XLogReaderState *record, uint8 block_id)
+{
+ Buffer buffer;
+ zs_pending_undo_op op;
+
+ if (XLogReadBufferForRedo(record, block_id, &buffer) == BLK_NEEDS_REDO)
+ {
+ zs_wal_undo_op xlrec;
+ Size len;
+ char *p = XLogRecGetBlockData(record, block_id, &len);
+
+ Assert(len >= SizeOfZSWalUndoOp);
+
+ memcpy(&xlrec, p, SizeOfZSWalUndoOp);
+ p += SizeOfZSWalUndoOp;
+ len -= SizeOfZSWalUndoOp;
+ Assert(xlrec.length == len);
+
+ op.reservation.undobuf = buffer;
+ op.reservation.undorecptr = xlrec.undoptr;
+ op.reservation.length = xlrec.length;
+ op.reservation.ptr = ((char *) BufferGetPage(buffer)) + xlrec.undoptr.offset;
+ op.is_update = xlrec.is_update;
+
+ START_CRIT_SECTION();
+ zsundo_finish_pending_op(&op, p);
+ END_CRIT_SECTION();
+ }
+ return buffer;
+}
+
+
+
+static bool
+zs_lazy_tid_reaped(ItemPointer itemptr, void *state)
+{
+ ZSVacRelStats *vacrelstats = (ZSVacRelStats *) state;
+ zstid tid = ZSTidFromItemPointer(*itemptr);
+
+ return intset_is_member(vacrelstats->dead_tids, tid);
+}
+
+/*
+ * Entry point of VACUUM for zedstore tables.
+ *
+ * Vacuum on a zedstore table works quite differently from the heap. We don't
+ * scan the table. Instead, we scan just the active UNDO log, and remove any
+ * garbage left behind by aborts or deletions based on the UNDO log.
+ */
+void
+zsundo_vacuum(Relation rel, VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin)
+{
+ ZSVacRelStats *vacrelstats;
+ Relation *Irel;
+ int nindexes;
+ IndexBulkDeleteResult **indstats;
+ zstid starttid;
+ zstid endtid;
+ uint64 num_live_tuples;
+
+ /* do nothing if the table is completely empty. */
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ /* don't believe the cached value without checking */
+ BlockNumber nblocks = RelationGetNumberOfBlocks(rel);
+
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ return;
+ }
+
+ /*
+ * Scan the UNDO log, and discard what we can.
+ */
+ (void) zsundo_trim(rel);
+
+ vacrelstats = (ZSVacRelStats *) palloc0(sizeof(ZSVacRelStats));
+
+ if (params->options & VACOPT_VERBOSE)
+ vacrelstats->elevel = INFO;
+ else
+ vacrelstats->elevel = DEBUG2;
+ vacrelstats->vac_strategy = bstrategy;
+
+ /* Open all indexes of the relation */
+ vac_open_indexes(rel, RowExclusiveLock, &nindexes, &Irel);
+ vacrelstats->hasindex = (nindexes > 0);
+ indstats = (IndexBulkDeleteResult **)
+ palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
+
+ ereport(vacrelstats->elevel,
+ (errmsg("vacuuming \"%s.%s\"",
+ get_namespace_name(RelationGetNamespace(rel)),
+ RelationGetRelationName(rel))));
+
+ starttid = MinZSTid;
+ num_live_tuples = 0;
+ do
+ {
+ IntegerSet *dead_tids;
+
+ /* Scan the TID tree, to collect TIDs that have been marked dead. */
+ dead_tids = zsbt_collect_dead_tids(rel, starttid, &endtid, &num_live_tuples);
+ vacrelstats->dead_tids = dead_tids;
+
+ if (intset_num_entries(dead_tids) > 0)
+ {
+ /* Remove index entries */
+ for (int i = 0; i < nindexes; i++)
+ lazy_vacuum_index(Irel[i],
+ &indstats[i],
+ vacrelstats);
+
+ /*
+ * Remove the attribute data for the dead rows, and finally their
+ * TID tree entries.
+ */
+ for (int attno = 1; attno <= RelationGetNumberOfAttributes(rel); attno++)
+ zsbt_attr_remove(rel, attno, dead_tids);
+ zsbt_tid_remove(rel, dead_tids);
+ }
+
+ ereport(vacrelstats->elevel,
+ (errmsg("\"%s\": removed " UINT64_FORMAT " row versions",
+ RelationGetRelationName(rel),
+ intset_num_entries(dead_tids))));
+
+ starttid = endtid;
+ } while(starttid < MaxPlusOneZSTid);
+
+ /* Do post-vacuum cleanup and statistics update for each index */
+ for (int i = 0; i < nindexes; i++)
+ lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+
+ /* Done with indexes */
+ vac_close_indexes(nindexes, Irel, NoLock);
+
+ /*
+ * Update pg_class to reflect new info we know. The main thing we know for
+ * sure here is relhasindex or not currently. Using OldestXmin as new
+ * frozenxid. And since we don't now the new multixid passing it as
+ * invalid to avoid update. Plus, using false for relallisvisible as don't
+ * know that either.
+ *
+ * FIXME: pass correct numbers for other arguments.
+ */
+ vac_update_relstats(rel,
+ RelationGetNumberOfBlocks(rel),
+ num_live_tuples,
+ false,
+ nindexes > 0,
+ OldestXmin,
+ InvalidMultiXactId,
+ false);
+
+ /* report results to the stats collector, too */
+ pgstat_report_vacuum(RelationGetRelid(rel),
+ rel->rd_rel->relisshared,
+ num_live_tuples,
+ 0); /* FIXME: # of dead tuples */
+}
+
+/*
+ * lazy_vacuum_index() -- vacuum one index relation.
+ *
+ * Delete all the index entries pointing to tuples listed in
+ * vacrelstats->dead_tuples, and update running statistics.
+ */
+static void
+lazy_vacuum_index(Relation indrel,
+ IndexBulkDeleteResult **stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = true;
+ ivinfo.message_level = vacrelstats->elevel;
+ /* We can only provide an approximate value of num_heap_tuples here */
+ ivinfo.num_heap_tuples = vacrelstats->old_live_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ /* Do bulk deletion */
+ *stats = index_bulk_delete(&ivinfo, *stats,
+ zs_lazy_tid_reaped, (void *) vacrelstats);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
+ RelationGetRelationName(indrel),
+ intset_num_entries(vacrelstats->dead_tids)),
+ errdetail_internal("%s", pg_rusage_show(&ru0))));
+}
+
+/*
+ * lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ */
+static void
+lazy_cleanup_index(Relation indrel,
+ IndexBulkDeleteResult *stats,
+ ZSVacRelStats *vacrelstats)
+{
+ IndexVacuumInfo ivinfo;
+ PGRUsage ru0;
+
+ pg_rusage_init(&ru0);
+
+ ivinfo.index = indrel;
+ ivinfo.analyze_only = false;
+ ivinfo.estimated_count = (vacrelstats->tupcount_pages < vacrelstats->rel_pages);
+ ivinfo.message_level = vacrelstats->elevel;
+
+ /*
+ * Now we can provide a better estimate of total number of surviving
+ * tuples (we assume indexes are more interested in that than in the
+ * number of nominally live tuples).
+ */
+ ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;
+ ivinfo.strategy = vacrelstats->vac_strategy;
+
+ stats = index_vacuum_cleanup(&ivinfo, stats);
+
+ if (!stats)
+ return;
+
+ /*
+ * Now update statistics in pg_class, but only if the index says the count
+ * is accurate.
+ */
+ if (!stats->estimated_count)
+ vac_update_relstats(indrel,
+ stats->num_pages,
+ stats->num_index_tuples,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ false);
+
+ ereport(vacrelstats->elevel,
+ (errmsg("index \"%s\" now contains %.0f row versions in %u pages",
+ RelationGetRelationName(indrel),
+ stats->num_index_tuples,
+ stats->num_pages),
+ errdetail("%.0f index row versions were removed.\n"
+ "%u index pages have been deleted, %u are currently reusable.\n"
+ "%s.",
+ stats->tuples_removed,
+ stats->pages_deleted, stats->pages_free,
+ pg_rusage_show(&ru0))));
+
+ pfree(stats);
+}
+
+
+/*
+ * Return the current "Oldest undo pointer". The effects of any actions with
+ * undo pointer older than this is known to be visible to everyone. (i.e.
+ * an inserted tuple is known to be visible, and a deleted tuple is known to
+ * be invisible.)
+ *
+ * If 'attempt_trim' is true, this not only gets the current oldest UNDO pointer,
+ * but tries to first advance it as much as possible, by scanning and discarding
+ * old UNDO log. That's pretty expensive, but fetching records from the UNDO log
+ * is very expensive, too, so until that is somehow sped up, it is a good tradeoff
+ * to advance the discard pointer aggressively. It is only safe to trim the UNDO
+ * log when you're not holding any other page locks, however.
+ */
+ZSUndoRecPtr
+zsundo_get_oldest_undo_ptr(Relation rel, bool attempt_trim)
+{
+ ZSUndoRecPtr result;
+
+ /* do nothing if the table is completely empty. */
+ if (RelationGetTargetBlock(rel) == 0 ||
+ RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ {
+ /* don't believe a cached 0 size without checking */
+ BlockNumber nblocks;
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ RelationSetTargetBlock(rel, nblocks);
+ if (nblocks == 0)
+ return InvalidUndoPtr;
+ }
+
+ /*
+ * If the caller asked for trimming the UNDO log, do that. Otherwise,
+ * just get the current value from the metapage.
+ */
+ if (attempt_trim)
+ result = zsundo_trim(rel);
+ else
+ {
+ /*
+ * Get the current oldest undo page from the metapage.
+ */
+ Buffer metabuf;
+ Page metapage;
+ ZSMetaPageOpaque *metaopaque;
+
+ metabuf = ReadBuffer(rel, ZS_META_BLK);
+ metapage = BufferGetPage(metabuf);
+ LockBuffer(metabuf, BUFFER_LOCK_SHARE);
+ metaopaque = (ZSMetaPageOpaque *) PageGetSpecialPointer(metapage);
+
+ result = metaopaque->zs_undo_oldestptr;
+ UnlockReleaseBuffer(metabuf);
+ }
+ return result;
+}
diff --git a/src/backend/access/zedstore/zedstore_visibility.c b/src/backend/access/zedstore/zedstore_visibility.c
new file mode 100644
index 0000000000..b8ed9975f6
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_visibility.c
@@ -0,0 +1,919 @@
+/*
+ * zedstore_visibility.c
+ * Routines for MVCC in Zedstore
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_visibility.c
+ */
+#include "postgres.h"
+
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "storage/procarray.h"
+
+static bool
+zs_tuplelock_compatible(LockTupleMode mode, LockTupleMode newmode)
+{
+ switch (newmode)
+ {
+ case LockTupleKeyShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare ||
+ mode == LockTupleNoKeyExclusive;
+
+ case LockTupleShare:
+ return mode == LockTupleKeyShare ||
+ mode == LockTupleShare;
+
+ case LockTupleNoKeyExclusive:
+ return mode == LockTupleKeyShare;
+ case LockTupleExclusive:
+ return false;
+
+ default:
+ elog(ERROR, "unknown tuple lock mode %d", newmode);
+ }
+}
+
+static bool
+am_i_holding_lock(Relation rel, ZSUndoRecPtr undo_ptr,
+ ZSUndoRecPtr recent_oldest_undo)
+{
+ ZSUndoRec *undorec;
+
+ for (;;)
+ {
+ /* Is it visible? */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ return false;
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ return false;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true;
+ }
+ undo_ptr = undorec->prevundorec;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesUpdate.
+ *
+ * When returns TM_Ok, this also returns a flag in *undo_record_needed, to indicate
+ * whether the old UNDO record is still of interest to anyone. If the old record
+ * belonged to an aborted deleting transaction, for example, it can be ignored.
+ *
+ * This does more than HeapTupleSatisfiesUpdate. If HeapTupleSatisfiesUpdate sees
+ * an updated or locked tuple, it returns TM_BeingUpdated, and the caller has to
+ * check if the tuple lock is compatible with the update. zs_SatisfiesUpdate
+ * checks if the new lock mode is compatible with the old one, and returns TM_Ok
+ * if so. Waiting for conflicting locks is left to the caller.
+ *
+ * This is also used for tuple locking (e.g. SELECT FOR UPDATE). 'mode' indicates
+ * the lock mode. For a genuine UPDATE, pass LockTupleExclusive or
+ * LockTupleNoKeyExclusive depending on whether key columns are being modified.
+ *
+ * If the tuple was UPDATEd, *next_tid is set to the TID of the new row version.
+ */
+TM_Result
+zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo,
+ zstid item_tid,
+ LockTupleMode mode,
+ bool *undo_record_needed, bool *this_xact_has_lock,
+ TM_FailureData *tmfd,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ int chain_depth = 0;
+
+ *this_xact_has_lock = false;
+ *undo_record_needed = true;
+
+ undo_ptr = visi_info->undoptr;
+
+fetch_undo_record:
+ chain_depth++;
+
+retry_fetch:
+ /* Is it visible? */
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /*
+ * The old UNDO record is no longer visible to anyone, so we don't
+ * need to keep it. If this record was not the one directly referenced
+ * from the item, then we must keep it, though. For example, if there
+ * is a chain (item -> LOCK_TUPLE -> INSERT), and the INSERT record is
+ * no longer needed by anyone, we must still keep the pointer to the LOCK
+ * record.
+ */
+ if (chain_depth == 1)
+ *undo_record_needed = false;
+
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return TM_Ok;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto retry_fetch;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (undorec->cid >= snapshot->curcid)
+ return TM_Invisible; /* inserted after scan started */
+ }
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return TM_Invisible; /* inserter has not committed yet */
+ else if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* it must have aborted or crashed */
+ return TM_Invisible;
+ }
+
+ /* The tuple is visible to use. But can we lock it? */
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ ZSUndoRec_TupleLock *lock_undorec = (ZSUndoRec_TupleLock *) undorec;
+
+ /*
+ * If any subtransaction of the current top transaction already holds
+ * a lock as strong as or stronger than what we're requesting, we
+ * effectively hold the desired lock already. We *must* succeed
+ * without trying to take the tuple lock, else we will deadlock
+ * against anyone wanting to acquire a stronger lock.
+ */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (lock_undorec->lockmode >= mode)
+ {
+ *undo_record_needed = true;
+ return TM_Ok;
+ }
+ }
+ else if (!zs_tuplelock_compatible(lock_undorec->lockmode, mode) &&
+ TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ /* but am I holding a weaker lock already? */
+ if (!*this_xact_has_lock)
+ *this_xact_has_lock = am_i_holding_lock(rel, undorec->prevundorec, recent_oldest_undo);
+
+ return TM_BeingModified;
+ }
+
+ /*
+ * No conflict with this lock. Look at the previous UNDO record, there
+ * might be more locks.
+ *
+ * FIXME: Shouldn't we drill down to the INSERT record and check if
+ * that's visible to us first, before looking at the lockers?
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE)
+ {
+ ZSUndoRec_Delete *deleterec = (ZSUndoRec_Delete *) undorec;
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ /* but am I holding a weaker lock already? */
+ if (!*this_xact_has_lock)
+ *this_xact_has_lock = am_i_holding_lock(rel, undorec->prevundorec, recent_oldest_undo);
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ if (deleterec->changedPart)
+ {
+ ItemPointerSet(&tmfd->ctid, MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber);
+ *next_tid = InvalidZSTid;
+ return TM_Updated;
+ }
+ else
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ return TM_Deleted;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* updated-away tuple */
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ LockTupleMode old_lockmode;
+ if (visi_info)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ }
+
+ *next_tid = updaterec->newtid;
+ old_lockmode = updaterec->key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ *this_xact_has_lock = true;
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ if (undorec->cid >= snapshot->curcid)
+ {
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = undorec->cid;
+ return TM_SelfModified; /* deleted/updated after scan started */
+ }
+ else
+ return TM_Invisible; /* deleted before scan started */
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(item_tid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+
+ /* but am I holding a weaker lock already? */
+ if (!*this_xact_has_lock)
+ *this_xact_has_lock = am_i_holding_lock(rel, undorec->prevundorec, recent_oldest_undo);
+
+ return TM_BeingModified;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter must have aborted or crashed. We have to keep following the
+ * undo chain, in case there are LOCK records that are still visible
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ if (zs_tuplelock_compatible(old_lockmode, mode))
+ return TM_Ok;
+
+ tmfd->ctid = ItemPointerFromZSTid(((ZSUndoRec_Update *) undorec)->newtid);
+ tmfd->xmax = undorec->xid;
+ tmfd->cmax = InvalidCommandId;
+ return TM_Updated;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+
+/*
+ * Like HeapTupleSatisfiesAny
+ */
+static bool
+zs_SatisfiesAny(ZSTidTreeScan *scan, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ undo_ptr = visi_info->undoptr;
+
+fetch_undo_record:
+ /* If this record is "old", then the record is visible. */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return true;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE ||
+ undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+
+ return true;
+}
+
+/*
+ * helper function to zs_SatisfiesMVCC(), to check if the given XID
+ * is visible to the snapshot.
+ */
+static bool
+xid_is_visible(Snapshot snapshot, TransactionId xid, CommandId cid, bool *aborted)
+{
+ *aborted = false;
+ if (TransactionIdIsCurrentTransactionId(xid))
+ {
+ if (cid >= snapshot->curcid)
+ return false;
+ else
+ return true;
+ }
+ else if (XidInMVCCSnapshot(xid, snapshot))
+ return false;
+ else if (TransactionIdDidCommit(xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ *aborted = true;
+ return false;
+ }
+}
+
+/*
+ * Like HeapTupleSatisfiesMVCC
+ */
+static bool
+zs_SatisfiesMVCC(ZSTidTreeScan *scan,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+ bool aborted;
+
+ undo_ptr = visi_info->undoptr;
+
+fetch_undo_record:
+ /* If this record is "old", then the record is visible. */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ /* Inserted tuple */
+ bool result;
+ result = xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted);
+ if (!result && !aborted)
+ *obsoleting_xid = undorec->xid;
+
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return result;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /*
+ * Deleted or updated-away. They are treated the same in an MVCC snapshot.
+ * They only need different treatment when updating or locking the row,
+ * in SatisfiesUpdate().
+ */
+ if (xid_is_visible(snapshot, undorec->xid, undorec->cid, &aborted))
+ {
+ /* we can see the deletion */
+ return false;
+ }
+ else
+ {
+ if (!aborted)
+ *obsoleting_xid = undorec->xid;
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesSelf
+ */
+static bool
+zs_SatisfiesSelf(ZSTidTreeScan *scan,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ ZSUndoRec *undorec;
+ ZSUndoRecPtr undo_ptr;
+
+ undo_ptr = visi_info->undoptr;
+
+fetch_undo_record:
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ return false;
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true;
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* we don't care about tuple locks here. Follow the link to the
+ * previous UNDO record for this tuple. */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true;
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesDirty
+ */
+static bool
+zs_SatisfiesDirty(ZSTidTreeScan *scan,
+ zstid *next_tid, ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ Snapshot snapshot = scan->snapshot;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ snapshot->xmin = snapshot->xmax = InvalidTransactionId;
+ snapshot->speculativeToken = INVALID_SPECULATIVE_TOKEN;
+
+ undo_ptr = visi_info->undoptr;
+
+fetch_undo_record:
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ ZSUndoRec_Insert *insertrec = (ZSUndoRec_Insert *) undorec;
+ snapshot->speculativeToken = insertrec->speculative_token;
+
+ /*
+ * HACK: For SnapshotDirty need to set the values of xmin/xmax/... in
+ * snapshot based on tuples. Hence, can't set the visi_info values
+ * here similar to other snapshots. Only setting the value for
+ * TransactionIdIsInProgress().
+ */
+
+ /* Inserted tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ return true; /* inserted by me */
+ else if (TransactionIdIsInProgress(undorec->xid))
+ {
+ snapshot->xmin = undorec->xid;
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+ return true;
+ }
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ return true;
+ }
+ else
+ {
+ /* it must have aborted or crashed */
+ return false;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* locked tuple. */
+ /* look at the previous UNDO record to find the insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ if (undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ ZSUndoRec_Update *updaterec = (ZSUndoRec_Update *) undorec;
+ if (next_tid)
+ *next_tid = updaterec->newtid;
+ }
+
+ /* deleted or updated-away tuple */
+ if (TransactionIdIsCurrentTransactionId(undorec->xid))
+ {
+ /* deleted by me */
+ return false;
+ }
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ {
+ /*
+ * TODO: not required to set the snapshot's xmax here? As gets
+ * populated based on visi_info later in snapshot by caller.
+ */
+ snapshot->xmax = undorec->xid;
+ visi_info->xmax = undorec->xid;
+ return true;
+ }
+
+ if (!TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter must have aborted or crashed. But we have to keep following the
+ * undo chain, to check if the insertion was visible in the first
+ * place.
+ */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+
+ return false;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * True if tuple might be visible to some transaction; false if it's
+ * surely dead to everyone, ie, vacuumable.
+ */
+static bool
+zs_SatisfiesNonVacuumable(ZSTidTreeScan *scan,
+ ZSUndoSlotVisibility *visi_info)
+{
+ Relation rel = scan->rel;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ undo_ptr = visi_info->undoptr;
+
+fetch_undo_record:
+
+ /* Is it visible? */
+ if (undo_ptr.counter < scan->recent_oldest_undo.counter)
+ {
+ visi_info->xmin = FrozenTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ return true;
+ }
+
+ /* have to fetch the UNDO record */
+ undorec = zsundo_fetch_record(rel, undo_ptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ goto fetch_undo_record;
+ }
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ visi_info->xmin = undorec->xid;
+ visi_info->cmin = undorec->cid;
+
+ /* Inserted tuple */
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* inserter has not committed yet */
+
+ if (TransactionIdDidCommit(undorec->xid))
+ return true;
+
+ /* it must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* deleted or updated-away tuple */
+ ZSUndoRecPtr prevptr;
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* delete-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * Deleter committed. But perhaps it was recent enough that some open
+ * transactions could still see the tuple.
+ */
+ if (!GlobalVisTestIsRemovableXid(scan->snapshot->vistest, undorec->xid))
+ {
+ visi_info->nonvacuumable_status = ZSNV_RECENTLY_DEAD;
+ return true;
+ }
+
+ return false;
+ }
+
+ /*
+ * The deleting transaction did not commit. But before concluding
+ * that the tuple is live, we have to check if the inserting
+ * XID is live.
+ */
+ do {
+ prevptr = undorec->prevundorec;
+
+ if (prevptr.counter < scan->recent_oldest_undo.counter)
+ return true;
+ undorec = zsundo_fetch_record(rel, prevptr);
+ if (!undorec)
+ {
+ scan->recent_oldest_undo = zsundo_get_oldest_undo_ptr(rel, false);
+ if (undo_ptr.counter >= scan->recent_oldest_undo.counter)
+ elog(ERROR, "could not find UNDO record " UINT64_FORMAT " at blk %u offset %u",
+ undo_ptr.counter, undo_ptr.blkno, undo_ptr.offset);
+ return true;
+ }
+ } while(undorec->type == ZSUNDO_TYPE_TUPLE_LOCK);
+
+ Assert(undorec->type == ZSUNDO_TYPE_INSERT);
+
+ if (TransactionIdIsInProgress(undorec->xid))
+ return true; /* insert-in-progress */
+ else if (TransactionIdDidCommit(undorec->xid))
+ return true; /* inserted committed */
+
+ /* inserter must have aborted or crashed */
+ return false;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* look at the previous UNDO record, to find the Insert record */
+ undo_ptr = undorec->prevundorec;
+ goto fetch_undo_record;
+ }
+ else
+ elog(ERROR, "unexpected UNDO record type: %d", undorec->type);
+}
+
+/*
+ * Like HeapTupleSatisfiesVisibility
+ *
+ * If next_tid is not NULL then gets populated for the tuple if tuple was
+ * UPDATEd. *next_tid_p is set to the TID of the new row version.
+ */
+bool
+zs_SatisfiesVisibility(ZSTidTreeScan *scan,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info)
+{
+ /* initialize as invalid, if we find valid one populate the same */
+ if (next_tid)
+ *next_tid = InvalidZSTid;
+
+ /* The caller should've filled in the recent_oldest_undo pointer */
+ Assert(scan->recent_oldest_undo.counter != 0);
+
+ *obsoleting_xid = InvalidTransactionId;
+
+ /*
+ * Items with invalid undo record are considered visible. Mostly META
+ * column stores the valid undo record, all other columns stores invalid
+ * undo pointer. Visibility check is performed based on META column and
+ * only if visible rest of columns are fetched. For in-place updates,
+ * columns other than META column may have valid undo record, in which
+ * case the visibility check needs to be performed for the same. META
+ * column can sometime also have items with invalid undo, see
+ * zsbt_undo_item_deletion().
+ */
+
+ if (!IsZSUndoRecPtrValid(&visi_info->undoptr))
+ return true;
+
+ switch (scan->snapshot->snapshot_type)
+ {
+ case SNAPSHOT_MVCC:
+ return zs_SatisfiesMVCC(scan, obsoleting_xid, next_tid, visi_info);
+
+ case SNAPSHOT_SELF:
+ return zs_SatisfiesSelf(scan, next_tid, visi_info);
+
+ case SNAPSHOT_ANY:
+ return zs_SatisfiesAny(scan, visi_info);
+
+ case SNAPSHOT_TOAST:
+ elog(ERROR, "SnapshotToast not implemented in zedstore");
+ break;
+
+ case SNAPSHOT_DIRTY:
+ return zs_SatisfiesDirty(scan, next_tid, visi_info);
+
+ case SNAPSHOT_HISTORIC_MVCC:
+ elog(ERROR, "SnapshotHistoricMVCC not implemented in zedstore yet");
+ break;
+
+ case SNAPSHOT_NON_VACUUMABLE:
+ return zs_SatisfiesNonVacuumable(scan, visi_info);
+ }
+
+ return false; /* keep compiler quiet */
+}
diff --git a/src/backend/access/zedstore/zedstore_wal.c b/src/backend/access/zedstore/zedstore_wal.c
new file mode 100644
index 0000000000..c33dd65363
--- /dev/null
+++ b/src/backend/access/zedstore/zedstore_wal.c
@@ -0,0 +1,107 @@
+/*
+ * zedstore_wal.c
+ * WAL-logging for zedstore.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstore_wal.c
+ */
+#include "postgres.h"
+
+#include "access/bufmask.h"
+#include "access/xlogreader.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undolog.h"
+#include "access/zedstore_undorec.h"
+#include "access/zedstore_wal.h"
+
+void
+zedstore_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case WAL_ZEDSTORE_INIT_METAPAGE:
+ zsmeta_initmetapage_redo(record);
+ break;
+ case WAL_ZEDSTORE_UNDO_NEWPAGE:
+ zsundo_newpage_redo(record);
+ break;
+ case WAL_ZEDSTORE_UNDO_DISCARD:
+ zsundo_discard_redo(record);
+ break;
+ case WAL_ZEDSTORE_BTREE_NEW_ROOT:
+ zsmeta_new_btree_root_redo(record);
+ break;
+ case WAL_ZEDSTORE_BTREE_REWRITE_PAGES:
+ zsbt_rewrite_pages_redo(record);
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS:
+ zsbt_tidleaf_items_redo(record, false);
+ break;
+ case WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM:
+ zsbt_tidleaf_items_redo(record, true);
+ break;
+ case WAL_ZEDSTORE_ATTSTREAM_CHANGE:
+ zsbt_attstream_change_redo(record);
+ break;
+ case WAL_ZEDSTORE_TOAST_NEWPAGE:
+ zstoast_newpage_redo(record);
+ break;
+ case WAL_ZEDSTORE_FPM_DELETE_PAGE:
+ zspage_delete_page_redo(record);
+ break;
+ case WAL_ZEDSTORE_FPM_REUSE_PAGE:
+ zspage_reuse_page_redo(record);
+ break;
+ default:
+ elog(PANIC, "zedstore_redo: unknown op code %u", info);
+ }
+}
+
+void
+zedstore_mask(char *pagedata, BlockNumber blkno)
+{
+ Page page = (Page) pagedata;
+ uint16 page_id;
+
+ mask_page_lsn_and_checksum(page);
+
+ page_id = *(uint16 *) (pagedata + BLCKSZ - sizeof(uint16));
+
+ if (blkno == ZS_META_BLK)
+ {
+ }
+ else if (page_id == ZS_UNDO_PAGE_ID && PageGetSpecialSize(page) == sizeof(ZSUndoPageOpaque))
+ {
+ /*
+ * On INSERT undo records, mask out speculative insertion tokens.
+ */
+ char *endptr = pagedata + ((PageHeader) pagedata)->pd_lower;
+ char *ptr;
+
+ ptr = pagedata + SizeOfPageHeaderData;
+
+ while (ptr < endptr)
+ {
+ ZSUndoRec *undorec = (ZSUndoRec *) ptr;
+
+ /* minimal validation */
+ if (undorec->size < sizeof(ZSUndoRec) || ptr + undorec->size > endptr)
+ break;
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ ((ZSUndoRec_Insert *) undorec)->speculative_token = MASK_MARKER;
+ }
+
+ ptr += undorec->size;
+ }
+ }
+
+ return;
+}
diff --git a/src/backend/access/zedstore/zedstoream_handler.c b/src/backend/access/zedstore/zedstoream_handler.c
new file mode 100644
index 0000000000..138e8a1572
--- /dev/null
+++ b/src/backend/access/zedstore/zedstoream_handler.c
@@ -0,0 +1,3205 @@
+/*-------------------------------------------------------------------------
+ *
+ * zedstoream_handler.c
+ * ZedStore table access method code
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/zedstore/zedstoream_handler.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/multixact.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/tsmapi.h"
+#include "access/tupdesc_details.h"
+#include "access/xact.h"
+#include "access/zedstore_internal.h"
+#include "access/zedstore_undorec.h"
+#include "catalog/catalog.h"
+#include "catalog/index.h"
+#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
+#include "commands/progress.h"
+#include "commands/vacuum.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "optimizer/plancat.h"
+#include "pgstat.h"
+#include "parser/parse_relation.h"
+#include "storage/lmgr.h"
+#include "storage/predicate.h"
+#include "storage/procarray.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+typedef struct ZedStoreProjectData
+{
+ int num_proj_atts;
+ Bitmapset *project_columns;
+ int *proj_atts;
+ ZSTidTreeScan tid_scan;
+ ZSAttrTreeScan *attr_scans;
+ MemoryContext context;
+} ZedStoreProjectData;
+
+typedef struct ZedStoreDescData
+{
+ /* scan parameters */
+ TableScanDescData rs_scan; /* */
+ ZedStoreProjectData proj_data;
+
+ bool started;
+ zstid cur_range_start;
+ zstid cur_range_end;
+
+ /* These fields are used for bitmap scans, to hold a "block's" worth of data */
+#define MAX_ITEMS_PER_LOGICAL_BLOCK MaxHeapTuplesPerPage
+ int bmscan_ntuples;
+ zstid *bmscan_tids;
+ int bmscan_nexttuple;
+
+ /* These fields are use for TABLESAMPLE scans */
+ zstid min_tid_to_scan;
+ zstid max_tid_to_scan;
+ zstid next_tid_to_scan;
+
+} ZedStoreDescData;
+
+typedef struct ZedStoreDescData *ZedStoreDesc;
+
+typedef struct ZedStoreIndexFetchData
+{
+ IndexFetchTableData idx_fetch_data;
+ ZedStoreProjectData proj_data;
+} ZedStoreIndexFetchData;
+
+typedef struct ZedStoreIndexFetchData *ZedStoreIndexFetch;
+
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static IndexFetchTableData *zedstoream_begin_index_fetch(Relation rel);
+static void zedstoream_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ Bitmapset *project_cols);
+static void zedstoream_end_index_fetch(IndexFetchTableData *scan);
+static bool zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot);
+static bool zs_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, bool *have_tuple_lock);
+
+static bool zs_blkscan_next_block(TableScanDesc sscan,
+ BlockNumber blkno, OffsetNumber *offsets, int noffsets,
+ bool predicatelocks);
+static bool zs_blkscan_next_tuple(TableScanDesc sscan, TupleTableSlot *slot);
+
+static Size zs_parallelscan_estimate(Relation rel);
+static Size zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan);
+static void zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan);
+static bool zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end);
+static void zsbt_fill_missing_attribute_value(TupleDesc tupleDesc, int attno, Datum *datum, bool *isnull);
+
+/* ----------------------------------------------------------------
+ * storage AM support routines for zedstoream
+ * ----------------------------------------------------------------
+ */
+
+static bool
+zedstoream_fetch_row_version(Relation rel,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ Bitmapset *project_cols)
+{
+ IndexFetchTableData *fetcher;
+ bool result;
+
+ zsbt_tuplebuffer_flush(rel);
+
+ fetcher = zedstoream_begin_index_fetch(rel);
+ zedstoream_fetch_set_column_projection(fetcher, project_cols);
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ *
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked tid_p.
+ */
+ PredicateLockTID(rel, tid_p, snapshot, InvalidTransactionId);
+ }
+ ExecMaterializeSlot(slot);
+ slot->tts_tableOid = RelationGetRelid(rel);
+ slot->tts_tid = *tid_p;
+
+ zedstoream_end_index_fetch(fetcher);
+
+ return result;
+}
+
+static void
+zedstoream_get_latest_tid(TableScanDesc sscan,
+ ItemPointer tid)
+{
+ zstid ztid = ZSTidFromItemPointer(*tid);
+
+ zsbt_tuplebuffer_flush(sscan->rs_rd);
+
+ zsbt_find_latest_tid(sscan->rs_rd, &ztid, sscan->rs_snapshot);
+ *tid = ItemPointerFromZSTid(ztid);
+}
+
+static inline void
+zedstoream_insert_internal(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate, uint32 speculative_token)
+{
+ zstid tid;
+ TransactionId xid = GetCurrentTransactionId();
+ MemoryContext oldcontext;
+ MemoryContext insert_mcontext;
+
+ /*
+ * insert code performs allocations for creating items and merging
+ * items. These are small allocations but add-up based on number of
+ * columns and rows being inserted. Hence, creating context to track them
+ * and wholesale free instead of retail freeing them. TODO: in long term
+ * try if can avoid creating context here, retail free in normal case and
+ * only create context for page splits maybe.
+ */
+ insert_mcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(insert_mcontext);
+
+ if (slot->tts_tupleDescriptor->natts != relation->rd_att->natts)
+ elog(ERROR, "slot's attribute count doesn't match relcache entry");
+
+ if (speculative_token == INVALID_SPECULATIVE_TOKEN)
+ tid = zsbt_tuplebuffer_allocate_tids(relation, xid, cid, 1);
+ else
+ tid = zsbt_tid_multi_insert(relation, 1, xid, cid, speculative_token,
+ InvalidUndoPtr);
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ slot_getallattrs(slot);
+ zsbt_tuplebuffer_spool_tuple(relation, tid, slot->tts_values, slot->tts_isnull);
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ /* XXX: should we set visi_info here? */
+
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(insert_mcontext);
+
+ /* Note: speculative insertions are counted too, even if aborted later */
+ pgstat_count_heap_insert(relation, 1);
+}
+
+static void
+zedstoream_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, struct BulkInsertStateData *bistate)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, INVALID_SPECULATIVE_TOKEN);
+}
+
+static void
+zedstoream_insert_speculative(Relation relation, TupleTableSlot *slot, CommandId cid,
+ int options, BulkInsertState bistate, uint32 specToken)
+{
+ zedstoream_insert_internal(relation, slot, cid, options, bistate, specToken);
+}
+
+static void
+zedstoream_complete_speculative(Relation relation, TupleTableSlot *slot, uint32 spekToken,
+ bool succeeded)
+{
+ zstid tid;
+
+ tid = ZSTidFromItemPointer(slot->tts_tid);
+ zsbt_tid_clear_speculative_token(relation, tid, spekToken, true /* for complete */);
+ /*
+ * there is a conflict
+ *
+ * FIXME: Shouldn't we mark the TID dead first?
+ */
+ if (!succeeded)
+ {
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(relation, true);
+
+ zsbt_tid_mark_dead(relation, tid, recent_oldest_undo);
+ }
+}
+
+static void
+zedstoream_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
+ CommandId cid, int options, BulkInsertState bistate)
+{
+ int i;
+ TransactionId xid = GetCurrentTransactionId();
+ zstid firsttid;
+ zstid *tids;
+
+ if (ntuples == 0)
+ {
+ /* COPY sometimes calls us with 0 tuples. */
+ return;
+ }
+
+ firsttid = zsbt_tuplebuffer_allocate_tids(relation, xid, cid, ntuples);
+
+ tids = palloc(ntuples * sizeof(zstid));
+ for (i = 0; i < ntuples; i++)
+ tids[i] = firsttid + i;
+
+ /*
+ * We only need to check for table-level SSI locks. Our
+ * new tuple can't possibly conflict with existing tuple locks, and
+ * page locks are only consolidated versions of tuple locks; they do not
+ * lock "gaps" as index page locks do.
+ */
+ CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
+
+ zsbt_tuplebuffer_spool_slots(relation, tids, slots, ntuples);
+
+ for (i = 0; i < ntuples; i++)
+ {
+ slots[i]->tts_tableOid = RelationGetRelid(relation);
+ slots[i]->tts_tid = ItemPointerFromZSTid(firsttid + i);
+ }
+
+ pgstat_count_heap_insert(relation, ntuples);
+}
+
+static TM_Result
+zedstoream_delete(Relation relation, ItemPointer tid_p, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result = TM_Ok;
+ bool this_xact_has_lock = false;
+ bool have_tuple_lock = false;
+
+ zsbt_tuplebuffer_flush(relation);
+
+retry:
+ result = zsbt_tid_delete(relation, tid, xid, cid,
+ snapshot, crosscheck, wait, hufd, changingPart,
+ &this_xact_has_lock);
+
+ if (result != TM_Ok)
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to delete invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ /*
+ * Acquire tuple lock to establish our priosity for the tuple
+ * See zedstoream_lock_tuple().
+ */
+ if (!this_xact_has_lock)
+ {
+ zs_acquire_tuplock(relation, tid_p, LockTupleExclusive, LockWaitBlock,
+ &have_tuple_lock);
+ }
+
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Delete);
+ goto retry;
+ }
+ }
+ }
+
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, tid_p, ItemPointerGetBlockNumber(tid_p));
+
+ if (result == TM_Ok)
+ pgstat_count_heap_delete(relation);
+
+ return result;
+}
+
+
+/*
+ * Each tuple lock mode has a corresponding heavyweight lock, and one or two
+ * corresponding MultiXactStatuses (one to merely lock tuples, another one to
+ * update them). This table (and the macros below) helps us determine the
+ * heavyweight lock mode and MultiXactStatus values to use for any particular
+ * tuple lock strength.
+ *
+ * Don't look at lockstatus/updstatus directly! Use get_mxact_status_for_lock
+ * instead.
+ */
+static const struct
+{
+ LOCKMODE hwlock;
+ int lockstatus;
+ int updstatus;
+}
+
+ tupleLockExtraInfo[MaxLockTupleMode + 1] =
+{
+ { /* LockTupleKeyShare */
+ AccessShareLock,
+ MultiXactStatusForKeyShare,
+ -1 /* KeyShare does not allow updating tuples */
+ },
+ { /* LockTupleShare */
+ RowShareLock,
+ MultiXactStatusForShare,
+ -1 /* Share does not allow updating tuples */
+ },
+ { /* LockTupleNoKeyExclusive */
+ ExclusiveLock,
+ MultiXactStatusForNoKeyUpdate,
+ MultiXactStatusNoKeyUpdate
+ },
+ { /* LockTupleExclusive */
+ AccessExclusiveLock,
+ MultiXactStatusForUpdate,
+ MultiXactStatusUpdate
+ }
+};
+
+
+/*
+ * Acquire heavyweight locks on tuples, using a LockTupleMode strength value.
+ * This is more readable than having every caller translate it to lock.h's
+ * LOCKMODE.
+ */
+#define LockTupleTuplock(rel, tup, mode) \
+ LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define UnlockTupleTuplock(rel, tup, mode) \
+ UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+#define ConditionalLockTupleTuplock(rel, tup, mode) \
+ ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
+
+/*
+ * Acquire heavyweight lock on the given tuple, in preparation for acquiring
+ * its normal, Xmax-based tuple lock.
+ *
+ * have_tuple_lock is an input and output parameter: on input, it indicates
+ * whether the lock has previously been acquired (and this function does
+ * nothing in that case). If this function returns success, have_tuple_lock
+ * has been flipped to true.
+ *
+ * Returns false if it was unable to obtain the lock; this can only happen if
+ * wait_policy is Skip.
+ *
+ * XXX: This is identical to heap_acquire_tuplock
+ */
+
+static bool
+zs_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, bool *have_tuple_lock)
+{
+ if (*have_tuple_lock)
+ return true;
+
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ LockTupleTuplock(relation, tid, mode);
+ break;
+
+ case LockWaitSkip:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ return false;
+ break;
+
+ case LockWaitError:
+ if (!ConditionalLockTupleTuplock(relation, tid, mode))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+ *have_tuple_lock = true;
+
+ return true;
+}
+
+
+static TM_Result
+zedstoream_lock_tuple(Relation relation, ItemPointer tid_p, Snapshot snapshot,
+ TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
+ LockWaitPolicy wait_policy, uint8 flags,
+ TM_FailureData *tmfd,
+ Bitmapset *project_cols)
+{
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ TM_Result result;
+ bool this_xact_has_lock = false;
+ bool have_tuple_lock = false;
+ zstid next_tid = tid;
+ SnapshotData SnapshotDirty;
+ bool locked_something = false;
+ ZSUndoSlotVisibility *visi_info = &((ZedstoreTupleTableSlot *) slot)->visi_info_buf;
+ bool follow_updates = false;
+
+ zsbt_tuplebuffer_flush(relation);
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = *tid_p;
+
+ tmfd->traversed = false;
+ /*
+ * For now, we lock just the first attribute. As long as everyone
+ * does that, that's enough.
+ */
+retry:
+ result = zsbt_tid_lock(relation, tid, xid, cid, mode, follow_updates,
+ snapshot, tmfd, &next_tid, &this_xact_has_lock, visi_info);
+ ((ZedstoreTupleTableSlot *) slot)->visi_info = visi_info;
+
+ if (result == TM_Invisible)
+ {
+ /*
+ * This is possible, but only when locking a tuple for ON CONFLICT
+ * UPDATE and some other cases handled below. We return this value
+ * here rather than throwing an error in order to give that case the
+ * opportunity to throw a more specific error.
+ */
+ /*
+ * This can also happen, if we're locking an UPDATE chain for KEY SHARE mode:
+ * A tuple has been inserted, and then updated, by a different transaction.
+ * The updating transaction is still in progress. We can lock the row
+ * in KEY SHARE mode, assuming the key columns were not updated, and we will
+ * try to lock all the row version, even the still in-progress UPDATEs.
+ * It's possible that the UPDATE aborts while we're chasing the update chain,
+ * so that the updated tuple becomes invisible to us. That's OK.
+ */
+ if (mode == LockTupleKeyShare && locked_something)
+ return TM_Ok;
+
+ /*
+ * This can also happen, if the caller asked for the latest version
+ * of the tuple and if tuple was inserted by our own transaction, we
+ * have to check cmin against cid: cmin >= current CID means our
+ * command cannot see the tuple, so we should ignore it.
+ */
+ Assert(visi_info->cmin != InvalidCommandId);
+ if ((flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION) != 0 &&
+ TransactionIdIsCurrentTransactionId(visi_info->xmin) &&
+ visi_info->cmin >= cid)
+ {
+ tmfd->xmax = visi_info->xmin;
+ tmfd->cmax = visi_info->cmin;
+ return TM_SelfModified;
+ }
+
+ return TM_Invisible;
+ }
+ else if (result == TM_Updated ||
+ (result == TM_SelfModified && tmfd->cmax >= cid))
+ {
+ /*
+ * The other transaction is an update and it already committed.
+ *
+ * If the caller asked for the latest version, find it.
+ */
+ if ((flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION) != 0 && next_tid != tid)
+ {
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (ItemPointerIndicatesMovedPartitions(&tmfd->ctid))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
+ /* it was updated, so look at the updated version */
+ *tid_p = ItemPointerFromZSTid(next_tid);
+
+ /* signal that a tuple later in the chain is getting locked */
+ tmfd->traversed = true;
+
+ /* loop back to fetch next in chain */
+
+ /* FIXME: In the corresponding code in heapam, we cross-check the xmin/xmax
+ * of the old and new tuple. Should we do the same here?
+ */
+
+ InitDirtySnapshot(SnapshotDirty);
+ snapshot = &SnapshotDirty;
+ tid = next_tid;
+ goto retry;
+ }
+
+ return result;
+ }
+ else if (result == TM_Deleted)
+ {
+ /*
+ * The other transaction is a delete and it already committed.
+ */
+ return result;
+ }
+ else if (result == TM_BeingModified)
+ {
+ TransactionId xwait = tmfd->xmax;
+
+ /*
+ * Acquire tuple lock to establish our priority for the tuple, or
+ * die trying. LockTuple will release us when we are next-in-line
+ * for the tuple. We must do this even if we are share-locking,
+ * but not if we already have a weaker lock on the tuple.
+ *
+ * If we are forced to "start over" below, we keep the tuple lock;
+ * this arranges that we stay at the head of the line while
+ * rechecking tuple state.
+ *
+ * Explanation for why we don't acquire heavy-weight lock when we
+ * already hold a weaker lock:
+ *
+ * Disable acquisition of the heavyweight tuple lock.
+ * Otherwise, when promoting a weaker lock, we might
+ * deadlock with another locker that has acquired the
+ * heavyweight tuple lock and is waiting for our
+ * transaction to finish.
+ *
+ * Note that in this case we still need to wait for
+ * the xid if required, to avoid acquiring
+ * conflicting locks.
+ *
+ */
+ if (!this_xact_has_lock &&
+ !zs_acquire_tuplock(relation, tid_p, mode, wait_policy,
+ &have_tuple_lock))
+ {
+ /*
+ * This can only happen if wait_policy is Skip and the lock
+ * couldn't be obtained.
+ */
+ return TM_WouldBlock;
+ }
+
+ /* wait for regular transaction to end, or die trying */
+ switch (wait_policy)
+ {
+ case LockWaitBlock:
+ XactLockTableWait(xwait, relation, tid_p, XLTW_Lock);
+ break;
+ case LockWaitSkip:
+ if (!ConditionalXactLockTableWait(xwait))
+ {
+ /* FIXME: should we release the hwlock here? */
+ return TM_WouldBlock;
+ }
+ break;
+ case LockWaitError:
+ if (!ConditionalXactLockTableWait(xwait))
+ ereport(ERROR,
+ (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg("could not obtain lock on row in relation \"%s\"",
+ RelationGetRelationName(relation))));
+ break;
+ }
+
+ /*
+ * xwait is done. Retry.
+ */
+ goto retry;
+ }
+ if (result == TM_Ok)
+ locked_something = true;
+
+ /*
+ * Now that we have successfully marked the tuple as locked, we can
+ * release the lmgr tuple lock, if we had it.
+ */
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, tid_p, mode);
+ have_tuple_lock = false;
+ }
+
+ if (mode == LockTupleKeyShare)
+ {
+ /* lock all row versions, if it's a KEY SHARE lock */
+ follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
+ if (result == TM_Ok && tid != next_tid && next_tid != InvalidZSTid)
+ {
+ tid = next_tid;
+ goto retry;
+ }
+ }
+
+ /* Fetch the tuple, too. */
+ if (!zedstoream_fetch_row_version(relation, tid_p, SnapshotAny, slot,
+ project_cols))
+ elog(ERROR, "could not fetch locked tuple");
+
+ return TM_Ok;
+}
+
+/* like heap_tuple_attr_equals */
+static bool
+zs_tuple_attr_equals(int attrnum, TupleTableSlot *slot1, TupleTableSlot *slot2)
+{
+ TupleDesc tupdesc = slot1->tts_tupleDescriptor;
+ Datum value1,
+ value2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ /*
+ * If it's a whole-tuple reference, say "not equal". It's not really
+ * worth supporting this case, since it could only succeed after a no-op
+ * update, which is hardly a case worth optimizing for.
+ */
+ if (attrnum == 0)
+ return false;
+
+ /*
+ * Likewise, automatically say "not equal" for any system attribute other
+ * than tableOID; we cannot expect these to be consistent in a HOT chain,
+ * or even to be set correctly yet in the new tuple.
+ */
+ if (attrnum < 0)
+ {
+ if (attrnum != TableOidAttributeNumber)
+ return false;
+ }
+
+ /*
+ * Extract the corresponding values. XXX this is pretty inefficient if
+ * there are many indexed columns. Should HeapDetermineModifiedColumns do
+ * a single heap_deform_tuple call on each tuple, instead? But that
+ * doesn't work for system columns ...
+ */
+ value1 = slot_getattr(slot1, attrnum, &isnull1);
+ value2 = slot_getattr(slot2, attrnum, &isnull2);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+
+ /*
+ * If both are NULL, they can be considered equal.
+ */
+ if (isnull1)
+ return true;
+
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attrnum <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ }
+ else
+ {
+ Assert(attrnum <= tupdesc->natts);
+ att = TupleDescAttr(tupdesc, attrnum - 1);
+ return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ }
+}
+
+static bool
+is_key_update(Relation relation, TupleTableSlot *oldslot, TupleTableSlot *newslot)
+{
+ Bitmapset *key_attrs;
+ Bitmapset *interesting_attrs;
+ Bitmapset *modified_attrs;
+ int attnum;
+
+ /*
+ * Fetch the list of attributes to be checked for various operations.
+ *
+ * For HOT considerations, this is wasted effort if we fail to update or
+ * have to put the new tuple on a different page. But we must compute the
+ * list before obtaining buffer lock --- in the worst case, if we are
+ * doing an update on one of the relevant system catalogs, we could
+ * deadlock if we try to fetch the list later. In any case, the relcache
+ * caches the data so this is usually pretty cheap.
+ *
+ * We also need columns used by the replica identity and columns that are
+ * considered the "key" of rows in the table.
+ *
+ * Note that we get copies of each bitmap, so we need not worry about
+ * relcache flush happening midway through.
+ */
+ key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
+
+ interesting_attrs = NULL;
+ interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
+
+ /* Determine columns modified by the update. */
+ modified_attrs = NULL;
+ while ((attnum = bms_first_member(interesting_attrs)) >= 0)
+ {
+ attnum += FirstLowInvalidHeapAttributeNumber;
+
+ if (!zs_tuple_attr_equals(attnum, oldslot, newslot))
+ modified_attrs = bms_add_member(modified_attrs,
+ attnum - FirstLowInvalidHeapAttributeNumber);
+ }
+
+ return bms_overlap(modified_attrs, key_attrs);
+}
+
+static TM_Result
+zedstoream_update(Relation relation, ItemPointer otid_p, TupleTableSlot *slot,
+ CommandId cid, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd,
+ LockTupleMode *lockmode, bool *update_indexes)
+{
+ zstid otid = ZSTidFromItemPointer(*otid_p);
+ TransactionId xid = GetCurrentTransactionId();
+ bool key_update;
+ Datum *d;
+ bool *isnulls;
+ TM_Result result;
+ zstid newtid;
+ TupleTableSlot *oldslot;
+ IndexFetchTableData *fetcher;
+ MemoryContext oldcontext;
+ MemoryContext insert_mcontext;
+ bool this_xact_has_lock = false;
+ bool have_tuple_lock = false;
+
+ zsbt_tuplebuffer_flush(relation);
+
+ /*
+ * insert code performs allocations for creating items and merging
+ * items. These are small allocations but add-up based on number of
+ * columns and rows being inserted. Hence, creating context to track them
+ * and wholesale free instead of retail freeing them. TODO: in long term
+ * try if can avoid creating context here, retail free in normal case and
+ * only create context for page splits maybe.
+ */
+ insert_mcontext = AllocSetContextCreate(CurrentMemoryContext,
+ "ZedstoreAMContext",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(insert_mcontext);
+
+ slot_getallattrs(slot);
+ d = slot->tts_values;
+ isnulls = slot->tts_isnull;
+
+ oldslot = table_slot_create(relation, NULL);
+ fetcher = zedstoream_begin_index_fetch(relation);
+
+ /*
+ * The meta-attribute holds the visibility information, including the "t_ctid"
+ * pointer to the updated version. All the real attributes are just inserted,
+ * as if for a new row.
+ */
+retry:
+ newtid = InvalidZSTid;
+
+ /*
+ * Fetch the old row, so that we can figure out which columns were modified.
+ *
+ * FIXME: if we have to follow the update chain, we should look at the
+ * currently latest tuple version, rather than the one visible to our snapshot.
+ */
+ if (!zedstoream_fetch_row((ZedStoreIndexFetchData *) fetcher,
+ otid_p, SnapshotAny, oldslot))
+ {
+ return TM_Invisible;
+ }
+ key_update = is_key_update(relation, oldslot, slot);
+
+ *lockmode = key_update ? LockTupleExclusive : LockTupleNoKeyExclusive;
+
+ result = zsbt_tid_update(relation, otid,
+ xid, cid, key_update, snapshot, crosscheck,
+ wait, hufd, &newtid, &this_xact_has_lock);
+
+ *update_indexes = (result == TM_Ok);
+ if (result == TM_Ok)
+ {
+ /*
+ * Check for SSI conflicts.
+ */
+ CheckForSerializableConflictIn(relation, otid_p, ItemPointerGetBlockNumber(otid_p));
+
+ zsbt_tuplebuffer_spool_tuple(relation, newtid, d, isnulls);
+
+ slot->tts_tableOid = RelationGetRelid(relation);
+ slot->tts_tid = ItemPointerFromZSTid(newtid);
+
+ pgstat_count_heap_update(relation, false);
+ }
+ else
+ {
+ if (result == TM_Invisible)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("attempted to update invisible tuple")));
+ else if (result == TM_BeingModified && wait)
+ {
+ TransactionId xwait = hufd->xmax;
+
+ if (!TransactionIdIsCurrentTransactionId(xwait))
+ {
+ /*
+ * Acquire tuple lock to establish our priosity for the tuple
+ * See zedstoream_lock_tuple().
+ */
+ if (!this_xact_has_lock)
+ {
+ zs_acquire_tuplock(relation, otid_p, LockTupleExclusive, LockWaitBlock,
+ &have_tuple_lock);
+ }
+
+ XactLockTableWait(xwait, relation, otid_p, XLTW_Update);
+ goto retry;
+ }
+ }
+ }
+
+ /*
+ * Now that we have successfully updated the tuple, we can
+ * release the lmgr tuple lock, if we had it.
+ */
+ if (have_tuple_lock)
+ {
+ UnlockTupleTuplock(relation, otid_p, LockTupleExclusive);
+ have_tuple_lock = false;
+ }
+
+ zedstoream_end_index_fetch(fetcher);
+ ExecDropSingleTupleTableSlot(oldslot);
+
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(insert_mcontext);
+
+ return result;
+}
+
+static const TupleTableSlotOps *
+zedstoream_slot_callbacks(Relation relation)
+{
+ return &TTSOpsZedstore;
+}
+
+static void
+zs_initialize_proj_attributes(TupleDesc tupledesc, ZedStoreProjectData *proj_data)
+{
+ MemoryContext oldcontext;
+
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ proj_data->tid_scan.active = false;
+
+ if (bms_is_empty(proj_data->project_columns))
+ {
+ proj_data->proj_atts = palloc(sizeof(int));
+ proj_data->proj_atts[proj_data->num_proj_atts++] = ZS_META_ATTRIBUTE_NUM;
+ proj_data->attr_scans = NULL;
+ }
+ else
+ {
+ bool project_whole_row = contains_whole_row_col(proj_data->project_columns);
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* add one for meta-attribute */
+ proj_data->proj_atts = palloc((tupledesc->natts + 1) * sizeof(int));
+ proj_data->attr_scans = palloc0(tupledesc->natts * sizeof(ZSAttrTreeScan));
+ proj_data->proj_atts[proj_data->num_proj_atts++] = ZS_META_ATTRIBUTE_NUM;
+
+ /* Convert column bitmap into an array of attnos.*/
+ for (int idx = 0; idx < tupledesc->natts; idx++)
+ {
+ int att_no = idx + 1;
+
+ /*
+ * never project dropped columns, null will be returned for them
+ * in slot by default.
+ */
+ if (TupleDescAttr(tupledesc, idx)->attisdropped)
+ continue;
+
+ /*
+ * If the whole row was requested, then ensure that all columns are
+ * added.
+ */
+ if (project_whole_row || bms_is_member(att_no, proj_data->project_columns))
+ proj_data->proj_atts[proj_data->num_proj_atts++] = att_no;
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ }
+}
+
+static void
+zs_initialize_proj_attributes_extended(ZedStoreDesc scan, TupleDesc tupledesc)
+{
+ MemoryContext oldcontext;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ /* if already initialized return */
+ if (proj_data->num_proj_atts != 0)
+ return;
+
+ zs_initialize_proj_attributes(tupledesc, proj_data);
+
+ oldcontext = MemoryContextSwitchTo(proj_data->context);
+ /* Extra setup for bitmap and sample scans */
+ if ((scan->rs_scan.rs_flags & SO_TYPE_BITMAPSCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_SAMPLESCAN) ||
+ (scan->rs_scan.rs_flags & SO_TYPE_ANALYZE))
+ {
+ scan->bmscan_ntuples = 0;
+ scan->bmscan_tids = palloc(MAX_ITEMS_PER_LOGICAL_BLOCK * sizeof(zstid));
+ }
+ MemoryContextSwitchTo(oldcontext);
+}
+
+static TableScanDesc
+zedstoream_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ Bitmapset *project_columns)
+{
+ ZedStoreDesc scan;
+
+ zsbt_tuplebuffer_flush(relation);
+
+ /* Sample scans have no snapshot, but we need one */
+ if (!snapshot)
+ {
+ Assert(!(flags & SO_TYPE_SAMPLESCAN));
+ snapshot = SnapshotAny;
+ }
+
+ /*
+ * allocate and initialize scan descriptor
+ */
+ scan = (ZedStoreDesc) palloc0(sizeof(ZedStoreDescData));
+
+ scan->rs_scan.rs_rd = relation;
+ scan->rs_scan.rs_snapshot = snapshot;
+ scan->rs_scan.rs_nkeys = nkeys;
+ scan->rs_scan.rs_flags = flags;
+ scan->rs_scan.rs_parallel = parallel_scan;
+
+ /*
+ * we can use page-at-a-time mode if it's an MVCC-safe snapshot
+ */
+
+ /*
+ * we do this here instead of in initscan() because heap_rescan also calls
+ * initscan() and we don't want to allocate memory again
+ */
+ if (nkeys > 0)
+ scan->rs_scan.rs_key = (ScanKey) palloc(sizeof(ScanKeyData) * nkeys);
+ else
+ scan->rs_scan.rs_key = NULL;
+
+ scan->proj_data.context = CurrentMemoryContext;
+ scan->proj_data.project_columns = project_columns;
+
+ /*
+ * For a seqscan in a serializable transaction, acquire a predicate lock
+ * on the entire relation. This is required not only to lock all the
+ * matching tuples, but also to conflict with new insertions into the
+ * table. In an indexscan, we take page locks on the index pages covering
+ * the range specified in the scan qual, but in a heap scan there is
+ * nothing more fine-grained to lock. A bitmap scan is a different story,
+ * there we have already scanned the index and locked the index pages
+ * covering the predicate. But in that case we still have to lock any
+ * matching heap tuples.
+ */
+ if (flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
+ PredicateLockRelation(relation, snapshot);
+
+ /*
+ * Currently, we don't have a stats counter for bitmap heap scans (but the
+ * underlying bitmap index scans will be counted) or sample scans (we only
+ * update stats for tuple fetches there)
+ */
+ if (!(flags & SO_TYPE_BITMAPSCAN) && !(flags & SO_TYPE_SAMPLESCAN))
+ pgstat_count_heap_scan(relation);
+
+ return (TableScanDesc) scan;
+}
+
+static TableScanDesc
+zedstoream_beginscan(Relation relation, Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags)
+{
+ return zedstoream_beginscan_with_column_projection(relation, snapshot,
+ nkeys, key, parallel_scan,
+ flags,
+ bms_make_singleton(0));
+}
+
+static void
+zedstoream_endscan(TableScanDesc sscan)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *proj_data = &scan->proj_data;
+
+ if (proj_data->proj_atts)
+ pfree(proj_data->proj_atts);
+
+ if (proj_data->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&proj_data->tid_scan);
+ for (int i = 1; i < proj_data->num_proj_atts; i++)
+ zsbt_attr_end_scan(&proj_data->attr_scans[i - 1]);
+ }
+
+ if (scan->rs_scan.rs_flags & SO_TEMP_SNAPSHOT)
+ UnregisterSnapshot(scan->rs_scan.rs_snapshot);
+
+ if (proj_data->attr_scans)
+ pfree(proj_data->attr_scans);
+ pfree(scan);
+}
+
+static void
+zedstoream_rescan(TableScanDesc sscan, struct ScanKeyData *key,
+ bool set_params, bool allow_strat,
+ bool allow_sync, bool allow_pagemode)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+
+ /* these params don't do much in zedstore yet, but whatever */
+ if (set_params)
+ {
+ if (allow_strat)
+ scan->rs_scan.rs_flags |= SO_ALLOW_STRAT;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_STRAT;
+
+ if (allow_sync)
+ scan->rs_scan.rs_flags |= SO_ALLOW_SYNC;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_SYNC;
+
+ if (allow_pagemode && scan->rs_scan.rs_snapshot &&
+ IsMVCCSnapshot(scan->rs_scan.rs_snapshot))
+ scan->rs_scan.rs_flags |= SO_ALLOW_PAGEMODE;
+ else
+ scan->rs_scan.rs_flags &= ~SO_ALLOW_PAGEMODE;
+ }
+
+ if (scan->proj_data.num_proj_atts > 0)
+ {
+ zsbt_tid_reset_scan(&scan->proj_data.tid_scan,
+ scan->cur_range_start, scan->cur_range_end, scan->cur_range_start - 1);
+
+ if ((scan->rs_scan.rs_flags & SO_TYPE_SAMPLESCAN) != 0)
+ scan->next_tid_to_scan = ZSTidFromBlkOff(0, 1);
+ }
+}
+
+static bool
+zedstoream_getnextslot(TableScanDesc sscan, ScanDirection direction,
+ TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *scan_proj = &scan->proj_data;
+ int slot_natts = slot->tts_tupleDescriptor->natts;
+ Datum *slot_values = slot->tts_values;
+ bool *slot_isnull = slot->tts_isnull;
+ zstid this_tid;
+ Datum datum;
+ bool isnull;
+ ZSUndoSlotVisibility *visi_info;
+ uint8 slotno;
+
+ if (direction != ForwardScanDirection && scan->rs_scan.rs_parallel)
+ elog(ERROR, "parallel backward scan not implemented");
+
+ if (!scan->started)
+ {
+ MemoryContext oldcontext;
+
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, scan_proj);
+
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+ }
+ else
+ {
+ scan->cur_range_start = MinZSTid;
+ scan->cur_range_end = MaxPlusOneZSTid;
+ }
+
+ oldcontext = MemoryContextSwitchTo(scan_proj->context);
+ zsbt_tid_begin_scan(scan->rs_scan.rs_rd,
+ scan->cur_range_start,
+ scan->cur_range_end,
+ scan->rs_scan.rs_snapshot,
+ &scan_proj->tid_scan);
+ scan_proj->tid_scan.serializable = true;
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ int attno = scan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(scan->rs_scan.rs_rd,
+ slot->tts_tupleDescriptor,
+ attno,
+ &scan_proj->attr_scans[i - 1]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ scan->started = true;
+ }
+ Assert((scan_proj->num_proj_atts - 1) <= slot_natts);
+
+ /*
+ * Initialize the slot.
+ *
+ * We initialize all columns to NULL. The values for columns that are projected
+ * will be set to the actual values below, but it's important that non-projected
+ * columns are NULL.
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < slot_natts; i++)
+ slot_isnull[i] = true;
+
+ /*
+ * Find the next visible TID.
+ */
+ for (;;)
+ {
+ this_tid = zsbt_tid_scan_next(&scan_proj->tid_scan, direction);
+ if (this_tid == InvalidZSTid)
+ {
+ if (scan->rs_scan.rs_parallel)
+ {
+ /* Allocate next range of TIDs to scan */
+ if (!zs_parallelscan_nextrange(scan->rs_scan.rs_rd,
+ (ParallelZSScanDesc) scan->rs_scan.rs_parallel,
+ &scan->cur_range_start, &scan->cur_range_end))
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+
+ zsbt_tid_reset_scan(&scan_proj->tid_scan,
+ scan->cur_range_start, scan->cur_range_end, scan->cur_range_start - 1);
+ continue;
+ }
+ else
+ {
+ ExecClearTuple(slot);
+ return false;
+ }
+ }
+ Assert (this_tid < scan->cur_range_end);
+ break;
+ }
+
+ /* Note: We don't need to predicate-lock tuples in Serializable mode,
+ * because in a sequential scan, we predicate-locked the whole table.
+ */
+
+ /* Fetch the datums of each attribute for this row */
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ ZSAttrTreeScan *btscan = &scan_proj->attr_scans[i - 1];
+ Form_pg_attribute attr = btscan->attdesc;
+ int natt;
+
+ if (!zsbt_attr_fetch(btscan, &datum, &isnull, this_tid))
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, btscan->attno,
+ &datum, &isnull);
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ natt = scan_proj->proj_atts[i];
+
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ MemoryContext oldcxt = CurrentMemoryContext;
+
+ if (btscan->decoder.tmpcxt)
+ MemoryContextSwitchTo(btscan->decoder.tmpcxt);
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, natt, this_tid, datum);
+ MemoryContextSwitchTo(oldcxt);
+ }
+
+ /* Check that the values coming out of the b-tree are aligned properly */
+ if (!isnull && attr->attlen == -1)
+ {
+ Assert (VARATT_IS_1B(datum) || INTALIGN(datum) == datum);
+ }
+
+ Assert(natt > 0);
+ slot_values[natt - 1] = datum;
+ slot_isnull[natt - 1] = isnull;
+ }
+
+ /* Fill in the rest of the fields in the slot, and return the tuple */
+ slotno = ZSTidScanCurUndoSlotNo(&scan_proj->tid_scan);
+ visi_info = &scan_proj->tid_scan.array_iter.visi_infos[slotno];
+ ((ZedstoreTupleTableSlot *) slot)->visi_info = visi_info;
+
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(this_tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ pgstat_count_heap_getnext(scan->rs_scan.rs_rd);
+ return true;
+}
+
+static bool
+zedstoream_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
+{
+ ZedStoreDesc scan;
+ zstid ztid;
+
+ if (!ItemPointerIsValid(tid))
+ return false;
+
+ scan = (ZedStoreDesc) sscan;
+ ztid = ZSTidFromItemPointer(*tid);
+
+ if (scan->min_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the min tid once and store it
+ */
+ scan->min_tid_to_scan = zsbt_get_first_tid(sscan->rs_rd);
+ }
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(sscan->rs_rd);
+ }
+
+ if ( ztid >= scan->min_tid_to_scan && ztid < scan->max_tid_to_scan)
+ return true;
+ else
+ return false;
+}
+
+static bool
+zedstoream_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
+ Snapshot snapshot)
+{
+ zstid tid = ZSTidFromItemPointer(slot->tts_tid);
+ TransactionId obsoleting_xid;
+ ZSTidTreeScan meta_scan;
+ bool result;
+ ZedstoreTupleTableSlot *zslot = (ZedstoreTupleTableSlot *) slot;
+
+ /* Use the meta-data tree for the visibility information. */
+ zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &meta_scan);
+ result = zs_SatisfiesVisibility(&meta_scan, &obsoleting_xid, NULL, zslot->visi_info);
+ zsbt_tid_end_scan(&meta_scan);
+
+ return result;
+}
+
+static TransactionId
+zedstoream_compute_xid_horizon_for_tuples(Relation rel,
+ ItemPointerData *items,
+ int nitems)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function %s not implemented yet", __func__)));
+}
+
+static IndexFetchTableData *
+zedstoream_begin_index_fetch(Relation rel)
+{
+ ZedStoreIndexFetch zscan;
+
+ zsbt_tuplebuffer_flush(rel);
+
+ zscan = palloc0(sizeof(ZedStoreIndexFetchData));
+ zscan->idx_fetch_data.rel = rel;
+ zscan->proj_data.context = CurrentMemoryContext;
+ zscan->proj_data.project_columns = bms_make_singleton(0);
+
+ return (IndexFetchTableData *) zscan;
+}
+
+static void
+zedstoream_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ Bitmapset *project_cols)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ zscan->proj_data.project_columns = project_cols;
+}
+
+static void
+zedstoream_reset_index_fetch(IndexFetchTableData *scan)
+{
+ /* TODO: we could close the scans here, but currently we don't bother */
+}
+
+static void
+zedstoream_end_index_fetch(IndexFetchTableData *scan)
+{
+ ZedStoreIndexFetch zscan = (ZedStoreIndexFetch) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_end_scan(&zscan_proj->tid_scan);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ zsbt_attr_end_scan(&zscan_proj->attr_scans[i - 1]);
+ }
+
+ if (zscan_proj->proj_atts)
+ pfree(zscan_proj->proj_atts);
+
+ if (zscan_proj->attr_scans)
+ pfree(zscan_proj->attr_scans);
+ pfree(zscan);
+}
+
+static bool
+zedstoream_index_fetch_tuple(struct IndexFetchTableData *scan,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ bool *call_again, bool *all_dead)
+{
+ bool result;
+
+ /*
+ * we don't do in-place updates, so this is essentially the same as
+ * fetch_row_version.
+ */
+ if (call_again)
+ *call_again = false;
+ if (all_dead)
+ *all_dead = false;
+
+ result = zedstoream_fetch_row((ZedStoreIndexFetchData *) scan, tid_p, snapshot, slot);
+ if (result)
+ {
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ *
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked tid_p.
+ */
+ PredicateLockTID(scan->rel, tid_p, snapshot, InvalidTransactionId);
+ }
+
+ return result;
+}
+
+/*
+ * Shared implementation of fetch_row_version and index_fetch_tuple callbacks.
+ */
+static bool
+zedstoream_fetch_row(ZedStoreIndexFetchData *fetch,
+ ItemPointer tid_p,
+ Snapshot snapshot,
+ TupleTableSlot *slot)
+{
+ Relation rel = fetch->idx_fetch_data.rel;
+ zstid tid = ZSTidFromItemPointer(*tid_p);
+ bool found = true;
+ ZedStoreProjectData *fetch_proj = &fetch->proj_data;
+
+ /* first time here, initialize */
+ if (fetch_proj->num_proj_atts == 0)
+ {
+ TupleDesc reldesc = RelationGetDescr(rel);
+ MemoryContext oldcontext;
+
+ zs_initialize_proj_attributes(slot->tts_tupleDescriptor, fetch_proj);
+
+ oldcontext = MemoryContextSwitchTo(fetch_proj->context);
+ zsbt_tid_begin_scan(rel, tid, tid + 1,
+ snapshot,
+ &fetch_proj->tid_scan);
+ fetch_proj->tid_scan.serializable = true;
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ {
+ int attno = fetch_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(rel, reldesc, attno,
+ &fetch_proj->attr_scans[i - 1]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ zsbt_tid_reset_scan(&fetch_proj->tid_scan, tid, tid + 1, tid - 1);
+
+ /*
+ * Initialize the slot.
+ *
+ * If we're not fetching all columns, initialize the unfetched values
+ * in the slot to NULL. (Actually, this initializes all to NULL, and the
+ * code below will overwrite them for the columns that are projected)
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < slot->tts_tupleDescriptor->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ found = zsbt_tid_scan_next(&fetch_proj->tid_scan, ForwardScanDirection) != InvalidZSTid;
+ if (found)
+ {
+ for (int i = 1; i < fetch_proj->num_proj_atts; i++)
+ {
+ int natt = fetch_proj->proj_atts[i];
+ ZSAttrTreeScan *btscan = &fetch_proj->attr_scans[i - 1];
+ Form_pg_attribute attr;
+ Datum datum;
+ bool isnull;
+
+ attr = btscan->attdesc;
+ if (zsbt_attr_fetch(btscan, &datum, &isnull, tid))
+ {
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && attr->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ MemoryContext oldcxt = CurrentMemoryContext;
+
+ if (btscan->decoder.tmpcxt)
+ MemoryContextSwitchTo(btscan->decoder.tmpcxt);
+ datum = zedstore_toast_flatten(rel, natt, tid, datum);
+ MemoryContextSwitchTo(oldcxt);
+ }
+ }
+ else
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, btscan->attno,
+ &datum, &isnull);
+
+ slot->tts_values[natt - 1] = datum;
+ slot->tts_isnull[natt - 1] = isnull;
+ }
+ }
+
+ if (found)
+ {
+ uint8 slotno = ZSTidScanCurUndoSlotNo(&fetch_proj->tid_scan);
+ ZSUndoSlotVisibility *visi_info;
+
+ visi_info = &fetch_proj->tid_scan.array_iter.visi_infos[slotno];
+
+ ((ZedstoreTupleTableSlot *) slot)->visi_info = visi_info;
+ slot->tts_tableOid = RelationGetRelid(rel);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ return true;
+ }
+
+ return false;
+}
+
+static void
+zedstoream_index_validate_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ Snapshot snapshot,
+ ValidateIndexState *state)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ int attno;
+ TableScanDesc scan;
+ ItemPointerData idx_ptr;
+ bool tuplesort_empty = false;
+ Bitmapset *proj = NULL;
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. We need just those tuples
+ * satisfying the passed-in reference snapshot. We must disable syncscan
+ * here, because it's critical that we read from block zero forward to
+ * match the sorted TIDs.
+ */
+
+ /*
+ * TODO: It would be very good to fetch only the columns we need.
+ */
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ proj = bms_add_member(proj, indexInfo->ii_IndexAttrNumbers[attno]);
+ }
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Predicate,
+ baseRelation->rd_att->natts,
+ &proj);
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Expressions,
+ baseRelation->rd_att->natts,
+ &proj);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ /*
+ * Scan all tuples matching the snapshot.
+ */
+ ItemPointerSet(&idx_ptr, 0, 0); /* this is less than any real TID */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ ItemPointerData tup_ptr = slot->tts_tid;
+ int cmp;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, like heapam's function.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ if (tuplesort_empty)
+ cmp = -1;
+ else
+ {
+ while ((cmp = ItemPointerCompare(&tup_ptr, &idx_ptr)) > 0)
+ {
+ Datum ts_val;
+ bool ts_isnull;
+
+ tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+ &ts_val, &ts_isnull, NULL);
+ if (!tuplesort_empty)
+ {
+ Assert(!ts_isnull);
+ itemptr_decode(&idx_ptr, DatumGetInt64(ts_val));
+
+ /* If int8 is pass-by-ref, free (encoded) TID Datum memory */
+#ifndef USE_FLOAT8_BYVAL
+ pfree(DatumGetPointer(ts_val));
+#endif
+ break;
+ }
+ else
+ {
+ /* Be tidy */
+ ItemPointerSetInvalid(&idx_ptr);
+ cmp = -1;
+ }
+ }
+ }
+ if (cmp < 0)
+ {
+ /* This item is not in the index */
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ index_insert(indexRelation, values, isnull, &tup_ptr, baseRelation,
+ indexInfo->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ indexInfo);
+
+ state->tups_inserted += 1;
+ }
+ }
+
+ table_endscan(scan);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+}
+
+static double
+zedstoream_index_build_range_scan(Relation baseRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ bool anyvisible,
+ bool progress,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state,
+ TableScanDesc scan)
+{
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ double reltuples;
+ ExprState *predicate;
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ Snapshot snapshot;
+ SnapshotData NonVacuumableSnapshot;
+ bool need_unregister_snapshot = false;
+ TransactionId OldestXmin;
+ bool tupleIsAlive;
+
+#ifdef USE_ASSERT_CHECKING
+ bool checking_uniqueness;
+ /* See whether we're verifying uniqueness/exclusion properties */
+ checking_uniqueness = (indexInfo->ii_Unique ||
+ indexInfo->ii_ExclusionOps != NULL);
+
+ /*
+ * "Any visible" mode is not compatible with uniqueness checks; make sure
+ * only one of those is requested.
+ */
+ Assert(!(anyvisible && checking_uniqueness));
+#endif
+
+ /*
+ * sanity checks
+ */
+ Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+ /*
+ * Need an EState for evaluation of index expressions and partial-index
+ * predicates. Also a slot to hold the current tuple.
+ */
+ estate = CreateExecutorState();
+ econtext = GetPerTupleExprContext(estate);
+ slot = table_slot_create(baseRelation, NULL);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /* Set up execution state for predicate, if any. */
+ predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+
+ /*
+ * Prepare for scan of the base relation. In a normal index build, we use
+ * SnapshotAny because we must retrieve all tuples and do our own time
+ * qual checks (because we have to index RECENTLY_DEAD tuples). In a
+ * concurrent build, or during bootstrap, we take a regular MVCC snapshot
+ * and index whatever's live according to that.
+ */
+ OldestXmin = InvalidTransactionId;
+
+ /* okay to ignore lazy VACUUMs here */
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
+ OldestXmin = GetOldestNonRemovableTransactionId(baseRelation);
+
+ zsbt_tuplebuffer_flush(baseRelation);
+ if (!scan)
+ {
+ int attno;
+ Bitmapset *proj = NULL;
+
+ /*
+ * Serial index build.
+ *
+ * Must begin our own zedstore scan in this case. We may also need to
+ * register a snapshot whose lifetime is under our direct control.
+ */
+ if (!TransactionIdIsValid(OldestXmin))
+ {
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ need_unregister_snapshot = true;
+ }
+ else
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot,
+ GlobalVisTestFor(baseRelation));
+ snapshot = &NonVacuumableSnapshot;
+ }
+
+ for (attno = 0; attno < indexInfo->ii_NumIndexKeyAttrs; attno++)
+ {
+ Assert(indexInfo->ii_IndexAttrNumbers[attno] <= baseRelation->rd_att->natts);
+ proj = bms_add_member(proj, indexInfo->ii_IndexAttrNumbers[attno]);
+ }
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Predicate,
+ baseRelation->rd_att->natts,
+ &proj);
+ PopulateNeededColumnsForNode((Node *)indexInfo->ii_Expressions,
+ baseRelation->rd_att->natts,
+ &proj);
+
+ scan = table_beginscan_with_column_projection(baseRelation, /* relation */
+ snapshot, /* snapshot */
+ 0, /* number of keys */
+ NULL, /* scan key */
+ proj);
+
+ if (start_blockno != 0 || numblocks != InvalidBlockNumber)
+ {
+ ZedStoreDesc zscan = (ZedStoreDesc) scan;
+ ZedStoreProjectData *zscan_proj = &zscan->proj_data;
+
+ zscan->cur_range_start = ZSTidFromBlkOff(start_blockno, 1);
+ zscan->cur_range_end = ZSTidFromBlkOff(numblocks, 1);
+
+ /* FIXME: when can 'num_proj_atts' be 0? */
+ if (zscan_proj->num_proj_atts > 0)
+ {
+ zsbt_tid_begin_scan(zscan->rs_scan.rs_rd,
+ zscan->cur_range_start,
+ zscan->cur_range_end,
+ zscan->rs_scan.rs_snapshot,
+ &zscan_proj->tid_scan);
+ for (int i = 1; i < zscan_proj->num_proj_atts; i++)
+ {
+ int natt = zscan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(zscan->rs_scan.rs_rd,
+ RelationGetDescr(zscan->rs_scan.rs_rd),
+ natt,
+ &zscan_proj->attr_scans[i - 1]);
+ }
+ }
+ }
+ }
+ else
+ {
+ /*
+ * Parallel index build.
+ *
+ * Parallel case never registers/unregisters own snapshot. Snapshot
+ * is taken from parallel zedstore scan, and is SnapshotAny or an MVCC
+ * snapshot, based on same criteria as serial case.
+ */
+ Assert(!IsBootstrapProcessingMode());
+ Assert(allow_sync);
+ Assert(start_blockno == 0);
+ Assert(numblocks == InvalidBlockNumber);
+ snapshot = scan->rs_snapshot;
+
+ if (snapshot == SnapshotAny)
+ {
+ /* leave out completely dead items even with SnapshotAny */
+ InitNonVacuumableSnapshot(NonVacuumableSnapshot,
+ GlobalVisTestFor(baseRelation));
+ snapshot = &NonVacuumableSnapshot;
+ }
+ }
+
+ /*
+ * Must call GetOldestXmin() with SnapshotAny. Should never call
+ * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == &NonVacuumableSnapshot || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == &NonVacuumableSnapshot ? TransactionIdIsValid(OldestXmin) :
+ !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == &NonVacuumableSnapshot || !anyvisible);
+
+ reltuples = 0;
+
+ /*
+ * Scan all tuples in the base relation.
+ */
+ while (zedstoream_getnextslot(scan, ForwardScanDirection, slot))
+ {
+ ZSUndoSlotVisibility *visi_info;
+
+ if (numblocks != InvalidBlockNumber &&
+ ItemPointerGetBlockNumber(&slot->tts_tid) >= numblocks)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Is the tuple deleted, but still visible to old transactions?
+ *
+ * We need to include such tuples in the index, but exclude them
+ * from unique-checking.
+ *
+ * TODO: Heap checks for DELETE_IN_PROGRESS do we need as well?
+ */
+ visi_info = ((ZedstoreTupleTableSlot *) slot)->visi_info;
+ tupleIsAlive = (visi_info->nonvacuumable_status != ZSNV_RECENTLY_DEAD);
+
+ if (tupleIsAlive)
+ reltuples += 1;
+
+ /*
+ * TODO: Once we have in-place updates, like HOT, this will need
+ * to work harder, to figure out which tuple version to index.
+ */
+
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+ /*
+ * In a partial index, discard tuples that don't satisfy the
+ * predicate.
+ */
+ if (predicate != NULL)
+ {
+ if (!ExecQual(predicate, econtext))
+ continue;
+ }
+
+ /*
+ * For the current heap tuple, extract all the attributes we use in
+ * this index, and note which are null. This also performs evaluation
+ * of any expressions needed.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ /* Call the AM's callback routine to process the tuple */
+ callback(indexRelation, &slot->tts_tid, values, isnull, tupleIsAlive,
+ callback_state);
+ }
+
+ table_endscan(scan);
+
+ /* we can now forget our snapshot, if set and registered by us */
+ if (need_unregister_snapshot)
+ UnregisterSnapshot(snapshot);
+
+ ExecDropSingleTupleTableSlot(slot);
+
+ FreeExecutorState(estate);
+
+ /* These may have been pointing to the now-gone estate */
+ indexInfo->ii_ExpressionsState = NIL;
+ indexInfo->ii_PredicateState = NULL;
+
+ return reltuples;
+}
+
+static void
+zedstoream_finish_bulk_insert(Relation relation, int options)
+{
+ zsbt_tuplebuffer_flush(relation);
+}
+
+/* ------------------------------------------------------------------------
+ * DDL related callbacks for zedstore AM.
+ * ------------------------------------------------------------------------
+ */
+
+static void
+zedstoream_relation_set_new_filenode(Relation rel,
+ const RelFileNode *newrnode,
+ char persistence,
+ TransactionId *freezeXid,
+ MultiXactId *minmulti)
+{
+ SMgrRelation srel;
+
+ /* XXX: I think we could just throw away all data in the buffer */
+ zsbt_tuplebuffer_flush(rel);
+
+ /*
+ * Initialize to the minimum XID that could put tuples in the table. We
+ * know that no xacts older than RecentXmin are still running, so that
+ * will do.
+ */
+ *freezeXid = RecentXmin;
+
+ /*
+ * Similarly, initialize the minimum Multixact to the first value that
+ * could possibly be stored in tuples in the table. Running transactions
+ * could reuse values from their local cache, so we are careful to
+ * consider all currently running multis.
+ *
+ * XXX this could be refined further, but is it worth the hassle?
+ */
+ *minmulti = GetOldestMultiXactId();
+
+ srel = RelationCreateStorage(*newrnode, persistence);
+
+ /*
+ * If required, set up an init fork for an unlogged table so that it can
+ * be correctly reinitialized on restart. An immediate sync is required
+ * even if the page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved the
+ * redo pointer past our xlog record. Recovery may as well remove it
+ * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+ * record. Therefore, logging is necessary even if wal_level=minimal.
+ */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW ||
+ rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+ smgrcreate(srel, INIT_FORKNUM, false);
+ log_smgrcreate(newrnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+}
+
+static void
+zedstoream_relation_nontransactional_truncate(Relation rel)
+{
+ /* XXX: I think we could just throw away all data in the buffer */
+ zsbt_tuplebuffer_flush(rel);
+ zsmeta_invalidate_cache(rel);
+ RelationTruncate(rel, 0);
+}
+
+static void
+zedstoream_relation_copy_data(Relation rel, const RelFileNode *newrnode)
+{
+ SMgrRelation dstrel;
+
+ zsbt_tuplebuffer_flush(rel);
+
+ dstrel = smgropen(*newrnode, rel->rd_backend);
+ RelationOpenSmgr(rel);
+
+ /*
+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers. We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.
+ */
+ FlushRelationBuffers(rel);
+
+ /*
+ * Create and copy all the relation, and schedule unlinking of the
+ * old physical file.
+ *
+ * NOTE: any conflict in relfilenode value will be caught in
+ * RelationCreateStorage().
+ *
+ * NOTE: There is only the main fork in zedstore. Otherwise
+ * this would need to copy other forks, too.
+ */
+ RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
+
+ /* copy main fork */
+ RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
+ rel->rd_rel->relpersistence);
+
+ /* drop old relation, and close new one */
+ RelationDropStorage(rel);
+ smgrclose(dstrel);
+}
+
+/*
+ * Subroutine of the zedstoream_relation_copy_for_cluster() callback.
+ *
+ * Creates the TID item with correct visibility information for the
+ * given tuple in the old table. Returns the tid of the tuple in the
+ * new table, or InvalidZSTid if this tuple can be left out completely.
+ *
+ * FIXME: This breaks UPDATE chains. I.e. after this is done, an UPDATE
+ * looks like DELETE + INSERT, instead of an UPDATE, to any transaction that
+ * might try to follow the update chain.
+ */
+static zstid
+zs_cluster_process_tuple(Relation OldHeap, Relation NewHeap,
+ zstid oldtid, ZSUndoRecPtr old_undoptr,
+ ZSUndoRecPtr recent_oldest_undo,
+ TransactionId OldestXmin)
+{
+ TransactionId this_xmin;
+ CommandId this_cmin;
+ TransactionId this_xmax;
+ CommandId this_cmax;
+ bool this_changedPart;
+ ZSUndoRecPtr undo_ptr;
+ ZSUndoRec *undorec;
+
+ /*
+ * Follow the chain of UNDO records for this tuple, to find the
+ * transaction that originally inserted the row (xmin/cmin), and
+ * the transaction that deleted or updated it away, if any (xmax/cmax)
+ */
+ this_xmin = FrozenTransactionId;
+ this_cmin = InvalidCommandId;
+ this_xmax = InvalidTransactionId;
+ this_cmax = InvalidCommandId;
+
+ undo_ptr = old_undoptr;
+ for (;;)
+ {
+ if (undo_ptr.counter < recent_oldest_undo.counter)
+ {
+ /* This tuple version is visible to everyone. */
+ break;
+ }
+
+ /* Fetch the next UNDO record. */
+ undorec = zsundo_fetch_record(OldHeap, undo_ptr);
+
+ if (undorec->type == ZSUNDO_TYPE_INSERT)
+ {
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /*
+ * inserter aborted or crashed. This row is not visible to
+ * anyone. Including any later tuple versions we might have
+ * seen.
+ */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* Inserter committed. */
+ this_xmin = undorec->xid;
+ this_cmin = undorec->cid;
+
+ /* we know everything there is to know about this tuple version. */
+ break;
+ }
+ }
+ else if (undorec->type == ZSUNDO_TYPE_TUPLE_LOCK)
+ {
+ /* Ignore tuple locks for now.
+ *
+ * FIXME: we should propagate them to the new copy of the table
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else if (undorec->type == ZSUNDO_TYPE_DELETE ||
+ undorec->type == ZSUNDO_TYPE_UPDATE)
+ {
+ /* Row was deleted (or updated away). */
+ if (!TransactionIdIsCurrentTransactionId(undorec->xid) &&
+ !TransactionIdIsInProgress(undorec->xid) &&
+ !TransactionIdDidCommit(undorec->xid))
+ {
+ /* deleter aborted or crashed. The previous record should
+ * be an insertion (possibly with some tuple-locking in
+ * between). We'll remember the tuple when we see the
+ * insertion.
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ else
+ {
+ /* deleter committed or is still in progress. */
+ if (TransactionIdPrecedes(undorec->xid, OldestXmin))
+ {
+ /* the deletion is visible to everyone. We can skip the row completely. */
+ this_xmin = InvalidTransactionId;
+ break;
+ }
+ else
+ {
+ /* deleter committed or is in progress. Remember that it was
+ * deleted by this XID.
+ */
+ this_xmax = undorec->xid;
+ this_cmax = undorec->cid;
+ if (undorec->type == ZSUNDO_TYPE_DELETE)
+ this_changedPart = ((ZSUndoRec_Delete *) undorec)->changedPart;
+ else
+ this_changedPart = false;
+
+ /* follow the UNDO chain to find information about the inserting
+ * transaction (xmin/cmin)
+ */
+ undo_ptr = undorec->prevundorec;
+ continue;
+ }
+ }
+ }
+ }
+
+ /*
+ * We now know the visibility of this tuple. Re-create it in the new table.
+ */
+ if (this_xmin != InvalidTransactionId)
+ {
+ /* Insert the first version of the row. */
+ zstid newtid;
+
+ /* First, insert the tuple. */
+ newtid = zsbt_tid_multi_insert(NewHeap,
+ 1,
+ this_xmin,
+ this_cmin,
+ INVALID_SPECULATIVE_TOKEN,
+ InvalidUndoPtr);
+
+ /* And if the tuple was deleted/updated away, do the same in the new table. */
+ if (this_xmax != InvalidTransactionId)
+ {
+ TM_Result delete_result;
+ bool this_xact_has_lock;
+
+ /* tuple was deleted. */
+ delete_result = zsbt_tid_delete(NewHeap, newtid,
+ this_xmax, this_cmax,
+ NULL, NULL, false, NULL, this_changedPart,
+ &this_xact_has_lock);
+ if (delete_result != TM_Ok)
+ elog(ERROR, "tuple deletion failed during table rewrite");
+ }
+ return newtid;
+ }
+ else
+ return InvalidZSTid;
+}
+
+
+static void
+zedstoream_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
+ Relation OldIndex, bool use_sort,
+ TransactionId OldestXmin,
+ TransactionId *xid_cutoff,
+ MultiXactId *multi_cutoff,
+ double *num_tuples,
+ double *tups_vacuumed,
+ double *tups_recently_dead)
+{
+ TupleDesc olddesc;
+ ZSTidTreeScan tid_scan;
+ ZSAttrTreeScan *attr_scans;
+ ZSUndoRecPtr recent_oldest_undo = zsundo_get_oldest_undo_ptr(OldHeap, true);
+ int attno;
+ IndexScanDesc indexScan;
+ Datum *newdatums;
+ bool *newisnulls;
+
+ zsbt_tuplebuffer_flush(OldHeap);
+
+ olddesc = RelationGetDescr(OldHeap),
+
+ attr_scans = palloc(olddesc->natts * sizeof(ZSAttrTreeScan));
+
+ /*
+ * Scan the old table. We ignore any old updated-away tuple versions,
+ * and only stop at the latest tuple version of each row. At the latest
+ * version, follow the update chain to get all the old versions of that
+ * row, too. That way, the whole update chain is processed in one go,
+ * and can be reproduced in the new table.
+ */
+ zsbt_tid_begin_scan(OldHeap, MinZSTid, MaxPlusOneZSTid,
+ SnapshotAny, &tid_scan);
+
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_begin_scan(OldHeap,
+ olddesc,
+ attno,
+ &attr_scans[attno - 1]);
+ }
+
+ newdatums = palloc(olddesc->natts * sizeof(Datum));
+ newisnulls = palloc(olddesc->natts * sizeof(bool));
+
+ /* TODO: sorting not implemented yet. (it would require materializing each
+ * row into a HeapTuple or something like that, which could carry the xmin/xmax
+ * information through the sorter).
+ */
+ use_sort = false;
+
+ /*
+ * Prepare to scan the OldHeap. To ensure we see recently-dead tuples
+ * that still need to be copied, we scan with SnapshotAny and use
+ * HeapTupleSatisfiesVacuum for the visibility test.
+ */
+ if (OldIndex != NULL && !use_sort)
+ {
+ const int ci_index[] = {
+ PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_INDEX_RELID
+ };
+ int64 ci_val[2];
+
+ /* Set phase and OIDOldIndex to columns */
+ ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[1] = RelationGetRelid(OldIndex);
+ pgstat_progress_update_multi_param(2, ci_index, ci_val);
+
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ index_rescan(indexScan, NULL, 0, NULL, 0);
+ }
+ else
+ {
+ /* In scan-and-sort mode and also VACUUM FULL, set phase */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+
+ indexScan = NULL;
+
+ /* Set total heap blocks */
+ /* TODO */
+#if 0
+ pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ heapScan->rs_nblocks);
+#endif
+ }
+
+ for (;;)
+ {
+ zstid old_tid;
+ ZSUndoRecPtr old_undoptr;
+ zstid new_tid;
+ zstid fetchtid = InvalidZSTid;
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (indexScan != NULL)
+ {
+ ItemPointer itemptr;
+
+ itemptr = index_getnext_tid(indexScan, ForwardScanDirection);
+ if (!itemptr)
+ break;
+
+ /* Since we used no scan keys, should never need to recheck */
+ if (indexScan->xs_recheck)
+ elog(ERROR, "CLUSTER does not support lossy index conditions");
+
+ fetchtid = ZSTidFromItemPointer(*itemptr);
+ zsbt_tid_reset_scan(&tid_scan, MinZSTid, MaxPlusOneZSTid, fetchtid - 1);
+ old_tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection);
+ if (old_tid == InvalidZSTid)
+ continue;
+ }
+ else
+ {
+ old_tid = zsbt_tid_scan_next(&tid_scan, ForwardScanDirection);
+ if (old_tid == InvalidZSTid)
+ break;
+ fetchtid = old_tid;
+ }
+ if (old_tid != fetchtid)
+ continue;
+
+ old_undoptr = tid_scan.array_iter.visi_infos[ZSTidScanCurUndoSlotNo(&tid_scan)].undoptr;
+
+ new_tid = zs_cluster_process_tuple(OldHeap, NewHeap,
+ old_tid, old_undoptr,
+ recent_oldest_undo,
+ OldestXmin);
+ if (new_tid != InvalidZSTid)
+ {
+ /* Fetch the attributes and write them out */
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ Form_pg_attribute att = TupleDescAttr(olddesc, attno - 1);
+ Datum datum;
+ bool isnull;
+
+ if (att->attisdropped)
+ {
+ datum = (Datum) 0;
+ isnull = true;
+ }
+ else
+ {
+ if (!zsbt_attr_fetch(&attr_scans[attno - 1], &datum, &isnull, old_tid))
+ zsbt_fill_missing_attribute_value(olddesc, attno, &datum, &isnull);
+ }
+
+ /* flatten and re-toast any ZS-TOASTed values */
+ if (!isnull && att->attlen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(OldHeap, attno, old_tid, datum);
+ }
+ }
+ newdatums[attno - 1] = datum;
+ newisnulls[attno - 1] = isnull;
+ }
+
+ zsbt_tuplebuffer_spool_tuple(NewHeap, new_tid, newdatums, newisnulls);
+ }
+ }
+
+ if (indexScan != NULL)
+ index_endscan(indexScan);
+
+ zsbt_tid_end_scan(&tid_scan);
+ for (attno = 1; attno <= olddesc->natts; attno++)
+ {
+ if (TupleDescAttr(olddesc, attno - 1)->attisdropped)
+ continue;
+
+ zsbt_attr_end_scan(&attr_scans[attno - 1]);
+ }
+
+ zsbt_tuplebuffer_flush(NewHeap);
+}
+
+static void
+zedstoream_scan_analyze_beginscan(Relation onerel, AnalyzeSampleContext *context)
+{
+ zstid tid;
+ List *va_cols = context->anl_cols;
+ Bitmapset *project_columns = NULL;
+
+ /* zedstore can sample rows on specified columns only */
+ if (!va_cols)
+ context->scan = table_beginscan_analyze(onerel);
+ else
+ {
+ ListCell *le;
+
+ foreach(le, va_cols)
+ {
+ char *col = strVal(lfirst(le));
+
+ project_columns =
+ bms_add_member(project_columns, attnameAttNum(onerel, col, false));
+ }
+
+ context->scan =
+ zedstoream_beginscan_with_column_projection(onerel, NULL, 0, NULL,
+ NULL, SO_TYPE_ANALYZE,
+ project_columns);
+ }
+
+ /* zedstore use a logical block number to acquire sample rows */
+ tid = zsbt_get_last_tid(onerel);
+ context->totalblocks = ZSTidGetBlockNumber(tid) + 1;
+}
+
+/*
+ * Get next logical block.
+ */
+static bool
+zedstoream_scan_analyze_next_block(BlockNumber blockno,
+ AnalyzeSampleContext *context)
+{
+ return zs_blkscan_next_block(context->scan, blockno, NULL, -1, false);
+}
+
+static bool
+zedstoream_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
+{
+ int i;
+ bool result;
+ AttrNumber attno;
+ TableScanDesc scan = context->scan;
+ ZedStoreDesc sscan = (ZedStoreDesc) scan;
+ ZSAttrTreeScan *attr_scan;
+ TupleTableSlot *slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DATA);
+
+ result = zs_blkscan_next_tuple(scan, slot);
+
+ if (result)
+ {
+ /* provide extra disk info when analyzing on full columns */
+ if (!context->anl_cols)
+ {
+ slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DISKSIZE);
+
+ ExecClearTuple(slot);
+
+ for (i = 0; i < scan->rs_rd->rd_att->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ for (i = 1; i < sscan->proj_data.num_proj_atts; i++)
+ {
+ attr_scan = &sscan->proj_data.attr_scans[i - 1];
+ attno = sscan->proj_data.proj_atts[i];
+
+ slot->tts_values[attno - 1] =
+ Float8GetDatum(attr_scan->decoder.avg_elements_size);
+ slot->tts_isnull[attno - 1] = false;
+ }
+
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+ }
+
+ context->liverows++;
+ }
+
+ return result;
+}
+
+static void
+zedstoream_scan_analyze_sample_tuple(int pos, bool replace, AnalyzeSampleContext *context)
+{
+ TupleTableSlot *slot;
+ Relation onerel = context->scan->rs_rd;
+
+ slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DATA);
+ AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DATA, pos, replace, false);
+
+ /* only record */
+ if (!context->anl_cols)
+ {
+ slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DISKSIZE);
+ AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DISKSIZE, pos, replace, false);
+ }
+}
+
+static void
+zedstoream_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+ table_endscan(context->scan);
+}
+
+/* ------------------------------------------------------------------------
+ * Miscellaneous callbacks for the heap AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * FIXME: Implement this function as best for zedstore. The return value is
+ * for example leveraged by analyze to find which blocks to sample.
+ */
+static uint64
+zedstoream_relation_size(Relation rel, ForkNumber forkNumber)
+{
+ uint64 nblocks = 0;
+
+ /* Open it at the smgr level if not already done */
+ RelationOpenSmgr(rel);
+ nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+ return nblocks * BLCKSZ;
+}
+
+/*
+ * Zedstore stores TOAST chunks within the table file itself. Hence, doesn't
+ * need separate toast table to be created. Return false for this callback
+ * avoids creation of toast table.
+ */
+static bool
+zedstoream_relation_needs_toast_table(Relation rel)
+{
+ return false;
+}
+
+/* ------------------------------------------------------------------------
+ * Planner related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * currently this is exact duplicate of heapam_estimate_rel_size().
+ * TODO fix to tune it based on zedstore storage.
+ */
+static void
+zedstoream_relation_estimate_size(Relation rel, int32 *attr_widths,
+ BlockNumber *pages, double *tuples,
+ double *allvisfrac)
+{
+ BlockNumber curpages;
+ BlockNumber relpages;
+ double reltuples;
+ BlockNumber relallvisible;
+ double density;
+
+ /* it has storage, ok to call the smgr */
+ curpages = RelationGetNumberOfBlocks(rel);
+
+ /* coerce values in pg_class to more desirable types */
+ relpages = (BlockNumber) rel->rd_rel->relpages;
+ reltuples = (double) rel->rd_rel->reltuples;
+ relallvisible = (BlockNumber) rel->rd_rel->relallvisible;
+
+ /*
+ * HACK: if the relation has never yet been vacuumed, use a minimum size
+ * estimate of 10 pages. The idea here is to avoid assuming a
+ * newly-created table is really small, even if it currently is, because
+ * that may not be true once some data gets loaded into it. Once a vacuum
+ * or analyze cycle has been done on it, it's more reasonable to believe
+ * the size is somewhat stable.
+ *
+ * (Note that this is only an issue if the plan gets cached and used again
+ * after the table has been filled. What we're trying to avoid is using a
+ * nestloop-type plan on a table that has grown substantially since the
+ * plan was made. Normally, autovacuum/autoanalyze will occur once enough
+ * inserts have happened and cause cached-plan invalidation; but that
+ * doesn't happen instantaneously, and it won't happen at all for cases
+ * such as temporary tables.)
+ *
+ * We approximate "never vacuumed" by "has relpages = 0", which means this
+ * will also fire on genuinely empty relations. Not great, but
+ * fortunately that's a seldom-seen case in the real world, and it
+ * shouldn't degrade the quality of the plan too much anyway to err in
+ * this direction.
+ *
+ * If the table has inheritance children, we don't apply this heuristic.
+ * Totally empty parent tables are quite common, so we should be willing
+ * to believe that they are empty.
+ */
+ if (curpages < 10 &&
+ relpages == 0 &&
+ !rel->rd_rel->relhassubclass)
+ curpages = 10;
+
+ /* report estimated # pages */
+ *pages = curpages;
+ /* quick exit if rel is clearly empty */
+ if (curpages == 0)
+ {
+ *tuples = 0;
+ *allvisfrac = 0;
+ return;
+ }
+
+ /* estimate number of tuples from previous tuple density */
+ if (relpages > 0)
+ density = reltuples / (double) relpages;
+ else
+ {
+ /*
+ * When we have no data because the relation was truncated, estimate
+ * tuple width from attribute datatypes. We assume here that the
+ * pages are completely full, which is OK for tables (since they've
+ * presumably not been VACUUMed yet) but is probably an overestimate
+ * for indexes. Fortunately get_relation_info() can clamp the
+ * overestimate to the parent table's size.
+ *
+ * Note: this code intentionally disregards alignment considerations,
+ * because (a) that would be gilding the lily considering how crude
+ * the estimate is, and (b) it creates platform dependencies in the
+ * default plans which are kind of a headache for regression testing.
+ */
+ int32 tuple_width;
+
+ tuple_width = get_rel_data_width(rel, attr_widths);
+ tuple_width += MAXALIGN(SizeofHeapTupleHeader);
+ tuple_width += sizeof(ItemIdData);
+ /* note: integer division is intentional here */
+ density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width;
+ }
+ *tuples = rint(density * (double) curpages);
+
+ /*
+ * We use relallvisible as-is, rather than scaling it up like we do for
+ * the pages and tuples counts, on the theory that any pages added since
+ * the last VACUUM are most likely not marked all-visible. But costsize.c
+ * wants it converted to a fraction.
+ */
+ if (relallvisible == 0 || curpages <= 0)
+ *allvisfrac = 0;
+ else if ((double) relallvisible >= curpages)
+ *allvisfrac = 1;
+ else
+ *allvisfrac = (double) relallvisible / curpages;
+}
+
+/* ------------------------------------------------------------------------
+ * Executor related callbacks for the zedstore AM
+ * ------------------------------------------------------------------------
+ */
+
+/*
+ * zs_blkscan_next_block() and zs_blkscan_next_tuple() are used to implement
+ * bitmap scans, and sample scans. The tableam interface for those are similar
+ * enough that they can share most code.
+ */
+static bool
+zs_blkscan_next_block(TableScanDesc sscan,
+ BlockNumber blkno, OffsetNumber *offsets, int noffsets,
+ bool predicatelocks)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ ZedStoreProjectData *scan_proj = &scan->proj_data;
+ int ntuples;
+ zstid tid;
+ int idx;
+
+ if (!scan->started)
+ {
+ Relation rel = scan->rs_scan.rs_rd;
+ TupleDesc reldesc = RelationGetDescr(rel);
+ MemoryContext oldcontext;
+
+ zs_initialize_proj_attributes_extended(scan, reldesc);
+
+ oldcontext = MemoryContextSwitchTo(scan_proj->context);
+ zsbt_tid_begin_scan(rel,
+ ZSTidFromBlkOff(blkno, 1),
+ ZSTidFromBlkOff(blkno + 1, 1),
+ scan->rs_scan.rs_snapshot,
+ &scan_proj->tid_scan);
+ scan_proj->tid_scan.serializable = true;
+ for (int i = 1; i < scan_proj->num_proj_atts; i++)
+ {
+ int attno = scan_proj->proj_atts[i];
+
+ zsbt_attr_begin_scan(rel, reldesc, attno,
+ &scan_proj->attr_scans[i - 1]);
+ }
+ MemoryContextSwitchTo(oldcontext);
+ scan->started = true;
+ }
+ else
+ {
+ zsbt_tid_reset_scan(&scan_proj->tid_scan,
+ ZSTidFromBlkOff(blkno, 1),
+ ZSTidFromBlkOff(blkno + 1, 1),
+ ZSTidFromBlkOff(blkno, 1) - 1);
+ }
+
+ /*
+ * Our strategy for a bitmap scan is to scan the TID tree in
+ * next_block() function, starting at the given logical block number, and
+ * store all the matching TIDs in in the scan struct. next_tuple() will
+ * fetch the attribute data from the attribute trees.
+ *
+ * TODO: it might be good to pass the next expected TID down to
+ * zsbt_tid_scan_next, so that it could skip over to the next match more
+ * efficiently.
+ */
+ ntuples = 0;
+ idx = 0;
+ while ((tid = zsbt_tid_scan_next(&scan_proj->tid_scan, ForwardScanDirection)) != InvalidZSTid)
+ {
+ OffsetNumber off = ZSTidGetOffsetNumber(tid);
+ ItemPointerData itemptr;
+
+ Assert(ZSTidGetBlockNumber(tid) == blkno);
+
+ ItemPointerSet(&itemptr, blkno, off);
+
+ if (noffsets != -1)
+ {
+ while (off > offsets[idx] && idx < noffsets)
+ {
+ /*
+ * Acquire predicate lock on all tuples that we scan, even those that are
+ * not visible to the snapshot.
+ */
+ if (predicatelocks)
+ /*
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked itemptr.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot, InvalidTransactionId);
+
+ idx++;
+ }
+
+ if (idx == noffsets)
+ break;
+
+ if (off < offsets[idx])
+ continue;
+ }
+
+ /* FIXME: heapam acquires the predicate lock first, and then
+ * calls CheckForSerializableConflictOut(). We do it in the
+ * opposite order, because CheckForSerializableConflictOut()
+ * call as done in zsbt_get_last_tid() already. Does it matter?
+ * I'm not sure.
+ */
+ if (predicatelocks)
+ /*
+ * We pass in InvalidTransactionId as we are sure that the current
+ * transaction hasn't locked itemptr.
+ */
+ PredicateLockTID(scan->rs_scan.rs_rd, &itemptr, scan->rs_scan.rs_snapshot, InvalidTransactionId);
+
+ scan->bmscan_tids[ntuples] = tid;
+ ntuples++;
+ }
+
+ scan->bmscan_nexttuple = 0;
+ scan->bmscan_ntuples = ntuples;
+
+ return ntuples > 0;
+}
+
+static bool
+zs_blkscan_next_tuple(TableScanDesc sscan, TupleTableSlot *slot)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ zstid tid;
+
+ if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
+ return false;
+
+ /*
+ * Initialize the slot.
+ *
+ * We initialize all columns to NULL. The values for columns that are projected
+ * will be set to the actual values below, but it's important that non-projected
+ * columns are NULL.
+ */
+ ExecClearTuple(slot);
+ for (int i = 0; i < sscan->rs_rd->rd_att->natts; i++)
+ slot->tts_isnull[i] = true;
+
+ /*
+ * projection attributes were created based on Relation tuple descriptor
+ * it better match TupleTableSlot.
+ */
+ Assert((scan->proj_data.num_proj_atts - 1) <= slot->tts_tupleDescriptor->natts);
+ tid = scan->bmscan_tids[scan->bmscan_nexttuple];
+ for (int i = 1; i < scan->proj_data.num_proj_atts; i++)
+ {
+ ZSAttrTreeScan *attr_scan = &scan->proj_data.attr_scans[i - 1];
+ AttrNumber attno = scan->proj_data.proj_atts[i];
+ Form_pg_attribute att = TupleDescAttr(slot->tts_tupleDescriptor, attno - 1);
+ Datum datum;
+ bool isnull;
+
+ if (!zsbt_attr_fetch(attr_scan, &datum, &isnull, tid))
+ zsbt_fill_missing_attribute_value(slot->tts_tupleDescriptor, attno, &datum, &isnull);
+
+ /*
+ * flatten any ZS-TOASTed values, because the rest of the system
+ * doesn't know how to deal with them.
+ */
+ if (!isnull && att->attlen == -1 &&
+ VARATT_IS_EXTERNAL(datum) && VARTAG_EXTERNAL(datum) == VARTAG_ZEDSTORE)
+ {
+ datum = zedstore_toast_flatten(scan->rs_scan.rs_rd, attno, tid, datum);
+ }
+
+ Assert(attno > 0);
+ slot->tts_values[attno - 1] = datum;
+ slot->tts_isnull[attno - 1] = isnull;
+ }
+
+ /* FIXME: Don't we need to set visi_info, like in a seqscan? */
+ slot->tts_tableOid = RelationGetRelid(scan->rs_scan.rs_rd);
+ slot->tts_tid = ItemPointerFromZSTid(tid);
+ slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+ slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+ scan->bmscan_nexttuple++;
+
+ pgstat_count_heap_fetch(scan->rs_scan.rs_rd);
+
+ return true;
+}
+
+
+
+static bool
+zedstoream_scan_bitmap_next_block(TableScanDesc sscan,
+ TBMIterateResult *tbmres)
+{
+ return zs_blkscan_next_block(sscan, tbmres->blockno, tbmres->offsets, tbmres->ntuples, true);
+}
+
+static bool
+zedstoream_scan_bitmap_next_tuple(TableScanDesc sscan,
+ TBMIterateResult *tbmres,
+ TupleTableSlot *slot)
+{
+ return zs_blkscan_next_tuple(sscan, slot);
+}
+
+static bool
+zedstoream_scan_sample_next_block(TableScanDesc sscan, SampleScanState *scanstate)
+{
+ ZedStoreDesc scan = (ZedStoreDesc) sscan;
+ TsmRoutine *tsm = scanstate->tsmroutine;
+ BlockNumber blockno;
+
+ if (scan->next_tid_to_scan == InvalidZSTid)
+ {
+ /* initialize next tid with the first tid */
+ scan->next_tid_to_scan = zsbt_get_first_tid(scan->rs_scan.rs_rd);
+ }
+
+ if (scan->max_tid_to_scan == InvalidZSTid)
+ {
+ /*
+ * get the max tid once and store it, used to calculate max blocks to
+ * scan either for SYSTEM or BERNOULLI sampling.
+ */
+ scan->max_tid_to_scan = zsbt_get_last_tid(scan->rs_scan.rs_rd);
+ }
+
+ if (tsm->NextSampleBlock)
+ {
+ /* Adding one below to convert block number to number of blocks. */
+ blockno = tsm->NextSampleBlock(scanstate,
+ ZSTidGetBlockNumber(scan->max_tid_to_scan) + 1);
+
+ if (!BlockNumberIsValid(blockno))
+ return false;
+ }
+ else
+ {
+ /* scanning table sequentially */
+ if (scan->next_tid_to_scan > scan->max_tid_to_scan)
+ return false;
+
+ blockno = ZSTidGetBlockNumber(scan->next_tid_to_scan);
+ /* move on to next block of tids for next iteration of scan */
+ scan->next_tid_to_scan = ZSTidFromBlkOff(blockno + 1, 1);
+ }
+
+ Assert(BlockNumberIsValid(blockno));
+
+ /*
+ * Fetch all TIDs on the page.
+ */
+ if (!zs_blkscan_next_block(sscan, blockno, NULL, -1, false))
+ return false;
+
+ /*
+ * Filter the list of TIDs, keeping only the TIDs that the sampling methods
+ * tells us to keep.
+ */
+ if (scan->bmscan_ntuples > 0)
+ {
+ zstid lasttid_for_block = scan->bmscan_tids[scan->bmscan_ntuples - 1];
+ OffsetNumber maxoffset = ZSTidGetOffsetNumber(lasttid_for_block);
+ OffsetNumber nextoffset;
+ int outtuples;
+ int idx;
+
+ /* ask the tablesample method which tuples to check on this page. */
+ nextoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+
+ outtuples = 0;
+ idx = 0;
+ while (idx < scan->bmscan_ntuples && OffsetNumberIsValid(nextoffset))
+ {
+ zstid thistid = scan->bmscan_tids[idx];
+ OffsetNumber thisoffset = ZSTidGetOffsetNumber(thistid);
+
+ if (thisoffset > nextoffset)
+ nextoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+ else
+ {
+ if (thisoffset == nextoffset)
+ scan->bmscan_tids[outtuples++] = thistid;
+ idx++;
+ }
+ }
+ scan->bmscan_ntuples = outtuples;
+
+ /*
+ * Must fast forward the sampler through all offsets on this page,
+ * until it returns InvalidOffsetNumber. Otherwise, the next
+ * call will continue to return offsets for this block.
+ *
+ * FIXME: It seems bogus that the sampler isn't reset, when you call
+ * NextSampleBlock(). Perhaps we should fix this in the TSM API?
+ */
+ while (OffsetNumberIsValid(nextoffset))
+ nextoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
+ }
+
+ return scan->bmscan_ntuples > 0;
+}
+
+static bool
+zedstoream_scan_sample_next_tuple(TableScanDesc sscan, SampleScanState *scanstate,
+ TupleTableSlot *slot)
+{
+ /*
+ * We already filtered the rows in the next_block() function, so all TIDs in
+ * in scan->bmscan_tids belong to the sample.
+ */
+ return zs_blkscan_next_tuple(sscan, slot);
+}
+
+static void
+zedstoream_vacuum_rel(Relation onerel, VacuumParams *params,
+ BufferAccessStrategy bstrategy)
+{
+ zsbt_tuplebuffer_flush(onerel);
+ zsundo_vacuum(onerel, params, bstrategy,
+ GetOldestNonRemovableTransactionId(onerel));
+}
+
+static const TableAmRoutine zedstoream_methods = {
+ .type = T_TableAmRoutine,
+ .scans_leverage_column_projection = true,
+
+ .slot_callbacks = zedstoream_slot_callbacks,
+
+ .scan_begin = zedstoream_beginscan,
+ .scan_begin_with_column_projection = zedstoream_beginscan_with_column_projection,
+ .scan_end = zedstoream_endscan,
+ .scan_rescan = zedstoream_rescan,
+ .scan_getnextslot = zedstoream_getnextslot,
+
+ .parallelscan_estimate = zs_parallelscan_estimate,
+ .parallelscan_initialize = zs_parallelscan_initialize,
+ .parallelscan_reinitialize = zs_parallelscan_reinitialize,
+
+ .index_fetch_begin = zedstoream_begin_index_fetch,
+ .index_fetch_reset = zedstoream_reset_index_fetch,
+ .index_fetch_end = zedstoream_end_index_fetch,
+ .index_fetch_set_column_projection = zedstoream_fetch_set_column_projection,
+ .index_fetch_tuple = zedstoream_index_fetch_tuple,
+
+ .tuple_insert = zedstoream_insert,
+ .tuple_insert_speculative = zedstoream_insert_speculative,
+ .tuple_complete_speculative = zedstoream_complete_speculative,
+ .multi_insert = zedstoream_multi_insert,
+ .tuple_delete = zedstoream_delete,
+ .tuple_update = zedstoream_update,
+ .tuple_lock = zedstoream_lock_tuple,
+ .finish_bulk_insert = zedstoream_finish_bulk_insert,
+
+ .tuple_fetch_row_version = zedstoream_fetch_row_version,
+ .tuple_get_latest_tid = zedstoream_get_latest_tid,
+ .tuple_tid_valid = zedstoream_tuple_tid_valid,
+ .tuple_satisfies_snapshot = zedstoream_tuple_satisfies_snapshot,
+ .compute_xid_horizon_for_tuples = zedstoream_compute_xid_horizon_for_tuples,
+
+ .relation_set_new_filenode = zedstoream_relation_set_new_filenode,
+ .relation_nontransactional_truncate = zedstoream_relation_nontransactional_truncate,
+ .relation_copy_data = zedstoream_relation_copy_data,
+ .relation_copy_for_cluster = zedstoream_relation_copy_for_cluster,
+ .relation_vacuum = zedstoream_vacuum_rel,
+ .scan_analyze_beginscan = zedstoream_scan_analyze_beginscan,
+ .scan_analyze_next_block = zedstoream_scan_analyze_next_block,
+ .scan_analyze_next_tuple = zedstoream_scan_analyze_next_tuple,
+ .scan_analyze_sample_tuple = zedstoream_scan_analyze_sample_tuple,
+ .scan_analyze_endscan = zedstoream_scan_analyze_endscan,
+
+ .index_build_range_scan = zedstoream_index_build_range_scan,
+ .index_validate_scan = zedstoream_index_validate_scan,
+
+ .relation_size = zedstoream_relation_size,
+ .relation_needs_toast_table = zedstoream_relation_needs_toast_table,
+ .relation_estimate_size = zedstoream_relation_estimate_size,
+
+ .scan_bitmap_next_block = zedstoream_scan_bitmap_next_block,
+ .scan_bitmap_next_tuple = zedstoream_scan_bitmap_next_tuple,
+ .scan_sample_next_block = zedstoream_scan_sample_next_block,
+ .scan_sample_next_tuple = zedstoream_scan_sample_next_tuple
+};
+
+Datum
+zedstore_tableam_handler(PG_FUNCTION_ARGS)
+{
+ PG_RETURN_POINTER(&zedstoream_methods);
+}
+
+
+/*
+ * Routines for dividing up the TID range for parallel seq scans
+ */
+
+/*
+ * Number of TIDs to assign to a parallel worker in a parallel Seq Scan in
+ * one batch.
+ *
+ * Not sure what the optimimum would be. If the chunk size is too small,
+ * the parallel workers will waste effort, when two parallel workers both
+ * need to decompress and process the pages at the boundary. But on the
+ * other hand, if the chunk size is too large, we might not be able to make
+ * good use of all the parallel workers.
+ */
+#define ZS_PARALLEL_CHUNK_SIZE ((uint64) 0x100000)
+
+typedef struct ParallelZSScanDescData
+{
+ ParallelTableScanDescData base;
+
+ zstid pzs_endtid; /* last tid + 1 in relation at start of scan */
+ pg_atomic_uint64 pzs_allocatedtids; /* TID space allocated to workers so far. */
+} ParallelZSScanDescData;
+typedef struct ParallelZSScanDescData *ParallelZSScanDesc;
+
+static Size
+zs_parallelscan_estimate(Relation rel)
+{
+ return sizeof(ParallelZSScanDescData);
+}
+
+static Size
+zs_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc zpscan = (ParallelZSScanDesc) pscan;
+
+ zpscan->base.phs_relid = RelationGetRelid(rel);
+ zpscan->pzs_endtid = zsbt_get_last_tid(rel);
+ pg_atomic_init_u64(&zpscan->pzs_allocatedtids, 1);
+
+ return sizeof(ParallelZSScanDescData);
+}
+
+static void
+zs_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
+{
+ ParallelZSScanDesc bpscan = (ParallelZSScanDesc) pscan;
+
+ pg_atomic_write_u64(&bpscan->pzs_allocatedtids, 1);
+}
+
+/*
+ * get the next TID range to scan
+ *
+ * Returns true if there is more to scan, false otherwise.
+ *
+ * Get the next TID range to scan. Even if there are no TIDs left to scan,
+ * another backend could have grabbed a range to scan and not yet finished
+ * looking at it, so it doesn't follow that the scan is done when the first
+ * backend gets 'false' return.
+ */
+static bool
+zs_parallelscan_nextrange(Relation rel, ParallelZSScanDesc pzscan,
+ zstid *start, zstid *end)
+{
+ uint64 allocatedtids;
+
+ /*
+ * zhs_allocatedtid tracks how much has been allocated to workers
+ * already. When phs_allocatedtid >= rs_lasttid, all TIDs have been
+ * allocated.
+ *
+ * Because we use an atomic fetch-and-add to fetch the current value, the
+ * phs_allocatedtid counter will exceed rs_lasttid, because workers will
+ * still increment the value, when they try to allocate the next block but
+ * all blocks have been allocated already. The counter must be 64 bits
+ * wide because of that, to avoid wrapping around when rs_lasttid is close
+ * to 2^32. That's also one reason we do this at granularity of 2^16 TIDs,
+ * even though zedstore isn't block-oriented.
+ */
+ allocatedtids = pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids, ZS_PARALLEL_CHUNK_SIZE);
+ *start = (zstid) allocatedtids;
+ *end = (zstid) (allocatedtids + ZS_PARALLEL_CHUNK_SIZE);
+
+ return *start < pzscan->pzs_endtid;
+}
+
+/*
+ * Get the value for a row, when no value has been stored in the attribute tree.
+ *
+ * This is used after ALTER TABLE ADD COLUMN, when reading rows that were
+ * created before column was added. Usually, missing values are implicitly
+ * NULLs, but you could specify a different value in the ALTER TABLE command,
+ * too, with DEFAULT.
+ */
+static void
+zsbt_fill_missing_attribute_value(TupleDesc tupleDesc, int attno, Datum *datum, bool *isnull)
+{
+ Form_pg_attribute attr = TupleDescAttr(tupleDesc, attno - 1);
+
+ *isnull = true;
+ *datum = (Datum) 0;
+
+ /* This means catalog doesn't have the default value for this attribute */
+ if (!attr->atthasmissing)
+ return;
+
+ if (tupleDesc->constr &&
+ tupleDesc->constr->missing)
+ {
+ AttrMissing *attrmiss = NULL;
+ /*
+ * If there are missing values we want to put them into the
+ * tuple.
+ */
+ attrmiss = tupleDesc->constr->missing;
+
+ if (attrmiss[attno - 1].am_present)
+ {
+ *isnull = false;
+ if (attr->attbyval)
+ *datum = fetch_att(&attrmiss[attno - 1].am_value, attr->attbyval, attr->attlen);
+ else
+ *datum = zs_datumCopy(attrmiss[attno - 1].am_value, attr->attbyval, attr->attlen);
+ }
+ }
+}
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..296f1ad24d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -89,19 +89,19 @@ static void do_analyze_rel(Relation onerel,
VacuumParams *params, List *va_cols,
AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
bool inh, bool in_outer_xact, int elevel);
+static void compute_disk_stats(VacAttrStats **stats, int natts,
+ TupleDesc desc, HeapTuple *rows,
+ int numrows);
static void compute_index_stats(Relation onerel, double totalrows,
AnlIndexData *indexdata, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context);
static VacAttrStats *examine_attribute(Relation onerel, int attnum,
Node *index_expr);
-static int acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows);
-static int compare_rows(const void *a, const void *b);
-static int acquire_inherited_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows);
+static void acquire_sample_rows(Relation onerel, int elevel,
+ AnalyzeSampleContext *context);
+static void acquire_inherited_sample_rows(Relation onerel, int elevel,
+ AnalyzeSampleContext *context);
static void update_attstats(Oid relid, bool inh,
int natts, VacAttrStats **vacattrstats);
static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
@@ -312,6 +312,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ AnalyzeSampleContext *sample_context;
if (inh)
ereport(elevel,
@@ -496,6 +497,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (targrows < minrows)
targrows = minrows;
+ /* create context for acquiring sample rows */
+ sample_context = CreateAnalyzeSampleContext(onerel, va_cols, targrows,
+ vac_strategy);
+
/*
* Acquire the sample rows
*/
@@ -504,13 +509,13 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
if (inh)
- numrows = acquire_inherited_sample_rows(onerel, elevel,
- rows, targrows,
- &totalrows, &totaldeadrows);
+ acquire_inherited_sample_rows(onerel, elevel, sample_context);
else
- numrows = (*acquirefunc) (onerel, elevel,
- rows, targrows,
- &totalrows, &totaldeadrows);
+ (*acquirefunc) (onerel, elevel, sample_context);
+
+ /* Get the sample statistics */
+ AnalyzeGetSampleStats(sample_context, &numrows, &totalrows, &totaldeadrows);
+ rows = AnalyzeGetSampleRows(sample_context, ANALYZE_SAMPLE_DATA, 0);
/*
* Compute the statistics. Temporary results during the calculations for
@@ -560,6 +565,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
MemoryContextResetAndDeleteChildren(col_context);
}
+ /* compute disksize ratio stats if any */
+ if (AnalyzeSampleIsValid(sample_context, ANALYZE_SAMPLE_DISKSIZE))
+ {
+ TupleTableSlot *slot =
+ AnalyzeGetSampleSlot(sample_context, onerel, ANALYZE_SAMPLE_DISKSIZE);
+ HeapTuple *rows =
+ AnalyzeGetSampleRows(sample_context, ANALYZE_SAMPLE_DISKSIZE, 0);
+
+ compute_disk_stats(vacattrstats, attr_cnt,
+ slot->tts_tupleDescriptor,
+ rows, numrows);
+ }
+
if (hasindex)
compute_index_stats(onerel, totalrows,
indexdata, nindexes,
@@ -693,6 +711,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
pg_rusage_show(&ru0))));
}
+ DestroyAnalyzeSampleContext(sample_context);
+
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -705,6 +725,41 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
anl_context = NULL;
}
+static void
+compute_disk_stats(VacAttrStats **stats, int natts,
+ TupleDesc desc, HeapTuple *rows,
+ int numrows)
+{
+ int i, j;
+ float8 attr_size = 0;
+ float8 total = 0;
+ bool isNull;
+
+ for (i = 0; i < numrows; i++)
+ {
+ HeapTuple tup = rows[i];
+
+ for (j = 0; j < natts; j++)
+ {
+ VacAttrStats *vac = stats[j];
+ Datum dat = heap_getattr(tup, j + 1, desc, &isNull);
+
+ if (!isNull)
+ {
+ attr_size = DatumGetFloat8(dat);
+ vac->disksize += attr_size;
+ total += attr_size;
+ }
+ }
+ }
+
+ for (j = 0; j < natts; j++)
+ {
+ VacAttrStats *vac = stats[j];
+ vac->stadiskfrac = vac->disksize / total;
+ }
+}
+
/*
* Compute statistics about indexes of a relation
*/
@@ -1021,28 +1076,28 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
* block. The previous sampling method put too much credence in the row
* density near the start of the table.
*/
-static int
+static void
acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
int numrows = 0; /* # rows now in reservoir */
+ int targrows = context->targrows;
double samplerows = 0; /* total # rows collected */
- double liverows = 0; /* # live rows seen */
- double deadrows = 0; /* # dead rows seen */
double rowstoskip = -1; /* -1 means not set yet */
+ double totalrows = 0;
+ double totaldeadrows = 0;
BlockNumber totalblocks;
TransactionId OldestXmin;
BlockSamplerData bs;
ReservoirStateData rstate;
- TupleTableSlot *slot;
- TableScanDesc scan;
BlockNumber nblocks;
BlockNumber blksdone = 0;
Assert(targrows > 0);
- totalblocks = RelationGetNumberOfBlocks(onerel);
+ table_scan_analyze_beginscan(onerel, context);
+
+ totalblocks = context->totalblocks;
/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
OldestXmin = GetOldestNonRemovableTransactionId(onerel);
@@ -1057,9 +1112,6 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Prepare for sampling rows */
reservoir_init_selection_state(&rstate, targrows);
- scan = table_beginscan_analyze(onerel);
- slot = table_slot_create(onerel, NULL);
-
/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{
@@ -1067,10 +1119,10 @@ acquire_sample_rows(Relation onerel, int elevel,
vacuum_delay_point();
- if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
+ if (!table_scan_analyze_next_block(targblock, context))
continue;
- while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+ while (table_scan_analyze_next_tuple(OldestXmin, context))
{
/*
* The first targrows sample rows are simply copied into the
@@ -1085,7 +1137,11 @@ acquire_sample_rows(Relation onerel, int elevel,
* we're done.
*/
if (numrows < targrows)
- rows[numrows++] = ExecCopySlotHeapTuple(slot);
+ {
+ table_scan_analyze_sample_tuple(numrows, false, context);
+
+ numrows++;
+ }
else
{
/*
@@ -1105,8 +1161,8 @@ acquire_sample_rows(Relation onerel, int elevel,
int k = (int) (targrows * sampler_random_fract(rstate.randstate));
Assert(k >= 0 && k < targrows);
- heap_freetuple(rows[k]);
- rows[k] = ExecCopySlotHeapTuple(slot);
+
+ table_scan_analyze_sample_tuple(k, true, context);
}
rowstoskip -= 1;
@@ -1119,19 +1175,7 @@ acquire_sample_rows(Relation onerel, int elevel,
++blksdone);
}
- ExecDropSingleTupleTableSlot(slot);
- table_endscan(scan);
-
- /*
- * If we didn't find as many tuples as we wanted then we're done. No sort
- * is needed, since they're already in order.
- *
- * Otherwise we need to sort the collected tuples by position
- * (itempointer). It's not worth worrying about corner cases where the
- * tuples are already sorted.
- */
- if (numrows == targrows)
- qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+ table_scan_analyze_endscan(context);
/*
* Estimate total numbers of live and dead rows in relation, extrapolating
@@ -1142,13 +1186,13 @@ acquire_sample_rows(Relation onerel, int elevel,
*/
if (bs.m > 0)
{
- *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
- *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+ totalrows = floor((context->liverows / bs.m) * totalblocks + 0.5);
+ totaldeadrows = floor((context->deadrows / bs.m) * totalblocks + 0.5);
}
else
{
- *totalrows = 0.0;
- *totaldeadrows = 0.0;
+ totalrows = 0.0;
+ totaldeadrows = 0.0;
}
/*
@@ -1160,34 +1204,13 @@ acquire_sample_rows(Relation onerel, int elevel,
"%d rows in sample, %.0f estimated total rows",
RelationGetRelationName(onerel),
bs.m, totalblocks,
- liverows, deadrows,
- numrows, *totalrows)));
-
- return numrows;
-}
+ context->liverows,
+ context->deadrows,
+ numrows, totalrows)));
-/*
- * qsort comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b)
-{
- HeapTuple ha = *(const HeapTuple *) a;
- HeapTuple hb = *(const HeapTuple *) b;
- BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
- OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
- BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
- OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
- if (ba < bb)
- return -1;
- if (ba > bb)
- return 1;
- if (oa < ob)
- return -1;
- if (oa > ob)
- return 1;
- return 0;
+ context->totalrows += totalrows;
+ context->totaldeadrows += totaldeadrows;
+ context->totalsampledrows += numrows;
}
@@ -1199,18 +1222,16 @@ compare_rows(const void *a, const void *b)
* We fail and return zero if there are no inheritance children, or if all
* children are foreign tables that don't support ANALYZE.
*/
-static int
+static void
acquire_inherited_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows)
+ AnalyzeSampleContext *context)
{
List *tableOIDs;
Relation *rels;
AcquireSampleRowsFunc *acquirefuncs;
double *relblocks;
double totalblocks;
- int numrows,
- nrels,
+ int nrels,
i;
ListCell *lc;
bool has_child;
@@ -1238,7 +1259,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no child tables",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
- return 0;
+ return;
}
/*
@@ -1336,7 +1357,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no analyzable child tables",
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel))));
- return 0;
+ return;
}
/*
@@ -1347,65 +1368,25 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
*/
pgstat_progress_update_param(PROGRESS_ANALYZE_CHILD_TABLES_TOTAL,
nrels);
- numrows = 0;
- *totalrows = 0;
- *totaldeadrows = 0;
for (i = 0; i < nrels; i++)
{
Relation childrel = rels[i];
AcquireSampleRowsFunc acquirefunc = acquirefuncs[i];
double childblocks = relblocks[i];
- pgstat_progress_update_param(PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID,
- RelationGetRelid(childrel));
-
if (childblocks > 0)
{
int childtargrows;
- childtargrows = (int) rint(targrows * childblocks / totalblocks);
+ childtargrows = (int) rint(context->totaltargrows * childblocks / totalblocks);
/* Make sure we don't overrun due to roundoff error */
- childtargrows = Min(childtargrows, targrows - numrows);
+ childtargrows = Min(childtargrows, context->totaltargrows - context->totalsampledrows);
if (childtargrows > 0)
{
- int childrows;
- double trows,
- tdrows;
+ InitAnalyzeSampleContextForChild(context, childrel, childtargrows);
/* Fetch a random sample of the child's rows */
- childrows = (*acquirefunc) (childrel, elevel,
- rows + numrows, childtargrows,
- &trows, &tdrows);
-
- /* We may need to convert from child's rowtype to parent's */
- if (childrows > 0 &&
- !equalTupleDescs(RelationGetDescr(childrel),
- RelationGetDescr(onerel)))
- {
- TupleConversionMap *map;
-
- map = convert_tuples_by_name(RelationGetDescr(childrel),
- RelationGetDescr(onerel));
- if (map != NULL)
- {
- int j;
-
- for (j = 0; j < childrows; j++)
- {
- HeapTuple newtup;
-
- newtup = execute_attr_map_tuple(rows[numrows + j], map);
- heap_freetuple(rows[numrows + j]);
- rows[numrows + j] = newtup;
- }
- free_conversion_map(map);
- }
- }
-
- /* And add to counts */
- numrows += childrows;
- *totalrows += trows;
- *totaldeadrows += tdrows;
+ (*acquirefunc) (childrel, elevel, context);
}
}
@@ -1417,8 +1398,6 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
pgstat_progress_update_param(PROGRESS_ANALYZE_CHILD_TABLES_DONE,
i + 1);
}
-
- return numrows;
}
@@ -1484,6 +1463,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(stats->attr->attnum);
values[Anum_pg_statistic_stainherit - 1] = BoolGetDatum(inh);
values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
+ values[Anum_pg_statistic_stadiskfrac - 1] = Float4GetDatum(stats->stadiskfrac);
values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
i = Anum_pg_statistic_stakind1 - 1;
@@ -1516,7 +1496,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
/* XXX knows more than it should about type float4: */
arry = construct_array(numdatums, nnum,
FLOAT4OID,
- sizeof(float4), true, TYPALIGN_INT);
+ sizeof(float4), true, 'i');
values[i++] = PointerGetDatum(arry); /* stanumbersN */
}
else
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 115860a9d4..847c79d9f5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2163,9 +2163,26 @@ CopyTo(CopyState cstate)
{
TupleTableSlot *slot;
TableScanDesc scandesc;
+ Bitmapset *proj = NULL;
- scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
slot = table_slot_create(cstate->rel, NULL);
+ if (table_scans_leverage_column_projection(cstate->rel))
+ {
+ foreach(cur, cstate->attnumlist)
+ {
+ int attnum = lfirst_int(cur);
+ Assert(attnum <= slot->tts_tupleDescriptor->natts);
+ proj = bms_add_member(proj, attnum);
+ }
+
+ scandesc = table_beginscan_with_column_projection(cstate->rel,
+ GetActiveSnapshot(),
+ 0, NULL, proj);
+ }
+ else
+ {
+ scandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);
+ }
processed = 0;
while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot))
@@ -2182,6 +2199,8 @@ CopyTo(CopyState cstate)
ExecDropSingleTupleTableSlot(slot);
table_endscan(scandesc);
+ if (proj)
+ pfree(proj);
}
else
{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 092ac1646d..200e18e43d 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2468,6 +2468,10 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
if (newtuple != trigtuple)
heap_freetuple(newtuple);
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(slot);
+
if (should_free)
heap_freetuple(trigtuple);
@@ -2753,6 +2757,10 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
newtuple = NULL;
}
}
+
+ /* Make sure the the new slot is not dependent on the original tuple */
+ ExecMaterializeSlot(newslot);
+
if (should_free_trig)
heap_freetuple(trigtuple);
@@ -2959,7 +2967,8 @@ GetTupleForTrigger(EState *estate,
estate->es_output_cid,
lockmode, LockWaitBlock,
lockflags,
- &tmfd);
+ &tmfd,
+ bms_make_singleton(0));
switch (test)
{
@@ -3034,7 +3043,7 @@ GetTupleForTrigger(EState *estate,
* suffices.
*/
if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
- oldslot))
+ oldslot, bms_make_singleton(0)))
elog(ERROR, "failed to fetch tuple for trigger");
}
@@ -3896,7 +3905,8 @@ AfterTriggerExecute(EState *estate,
if (!table_tuple_fetch_row_version(rel, &(event->ate_ctid1),
SnapshotAny,
- LocTriggerData.tg_trigslot))
+ LocTriggerData.tg_trigslot,
+ bms_make_singleton(0)))
elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
LocTriggerData.tg_trigtuple =
ExecFetchSlotHeapTuple(LocTriggerData.tg_trigslot, false, &should_free_trig);
@@ -3915,7 +3925,8 @@ AfterTriggerExecute(EState *estate,
if (!table_tuple_fetch_row_version(rel, &(event->ate_ctid2),
SnapshotAny,
- LocTriggerData.tg_newslot))
+ LocTriggerData.tg_newslot,
+ bms_make_singleton(0)))
elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
LocTriggerData.tg_newtuple =
ExecFetchSlotHeapTuple(LocTriggerData.tg_newslot, false, &should_free_new);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7179f589f9..6d1a7c4538 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2551,7 +2551,8 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
/* ordinary table, fetch the tuple */
if (!table_tuple_fetch_row_version(erm->relation,
(ItemPointer) DatumGetPointer(datum),
- SnapshotAny, slot))
+ SnapshotAny, slot,
+ bms_make_singleton(0)))
elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
return true;
}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 86594bd056..8f67c442d5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -908,6 +908,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
ExecInitQual((List *) clause, &mtstate->ps);
}
}
+
+ PopulateNeededColumnsForOnConflictUpdate(leaf_part_rri);
}
}
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 01d26881e7..964fa4ee3d 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -180,7 +180,8 @@ retry:
lockmode,
LockWaitBlock,
0 /* don't follow updates */ ,
- &tmfd);
+ &tmfd,
+ bms_make_singleton(0));
PopActiveSnapshot();
@@ -357,7 +358,8 @@ retry:
lockmode,
LockWaitBlock,
0 /* don't follow updates */ ,
- &tmfd);
+ &tmfd,
+ bms_make_singleton(0));
PopActiveSnapshot();
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 642805d90c..091254af35 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -20,7 +20,9 @@
#include "executor/executor.h"
#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
#include "utils/memutils.h"
+#include "utils/rel.h"
@@ -340,3 +342,109 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+typedef struct neededColumnContext
+{
+ Bitmapset **mask;
+ int n;
+} neededColumnContext;
+
+static bool
+neededColumnContextWalker(Node *node, neededColumnContext *c)
+{
+ if (node == NULL || contains_whole_row_col(*c->mask))
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *)node;
+
+ if (var->varattno > 0)
+ {
+ Assert(var->varattno <= c->n);
+ *(c->mask) = bms_add_member(*(c->mask), var->varattno);
+ }
+ else if(var->varattno == 0) {
+ bms_free(*(c->mask));
+ *(c->mask) = bms_make_singleton(0);
+ }
+
+ return false;
+ }
+ return expression_tree_walker(node, neededColumnContextWalker, (void * )c);
+}
+
+/*
+ * n specifies the number of allowed entries in mask: we use
+ * it for bounds-checking in the walker above.
+ */
+void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols)
+{
+ neededColumnContext c;
+
+ c.mask = scanCols;
+ c.n = n;
+
+ neededColumnContextWalker(expr, &c);
+}
+
+Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol)
+{
+ Bitmapset *result = NULL;
+ Plan *plan = scanstate->ps.plan;
+
+ PopulateNeededColumnsForNode((Node *) plan->targetlist, ncol, &result);
+ PopulateNeededColumnsForNode((Node *) plan->qual, ncol, &result);
+
+ if (IsA(plan, IndexScan))
+ {
+ PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexqualorig, ncol, &result);
+ PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexorderbyorig, ncol, &result);
+ }
+ else if (IsA(plan, BitmapHeapScan))
+ PopulateNeededColumnsForNode((Node *) ((BitmapHeapScan *) plan)->bitmapqualorig, ncol, &result);
+
+ return result;
+}
+
+Bitmapset *
+PopulateNeededColumnsForEPQ(EPQState *epqstate, int ncol)
+{
+ Bitmapset *epqCols = NULL;
+ Assert(epqstate && epqstate->plan);
+ PopulateNeededColumnsForNode((Node *) epqstate->plan->qual,
+ ncol,
+ &epqCols);
+ return epqCols;
+}
+
+void
+PopulateNeededColumnsForOnConflictUpdate(ResultRelInfo *resultRelInfo)
+{
+ ExprState *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
+ ProjectionInfo *oc_ProjInfo = resultRelInfo->ri_onConflict->oc_ProjInfo;
+ Relation relation = resultRelInfo->ri_RelationDesc;
+ Bitmapset *proj_cols = NULL;
+ ListCell *lc;
+
+ if (onConflictSetWhere && onConflictSetWhere->expr)
+ PopulateNeededColumnsForNode((Node *) onConflictSetWhere->expr,
+ RelationGetDescr(relation)->natts,
+ &proj_cols);
+
+ if (oc_ProjInfo)
+ PopulateNeededColumnsForNode((Node *) oc_ProjInfo->pi_state.expr,
+ RelationGetDescr(relation)->natts,
+ &proj_cols);
+
+ foreach(lc, resultRelInfo->ri_WithCheckOptionExprs)
+ {
+ ExprState *wcoExpr = (ExprState *) lfirst(lc);
+ PopulateNeededColumnsForNode((Node *) wcoExpr->expr,
+ RelationGetDescr(relation)->natts,
+ &proj_cols);
+ }
+ resultRelInfo->ri_onConflict->proj_cols = proj_cols;
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d0a96a38e0..69a00b7510 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -115,6 +115,16 @@ IndexNext(IndexScanState *node)
node->iss_NumScanKeys,
node->iss_NumOrderByKeys);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ Bitmapset *proj = NULL;
+ Scan *planNode = (Scan *)node->ss.ps.plan;
+ int rti = planNode->scanrelid;
+ RangeTblEntry *rte = list_nth(estate->es_plannedstmt->rtable, rti - 1);
+ proj = rte->scanCols;
+ table_index_fetch_set_column_projection(scandesc->xs_heapfetch, proj);
+ }
+
node->iss_ScanDesc = scandesc;
/*
@@ -901,6 +911,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
{
IndexScanState *indexstate;
Relation currentRelation;
+ const TupleTableSlotOps *table_slot_ops;
LOCKMODE lockmode;
/*
@@ -927,11 +938,19 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->ss.ss_currentScanDesc = NULL; /* no heap scan here */
/*
- * get the scan type from the relation descriptor.
+ * Initialize the scan slot.
+ *
+ * With the reorder queue, we will sometimes use the reorderqueue's slot,
+ * which uses heap ops, and sometimes the table AM's slot directly. We
+ * have to set scanopsfixed to false, unless the table AM also uses heap
+ * ops.
*/
+ table_slot_ops = table_slot_callbacks(currentRelation);
ExecInitScanTupleSlot(estate, &indexstate->ss,
RelationGetDescr(currentRelation),
- table_slot_callbacks(currentRelation));
+ table_slot_ops);
+ if (node->indexorderby && table_slot_ops != &TTSOpsHeapTuple)
+ indexstate->ss.ps.scanopsfixed = false;
/*
* Initialize result type and projection.
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 554c2a5a2c..35963641e5 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -80,6 +80,7 @@ lnext:
int lockflags = 0;
TM_Result test;
TupleTableSlot *markSlot;
+ Bitmapset *epqCols = NULL;
/* clear any leftover test tuple for this rel */
markSlot = EvalPlanQualSlot(&node->lr_epqstate, erm->relation, erm->rti);
@@ -179,11 +180,15 @@ lnext:
if (!IsolationUsesXactSnapshot())
lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
+ epqCols = PopulateNeededColumnsForEPQ(&node->lr_epqstate,
+ RelationGetDescr(erm->relation)->natts);
+
test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
markSlot, estate->es_output_cid,
lockmode, erm->waitPolicy,
lockflags,
- &tmfd);
+ &tmfd,
+ epqCols);
switch (test)
{
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29e07b7228..380a1ed7de 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -233,7 +233,7 @@ ExecCheckTIDVisible(EState *estate,
if (!IsolationUsesXactSnapshot())
return;
- if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
+ if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot, NULL))
elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
ExecCheckTupleVisible(estate, rel, tempSlot);
ExecClearTuple(tempSlot);
@@ -861,6 +861,7 @@ ldelete:;
{
TupleTableSlot *inputslot;
TupleTableSlot *epqslot;
+ Bitmapset *epqCols = NULL;
if (IsolationUsesXactSnapshot())
ereport(ERROR,
@@ -875,12 +876,15 @@ ldelete:;
inputslot = EvalPlanQualSlot(epqstate, resultRelationDesc,
resultRelInfo->ri_RangeTableIndex);
+ epqCols = PopulateNeededColumnsForEPQ(epqstate,
+ RelationGetDescr(resultRelationDesc)->natts);
+
result = table_tuple_lock(resultRelationDesc, tupleid,
estate->es_snapshot,
inputslot, estate->es_output_cid,
LockTupleExclusive, LockWaitBlock,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
- &tmfd);
+ &tmfd, epqCols);
switch (result)
{
@@ -1033,8 +1037,23 @@ ldelete:;
}
else
{
+ RangeTblEntry *resultrte = exec_rt_fetch(resultRelInfo->ri_RangeTableIndex, estate);
+ Bitmapset *project_cols = resultrte->returningCols;
+ /*
+ * XXX returningCols should never be empty if we have a RETURNING
+ * clause. Right now, if we have a view, we fail to populate the
+ * returningCols of it's base table's RTE.
+ * If we encounter such a situation now, for correctness, ensure
+ * that we fetch all the columns.
+ */
+ if(bms_is_empty(resultrte->returningCols))
+ {
+ bms_free(resultrte->returningCols);
+ project_cols = bms_make_singleton(0);
+ }
if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
- SnapshotAny, slot))
+ SnapshotAny, slot,
+ project_cols))
elog(ERROR, "failed to fetch deleted tuple for DELETE RETURNING");
}
}
@@ -1427,6 +1446,7 @@ lreplace:;
{
TupleTableSlot *inputslot;
TupleTableSlot *epqslot;
+ Bitmapset *epqCols = NULL;
if (IsolationUsesXactSnapshot())
ereport(ERROR,
@@ -1440,12 +1460,14 @@ lreplace:;
inputslot = EvalPlanQualSlot(epqstate, resultRelationDesc,
resultRelInfo->ri_RangeTableIndex);
+ epqCols = PopulateNeededColumnsForEPQ(epqstate,
+ RelationGetDescr(resultRelationDesc)->natts);
result = table_tuple_lock(resultRelationDesc, tupleid,
estate->es_snapshot,
inputslot, estate->es_output_cid,
lockmode, LockWaitBlock,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
- &tmfd);
+ &tmfd, epqCols);
switch (result)
{
@@ -1574,6 +1596,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
Relation relation = resultRelInfo->ri_RelationDesc;
ExprState *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
+ ProjectionInfo *oc_ProjInfo = resultRelInfo->ri_onConflict->oc_ProjInfo;
+ Bitmapset *proj_cols = resultRelInfo->ri_onConflict->proj_cols;
TM_FailureData tmfd;
LockTupleMode lockmode;
TM_Result test;
@@ -1594,7 +1618,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
estate->es_snapshot,
existing, estate->es_output_cid,
lockmode, LockWaitBlock, 0,
- &tmfd);
+ &tmfd, proj_cols);
switch (test)
{
case TM_Ok:
@@ -1742,7 +1766,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
}
/* Project the new tuple version */
- ExecProject(resultRelInfo->ri_onConflict->oc_ProjInfo);
+ ExecProject(oc_ProjInfo);
/*
* Note that it is possible that the target tuple has been modified in
@@ -2503,6 +2527,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
&mtstate->ps);
resultRelInfo->ri_onConflict->oc_WhereClause = qualexpr;
}
+
+ PopulateNeededColumnsForOnConflictUpdate(resultRelInfo);
}
/*
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 1a7c1e919f..7394df2b1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -31,6 +31,7 @@
#include "access/tableam.h"
#include "executor/execdebug.h"
#include "executor/nodeSeqscan.h"
+#include "nodes/nodeFuncs.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
@@ -68,9 +69,22 @@ SeqNext(SeqScanState *node)
* We reach here if the scan is not parallel, or if we're serially
* executing a scan that was planned to be parallel.
*/
- scandesc = table_beginscan(node->ss.ss_currentRelation,
- estate->es_snapshot,
- 0, NULL);
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ Scan *planNode = (Scan *)node->ss.ps.plan;
+ int rti = planNode->scanrelid;
+ RangeTblEntry *rte = list_nth(estate->es_plannedstmt->rtable, rti - 1);
+ scandesc = table_beginscan_with_column_projection(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL,
+ rte->scanCols);
+ }
+ else
+ {
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ }
node->ss.ss_currentScanDesc = scandesc;
}
@@ -270,14 +284,22 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
{
EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
+ Bitmapset *proj = NULL;
pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ }
+
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan, proj);
}
/* ----------------------------------------------------------------
@@ -307,8 +329,19 @@ ExecSeqScanInitializeWorker(SeqScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelTableScanDesc pscan;
+ Bitmapset *proj = NULL;
+
+ /*
+ * FIXME: this is duplicate work with ExecSeqScanInitializeDSM. In future
+ * plan will have the we have projection list, then this overhead will not exist.
+ */
+ if (table_scans_leverage_column_projection(node->ss.ss_currentRelation))
+ {
+ proj = PopulateNeededColumnsForScan(&node->ss,
+ node->ss.ss_currentRelation->rd_att->natts);
+ }
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->ss.ss_currentScanDesc =
- table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel(node->ss.ss_currentRelation, pscan, proj);
}
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 8049fdc64e..ab01119f17 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -366,6 +366,7 @@ TidNext(TidScanState *node)
while (node->tss_TidPtr >= 0 && node->tss_TidPtr < numTids)
{
ItemPointerData tid = tidList[node->tss_TidPtr];
+ Bitmapset *project_cols = NULL;
/*
* For WHERE CURRENT OF, the tuple retrieved from the cursor might
@@ -375,7 +376,15 @@ TidNext(TidScanState *node)
if (node->tss_isCurrentOf)
table_tuple_get_latest_tid(scan, &tid);
- if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+ /*
+ * TODO: Remove this hack!! This should be done once at the start of the tid scan.
+ * Ideally we should probably set the list of projection cols in the
+ * generic scan desc, perhaps in TableScanDesc.
+ */
+ project_cols = PopulateNeededColumnsForScan((ScanState *) node,
+ RelationGetDescr(heapRelation)->natts);
+
+ if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot, project_cols))
return slot;
/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3031c52991..b146bfadd9 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2436,8 +2436,10 @@ _copyRangeTblEntry(const RangeTblEntry *from)
COPY_SCALAR_FIELD(checkAsUser);
COPY_BITMAPSET_FIELD(selectedCols);
COPY_BITMAPSET_FIELD(insertedCols);
+ COPY_BITMAPSET_FIELD(returningCols);
COPY_BITMAPSET_FIELD(updatedCols);
COPY_BITMAPSET_FIELD(extraUpdatedCols);
+ COPY_BITMAPSET_FIELD(scanCols);
COPY_NODE_FIELD(securityQuals);
return newnode;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 9aa853748d..a0a1bd8848 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2693,8 +2693,10 @@ _equalRangeTblEntry(const RangeTblEntry *a, const RangeTblEntry *b)
COMPARE_SCALAR_FIELD(checkAsUser);
COMPARE_BITMAPSET_FIELD(selectedCols);
COMPARE_BITMAPSET_FIELD(insertedCols);
+ COMPARE_BITMAPSET_FIELD(returningCols);
COMPARE_BITMAPSET_FIELD(updatedCols);
COMPARE_BITMAPSET_FIELD(extraUpdatedCols);
+ COMPARE_BITMAPSET_FIELD(scanCols);
COMPARE_NODE_FIELD(securityQuals);
return true;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4504b1503b..7516a4d255 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -3151,8 +3151,10 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
WRITE_OID_FIELD(checkAsUser);
WRITE_BITMAPSET_FIELD(selectedCols);
WRITE_BITMAPSET_FIELD(insertedCols);
+ WRITE_BITMAPSET_FIELD(returningCols);
WRITE_BITMAPSET_FIELD(updatedCols);
WRITE_BITMAPSET_FIELD(extraUpdatedCols);
+ WRITE_BITMAPSET_FIELD(scanCols);
WRITE_NODE_FIELD(securityQuals);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ab7b535caa..0471878e71 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1460,8 +1460,10 @@ _readRangeTblEntry(void)
READ_OID_FIELD(checkAsUser);
READ_BITMAPSET_FIELD(selectedCols);
READ_BITMAPSET_FIELD(insertedCols);
+ READ_BITMAPSET_FIELD(returningCols);
READ_BITMAPSET_FIELD(updatedCols);
READ_BITMAPSET_FIELD(extraUpdatedCols);
+ READ_BITMAPSET_FIELD(scanCols);
READ_NODE_FIELD(securityQuals);
READ_DONE();
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 84a69b064a..4e73804d66 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -23,10 +23,12 @@
#include "catalog/pg_class.h"
#include "catalog/pg_operator.h"
#include "catalog/pg_proc.h"
+#include "catalog/pg_statistic.h"
#include "foreign/fdwapi.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
+#include "utils/rel.h"
#ifdef OPTIMIZER_DEBUG
#include "nodes/print.h"
#endif
@@ -47,6 +49,7 @@
#include "partitioning/partbounds.h"
#include "partitioning/partprune.h"
#include "rewrite/rewriteManip.h"
+#include "utils/syscache.h"
#include "utils/lsyscache.h"
@@ -79,7 +82,11 @@ static void set_rel_size(PlannerInfo *root, RelOptInfo *rel,
static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
Index rti, RangeTblEntry *rte);
static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
- RangeTblEntry *rte);
+ Index rti, RangeTblEntry *rte);
+static void set_plain_rel_page_estimates(PlannerInfo *root,
+ RelOptInfo *rel,
+ Index rti,
+ RangeTblEntry *rte);
static void create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel);
static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
RangeTblEntry *rte);
@@ -146,7 +153,7 @@ static void subquery_push_qual(Query *subquery,
static void recurse_push_qual(Node *setOp, Query *topquery,
RangeTblEntry *rte, Index rti, Node *qual);
static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-
+static void extract_scan_columns(PlannerInfo *root);
/*
* make_one_rel
@@ -189,6 +196,8 @@ make_one_rel(PlannerInfo *root, List *joinlist)
*/
set_base_rel_sizes(root);
+ extract_scan_columns(root);
+
/*
* We should now have size estimates for every actual table involved in
* the query, and we also know which if any have been deleted from the
@@ -239,6 +248,86 @@ make_one_rel(PlannerInfo *root, List *joinlist)
return rel;
}
+static void
+extract_scan_columns(PlannerInfo *root)
+{
+ for (int i = 1; i < root->simple_rel_array_size; i++)
+ {
+ ListCell *lc;
+ RangeTblEntry *rte = root->simple_rte_array[i];
+ RelOptInfo *rel = root->simple_rel_array[i];
+ if (rte == NULL)
+ continue;
+ if (rel == NULL)
+ continue;
+ if (IS_DUMMY_REL(rel))
+ continue;
+ rte->scanCols = NULL;
+ foreach(lc, rel->reltarget->exprs)
+ {
+ Node *node;
+ List *vars;
+ ListCell *lc1;
+ node = lfirst(lc);
+ /*
+ * TODO: suggest a default for vars_only to make maintenance less burdensome
+ */
+ vars = pull_var_clause(node,
+ PVC_RECURSE_AGGREGATES |
+ PVC_RECURSE_WINDOWFUNCS |
+ PVC_RECURSE_PLACEHOLDERS);
+ foreach(lc1, vars)
+ {
+ Var *var = lfirst(lc1);
+ if (var->varno == i)
+ {
+ if (var->varattno > 0)
+ rte->scanCols = bms_add_member(rte->scanCols, var->varattno);
+ else if (var->varattno == 0)
+ {
+ /*
+ * If there is a whole-row var, we have to fetch the whole row.
+ */
+ bms_free(rte->scanCols);
+ rte->scanCols = bms_make_singleton(0);
+ goto outer;
+ }
+ }
+ }
+ }
+ foreach(lc, rel->baserestrictinfo)
+ {
+ RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+ List *vars = pull_var_clause((Node *)rinfo->clause,
+ PVC_RECURSE_AGGREGATES |
+ PVC_RECURSE_WINDOWFUNCS |
+ PVC_RECURSE_PLACEHOLDERS);
+ ListCell *lc1;
+ if (contains_whole_row_col(rte->scanCols))
+ break;
+ foreach(lc1, vars)
+ {
+ Var *var = lfirst(lc1);
+ if (var->varno == i)
+ {
+ if (var->varattno > 0)
+ rte->scanCols = bms_add_member(rte->scanCols, var->varattno);
+ else if (var->varattno == 0)
+ {
+ /*
+ * If there is a whole-row var, we have to fetch the whole row.
+ */
+ bms_free(rte->scanCols);
+ rte->scanCols = bms_make_singleton(0);
+ break;
+ }
+ }
+ }
+ }
+ outer:;
+ }
+}
+
/*
* set_base_rel_consider_startup
* Set the consider_[param_]startup flags for each base-relation entry.
@@ -414,7 +503,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
else
{
/* Plain relation */
- set_plain_rel_size(root, rel, rte);
+ set_plain_rel_size(root, rel, rti, rte);
}
break;
case RTE_SUBQUERY:
@@ -576,7 +665,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
* Set size estimates for a plain relation (no subquery, no inheritance)
*/
static void
-set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
+set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte)
{
/*
* Test any partial indexes of rel for applicability. We must do this
@@ -586,6 +675,81 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
/* Mark rel with estimated output rows, width, etc */
set_baserel_size_estimates(root, rel);
+
+ /* Estimate the pages based on the selected columns */
+ set_plain_rel_page_estimates(root, rel, rti, rte);
+}
+
+static void
+set_plain_rel_page_estimates(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte)
+{
+ Var *var;
+ List *vars;
+ double pages;
+ ListCell *lc;
+ ListCell *lc1;
+ Bitmapset *cols = NULL;
+ HeapTuple tp;
+ AttrNumber attno;
+ Selectivity sel = 0;
+
+ Assert(rel->rtekind == RTE_RELATION);
+
+ foreach(lc, rel->reltarget->exprs)
+ {
+ Node *node;
+ node = lfirst(lc);
+ vars = pull_var_clause(node,
+ PVC_RECURSE_AGGREGATES |
+ PVC_RECURSE_WINDOWFUNCS |
+ PVC_RECURSE_PLACEHOLDERS);
+ foreach(lc1, vars)
+ {
+ var = lfirst(lc1);
+ if (var->varno == rti && var->varattno >= 0)
+ cols = bms_add_member(cols, var->varattno);
+ }
+ }
+
+ foreach(lc, rel->baserestrictinfo)
+ {
+ RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+ vars = pull_var_clause((Node *)rinfo->clause,
+ PVC_RECURSE_AGGREGATES |
+ PVC_RECURSE_WINDOWFUNCS |
+ PVC_RECURSE_PLACEHOLDERS);
+ foreach(lc1, vars)
+ {
+ var = lfirst(lc1);
+ if (var->varno == rti && var->varattno >= 0)
+ cols = bms_add_member(cols, var->varattno);
+ }
+ }
+
+ attno = -1;
+ while ((attno = bms_next_member(cols, attno)) >= 0)
+ {
+ tp = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(rte->relid),
+ Int16GetDatum(attno),
+ BoolGetDatum(rte->inh));
+
+ if (HeapTupleIsValid(tp))
+ {
+ sel += ((Form_pg_statistic) GETSTRUCT(tp))->stadiskfrac;
+ ReleaseSysCache(tp);
+ }
+ }
+
+ if (sel > 0)
+ {
+ pages = rel->pages * sel;
+
+ if (pages <= 1.0)
+ rel->pages = 1;
+ else
+ rel->pages = rint(pages);
+ }
}
/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 40abe6f9f6..6d8a014fa0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -838,6 +838,9 @@ use_physical_tlist(PlannerInfo *root, Path *path, int flags)
rel->rtekind != RTE_CTE)
return false;
+ if (rel->rtekind == RTE_RELATION && rel->leverage_column_projection)
+ return false;
+
/*
* Can't do it with inheritance cases either (mainly because Append
* doesn't project; this test may be unnecessary now that
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 986d7a52e3..7f34f6bef1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1481,6 +1481,9 @@ inheritance_planner(PlannerInfo *root)
RelOptInfo *sub_final_rel;
Path *subpath;
+ ListCell *listCell;
+ int rti;
+
/*
* expand_inherited_rtentry() always processes a parent before any of
* that parent's children, so the parent query for this relation
@@ -1685,6 +1688,22 @@ inheritance_planner(PlannerInfo *root)
/* Build list of modified subroots, too */
subroots = lappend(subroots, subroot);
+ rti = 0;
+ foreach(listCell, subroot->parse->rtable)
+ {
+ RangeTblEntry *subroot_rte = lfirst(listCell);
+ RangeTblEntry *finalroot_rte = list_nth(final_rtable, rti);
+ if (finalroot_rte != subroot_rte)
+ {
+ finalroot_rte->scanCols = bms_union(finalroot_rte->scanCols, subroot_rte->scanCols);
+ if(contains_whole_row_col(finalroot_rte->scanCols))
+ {
+ bms_free(finalroot_rte->scanCols);
+ finalroot_rte->scanCols = bms_make_singleton(0);
+ }
+ }
+ rti++;
+ }
/* Build list of target-relation RT indexes */
resultRelations = lappend_int(resultRelations, appinfo->child_relid);
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index d56d8c6509..81220902b4 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -190,8 +190,19 @@ preprocess_targetlist(PlannerInfo *root)
* to make these Vars available for the RETURNING calculation. Vars that
* belong to the result rel don't need to be added, because they will be
* made to refer to the actual heap tuple.
+ *
+ * XXX: Avoid adding cols from the returningList to avoid overestimation
+ * of scanCols from RelOptInfo->reltarget exprs. This is done to avoid
+ * additional cols from the RETURNING clause making its way into scanCols
+ * for queries such as:
+ * delete from base_tbl using other_tbl t where base_tbl.col1 = t.col1 returning *;
+ * where base_tbl is the root table of an inheritance hierarchy
+ * TODO: Delete the result_relation guard below if and when
+ * inheritance_planner() is refactored to not fake a round of planning
+ * pretending we have a SELECT query (which causes result_relation to be 0
+ * in the first place)
*/
- if (parse->returningList && list_length(parse->rtable) > 1)
+ if (result_relation && parse->returningList && list_length(parse->rtable) > 1)
{
List *vars;
ListCell *l;
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3132fd35a5..101d3cb42f 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -50,7 +50,9 @@ static Bitmapset *translate_col_privs(const Bitmapset *parent_privs,
static void expand_appendrel_subquery(PlannerInfo *root, RelOptInfo *rel,
RangeTblEntry *rte, Index rti);
-
+static Bitmapset *
+translate_parent_cols(Bitmapset *parent_cols, List *translated_vars,
+ Relation parent_rel);
/*
* expand_inherited_rtentry
* Expand a rangetable entry that has the "inh" bit set.
@@ -518,6 +520,13 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
childrte->alias = childrte->eref = makeAlias(parentrte->eref->aliasname,
child_colnames);
+ if (childOID != parentOID)
+ childrte->returningCols =
+ translate_parent_cols(parentrte->returningCols,
+ appinfo->translated_vars, parentrel);
+ else
+ childrte->returningCols = bms_copy(parentrte->returningCols);
+
/*
* Translate the column permissions bitmaps to the child's attnums (we
* have to build the translated_vars list before we can do this). But if
@@ -587,6 +596,32 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
}
}
+/*
+ * We need to translate the list of ordinal attnos from a parent table's
+ * RangeTblEntry to the ordinal attribute numbers for the child's entry.
+ */
+
+static Bitmapset *
+translate_parent_cols(Bitmapset *parent_cols, List *translated_vars,
+ Relation parent_rel)
+{
+ int col = -1;
+ Bitmapset *result = NULL;
+ /*
+ * Enumerate the set of parent columns for translation if there is a whole
+ * row var
+ */
+ if(contains_whole_row_col(parent_cols))
+ parent_cols = get_ordinal_attnos(parent_rel);
+ while ((col = bms_next_member(parent_cols, col)) >= 0)
+ {
+ Var *var = (Var *) list_nth(translated_vars, col - 1);
+ if (var)
+ result = bms_add_member(result, var->varattno);
+ }
+ return result;
+}
+
/*
* translate_col_privs
* Translate a bitmapset representing per-column privileges from the
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 52c01eb86b..652d52cc43 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -126,6 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
*/
relation = table_open(relationObjectId, NoLock);
+ if (relation->rd_tableam)
+ rel->leverage_column_projection = relation->rd_tableam->scans_leverage_column_projection;
/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())
ereport(ERROR,
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 575e22ce0d..c0fb34df80 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -67,7 +67,7 @@ static Node *transformSetOperationTree(ParseState *pstate, SelectStmt *stmt,
static void determineRecursiveColTypes(ParseState *pstate,
Node *larg, List *nrtargetlist);
static Query *transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt);
-static List *transformReturningList(ParseState *pstate, List *returningList);
+static void transformReturningList(ParseState *pstate, Query *qry, List *returningList);
static List *transformUpdateTargetList(ParseState *pstate,
List *targetList);
static Query *transformDeclareCursorStmt(ParseState *pstate,
@@ -439,7 +439,7 @@ transformDeleteStmt(ParseState *pstate, DeleteStmt *stmt)
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
- qry->returningList = transformReturningList(pstate, stmt->returningList);
+ transformReturningList(pstate, qry, stmt->returningList);
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
@@ -855,8 +855,7 @@ transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
pstate->p_namespace = NIL;
addNSItemToQuery(pstate, pstate->p_target_nsitem,
false, true, true);
- qry->returningList = transformReturningList(pstate,
- stmt->returningList);
+ transformReturningList(pstate, qry, stmt->returningList);
}
/* done building the range table and jointree */
@@ -2252,7 +2251,7 @@ transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
- qry->returningList = transformReturningList(pstate, stmt->returningList);
+ transformReturningList(pstate, qry, stmt->returningList);
/*
* Now we are done with SELECT-like processing, and can get on with
@@ -2347,14 +2346,16 @@ transformUpdateTargetList(ParseState *pstate, List *origTlist)
* transformReturningList -
* handle a RETURNING clause in INSERT/UPDATE/DELETE
*/
-static List *
-transformReturningList(ParseState *pstate, List *returningList)
+static void
+transformReturningList(ParseState *pstate, Query *qry, List *returningList)
{
List *rlist;
int save_next_resno;
+ List *vars;
+ ListCell *l;
if (returningList == NIL)
- return NIL; /* nothing to do */
+ return;
/*
* We need to assign resnos starting at one in the RETURNING list. Save
@@ -2367,6 +2368,27 @@ transformReturningList(ParseState *pstate, List *returningList)
/* transform RETURNING identically to a SELECT targetlist */
rlist = transformTargetList(pstate, returningList, EXPR_KIND_RETURNING);
+ vars = pull_var_clause((Node *) rlist,
+ PVC_RECURSE_AGGREGATES |
+ PVC_RECURSE_WINDOWFUNCS |
+ PVC_INCLUDE_PLACEHOLDERS);
+ foreach (l, vars)
+ {
+ Var *var = (Var *) lfirst(l);
+ RangeTblEntry *rte = (RangeTblEntry *) list_nth(pstate->p_rtable, var->varno - 1);
+ if (var->varattno > 0)
+ rte->returningCols = bms_add_member(rte->returningCols, var->varattno);
+ else if (var->varattno == 0)
+ {
+ /*
+ * If there is a whole-row var, we have to fetch the whole row.
+ */
+ bms_free(rte->returningCols);
+ rte->returningCols = bms_make_singleton(0);
+ break;
+ }
+ }
+
/*
* Complain if the nonempty tlist expanded to nothing (which is possible
* if it contains only a star-expansion of a zero-column table). If we
@@ -2390,7 +2412,7 @@ transformReturningList(ParseState *pstate, List *returningList)
/* restore state */
pstate->p_next_resno = save_next_resno;
- return rlist;
+ qry->returningList = rlist;
}
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index a56bd86181..ef8890d957 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -1451,7 +1451,9 @@ addRangeTableEntry(ParseState *pstate,
rte->selectedCols = NULL;
rte->insertedCols = NULL;
rte->updatedCols = NULL;
+ rte->returningCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
@@ -1538,7 +1540,9 @@ addRangeTableEntryForRelation(ParseState *pstate,
rte->checkAsUser = InvalidOid; /* not set-uid by default, either */
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
+ rte->scanCols = NULL;
rte->extraUpdatedCols = NULL;
/*
@@ -1635,8 +1639,10 @@ addRangeTableEntryForSubquery(ParseState *pstate,
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
@@ -1941,8 +1947,10 @@ addRangeTableEntryForFunction(ParseState *pstate,
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
@@ -2012,8 +2020,10 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
@@ -2099,8 +2109,10 @@ addRangeTableEntryForValues(ParseState *pstate,
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
@@ -2188,8 +2200,10 @@ addRangeTableEntryForJoin(ParseState *pstate,
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
@@ -2307,8 +2321,10 @@ addRangeTableEntryForCTE(ParseState *pstate,
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
@@ -2422,6 +2438,8 @@ addRangeTableEntryForENR(ParseState *pstate,
rte->requiredPerms = 0;
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
+ rte->returningCols = NULL;
+ rte->scanCols = NULL;
/*
* Add completed RTE to pstate's range table list, so that we know its
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index ac0c495972..71e86245bd 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -3181,6 +3181,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
TableScanDesc scan;
MemoryContext oldCxt;
TupleTableSlot *tupslot;
+ Bitmapset *proj = NULL;
/* Lock already taken above. */
if (part_relid != RelationGetRelid(default_rel))
@@ -3245,7 +3246,15 @@ check_default_partition_contents(Relation parent, Relation default_rel,
econtext = GetPerTupleExprContext(estate);
snapshot = RegisterSnapshot(GetLatestSnapshot());
tupslot = table_slot_create(part_rel, &estate->es_tupleTable);
- scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ if (table_scans_leverage_column_projection(part_rel))
+ {
+ PopulateNeededColumnsForNode((Node*)partqualstate->expr, tupslot->tts_tupleDescriptor->natts, &proj);
+ scan = table_beginscan_with_column_projection(part_rel, snapshot, 0, NULL, proj);
+ }
+ else
+ {
+ scan = table_beginscan(part_rel, snapshot, 0, NULL);
+ }
/*
* Switch to per-tuple memory context and reset it for each tuple
@@ -3276,6 +3285,9 @@ check_default_partition_contents(Relation parent, Relation default_rel,
if (RelationGetRelid(default_rel) != RelationGetRelid(part_rel))
table_close(part_rel, NoLock); /* keep the lock until commit */
+
+ if (proj)
+ pfree(proj);
}
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..b35e0e6e7a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -167,6 +167,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
case RM_COMMIT_TS_ID:
case RM_REPLORIGIN_ID:
case RM_GENERIC_ID:
+ case RM_ZEDSTORE_ID:
/* just deal with xid, and done */
ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
buf.origptr);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 04684912de..aae2737628 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1372,6 +1372,7 @@ apply_handle_update_internal(ResultRelInfo *relinfo,
found = FindReplTupleInLocalRel(estate, localrel,
&relmapentry->remoterel,
remoteslot, &localslot);
+
ExecClearTuple(remoteslot);
/*
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 41dd670572..685fe87055 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1675,8 +1675,10 @@ ApplyRetrieveRule(Query *parsetree,
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
/*
* For the most part, Vars referencing the view should remain as
@@ -1776,15 +1778,19 @@ ApplyRetrieveRule(Query *parsetree,
subrte->checkAsUser = rte->checkAsUser;
subrte->selectedCols = rte->selectedCols;
subrte->insertedCols = rte->insertedCols;
+ subrte->returningCols = rte->returningCols;
subrte->updatedCols = rte->updatedCols;
subrte->extraUpdatedCols = rte->extraUpdatedCols;
+ subrte->scanCols = rte->scanCols;
rte->requiredPerms = 0; /* no permission check on subquery itself */
rte->checkAsUser = InvalidOid;
rte->selectedCols = NULL;
rte->insertedCols = NULL;
+ rte->returningCols = NULL;
rte->updatedCols = NULL;
rte->extraUpdatedCols = NULL;
+ rte->scanCols = NULL;
return parsetree;
}
@@ -3118,6 +3124,7 @@ rewriteTargetView(Query *parsetree, Relation view)
* base_rte instead of copying it.
*/
new_rte = base_rte;
+ new_rte->returningCols = bms_copy(view_rte->returningCols);
new_rte->rellockmode = RowExclusiveLock;
parsetree->rtable = lappend(parsetree->rtable, new_rte);
@@ -3470,6 +3477,7 @@ rewriteTargetView(Query *parsetree, Relation view)
}
}
+ new_rte->returningCols = bms_copy(view_rte->returningCols);
table_close(base_rel, NoLock);
return parsetree;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..2b773d1ccb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3769,6 +3769,28 @@ LockBuffer(Buffer buffer, int mode)
elog(ERROR, "unrecognized buffer lock mode: %d", mode);
}
+/*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ */
+bool
+ConditionalLockBufferInMode(Buffer buffer, int mode)
+{
+ BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+ if (BufferIsLocal(buffer))
+ return true; /* act as though we got it */
+
+ buf = GetBufferDescriptor(buffer - 1);
+
+ if (mode == BUFFER_LOCK_SHARE)
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
+ else if (mode == BUFFER_LOCK_EXCLUSIVE)
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+ else
+ elog(ERROR, "unrecognized buffer lock mode: %d", mode);
+}
+
/*
* Acquire the content_lock for the buffer, but only if we don't have to wait.
*
diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c
index 509a0fdffc..2f25ec2a4d 100644
--- a/src/backend/utils/adt/tid.c
+++ b/src/backend/utils/adt/tid.c
@@ -29,6 +29,7 @@
#include "libpq/pqformat.h"
#include "miscadmin.h"
#include "parser/parsetree.h"
+#include "storage/itemptr.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -36,12 +37,6 @@
#include "utils/snapmgr.h"
#include "utils/varlena.h"
-
-#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
-#define ItemPointerGetDatum(X) PointerGetDatum(X)
-#define PG_GETARG_ITEMPOINTER(n) DatumGetItemPointer(PG_GETARG_DATUM(n))
-#define PG_RETURN_ITEMPOINTER(x) return ItemPointerGetDatum(x)
-
#define LDELIM '('
#define RDELIM ')'
#define DELIM ','
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..c0dc97b307 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -22,6 +22,7 @@
#include "access/spgxlog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/zedstore_wal.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
#include "commands/sequence.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 6c15df7e70..38ed65f9e3 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_ZEDSTORE_ID, "Zedstore", zedstore_redo, zedstore_desc, zedstore_identify, NULL, NULL, zedstore_mask)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387eb34a61..55733be214 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -38,6 +38,66 @@ struct SampleScanState;
struct TBMIterateResult;
struct VacuumParams;
struct ValidateIndexState;
+struct TupleConversionMap;
+
+typedef enum AnalyzeSampleType
+{
+ ANALYZE_SAMPLE_DATA = 0, /* real data per column */
+ ANALYZE_SAMPLE_DISKSIZE, /* physical size per column */
+ MAX_ANALYZE_SAMPLE /* must be last */
+} AnalyzeSampleType;
+
+typedef struct AnalyzeSampleContext
+{
+ /* Filled when context is created */
+ int totaltargrows;
+ List *anl_cols;
+ Relation parent;
+ BufferAccessStrategy bstrategy;
+
+ /* Filled by table AM analyze routines */
+ BlockNumber totalblocks;
+ TableScanDesc scan;
+
+ /*
+ * Acquiring sample rows from a inherited table will invoke
+ * multiple sampling iterations for each child relation, so
+ * bellow filed is the statistic for each iteration.
+ */
+ int targrows; /* target number of sample rows */
+ double liverows;
+ double deadrows;
+ bool ordered; /* are sample rows ordered physically */
+
+ /*
+ * Statistics filed by all sampling iterations.
+ */
+ int totalsampledrows; /* total number of sample rows stored */
+ double totalrows;
+ double totaldeadrows;
+
+ /*
+ * If childrel has different rowtype with parent, we
+ * need to convert sample tuple to the same rowtype
+ * with parent
+ */
+ struct TupleConversionMap *tup_convert_map;
+
+ /*
+ * Used by table AM analyze routines to store
+ * the temporary tuple for different types of
+ * sample rows, the tuple is finally stored to
+ * sample_rows[] if the tuple is
+ * randomly selected.
+ */
+ TupleTableSlot* sample_slots[MAX_ANALYZE_SAMPLE];
+
+ /*
+ * stores the final sample rows which will be
+ * used to compute statistics.
+ */
+ HeapTuple* sample_rows[MAX_ANALYZE_SAMPLE];
+} AnalyzeSampleContext;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -164,6 +224,7 @@ typedef struct TableAmRoutine
{
/* this must be set to T_TableAmRoutine */
NodeTag type;
+ bool scans_leverage_column_projection;
/* ------------------------------------------------------------------------
@@ -204,6 +265,30 @@ typedef struct TableAmRoutine
ParallelTableScanDesc pscan,
uint32 flags);
+ /*
+ * Variant of scan_begin() with a column projection bitmap that lists the
+ * ordinal attribute numbers to be fetched during the scan.
+ *
+ * If project_columns is an empty bitmap, none of the data columns are to be
+ * fetched.
+ *
+ * If project_columns is a singleton bitmap with a whole-row reference (0),
+ * all of the data columns are to be fetched.
+ *
+ * Please note: project_cols only deals with non system columns (attnum >= 0)
+ *
+ * Please note: Due to the limitations of the slot_get***() APIs, the
+ * scan_getnextslot() tableAM call must return a TupleTableSlot that is densely
+ * populated (missing cols indicated with isnull = true upto the largest
+ * attno in the projection list)
+ */
+ TableScanDesc (*scan_begin_with_column_projection)(Relation relation,
+ Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ ParallelTableScanDesc parallel_scan,
+ uint32 flags,
+ Bitmapset *project_columns);
+
/*
* Release resources and deallocate scan. If TableScanDesc.temp_snap,
* TableScanDesc.rs_snapshot needs to be unregistered.
@@ -279,6 +364,26 @@ typedef struct TableAmRoutine
*/
void (*index_fetch_end) (struct IndexFetchTableData *data);
+ /*
+ * Set up a column projection list that can be used by index_fetch_tuple()
+ * to fetch a subset of columns for a tuple.
+ *
+ * If project_columns is an empty bitmap, none of the data columns are to be
+ * fetched.
+ *
+ * If project_columns is a singleton bitmap with a whole-row reference (0),
+ * all of the data columns are to be fetched.
+ *
+ * Please note: project_columns only deals with non system columns (attnum >= 0)
+ *
+ * Please note: Due to the limitations of the slot_get***() APIs,
+ * index_fetch_tuple() must return a TupleTableSlot that is densely
+ * populated (missing cols indicated with isnull = true upto the largest
+ * attno in the projection list)
+ */
+ void (*index_fetch_set_column_projection) (struct IndexFetchTableData *data,
+ Bitmapset *project_columns);
+
/*
* Fetch tuple at `tid` into `slot`, after doing a visibility test
* according to `snapshot`. If a tuple was found and passed the visibility
@@ -315,11 +420,27 @@ typedef struct TableAmRoutine
* Fetch tuple at `tid` into `slot`, after doing a visibility test
* according to `snapshot`. If a tuple was found and passed the visibility
* test, returns true, false otherwise.
+ *
+ * project_cols is a set of columns to be fetched for the given row.
+ *
+ * If project_cols is an empty bitmap, none of the data columns are to be
+ * fetched.
+ *
+ * If project_cols is a singleton bitmap with a whole-row reference (0),
+ * all of the data columns are to be fetched.
+ *
+ * Please note: project_cols only deals with non system columns (attnum >= 0)
+ *
+ * Please note: Due to the limitations of the slot_get***() APIs,
+ * tuple_fetch_row_version() must return a TupleTableSlot that is densely
+ * populated (missing cols indicated with isnull = true upto the largest
+ * attno in the projection list)
*/
bool (*tuple_fetch_row_version) (Relation rel,
ItemPointer tid,
Snapshot snapshot,
- TupleTableSlot *slot);
+ TupleTableSlot *slot,
+ Bitmapset *project_cols);
/*
* Is tid valid for a scan of this relation.
@@ -407,7 +528,8 @@ typedef struct TableAmRoutine
LockTupleMode mode,
LockWaitPolicy wait_policy,
uint8 flags,
- TM_FailureData *tmfd);
+ TM_FailureData *tmfd,
+ Bitmapset *project_cols);
/*
* Perform operations necessary to complete insertions made via
@@ -518,9 +640,10 @@ typedef struct TableAmRoutine
* clear what a good interface for non block based AMs would be, so there
* isn't one yet.
*/
- bool (*scan_analyze_next_block) (TableScanDesc scan,
- BlockNumber blockno,
- BufferAccessStrategy bstrategy);
+ void (*scan_analyze_beginscan) (Relation onerel, AnalyzeSampleContext *context);
+
+ bool (*scan_analyze_next_block) (BlockNumber blockno,
+ AnalyzeSampleContext *context);
/*
* See table_scan_analyze_next_tuple().
@@ -530,11 +653,13 @@ typedef struct TableAmRoutine
* influence autovacuum scheduling (see comment for relation_vacuum
* callback).
*/
- bool (*scan_analyze_next_tuple) (TableScanDesc scan,
- TransactionId OldestXmin,
- double *liverows,
- double *deadrows,
- TupleTableSlot *slot);
+ bool (*scan_analyze_next_tuple) (TransactionId OldestXmin,
+ AnalyzeSampleContext *context);
+
+ void (*scan_analyze_sample_tuple) (int pos, bool replace,
+ AnalyzeSampleContext *context);
+
+ void (*scan_analyze_endscan) (AnalyzeSampleContext *context);
/* see table_index_build_range_scan for reference about parameters */
double (*index_build_range_scan) (Relation table_rel,
@@ -761,6 +886,12 @@ table_beginscan(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline bool
+table_scans_leverage_column_projection(Relation relation)
+{
+ return relation->rd_tableam->scans_leverage_column_projection;
+}
+
/*
* Like table_beginscan(), but for scanning catalog. It'll automatically use a
* snapshot appropriate for scanning catalog relations.
@@ -790,6 +921,19 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
}
+static inline TableScanDesc
+table_beginscan_with_column_projection(Relation relation, Snapshot snapshot,
+ int nkeys, struct ScanKeyData *key,
+ Bitmapset *project_column)
+{
+ uint32 flags = SO_TYPE_SEQSCAN |
+ SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+ Assert(relation->rd_tableam->scans_leverage_column_projection);
+ return relation->rd_tableam->scan_begin_with_column_projection(
+ relation, snapshot, nkeys, key, NULL, flags, project_column);
+}
+
/*
* table_beginscan_bm is an alternative entry point for setting up a
* TableScanDesc for a bitmap heap scan. Although that scan technology is
@@ -946,7 +1090,8 @@ extern void table_parallelscan_initialize(Relation rel,
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel(Relation rel,
- ParallelTableScanDesc pscan);
+ ParallelTableScanDesc pscan,
+ Bitmapset *proj);
/*
* Restart a parallel scan. Call this in the leader process. Caller is
@@ -996,6 +1141,13 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
scan->rel->rd_tableam->index_fetch_end(scan);
}
+static inline void
+table_index_fetch_set_column_projection(struct IndexFetchTableData *scan,
+ Bitmapset *project_column)
+{
+ scan->rel->rd_tableam->index_fetch_set_column_projection(scan, project_column);
+}
+
/*
* Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
* a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1071,7 +1223,8 @@ static inline bool
table_tuple_fetch_row_version(Relation rel,
ItemPointer tid,
Snapshot snapshot,
- TupleTableSlot *slot)
+ TupleTableSlot *slot,
+ Bitmapset *project_cols)
{
/*
* We don't expect direct calls to table_tuple_fetch_row_version with
@@ -1081,7 +1234,7 @@ table_tuple_fetch_row_version(Relation rel,
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
- return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+ return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot, project_cols);
}
/*
@@ -1335,6 +1488,20 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
* also lock descendant tuples if lock modes don't conflict.
* If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain and lock
* latest version.
+ * project_cols: It is a set of columns to be fetched for the tuple being locked.
+ *
+ * If project_cols is an empty bitmap, none of the data columns are to be
+ * fetched.
+ *
+ * If project_cols is a singleton bitmap with a whole-row reference (0),
+ * all of the data columns are to be fetched.
+ *
+ * Please note: project_cols only deals with non system columns (attnum >= 0)
+ *
+ * Please note: Due to the limitations of the slot_get***() APIs,
+ * tuple_lock() must return a TupleTableSlot that is densely
+ * populated (missing cols indicated with isnull = true upto the largest
+ * attno in the projection list)
*
* Output parameters:
* *slot: contains the target tuple
@@ -1356,11 +1523,11 @@ static inline TM_Result
table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
LockWaitPolicy wait_policy, uint8 flags,
- TM_FailureData *tmfd)
+ TM_FailureData *tmfd, Bitmapset *project_cols)
{
return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
cid, mode, wait_policy,
- flags, tmfd);
+ flags, tmfd, project_cols);
}
/*
@@ -1485,6 +1652,12 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
}
+static inline void
+table_scan_analyze_beginscan(Relation rel, struct AnalyzeSampleContext *context)
+{
+ rel->rd_tableam->scan_analyze_beginscan(rel, context);
+}
+
/*
* Prepare to analyze block `blockno` of `scan`. The scan needs to have been
* started with table_beginscan_analyze(). Note that this routine might
@@ -1494,11 +1667,10 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
* Returns false if block is unsuitable for sampling, true otherwise.
*/
static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
- BufferAccessStrategy bstrategy)
+table_scan_analyze_next_block(BlockNumber blockno,
+ struct AnalyzeSampleContext *context)
{
- return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
- bstrategy);
+ return context->scan->rs_rd->rd_tableam->scan_analyze_next_block(blockno, context);
}
/*
@@ -1512,13 +1684,21 @@ table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
* tuples.
*/
static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
- double *liverows, double *deadrows,
- TupleTableSlot *slot)
+table_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
{
- return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
- liverows, deadrows,
- slot);
+ return context->scan->rs_rd->rd_tableam->scan_analyze_next_tuple(OldestXmin, context);
+}
+
+static inline void
+table_scan_analyze_sample_tuple(Index sample, bool replace, AnalyzeSampleContext *context)
+{
+ context->scan->rs_rd->rd_tableam->scan_analyze_sample_tuple(sample, replace, context);
+}
+
+static inline void
+table_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+ context->scan->rs_rd->rd_tableam->scan_analyze_endscan(context);
}
/*
@@ -1869,6 +2049,32 @@ extern void table_block_relation_estimate_size(Relation rel,
Size overhead_bytes_per_tuple,
Size usable_bytes_per_page);
+/* ----------------------------------------------------------------------------
+ * Helper functions to implement analyze scan.
+j* ----------------------------------------------------------------------------
+ */
+extern AnalyzeSampleContext *
+CreateAnalyzeSampleContext(Relation onerel, List *cols, int targrows,
+ BufferAccessStrategy strategy);
+extern void DestroyAnalyzeSampleContext(AnalyzeSampleContext *context);
+extern TupleTableSlot * AnalyzeGetSampleSlot(AnalyzeSampleContext *context,
+ Relation onerel, AnalyzeSampleType type);
+extern void AnalyzeRecordSampleRow(AnalyzeSampleContext *context,
+ TupleTableSlot *sample_slot,
+ HeapTuple sample_tuple,
+ AnalyzeSampleType type, int pos,
+ bool replace, bool withtid);
+extern void InitAnalyzeSampleContextForChild(AnalyzeSampleContext *context,
+ Relation child,
+ int childtargrows);
+extern void AnalyzeGetSampleStats(AnalyzeSampleContext *context,
+ int *totalsampledrows,
+ double *totalrows,
+ double *totaldeadrows);
+extern HeapTuple *
+AnalyzeGetSampleRows(AnalyzeSampleContext *context, AnalyzeSampleType type, int offset);
+extern bool AnalyzeSampleIsValid(AnalyzeSampleContext *context, AnalyzeSampleType type);
+
/* ----------------------------------------------------------------------------
* Functions in tableamapi.c
* ----------------------------------------------------------------------------
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 2f0c8bf589..519728bc35 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -218,7 +218,7 @@ typedef struct XLogRecordDataHeaderLong
* need a handful of block references, but there are a few exceptions that
* need more.
*/
-#define XLR_MAX_BLOCK_ID 32
+#define XLR_MAX_BLOCK_ID 199
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
diff --git a/src/include/access/zedstore_compression.h b/src/include/access/zedstore_compression.h
new file mode 100644
index 0000000000..8b1ab8586f
--- /dev/null
+++ b/src/include/access/zedstore_compression.h
@@ -0,0 +1,16 @@
+/*
+ * zedstore_compression.h
+ * internal declarations for ZedStore compression
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_compression.h
+ */
+#ifndef ZEDSTORE_COMPRESSION_H
+#define ZEDSTORE_COMPRESSION_H
+
+extern int zs_compress_destSize(const char *src, char *dst, int *srcSizePtr, int targetDstSize);
+extern void zs_decompress(const char *src, char *dst, int compressedSize, int uncompressedSize);
+
+#endif /* ZEDSTORE_COMPRESSION_H */
diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
new file mode 100644
index 0000000000..e8dccdc632
--- /dev/null
+++ b/src/include/access/zedstore_internal.h
@@ -0,0 +1,1083 @@
+/*
+ * zedstore_internal.h
+ * internal declarations for ZedStore tables
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_internal.h
+ */
+#ifndef ZEDSTORE_INTERNAL_H
+#define ZEDSTORE_INTERNAL_H
+
+#include "access/tableam.h"
+#include "access/zedstore_compression.h"
+#include "access/zedstore_tid.h"
+#include "access/zedstore_undolog.h"
+#include "lib/integerset.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/datum.h"
+
+struct zs_pending_undo_op;
+
+#define ZS_META_ATTRIBUTE_NUM 0
+#define ZS_INVALID_ATTRIBUTE_NUM (-1)
+
+#define INVALID_SPECULATIVE_TOKEN 0
+
+/*
+ * attstream_buffer is an in-memory representation of an attribute stream. It is used
+ * by the operations that construct and manipulate attribute streams.
+ */
+typedef struct
+{
+ /*
+ * Enlargeable buffer. The chunks are stored in 'data', between the
+ * 'cursor' and 'len' positions. So if cursor > 0, there is some unused
+ * space before the chunks, and if data < maxlen, there is unused space
+ * after the chunks.
+ */
+ char *data; /* contains raw chunks */
+ int len;
+ int maxlen;
+ int cursor; /* beginning of remaining chunks */
+
+ /*
+ * First and last TID (inclusive) stored in the chunks.
+ */
+ zstid firsttid;
+ zstid lasttid;
+
+ /*
+ * meta-data of the attribute, so that we don't need to pass these along
+ * as separate arguments everywhere.
+ */
+ int16 attlen;
+ bool attbyval;
+} attstream_buffer;
+
+/*
+ * attstream_decoder is used to unpack an attstream into tids/datums/isnulls.
+ */
+typedef struct
+{
+ /* memory context holding the buffer */
+ MemoryContext cxt;
+
+ /* this is for holding decoded element data in the arrays, reset between decoder_attstream_cont calls */
+ MemoryContext tmpcxt;
+
+ /*
+ * meta-data of the attribute, so that we don't need to pass these along
+ * as separate arguments everywhere.
+ */
+ int16 attlen;
+ bool attbyval;
+
+ /* buffer and its allocated size */
+ char *chunks_buf;
+ int chunks_buf_size;
+
+ /* attstream compression ratio */
+ float8 compression_ratio;
+
+ /* information about the current attstream in the buffer */
+ int chunks_len;
+ zstid firsttid;
+ zstid lasttid;
+
+ /* next position within the attstream */
+ int pos;
+ zstid prevtid;
+
+ /*
+ * currently decoded batch of elements
+ */
+/* must be >= the max number of items in one codeword (that is, >= 60)*/
+#define DECODER_MAX_ELEMS 90
+ zstid tids[DECODER_MAX_ELEMS];
+ Datum datums[DECODER_MAX_ELEMS];
+ bool isnulls[DECODER_MAX_ELEMS];
+ int num_elements;
+ float8 avg_elements_size; /* avg physical size of elements */
+} attstream_decoder;
+
+/*
+ * A ZedStore table contains different kinds of pages, all in the same file.
+ *
+ * Block 0 is always a metapage. It contains the block numbers of the other
+ * data structures stored within the file, like the per-attribute B-trees,
+ * and the UNDO log. In addition, if there are overly large datums in the
+ * the table, they are chopped into separate "toast" pages.
+ */
+#define ZS_META_PAGE_ID 0xF083
+#define ZS_BTREE_PAGE_ID 0xF084
+#define ZS_UNDO_PAGE_ID 0xF085
+#define ZS_TOAST_PAGE_ID 0xF086
+#define ZS_FREE_PAGE_ID 0xF087
+
+/* flags for zedstore b-tree pages */
+#define ZSBT_ROOT 0x0001
+
+typedef struct ZSBtreePageOpaque
+{
+ AttrNumber zs_attno;
+ uint16 zs_level; /* 0 = leaf */
+ BlockNumber zs_next;
+ zstid zs_lokey; /* inclusive */
+ zstid zs_hikey; /* exclusive */
+ uint16 zs_flags;
+
+ uint16 padding1;
+ uint16 padding2;
+
+ uint16 zs_page_id; /* always ZS_BTREE_PAGE_ID */
+} ZSBtreePageOpaque;
+
+#define ZSBtreePageGetOpaque(page) ((ZSBtreePageOpaque *) PageGetSpecialPointer(page))
+
+/*
+ * Internal B-tree page layout.
+ *
+ * The "contents" of the page is an array of ZSBtreeInternalPageItem. The number
+ * of items can be deduced from pd_lower.
+ */
+typedef struct ZSBtreeInternalPageItem
+{
+ zstid tid;
+ BlockNumber childblk;
+} ZSBtreeInternalPageItem;
+
+static inline ZSBtreeInternalPageItem *
+ZSBtreeInternalPageGetItems(Page page)
+{
+ ZSBtreeInternalPageItem *items;
+
+ items = (ZSBtreeInternalPageItem *) PageGetContents(page);
+
+ return items;
+}
+static inline int
+ZSBtreeInternalPageGetNumItems(Page page)
+{
+ ZSBtreeInternalPageItem *begin;
+ ZSBtreeInternalPageItem *end;
+
+ begin = (ZSBtreeInternalPageItem *) PageGetContents(page);
+ end = (ZSBtreeInternalPageItem *) ((char *) page + ((PageHeader) page)->pd_lower);
+
+ return end - begin;
+}
+
+static inline bool
+ZSBtreeInternalPageIsFull(Page page)
+{
+ PageHeader phdr = (PageHeader) page;
+
+ return phdr->pd_upper - phdr->pd_lower < sizeof(ZSBtreeInternalPageItem);
+}
+
+/*
+ * Attribute B-tree leaf page layout
+ *
+ * Leaf pages in the attribute trees don't follow the normal page layout
+ * with line pointers and items. They use the standard page header,
+ * with pd_lower and pd_upper, but the data stored in the lower and upper
+ * parts are different from the normal usage.
+ *
+ * The upper and lower parts of the page contain one "attribute stream"
+ * each. An attibute stream contains attribute data for a range of rows.
+ * Logically, it contains a list of TIDs, and their Datums and isnull
+ * flags. The ranges of TIDs stored in the streams never overlap, but
+ * there can be gaps, if rows have been deleted or updated.
+ *
+ * Physically, the stream consists of "chunks", where one chunk contains
+ * the TIDs of 1-60 datums, packed in a compact form, and their datums.
+ * Finally, the whole stream can be compressed. See comments in
+ * zedstore_attstream.c for a more detailed description of the chunk
+ * format.
+ *
+ * By convention, the attribute stream stored in the upper part of the
+ * page, between pd_upper and pd_special, is compressed, and the lower
+ * stream, stored between the page header and pd_lower, is uncompressed:
+ *
+ * +--------------------+
+ * | PageHeaderData |
+ * +--------------------+
+ * | lower attstream |
+ * | (uncompressed) ... |
+ * | .................. |
+ * | .................. |
+ * +--------------------+ <-pd_lower
+ * | |
+ * | (free space) |
+ * | |
+ * +--------------------+ <-pd_upper
+ * | upper attstream |
+ * | (compressed) .... |
+ * | .................. |
+ * | .................. |
+ * | .................. |
+ * +--------------------+ <-pd_special
+ * | ZSBtreePageOpaque |
+ * +--------------------+
+ *
+ * The point of having two streams is to allow fast appending of
+ * data to a page, without having to decompress and recompress
+ * the whole page. When new data is inserted, it is added to
+ * the uncompressed stream, if it fits. When a page comes full,
+ * the uncompressed stream is merged with the compressed stream,
+ * replacing both with one larger compressed stream.
+ *
+ * The names "lower" and "upper" refer to the physical location of
+ * the stream on the page. The data in the lower attstream
+ * have higher-numbered TIDs than the data in the upper attstream.
+ * No overlap is allowed. This works well with the usual usage
+ * pattern that new data is added to the end (i.e. with increasing
+ * sequence of TIDs), and old data is archived in compressed form
+ * when a page fills up.
+ */
+
+/*
+ * ZSAttStream represents one attribute stream, stored in the lower
+ * or upper part of an attribute leaf page. It is also used to
+ * pass around data in memory, in which case a stream can be
+ * arbitrarily long.
+ *
+ *
+ * Attstreams are compressed by feeding the stream to the compressor, until
+ * all the space available on the page. However, the compressor doesn't know
+ * about chunk boundaries within the stream, so it may stop the compression
+ * in the middle of a chunk. As an artifact of that, a compressed stream
+ * often contains an incomplete chunk at the end. That space goes wasted, and
+ * is ignored. 't_decompressed_size' is the total size of all complete chunks
+ * in a compressed stream, while 't_decompressed_bufsize' includes the wasted
+ * bytes at the end.
+ *
+ * XXX: We could avoid the waste by using a compressor that knows about the
+ * chunk boundaries. Or we could compress twice, first to get the size that
+ * fits, and second time to compress just what fits. But that would be twice
+ * as slow. In practice, the wasted space doesn't matter much. We try to
+ * keep each chunk relatively small, to minimize the waste. And because we
+ * know the next chunk wouldn't fit on the page anyway, there isn't much else
+ * we could do with the wasted space, anyway.
+ */
+typedef struct
+{
+ uint32 t_size; /* physical size of the stream. */
+ uint32 t_flags;
+ uint32 t_decompressed_size; /* payload size, excludes waste */
+ uint32 t_decompressed_bufsize; /* payload size, includes waste */
+ zstid t_lasttid; /* last TID stored in this stream */
+
+ char t_payload[FLEXIBLE_ARRAY_MEMBER];
+} ZSAttStream;
+
+#define SizeOfZSAttStreamHeader offsetof(ZSAttStream, t_payload)
+
+#define ATTSTREAM_COMPRESSED 1
+
+
+/*
+ * TID B-tree leaf page layout
+ *
+ * Leaf pages are packed with ZSTidArrayItems. Each ZSTidArrayItem represents
+ * a range of tuples, starting at 't_firsttid', up to 't_endtid' - 1. For each
+ * tuple, we its TID and the UNDO pointer. The TIDs and UNDO pointers are specially
+ * encoded, so that they take less space.
+ *
+ * Item format:
+ *
+ * We make use of some assumptions / observations on the TIDs and UNDO pointers
+ * to pack them tightly:
+ *
+ * - TIDs are kept in ascending order, and the gap between two TIDs
+ * is usually very small. On a newly loaded table, all TIDs are
+ * consecutive.
+ *
+ * - It's common for the UNDO pointer to be old so that the tuple is
+ * visible to everyone. In that case we don't need to keep the exact value.
+ *
+ * - Nearby TIDs are likely to have only a few distinct UNDO pointer values.
+ *
+ *
+ * Each item looks like this:
+ *
+ * Header | 1-16 TID codewords | 0-2 UNDO pointers | UNDO "slotwords"
+ *
+ * The fixed-size header contains the start and end of the TID range that
+ * this item represents, and information on how many UNDO slots and codewords
+ * follow in the variable-size part.
+ *
+ * After the fixed-size header comes the list of TIDs. They are encoded in
+ * Simple-8b codewords. Simple-8b is an encoding scheme to pack multiple
+ * integers in 64-bit codewords. A single codeword can pack e.g. three 20-bit
+ * integers, or 20 3-bit integers, or a number of different combinations.
+ * Therefore, small integers pack more tightly than larger integers. We encode
+ * the difference between each TID, so in the common case that there are few
+ * gaps between the TIDs, we only need a few bits per tuple. The first encoded
+ * integer is always 0, because the first TID is stored explicitly in
+ * t_firsttid. (TODO: storing the first constant 0 is obviously a waste of
+ * space. Also, since there cannot be duplicates, we could store "delta - 1",
+ * which would allow a more tight representation in some cases.)
+ *
+ * After the TID codeword, are so called "UNDO slots". They represent all the
+ * distinct UNDO pointers in the group of TIDs that this item covers.
+ * Logically, there are 4 slots. Slots 0 and 1 are special, representing
+ * all-visible "old" TIDs, and "dead" TIDs. They are not stored in the item
+ * itself, to save space, but logically, they can be thought to be part of
+ * every item. They are included in 't_num_undo_slots', so the number of UNDO
+ * pointers physically stored on an item is actually 't_num_undo_slots - 2'.
+ *
+ * With the 4 UNDO slots, we can represent an UNDO pointer using a 2-bit
+ * slot number. If you update a tuple with a new UNDO pointer, and all four
+ * slots are already in use, the item needs to be split. Hopefully that doesn't
+ * happen too often (see assumptions above).
+ *
+ * After the UNDO slots come "UNDO slotwords". The slotwords contain the slot
+ * number of each tuple in the item. The slot numbers are packed in 64 bit
+ * integers, with 2 bits for each tuple.
+ *
+ * Representing UNDO pointers as distinct slots also has the advantage that
+ * when we're scanning the TID array, we can check the few UNDO pointers in
+ * the slots against the current snapshot, and remember the visibility of
+ * each slot, instead of checking every UNDO pointer separately. That
+ * considerably speeds up visibility checks when reading. That's one
+ * advantage of this special encoding scheme, compared to e.g. using a
+ * general-purpose compression algorithm on an array of TIDs and UNDO pointers.
+ *
+ * The physical size of an item depends on how many tuples it covers, the
+ * number of codewords needed to encode the TIDs, and many distinct UNDO
+ * pointers they have.
+ */
+typedef struct
+{
+ uint16 t_size;
+ uint16 t_num_tids;
+ uint16 t_num_codewords;
+ uint16 t_num_undo_slots;
+
+ zstid t_firsttid;
+ zstid t_endtid;
+
+ /* Followed by UNDO slots, and then followed by codewords */
+ uint64 t_payload[FLEXIBLE_ARRAY_MEMBER];
+
+} ZSTidArrayItem;
+
+/*
+ * We use 2 bits for the UNDO slot number for every tuple. We can therefore
+ * fit 32 slot numbers in each 64-bit "slotword".
+ */
+#define ZSBT_ITEM_UNDO_SLOT_BITS 2
+#define ZSBT_MAX_ITEM_UNDO_SLOTS (1 << (ZSBT_ITEM_UNDO_SLOT_BITS))
+#define ZSBT_ITEM_UNDO_SLOT_MASK (ZSBT_MAX_ITEM_UNDO_SLOTS - 1)
+#define ZSBT_SLOTNOS_PER_WORD (64 / ZSBT_ITEM_UNDO_SLOT_BITS)
+
+/*
+ * To keep the item size and time needed to work with them reasonable,
+ * limit the size of an item to max 16 codewords and 128 TIDs.
+ */
+#define ZSBT_MAX_ITEM_CODEWORDS 16
+#define ZSBT_MAX_ITEM_TIDS 128
+
+#define ZSBT_OLD_UNDO_SLOT 0
+#define ZSBT_DEAD_UNDO_SLOT 1
+#define ZSBT_FIRST_NORMAL_UNDO_SLOT 2
+
+/* Number of UNDO slotwords needed for a given number of tuples */
+#define ZSBT_NUM_SLOTWORDS(num_tids) ((num_tids + ZSBT_SLOTNOS_PER_WORD - 1) / ZSBT_SLOTNOS_PER_WORD)
+
+static inline size_t
+SizeOfZSTidArrayItem(int num_tids, int num_undo_slots, int num_codewords)
+{
+ Size sz;
+
+ sz = offsetof(ZSTidArrayItem, t_payload);
+ sz += num_codewords * sizeof(uint64);
+ sz += (num_undo_slots - ZSBT_FIRST_NORMAL_UNDO_SLOT) * sizeof(ZSUndoRecPtr);
+ sz += ZSBT_NUM_SLOTWORDS(num_tids) * sizeof(uint64);
+
+ return sz;
+}
+
+/*
+ * Get pointers to the TID codewords, UNDO slots, and slotwords from an item.
+ *
+ * Note: this is also used to get the pointers when constructing a new item, so
+ * don't assert here that the data is valid!
+ */
+static inline void
+ZSTidArrayItemDecode(ZSTidArrayItem *item, uint64 **codewords,
+ ZSUndoRecPtr **slots, uint64 **slotwords)
+{
+ char *p = (char *) item->t_payload;
+
+ *codewords = (uint64 *) p;
+ p += item->t_num_codewords * sizeof(uint64);
+ *slots = (ZSUndoRecPtr *) p;
+ p += (item->t_num_undo_slots - ZSBT_FIRST_NORMAL_UNDO_SLOT) * sizeof(ZSUndoRecPtr);
+ *slotwords = (uint64 *) p;
+}
+
+/*
+ * Toast page layout.
+ *
+ * When an overly large datum is stored, it is divided into chunks, and each
+ * chunk is stored on a dedicated toast page. The toast pages of a datum form
+ * list, each page has a next/prev pointer.
+ */
+/*
+ * Maximum size of an individual untoasted Datum stored in ZedStore. Datums
+ * larger than this need to be toasted.
+ *
+ * A datum needs to fit on a B-tree page, with page and item headers.
+ *
+ * XXX: 500 accounts for all the headers. Need to compute this correctly...
+ */
+#define MaxZedStoreDatumSize (BLCKSZ - 500)
+
+typedef struct ZSToastPageOpaque
+{
+ AttrNumber zs_attno;
+
+ /* these are only set on the first page. */
+ zstid zs_tid;
+ uint32 zs_total_size;
+ uint32 zs_decompressed_size;
+ bool zs_is_compressed;
+
+ uint32 zs_slice_offset;
+ BlockNumber zs_prev;
+ BlockNumber zs_next;
+ uint16 zs_flags;
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id;
+} ZSToastPageOpaque;
+
+/*
+ * "Toast pointer" of a datum that's stored in zedstore toast pages.
+ *
+ * This looks somewhat like a normal TOAST pointer, but we mustn't let these
+ * escape out of zedstore code, because the rest of the system doesn't know
+ * how to deal with them.
+ *
+ * This must look like varattrib_1b_e!
+ */
+typedef struct varatt_zs_toastptr
+{
+ /* varattrib_1b_e */
+ uint8 va_header;
+ uint8 va_tag; /* VARTAG_ZEDSTORE in zedstore toast datums */
+
+ /* first block */
+ BlockNumber zst_block;
+} varatt_zs_toastptr;
+
+/*
+ * va_tag value. this should be distinguishable from the values in
+ * vartag_external
+ */
+#define VARTAG_ZEDSTORE 10
+
+/*
+ * Versions of datumGetSize and datumCopy that know about ZedStore-toasted
+ * datums.
+ */
+static inline Size
+zs_datumGetSize(Datum value, bool typByVal, int typLen)
+{
+ if (typLen > 0)
+ return typLen;
+ else if (typLen == -1)
+ {
+ if (VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ return sizeof(varatt_zs_toastptr);
+ else
+ return VARSIZE_ANY(value);
+ }
+ else
+ return datumGetSize(value, typByVal, typLen);
+}
+
+static inline Datum
+zs_datumCopy(Datum value, bool typByVal, int typLen)
+{
+ if (typLen < 0 && VARATT_IS_EXTERNAL(value) && VARTAG_EXTERNAL(value) == VARTAG_ZEDSTORE)
+ {
+ char *result = palloc(sizeof(varatt_zs_toastptr));
+
+ memcpy(result, DatumGetPointer(value), sizeof(varatt_zs_toastptr));
+
+ return PointerGetDatum(result);
+ }
+ else
+ return datumCopy(value, typByVal, typLen);
+}
+
+/*
+ * Block 0 on every ZedStore table is a metapage.
+ *
+ * It contains a directory of b-tree roots for each attribute, and lots more.
+ */
+#define ZS_META_BLK 0
+
+/*
+ * The metapage stores one of these for each attribute.
+ */
+typedef struct ZSRootDirItem
+{
+ BlockNumber root;
+ BlockNumber fpm_head;
+} ZSRootDirItem;
+
+typedef struct ZSMetaPage
+{
+ int nattributes;
+ ZSRootDirItem tree_root_dir[FLEXIBLE_ARRAY_MEMBER]; /* one for each attribute */
+} ZSMetaPage;
+
+/*
+ * it's not clear what we should store in the "opaque" special area, and what
+ * as page contents, on a metapage. But have at least the page_id field here,
+ * so that tools like pg_filedump can recognize it as a zedstore metapage.
+ */
+typedef struct ZSMetaPageOpaque
+{
+ /*
+ * Head and tail page of the UNDO log.
+ *
+ * 'zs_undo_tail' is the newest page, where new UNDO records will be inserted,
+ * and 'zs_undo_head' is the oldest page. 'zs_undo_tail_first_counter' is the
+ * UNDO counter value of the first record on the tail page (or if the tail
+ * page is empty, the counter value the first record on the tail page will
+ * have, when it's inserted.) If there is no UNDO log at all,
+ * 'zs_undo_tail_first_counter' is the new counter value to use. It's actually
+ * redundant, except when there is no UNDO log at all, but it's a nice
+ * cross-check at other times.
+ */
+ BlockNumber zs_undo_head;
+ BlockNumber zs_undo_tail;
+ uint64 zs_undo_tail_first_counter;
+
+ /*
+ * Oldest UNDO record that is still needed. Anything older than this can
+ * be discarded, and considered as visible to everyone.
+ */
+ ZSUndoRecPtr zs_undo_oldestptr;
+
+ BlockNumber zs_fpm_head; /* head of the Free Page Map list for UNDO pages */
+ uint16 zs_flags;
+ uint16 zs_page_id;
+} ZSMetaPageOpaque;
+
+/*
+ * Codes populated by zs_SatisfiesNonVacuumable. This has minimum values
+ * defined based on what's needed. Heap equivalent has more states.
+ */
+typedef enum
+{
+ ZSNV_NONE,
+ ZSNV_RECENTLY_DEAD /* tuple is dead, but not deletable yet */
+} ZSNV_Result;
+
+typedef struct ZSUndoSlotVisibility
+{
+ TransactionId xmin;
+ TransactionId xmax;
+ CommandId cmin;
+ uint32 speculativeToken;
+ ZSNV_Result nonvacuumable_status;
+ ZSUndoRecPtr undoptr;
+} ZSUndoSlotVisibility;
+
+static inline void InvalidateUndoVisibility(ZSUndoSlotVisibility *visi_info)
+{
+ visi_info->xmin = InvalidTransactionId;
+ visi_info->xmax = InvalidTransactionId;
+ visi_info->cmin = InvalidCommandId;
+ visi_info->speculativeToken = INVALID_SPECULATIVE_TOKEN;
+ visi_info->nonvacuumable_status = ZSNV_NONE;
+};
+
+typedef struct ZSTidItemIterator
+{
+ int tids_allocated_size;
+ zstid *tids;
+ uint8 *tid_undoslotnos;
+ int num_tids;
+ MemoryContext context;
+ ZSUndoSlotVisibility visi_infos[ZSBT_MAX_ITEM_UNDO_SLOTS];
+} ZSTidItemIterator;
+
+/*
+ * Holds the state of an in-progress scan on a zedstore Tid tree.
+ */
+typedef struct ZSTidTreeScan
+{
+ Relation rel;
+
+ /*
+ * memory context that should be used for any allocations that go with the scan,
+ * like the decompression buffers. This isn't a dedicated context, you must still
+ * free everything to avoid leaking! We need this because the getnext function
+ * might be called in a short-lived memory context that is reset between calls.
+ */
+ MemoryContext context;
+
+ bool active;
+ Buffer lastbuf;
+ OffsetNumber lastoff;
+ Snapshot snapshot;
+
+ /*
+ * starttid and endtid define a range of TIDs to scan. currtid is the previous
+ * TID that was returned from the scan. They determine what zsbt_tid_scan_next()
+ * will return.
+ */
+ zstid starttid;
+ zstid endtid;
+ zstid currtid;
+
+ /* in the "real" UNDO-log, this would probably be a global variable */
+ ZSUndoRecPtr recent_oldest_undo;
+
+ /* should this scan do predicate locking? Or check for conflicts? */
+ bool serializable;
+ bool acquire_predicate_tuple_locks;
+
+ /*
+ * These fields are used, when the scan is processing an array item.
+ */
+ ZSTidItemIterator array_iter;
+ int array_curr_idx;
+} ZSTidTreeScan;
+
+/*
+ * This is convenience function to get the index aka slot number for undo and
+ * visibility array. Important to note this performs "next_idx - 1" means
+ * works after returning from TID scan function when the next_idx has been
+ * incremented.
+ */
+static inline uint8
+ZSTidScanCurUndoSlotNo(ZSTidTreeScan *scan)
+{
+ Assert(scan->array_curr_idx >= 0 && scan->array_curr_idx < scan->array_iter.num_tids);
+ Assert(scan->array_iter.tid_undoslotnos != NULL);
+ return (scan->array_iter.tid_undoslotnos[scan->array_curr_idx]);
+}
+
+/*
+ * Holds the state of an in-progress scan on a zedstore attribute tree.
+ */
+typedef struct ZSAttrTreeScan
+{
+ Relation rel;
+ AttrNumber attno;
+ Form_pg_attribute attdesc;
+
+ /*
+ * memory context that should be used for any allocations that go with the scan,
+ * like the decompression buffers. This isn't a dedicated context, you must still
+ * free everything to avoid leaking! We need this because the getnext function
+ * might be called in a short-lived memory context that is reset between calls.
+ */
+ MemoryContext context;
+
+ bool active;
+ Buffer lastbuf;
+ OffsetNumber lastoff;
+
+ /*
+ * These fields are used, when the scan is processing an array tuple.
+ * They are filled in by zsbt_attr_scan_fetch_array().
+ */
+ attstream_decoder decoder;
+
+ /* last index into attr_decoder arrays */
+ int decoder_last_idx;
+
+} ZSAttrTreeScan;
+
+/*
+ * We keep a this cached copy of the information in the metapage in
+ * backend-private memory. In RelationData->rd_amcache.
+ *
+ * The cache contains the block numbers of the roots of all the tree
+ * structures, for quick searches, as well as the rightmost leaf page, for
+ * quick insertions to the end.
+ *
+ * Use zsmeta_get_cache() to get the cached struct.
+ *
+ * This is used together with smgr_targblock. smgr_targblock tracks the
+ * physical size of the relation file. This struct is only considered valid
+ * when smgr_targblock is valid. So in effect, we invalidate this whenever
+ * a smgr invalidation happens. Logically, the lifetime of this is the same
+ * as smgr_targblocks/smgr_fsm_nblocks/smgr_vm_nblocks, but there's no way
+ * to attach an AM-specific struct directly to SmgrRelation.
+ */
+typedef struct ZSMetaCacheData
+{
+ int cache_nattributes;
+
+ /* For each attribute */
+ struct {
+ BlockNumber root; /* root of the b-tree */
+ BlockNumber rightmost; /* right most leaf page */
+ zstid rightmost_lokey; /* lokey of rightmost leaf */
+ } cache_attrs[FLEXIBLE_ARRAY_MEMBER];
+
+} ZSMetaCacheData;
+
+extern ZSMetaCacheData *zsmeta_populate_cache(Relation rel);
+
+static inline ZSMetaCacheData *
+zsmeta_get_cache(Relation rel)
+{
+ if (rel->rd_amcache == NULL || RelationGetTargetBlock(rel) == InvalidBlockNumber)
+ zsmeta_populate_cache(rel);
+ return (ZSMetaCacheData *) rel->rd_amcache;
+}
+
+/*
+ * Blow away the cached ZSMetaCacheData struct. Next call to zsmeta_get_cache()
+ * will reload it from the metapage.
+ */
+static inline void
+zsmeta_invalidate_cache(Relation rel)
+{
+ if (rel->rd_amcache != NULL)
+ {
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
+}
+
+/*
+ * zs_split_stack is used during page split, or page merge, to keep track
+ * of all the modified pages. The page split (or merge) routines don't
+ * modify pages directly, but they construct a list of 'zs_split_stack'
+ * entries. Each entry holds a buffer, and a temporary in-memory copy of
+ * a page that should be written to the buffer, once everything is completed.
+ * All the buffers are exclusively-locked.
+ */
+typedef struct zs_split_stack zs_split_stack;
+
+struct zs_split_stack
+{
+ zs_split_stack *next;
+
+ Buffer buf;
+ Page page; /* temp in-memory copy of page */
+ bool recycle; /* should the page be added to the FPM? */
+ bool special_only; /* if set, only the "special" area was changed, (the
+ * rest of the page won't need to be WAL-logged */
+};
+
+/* prototypes for functions in zedstore_tidpage.c */
+extern void zsbt_tid_begin_scan(Relation rel, zstid starttid, zstid endtid,
+ Snapshot snapshot, ZSTidTreeScan *scan);
+extern void zsbt_tid_reset_scan(ZSTidTreeScan *scan, zstid starttid, zstid endtid, zstid currtid);
+extern void zsbt_tid_end_scan(ZSTidTreeScan *scan);
+extern bool zsbt_tid_scan_next_array(ZSTidTreeScan *scan, zstid nexttid, ScanDirection direction);
+
+/*
+ * Return the next TID in the scan.
+ *
+ * The next TID means the first TID > scan->currtid. Each call moves
+ * scan->currtid to the last returned TID. You can call zsbt_tid_reset_scan()
+ * to change the position, scan->starttid and scan->endtid define the
+ * boundaries of the search.
+ */
+static inline zstid
+zsbt_tid_scan_next(ZSTidTreeScan *scan, ScanDirection direction)
+{
+ zstid nexttid;
+ int idx;
+
+ Assert(scan->active);
+
+ if (direction == ForwardScanDirection)
+ nexttid = scan->currtid + 1;
+ else if (direction == BackwardScanDirection)
+ nexttid = scan->currtid - 1;
+ else
+ nexttid = scan->currtid;
+
+ if (scan->array_iter.num_tids == 0 ||
+ nexttid < scan->array_iter.tids[0] ||
+ nexttid > scan->array_iter.tids[scan->array_iter.num_tids - 1])
+ {
+ scan->array_curr_idx = -1;
+ if (!zsbt_tid_scan_next_array(scan, nexttid, direction))
+ {
+ scan->currtid = nexttid;
+ return InvalidZSTid;
+ }
+ }
+
+ /*
+ * Optimize for the common case that we're scanning forward from the previous
+ * TID.
+ */
+ if (scan->array_curr_idx >= 0 && scan->array_iter.tids[scan->array_curr_idx] < nexttid)
+ idx = scan->array_curr_idx + 1;
+ else
+ idx = 0;
+
+ for (; idx < scan->array_iter.num_tids; idx++)
+ {
+ zstid this_tid = scan->array_iter.tids[idx];
+
+ if (this_tid >= scan->endtid)
+ {
+ scan->currtid = nexttid;
+ return InvalidZSTid;
+ }
+
+ if (this_tid >= nexttid)
+ {
+ /*
+ * Callers using SnapshotDirty need some extra visibility information.
+ */
+ if (scan->snapshot->snapshot_type == SNAPSHOT_DIRTY)
+ {
+ int slotno = scan->array_iter.tid_undoslotnos[idx];
+ ZSUndoSlotVisibility *visi_info = &scan->array_iter.visi_infos[slotno];
+
+ if (visi_info->xmin != FrozenTransactionId)
+ scan->snapshot->xmin = visi_info->xmin;
+ scan->snapshot->xmax = visi_info->xmax;
+ scan->snapshot->speculativeToken = visi_info->speculativeToken;
+ }
+
+ /* on next call, continue the scan at the next TID */
+ scan->currtid = this_tid;
+ scan->array_curr_idx = idx;
+ return this_tid;
+ }
+ }
+
+ /*
+ * unreachable, because zsbt_tid_scan_next_array() should never return an array
+ * that doesn't contain a matching TID.
+ */
+ Assert(false);
+ return InvalidZSTid;
+}
+
+
+extern zstid zsbt_tid_multi_insert(Relation rel, int ntuples,
+ TransactionId xid, CommandId cid,
+ uint32 speculative_token, ZSUndoRecPtr prevundoptr);
+extern TM_Result zsbt_tid_delete(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ Snapshot snapshot, Snapshot crosscheck, bool wait,
+ TM_FailureData *hufd, bool changingPart, bool *this_xact_has_lock);
+extern TM_Result zsbt_tid_update(Relation rel, zstid otid,
+ TransactionId xid,
+ CommandId cid, bool key_update, Snapshot snapshot, Snapshot crosscheck,
+ bool wait, TM_FailureData *hufd, zstid *newtid_p, bool *this_xact_has_lock);
+extern void zsbt_tid_clear_speculative_token(Relation rel, zstid tid, uint32 spectoken, bool forcomplete);
+extern void zsbt_tid_mark_dead(Relation rel, zstid tid, ZSUndoRecPtr recent_oldest_undo);
+extern IntegerSet *zsbt_collect_dead_tids(Relation rel, zstid starttid, zstid *endtid, uint64 *num_live_tuples);
+extern void zsbt_tid_remove(Relation rel, IntegerSet *tids);
+extern TM_Result zsbt_tid_lock(Relation rel, zstid tid,
+ TransactionId xid, CommandId cid,
+ LockTupleMode lockmode, bool follow_updates,
+ Snapshot snapshot, TM_FailureData *hufd,
+ zstid *next_tid, bool *this_xact_has_lock,
+ ZSUndoSlotVisibility *visi_info);
+extern void zsbt_tid_undo_deletion(Relation rel, zstid tid, ZSUndoRecPtr undoptr, ZSUndoRecPtr recent_oldest_undo);
+extern zstid zsbt_get_first_tid(Relation rel);
+extern zstid zsbt_get_last_tid(Relation rel);
+extern void zsbt_find_latest_tid(Relation rel, zstid *tid, Snapshot snapshot);
+
+/* prototypes for functions in zedstore_tiditem.c */
+extern List *zsbt_tid_item_create_for_range(zstid tid, int nelements, ZSUndoRecPtr undo_ptr);
+extern List *zsbt_tid_item_add_tids(ZSTidArrayItem *orig, zstid firsttid, int nelements,
+ ZSUndoRecPtr undo_ptr, bool *modified_orig);
+extern void zsbt_tid_item_unpack(ZSTidArrayItem *item, ZSTidItemIterator *iter);
+extern List *zsbt_tid_item_change_undoptr(ZSTidArrayItem *orig, zstid target_tid, ZSUndoRecPtr undoptr, ZSUndoRecPtr recent_oldest_undo);
+extern List *zsbt_tid_item_remove_tids(ZSTidArrayItem *orig, zstid *nexttid, IntegerSet *remove_tids,
+ ZSUndoRecPtr recent_oldest_undo);
+
+
+/* prototypes for functions in zedstore_attpage.c */
+extern void zsbt_attr_begin_scan(Relation rel, TupleDesc tdesc, AttrNumber attno,
+ ZSAttrTreeScan *scan);
+extern void zsbt_attr_end_scan(ZSAttrTreeScan *scan);
+extern bool zsbt_attr_scan_fetch_array(ZSAttrTreeScan *scan, zstid tid);
+
+extern void zsbt_attr_add(Relation rel, AttrNumber attno, attstream_buffer *newstream);
+extern void zsbt_attstream_change_redo(XLogReaderState *record);
+
+/* prototypes for functions in zedstore_attstream.c */
+extern void create_attstream(attstream_buffer *buffer, bool attbyval, int16 attlen,
+ int nelems, zstid *tids, Datum *datums, bool *isnulls);
+extern void init_attstream_buffer(attstream_buffer *buf, bool attbyval, int16 attlen);
+extern void init_attstream_buffer_from_stream(attstream_buffer *buf, bool attbyval, int16 attlen,
+ ZSAttStream *attstream, MemoryContext memcontext);
+extern int append_attstream(attstream_buffer *buffer, bool all, int nelems,
+ zstid *tids, Datum *datums, bool *isnulls);
+extern void vacuum_attstream(Relation rel, AttrNumber attno, attstream_buffer *buffer,
+ ZSAttStream *attstream,
+ zstid *tids_to_remove, int num_tids_to_remove);
+
+extern void merge_attstream(Form_pg_attribute attr, attstream_buffer *buffer, ZSAttStream *attstream2);
+extern void merge_attstream_buffer(Form_pg_attribute attr, attstream_buffer *buffer, attstream_buffer *buffer2);
+
+extern bool append_attstream_inplace(Form_pg_attribute att, ZSAttStream *oldstream, int freespace, attstream_buffer *newstream);
+
+extern int find_chunk_for_offset(attstream_buffer *attbuf, int offset, zstid *lasttid);
+extern int find_chunk_containing_tid(attstream_buffer *attbuf, zstid tid, zstid *lasttid);
+extern void trim_attstream_upto_offset(attstream_buffer *buf, int chunk_pos, zstid prev_lasttid);
+extern void split_attstream_buffer(attstream_buffer *oldattbuf, attstream_buffer *newattbuf, zstid splittid);
+
+extern void print_attstream(int attlen, char *chunk, int len);
+
+extern void init_attstream_decoder(attstream_decoder *decoder, bool attbyval, int16 attlen);
+extern void destroy_attstream_decoder(attstream_decoder *decoder);
+extern void decode_attstream_begin(attstream_decoder *decoder, ZSAttStream *attstream);
+extern bool decode_attstream_cont(attstream_decoder *decoder);
+extern bool get_attstream_chunk_cont(attstream_decoder *decoder, zstid *prevtid, zstid *firsttid, zstid *lasttid, bytea **chunk);
+
+/* prototypes for functions in zedstore_tuplebuffer.c */
+extern zstid zsbt_tuplebuffer_allocate_tids(Relation rel, TransactionId xid, CommandId cid, int ntids);
+extern void zsbt_tuplebuffer_flush(Relation rel);
+extern void zsbt_tuplebuffer_spool_tuple(Relation rel, zstid tid, Datum *datums, bool *isnulls);
+extern void zsbt_tuplebuffer_spool_slots(Relation rel, zstid *tids, TupleTableSlot **slots, int ntuples);
+
+extern void AtEOXact_zedstream_tuplebuffers(bool isCommit);
+
+
+/* prototypes for functions in zedstore_btree.c */
+extern zs_split_stack *zsbt_newroot(Relation rel, AttrNumber attno, int level, List *downlinks);
+extern zs_split_stack *zsbt_insert_downlinks(Relation rel, AttrNumber attno,
+ zstid leftlokey, BlockNumber leftblkno, int level,
+ List *downlinks);
+extern void zsbt_attr_remove(Relation rel, AttrNumber attno, IntegerSet *tids);
+extern zs_split_stack *zsbt_unlink_page(Relation rel, AttrNumber attno, Buffer buf, int level);
+extern zs_split_stack *zs_new_split_stack_entry(Buffer buf, Page page);
+extern void zs_apply_split_changes(Relation rel, zs_split_stack *stack, struct zs_pending_undo_op *undo_op, AttrNumber attrNumber);
+extern Buffer zsbt_descend(Relation rel, AttrNumber attno, zstid key, int level, bool readonly);
+extern Buffer zsbt_find_and_lock_leaf_containing_tid(Relation rel, AttrNumber attno,
+ Buffer buf, zstid nexttid, int lockmode);
+extern bool zsbt_page_is_expected(Relation rel, AttrNumber attno, zstid key, int level, Buffer buf);
+extern void zsbt_wal_log_leaf_items(Relation rel, AttrNumber attno, Buffer buf, OffsetNumber off, bool replace, List *items, struct zs_pending_undo_op *undo_op);
+extern void zsbt_wal_log_rewrite_pages(Relation rel, AttrNumber attno, List *buffers, struct zs_pending_undo_op *undo_op);
+
+/*
+ * Return the value of row identified with 'tid' in a scan.
+ *
+ * 'tid' must be greater than any previously returned item.
+ *
+ * Returns true if a matching item is found, false otherwise. After
+ * a false return, it's OK to call this again with another greater TID.
+ */
+static inline bool
+zsbt_attr_fetch(ZSAttrTreeScan *scan, Datum *datum, bool *isnull, zstid tid)
+{
+ int idx;
+
+ /*
+ * Fetch the next item from the scan. The item we're looking for might
+ * already be in scan->array_*.
+ */
+ if (scan->decoder.num_elements == 0 ||
+ tid < scan->decoder.tids[0] ||
+ tid > scan->decoder.tids[scan->decoder.num_elements - 1])
+ {
+ if (!zsbt_attr_scan_fetch_array(scan, tid))
+ return false;
+ scan->decoder_last_idx = -1;
+ }
+ Assert(scan->decoder.num_elements > 0 &&
+ tid >= scan->decoder.tids[0] &&
+ tid <= scan->decoder.tids[scan->decoder.num_elements - 1]);
+
+ /*
+ * Optimize for the common case that we're scanning forward from the previous
+ * TID.
+ */
+ if (scan->decoder_last_idx != -1 && scan->decoder.tids[scan->decoder_last_idx] < tid)
+ idx = scan->decoder_last_idx + 1;
+ else
+ idx = 0;
+
+ for (; idx < scan->decoder.num_elements; idx++)
+ {
+ zstid this_tid = scan->decoder.tids[idx];
+
+ if (this_tid == tid)
+ {
+ *isnull = scan->decoder.isnulls[idx];
+ *datum = scan->decoder.datums[idx];
+ scan->decoder_last_idx = idx;
+ return true;
+ }
+ if (this_tid > tid)
+ return false;
+ }
+
+ return false;
+}
+
+extern PGDLLIMPORT const TupleTableSlotOps TTSOpsZedstore;
+
+/* prototypes for functions in zedstore_meta.c */
+extern void zsmeta_initmetapage(Relation rel);
+extern void zsmeta_initmetapage_redo(XLogReaderState *record);
+extern BlockNumber zsmeta_get_root_for_attribute(Relation rel, AttrNumber attno, bool for_update);
+extern void zsmeta_add_root_for_new_attributes(Relation rel, Page page);
+
+/* prototypes for functions in zedstore_visibility.c */
+extern TM_Result zs_SatisfiesUpdate(Relation rel, Snapshot snapshot,
+ ZSUndoRecPtr recent_oldest_undo,
+ zstid item_tid,
+ LockTupleMode mode,
+ bool *undo_record_needed, bool *this_xact_has_lock,
+ TM_FailureData *tmfd, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info);
+extern bool zs_SatisfiesVisibility(ZSTidTreeScan *scan,
+ TransactionId *obsoleting_xid, zstid *next_tid,
+ ZSUndoSlotVisibility *visi_info);
+
+/* prototypes for functions in zedstore_toast.c */
+extern Datum zedstore_toast_datum(Relation rel, AttrNumber attno, Datum value, zstid tid);
+extern Datum zedstore_toast_flatten(Relation rel, AttrNumber attno, zstid tid, Datum toasted);
+extern void zedstore_toast_delete(Relation rel, Form_pg_attribute attr, zstid tid, BlockNumber blkno);
+
+/* prototypes for functions in zedstore_freepagemap.c */
+extern Buffer zspage_getnewbuf(Relation rel, AttrNumber attrNumber);
+extern void zspage_mark_page_deleted(Page page, BlockNumber next_free_blk);
+extern void zspage_delete_page(Relation rel, Buffer buf, Buffer metabuf, AttrNumber attrNumber);
+
+typedef struct ZedstoreTupleTableSlot
+{
+ TupleTableSlot base;
+
+ char *data; /* data for materialized slots */
+
+ /*
+ * Extra visibility information. The tuple's xmin and cmin can be extracted
+ * from here, used e.g. for triggers (XXX is that true?). There's also
+ * a flag to indicate if a tuple is vacuumable or not, which can be useful
+ * if you're scanning with SnapshotAny. That's currently used in index
+ * build.
+ */
+ ZSUndoSlotVisibility *visi_info;
+
+ /*
+ * Normally, when a tuple is retrieved from a table, 'visi_info' points to
+ * TID tree scan's data structures. But sometimes it's useful to keep the
+ * information together with the slot, e.g. whe a slot is copied, so that
+ * it doesn't depend on any data outside the slot. In that case, you can
+ * fill in 'visi_info_buf', and set visi_info = &visi_info_buf.
+ */
+ ZSUndoSlotVisibility visi_info_buf;
+} ZedstoreTupleTableSlot;
+
+#endif /* ZEDSTORE_INTERNAL_H */
diff --git a/src/include/access/zedstore_simple8b.h b/src/include/access/zedstore_simple8b.h
new file mode 100644
index 0000000000..ebaaa368c9
--- /dev/null
+++ b/src/include/access/zedstore_simple8b.h
@@ -0,0 +1,21 @@
+/*
+ * zedstore_simple8b.h
+ * XXX
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_simple8b.h
+ */
+#ifndef ZEDSTORE_SIMPLE8B_H
+#define ZEDSTORE_SIMPLE8B_H
+
+extern uint64 simple8b_encode(const uint64 *ints, int num_ints, int *num_encoded);
+extern uint64 simple8b_encode_consecutive(const uint64 firstint, const uint64 secondint, int num_ints,
+ int *num_encoded);
+extern int simple8b_decode(uint64 codeword, uint64 *decoded);
+
+extern void simple8b_decode_words(uint64 *codewords, int num_codewords,
+ uint64 *dst, int num_integers);
+
+#endif /* ZEDSTORE_SIMPLE8B_H */
diff --git a/src/include/access/zedstore_tid.h b/src/include/access/zedstore_tid.h
new file mode 100644
index 0000000000..4f0b0b5993
--- /dev/null
+++ b/src/include/access/zedstore_tid.h
@@ -0,0 +1,84 @@
+/*
+ * zedstore_tid.h
+ * Conversions between ItemPointers and uint64.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_tid.h
+ */
+#ifndef ZEDSTORE_TID_H
+#define ZEDSTORE_TID_H
+
+#include "storage/itemptr.h"
+
+/*
+ * Throughout ZedStore, we pass around TIDs as uint64's, rather than ItemPointers,
+ * for speed.
+ */
+typedef uint64 zstid;
+
+#define InvalidZSTid 0
+#define MinZSTid 1 /* blk 0, off 1 */
+#define MaxZSTid ((uint64) MaxBlockNumber << 16 | 0xffff)
+/* note: if this is converted to ItemPointer, it is invalid */
+#define MaxPlusOneZSTid (MaxZSTid + 1)
+
+#define MaxZSTidOffsetNumber 129
+
+#define PG_GETARG_ZSTID(n) DatumGetZSTid(PG_GETARG_DATUM(n))
+#define PG_RETURN_ZSTID(x) return ZSTidGetDatum(x)
+
+/* fmgr interface macros */
+#ifdef USE_FLOAT8_BYVAL
+#define ZSTidGetDatum(X) Int64GetDatum(X)
+#define DatumGetZSTid(X) ((zstid) (X))
+#else
+#define ZSTidGetDatum(X) PointerGetDatum(X)
+#define DatumGetZSTid(X) (* ((zstid*) DatumGetPointer(X)))
+#endif
+
+static inline zstid
+ZSTidFromBlkOff(BlockNumber blk, OffsetNumber off)
+{
+ Assert(off != 0);
+
+ return (uint64) blk * (MaxZSTidOffsetNumber - 1) + off;
+}
+
+static inline zstid
+ZSTidFromItemPointer(ItemPointerData iptr)
+{
+ Assert(ItemPointerIsValid(&iptr));
+ return ZSTidFromBlkOff(ItemPointerGetBlockNumber(&iptr),
+ ItemPointerGetOffsetNumber(&iptr));
+}
+
+static inline ItemPointerData
+ItemPointerFromZSTid(zstid tid)
+{
+ ItemPointerData iptr;
+ BlockNumber blk;
+ OffsetNumber off;
+
+ blk = (tid - 1) / (MaxZSTidOffsetNumber - 1);
+ off = (tid - 1) % (MaxZSTidOffsetNumber - 1) + 1;
+
+ ItemPointerSet(&iptr, blk, off);
+ Assert(ItemPointerIsValid(&iptr));
+ return iptr;
+}
+
+static inline BlockNumber
+ZSTidGetBlockNumber(zstid tid)
+{
+ return (BlockNumber) ((tid - 1) / (MaxZSTidOffsetNumber - 1));
+}
+
+static inline OffsetNumber
+ZSTidGetOffsetNumber(zstid tid)
+{
+ return (OffsetNumber) ((tid - 1) % (MaxZSTidOffsetNumber - 1) + 1);
+}
+
+#endif /* ZEDSTORE_TID_H */
diff --git a/src/include/access/zedstore_undolog.h b/src/include/access/zedstore_undolog.h
new file mode 100644
index 0000000000..da5cc67327
--- /dev/null
+++ b/src/include/access/zedstore_undolog.h
@@ -0,0 +1,108 @@
+/*
+ * zedstore_undolog.h
+ * internal declarations for ZedStore undo logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_undolog.h
+ */
+#ifndef ZEDSTORE_UNDOLOG_H
+#define ZEDSTORE_UNDOLOG_H
+
+#include "storage/buf.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+/*
+ * We don't support splitting UNDO records across pages, so the max size of
+ * an undo record is limited by block size.
+ */
+#define MaxUndoRecordSize (BLCKSZ - SizeOfPageHeaderData - sizeof(ZSUndoPageOpaque))
+
+/*
+ * An UNDO-pointer.
+ *
+ * In the "real" UNDO-logging work from EDB, an UndoRecPtr is only 64 bits.
+ * But we make life easier for us, by encoding more information in it.
+ *
+ * 'counter' is a number that's incremented every time a new undo record is
+ * created. It can be used to determine if an undo pointer is too old to be
+ * of interest to anyone.
+ *
+ * 'blkno' and 'offset' are the physical location of the UNDO record. They
+ * can be used to easily fetch a given record.
+ */
+typedef struct
+{
+ uint64 counter;
+ BlockNumber blkno;
+ int32 offset; /* int16 would suffice, but avoid padding */
+} ZSUndoRecPtr;
+
+/* TODO: assert that blkno and offset match, too, if counter matches */
+#define ZSUndoRecPtrEquals(a, b) ((a).counter == (b).counter)
+
+typedef struct
+{
+ BlockNumber next;
+ ZSUndoRecPtr first_undorecptr; /* note: this is set even if the page is empty! */
+ ZSUndoRecPtr last_undorecptr;
+ uint16 padding0; /* padding, to put zs_page_id last */
+ uint16 padding1; /* padding, to put zs_page_id last */
+ uint16 padding2; /* padding, to put zs_page_id last */
+ uint16 zs_page_id; /* ZS_UNDO_PAGE_ID */
+} ZSUndoPageOpaque;
+
+/*
+ * "invalid" undo pointer. The value is chosen so that an invalid pointer
+ * is less than any real UNDO pointer value. Therefore, a tuple with an
+ * invalid UNDO pointer is considered visible to everyone.
+ */
+static const ZSUndoRecPtr InvalidUndoPtr = {
+ .counter = 0,
+ .blkno = InvalidBlockNumber,
+ .offset = 0
+};
+
+/*
+ * A special value used on TID items, to mean that a tuple is not visible to
+ * anyone
+ */
+static const ZSUndoRecPtr DeadUndoPtr = {
+ .counter = 1,
+ .blkno = InvalidBlockNumber,
+ .offset = 0
+};
+
+static inline bool
+IsZSUndoRecPtrValid(ZSUndoRecPtr *uptr)
+{
+ return uptr->counter != 0;
+}
+
+/*
+ * zs_undo_reservation represents a piece of UNDO log that has been reserved for
+ * inserting a new UNDO record, but the UNDO record hasn't been written yet.
+ */
+typedef struct
+{
+ Buffer undobuf;
+ ZSUndoRecPtr undorecptr;
+ size_t length;
+
+ char *ptr;
+} zs_undo_reservation;
+
+/* prototypes for functions in zedstore_undolog.c */
+extern void zsundo_insert_reserve(Relation rel, size_t size, zs_undo_reservation *reservation_p);
+extern void zsundo_insert_finish(zs_undo_reservation *reservation);
+
+extern char *zsundo_fetch(Relation rel, ZSUndoRecPtr undoptr, Buffer *buf_p, int lockmode, bool missing_ok);
+
+extern void zsundo_discard(Relation rel, ZSUndoRecPtr oldest_undorecptr);
+
+extern void zsundo_newpage_redo(XLogReaderState *record);
+extern void zsundo_discard_redo(XLogReaderState *record);
+
+#endif /* ZEDSTORE_UNDO_H */
diff --git a/src/include/access/zedstore_undorec.h b/src/include/access/zedstore_undorec.h
new file mode 100644
index 0000000000..07f307d483
--- /dev/null
+++ b/src/include/access/zedstore_undorec.h
@@ -0,0 +1,196 @@
+/*
+ * zedstore_undorec.h
+ * Declarations for different kinds of UNDO records in Zedstore.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_undorec.h
+ */
+#ifndef ZEDSTORE_UNDOREC_H
+#define ZEDSTORE_UNDOREC_H
+
+#include "access/zedstore_tid.h"
+#include "nodes/lockoptions.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+#define ZSUNDO_TYPE_INSERT 1
+#define ZSUNDO_TYPE_DELETE 2
+#define ZSUNDO_TYPE_UPDATE 3
+#define ZSUNDO_TYPE_TUPLE_LOCK 4
+
+struct ZSUndoRec
+{
+ int16 size; /* size of this record, including header */
+ uint8 type; /* ZSUNDO_TYPE_* */
+ ZSUndoRecPtr undorecptr;
+ TransactionId xid;
+ CommandId cid;
+
+ /*
+ * UNDO-record of the inserter. This is needed if a row is inserted, and
+ * deleted, and there are some snapshots active don't don't consider even
+ * the insertion as visible.
+ *
+ * This is also used in Insert records, if the record represents the
+ * new tuple version of an UPDATE, rather than an INSERT. It's needed to
+ * dig into possible KEY SHARE locks held on the row, which didn't prevent
+ * the tuple from being updated.
+ */
+ ZSUndoRecPtr prevundorec;
+};
+typedef struct ZSUndoRec ZSUndoRec;
+
+/*
+ * Type-specific record formats.
+ *
+ * We store similar info as zheap for INSERT/UPDATE/DELETE. See zheap README.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid firsttid;
+ zstid endtid; /* exclusive */
+ uint32 speculative_token; /* Only used for INSERT records */
+
+} ZSUndoRec_Insert;
+
+#define ZSUNDO_NUM_TIDS_PER_DELETE 10
+
+typedef struct
+{
+ ZSUndoRec rec;
+
+ bool changedPart; /* tuple was moved to a different partition by UPDATE */
+
+ /*
+ * One deletion record can represent deleting up to
+ * ZSUNDO_NUM_TIDS_PER_DELETE tuples. The 'rec.tid' field is unused.
+ */
+ uint16 num_tids;
+ zstid tids[ZSUNDO_NUM_TIDS_PER_DELETE];
+
+ /*
+ * TODO: It might be good to move the deleted tuple to the undo-log, so
+ * that the space can immediately be reused. But currently, we don't do
+ * that. Or even better, move the old tuple to the undo-log lazily, if
+ * the space is needed for a new insertion, before the old tuple becomes
+ * recyclable.
+ */
+} ZSUndoRec_Delete;
+
+/*
+ * This is used for an UPDATE, to mark the old tuple version as updated.
+ * It's the same as a deletion, except this stores the TID of the new tuple
+ * version, so it can be followed in READ COMMITTED mode.
+ *
+ * The ZSUndoRec_Insert record is used for the insertion of the new tuple
+ * version.
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+
+ zstid oldtid;
+ zstid newtid;
+
+ bool key_update; /* were key columns updated?
+ * (for conflicting with FOR KEY SHARE) */
+
+} ZSUndoRec_Update;
+
+/*
+ * This is used when a tuple is locked e.g. with SELECT FOR UPDATE.
+ * The tuple isn't really changed in any way, but the undo record gives
+ * a place to store the XID of the locking transaction.
+ *
+ * In case of a FOR SHARE lock, there can be multiple lockers. Each locker
+ * will create a new undo record with its own XID that points to the previous
+ * record. So the records will form a chain, leading finally to the insertion
+ * record (or beyond the UNDO horizon, meaning the tuple's insertion is visible
+ * to everyone)
+ */
+typedef struct
+{
+ ZSUndoRec rec;
+ zstid tid;
+
+ /*
+ * XXX: Is it OK to store this on disk? The enum values could change. Then
+ * again, no one should care about old locks that were acquired before
+ * last restart. Except with two-phase commit prepared transactions.
+ */
+ LockTupleMode lockmode;
+} ZSUndoRec_TupleLock;
+
+/*
+ * These are used in WAL records, to represent insertion or modification
+ * of an UNDO record.
+ *
+ * We use this same record for all UNDO operations. It's a bit wasteful;
+ * if an existing UNDO record is modified, we wouldn't need to overwrite
+ * the whole record. Also, no need to WAL-log the command ids, because
+ * they don't matter after crash/replay.
+ */
+typedef struct
+{
+ ZSUndoRecPtr undoptr;
+ uint16 length;
+ bool is_update;
+} zs_wal_undo_op;
+
+#define SizeOfZSWalUndoOp (offsetof(zs_wal_undo_op, is_update) + sizeof(bool))
+
+/*
+ * zs_pending_undo_op encapsulates the insertion or modification of an UNDO
+ * record. The zsundo_create_* functions don't insert UNDO records directly,
+ * because the callers are not in a critical section yet, and may still need
+ * to abort. For example, to inserting a new TID to the TID tree, we first
+ * construct the UNDO record for the insertion, and then lock the correct
+ * TID tree page to insert to. But if e.g. we need to split the TID page,
+ * we might still have to error out.
+ */
+struct zs_pending_undo_op
+{
+ zs_undo_reservation reservation;
+ bool is_update;
+
+ /*
+ * Payload to include in the WAL record. All the data here is redundant with
+ * the other fields in this struct, but we keep this copy here, so
+ * that we can register it as data in the WAL record.
+ */
+ zs_wal_undo_op waldata;
+
+ char payload[FLEXIBLE_ARRAY_MEMBER];
+};
+typedef struct zs_pending_undo_op zs_pending_undo_op;
+
+/* prototypes for functions in zedstore_undorec.c */
+extern struct ZSUndoRec *zsundo_fetch_record(Relation rel, ZSUndoRecPtr undorecptr);
+
+extern zs_pending_undo_op *zsundo_create_for_delete(Relation rel, TransactionId xid, CommandId cid, zstid tid,
+ bool changedPart, ZSUndoRecPtr prev_undo_ptr);
+extern zs_pending_undo_op *zsundo_create_for_insert(Relation rel, TransactionId xid, CommandId cid,
+ zstid tid, int nitems,
+ uint32 speculative_token, ZSUndoRecPtr prev_undo_ptr);
+extern zs_pending_undo_op *zsundo_create_for_update(Relation rel, TransactionId xid, CommandId cid,
+ zstid oldtid, zstid newtid, ZSUndoRecPtr prev_undo_ptr,
+ bool key_update);
+extern zs_pending_undo_op *zsundo_create_for_tuple_lock(Relation rel, TransactionId xid, CommandId cid,
+ zstid tid, LockTupleMode lockmode,
+ ZSUndoRecPtr prev_undo_ptr);
+extern void zsundo_finish_pending_op(zs_pending_undo_op *pendingop, char *payload);
+extern void zsundo_clear_speculative_token(Relation rel, ZSUndoRecPtr undoptr);
+
+extern void XLogRegisterUndoOp(uint8 block_id, zs_pending_undo_op *undo_op);
+extern Buffer XLogRedoUndoOp(XLogReaderState *record, uint8 block_id);
+
+struct VacuumParams;
+extern void zsundo_vacuum(Relation rel, struct VacuumParams *params, BufferAccessStrategy bstrategy,
+ TransactionId OldestXmin);
+extern ZSUndoRecPtr zsundo_get_oldest_undo_ptr(Relation rel, bool attempt_trim);
+
+#endif /* ZEDSTORE_UNDOREC_H */
diff --git a/src/include/access/zedstore_wal.h b/src/include/access/zedstore_wal.h
new file mode 100644
index 0000000000..8b496f3d13
--- /dev/null
+++ b/src/include/access/zedstore_wal.h
@@ -0,0 +1,220 @@
+/*
+ * zedstore_wal.h
+ * internal declarations for ZedStore wal logging
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstore_wal.h
+ */
+#ifndef ZEDSTORE_WAL_H
+#define ZEDSTORE_WAL_H
+
+#include "access/attnum.h"
+#include "access/xlogreader.h"
+#include "access/zedstore_tid.h"
+#include "access/zedstore_undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/off.h"
+
+#define WAL_ZEDSTORE_INIT_METAPAGE 0x00
+#define WAL_ZEDSTORE_UNDO_NEWPAGE 0x10
+#define WAL_ZEDSTORE_UNDO_DISCARD 0x20
+#define WAL_ZEDSTORE_BTREE_NEW_ROOT 0x30
+#define WAL_ZEDSTORE_BTREE_REWRITE_PAGES 0x40
+#define WAL_ZEDSTORE_TIDLEAF_ADD_ITEMS 0x50
+#define WAL_ZEDSTORE_TIDLEAF_REPLACE_ITEM 0x60
+#define WAL_ZEDSTORE_ATTSTREAM_CHANGE 0x70
+#define WAL_ZEDSTORE_TOAST_NEWPAGE 0x80
+#define WAL_ZEDSTORE_FPM_DELETE_PAGE 0x90
+#define WAL_ZEDSTORE_FPM_REUSE_PAGE 0xA0
+
+/* in zedstore_wal.c */
+extern void zedstore_redo(XLogReaderState *record);
+extern void zedstore_mask(char *pagedata, BlockNumber blkno);
+
+/* in zedstoredesc.c */
+extern void zedstore_desc(StringInfo buf, XLogReaderState *record);
+extern const char *zedstore_identify(uint8 info);
+
+/*
+ * WAL record for initializing zedstore metapage (WAL_ZEDSTORE_INIT_METAPAGE)
+ *
+ * These records always use a full-page image, so this data is really just
+ * for debugging purposes.
+ */
+typedef struct wal_zedstore_init_metapage
+{
+ int32 natts; /* number of attributes. */
+} wal_zedstore_init_metapage;
+
+#define SizeOfZSWalInitMetapage (offsetof(wal_zedstore_init_metapage, natts) + sizeof(int32))
+
+/*
+ * WAL record for extending the UNDO log with one page.
+ */
+typedef struct wal_zedstore_undo_newpage
+{
+ uint64 first_counter;
+} wal_zedstore_undo_newpage;
+
+#define SizeOfZSWalUndoNewPage (offsetof(wal_zedstore_undo_newpage, first_counter) + sizeof(uint64))
+
+/*
+ * WAL record for updating the oldest undo pointer on the metapage, after
+ * discarding an old portion the UNDO log.
+ *
+ * blkref #0 is the metapage.
+ *
+ * If an old UNDO page was discarded away, advancing zs_undo_head, that page
+ * is stored as blkref #1. The new block number to store in zs_undo_head is
+ * stored as the data of blkref #0.
+ */
+typedef struct wal_zedstore_undo_discard
+{
+ ZSUndoRecPtr oldest_undorecptr;
+
+ /*
+ * Next oldest remaining block in the UNDO chain. This is not the same as
+ * oldest_undorecptr.block, if we are discarding multiple UNDO blocks. We will
+ * update oldest_undorecptr in the first iteration already, so that visibility
+ * checks can use the latest value immediately. But we can't hold a potentially
+ * unlimited number of pages locked while we mark them as deleted, so they are
+ * deleted one by one, and each deletion is WAL-logged separately.
+ */
+ BlockNumber oldest_undopage;
+} wal_zedstore_undo_discard;
+
+#define SizeOfZSWalUndoDiscard (offsetof(wal_zedstore_undo_discard, oldest_undopage) + sizeof(BlockNumber))
+
+/*
+ * WAL record for creating a new, empty, root page for an attribute.
+ */
+typedef struct wal_zedstore_btree_new_root
+{
+ AttrNumber attno; /* 0 means TID tree */
+} wal_zedstore_btree_new_root;
+
+#define SizeOfZSWalBtreeNewRoot (offsetof(wal_zedstore_btree_new_root, attno) + sizeof(AttrNumber))
+
+/*
+ * WAL record for replacing/adding items to the TID tree.
+ */
+typedef struct wal_zedstore_tidleaf_items
+{
+ int16 nitems;
+ OffsetNumber off;
+
+ /* the items follow */
+} wal_zedstore_tidleaf_items;
+
+#define SizeOfZSWalTidLeafItems (offsetof(wal_zedstore_tidleaf_items, off) + sizeof(OffsetNumber))
+
+/*
+ * WAL record for page splits, and other more complicated operations where
+ * we just rewrite whole pages.
+ *
+ * block #0 is UNDO buffer, if any.
+ * The rest are the b-tree pages (numpages).
+ */
+typedef struct wal_zedstore_btree_rewrite_pages
+{
+ int numpages;
+
+ /* one of these per page. */
+ struct
+ {
+ bool recycle;
+ bool special_only;
+ } pageinfo[FLEXIBLE_ARRAY_MEMBER];
+} wal_zedstore_btree_rewrite_pages;
+
+#define SizeOfZSWalBtreeRewritePages(numpages) (offsetof(wal_zedstore_btree_rewrite_pages, pageinfo[numpages]))
+
+/*
+ * WAL record for a change to attribute leaf page.
+ *
+ * Modifies an attribute stream stored on an attribute leaf page.
+ * If 'is_upper' is set, the change applies to the upper stream,
+ * between pd_upper and pd_special, otherwise it applies to the
+ * lower stream between page header and pd_lower.
+ *
+ * new_attstream_size is the new size of the attstream. At replay,
+ * pd_lower or pd_upper is adjusted to match the new size. If
+ * size of the upper stream changes, any existing data in the upper
+ * area on the page conceptually moved to the beginning of the upper
+ * area, before the replacement data in the record is applied.
+ *
+ * The block data 0 contains new data, which overwrites the data
+ * between begin_offset and end_offset. Not all data in the stream
+ * needs to be overwritten, that is, begin_offset and end_offset
+ * don't need to cover the whole stream. That allows efficiently
+ * appending data to an uncompressed stream. (It's also pretty
+ * effective for the compressed stream: if a stream is
+ * decompressed, some changes are made, and the stream is
+ * recompressed, the part before the change will usually re-compress
+ * to the same bytes.)
+ */
+typedef struct wal_zedstore_attstream_change
+{
+ bool is_upper;
+
+ /*
+ * These field correspond to the fields in ZSAttStream. But
+ * we use smaller fields to save on WAL volume. (ZSAttStream
+ * uses larger fields for the size, so that the same struct
+ * can be used for longer streams than fit on disk, when passed
+ * around in memory.)
+ */
+ uint16 new_attstream_size;
+ uint16 new_decompressed_size;
+ uint16 new_decompressed_bufsize;
+ zstid new_lasttid;
+
+ uint16 begin_offset;
+ uint16 end_offset;
+} wal_zedstore_attstream_change;
+
+#define SizeOfZSWalAttstreamChange (offsetof(wal_zedstore_attstream_change, end_offset) + sizeof(uint16))
+
+/*
+ * WAL record for zedstore toasting. When a large datum spans multiple pages,
+ * we write one of these for every page. The chain will appear valid between
+ * every operation, except that the total size won't match the total size of
+ * all the pages until the last page is written.
+ *
+ * blkref 0: the new page being added
+ * blkref 1: the previous page in the chain
+ */
+typedef struct wal_zedstore_toast_newpage
+{
+ zstid tid;
+ AttrNumber attno;
+ int32 total_size;
+ int32 offset;
+} wal_zedstore_toast_newpage;
+
+#define SizeOfZSWalToastNewPage (offsetof(wal_zedstore_toast_newpage, offset) + sizeof(int32))
+
+typedef struct wal_zedstore_fpm_delete_page
+{
+ BlockNumber next_free_blkno;
+} wal_zedstore_fpm_delete_page;
+
+#define SizeOfZSWalFpmDeletePage (offsetof(wal_zedstore_fpm_delete_page, next_free_blkno) + sizeof(BlockNumber))
+
+typedef struct wal_zedstore_fpm_reuse_page
+{
+ BlockNumber next_free_blkno;
+} wal_zedstore_fpm_reuse_page;
+
+#define SizeOfZSWalFpmReusePage (offsetof(wal_zedstore_fpm_reuse_page, next_free_blkno) + sizeof(BlockNumber))
+
+extern void zsbt_tidleaf_items_redo(XLogReaderState *record, bool replace);
+extern void zsmeta_new_btree_root_redo(XLogReaderState *record);
+extern void zsbt_rewrite_pages_redo(XLogReaderState *record);
+extern void zstoast_newpage_redo(XLogReaderState *record);
+extern void zspage_delete_page_redo(XLogReaderState *record);
+extern void zspage_reuse_page_redo(XLogReaderState *record);
+
+#endif /* ZEDSTORE_WAL_H */
diff --git a/src/include/access/zedstoream.h b/src/include/access/zedstoream.h
new file mode 100644
index 0000000000..c707ad86d6
--- /dev/null
+++ b/src/include/access/zedstoream.h
@@ -0,0 +1,17 @@
+/*
+ * zedstoream.h
+ * public declarations for ZedStore
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/zedstoream.h
+ */
+#ifndef ZEDSTOREAM_H
+#define ZEDSTOREAM_H
+
+extern void AtEOXact_zedstore_tuplebuffers(bool isCommit);
+extern void AtSubStart_zedstore_tuplebuffers(void);
+extern void AtEOSubXact_zedstore_tuplebuffers(bool isCommit);
+
+#endif /* ZEDSTOREAM_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 0f051277a6..5abe8751b0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '6668', oid_symbol => 'ZEDSTORE_TABLE_AM_OID',
+ descr => 'zedstore table access method',
+ amname => 'zedstore', amhandler => 'zedstore_tableam_handler', amtype => 't' },
]
diff --git a/src/include/catalog/pg_cast.dat b/src/include/catalog/pg_cast.dat
index 5a58f50fbb..32ac298662 100644
--- a/src/include/catalog/pg_cast.dat
+++ b/src/include/catalog/pg_cast.dat
@@ -530,4 +530,15 @@
{ castsource => 'jsonb', casttarget => 'float8', castfunc => 'float8(jsonb)',
castcontext => 'e', castmethod => 'f' },
+# zedstore
+{ castsource => 'tid', casttarget => 'zstid', castfunc => 'zstid(tid)',
+ castcontext => 'e', castmethod => 'f' },
+{ castsource => 'zstid', casttarget => 'int8', castfunc => 'int8(zstid)',
+ castcontext => 'e', castmethod => 'f' },
+{ castsource => 'int2', casttarget => 'zstid', castfunc => 'zstid(int2)',
+ castcontext => 'i', castmethod => 'f' },
+{ castsource => 'int4', casttarget => 'zstid', castfunc => 'zstid(int4)',
+ castcontext => 'i', castmethod => 'f' },
+{ castsource => 'int8', casttarget => 'zstid', castfunc => 'zstid(int8)',
+ castcontext => 'i', castmethod => 'f' },
]
diff --git a/src/include/catalog/pg_operator.dat b/src/include/catalog/pg_operator.dat
index ede7bb96ab..9dbe9b22a6 100644
--- a/src/include/catalog/pg_operator.dat
+++ b/src/include/catalog/pg_operator.dat
@@ -3269,4 +3269,30 @@
oprresult => 'bool', oprcode => 'jsonb_path_match_opr(jsonb,jsonpath)',
oprrest => 'matchingsel', oprjoin => 'matchingjoinsel' },
+# zedstore
+{ oid => '7060', descr => 'equal',
+ oprname => '=', oprcanmerge => 'f', oprcanhash => 'f', oprleft => 'zstid',
+ oprright => 'zstid', oprresult => 'bool', oprcom => '=(zstid,zstid)',
+ oprnegate => '<>(zstid,zstid)', oprcode => 'zstideq', oprrest => 'eqsel',
+ oprjoin => 'eqjoinsel' },
+{ oid => '7061', descr => 'not equal',
+ oprname => '<>', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '<>(zstid,zstid)', oprnegate => '=(zstid,zstid)', oprcode => 'zstidne',
+ oprrest => 'neqsel', oprjoin => 'neqjoinsel' },
+{ oid => '7062', descr => 'less than',
+ oprname => '<', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '>(zstid,zstid)', oprnegate => '>=(zstid,zstid)', oprcode => 'zstidlt',
+ oprrest => 'scalarltsel', oprjoin => 'scalarltjoinsel' },
+{ oid => '7063', descr => 'greater than',
+ oprname => '>', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '<(zstid,zstid)', oprnegate => '<=(zstid,zstid)', oprcode => 'zstidgt',
+ oprrest => 'scalargtsel', oprjoin => 'scalargtjoinsel' },
+{ oid => '7064', descr => 'less than or equal',
+ oprname => '<=', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '>=(zstid,zstid)', oprnegate => '>(zstid,zstid)', oprcode => 'zstidle',
+ oprrest => 'scalarlesel', oprjoin => 'scalarlejoinsel' },
+{ oid => '7065', descr => 'greater than or equal',
+ oprname => '>=', oprleft => 'zstid', oprright => 'zstid', oprresult => 'bool',
+ oprcom => '<=(zstid,zstid)', oprnegate => '<(zstid,zstid)', oprcode => 'zstidge',
+ oprrest => 'scalargesel', oprjoin => 'scalargejoinsel' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c01da4bf01..7ecd644dcc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -905,6 +905,10 @@
proname => 'heap_tableam_handler', provolatile => 'v',
prorettype => 'table_am_handler', proargtypes => 'internal',
prosrc => 'heap_tableam_handler' },
+{ oid => '6669', descr => 'column-oriented table access method handler',
+ proname => 'zedstore_tableam_handler', provolatile => 'v',
+ prorettype => 'table_am_handler', proargtypes => 'internal',
+ prosrc => 'zedstore_tableam_handler' },
# Index access method handlers
{ oid => '330', descr => 'btree index access method handler',
@@ -11005,4 +11009,98 @@
proname => 'is_normalized', prorettype => 'bool', proargtypes => 'text text',
prosrc => 'unicode_is_normalized' },
+# zedstore inspection functions
+{ oid => '7000', descr => 'get zedstore page type',
+ proname => 'pg_zs_page_type', prorettype => 'text',
+ proargtypes => 'regclass int8', prosrc => 'pg_zs_page_type' },
+{ oid => '7001', descr => 'show stats about active zedstore undo pages',
+ proname => 'pg_zs_undo_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int4,int4,int8,int8}',
+ proargmodes => '{i,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nrecords,freespace,firstrecptr,lastrecptr}',
+ prosrc => 'pg_zs_undo_pages' },
+{ oid => '7002', descr => 'show stats about zedstore btree pages',
+ proname => 'pg_zs_btree_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int4,int4,int8,int8,int4,int4,int4,int4,int4}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,nextblk,attno,level,lokey,hikey,nitems,ncompressed,totalsz,uncompressedsz,freespace}',
+ prosrc => 'pg_zs_btree_pages' },
+{ oid => '7003', descr => 'show stats about zedstore toast pages',
+ proname => 'pg_zs_toast_pages', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int8,int8,int8,int8,int4,bool}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,tid,total_size,slice_offset,prev,next,decompressed_size,is_compressed}',
+ prosrc => 'pg_zs_toast_pages' },
+{ oid => '7004', descr => 'show stats about the zedstore meta page',
+ proname => 'pg_zs_meta_page', prorettype => 'record',
+ proargtypes => 'regclass',
+ proallargtypes => '{regclass,int8,int8,int8,int8,int8,int8,int4,int8,int4}',
+ proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,undo_head,undo_tail,undo_tail_first_counter,undo_oldestpointer_counter,undo_oldestponter_blkno, undo_oldestpointer_offset,fpm_head,flags}',
+ prosrc => 'pg_zs_meta_page' },
+{ oid => '7005', descr => 'dump individual chunks from a zedstore btree page',
+ proname => 'pg_zs_dump_attstreams', prorows => '1000', proretset => 't',
+ prorettype => 'record', proargtypes => 'regclass int8',
+ proallargtypes => '{regclass,int8,int2,int4,bool,bool,bool,int2,int4,int4,zstid,zstid,zstid,bytea,int4}',
+ proargmodes => '{i,i,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{relid,blkno,attno,chunkno,upperstream,compressed,attbyval,attlen,chunk_start,chunk_len,prevtid,firsttid,lasttid,chunk,itemcount}',
+ prosrc => 'pg_zs_dump_attstreams' },
+{ oid => '7006', descr => 'split zedstore chunks into value arrays',
+ proname => 'pg_zs_decode_chunk', prorettype => 'record',
+ proargtypes => 'bool int2 zstid zstid bytea',
+ proallargtypes => '{bool,int2,zstid,zstid,bytea,int4,_zstid,_bytea,_bool}',
+ proargmodes => '{i,i,i,i,i,o,o,o,o}',
+ proargnames => '{attbyval,attlen,prevtid,lasttid,chunk,num_elements,tids,datums,isnulls}',
+ prosrc => 'pg_zs_decode_chunk' },
+{ oid => '7007', descr => 'calculate the number of adjacent blocks per attribute',
+ proname => 'pg_zs_calculate_adjacent_block', prorettype => 'record',
+ proargtypes => 'regclass', prorows => '1000', proretset => 't',
+ proallargtypes => '{regclass,int4,int4,int4}',
+ proargmodes => '{i,o,o,o}',
+ proargnames => '{relid,attnum,num_runs,total_runs}',
+ prosrc => 'pg_zs_calculate_adjacent_block' },
+
+# zedstore
+{ oid => '7020', descr => 'input zstid',
+ proname => 'zstidin', prorettype => 'zstid', proargtypes => 'cstring',
+ prosrc => 'zstidin' },
+{ oid => '7021', descr => 'output zstid',
+ proname => 'zstidout', prorettype => 'cstring', proargtypes => 'zstid',
+ prosrc => 'zstidout' },
+{ oid => '7022',
+ proname => 'zstideq', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstideq' },
+{ oid => '7023',
+ proname => 'zstidle', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidle' },
+{ oid => '7024',
+ proname => 'zstidge', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidge' },
+{ oid => '7025',
+ proname => 'zstidgt', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidgt' },
+{ oid => '7026',
+ proname => 'zstidlt', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidlt' },
+{ oid => '7027',
+ proname => 'zstidne', proleakproof => 't', prorettype => 'bool',
+ proargtypes => 'zstid zstid', prosrc => 'zstidne' },
+{ oid => '7028', descr => 'convert tid to zstid',
+ proname => 'zstid', provolatile => 's', prorettype => 'zstid',
+ proargtypes => 'tid', prosrc => 'tidtozstid' },
+{ oid => '7029', descr => 'convert int2 to zstid',
+ proname => 'zstid', prorettype => 'zstid', proargtypes => 'int2',
+ prosrc => 'int2tozstid' },
+{ oid => '7030', descr => 'convert int4 to zstid',
+ proname => 'zstid', prorettype => 'zstid', proargtypes => 'int4',
+ prosrc => 'int4tozstid' },
+{ oid => '7031', descr => 'convert int8 to zstid',
+ proname => 'zstid', prorettype => 'zstid', proargtypes => 'int8',
+ prosrc => 'int8tozstid' },
+{ oid => '7032', descr => 'convert zstd to int8',
+ proname => 'int8', prorettype => 'int8', proargtypes => 'zstid',
+ prosrc => 'zstidtoint8' },
]
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index 40a7260165..a069546917 100644
--- a/src/include/catalog/pg_statistic.h
+++ b/src/include/catalog/pg_statistic.h
@@ -36,6 +36,9 @@ CATALOG(pg_statistic,2619,StatisticRelationId)
/* the fraction of the column's entries that are NULL: */
float4 stanullfrac;
+ /* the fraction of the column's disksize of all columns */
+ float4 stadiskfrac;
+
/*
* stawidth is the average width in bytes of non-null entries. For
* fixed-width datatypes this is of course the same as the typlen, but for
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index 21a467a7a7..6bc55d7e1e 100644
--- a/src/include/catalog/pg_type.dat
+++ b/src/include/catalog/pg_type.dat
@@ -620,4 +620,10 @@
typoutput => 'anycompatiblerange_out', typreceive => '-', typsend => '-',
typalign => 'd', typstorage => 'x' },
+
+{ oid => '7050', array_type_oid => '7051',
+ descr => 'zedstore logical tuple id',
+ typname => 'zstid', typlen => '8', typbyval => 'FLOAT8PASSBYVAL',
+ typcategory => 'N', typinput => 'zstidin', typoutput => 'zstidout',
+ typreceive => '-', typsend => '-', typalign => 'd' },
]
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..2cd609ff28 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -151,6 +151,12 @@ typedef struct VacAttrStats
int numvalues[STATISTIC_NUM_SLOTS];
Datum *stavalues[STATISTIC_NUM_SLOTS];
+ /*
+ * These fields are to be filled in compute_disk_stats
+ */
+ float4 stadiskfrac; /* fraction of the physical size */
+ float8 disksize; /* value of the physical size */
+
/*
* These fields describe the stavalues[n] element types. They will be
* initialized to match attrtypid, but a custom typanalyze function might
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 0c48d2a519..c205755215 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -615,5 +615,11 @@ extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
extern void CheckSubscriptionRelkind(char relkind, const char *nspname,
const char *relname);
+extern void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols);
+extern Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol);
+extern Bitmapset *PopulateNeededColumnsForEPQ(EPQState *epqstate, int ncol);
+extern void PopulateNeededColumnsForOnConflictUpdate(ResultRelInfo *resultRelInfo);
#endif /* EXECUTOR_H */
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index ce37e342cd..71402ee286 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -330,7 +330,7 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
*/
#define DatumGetByteaP(X) ((bytea *) PG_DETOAST_DATUM(X))
#define DatumGetTextP(X) ((text *) PG_DETOAST_DATUM(X))
-#define DatumGetBpCharP(X) ((BpChar *) PG_DETOAST_DATUM(X))
+#define DatumGetBpharP(X) ((BpChar *) PG_DETOAST_DATUM(X))
#define DatumGetVarCharP(X) ((VarChar *) PG_DETOAST_DATUM(X))
#define PG_GETARG_BYTEA_P(n) DatumGetByteaP(PG_GETARG_DATUM(n))
#define PG_GETARG_TEXT_P(n) DatumGetTextP(PG_GETARG_DATUM(n))
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..ee8c2afc6c 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -18,6 +18,7 @@
/* To avoid including explain.h here, reference ExplainState thus: */
struct ExplainState;
+struct AnalyzeSampleContext;
/*
@@ -139,10 +140,8 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
struct ExplainState *es);
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows,
- double *totaldeadrows);
+typedef void (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+ struct AnalyzeSampleContext *context);
typedef bool (*AnalyzeForeignTable_function) (Relation relation,
AcquireSampleRowsFunc *func,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6c0a7d68d6..7cb27cd105 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -385,6 +385,7 @@ typedef struct OnConflictSetState
TupleTableSlot *oc_ProjSlot; /* CONFLICT ... SET ... projection target */
ProjectionInfo *oc_ProjInfo; /* for ON CONFLICT DO UPDATE SET */
ExprState *oc_WhereClause; /* state for the WHERE clause */
+ Bitmapset *proj_cols; /* cols to be scanned during the operation */
} OnConflictSetState;
/*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7ef9b0eac0..63bd81f045 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1129,6 +1129,20 @@ typedef struct RangeTblEntry
Bitmapset *updatedCols; /* columns needing UPDATE permission */
Bitmapset *extraUpdatedCols; /* generated columns being updated */
List *securityQuals; /* security barrier quals to apply, if any */
+
+ /*
+ * scanCols: Columns to be retrieved during a physical scan.
+ * returningCols: Columns to be retrieved to satisfy the RETURNING clause.
+ *
+ * Please note: These bitmaps only deal with non-system columns (attnum >= 0)
+ *
+ * These bitmaps have some special values:
+ * - A singleton bitmap with the element 0 indicates that all non-system
+ * columns must be fetched.
+ * - An empty bitmap indicates that no non-system column must be fetched.
+ */
+ Bitmapset *scanCols; /* columns to be fetched during a physical scan */
+ Bitmapset *returningCols; /* columns in the RETURNING clause */
} RangeTblEntry;
/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 8f62d61702..96695d41d1 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -705,6 +705,7 @@ typedef struct RelOptInfo
PlannerInfo *subroot; /* if subquery */
List *subplan_params; /* if subquery */
int rel_parallel_workers; /* wanted number of parallel workers */
+ bool leverage_column_projection;
/* Information about foreign tables and foreign joins */
Oid serverid; /* identifies server for the table or join */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index fb270df678..acca2a6074 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -319,6 +319,9 @@
/* Define to 1 if you have the `ldap_r' library (-lldap_r). */
#undef HAVE_LIBLDAP_R
+/* Define to 1 if you have the `lz4' library (-llz4). */
+#undef HAVE_LIBLZ4
+
/* Define to 1 if you have the `m' library (-lm). */
#undef HAVE_LIBM
@@ -358,6 +361,9 @@
/* Define to 1 if `long long int' works and is 64 bits. */
#undef HAVE_LONG_LONG_INT_64
+/* Define to 1 if you have the <lz4.h> header file. */
+#undef HAVE_LZ4_H
+
/* Define to 1 if you have the <mbarrier.h> header file. */
#undef HAVE_MBARRIER_H
@@ -881,6 +887,9 @@
/* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
#undef USE_LLVM
+/* Define to 1 to build with LZ4 support. (--with-lz4) */
+#undef USE_LZ4
+
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..f63fdd768f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -226,6 +226,7 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferInMode(Buffer buffer, int mode);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 944f6fe6bd..77e57a3fca 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -195,6 +195,12 @@ typedef ItemPointerData *ItemPointer;
#define ItemPointerSetMovedPartitions(pointer) \
ItemPointerSet((pointer), MovedPartitionsBlockNumber, MovedPartitionsOffsetNumber)
+/* fmgr interface macros */
+#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X) PointerGetDatum(X)
+#define PG_GETARG_ITEMPOINTER(n) DatumGetItemPointer(PG_GETARG_DATUM(n))
+#define PG_RETURN_ITEMPOINTER(x) return ItemPointerGetDatum(x)
+
/* ----------------
* externs
* ----------------
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c5ffea40f2..ac1661ce0d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -306,8 +306,16 @@ typedef struct StdRdOptions
int parallel_workers; /* max number of parallel workers */
bool vacuum_index_cleanup; /* enables index vacuuming and cleanup */
bool vacuum_truncate; /* enables vacuum to truncate a relation */
+ /*
+ * request zedstore_rel_extension_factor #blocks from storage manager
+ * whenever we need to extend the relation by one block for attribute/tid
+ * tree pages.
+ */
+ int zedstore_rel_extension_factor;
} StdRdOptions;
+#define ZEDSTORE_DEFAULT_REL_EXTENSION_FACTOR 1
+
#define HEAP_MIN_FILLFACTOR 10
#define HEAP_DEFAULT_FILLFACTOR 100
@@ -638,6 +646,20 @@ typedef struct ViewOptions
RelationNeedsWAL(relation) && \
!IsCatalogRelation(relation))
+static inline bool
+contains_whole_row_col(Bitmapset *cols)
+{
+ return bms_is_member(0, cols);
+}
+
+static inline Bitmapset *
+get_ordinal_attnos(Relation rel)
+{
+ Bitmapset *attnos = NULL;
+ attnos = bms_add_range(attnos, 1, RelationGetDescr(rel)->natts);
+ return attnos;
+}
+
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
extern void RelationDecrementReferenceCount(Relation rel);
diff --git a/src/test/isolation/expected/multiple-row-versions_1.out b/src/test/isolation/expected/multiple-row-versions_1.out
new file mode 100644
index 0000000000..f4f140cab1
--- /dev/null
+++ b/src/test/isolation/expected/multiple-row-versions_1.out
@@ -0,0 +1,25 @@
+Parsed test spec with 4 sessions
+
+starting permutation: rx1 wx2 c2 wx3 ry3 wy4 rz4 c4 c3 wz1 c1
+step rx1: SELECT * FROM t WHERE id = 1000000;
+id txt
+
+1000000
+step wx2: UPDATE t SET txt = 'b' WHERE id = 1000000;
+step c2: COMMIT;
+step wx3: UPDATE t SET txt = 'c' WHERE id = 1000000;
+step ry3: SELECT * FROM t WHERE id = 500000;
+id txt
+
+500000
+step wy4: UPDATE t SET txt = 'd' WHERE id = 500000;
+step rz4: SELECT * FROM t WHERE id = 1;
+id txt
+
+1
+step c4: COMMIT;
+step c3: COMMIT;
+ERROR: could not serialize access due to read/write dependencies among transactions
+step wz1: UPDATE t SET txt = 'a' WHERE id = 1;
+ERROR: could not serialize access due to read/write dependencies among transactions
+step c1: COMMIT;
diff --git a/src/test/isolation/expected/read-only-anomaly-2_1.out b/src/test/isolation/expected/read-only-anomaly-2_1.out
new file mode 100644
index 0000000000..2e36552031
--- /dev/null
+++ b/src/test/isolation/expected/read-only-anomaly-2_1.out
@@ -0,0 +1,45 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s3c s2wx
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id;
+id balance
+
+X 0
+Y 20
+step s3c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
diff --git a/src/test/isolation/expected/read-only-anomaly-3_1.out b/src/test/isolation/expected/read-only-anomaly-3_1.out
new file mode 100644
index 0000000000..d9a5a8e49d
--- /dev/null
+++ b/src/test/isolation/expected/read-only-anomaly-3_1.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id; <waiting ...>
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+step s3r: <... completed>
+id balance
+
+X 0
+Y 20
+error in steps s2wx s3r: ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/serializable-parallel_1.out b/src/test/isolation/expected/serializable-parallel_1.out
new file mode 100644
index 0000000000..2e36552031
--- /dev/null
+++ b/src/test/isolation/expected/serializable-parallel_1.out
@@ -0,0 +1,45 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s2wx s2c s3c
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
+step s2c: COMMIT;
+step s3c: COMMIT;
+
+starting permutation: s2rx s2ry s1ry s1wy s1c s3r s3c s2wx
+step s2rx: SELECT balance FROM bank_account WHERE id = 'X';
+balance
+
+0
+step s2ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1ry: SELECT balance FROM bank_account WHERE id = 'Y';
+balance
+
+0
+step s1wy: UPDATE bank_account SET balance = 20 WHERE id = 'Y';
+step s1c: COMMIT;
+step s3r: SELECT id, balance FROM bank_account WHERE id IN ('X', 'Y') ORDER BY id;
+id balance
+
+X 0
+Y 20
+step s3c: COMMIT;
+step s2wx: UPDATE bank_account SET balance = -11 WHERE id = 'X';
+ERROR: could not serialize access due to read/write dependencies among transactions
diff --git a/src/test/isolation/expected/vacuum-reltuples_1.out b/src/test/isolation/expected/vacuum-reltuples_1.out
new file mode 100644
index 0000000000..151b4a3f63
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-reltuples_1.out
@@ -0,0 +1,59 @@
+Parsed test spec with 2 sessions
+
+starting permutation: modify vac stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step vac:
+ vacuum smalltbl;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 21
+
+starting permutation: modify open fetch1 vac close stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step open:
+ begin;
+ declare c1 cursor for select 1 as dummy from smalltbl;
+
+step fetch1:
+ fetch next from c1;
+
+dummy
+
+1
+step vac:
+ vacuum smalltbl;
+
+step close:
+ commit;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 21
+
+starting permutation: modify vac stats
+step modify:
+ insert into smalltbl select max(id)+1 from smalltbl;
+
+step vac:
+ vacuum smalltbl;
+
+step stats:
+ select relpages, reltuples from pg_class
+ where oid='smalltbl'::regclass;
+
+relpages reltuples
+
+4 21
diff --git a/src/test/isolation/specs/read-only-anomaly-2.spec b/src/test/isolation/specs/read-only-anomaly-2.spec
index 9812f49ee4..2b17fcb521 100644
--- a/src/test/isolation/specs/read-only-anomaly-2.spec
+++ b/src/test/isolation/specs/read-only-anomaly-2.spec
@@ -18,13 +18,15 @@ teardown
}
session "s1"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s1ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s1wy" { UPDATE bank_account SET balance = 20 WHERE id = 'Y'; }
step "s1c" { COMMIT; }
session "s2"
-setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
+setup { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET enable_seqscan=off; }
step "s2rx" { SELECT balance FROM bank_account WHERE id = 'X'; }
step "s2ry" { SELECT balance FROM bank_account WHERE id = 'Y'; }
step "s2wx" { UPDATE bank_account SET balance = -11 WHERE id = 'X'; }
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index c830627b00..add3ee8cb7 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -49,12 +49,18 @@ $(top_builddir)/src/port/pg_config_paths.h: | submake-libpgport
install: all installdirs
$(INSTALL_PROGRAM) pg_regress$(X) '$(DESTDIR)$(pgxsdir)/$(subdir)/pg_regress$(X)'
+ $(INSTALL_PROGRAM) '$(srcdir)/gpdiff.pl' '$(DESTDIR)$(pgxsdir)/$(subdir)/gpdiff.pl'
+ $(INSTALL_PROGRAM) '$(srcdir)/atmsort.pl' '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pl'
+ $(INSTALL_PROGRAM) '$(srcdir)/atmsort.pm' '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pm'
installdirs:
$(MKDIR_P) '$(DESTDIR)$(pgxsdir)/$(subdir)'
uninstall:
rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/pg_regress$(X)'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/gpdiff.pl'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pl'
+ rm -f '$(DESTDIR)$(pgxsdir)/$(subdir)/atmsort.pm'
# Build dynamically-loaded object file for CREATE FUNCTION ... LANGUAGE C.
diff --git a/src/test/regress/atmsort.pl b/src/test/regress/atmsort.pl
new file mode 100755
index 0000000000..142b998bce
--- /dev/null
+++ b/src/test/regress/atmsort.pl
@@ -0,0 +1,346 @@
+#!/usr/bin/env perl
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Pod::Usage is loaded lazily when needed, if the --help or other such option
+# is actually used. Loading the module takes some time, which adds up when
+# running hundreds of regression tests, and gpdiff.pl calls this script twice
+# for every test. See lazy_pod2usage().
+#use Pod::Usage;
+
+use Getopt::Long;
+#use Data::Dumper; # only used by commented-out debug statements.
+use strict;
+use warnings;
+
+use File::Spec;
+
+use FindBin;
+use lib "$FindBin::Bin";
+use atmsort;
+
+=head1 NAME
+
+B<atmsort.pl> - [A] [T]est [M]echanism Sort: sort the contents of SQL log files to aid diff comparison
+
+=head1 SYNOPSIS
+
+B<atmsort.pl> [options] logfile [logfile...]
+
+Options:
+
+ -help brief help message
+ -man full documentation
+ -ignore_plans ignore explain plan content in query output
+ -init <file> load initialization file
+
+=head1 OPTIONS
+
+=over 8
+
+=item B<-help>
+
+ Print a brief help message and exits.
+
+=item B<-man>
+
+ Prints the manual page and exits.
+
+=item B<-ignore_plans>
+
+Specify this option to ignore any explain plan diffs between the
+input files. This will completely ignore any plan content in
+the input files thus masking differences in plans between the input files.
+
+For example, for the following plan:
+explain select i from foo where i > 10;
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2.72 rows=45 width=4)
+ -> Seq Scan on foo (cost=0.00..1.55 rows=45 width=4)
+ Filter: i > 10
+ Settings: optimizer=on
+(4 rows)
+
+atmsort.pl -ignore_plans will reduce this to:
+
+explain select i from foo where i > 10;
+QUERY PLAN
+___________
+REGRESS_IGNORE:{
+REGRESS_IGNORE: 'child' => [
+REGRESS_IGNORE: {
+REGRESS_IGNORE: 'id' => 2,
+REGRESS_IGNORE: 'parent' => 1,
+REGRESS_IGNORE: 'short' => 'Seq Scan on foo'
+REGRESS_IGNORE: }
+REGRESS_IGNORE: ],
+REGRESS_IGNORE: 'id' => 1,
+REGRESS_IGNORE: 'short' => 'Gather Motion'
+REGRESS_IGNORE:}
+REGRESS_IGNORE:(4 rows)
+
+
+=item B<-init> <file>
+
+Specify an initialization file containing a series of directives
+(mainly for match_subs) that get applied to the input files. To
+specify multiple initialization files, use multiple init arguments,
+eg:
+
+ -init file1 -init file2
+
+
+=back
+
+=back
+
+=head1 DESCRIPTION
+
+atmsort reads sql log files from STDIN and sorts the query output for
+all SELECT statements that do *not* have an ORDER BY, writing the
+result to STDOUT. This change to the log facilitates diff comparison,
+since unORDERed query output does not have a guaranteed order. Note
+that for diff to work correctly, statements that do use ORDER BY must
+have a fully-specified order.
+
+The log content must look something like:
+
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1,2,3,4;
+ a | b | c | d
+ ------------+-----------------+-----------+---------------
+ 1 | 1 | 1 | 1
+ 1 | 1 | 1 | 2
+ 3 | 2 | 2 | 5
+ (3 rows)
+
+The log file must contain SELECT statements, followed by the query
+output in the standard PostgreSQL format, ie a set of named columns, a
+separator line constructed of dashes and plus signs, and the rows,
+followed by an "(N rows)" row count. The SELECT statement must be
+unambiguous, eg no embedded SQL keywords like INSERT, UPDATE, or
+DELETE, and it must be terminated with a semicolon. Normally, the
+query output is sorted, but if the statement contains an ORDER BY
+clause the query output for that query is not sorted.
+
+=head2 EXPLAIN PLAN
+
+atmsort can also use explain.pl to process EXPLAIN and EXPLAIN ANALYZE
+output in a configuration-independent way. It strips out all timing,
+segment, and slice information, reducing the plan to a simple nested
+perl structure. For example, for the following plan:
+
+explain analyze select * from customer;
+
+ QUERY PLAN
+------------------------------------------------------------------------
+ Gather Motion 2:1 (slice1) (cost=0.00..698.88 rows=25088 width=550)
+ Rows out: 150000 rows at destination with 0.230 ms to first row,
+ 386 ms to end, start offset by 8.254 ms.
+ -> Seq Scan on customer (cost=0.00..698.88 rows=25088 width=550)
+ Rows out: Avg 75000.0 rows x 2 workers. Max 75001 rows (seg0)
+ with 0.056 ms to first row, 26 ms to end, start offset by 7.332 ms.
+ Slice statistics:
+ (slice0) Executor memory: 186K bytes.
+ (slice1) Executor memory: 130K bytes avg x 2 workers,
+ 130K bytes max (seg0).
+ Total runtime: 413.401 ms
+(8 rows)
+
+atmsort reduces the plan to:
+
+ QUERY PLAN
+------------------------------------------------------------------------
+{
+ 'child' => [
+ {
+ 'id' => 2,
+ 'parent' => 1,
+ 'short' => 'Seq Scan on customer'
+ }
+ ],
+ 'id' => 1,
+ 'short' => 'Gather Motion'
+ }
+(8 rows)
+
+
+=head2 Advanced Usage
+
+atmsort supports several "commands" that allow finer-grained control
+over the comparison process for SELECT queries. These commands are
+specified in comments in the following form:
+
+ --
+ -- order 1
+ --
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1;
+
+or
+
+ SELECT a, b, c, d
+ from foo
+ ORDER BY 1; -- order 1
+
+The supported commands are:
+
+=over 13
+
+=item -- order column number[, column number...]
+
+ The order directive is used to compare
+ "partially-ordered" query
+ output. The specified columns are assumed
+ to be ordered, and the remaining columns are
+ sorted to allow for deterministic comparison.
+
+=item -- order none
+
+ The order none directive can be used to specify that the SELECT's
+ output is not ordered. This can be necessary if the default
+ heuristic that checks if there is an ORDER BY in the query gets
+ fooled, e.g by an ORDER BY in a subquery that doesn't force the
+ overall result to be ordered.
+
+=item -- ignore
+
+The ignore directive prefixes the SELECT output with REGRESS_IGNORE. The
+diff command can use the -I flag to ignore lines with this prefix.
+
+=item -- mvd colnum[, colnum...] -> colnum[, colnum...] [; <additional specs>]
+
+mvd is designed to support Multi-Value Dependencies for OLAP queries.
+The syntax "col1,col2->col3,col4" indicates that the col1 and col2
+values determine the col3, col4 result order.
+
+=item -- start_ignore
+
+Ignore all results until the next "end_ignore" directive. The
+start_ignore directive prefixes all subsequent output with REGRESS_IGNORE,
+and all other formatting directives are ignored as well. The diff
+command can use the -I flag to ignore lines with this prefix.
+
+=item -- end_ignore
+
+ Ends the ignored region that started with "start_ignore"
+
+=item -- start_matchsubs
+
+Starts a list of match/substitution expressions, where the match and
+substitution are specified as perl "m" and "s" operators for a single
+line of input. atmsort will compile the expressions and use them to
+process the current input file. The format is:
+
+ -- start_matchsubs
+ --
+ -- # first, a match expression
+ -- m/match this/
+ -- # next, a substitute expression
+ -- s/match this/substitute this/
+ --
+ -- # and can have more matchsubs after this...
+ --
+ -- end_matchsubs
+
+ Blank lines are ignored, and comments may be used if they are
+ prefixed with "#", the perl comment character, eg:
+
+ -- # this is a comment
+
+ Multiple match and substitute pairs may be specified. See "man
+ perlre" for more information on perl regular expressions.
+
+=item -- end_matchsubs
+
+ Ends the match/substitution region that started with "start_matchsubs"
+
+=item -- start_matchignore
+
+Similar to matchsubs, starts a list of match/ignore expressions as a
+set of perl match operators. Each line that matches one of the
+specified expressions is elided from the atmsort output. Note that
+there isn't an "ignore" expression -- just a list of individual match
+operators.
+
+=item -- end_matchignore
+
+ Ends the match/ignore region that started with "start_matchignore"
+
+=item -- force_explain
+
+Normally, atmsort can detect that a SQL query is being EXPLAINed, and
+the expain processing will happen automatically. However, if the
+query is complex, you may need to tag it with a comment to force the
+explain. Using this command for non-EXPLAIN statements is
+inadvisable.
+
+=back
+
+Note that you can combine the directives for a single query, but each
+directive must be on a separate line. Multiple mvd specifications
+must be on a single mvd line, separated by semicolons. Note that
+start_ignore overrides all directives until the next end_ignore.
+
+=head1 CAVEATS/LIMITATIONS
+
+atmsort cannot handle "unsorted" SELECT queries where the output has
+strings with embedded newlines or pipe ("|") characters due to
+limitations with the parser in the "tablelizer" function. Queries
+with these characteristics must have an ORDER BY clause to avoid
+potential erroneous comparison.
+
+=cut
+
+# Calls pod2usage, but loads the module first.
+sub lazy_pod2usage
+{
+ require Pod::Usage;
+ Pod::Usage::pod2usage(@_);
+}
+
+my $glob_id = "";
+
+my $glob_init;
+
+my $glob_orderwarn;
+my $glob_verbose;
+my $glob_fqo;
+
+my $man = 0;
+my $help = 0;
+my $ignore_plans;
+my @init_file;
+my $verbose;
+my $orderwarn;
+
+GetOptions(
+ 'help|?' => \$help, man => \$man,
+ 'gpd_ignore_plans|gp_ignore_plans|ignore_plans' => \$ignore_plans,
+ 'gpd_init|gp_init|init:s' => \@init_file,
+ 'order_warn|orderwarn' => \$orderwarn,
+ 'verbose' => \$verbose
+ )
+ or lazy_pod2usage(2);
+
+lazy_pod2usage(-msg => $glob_id, -exitstatus => 1) if $help;
+lazy_pod2usage(-msg => $glob_id, -exitstatus => 0, -verbose => 2) if $man;
+
+push @{$glob_init}, @init_file;
+
+my %args;
+
+$args{IGNORE_PLANS} = $ignore_plans if (defined ($ignore_plans));
+@{$args{INIT_FILES}} = @init_file if (scalar(@init_file));
+$args{ORDER_WARN} = $orderwarn if (defined ($orderwarn));
+$args{VERBOSE} = $verbose if (defined ($verbose));
+
+atmsort::atmsort_init(%args);
+
+atmsort::run_fhs(*STDIN, *STDOUT);
+
+exit();
diff --git a/src/test/regress/atmsort.pm b/src/test/regress/atmsort.pm
new file mode 100644
index 0000000000..bbee018255
--- /dev/null
+++ b/src/test/regress/atmsort.pm
@@ -0,0 +1,1371 @@
+#
+# This is the workhorse of atmsort.pl, extracted into a module so that it
+# can be called more efficiently from other perl programs.
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Public interface:
+#
+# atmsort_init(args in a hash)
+#
+# followed by:
+#
+# run_fhs(input file handle, output file handle)
+# or
+# run(input filename, output filename)
+#
+package atmsort;
+
+#use Data::Dumper; # only used by commented-out debug statements.
+use strict;
+use warnings;
+use File::Temp qw/ tempfile /;
+
+# optional set of prefixes to identify sql statements, query output,
+# and sorted lines (for testing purposes)
+#my $apref = 'a: ';
+#my $bpref = 'b: ';
+#my $cpref = 'c: ';
+#my $dpref = 'S: ';
+my $apref = '';
+my $bpref = '';
+my $cpref = '';
+my $dpref = '';
+
+my $glob_ignore_plans;
+my @glob_init;
+
+my $glob_orderwarn;
+my $glob_verbose;
+my $glob_fqo;
+
+my $atmsort_outfh;
+
+sub atmsort_init
+{
+ my %args = (
+ # defaults
+ IGNORE_PLANS => 0,
+ INIT_FILES => [],
+ ORDER_WARN => 0,
+ VERBOSE => 0,
+
+ # override the defaults from argument list
+ @_
+ );
+
+ $glob_ignore_plans = 0;
+ @glob_init = ();
+
+ $glob_orderwarn = 0;
+ $glob_verbose = 0;
+ $glob_fqo = {count => 0};
+
+ my $ignore_plans;
+ my @init_file;
+ my $verbose;
+ my $orderwarn;
+
+ $glob_ignore_plans = $args{IGNORE_PLANS};
+
+ @glob_init = @{$args{INIT_FILES}};
+
+ $glob_orderwarn = $args{ORDER_WARN};
+ $glob_verbose = $args{VERBOSE};
+
+ _process_init_files();
+}
+
+sub _process_init_files
+{
+ # allow multiple init files
+ if (@glob_init)
+ {
+ my $devnullfh;
+ my $init_file_fh;
+
+ open $devnullfh, "> /dev/null" or die "can't open /dev/null: $!";
+
+ for my $init_file (@glob_init)
+ {
+ die "no such file: $init_file"
+ unless (-e $init_file);
+
+ # Perform initialization from this init_file by passing it
+ # to bigloop. Open the file, and pass that as the input file
+ # handle, and redirect output to /dev/null.
+ open $init_file_fh, "< $init_file" or die "could not open $init_file: $!";
+
+ atmsort_bigloop($init_file_fh, $devnullfh);
+
+ close $init_file_fh;
+ }
+
+ close $devnullfh;
+ }
+}
+
+my $glob_match_then_sub_fnlist;
+
+sub _build_match_subs
+{
+ my ($here_matchsubs, $whomatch) = @_;
+
+ my $stat = [1];
+
+ # filter out the comments and blank lines
+ $here_matchsubs =~ s/^\s*(?:#.*)?(?:[\r\n]|\x0D\x0A)//gm;
+
+ # split up the document into separate lines
+ my @foo = split(/\n/, $here_matchsubs);
+
+ my $ii = 0;
+
+ my $matchsubs_arr = [];
+ my $msa;
+
+ # build an array of arrays of match/subs pairs
+ while ($ii < scalar(@foo))
+ {
+ my $lin = $foo[$ii];
+
+ if (defined($msa))
+ {
+ push @{$msa}, $lin;
+
+ push @{$matchsubs_arr}, $msa;
+
+ undef $msa;
+ }
+ else
+ {
+ $msa = [$lin];
+ }
+ $ii++;
+ next;
+ } # end while
+
+# print Data::Dumper->Dump($matchsubs_arr);
+
+ my $bigdef;
+
+ my $fn1;
+
+ # build a lambda function for each expression, and load it into an
+ # array
+ my $mscount = 1;
+
+ for my $defi (@{$matchsubs_arr})
+ {
+ unless (2 == scalar(@{$defi}))
+ {
+ my $err1 = "bad definition: " . Data::Dumper->Dump([$defi]);
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ $bigdef = '$fn1 = sub { my $ini = shift; '. "\n";
+ $bigdef .= 'if ($ini =~ ' . $defi->[0];
+ $bigdef .= ') { ' . "\n";
+# $bigdef .= 'print "match\n";' . "\n";
+ $bigdef .= '$ini =~ ' . $defi->[1];
+ $bigdef .= '; }' . "\n";
+ $bigdef .= 'return $ini; }' . "\n";
+
+# print $bigdef;
+
+ if (eval $bigdef)
+ {
+ my $cmt = $whomatch . " matchsubs \#" . $mscount;
+ $mscount++;
+
+ # store the function pointer and the text of the function
+ # definition
+ push @{$glob_match_then_sub_fnlist},
+ [$fn1, $bigdef, $cmt, $defi->[0], $defi->[1]];
+
+ if ($glob_verbose && defined $atmsort_outfh)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: Defined $cmt\t$defi->[0]\t$defi->[1]\n"
+ }
+ }
+ else
+ {
+ my $err1 = "bad eval: $bigdef";
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ }
+
+# print Data::Dumper->Dump($glob_match_then_sub_fnlist);
+
+ return $stat;
+
+} # end _build_match_subs
+
+sub match_then_subs
+{
+ my $ini = shift;
+
+ for my $ff (@{$glob_match_then_sub_fnlist})
+ {
+ # get the function and execute it
+ my $fn1 = $ff->[0];
+ if (!$glob_verbose)
+ {
+ $ini = &$fn1($ini);
+ }
+ else
+ {
+ my $subs = &$fn1($ini);
+ unless ($subs eq $ini)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: was: $ini";
+ print $atmsort_outfh "REGRESS_IGNORE: matched $ff->[-3]\t$ff->[-2]\t$ff->[-1]\n"
+ }
+
+ $ini = &$fn1($ini);
+ }
+
+ }
+ return $ini;
+}
+
+my $glob_match_then_ignore_fnlist;
+
+sub _build_match_ignores
+{
+ my ($here_matchignores, $whomatch) = @_;
+
+ my $stat = [1];
+
+ # filter out the comments and blank lines
+ $here_matchignores =~ s/^\s*(?:#.*)?(?:[\r\n]|\x0D\x0A)//gm;
+
+ # split up the document into separate lines
+ my @matchignores_arr = split(/\n/, $here_matchignores);
+
+ my $bigdef;
+
+ my $fn1;
+
+ # build a lambda function for each expression, and load it into an
+ # array
+ my $mscount = 1;
+
+ for my $defi (@matchignores_arr)
+ {
+ $bigdef = '$fn1 = sub { my $ini = shift; '. "\n";
+ $bigdef .= 'return ($ini =~ ' . $defi;
+ $bigdef .= ') ; } ' . "\n";
+# print $bigdef;
+
+ if (eval $bigdef)
+ {
+ my $cmt = $whomatch . " matchignores \#" . $mscount;
+ $mscount++;
+
+ # store the function pointer and the text of the function
+ # definition
+ push @{$glob_match_then_ignore_fnlist},
+ [$fn1, $bigdef, $cmt, $defi, "(ignore)"];
+ if ($glob_verbose && defined $atmsort_outfh)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: Defined $cmt\t$defi\n"
+ }
+
+ }
+ else
+ {
+ my $err1 = "bad eval: $bigdef";
+ $stat->[0] = 1;
+ $stat->[1] = $err1;
+ return $stat;
+ }
+
+ }
+
+# print Data::Dumper->Dump($glob_match_then_ignore_fnlist);
+
+ return $stat;
+
+} # end _build_match_ignores
+
+# if the input matches, return 1 (ignore), else return 0 (keep)
+sub match_then_ignore
+{
+ my $ini = shift;
+
+ for my $ff (@{$glob_match_then_ignore_fnlist})
+ {
+ # get the function and execute it
+ my $fn1 = $ff->[0];
+
+ if (&$fn1($ini))
+ {
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: matched $ff->[-3]\t$ff->[-2]\t$ff->[-1]\n"
+ }
+ return 1; # matched
+ }
+ }
+ return 0; # no match
+}
+
+# convert a postgresql psql formatted table into an array of hashes
+sub tablelizer
+{
+ my ($ini, $got_line1) = @_;
+
+ # first, split into separate lines, the find all the column headings
+
+ my @lines = split(/\n/, $ini);
+
+ return undef
+ unless (scalar(@lines));
+
+ # if the first line is supplied, then it has the column headers,
+ # so don't try to find them (or the ---+---- separator) in
+ # "lines"
+ my $line1 = $got_line1;
+ $line1 = shift @lines
+ unless (defined($got_line1));
+
+ # look for <space>|<space>
+ my @colheads = split(/\s+\|\s+/, $line1);
+
+ # fixup first, last column head (remove leading,trailing spaces)
+
+ $colheads[0] =~ s/^(\s+|\s+$)//;
+ $colheads[-1] =~ s/^(\s+|\s+$)//;
+
+ return undef
+ unless (scalar(@lines));
+
+ shift @lines # skip dashed separator (unless it was skipped already)
+ unless (defined($got_line1));
+
+ my @rows;
+
+ for my $lin (@lines)
+ {
+ my @cols = split(/\|/, $lin, scalar(@colheads));
+ last
+ unless (scalar(@cols) == scalar(@colheads));
+
+ my $rowh = {};
+
+ for my $colhdcnt (0..(scalar(@colheads)-1))
+ {
+ my $rawcol = shift @cols;
+
+ $rawcol =~ s/^(\s+|\s+$)//;
+
+ my $colhd = $colheads[$colhdcnt];
+ $rowh->{($colhdcnt+1)} = $rawcol;
+ }
+ push @rows, $rowh;
+ }
+
+ return \@rows;
+}
+# reformat the EXPLAIN output according to the directive hash
+sub format_explain
+{
+ my ($outarr, $directive) = @_;
+ my $prefix = "";
+ my $xopt = "perl"; # normal case
+
+ $directive = {} unless (defined($directive));
+
+ # Ignore plan content if its between start_ignore and end_ignore blocks
+ # or if -ignore_plans is specified.
+ $prefix = "REGRESS_IGNORE:"
+ if (exists($directive->{ignore})) || ($glob_ignore_plans);
+
+ my @tmp_lines;
+
+ if (scalar(@{$outarr}))
+ {
+ @tmp_lines = (
+ "QUERY PLAN\n",
+ ("-" x 71) . "\n",
+ @{$outarr},
+ "(111 rows)\n"
+ );
+ }
+
+ # Apply prefix to each line, if requested.
+ if (defined($prefix) && length($prefix))
+ {
+ foreach my $line (@tmp_lines)
+ {
+ $line = $prefix . $line;
+ }
+ }
+
+ # Put back newlines and print
+ foreach my $line (@tmp_lines)
+ {
+ print $atmsort_outfh $line;
+ }
+
+ return \@tmp_lines;
+}
+
+# reformat the query output according to the directive hash
+sub format_query_output
+{
+ my ($fqostate, $has_order, $outarr, $directive) = @_;
+ my $prefix = "";
+
+ $directive = {} unless (defined($directive));
+
+ $fqostate->{count} += 1;
+
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: start fqo $fqostate->{count}\n";
+ }
+
+ # EXPLAIN
+ #
+ # EXPLAIN (COSTS OFF) output is *not* processed. The output with COSTS OFF
+ # shouldn't contain anything that varies across runs, and shouldn't need
+ # sanitizing.
+ #
+ # However when -ignore_plans is specified we also need to process
+ # EXPLAIN (COSTS OFF) to ignore the segments information.
+ if (exists($directive->{explain})
+ && ($glob_ignore_plans
+ || $directive->{explain} ne 'costs_off')
+ && (!exists($directive->{explain_processing})
+ || ($directive->{explain_processing} =~ m/on/)))
+ {
+ format_explain($outarr, $directive);
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+ return;
+ }
+
+ $prefix = "REGRESS_IGNORE:"
+ if (exists($directive->{ignore}));
+
+ if (exists($directive->{sortlines}))
+ {
+ my $firstline = $directive->{firstline};
+ my $ordercols = $directive->{order};
+ my $mvdlist = $directive->{mvd};
+
+ # lines already have newline terminator, so just rejoin them.
+ my $lines = join ("", @{$outarr});
+
+ my $ah1 = tablelizer($lines, $firstline);
+
+ unless (defined($ah1) && scalar(@{$ah1}))
+ {
+# print "No tablelizer hash for $lines, $firstline\n";
+# print STDERR "No tablelizer hash for $lines, $firstline\n";
+
+ if ($glob_verbose)
+ {
+ print $atmsort_outfh "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+
+ return;
+ }
+
+ my @allcols = sort (keys(%{$ah1->[0]}));
+
+ my @presortcols;
+ if (defined($ordercols) && length($ordercols))
+ {
+# $ordercols =~ s/^.*order\s*//;
+ $ordercols =~ s/\n//gm;
+ $ordercols =~ s/\s//gm;
+
+ @presortcols = split(/\s*\,\s*/, $ordercols);
+ }
+
+ my @mvdcols;
+ my @mvd_deps;
+ my @mvd_nodeps;
+ my @mvdspec;
+ if (defined($mvdlist) && length($mvdlist))
+ {
+ $mvdlist =~ s/\n//gm;
+ $mvdlist =~ s/\s//gm;
+
+ # find all the mvd specifications (separated by semicolons)
+ my @allspecs = split(/\;/, $mvdlist);
+
+# print "allspecs:", Data::Dumper->Dump(\@allspecs);
+
+ for my $item (@allspecs)
+ {
+ my $realspec;
+ # split the specification list, separating the
+ # specification columns on the left hand side (LHS)
+ # from the "dependent" columns on the right hand side (RHS)
+ my @colset = split(/\-\>/, $item, 2);
+ unless (scalar(@colset) == 2)
+ {
+ print $atmsort_outfh "invalid colset for $item\n";
+ print STDERR "invalid colset for $item\n";
+ next;
+ }
+ # specification columns (LHS)
+ my @scols = split(/\,/, $colset[0]);
+ unless (scalar(@scols))
+ {
+ print $atmsort_outfh "invalid dependency specification: $colset[0]\n";
+ print STDERR
+ "invalid dependency specification: $colset[0]\n";
+ next;
+ }
+ # dependent columns (RHS)
+ my @dcols = split(/\,/, $colset[1]);
+ unless (scalar(@dcols))
+ {
+ print $atmsort_outfh "invalid specified dependency: $colset[1]\n";
+ print STDERR "invalid specified dependency: $colset[1]\n";
+ next;
+ }
+ $realspec = {};
+ my $scol2 = [];
+ my $dcol2 = [];
+ my $sdcol = [];
+ $realspec->{spec} = $item;
+ push @{$scol2}, @scols;
+ push @{$dcol2}, @dcols;
+ push @{$sdcol}, @scols, @dcols;
+ $realspec->{scol} = $scol2;
+ $realspec->{dcol} = $dcol2;
+ $realspec->{allcol} = $sdcol;
+
+ push @mvdcols, @scols, @dcols;
+ # find all the dependent columns
+ push @mvd_deps, @dcols;
+ push @mvdspec, $realspec;
+ }
+
+ # find all the mvd cols which are *not* dependent. Need
+ # to handle the case of self-dependency, eg "mvd 1->1", so
+ # must build set of all columns, then strip out the
+ # "dependent" cols. So this is the set of all LHS columns
+ # which are never on the RHS.
+ my %get_nodeps;
+
+ for my $col (@mvdcols)
+ {
+ $get_nodeps{$col} = 1;
+ }
+
+ # remove dependent cols
+ for my $col (@mvd_deps)
+ {
+ if (exists($get_nodeps{$col}))
+ {
+ delete $get_nodeps{$col};
+ }
+ }
+ # now sorted and unique, with no dependents
+ @mvd_nodeps = sort (keys(%get_nodeps));
+# print "mvdspec:", Data::Dumper->Dump(\@mvdspec);
+# print "mvd no deps:", Data::Dumper->Dump(\@mvd_nodeps);
+ }
+
+ my %unsorth = map { $_ => 1 } @allcols;
+
+ # clear sorted column list if just "order 0"
+ if ((1 == scalar(@presortcols))
+ && ($presortcols[0] eq "0"))
+ {
+ @presortcols = ();
+ }
+
+
+ for my $col (@presortcols)
+ {
+ if (exists($unsorth{$col}))
+ {
+ delete $unsorth{$col};
+ }
+ }
+ for my $col (@mvdcols)
+ {
+ if (exists($unsorth{$col}))
+ {
+ delete $unsorth{$col};
+ }
+ }
+ my @unsortcols = sort(keys(%unsorth));
+
+# print Data::Dumper->Dump([$ah1]);
+
+ if (scalar(@presortcols))
+ {
+ my $hd1 = "sorted columns " . join(", ", @presortcols);
+
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+# print "hrow:",Data::Dumper->Dump([$h_row]), "\n";
+
+ for my $col (@presortcols)
+ {
+# print "col: ($col)\n";
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified ORDER column out of range: $col vs $maxcol\n";
+ print $atmsort_outfh $errstr;
+ print STDERR $errstr;
+ last;
+ }
+ }
+ print $atmsort_outfh $prefix, join(' | ', @collist), "\n";
+ }
+ }
+
+ if (scalar(@mvdspec))
+ {
+ my @outi;
+
+ my $hd1 = "multivalue dependency specifications";
+
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $mspec (@mvdspec)
+ {
+ $hd1 = $mspec->{spec};
+ print $hd1, "\n", "-"x(length($hd1)), "\n";
+
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+# print "hrow:",Data::Dumper->Dump([$h_row]), "\n";
+
+ for my $col (@{$mspec->{allcol}})
+ {
+# print "col: ($col)\n";
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified MVD column out of range: $col vs $maxcol\n";
+ print $errstr;
+ print STDERR $errstr;
+ last;
+ }
+
+ }
+ push @outi, join(' | ', @collist);
+ }
+ my @ggg= sort @outi;
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $prefix, $line, "\n";
+ }
+ @outi = ();
+ }
+ }
+ my $hd2 = "unsorted columns " . join(", ", @unsortcols);
+
+ # the "unsorted" comparison must include all columns which are
+ # not sorted or part of an mvd specification, plus the sorted
+ # columns, plus the non-dependent mvd columns which aren't
+ # already in the list
+ if ((scalar(@presortcols))
+ || scalar(@mvd_nodeps))
+ {
+ if (scalar(@presortcols))
+ {
+ if (scalar(@mvd_deps))
+ {
+ my %get_presort;
+
+ for my $col (@presortcols)
+ {
+ $get_presort{$col} = 1;
+ }
+ # remove "dependent" (RHS) columns
+ for my $col (@mvd_deps)
+ {
+ if (exists($get_presort{$col}))
+ {
+ delete $get_presort{$col};
+ }
+ }
+ # now sorted and unique, minus all mvd dependent cols
+ @presortcols = sort (keys(%get_presort));
+
+ }
+
+ if (scalar(@presortcols))
+ {
+ $hd2 .= " ( " . join(", ", @presortcols) . ")";
+ # have to compare all columns as unsorted
+ push @unsortcols, @presortcols;
+ }
+ }
+ if (scalar(@mvd_nodeps))
+ {
+ my %get_nodeps;
+
+ for my $col (@mvd_nodeps)
+ {
+ $get_nodeps{$col} = 1;
+ }
+ # remove "nodeps" which are already in the output list
+ for my $col (@unsortcols)
+ {
+ if (exists($get_nodeps{$col}))
+ {
+ delete $get_nodeps{$col};
+ }
+ }
+ # now sorted and unique, minus all unsorted/sorted cols
+ @mvd_nodeps = sort (keys(%get_nodeps));
+ if (scalar(@mvd_nodeps))
+ {
+ $hd2 .= " (( " . join(", ", @mvd_nodeps) . "))";
+ # have to compare all columns as unsorted
+ push @unsortcols, @mvd_nodeps;
+ }
+
+ }
+
+ }
+
+ print $hd2, "\n", "-"x(length($hd2)), "\n";
+
+ my @finalunsort;
+
+ if (scalar(@unsortcols))
+ {
+ for my $h_row (@{$ah1})
+ {
+ my @collist;
+
+ @collist = ();
+
+ for my $col (@unsortcols)
+ {
+ if (exists($h_row->{$col}))
+ {
+ push @collist, $h_row->{$col};
+ }
+ else
+ {
+ my $maxcol = scalar(@allcols);
+ my $errstr =
+ "specified UNSORT column out of range: $col vs $maxcol\n";
+ print $errstr;
+ print STDERR $errstr;
+ last;
+ }
+
+ }
+ push @finalunsort, join(' | ', @collist);
+ }
+ my @ggg= sort @finalunsort;
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $prefix, $line, "\n";
+ }
+ }
+
+ if ($glob_verbose)
+ {
+ print "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+
+ return;
+ } # end order
+
+
+ if ($has_order)
+ {
+ my @ggg= @{$outarr};
+
+ if ($glob_orderwarn)
+ {
+ # If no ordering cols specified (no directive), and SELECT has
+ # ORDER BY, see if number of order by cols matches all cols in
+ # selected lists. Treat the order by cols as a comma separated
+ # list and count them. Works ok for simple ORDER BY clauses
+ if (defined($directive->{sql_statement}))
+ {
+ my @ocols = ($directive->{sql_statement} =~ m/select.*order\s+by\s+(.*)\;/ism);
+
+ if (scalar(@ocols))
+ {
+ my $fl2 = $directive->{firstline};
+ # lines already have newline terminator, so just rejoin them.
+ my $line2 = join ("", @{$outarr});
+
+ my $ah2 = tablelizer($line2, $fl2);
+ if (defined($ah2) && scalar(@{$ah2}))
+ {
+ my $allcol_count = scalar(keys(%{$ah2->[0]}));
+
+ # In order to count the number of ORDER BY columns we
+ # can transliterate over comma and increment by one to
+ # account for the last column not having a trailing
+ # comma. This is faster than splitting over the comma
+ # since we don't need to allocate the returned array.
+ my $ocol_count = ($ocols[0] =~ tr/,//) + 1;
+
+ if ($ocol_count < $allcol_count)
+ {
+ print "REGRESS_IGNORE: ORDER_WARNING: OUTPUT ",
+ $allcol_count, " columns, but ORDER BY on ",
+ $ocol_count, " \n";
+ }
+ }
+ }
+ }
+ } # end if $glob_orderwarn
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $dpref, $prefix, $line;
+ }
+ }
+ else
+ {
+ my @ggg= sort @{$outarr};
+
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $bpref, $prefix, $line;
+ }
+ }
+
+ if ($glob_verbose)
+ {
+ print "REGRESS_IGNORE: end fqo $fqostate->{count}\n";
+ }
+}
+
+
+# The caller should've opened ATMSORT_INFILE and ATMSORT_OUTFILE file handles.
+sub atmsort_bigloop
+{
+ my $infh = shift;
+ $atmsort_outfh = shift;
+
+ my $sql_statement = "";
+ my @outarr;
+
+ my $getrows = 0;
+ my $getstatement = 0;
+ my $has_order = 0;
+ my $copy_to_stdout_result = 0;
+ my $describe_mode = 0;
+ my $directive = {};
+ my $big_ignore = 0;
+ my %define_match_expression;
+
+ print $atmsort_outfh "REGRESS_IGNORE: formatted by atmsort.pm\n";
+
+ L_bigwhile:
+ while (<$infh>) # big while
+ {
+ reprocess_row:
+ my $ini = $_;
+
+ # look for match/substitution or match/ignore expressions
+ if (%define_match_expression)
+ {
+ if ($ini =~ m/\-\-\s*end\_match(subs|ignore)\s*$/)
+ {
+ if ($define_match_expression{"type"} ne $1)
+ {
+ die "Non-matching operation end_match" . $1 . ", " .
+ "expected end_match" . $define_match_expression{"type"};
+ }
+ }
+ else
+ {
+ $define_match_expression{"expr"} .= $ini;
+ goto L_push_outarr;
+ }
+
+ my @foo = split(/\n/, $define_match_expression{"expr"}, 2);
+
+ unless (2 == scalar(@foo))
+ {
+ $ini .= "REGRESS_IGNORE: bad match definition\n";
+ undef %define_match_expression;
+ goto L_push_outarr;
+ }
+
+ my $stat;
+
+ my $doc1 = $foo[1];
+
+ # strip off leading comment characters
+ $doc1 =~ s/^\s*\-\-//gm;
+
+ if ($define_match_expression{"type"} eq 'subs')
+ {
+ $stat = _build_match_subs($doc1, "USER");
+ }
+ else
+ {
+ $stat = _build_match_ignores($doc1, "USER");
+ }
+
+ if (scalar(@{$stat}) > 1)
+ {
+ my $outi = $stat->[1];
+
+ # print a message showing the error
+ $outi =~ s/^(.*)/REGRESS_IGNORE: ($1)/gm;
+ $ini .= $outi;
+ }
+ else
+ {
+ $ini .= "REGRESS_IGNORE: defined new match expression\n";
+ }
+
+ undef %define_match_expression;
+ goto L_push_outarr;
+ } # end defined match expression
+
+ if ($big_ignore > 0)
+ {
+ if ($ini =~ m/\-\-\s*end\_ignore\s*$/)
+ {
+ $big_ignore--;
+ }
+ print $atmsort_outfh "REGRESS_IGNORE:", $ini;
+ next;
+ }
+
+ if ($getrows) # getting rows from SELECT output
+ {
+ # The end of "result set" for a COPY TO STDOUT is a bit tricky
+ # to find. There is no explicit marker for it. We look for a
+ # line that looks like a SQL comment or a new query, or an ERROR.
+ # This is not bullet-proof, but works for the current tests.
+ if ($copy_to_stdout_result &&
+ ($ini =~ m/(?:\-\-|ERROR|copy|create|drop|select|insert|update)/i))
+ {
+ my @ggg = sort @outarr;
+ for my $line (@ggg)
+ {
+ print $atmsort_outfh $bpref, $line;
+ }
+
+ @outarr = ();
+ $getrows = 0;
+ $has_order = 0;
+ $copy_to_stdout_result = 0;
+
+ # Process the row again, in case it begins another
+ # COPY TO STDOUT statement, or another query.
+ goto reprocess_row;
+ }
+
+ my $end_of_table = 0;
+
+ if ($describe_mode)
+ {
+ # \d tables don't always end with a row count, and there may be
+ # more than one of them per command. So we allow any of the
+ # following to end the table:
+ # - a blank line
+ # - a row that doesn't have the same number of column separators
+ # as the header line
+ # - a row count (checked below)
+ if ($ini =~ m/^$/)
+ {
+ $end_of_table = 1;
+ }
+ elsif (exists($directive->{firstline}))
+ {
+ # Count the number of column separators in the table header
+ # and our current line.
+ my $headerSeparators = ($directive->{firstline} =~ tr/\|//);
+ my $lineSeparators = ($ini =~ tr/\|//);
+
+ if ($headerSeparators != $lineSeparators)
+ {
+ $end_of_table = 1;
+ }
+ }
+
+ # Don't reset describe_mode at the end of the table; there may
+ # be more tables still to go.
+ }
+
+ # regex example: (5 rows)
+ if ($ini =~ m/^\s*\(\d+\s+row(?:s)*\)\s*$/)
+ {
+ # Always ignore the rowcount for explain plan out as the
+ # skeleton plans might be the same even if the row counts
+ # differ because of session level GUCs.
+ if (exists($directive->{explain}))
+ {
+ $ini = 'REGRESS_IGNORE:' . $ini;
+ }
+
+ $end_of_table = 1;
+ }
+
+ if ($end_of_table)
+ {
+ format_query_output($glob_fqo,
+ $has_order, \@outarr, $directive);
+
+ $directive = {};
+ @outarr = ();
+ $getrows = 0;
+ $has_order = 0;
+ }
+ }
+ else # finding SQL statement or start of SELECT output
+ {
+ # To avoid hunting for gpdiff commands which are contained inside
+ # comments first establish if the line contains a comment with any
+ # trailing characters at all.
+ my $has_comment = ((m/\s*\-\-.+$/) ? 1 : 0);
+
+ if ($has_comment && $ini =~ m/\-\-\s*start\_match(subs|ignore)\s*$/)
+ {
+ $define_match_expression{"type"} = $1;
+ $define_match_expression{"expr"} = $ini;
+ goto L_push_outarr;
+ }
+ if ($has_comment && ($ini =~ m/\-\-\s*start\_ignore\s*$/))
+ {
+ $big_ignore += 1;
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $apref, $line;
+ }
+ @outarr = ();
+
+ print $atmsort_outfh 'REGRESS_IGNORE:', $ini;
+ next;
+ }
+
+ # EXPLAIN (COSTS OFF) ...
+ if ($ini =~ m/explain\s*\(.*costs\s+off.*\)/i)
+ {
+ $directive->{explain} = "costs_off";
+ }
+ # Note: \d is for the psql "describe"
+ elsif ($ini =~ m/(?:insert|update|delete|select|^\s*\\d|copy|execute)/i)
+ {
+ $copy_to_stdout_result = 0;
+ $has_order = 0;
+ $sql_statement = "";
+
+ if ($ini =~ m/explain.*(?:insert|update|delete|select|execute)/i)
+ {
+ $directive->{explain} = 'normal';
+ }
+
+ # Should we apply more heuristics to try to find the end of \d
+ # output?
+ $describe_mode = ($ini =~ m/^\s*\\d/);
+ }
+
+ # Catching multiple commands and capturing the parens matches
+ # makes it possible to check just the first character since
+ # each command has a unique first character. This allows us to
+ # use fewer regular expression matches in this hot section.
+ if ($has_comment &&
+ $ini =~ m/\-\-\s*((force_explain)\s*(operator)?\s*$|(ignore)\s*$|(order)\s+(\d+|none).*$|(mvd)\s+\d+.*$|(explain_processing_(on|off))\s+.*$)/)
+ {
+ my $full_command = $1;
+ my $cmd = substr($full_command, 0, 1);
+ if ($cmd eq 'i')
+ {
+ $directive->{ignore} = 'ignore';
+ }
+ elsif ($cmd eq 'o')
+ {
+ my $olist = $ini;
+ $olist =~ s/^.*\-\-\s*order//;
+ if ($olist =~ /none/)
+ {
+ $directive->{order_none} = 1;
+ }
+ else
+ {
+ $directive->{order} = $olist;
+ }
+ }
+ elsif ($cmd eq 'f')
+ {
+ if (defined($3))
+ {
+ $directive->{explain} = 'operator';
+ }
+ else
+ {
+ $directive->{explain} = 'normal';
+ }
+ }
+ elsif ($cmd eq 'e')
+ {
+ $full_command =~ m/(on|off)$/;
+ $directive->{explain_processing} = $1;
+ }
+ else
+ {
+ my $olist = $ini;
+ $olist =~ s/^.*\-\-\s*mvd//;
+ $directive->{mvd} = $olist;
+ }
+ }
+
+ if ($ini =~ m/select/i)
+ {
+ $getstatement = 1;
+ $sql_statement .= $ini;
+ }
+ if (index($ini, ';') != -1)
+ {
+ $getstatement = 0;
+ }
+
+ # prune notices with segment info if they are duplicates
+ if ($ini =~ m/^\s*(?:NOTICE|ERROR|HINT|DETAIL|WARNING)\:.*\(seg.*pid.*\)/)
+ {
+ $ini =~ s/\s+(?:\W)?(?:\W)?\(seg.*pid.*\)//;
+
+ my $outsize = scalar(@outarr);
+
+ my $lastguy = -1;
+
+ L_checkfor:
+ for my $jj (1..$outsize)
+ {
+ my $checkstr = $outarr[$lastguy];
+
+ #remove trailing spaces for comparison
+ $checkstr =~ s/\s+$//;
+
+ my $skinny = $ini;
+ $skinny =~ s/\s+$//;
+
+ # stop when no more notices
+ last L_checkfor
+ if ($checkstr !~ m/^\s*(?:NOTICE|ERROR|HINT|DETAIL|WARNING)\:/);
+
+ # discard this line if matches a previous notice
+ if ($skinny eq $checkstr)
+ {
+ if (0) # debug code
+ {
+ $ini = "DUP: " . $ini;
+ last L_checkfor;
+ }
+ next L_bigwhile;
+ }
+ $lastguy--;
+ } # end for
+
+ } # end if pruning notices
+
+ # MPP-1492 allow:
+ # copy (select ...) to stdout
+ # \copy (select ...) to stdout
+ # and special case these guys:
+ # copy test1 to stdout
+ # \copy test1 to stdout
+ my $matches_copy_to_stdout = 0;
+ if ($ini =~ m/^(?:\\)?copy\s+(?:(?:\(select.*\))|\S+)\s+to stdout.*$/i)
+ {
+ $matches_copy_to_stdout = 1;
+ }
+
+ # Try to detect the beginning of result set, as printed by psql
+ #
+ # Examples:
+ #
+ # hdr
+ # ----------
+ #
+ # a | b
+ # ---+---
+ #
+ # The previous line should be the header. It should have a space at the
+ # beginning and end. This line should consist of dashes and plus signs,
+ # with at least three dashes for each column.
+ #
+ if (($matches_copy_to_stdout && $ini !~ m/order by/i) ||
+ (scalar(@outarr) > 1 && $outarr[-1] =~ m/^\s+.*\s$/ &&
+ $ini =~ m/^(?:(?:\-\-)(?:\-)+(?:\+(?:\-)+)*)$/))
+ # special case for copy select
+ { # sort this region
+
+ $directive->{firstline} = $outarr[-1];
+
+ if (exists($directive->{order}) ||
+ exists($directive->{mvd}))
+ {
+ $directive->{sortlines} = $outarr[-1];
+ }
+
+ # special case for copy select
+ if ($matches_copy_to_stdout)
+ {
+ $copy_to_stdout_result = 1;
+ $sql_statement = "";
+ }
+ # special case for explain
+ if (exists($directive->{explain}) &&
+ ($ini =~ m/^\s*(?:(?:\-\-)(?:\-)+(?:\+(?:\-)+)*)+\s*$/) &&
+ (scalar(@outarr) && $outarr[-1] =~ m/QUERY PLAN/))
+ {
+ # ENGINF-88: fixup explain headers
+ $outarr[-1] = "QUERY PLAN\n";
+ $ini = ("_" x length($outarr[-1])) . "\n";
+ }
+
+ $getstatement = 0;
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $apref, $line;
+ }
+ @outarr = ();
+
+ print $atmsort_outfh $apref, $ini;
+
+ # If there is an ORDER BY in the query, then the results must
+ # be in the order that we have memorized in the expected
+ # output. Otherwise, the order of the rows is not
+ # well-defined, so we sort them before comparing, to mask out
+ # any differences in the order.
+ #
+ # This isn't foolproof, and will get fooled by ORDER BYs in
+ # subqueries, for example. But it catches the commmon cases.
+ if (defined($directive->{explain}))
+ {
+ $has_order = 1; # Do not reorder EXPLAIN output
+ }
+ elsif (defined($sql_statement)
+ && length($sql_statement)
+ && !defined($directive->{order_none})
+ # multiline match
+ && ($sql_statement =~ m/select.*order.*by/is))
+ {
+ # There was an ORDER BY. But if it was part of an
+ # "agg() OVER (ORDER BY ...)" or "WITHIN GROUP (ORDER BY
+ # ...)" construct, ignore it, because those constructs
+ # don't mean that the final result has to be in order.
+ my $t = $sql_statement;
+ $t =~ s/over\s*\(order\s+by.*\)/xx/isg;
+ $t =~ s/over\s*\((partition\s+by.*)?order\s+by.*\)/xx/isg;
+ $t =~ s/window\s+\w+\s+as\s+\((partition\s+by.*)?order\s+by.*\)/xx/isg;
+ $t =~ s/within\s+group\s*\((order\s+by.*)\)/xx/isg;
+
+ if ($t =~ m/order\s+by/is)
+ {
+ $has_order = 1; # so do *not* sort output
+ }
+ else
+ {
+ $has_order = 0; # need to sort query output
+ }
+ }
+ else
+ {
+ $has_order = 0; # need to sort query output
+ }
+ $directive->{sql_statement} = $sql_statement;
+ $sql_statement = '';
+
+ $getrows = 1;
+ next;
+ } # end sort this region
+ } # end finding SQL
+
+ # if MATCH then SUBSTITUTE
+ # see HERE document for definitions
+ $ini = match_then_subs($ini);
+
+ # if MATCH then IGNORE
+ # see HERE document for definitions
+ if ( match_then_ignore($ini))
+ {
+ next; # ignore matching lines
+ }
+
+L_push_outarr:
+
+ push @outarr, $ini;
+
+ } # end big while
+
+ for my $line (@outarr)
+ {
+ print $atmsort_outfh $cpref, $line;
+ }
+} # end bigloop
+
+
+# The arguments is the input filename. The output filename is returned as it
+# is generated in this function to avoid races around the temporary filename
+# creation.
+sub run
+{
+ my $infname = shift;
+
+ open my $infh, '<', $infname or die "could not open $infname: $!";
+ my ($outfh, $outfname) = tempfile();
+
+ run_fhs($infh, $outfh);
+
+ close $infh;
+ close $outfh;
+ return $outfname;
+}
+
+# The arguments are input and output file handles
+sub run_fhs
+{
+ my $infh = shift;
+ my $outfh = shift;
+
+
+ # loop over input file.
+ atmsort_bigloop($infh, $outfh);
+}
+
+1;
diff --git a/src/test/regress/expected/.gitignore b/src/test/regress/expected/.gitignore
index 93c56c85a0..0eb6984372 100644
--- a/src/test/regress/expected/.gitignore
+++ b/src/test/regress/expected/.gitignore
@@ -5,5 +5,6 @@
/largeobject.out
/largeobject_1.out
/misc.out
+/misc_1.out
/security_label.out
/tablespace.out
diff --git a/src/test/regress/expected/alter_table_1.out b/src/test/regress/expected/alter_table_1.out
new file mode 100644
index 0000000000..eb16f38cf4
--- /dev/null
+++ b/src/test/regress/expected/alter_table_1.out
@@ -0,0 +1,4425 @@
+--
+-- ALTER_TABLE
+--
+-- Clean up in case a prior regression run failed
+SET client_min_messages TO 'warning';
+DROP ROLE IF EXISTS regress_alter_table_user1;
+RESET client_min_messages;
+CREATE USER regress_alter_table_user1;
+--
+-- add attribute
+--
+CREATE TABLE attmp (initial int4);
+COMMENT ON TABLE attmp_wrong IS 'table comment';
+ERROR: relation "attmp_wrong" does not exist
+COMMENT ON TABLE attmp IS 'table comment';
+COMMENT ON TABLE attmp IS NULL;
+ALTER TABLE attmp ADD COLUMN xmin integer; -- fails
+ERROR: column name "xmin" conflicts with a system column name
+ALTER TABLE attmp ADD COLUMN a int4 default 3;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+DROP TABLE attmp;
+-- the wolf bug - schema mods caused inconsistent row descriptors
+CREATE TABLE attmp (
+ initial int4
+);
+ALTER TABLE attmp ADD COLUMN a int4;
+ALTER TABLE attmp ADD COLUMN b name;
+ALTER TABLE attmp ADD COLUMN c text;
+ALTER TABLE attmp ADD COLUMN d float8;
+ALTER TABLE attmp ADD COLUMN e float4;
+ALTER TABLE attmp ADD COLUMN f int2;
+ALTER TABLE attmp ADD COLUMN g polygon;
+ALTER TABLE attmp ADD COLUMN i char;
+ALTER TABLE attmp ADD COLUMN k int4;
+ALTER TABLE attmp ADD COLUMN l tid;
+ALTER TABLE attmp ADD COLUMN m xid;
+ALTER TABLE attmp ADD COLUMN n oidvector;
+--ALTER TABLE attmp ADD COLUMN o lock;
+ALTER TABLE attmp ADD COLUMN p boolean;
+ALTER TABLE attmp ADD COLUMN q point;
+ALTER TABLE attmp ADD COLUMN r lseg;
+ALTER TABLE attmp ADD COLUMN s path;
+ALTER TABLE attmp ADD COLUMN t box;
+ALTER TABLE attmp ADD COLUMN v timestamp;
+ALTER TABLE attmp ADD COLUMN w interval;
+ALTER TABLE attmp ADD COLUMN x float8[];
+ALTER TABLE attmp ADD COLUMN y float4[];
+ALTER TABLE attmp ADD COLUMN z int2[];
+INSERT INTO attmp (a, b, c, d, e, f, g, i, k, l, m, n, p, q, r, s, t,
+ v, w, x, y, z)
+ VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
+ 'c',
+ 314159, '(1,1)', '512',
+ '1 2 3 4 5 6 7 8', true, '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
+ '(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)',
+ 'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
+SELECT * FROM attmp;
+ initial | a | b | c | d | e | f | g | i | k | l | m | n | p | q | r | s | t | v | w | x | y | z
+---------+---+------+------+-----+-----+---+-----------------------+---+--------+-------+-----+-----------------+---+-----------+-----------------------+-----------------------------+---------------------+--------------------------+------------------+-----------+-----------+-----------
+ | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | c | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | t | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
+(1 row)
+
+CREATE INDEX attmp_idx ON attmp (a, (d + e), b);
+ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ERROR: column number must be in range from 1 to 32767
+LINE 1: ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
+ ^
+ALTER INDEX attmp_idx ALTER COLUMN 1 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "a" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS 1000;
+\d+ attmp_idx
+ Index "public.attmp_idx"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+------------------+------+------------+---------+--------------
+ a | integer | yes | a | plain |
+ expr | double precision | yes | (d + e) | plain | 1000
+ b | cstring | yes | b | plain |
+btree, for table "public.attmp"
+
+ALTER INDEX attmp_idx ALTER COLUMN 3 SET STATISTICS 1000;
+ERROR: cannot alter statistics on non-expression column "b" of index "attmp_idx"
+HINT: Alter statistics on table column instead.
+ALTER INDEX attmp_idx ALTER COLUMN 4 SET STATISTICS 1000;
+ERROR: column number 4 of relation "attmp_idx" does not exist
+ALTER INDEX attmp_idx ALTER COLUMN 2 SET STATISTICS -1;
+DROP TABLE attmp;
+--
+-- rename - check on both non-temp and temp tables
+--
+CREATE TABLE attmp (regtable int);
+CREATE TEMP TABLE attmp (attmptable int);
+ALTER TABLE attmp RENAME TO attmp_new;
+SELECT * FROM attmp;
+ regtable
+----------
+(0 rows)
+
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+ALTER TABLE attmp RENAME TO attmp_new2;
+SELECT * FROM attmp; -- should fail
+ERROR: relation "attmp" does not exist
+LINE 1: SELECT * FROM attmp;
+ ^
+SELECT * FROM attmp_new;
+ attmptable
+------------
+(0 rows)
+
+SELECT * FROM attmp_new2;
+ regtable
+----------
+(0 rows)
+
+DROP TABLE attmp_new;
+DROP TABLE attmp_new2;
+-- check rename of partitioned tables and indexes also
+CREATE TABLE part_attmp (a int primary key) partition by range (a);
+CREATE TABLE part_attmp1 PARTITION OF part_attmp FOR VALUES FROM (0) TO (100);
+ALTER INDEX part_attmp_pkey RENAME TO part_attmp_index;
+ALTER INDEX part_attmp1_pkey RENAME TO part_attmp1_index;
+ALTER TABLE part_attmp RENAME TO part_at2tmp;
+ALTER TABLE part_attmp1 RENAME TO part_at2tmp1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX part_attmp_index RENAME TO fail;
+ERROR: must be owner of index part_attmp_index
+ALTER INDEX part_attmp1_index RENAME TO fail;
+ERROR: must be owner of index part_attmp1_index
+ALTER TABLE part_at2tmp RENAME TO fail;
+ERROR: must be owner of table part_at2tmp
+ALTER TABLE part_at2tmp1 RENAME TO fail;
+ERROR: must be owner of table part_at2tmp1
+RESET ROLE;
+DROP TABLE part_at2tmp;
+--
+-- check renaming to a table's array type's autogenerated name
+-- (the array type's name should get out of the way)
+--
+CREATE TABLE attmp_array (id int);
+CREATE TABLE attmp_array2 (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = 'attmp_array2[]'::regtype;
+ typname
+---------------
+ _attmp_array2
+(1 row)
+
+ALTER TABLE attmp_array2 RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+----------------
+ ___attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+DROP TABLE attmp_array;
+-- renaming to table's own array type's name is an interesting corner case
+CREATE TABLE attmp_array (id int);
+SELECT typname FROM pg_type WHERE oid = 'attmp_array[]'::regtype;
+ typname
+--------------
+ _attmp_array
+(1 row)
+
+ALTER TABLE attmp_array RENAME TO _attmp_array;
+SELECT typname FROM pg_type WHERE oid = '_attmp_array[]'::regtype;
+ typname
+---------------
+ __attmp_array
+(1 row)
+
+DROP TABLE _attmp_array;
+-- ALTER TABLE ... RENAME on non-table relations
+-- renaming indexes (FIXME: this should probably test the index's functionality)
+ALTER INDEX IF EXISTS __onek_unique1 RENAME TO attmp_onek_unique1;
+NOTICE: relation "__onek_unique1" does not exist, skipping
+ALTER INDEX IF EXISTS __attmp_onek_unique1 RENAME TO onek_unique1;
+NOTICE: relation "__attmp_onek_unique1" does not exist, skipping
+ALTER INDEX onek_unique1 RENAME TO attmp_onek_unique1;
+ALTER INDEX attmp_onek_unique1 RENAME TO onek_unique1;
+SET ROLE regress_alter_table_user1;
+ALTER INDEX onek_unique1 RENAME TO fail; -- permission denied
+ERROR: must be owner of index onek_unique1
+RESET ROLE;
+-- renaming views
+CREATE VIEW attmp_view (unique1) AS SELECT unique1 FROM tenk1;
+ALTER TABLE attmp_view RENAME TO attmp_view_new;
+SET ROLE regress_alter_table_user1;
+ALTER VIEW attmp_view_new RENAME TO fail; -- permission denied
+ERROR: must be owner of view attmp_view_new
+RESET ROLE;
+-- hack to ensure we get an indexscan here
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+-- 5 values, sorted
+SELECT unique1 FROM tenk1 WHERE unique1 < 5;
+ unique1
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+(5 rows)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+DROP VIEW attmp_view_new;
+-- toast-like relation name
+alter table stud_emp rename to pg_toast_stud_emp;
+alter table pg_toast_stud_emp rename to stud_emp;
+-- renaming index should rename constraint as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+ALTER INDEX onek_unique1_constraint RENAME TO onek_unique1_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraint
+ALTER TABLE onek ADD CONSTRAINT onek_check_constraint CHECK (unique1 >= 0);
+ALTER TABLE onek RENAME CONSTRAINT onek_check_constraint TO onek_check_constraint_foo;
+ALTER TABLE onek DROP CONSTRAINT onek_check_constraint_foo;
+-- renaming constraint should rename index as well
+ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
+DROP INDEX onek_unique1_constraint; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint because constraint onek_unique1_constraint on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint on table onek instead.
+ALTER TABLE onek RENAME CONSTRAINT onek_unique1_constraint TO onek_unique1_constraint_foo;
+DROP INDEX onek_unique1_constraint_foo; -- to see whether it's there
+ERROR: cannot drop index onek_unique1_constraint_foo because constraint onek_unique1_constraint_foo on table onek requires it
+HINT: You can drop constraint onek_unique1_constraint_foo on table onek instead.
+ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
+-- renaming constraints vs. inheritance
+CREATE TABLE constraint_rename_test (a int CONSTRAINT con1 CHECK (a > 0), b int, c int);
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+
+CREATE TABLE constraint_rename_test2 (a int CONSTRAINT con1 CHECK (a > 0), d int) INHERITS (constraint_rename_test);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging constraint "con1" with inherited definition
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test2 RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: cannot rename inherited constraint "con1"
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- fail
+ERROR: inherited constraint "con1" must be renamed in child tables too
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con2 CHECK (b > 0) NO INHERIT;
+ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con2 TO con2foo; -- ok
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con2foo TO con2bar; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+ALTER TABLE constraint_rename_test ADD CONSTRAINT con3 PRIMARY KEY (a);
+ALTER TABLE constraint_rename_test RENAME CONSTRAINT con3 TO con3foo; -- ok
+\d constraint_rename_test
+ Table "public.constraint_rename_test"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+Indexes:
+ "con3foo" PRIMARY KEY, btree (a)
+Check constraints:
+ "con1foo" CHECK (a > 0)
+ "con2bar" CHECK (b > 0) NO INHERIT
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d constraint_rename_test2
+ Table "public.constraint_rename_test2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+ b | integer | | |
+ c | integer | | |
+ d | integer | | |
+Check constraints:
+ "con1foo" CHECK (a > 0)
+Inherits: constraint_rename_test
+
+DROP TABLE constraint_rename_test2;
+DROP TABLE constraint_rename_test;
+ALTER TABLE IF EXISTS constraint_not_exist RENAME CONSTRAINT con3 TO con3foo; -- ok
+NOTICE: relation "constraint_not_exist" does not exist, skipping
+ALTER TABLE IF EXISTS constraint_rename_test ADD CONSTRAINT con4 UNIQUE (a);
+NOTICE: relation "constraint_rename_test" does not exist, skipping
+-- renaming constraints with cache reset of target relation
+CREATE TABLE constraint_rename_cache (a int,
+ CONSTRAINT chk_a CHECK (a > 0),
+ PRIMARY KEY (a));
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT chk_a TO chk_a_new;
+ALTER TABLE constraint_rename_cache
+ RENAME CONSTRAINT constraint_rename_cache_pkey TO constraint_rename_pkey_new;
+CREATE TABLE like_constraint_rename_cache
+ (LIKE constraint_rename_cache INCLUDING ALL);
+\d like_constraint_rename_cache
+ Table "public.like_constraint_rename_cache"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | not null |
+Indexes:
+ "like_constraint_rename_cache_pkey" PRIMARY KEY, btree (a)
+Check constraints:
+ "chk_a_new" CHECK (a > 0)
+
+DROP TABLE constraint_rename_cache;
+DROP TABLE like_constraint_rename_cache;
+-- FOREIGN KEY CONSTRAINT adding TEST
+CREATE TABLE attmp2 (a int primary key);
+CREATE TABLE attmp3 (a int, b int);
+CREATE TABLE attmp4 (a int, b int, unique(a,b));
+CREATE TABLE attmp5 (a int, b int);
+-- Insert rows into attmp2 (pktable)
+INSERT INTO attmp2 values (1);
+INSERT INTO attmp2 values (2);
+INSERT INTO attmp2 values (3);
+INSERT INTO attmp2 values (4);
+-- Insert rows into attmp3
+INSERT INTO attmp3 values (1,10);
+INSERT INTO attmp3 values (1,20);
+INSERT INTO attmp3 values (5,50);
+-- Try (and fail) to add constraint due to invalid source columns
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(c) references attmp2 match full;
+ERROR: column "c" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid destination columns explicitly given
+ALTER TABLE attmp3 add constraint attmpconstr foreign key(a) references attmp2(b) match full;
+ERROR: column "b" referenced in foreign key constraint does not exist
+-- Try (and fail) to add constraint due to invalid data
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed)
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full;
+ALTER TABLE attmp3 drop constraint attmpconstr;
+INSERT INTO attmp3 values (5,50);
+-- Try NOT VALID and then VALIDATE CONSTRAINT, but fails. Delete failure then re-validate
+ALTER TABLE attmp3 add constraint attmpconstr foreign key (a) references attmp2 match full NOT VALID;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ERROR: insert or update on table "attmp3" violates foreign key constraint "attmpconstr"
+DETAIL: Key (a)=(5) is not present in table "attmp2".
+-- Delete failing row
+DELETE FROM attmp3 where a=5;
+-- Try (and succeed) and repeat to show it works on already valid constraint
+ALTER TABLE attmp3 validate constraint attmpconstr;
+ALTER TABLE attmp3 validate constraint attmpconstr;
+-- Try a non-verified CHECK constraint
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10); -- fail
+ERROR: check constraint "b_greater_than_ten" of relation "attmp3" is violated by some row
+ALTER TABLE attmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10) NOT VALID; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- fails
+ERROR: check constraint "b_greater_than_ten" of relation "attmp3" is violated by some row
+DELETE FROM attmp3 WHERE NOT b > 10;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
+-- Test inherited NOT VALID CHECK constraints
+select * from attmp3;
+ a | b
+---+----
+ 1 | 20
+(1 row)
+
+CREATE TABLE attmp6 () INHERITS (attmp3);
+CREATE TABLE attmp7 () INHERITS (attmp3);
+INSERT INTO attmp6 VALUES (6, 30), (7, 16);
+ALTER TABLE attmp3 ADD CONSTRAINT b_le_20 CHECK (b <= 20) NOT VALID;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- fails
+ERROR: check constraint "b_le_20" of relation "attmp6" is violated by some row
+DELETE FROM attmp6 WHERE b > 20;
+ALTER TABLE attmp3 VALIDATE CONSTRAINT b_le_20; -- succeeds
+-- An already validated constraint must not be revalidated
+CREATE FUNCTION boo(int) RETURNS int IMMUTABLE STRICT LANGUAGE plpgsql AS $$ BEGIN RAISE NOTICE 'boo: %', $1; RETURN $1; END; $$;
+INSERT INTO attmp7 VALUES (8, 18);
+ALTER TABLE attmp7 ADD CONSTRAINT identity CHECK (b = boo(b));
+NOTICE: boo: 18
+ALTER TABLE attmp3 ADD CONSTRAINT IDENTITY check (b = boo(b)) NOT VALID;
+NOTICE: merging constraint "identity" with inherited definition
+ALTER TABLE attmp3 VALIDATE CONSTRAINT identity;
+NOTICE: boo: 20
+NOTICE: boo: 16
+-- A NO INHERIT constraint should not be looked for in children during VALIDATE CONSTRAINT
+create table parent_noinh_convalid (a int);
+create table child_noinh_convalid () inherits (parent_noinh_convalid);
+insert into parent_noinh_convalid values (1);
+insert into child_noinh_convalid values (1);
+alter table parent_noinh_convalid add constraint check_a_is_2 check (a = 2) no inherit not valid;
+-- fail, because of the row in parent
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+ERROR: check constraint "check_a_is_2" of relation "parent_noinh_convalid" is violated by some row
+delete from only parent_noinh_convalid;
+-- ok (parent itself contains no violating rows)
+alter table parent_noinh_convalid validate constraint check_a_is_2;
+select convalidated from pg_constraint where conrelid = 'parent_noinh_convalid'::regclass and conname = 'check_a_is_2';
+ convalidated
+--------------
+ t
+(1 row)
+
+-- cleanup
+drop table parent_noinh_convalid, child_noinh_convalid;
+-- Try (and fail) to create constraint from attmp5(a) to attmp4(a) - unique constraint on
+-- attmp4 is a,b
+ALTER TABLE attmp5 add constraint attmpconstr foreign key(a) references attmp4(a) match full;
+ERROR: there is no unique constraint matching given keys for referenced table "attmp4"
+DROP TABLE attmp7;
+DROP TABLE attmp6;
+DROP TABLE attmp5;
+DROP TABLE attmp4;
+DROP TABLE attmp3;
+DROP TABLE attmp2;
+-- NOT VALID with plan invalidation -- ensure we don't use a constraint for
+-- exclusion until validated
+set constraint_exclusion TO 'partition';
+create table nv_parent (d date, check (false) no inherit not valid);
+-- not valid constraint added at creation time should automatically become valid
+\d nv_parent
+ Table "public.nv_parent"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_parent_check" CHECK (false) NO INHERIT
+
+create table nv_child_2010 () inherits (nv_parent);
+create table nv_child_2011 () inherits (nv_parent);
+alter table nv_child_2010 add check (d between '2010-01-01'::date and '2010-12-31'::date) not valid;
+alter table nv_child_2011 add check (d between '2011-01-01'::date and '2011-12-31'::date) not valid;
+explain (costs off) select * from nv_parent where d between '2011-08-01' and '2011-08-31';
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent nv_parent_1
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_2
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011 nv_parent_3
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+create table nv_child_2009 (check (d between '2009-01-01'::date and '2009-12-31'::date)) inherits (nv_parent);
+explain (costs off) select * from nv_parent where d between '2011-08-01'::date and '2011-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent nv_parent_1
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_2
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+ -> Seq Scan on nv_child_2011 nv_parent_3
+ Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
+(7 rows)
+
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent nv_parent_1
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_2
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2011 nv_parent_3
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009 nv_parent_4
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(9 rows)
+
+-- after validation, the constraint should be used
+alter table nv_child_2011 VALIDATE CONSTRAINT nv_child_2011_d_check;
+explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Append
+ -> Seq Scan on nv_parent nv_parent_1
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2010 nv_parent_2
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+ -> Seq Scan on nv_child_2009 nv_parent_3
+ Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
+(7 rows)
+
+-- add an inherited NOT VALID constraint
+alter table nv_parent add check (d between '2001-01-01'::date and '2099-12-31'::date) not valid;
+\d nv_child_2009
+ Table "public.nv_child_2009"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ d | date | | |
+Check constraints:
+ "nv_child_2009_d_check" CHECK (d >= '01-01-2009'::date AND d <= '12-31-2009'::date)
+ "nv_parent_d_check" CHECK (d >= '01-01-2001'::date AND d <= '12-31-2099'::date) NOT VALID
+Inherits: nv_parent
+
+-- we leave nv_parent and children around to help test pg_dump logic
+-- Foreign key adding test with mixed types
+-- Note: these tables are TEMP to avoid name conflicts when this test
+-- is run in parallel with foreign_key.sql.
+CREATE TEMP TABLE PKTABLE (ptest1 int PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 inet);
+-- This next should fail, because int=inet does not exist
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+-- This should also fail for the same reason, but here we
+-- give the column name
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable(ptest1);
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+-- This should succeed, even though they are different types,
+-- because int=int8 exists and is a member of the integer opfamily
+CREATE TEMP TABLE FKTABLE (ftest1 int8);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+-- This should fail, because we'd have to cast numeric to int which is
+-- not an implicit coercion (or use numeric=numeric, but that's not part
+-- of the integer opfamily)
+CREATE TEMP TABLE FKTABLE (ftest1 numeric);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: numeric and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- On the other hand, this should work because int implicitly promotes to
+-- numeric, and we allow promotion on the FK side
+CREATE TEMP TABLE PKTABLE (ptest1 numeric PRIMARY KEY);
+INSERT INTO PKTABLE VALUES(42);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
+-- Check it actually works
+INSERT INTO FKTABLE VALUES(42); -- should succeed
+INSERT INTO FKTABLE VALUES(43); -- should fail
+ERROR: insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
+DETAIL: Key (ftest1)=(43) is not present in table "pktable".
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+CREATE TEMP TABLE PKTABLE (ptest1 int, ptest2 inet,
+ PRIMARY KEY(ptest1, ptest2));
+-- This should fail, because we just chose really odd types
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2) references pktable;
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- Again, so should this...
+CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
+DROP TABLE FKTABLE;
+-- This fails because we mixed up the column ordering
+CREATE TEMP TABLE FKTABLE (ftest1 int, ftest2 inet);
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
+ references pktable(ptest2, ptest1);
+ERROR: foreign key constraint "fktable_ftest1_ftest2_fkey" cannot be implemented
+DETAIL: Key columns "ftest1" and "ptest2" are of incompatible types: integer and inet.
+-- As does this...
+ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest2, ftest1)
+ references pktable(ptest1, ptest2);
+ERROR: foreign key constraint "fktable_ftest2_ftest1_fkey" cannot be implemented
+DETAIL: Key columns "ftest2" and "ptest1" are of incompatible types: inet and integer.
+DROP TABLE FKTABLE;
+DROP TABLE PKTABLE;
+-- Test that ALTER CONSTRAINT updates trigger deferrability properly
+CREATE TEMP TABLE PKTABLE (ptest1 int primary key);
+CREATE TEMP TABLE FKTABLE (ftest1 int);
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY IMMEDIATE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fknd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fknd2 NOT DEFERRABLE;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdd2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdd2 DEFERRABLE INITIALLY DEFERRED;
+ALTER TABLE FKTABLE ADD CONSTRAINT fkdi2 FOREIGN KEY(ftest1) REFERENCES pktable
+ ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE;
+ALTER TABLE FKTABLE ALTER CONSTRAINT fkdi2 DEFERRABLE INITIALLY IMMEDIATE;
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'pktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+------------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdd2 | "RI_FKey_noaction_upd" | 17 | t | t
+ fkdi | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi | "RI_FKey_noaction_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_cascade_del" | 9 | f | f
+ fkdi2 | "RI_FKey_noaction_upd" | 17 | t | f
+ fknd | "RI_FKey_cascade_del" | 9 | f | f
+ fknd | "RI_FKey_noaction_upd" | 17 | f | f
+ fknd2 | "RI_FKey_cascade_del" | 9 | f | f
+ fknd2 | "RI_FKey_noaction_upd" | 17 | f | f
+(12 rows)
+
+SELECT conname, tgfoid::regproc, tgtype, tgdeferrable, tginitdeferred
+FROM pg_trigger JOIN pg_constraint con ON con.oid = tgconstraint
+WHERE tgrelid = 'fktable'::regclass
+ORDER BY 1,2,3;
+ conname | tgfoid | tgtype | tgdeferrable | tginitdeferred
+---------+---------------------+--------+--------------+----------------
+ fkdd | "RI_FKey_check_ins" | 5 | t | t
+ fkdd | "RI_FKey_check_upd" | 17 | t | t
+ fkdd2 | "RI_FKey_check_ins" | 5 | t | t
+ fkdd2 | "RI_FKey_check_upd" | 17 | t | t
+ fkdi | "RI_FKey_check_ins" | 5 | t | f
+ fkdi | "RI_FKey_check_upd" | 17 | t | f
+ fkdi2 | "RI_FKey_check_ins" | 5 | t | f
+ fkdi2 | "RI_FKey_check_upd" | 17 | t | f
+ fknd | "RI_FKey_check_ins" | 5 | f | f
+ fknd | "RI_FKey_check_upd" | 17 | f | f
+ fknd2 | "RI_FKey_check_ins" | 5 | f | f
+ fknd2 | "RI_FKey_check_upd" | 17 | f | f
+(12 rows)
+
+-- temp tables should go away by themselves, need not drop them.
+-- test check constraint adding
+create table atacc1 ( test int );
+-- add a check constraint
+alter table atacc1 add constraint atacc_test1 check (test>3);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (2).
+-- should succeed
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails when added
+create table atacc1 ( test int );
+-- insert a soon to be failing row
+insert into atacc1 (test) values (2);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test>3);
+ERROR: check constraint "atacc_test1" of relation "atacc1" is violated by some row
+insert into atacc1 (test) values (4);
+drop table atacc1;
+-- let's do one where the check fails because the column doesn't exist
+create table atacc1 ( test int );
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test1>3);
+ERROR: column "test1" does not exist
+HINT: Perhaps you meant to reference the column "atacc1.test".
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int, test3 int);
+-- add a check constraint (fails)
+alter table atacc1 add constraint atacc_test1 check (test+test2<test3*4);
+-- should fail
+insert into atacc1 (test,test2,test3) values (4,4,2);
+ERROR: new row for relation "atacc1" violates check constraint "atacc_test1"
+DETAIL: Failing row contains (4, 4, 2).
+-- should succeed
+insert into atacc1 (test,test2,test3) values (4,4,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int check (test>3), test2 int);
+alter table atacc1 add check (test2>test);
+-- should fail for $2
+insert into atacc1 (test2, test) values (3, 4);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_check"
+DETAIL: Failing row contains (4, 3).
+drop table atacc1;
+-- inheritance related tests
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc2 add constraint foo check (test2>0);
+-- fail and then succeed on atacc2
+insert into atacc2 (test2) values (-3);
+ERROR: new row for relation "atacc2" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc2 (test2) values (3);
+-- fail and then succeed on atacc3
+insert into atacc3 (test2) values (-3);
+ERROR: new row for relation "atacc3" violates check constraint "foo"
+DETAIL: Failing row contains (null, -3, null).
+insert into atacc3 (test2) values (3);
+drop table atacc3;
+drop table atacc2;
+drop table atacc1;
+-- same things with one created with INHERIT
+create table atacc1 (test int);
+create table atacc2 (test2 int);
+create table atacc3 (test3 int) inherits (atacc1, atacc2);
+alter table atacc3 no inherit atacc2;
+-- fail
+alter table atacc3 no inherit atacc2;
+ERROR: relation "atacc2" is not a parent of relation "atacc3"
+-- make sure it really isn't a child
+insert into atacc3 (test2) values (3);
+select test2 from atacc2;
+ test2
+-------
+(0 rows)
+
+-- fail due to missing constraint
+alter table atacc2 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing constraint "foo"
+-- fail due to missing column
+alter table atacc3 rename test2 to testx;
+alter table atacc3 inherit atacc2;
+ERROR: child table is missing column "test2"
+-- fail due to mismatched data type
+alter table atacc3 add test2 bool;
+alter table atacc3 inherit atacc2;
+ERROR: child table "atacc3" has different type for column "test2"
+alter table atacc3 drop test2;
+-- succeed
+alter table atacc3 add test2 int;
+update atacc3 set test2 = 4 where test2 is null;
+alter table atacc3 add constraint foo check (test2>0);
+alter table atacc3 inherit atacc2;
+-- fail due to duplicates and circular inheritance
+alter table atacc3 inherit atacc2;
+ERROR: relation "atacc2" would be inherited from more than once
+alter table atacc2 inherit atacc3;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc3" is already a child of "atacc2".
+alter table atacc2 inherit atacc2;
+ERROR: circular inheritance not allowed
+DETAIL: "atacc2" is already a child of "atacc2".
+-- test that we really are a child now (should see 4 not 3 and cascade should go through)
+select test2 from atacc2;
+ test2
+-------
+ 4
+(1 row)
+
+drop table atacc2 cascade;
+NOTICE: drop cascades to table atacc3
+drop table atacc1;
+-- adding only to a parent is allowed as of 9.2
+create table atacc1 (test int);
+create table atacc2 (test2 int) inherits (atacc1);
+-- ok:
+alter table atacc1 add constraint foo check (test>0) no inherit;
+-- check constraint is not there on child
+insert into atacc2 (test) values (-3);
+-- check constraint is there on parent
+insert into atacc1 (test) values (-3);
+ERROR: new row for relation "atacc1" violates check constraint "foo"
+DETAIL: Failing row contains (-3).
+insert into atacc1 (test) values (3);
+-- fail, violating row:
+alter table atacc2 add constraint foo check (test>0) no inherit;
+ERROR: check constraint "foo" of relation "atacc2" is violated by some row
+drop table atacc2;
+drop table atacc1;
+-- test unique constraint adding
+create table atacc1 ( test int ) ;
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- try to create duplicates via alter table using - should fail
+alter table atacc1 alter column test type integer using 0;
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(0) is duplicated.
+drop table atacc1;
+-- let's do one where the unique constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the unique constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a unique constraint (fails)
+alter table atacc1 add constraint atacc_test1 unique (test1);
+ERROR: column "test1" named in key does not exist
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a unique constraint
+alter table atacc1 add constraint atacc_test1 unique (test, test2);
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, unique(test));
+alter table atacc1 add unique (test2);
+-- should fail for @@ second one @@
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_test_key"
+DETAIL: Key (test)=(3) already exists.
+drop table atacc1;
+-- test primary key constraint adding
+create table atacc1 ( id serial, test int) ;
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test);
+-- insert first value
+insert into atacc1 (test) values (2);
+-- should fail
+insert into atacc1 (test) values (2);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test)=(2) already exists.
+-- should succeed
+insert into atacc1 (test) values (4);
+-- inserting NULL should fail
+insert into atacc1 (test) values(NULL);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (4, null).
+-- try adding a second primary key (should fail)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- drop first primary key constraint
+alter table atacc1 drop constraint atacc_test1 restrict;
+-- try adding a primary key on oid (should succeed)
+alter table atacc1 add constraint atacc_oid1 primary key(id);
+drop table atacc1;
+-- let's do one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing rows
+insert into atacc1 (test) values (2);
+insert into atacc1 (test) values (2);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: could not create unique index "atacc_test1"
+DETAIL: Key (test)=(2) is duplicated.
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do another one where the primary key constraint fails when added
+create table atacc1 ( test int );
+-- insert soon to be failing row
+insert into atacc1 (test) values (NULL);
+-- add a primary key (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test);
+ERROR: column "test" of relation "atacc1" contains null values
+insert into atacc1 (test) values (3);
+drop table atacc1;
+-- let's do one where the primary key constraint fails
+-- because the column doesn't exist
+create table atacc1 ( test int );
+-- add a primary key constraint (fails)
+alter table atacc1 add constraint atacc_test1 primary key (test1);
+ERROR: column "test1" of relation "atacc1" does not exist
+drop table atacc1;
+-- adding a new column as primary key to a non-empty table.
+-- should fail unless the column has a non-null default value.
+create table atacc1 ( test int );
+insert into atacc1 (test) values (0);
+-- add a primary key column without a default (fails).
+alter table atacc1 add column test2 int primary key;
+ERROR: column "test2" of relation "atacc1" contains null values
+-- now add a primary key column with a default (succeeds).
+alter table atacc1 add column test2 int default 0 primary key;
+drop table atacc1;
+-- this combination used to have order-of-execution problems (bug #15580)
+create table atacc1 (a int);
+insert into atacc1 values(1);
+alter table atacc1
+ add column b float8 not null default random(),
+ add primary key(a);
+drop table atacc1;
+-- additionally, we've seen issues with foreign key validation not being
+-- properly delayed until after a table rewrite. Check that works ok.
+create table atacc1 (a int primary key);
+alter table atacc1 add constraint atacc1_fkey foreign key (a) references atacc1 (a) not valid;
+alter table atacc1 validate constraint atacc1_fkey, alter a type bigint;
+drop table atacc1;
+-- we've also seen issues with check constraints being validated at the wrong
+-- time when there's a pending table rewrite.
+create table atacc1 (a bigint, b int);
+insert into atacc1 values(1,1);
+alter table atacc1 add constraint atacc1_chk check(b = 1) not valid;
+alter table atacc1 validate constraint atacc1_chk, alter a type int;
+drop table atacc1;
+-- same as above, but ensure the constraint violation is detected
+create table atacc1 (a bigint, b int);
+insert into atacc1 values(1,2);
+alter table atacc1 add constraint atacc1_chk check(b = 1) not valid;
+alter table atacc1 validate constraint atacc1_chk, alter a type int;
+ERROR: check constraint "atacc1_chk" of relation "atacc1" is violated by some row
+drop table atacc1;
+-- something a little more complicated
+create table atacc1 ( test int, test2 int);
+-- add a primary key constraint
+alter table atacc1 add constraint atacc_test1 primary key (test, test2);
+-- try adding a second primary key - should fail
+alter table atacc1 add constraint atacc_test2 primary key (test);
+ERROR: multiple primary keys for table "atacc1" are not allowed
+-- insert initial value
+insert into atacc1 (test,test2) values (4,4);
+-- should fail
+insert into atacc1 (test,test2) values (4,4);
+ERROR: duplicate key value violates unique constraint "atacc_test1"
+DETAIL: Key (test, test2)=(4, 4) already exists.
+insert into atacc1 (test,test2) values (NULL,3);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, 3).
+insert into atacc1 (test,test2) values (3, NULL);
+ERROR: null value in column "test2" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (3, null).
+insert into atacc1 (test,test2) values (NULL,NULL);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+-- should all succeed
+insert into atacc1 (test,test2) values (4,5);
+insert into atacc1 (test,test2) values (5,4);
+insert into atacc1 (test,test2) values (5,5);
+drop table atacc1;
+-- lets do some naming tests
+create table atacc1 (test int, test2 int, primary key(test));
+-- only first should succeed
+insert into atacc1 (test2, test) values (3, 3);
+insert into atacc1 (test2, test) values (2, 3);
+ERROR: duplicate key value violates unique constraint "atacc1_pkey"
+DETAIL: Key (test)=(3) already exists.
+insert into atacc1 (test2, test) values (1, NULL);
+ERROR: null value in column "test" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, 1).
+drop table atacc1;
+-- alter table / alter column [set/drop] not null tests
+-- try altering system catalogs, should fail
+alter table pg_class alter column relname drop not null;
+ERROR: permission denied: "pg_class" is a system catalog
+alter table pg_class alter relname set not null;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table non_existent alter column bar set not null;
+ERROR: relation "non_existent" does not exist
+alter table non_existent alter column bar drop not null;
+ERROR: relation "non_existent" does not exist
+-- test setting columns to null and not null and vice versa
+-- test checking for null values and primary key
+create table atacc1 (test int not null);
+alter table atacc1 add constraint "atacc1_pkey" primary key (test);
+alter table atacc1 alter column test drop not null;
+ERROR: column "test" is in a primary key
+alter table atacc1 drop constraint "atacc1_pkey";
+alter table atacc1 alter column test drop not null;
+insert into atacc1 values (null);
+alter table atacc1 alter test set not null;
+ERROR: column "test" of relation "atacc1" contains null values
+delete from atacc1;
+alter table atacc1 alter test set not null;
+-- try altering a non-existent column, should fail
+alter table atacc1 alter bar set not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+alter table atacc1 alter bar drop not null;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+alter table myview alter column test drop not null;
+ERROR: "myview" is not a table or foreign table
+alter table myview alter column test set not null;
+ERROR: "myview" is not a table or foreign table
+drop view myview;
+drop table atacc1;
+-- set not null verified by constraints
+create table atacc1 (test_a int, test_b int);
+insert into atacc1 values (null, 1);
+-- constraint not cover all values, should fail
+alter table atacc1 add constraint atacc1_constr_or check(test_a is not null or test_b < 10);
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" of relation "atacc1" contains null values
+alter table atacc1 drop constraint atacc1_constr_or;
+-- not valid constraint, should fail
+alter table atacc1 add constraint atacc1_constr_invalid check(test_a is not null) not valid;
+alter table atacc1 alter test_a set not null;
+ERROR: column "test_a" of relation "atacc1" contains null values
+alter table atacc1 drop constraint atacc1_constr_invalid;
+-- with valid constraint
+update atacc1 set test_a = 1;
+alter table atacc1 add constraint atacc1_constr_a_valid check(test_a is not null);
+alter table atacc1 alter test_a set not null;
+delete from atacc1;
+insert into atacc1 values (2, null);
+alter table atacc1 alter test_a drop not null;
+-- test multiple set not null at same time
+-- test_a checked by atacc1_constr_a_valid, test_b should fail by table scan
+alter table atacc1 alter test_a set not null, alter test_b set not null;
+ERROR: column "test_b" of relation "atacc1" contains null values
+-- commands order has no importance
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+ERROR: column "test_b" of relation "atacc1" contains null values
+-- valid one by table scan, one by check constraints
+update atacc1 set test_b = 1;
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+alter table atacc1 alter test_a drop not null, alter test_b drop not null;
+-- both column has check constraints
+alter table atacc1 add constraint atacc1_constr_b_valid check(test_b is not null);
+alter table atacc1 alter test_b set not null, alter test_a set not null;
+drop table atacc1;
+-- test inheritance
+create table parent (a int);
+create table child (b varchar(255)) inherits (parent);
+alter table parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" of relation "parent" violates not-null constraint
+DETAIL: Failing row contains (null).
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" of relation "child" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+alter table parent alter a drop not null;
+insert into parent values (NULL);
+insert into child (a, b) values (NULL, 'foo');
+alter table only parent alter a set not null;
+ERROR: column "a" of relation "parent" contains null values
+alter table child alter a set not null;
+ERROR: column "a" of relation "child" contains null values
+delete from parent;
+alter table only parent alter a set not null;
+insert into parent values (NULL);
+ERROR: null value in column "a" of relation "parent" violates not-null constraint
+DETAIL: Failing row contains (null).
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" of relation "child" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+delete from child;
+alter table child alter a set not null;
+insert into child (a, b) values (NULL, 'foo');
+ERROR: null value in column "a" of relation "child" violates not-null constraint
+DETAIL: Failing row contains (null, foo).
+drop table child;
+drop table parent;
+-- test setting and removing default values
+create table def_test (
+ c1 int4 default 5,
+ c2 text default 'initial_default'
+);
+insert into def_test default values;
+alter table def_test alter column c1 drop default;
+insert into def_test default values;
+alter table def_test alter column c2 drop default;
+insert into def_test default values;
+alter table def_test alter column c1 set default 10;
+alter table def_test alter column c2 set default 'new_default';
+insert into def_test default values;
+select * from def_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+(4 rows)
+
+-- set defaults to an incorrect type: this should fail
+alter table def_test alter column c1 set default 'wrong_datatype';
+ERROR: invalid input syntax for type integer: "wrong_datatype"
+alter table def_test alter column c2 set default 20;
+-- set defaults on a non-existent column: this should fail
+alter table def_test alter column c3 set default 30;
+ERROR: column "c3" of relation "def_test" does not exist
+-- set defaults on views: we need to create a view, add a rule
+-- to allow insertions into it, and then alter the view to add
+-- a default
+create view def_view_test as select * from def_test;
+create rule def_view_test_ins as
+ on insert to def_view_test
+ do instead insert into def_test select new.*;
+insert into def_view_test default values;
+alter table def_view_test alter column c1 set default 45;
+insert into def_view_test default values;
+alter table def_view_test alter column c2 set default 'view_default';
+insert into def_view_test default values;
+select * from def_view_test;
+ c1 | c2
+----+-----------------
+ 5 | initial_default
+ | initial_default
+ |
+ 10 | new_default
+ |
+ 45 |
+ 45 | view_default
+(7 rows)
+
+drop rule def_view_test_ins on def_view_test;
+drop view def_view_test;
+drop table def_test;
+-- alter table / drop column tests
+-- try altering system catalogs, should fail
+alter table pg_class drop column relname;
+ERROR: permission denied: "pg_class" is a system catalog
+-- try altering non-existent table, should fail
+alter table nosuchtable drop column bar;
+ERROR: relation "nosuchtable" does not exist
+-- test dropping columns
+create table atacc1 (a int4 not null, b int4, c int4 not null, d int4);
+insert into atacc1 values (1, 2, 3, 4);
+alter table atacc1 drop a;
+alter table atacc1 drop a;
+ERROR: column "a" of relation "atacc1" does not exist
+-- SELECTs
+select * from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select * from atacc1 order by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 order by a;
+ ^
+select * from atacc1 order by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 order by "........pg.dropped.1........"...
+ ^
+select * from atacc1 group by a;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 group by a;
+ ^
+select * from atacc1 group by "........pg.dropped.1........";
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 group by "........pg.dropped.1........"...
+ ^
+select atacc1.* from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a from atacc1;
+ ^
+select atacc1.a from atacc1;
+ERROR: column atacc1.a does not exist
+LINE 1: select atacc1.a from atacc1;
+ ^
+select b,c,d from atacc1;
+ b | c | d
+---+---+---
+ 2 | 3 | 4
+(1 row)
+
+select a,b,c,d from atacc1;
+ERROR: column "a" does not exist
+LINE 1: select a,b,c,d from atacc1;
+ ^
+select * from atacc1 where a = 1;
+ERROR: column "a" does not exist
+LINE 1: select * from atacc1 where a = 1;
+ ^
+select "........pg.dropped.1........" from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........" from atacc1;
+ ^
+select atacc1."........pg.dropped.1........" from atacc1;
+ERROR: column atacc1.........pg.dropped.1........ does not exist
+LINE 1: select atacc1."........pg.dropped.1........" from atacc1;
+ ^
+select "........pg.dropped.1........",b,c,d from atacc1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select "........pg.dropped.1........",b,c,d from atacc1;
+ ^
+select * from atacc1 where "........pg.dropped.1........" = 1;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: select * from atacc1 where "........pg.dropped.1........" = ...
+ ^
+-- UPDATEs
+update atacc1 set a = 3;
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: update atacc1 set a = 3;
+ ^
+update atacc1 set b = 2 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: update atacc1 set b = 2 where a = 3;
+ ^
+update atacc1 set "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: update atacc1 set "........pg.dropped.1........" = 3;
+ ^
+update atacc1 set b = 2 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: update atacc1 set b = 2 where "........pg.dropped.1........"...
+ ^
+-- INSERTs
+insert into atacc1 values (10, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (10, 11, 12, 13);
+ ^
+insert into atacc1 values (default, 11, 12, 13);
+ERROR: INSERT has more expressions than target columns
+LINE 1: insert into atacc1 values (default, 11, 12, 13);
+ ^
+insert into atacc1 values (11, 12, 13);
+insert into atacc1 (a) values (10);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (10);
+ ^
+insert into atacc1 (a) values (default);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a) values (default);
+ ^
+insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (10,11,12,13);
+ ^
+insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ERROR: column "a" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 (a,b,c,d) values (default,11,12,13);
+ ^
+insert into atacc1 (b,c,d) values (11,12,13);
+insert into atacc1 ("........pg.dropped.1........") values (10);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........") values (default);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (10,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+insert into atacc1 ("........pg.dropped.1........",b,c,d) values (default,11,12,13);
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
+ ^
+-- DELETEs
+delete from atacc1 where a = 3;
+ERROR: column "a" does not exist
+LINE 1: delete from atacc1 where a = 3;
+ ^
+delete from atacc1 where "........pg.dropped.1........" = 3;
+ERROR: column "........pg.dropped.1........" does not exist
+LINE 1: delete from atacc1 where "........pg.dropped.1........" = 3;
+ ^
+delete from atacc1;
+-- try dropping a non-existent column, should fail
+alter table atacc1 drop bar;
+ERROR: column "bar" of relation "atacc1" does not exist
+-- try removing an oid column, should succeed (as it's nonexistent)
+alter table atacc1 SET WITHOUT OIDS;
+-- try adding an oid column, should fail (not supported)
+alter table atacc1 SET WITH OIDS;
+ERROR: syntax error at or near "WITH"
+LINE 1: alter table atacc1 SET WITH OIDS;
+ ^
+-- try dropping the xmin column, should fail
+alter table atacc1 drop xmin;
+ERROR: cannot drop system column "xmin"
+-- try creating a view and altering that, should fail
+create view myview as select * from atacc1;
+select * from myview;
+ b | c | d
+---+---+---
+(0 rows)
+
+alter table myview drop d;
+ERROR: "myview" is not a table, composite type, or foreign table
+drop view myview;
+-- test some commands to make sure they fail on the dropped column
+analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+vacuum analyze atacc1(a);
+ERROR: column "a" of relation "atacc1" does not exist
+vacuum analyze atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+comment on column atacc1.a is 'testing';
+ERROR: column "a" of relation "atacc1" does not exist
+comment on column atacc1."........pg.dropped.1........" is 'testing';
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set storage plain;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set storage plain;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set statistics 0;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set statistics 0;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set default 3;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set default 3;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop default;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop default;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a set not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" set not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 alter a drop not null;
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 alter "........pg.dropped.1........" drop not null;
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 rename a to x;
+ERROR: column "a" does not exist
+alter table atacc1 rename "........pg.dropped.1........" to x;
+ERROR: column "........pg.dropped.1........" does not exist
+alter table atacc1 add primary key(a);
+ERROR: column "a" of relation "atacc1" does not exist
+alter table atacc1 add primary key("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" of relation "atacc1" does not exist
+alter table atacc1 add unique(a);
+ERROR: column "a" named in key does not exist
+alter table atacc1 add unique("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" named in key does not exist
+alter table atacc1 add check (a > 3);
+ERROR: column "a" does not exist
+alter table atacc1 add check ("........pg.dropped.1........" > 3);
+ERROR: column "........pg.dropped.1........" does not exist
+create table atacc2 (id int4 unique);
+alter table atacc1 add foreign key (a) references atacc2(id);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc1 add foreign key ("........pg.dropped.1........") references atacc2(id);
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1(a);
+ERROR: column "a" referenced in foreign key constraint does not exist
+alter table atacc2 add foreign key (id) references atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" referenced in foreign key constraint does not exist
+drop table atacc2;
+create index "testing_idx" on atacc1(a);
+ERROR: column "a" does not exist
+create index "testing_idx" on atacc1("........pg.dropped.1........");
+ERROR: column "........pg.dropped.1........" does not exist
+-- test create as and select into
+insert into atacc1 values (21, 22, 23);
+create table attest1 as select * from atacc1;
+select * from attest1;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest1;
+select * into attest2 from atacc1;
+select * from attest2;
+ b | c | d
+----+----+----
+ 21 | 22 | 23
+(1 row)
+
+drop table attest2;
+-- try dropping all columns
+alter table atacc1 drop c;
+alter table atacc1 drop d;
+alter table atacc1 drop b;
+select * from atacc1;
+--
+(1 row)
+
+drop table atacc1;
+-- test constraint error reporting in presence of dropped columns
+create table atacc1 (id serial primary key, value int check (value < 10));
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (1, 100).
+alter table atacc1 drop column value;
+alter table atacc1 add column value int check (value < 10);
+insert into atacc1(value) values (100);
+ERROR: new row for relation "atacc1" violates check constraint "atacc1_value_check"
+DETAIL: Failing row contains (2, 100).
+insert into atacc1(id, value) values (null, 0);
+ERROR: null value in column "id" of relation "atacc1" violates not-null constraint
+DETAIL: Failing row contains (null, 0).
+drop table atacc1;
+-- test inheritance
+create table parent (a int, b int, c int);
+insert into parent values (1, 2, 3);
+alter table parent drop a;
+create table child (d varchar(255)) inherits (parent);
+insert into child values (12, 13, 'testing');
+select * from parent;
+ b | c
+----+----
+ 2 | 3
+ 12 | 13
+(2 rows)
+
+select * from child;
+ b | c | d
+----+----+---------
+ 12 | 13 | testing
+(1 row)
+
+alter table parent drop c;
+select * from parent;
+ b
+----
+ 2
+ 12
+(2 rows)
+
+select * from child;
+ b | d
+----+---------
+ 12 | testing
+(1 row)
+
+drop table child;
+drop table parent;
+-- check error cases for inheritance column merging
+create table parent (a float8, b numeric(10,4), c text collate "C");
+create table child (a float4) inherits (parent); -- fail
+NOTICE: merging column "a" with inherited definition
+ERROR: column "a" has a type conflict
+DETAIL: double precision versus real
+create table child (b decimal(10,7)) inherits (parent); -- fail
+NOTICE: moving and merging column "b" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "b" has a type conflict
+DETAIL: numeric(10,4) versus numeric(10,7)
+create table child (c text collate "POSIX") inherits (parent); -- fail
+NOTICE: moving and merging column "c" with inherited definition
+DETAIL: User-specified column moved to the position of the inherited column.
+ERROR: column "c" has a collation conflict
+DETAIL: "C" versus "POSIX"
+create table child (a double precision, b decimal(10,4)) inherits (parent);
+NOTICE: merging column "a" with inherited definition
+NOTICE: merging column "b" with inherited definition
+drop table child;
+drop table parent;
+-- test copy in/out
+create table attest (a int4, b int4, c int4);
+insert into attest values (1,2,3);
+alter table attest drop a;
+copy attest to stdout;
+2 3
+copy attest(a) to stdout;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") to stdout;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest from stdin;
+ERROR: extra data after last expected column
+CONTEXT: COPY attest, line 1: "10 11 12"
+select * from attest;
+ b | c
+---+---
+ 2 | 3
+(1 row)
+
+copy attest from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+(2 rows)
+
+copy attest(a) from stdin;
+ERROR: column "a" of relation "attest" does not exist
+copy attest("........pg.dropped.1........") from stdin;
+ERROR: column "........pg.dropped.1........" of relation "attest" does not exist
+copy attest(b,c) from stdin;
+select * from attest;
+ b | c
+----+----
+ 2 | 3
+ 21 | 22
+ 31 | 32
+(3 rows)
+
+drop table attest;
+-- test inheritance
+create table dropColumn (a int, b int, e int);
+create table dropColumnChild (c int) inherits (dropColumn);
+create table dropColumnAnother (d int) inherits (dropColumnChild);
+-- these two should fail
+alter table dropColumnchild drop column a;
+ERROR: cannot drop inherited column "a"
+alter table only dropColumnChild drop column b;
+ERROR: cannot drop inherited column "b"
+-- these three should work
+alter table only dropColumn drop column e;
+alter table dropColumnChild drop column c;
+alter table dropColumn drop column a;
+create table renameColumn (a int);
+create table renameColumnChild (b int) inherits (renameColumn);
+create table renameColumnAnother (c int) inherits (renameColumnChild);
+-- these three should fail
+alter table renameColumnChild rename column a to d;
+ERROR: cannot rename inherited column "a"
+alter table only renameColumnChild rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+alter table only renameColumn rename column a to d;
+ERROR: inherited column "a" must be renamed in child tables too
+-- these should work
+alter table renameColumn rename column a to d;
+alter table renameColumnChild rename column b to a;
+-- these should work
+alter table if exists doesnt_exist_tab rename column a to d;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+alter table if exists doesnt_exist_tab rename column b to a;
+NOTICE: relation "doesnt_exist_tab" does not exist, skipping
+-- this should work
+alter table renameColumn add column w int;
+-- this should fail
+alter table only renameColumn add column x int;
+ERROR: column must be added to child tables too
+-- Test corner cases in dropping of inherited columns
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+-- should work
+alter table p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+select f1 from c1;
+ f1
+----
+(0 rows)
+
+alter table c1 drop column f1;
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table p1 drop column f1;
+-- c1.f1 is dropped now, since there is no local definition for it
+select f1 from c1;
+ERROR: column "f1" does not exist
+LINE 1: select f1 from c1;
+ ^
+HINT: Perhaps you meant to reference the column "c1.f2".
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 () inherits(p1);
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is NOT dropped, but must now be considered non-inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1 (f1 int, f2 int);
+create table c1 (f1 int not null) inherits(p1);
+NOTICE: merging column "f1" with inherited definition
+-- should be rejected since c1.f1 is inherited
+alter table c1 drop column f1;
+ERROR: cannot drop inherited column "f1"
+alter table only p1 drop column f1;
+-- c1.f1 is still there, but no longer inherited
+alter table c1 drop column f1;
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+create table p1(id int, name text);
+create table p2(id2 int, name text, height int);
+create table c1(age int) inherits(p1,p2);
+NOTICE: merging multiple inherited definitions of column "name"
+create table gc1() inherits (c1);
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | name | 2 | f
+ c1 | id2 | 1 | f
+ c1 | height | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | name | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | height | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p1 | name | 0 | t
+ p2 | id2 | 0 | t
+ p2 | name | 0 | t
+ p2 | height | 0 | t
+(15 rows)
+
+-- should work
+alter table only p1 drop column name;
+-- should work. Now c1.name is local and inhcount is 0.
+alter table p2 drop column name;
+-- should be rejected since its inherited
+alter table gc1 drop column name;
+ERROR: cannot drop inherited column "name"
+-- should work, and drop gc1.name along
+alter table c1 drop column name;
+-- should fail: column does not exist
+alter table gc1 drop column name;
+ERROR: column "name" of relation "gc1" does not exist
+-- should work and drop the attribute in all tables
+alter table p2 drop column height;
+-- IF EXISTS test
+create table dropColumnExists ();
+alter table dropColumnExists drop column non_existing; --fail
+ERROR: column "non_existing" of relation "dropcolumnexists" does not exist
+alter table dropColumnExists drop column if exists non_existing; --succeed
+NOTICE: column "non_existing" of relation "dropcolumnexists" does not exist, skipping
+select relname, attname, attinhcount, attislocal
+from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
+where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
+order by relname, attnum;
+ relname | attname | attinhcount | attislocal
+---------+---------+-------------+------------
+ c1 | id | 1 | f
+ c1 | id2 | 1 | f
+ c1 | age | 0 | t
+ gc1 | id | 1 | f
+ gc1 | id2 | 1 | f
+ gc1 | age | 1 | f
+ p1 | id | 0 | t
+ p2 | id2 | 0 | t
+(8 rows)
+
+drop table p1, p2 cascade;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to table c1
+drop cascades to table gc1
+-- test attinhcount tracking with merged columns
+create table depth0();
+create table depth1(c text) inherits (depth0);
+create table depth2() inherits (depth1);
+alter table depth0 add c text;
+NOTICE: merging definition of column "c" for child "depth1"
+select attrelid::regclass, attname, attinhcount, attislocal
+from pg_attribute
+where attnum > 0 and attrelid::regclass in ('depth0', 'depth1', 'depth2')
+order by attrelid::regclass::text, attnum;
+ attrelid | attname | attinhcount | attislocal
+----------+---------+-------------+------------
+ depth0 | c | 0 | t
+ depth1 | c | 1 | t
+ depth2 | c | 1 | f
+(3 rows)
+
+-- test renumbering of child-table columns in inherited operations
+create table p1 (f1 int);
+create table c1 (f2 text, f3 int) inherits (p1);
+alter table p1 add column a1 int check (a1 > 0);
+alter table p1 add column f2 text;
+NOTICE: merging definition of column "f2" for child "c1"
+insert into p1 values (1,2,'abc');
+insert into c1 values(11,'xyz',33,0); -- should fail
+ERROR: new row for relation "c1" violates check constraint "p1_a1_check"
+DETAIL: Failing row contains (11, xyz, 33, 0).
+insert into c1 values(11,'xyz',33,22);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 2 | abc
+ 11 | 22 | xyz
+(2 rows)
+
+update p1 set a1 = a1 + 1, f2 = upper(f2);
+select * from p1;
+ f1 | a1 | f2
+----+----+-----
+ 1 | 3 | ABC
+ 11 | 23 | XYZ
+(2 rows)
+
+drop table p1 cascade;
+NOTICE: drop cascades to table c1
+-- test that operations with a dropped column do not try to reference
+-- its datatype
+create domain mytype as text;
+create temp table foo (f1 text, f2 mytype, f3 text);
+insert into foo values('bb','cc','dd');
+select * from foo;
+ f1 | f2 | f3
+----+----+----
+ bb | cc | dd
+(1 row)
+
+drop domain mytype cascade;
+NOTICE: drop cascades to column f2 of table foo
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+(1 row)
+
+insert into foo values('qq','rr');
+select * from foo;
+ f1 | f3
+----+----
+ bb | dd
+ qq | rr
+(2 rows)
+
+update foo set f3 = 'zz';
+select * from foo;
+ f1 | f3
+----+----
+ bb | zz
+ qq | zz
+(2 rows)
+
+select f3,max(f1) from foo group by f3;
+ f3 | max
+----+-----
+ zz | qq
+(1 row)
+
+-- Simple tests for alter table column type
+alter table foo alter f1 TYPE integer; -- fails
+ERROR: column "f1" cannot be cast automatically to type integer
+HINT: You might need to specify "USING f1::integer".
+alter table foo alter f1 TYPE varchar(10);
+create table anothertab (atcol1 serial8, atcol2 boolean,
+ constraint anothertab_chk check (atcol1 <= 3));
+insert into anothertab (atcol1, atcol2) values (default, true);
+insert into anothertab (atcol1, atcol2) values (default, false);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+alter table anothertab alter column atcol1 type boolean; -- fails
+ERROR: column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to specify "USING atcol1::boolean".
+alter table anothertab alter column atcol1 type boolean using atcol1::int; -- fails
+ERROR: result of USING clause for column "atcol1" cannot be cast automatically to type boolean
+HINT: You might need to add an explicit cast.
+alter table anothertab alter column atcol1 type integer;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+(2 rows)
+
+insert into anothertab (atcol1, atcol2) values (45, null); -- fails
+ERROR: new row for relation "anothertab" violates check constraint "anothertab_chk"
+DETAIL: Failing row contains (45, null).
+insert into anothertab (atcol1, atcol2) values (default, null);
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------
+ 1 | t
+ 2 | f
+ 3 |
+(3 rows)
+
+alter table anothertab alter column atcol2 type text
+ using case when atcol2 is true then 'IT WAS TRUE'
+ when atcol2 is false then 'IT WAS FALSE'
+ else 'IT WAS NULL!' end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ 1 | IT WAS TRUE
+ 2 | IT WAS FALSE
+ 3 | IT WAS NULL!
+(3 rows)
+
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: default for column "atcol1" cannot be cast automatically to type boolean
+alter table anothertab alter column atcol1 drop default;
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end; -- fails
+ERROR: operator does not exist: boolean <= integer
+HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
+alter table anothertab drop constraint anothertab_chk;
+alter table anothertab drop constraint anothertab_chk; -- fails
+ERROR: constraint "anothertab_chk" of relation "anothertab" does not exist
+alter table anothertab drop constraint IF EXISTS anothertab_chk; -- succeeds
+NOTICE: constraint "anothertab_chk" of relation "anothertab" does not exist, skipping
+alter table anothertab alter column atcol1 type boolean
+ using case when atcol1 % 2 = 0 then true else false end;
+select * from anothertab;
+ atcol1 | atcol2
+--------+--------------
+ f | IT WAS TRUE
+ t | IT WAS FALSE
+ f | IT WAS NULL!
+(3 rows)
+
+drop table anothertab;
+-- Test index handling in alter table column type (cf. bugs #15835, #15865)
+create table anothertab(f1 int primary key, f2 int unique,
+ f3 int, f4 int, f5 int);
+alter table anothertab
+ add exclude using btree (f3 with =);
+alter table anothertab
+ add exclude using btree (f4 with =) where (f4 is not null);
+alter table anothertab
+ add exclude using btree (f4 with =) where (f5 > 0);
+alter table anothertab
+ add unique(f1,f4);
+create index on anothertab(f2,f3);
+create unique index on anothertab(f4);
+\d anothertab
+ Table "public.anothertab"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ f1 | integer | | not null |
+ f2 | integer | | |
+ f3 | integer | | |
+ f4 | integer | | |
+ f5 | integer | | |
+Indexes:
+ "anothertab_pkey" PRIMARY KEY, btree (f1)
+ "anothertab_f1_f4_key" UNIQUE CONSTRAINT, btree (f1, f4)
+ "anothertab_f2_f3_idx" btree (f2, f3)
+ "anothertab_f2_key" UNIQUE CONSTRAINT, btree (f2)
+ "anothertab_f3_excl" EXCLUDE USING btree (f3 WITH =)
+ "anothertab_f4_excl" EXCLUDE USING btree (f4 WITH =) WHERE (f4 IS NOT NULL)
+ "anothertab_f4_excl1" EXCLUDE USING btree (f4 WITH =) WHERE (f5 > 0)
+ "anothertab_f4_idx" UNIQUE, btree (f4)
+
+alter table anothertab alter column f1 type bigint;
+alter table anothertab
+ alter column f2 type bigint,
+ alter column f3 type bigint,
+ alter column f4 type bigint;
+alter table anothertab alter column f5 type bigint;
+\d anothertab
+ Table "public.anothertab"
+ Column | Type | Collation | Nullable | Default
+--------+--------+-----------+----------+---------
+ f1 | bigint | | not null |
+ f2 | bigint | | |
+ f3 | bigint | | |
+ f4 | bigint | | |
+ f5 | bigint | | |
+Indexes:
+ "anothertab_pkey" PRIMARY KEY, btree (f1)
+ "anothertab_f1_f4_key" UNIQUE CONSTRAINT, btree (f1, f4)
+ "anothertab_f2_f3_idx" btree (f2, f3)
+ "anothertab_f2_key" UNIQUE CONSTRAINT, btree (f2)
+ "anothertab_f3_excl" EXCLUDE USING btree (f3 WITH =)
+ "anothertab_f4_excl" EXCLUDE USING btree (f4 WITH =) WHERE (f4 IS NOT NULL)
+ "anothertab_f4_excl1" EXCLUDE USING btree (f4 WITH =) WHERE (f5 > 0)
+ "anothertab_f4_idx" UNIQUE, btree (f4)
+
+drop table anothertab;
+-- test that USING expressions are parsed before column alter type / drop steps
+create table another (f1 int, f2 text, f3 text);
+insert into another values(1, 'one', 'uno');
+insert into another values(2, 'two', 'due');
+insert into another values(3, 'three', 'tre');
+select * from another;
+ f1 | f2 | f3
+----+-------+-----
+ 1 | one | uno
+ 2 | two | due
+ 3 | three | tre
+(3 rows)
+
+alter table another
+ alter f1 type text using f2 || ' and ' || f3 || ' more',
+ alter f2 type bigint using f1 * 10,
+ drop column f3;
+select * from another;
+ f1 | f2
+--------------------+----
+ one and uno more | 10
+ two and due more | 20
+ three and tre more | 30
+(3 rows)
+
+drop table another;
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
+-- table's row type
+create table tab1 (a int, b text);
+create table tab2 (x int, y tab1);
+alter table tab1 alter column b type varchar; -- fails
+ERROR: cannot alter table "tab1" because column "tab2.y" uses its row type
+-- Alter column type that's part of a partitioned index
+create table at_partitioned (a int, b text) partition by range (a);
+create table at_part_1 partition of at_partitioned for values from (0) to (1000);
+insert into at_partitioned values (512, '0.123');
+create table at_part_2 (b text, a int);
+insert into at_part_2 values ('1.234', 1024);
+create index on at_partitioned (b);
+create index on at_partitioned (a);
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+
+alter table at_partitioned attach partition at_part_2 for values from (1000) to (2000);
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | text | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+alter table at_partitioned alter column b type numeric using b::numeric;
+\d at_part_1
+ Table "public.at_part_1"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | numeric | | |
+Partition of: at_partitioned FOR VALUES FROM (0) TO (1000)
+Indexes:
+ "at_part_1_a_idx" btree (a)
+ "at_part_1_b_idx" btree (b)
+
+\d at_part_2
+ Table "public.at_part_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ b | numeric | | |
+ a | integer | | |
+Partition of: at_partitioned FOR VALUES FROM (1000) TO (2000)
+Indexes:
+ "at_part_2_a_idx" btree (a)
+ "at_part_2_b_idx" btree (b)
+
+drop table at_partitioned;
+-- Alter column type when no table rewrite is required
+-- Also check that comments are preserved
+create table at_partitioned(id int, name varchar(64), unique (id, name))
+ partition by hash(id);
+comment on constraint at_partitioned_id_name_key on at_partitioned is 'parent constraint';
+comment on index at_partitioned_id_name_key is 'parent index';
+create table at_partitioned_0 partition of at_partitioned
+ for values with (modulus 2, remainder 0);
+comment on constraint at_partitioned_0_id_name_key on at_partitioned_0 is 'child 0 constraint';
+comment on index at_partitioned_0_id_name_key is 'child 0 index';
+create table at_partitioned_1 partition of at_partitioned
+ for values with (modulus 2, remainder 1);
+comment on constraint at_partitioned_1_id_name_key on at_partitioned_1 is 'child 1 constraint';
+comment on index at_partitioned_1_id_name_key is 'child 1 index';
+insert into at_partitioned values(1, 'foo');
+insert into at_partitioned values(3, 'bar');
+create temp table old_oids as
+ select relname, oid as oldoid, relfilenode as oldfilenode
+ from pg_class where relname like 'at_partitioned%';
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+---------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | t | own | child 0 index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | t | own | child 1 index
+ at_partitioned_id_name_key | t | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+--------------------
+ at_partitioned_0_id_name_key | child 0 constraint
+ at_partitioned_1_id_name_key | child 1 constraint
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+alter table at_partitioned alter column name type varchar(127);
+-- Note: these tests currently show the wrong behavior for comments :-(
+select relname,
+ c.oid = oldoid as orig_oid,
+ case relfilenode
+ when 0 then 'none'
+ when c.oid then 'own'
+ when oldfilenode then 'orig'
+ else 'OTHER'
+ end as storage,
+ obj_description(c.oid, 'pg_class') as desc
+ from pg_class c left join old_oids using (relname)
+ where relname like 'at_partitioned%'
+ order by relname;
+ relname | orig_oid | storage | desc
+------------------------------+----------+---------+--------------
+ at_partitioned | t | none |
+ at_partitioned_0 | t | own |
+ at_partitioned_0_id_name_key | f | own | parent index
+ at_partitioned_1 | t | own |
+ at_partitioned_1_id_name_key | f | own | parent index
+ at_partitioned_id_name_key | f | none | parent index
+(6 rows)
+
+select conname, obj_description(oid, 'pg_constraint') as desc
+ from pg_constraint where conname like 'at_partitioned%'
+ order by conname;
+ conname | desc
+------------------------------+-------------------
+ at_partitioned_0_id_name_key |
+ at_partitioned_1_id_name_key |
+ at_partitioned_id_name_key | parent constraint
+(3 rows)
+
+-- Don't remove this DROP, it exposes bug #15672
+drop table at_partitioned;
+-- disallow recursive containment of row types
+create temp table recur1 (f1 int);
+alter table recur1 add column f2 recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 recur1[]; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create domain array_of_recur1 as recur1[];
+alter table recur1 add column f2 array_of_recur1; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+create temp table recur2 (f1 int, f2 recur1);
+alter table recur1 add column f2 recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+alter table recur1 add column f2 int;
+alter table recur1 alter column f2 type recur2; -- fails
+ERROR: composite type recur1 cannot be made a member of itself
+-- SET STORAGE may need to add a TOAST table
+create table test_storage (a text);
+alter table test_storage alter a set storage plain;
+alter table test_storage add b int default 0; -- rewrite table to remove its TOAST table
+alter table test_storage alter a set storage extended; -- re-add TOAST table
+select reltoastrelid <> 0 as has_toast_table
+from pg_class
+where oid = 'test_storage'::regclass;
+ has_toast_table
+-----------------
+ f
+(1 row)
+
+-- test that SET STORAGE propagates to index correctly
+create index test_storage_idx on test_storage (b, a);
+alter table test_storage alter column a set storage external;
+\d+ test_storage
+ Table "public.test_storage"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+---------+-----------+----------+---------+----------+--------------+-------------
+ a | text | | | | external | |
+ b | integer | | | 0 | plain | |
+Indexes:
+ "test_storage_idx" btree (b, a)
+
+\d+ test_storage_idx
+ Index "public.test_storage_idx"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+---------+------+------------+----------+--------------
+ b | integer | yes | b | plain |
+ a | text | yes | a | external |
+btree, for table "public.test_storage"
+
+-- ALTER COLUMN TYPE with a check constraint and a child table (bug #13779)
+CREATE TABLE test_inh_check (a float check (a > 10.2), b float);
+CREATE TABLE test_inh_check_child() INHERITS(test_inh_check);
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | double precision | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN a TYPE numeric;
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(2 rows)
+
+-- also try noinherit, local, and local+inherited cases
+ALTER TABLE test_inh_check ADD CONSTRAINT bnoinherit CHECK (b > 100) NO INHERIT;
+ALTER TABLE test_inh_check_child ADD CONSTRAINT blocal CHECK (b < 1000);
+ALTER TABLE test_inh_check_child ADD CONSTRAINT bmerged CHECK (b > 1);
+ALTER TABLE test_inh_check ADD CONSTRAINT bmerged CHECK (b > 1);
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "bmerged" CHECK (b > 1::double precision)
+ "bnoinherit" CHECK (b > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+------------------+-----------+----------+---------
+ a | numeric | | |
+ b | double precision | | |
+Check constraints:
+ "blocal" CHECK (b < 1000::double precision)
+ "bmerged" CHECK (b > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+ALTER TABLE test_inh_check ALTER COLUMN b TYPE numeric;
+NOTICE: merging constraint "bmerged" with inherited definition
+\d test_inh_check
+ Table "public.test_inh_check"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "bnoinherit" CHECK (b::double precision > 100::double precision) NO INHERIT
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Number of child tables: 1 (Use \d+ to list them.)
+
+\d test_inh_check_child
+ Table "public.test_inh_check_child"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | numeric | | |
+ b | numeric | | |
+Check constraints:
+ "blocal" CHECK (b::double precision < 1000::double precision)
+ "bmerged" CHECK (b::double precision > 1::double precision)
+ "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
+Inherits: test_inh_check
+
+select relname, conname, coninhcount, conislocal, connoinherit
+ from pg_constraint c, pg_class r
+ where relname like 'test_inh_check%' and c.conrelid = r.oid
+ order by 1, 2;
+ relname | conname | coninhcount | conislocal | connoinherit
+----------------------+------------------------+-------------+------------+--------------
+ test_inh_check | bmerged | 0 | t | f
+ test_inh_check | bnoinherit | 0 | t | t
+ test_inh_check | test_inh_check_a_check | 0 | t | f
+ test_inh_check_child | blocal | 0 | t | f
+ test_inh_check_child | bmerged | 1 | t | f
+ test_inh_check_child | test_inh_check_a_check | 1 | f | f
+(6 rows)
+
+-- ALTER COLUMN TYPE with different schema in children
+-- Bug at https://postgr.es/m/20170102225618.GA10071@telsasoft.com
+CREATE TABLE test_type_diff (f1 int);
+CREATE TABLE test_type_diff_c (extra smallint) INHERITS (test_type_diff);
+ALTER TABLE test_type_diff ADD COLUMN f2 int;
+INSERT INTO test_type_diff_c VALUES (1, 2, 3);
+ALTER TABLE test_type_diff ALTER COLUMN f2 TYPE bigint USING f2::bigint;
+CREATE TABLE test_type_diff2 (int_two int2, int_four int4, int_eight int8);
+CREATE TABLE test_type_diff2_c1 (int_four int4, int_eight int8, int_two int2);
+CREATE TABLE test_type_diff2_c2 (int_eight int8, int_two int2, int_four int4);
+CREATE TABLE test_type_diff2_c3 (int_two int2, int_four int4, int_eight int8);
+ALTER TABLE test_type_diff2_c1 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c2 INHERIT test_type_diff2;
+ALTER TABLE test_type_diff2_c3 INHERIT test_type_diff2;
+INSERT INTO test_type_diff2_c1 VALUES (1, 2, 3);
+INSERT INTO test_type_diff2_c2 VALUES (4, 5, 6);
+INSERT INTO test_type_diff2_c3 VALUES (7, 8, 9);
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int8 USING int_four::int8;
+-- whole-row references are disallowed
+ALTER TABLE test_type_diff2 ALTER COLUMN int_four TYPE int4 USING (pg_column_size(test_type_diff2));
+ERROR: cannot convert whole-row table reference
+DETAIL: USING expression contains a whole-row table reference.
+-- check for rollback of ANALYZE corrupting table property flags (bug #11638)
+CREATE TABLE check_fk_presence_1 (id int PRIMARY KEY, t text);
+CREATE TABLE check_fk_presence_2 (id int REFERENCES check_fk_presence_1, t text);
+BEGIN;
+ALTER TABLE check_fk_presence_2 DROP CONSTRAINT check_fk_presence_2_id_fkey;
+ANALYZE check_fk_presence_2;
+ROLLBACK;
+\d check_fk_presence_2
+ Table "public.check_fk_presence_2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id | integer | | |
+ t | text | | |
+Foreign-key constraints:
+ "check_fk_presence_2_id_fkey" FOREIGN KEY (id) REFERENCES check_fk_presence_1(id)
+
+DROP TABLE check_fk_presence_1, check_fk_presence_2;
+-- check column addition within a view (bug #14876)
+create table at_base_table(id int, stuff text);
+insert into at_base_table values (23, 'skidoo');
+create view at_view_1 as select * from at_base_table bt;
+create view at_view_2 as select *, to_json(v1) as j from at_view_1 v1;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+View definition:
+ SELECT bt.id,
+ bt.stuff
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo"}
+(1 row)
+
+create or replace view at_view_1 as select *, 2+2 as more from at_base_table bt;
+\d+ at_view_1
+ View "public.at_view_1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ more | integer | | | | plain |
+View definition:
+ SELECT bt.id,
+ bt.stuff,
+ 2 + 2 AS more
+ FROM at_base_table bt;
+
+\d+ at_view_2
+ View "public.at_view_2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+----------+-------------
+ id | integer | | | | plain |
+ stuff | text | | | | extended |
+ j | json | | | | extended |
+View definition:
+ SELECT v1.id,
+ v1.stuff,
+ to_json(v1.*) AS j
+ FROM at_view_1 v1;
+
+explain (verbose, costs off) select * from at_view_2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Seq Scan on public.at_base_table bt
+ Output: bt.id, bt.stuff, to_json(ROW(bt.id, bt.stuff, NULL))
+(2 rows)
+
+select * from at_view_2;
+ id | stuff | j
+----+--------+----------------------------------------
+ 23 | skidoo | {"id":23,"stuff":"skidoo","more":null}
+(1 row)
+
+drop view at_view_2;
+drop view at_view_1;
+drop table at_base_table;
+-- check adding a column not iself requiring a rewrite, together with
+-- a column requiring a default (bug #16038)
+-- ensure that rewrites aren't silently optimized away, removing the
+-- value of the test
+CREATE FUNCTION check_ddl_rewrite(p_tablename regclass, p_ddl text)
+RETURNS boolean
+LANGUAGE plpgsql AS $$
+DECLARE
+ v_relfilenode oid;
+BEGIN
+ v_relfilenode := relfilenode FROM pg_class WHERE oid = p_tablename;
+
+ EXECUTE p_ddl;
+
+ RETURN v_relfilenode <> (SELECT relfilenode FROM pg_class WHERE oid = p_tablename);
+END;
+$$;
+CREATE TABLE rewrite_test(col text);
+INSERT INTO rewrite_test VALUES ('something');
+INSERT INTO rewrite_test VALUES (NULL);
+-- empty[12] don't need rewrite, but notempty[12]_rewrite will force one
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN empty1 text,
+ ADD COLUMN notempty1_rewrite serial;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN notempty2_rewrite serial,
+ ADD COLUMN empty2 text;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+-- also check that fast defaults cause no problem, first without rewrite
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN empty3 text,
+ ADD COLUMN notempty3_norewrite int default 42;
+$$);
+ check_ddl_rewrite
+-------------------
+ f
+(1 row)
+
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN notempty4_norewrite int default 42,
+ ADD COLUMN empty4 text;
+$$);
+ check_ddl_rewrite
+-------------------
+ f
+(1 row)
+
+-- then with rewrite
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN empty5 text,
+ ADD COLUMN notempty5_norewrite int default 42,
+ ADD COLUMN notempty5_rewrite serial;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+SELECT check_ddl_rewrite('rewrite_test', $$
+ ALTER TABLE rewrite_test
+ ADD COLUMN notempty6_rewrite serial,
+ ADD COLUMN empty6 text,
+ ADD COLUMN notempty6_norewrite int default 42;
+$$);
+ check_ddl_rewrite
+-------------------
+ t
+(1 row)
+
+-- cleanup
+DROP FUNCTION check_ddl_rewrite(regclass, text);
+DROP TABLE rewrite_test;
+--
+-- lock levels
+--
+drop type lockmodes;
+ERROR: type "lockmodes" does not exist
+create type lockmodes as enum (
+ 'SIReadLock'
+,'AccessShareLock'
+,'RowShareLock'
+,'RowExclusiveLock'
+,'ShareUpdateExclusiveLock'
+,'ShareLock'
+,'ShareRowExclusiveLock'
+,'ExclusiveLock'
+,'AccessExclusiveLock'
+);
+drop view my_locks;
+ERROR: view "my_locks" does not exist
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = pg_current_xact_id()::xid)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname != 'my_locks'
+group by c.relname;
+create table alterlock (f1 int primary key, f2 text);
+insert into alterlock values (1, 'foo');
+create table alterlock2 (f3 int primary key, f1 int);
+insert into alterlock2 values (1, 1);
+begin; alter table alterlock alter column f2 set statistics 150;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock cluster on alterlock_pkey;
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+ alterlock_pkey | ShareUpdateExclusiveLock
+(2 rows)
+
+commit;
+begin; alter table alterlock set without cluster;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (fillfactor = 100);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock reset (fillfactor);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (toast.autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock set (autovacuum_enabled = off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set (n_distinct = 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+rollback;
+-- test that mixing options with different lock levels works as expected
+begin; alter table alterlock set (autovacuum_enabled = off, fillfactor = 80);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+--------------------------
+ alterlock | ShareUpdateExclusiveLock
+(1 row)
+
+commit;
+begin; alter table alterlock alter column f2 set storage extended;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin; alter table alterlock alter column f2 set default 'x';
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+---------------------
+ alterlock | AccessExclusiveLock
+(1 row)
+
+rollback;
+begin;
+create trigger ttdummy
+ before delete or update on alterlock
+ for each row
+ execute procedure
+ ttdummy (1, 1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------+-----------------------
+ alterlock | ShareRowExclusiveLock
+(1 row)
+
+rollback;
+begin;
+select * from my_locks order by 1;
+ relname | max_lockmode
+---------+--------------
+(0 rows)
+
+alter table alterlock2 add foreign key (f1) references alterlock (f1);
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+begin;
+alter table alterlock2
+add constraint alterlock2nv foreign key (f1) references alterlock (f1) NOT VALID;
+select * from my_locks order by 1;
+ relname | max_lockmode
+------------+-----------------------
+ alterlock | ShareRowExclusiveLock
+ alterlock2 | ShareRowExclusiveLock
+(2 rows)
+
+commit;
+begin;
+alter table alterlock2 validate constraint alterlock2nv;
+select * from my_locks order by 1;
+ relname | max_lockmode
+-----------------+--------------------------
+ alterlock | RowShareLock
+ alterlock2 | ShareUpdateExclusiveLock
+ alterlock2_pkey | AccessShareLock
+ alterlock_pkey | AccessShareLock
+(4 rows)
+
+rollback;
+create or replace view my_locks as
+select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
+from pg_locks l join pg_class c on l.relation = c.oid
+where virtualtransaction = (
+ select virtualtransaction
+ from pg_locks
+ where transactionid = pg_current_xact_id()::xid)
+and locktype = 'relation'
+and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
+and c.relname = 'my_locks'
+group by c.relname;
+-- raise exception
+alter table my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter view my_locks set (autovacuum_enabled = false);
+ERROR: unrecognized parameter "autovacuum_enabled"
+alter table my_locks reset (autovacuum_enabled);
+alter view my_locks reset (autovacuum_enabled);
+begin;
+alter view my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter view my_locks reset (security_barrier);
+rollback;
+-- this test intentionally applies the ALTER TABLE command against a view, but
+-- uses a view option so we expect this to succeed. This form of SQL is
+-- accepted for historical reasons, as shown in the docs for ALTER VIEW
+begin;
+alter table my_locks set (security_barrier=off);
+select * from my_locks order by 1;
+ relname | max_lockmode
+----------+---------------------
+ my_locks | AccessExclusiveLock
+(1 row)
+
+alter table my_locks reset (security_barrier);
+rollback;
+-- cleanup
+drop table alterlock2;
+drop table alterlock;
+drop view my_locks;
+drop type lockmodes;
+--
+-- alter function
+--
+create function test_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql returns null on null input;
+select test_strict(NULL);
+ test_strict
+-------------
+
+(1 row)
+
+alter function test_strict(text) called on null input;
+select test_strict(NULL);
+ test_strict
+-------------------
+ got passed a null
+(1 row)
+
+create function non_strict(text) returns text as
+ 'select coalesce($1, ''got passed a null'');'
+ language sql called on null input;
+select non_strict(NULL);
+ non_strict
+-------------------
+ got passed a null
+(1 row)
+
+alter function non_strict(text) returns null on null input;
+select non_strict(NULL);
+ non_strict
+------------
+
+(1 row)
+
+--
+-- alter object set schema
+--
+create schema alter1;
+create schema alter2;
+create table alter1.t1(f1 serial primary key, f2 int check (f2 > 0));
+create view alter1.v1 as select * from alter1.t1;
+create function alter1.plus1(int) returns int as 'select $1+1' language sql;
+create domain alter1.posint integer check (value > 0);
+create type alter1.ctype as (f1 int, f2 text);
+create function alter1.same(alter1.ctype, alter1.ctype) returns boolean language sql
+as 'select $1.f1 is not distinct from $2.f1 and $1.f2 is not distinct from $2.f2';
+create operator alter1.=(procedure = alter1.same, leftarg = alter1.ctype, rightarg = alter1.ctype);
+create operator class alter1.ctype_hash_ops default for type alter1.ctype using hash as
+ operator 1 alter1.=(alter1.ctype, alter1.ctype);
+create conversion alter1.latin1_to_utf8 for 'latin1' to 'utf8' from iso8859_1_to_utf8;
+create text search parser alter1.prs(start = prsd_start, gettoken = prsd_nexttoken, end = prsd_end, lextypes = prsd_lextype);
+create text search configuration alter1.cfg(parser = alter1.prs);
+create text search template alter1.tmpl(init = dsimple_init, lexize = dsimple_lexize);
+create text search dictionary alter1.dict(template = alter1.tmpl);
+insert into alter1.t1(f2) values(11);
+insert into alter1.t1(f2) values(12);
+alter table alter1.t1 set schema alter1; -- no-op, same schema
+alter table alter1.t1 set schema alter2;
+alter table alter1.v1 set schema alter2;
+alter function alter1.plus1(int) set schema alter2;
+alter domain alter1.posint set schema alter2;
+alter operator class alter1.ctype_hash_ops using hash set schema alter2;
+alter operator family alter1.ctype_hash_ops using hash set schema alter2;
+alter operator alter1.=(alter1.ctype, alter1.ctype) set schema alter2;
+alter function alter1.same(alter1.ctype, alter1.ctype) set schema alter2;
+alter type alter1.ctype set schema alter1; -- no-op, same schema
+alter type alter1.ctype set schema alter2;
+alter conversion alter1.latin1_to_utf8 set schema alter2;
+alter text search parser alter1.prs set schema alter2;
+alter text search configuration alter1.cfg set schema alter2;
+alter text search template alter1.tmpl set schema alter2;
+alter text search dictionary alter1.dict set schema alter2;
+-- this should succeed because nothing is left in alter1
+drop schema alter1;
+insert into alter2.t1(f2) values(13);
+insert into alter2.t1(f2) values(14);
+select * from alter2.t1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select * from alter2.v1;
+ f1 | f2
+----+----
+ 1 | 11
+ 2 | 12
+ 3 | 13
+ 4 | 14
+(4 rows)
+
+select alter2.plus1(41);
+ plus1
+-------
+ 42
+(1 row)
+
+-- clean up
+drop schema alter2 cascade;
+NOTICE: drop cascades to 13 other objects
+DETAIL: drop cascades to table alter2.t1
+drop cascades to view alter2.v1
+drop cascades to function alter2.plus1(integer)
+drop cascades to type alter2.posint
+drop cascades to type alter2.ctype
+drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
+drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
+drop cascades to conversion alter2.latin1_to_utf8
+drop cascades to text search parser alter2.prs
+drop cascades to text search configuration alter2.cfg
+drop cascades to text search template alter2.tmpl
+drop cascades to text search dictionary alter2.dict
+--
+-- composite types
+--
+CREATE TYPE test_type AS (a int);
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE nosuchtype ADD ATTRIBUTE b text; -- fails
+ERROR: relation "nosuchtype" does not exist
+ALTER TYPE test_type ADD ATTRIBUTE b text;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+ALTER TYPE test_type ADD ATTRIBUTE b text; -- fails
+ERROR: column "b" of relation "test_type" already exists
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE varchar;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+
+ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE integer;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE b;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+
+ALTER TYPE test_type DROP ATTRIBUTE c; -- fails
+ERROR: column "c" of relation "test_type" does not exist
+ALTER TYPE test_type DROP ATTRIBUTE IF EXISTS c;
+NOTICE: column "c" of relation "test_type" does not exist, skipping
+ALTER TYPE test_type DROP ATTRIBUTE a, ADD ATTRIBUTE d boolean;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ d | boolean | | |
+
+ALTER TYPE test_type RENAME ATTRIBUTE a TO aa;
+ERROR: column "a" does not exist
+ALTER TYPE test_type RENAME ATTRIBUTE d TO dd;
+\d test_type
+ Composite type "public.test_type"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ dd | boolean | | |
+
+DROP TYPE test_type;
+CREATE TYPE test_type1 AS (a int, b text);
+CREATE TABLE test_tbl1 (x int, y test_type1);
+ALTER TYPE test_type1 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type1" because column "test_tbl1.y" uses it
+CREATE TYPE test_type2 AS (a int, b text);
+CREATE TABLE test_tbl2 OF test_type2;
+CREATE TABLE test_tbl2_subclass () INHERITS (test_tbl2);
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ADD ATTRIBUTE c text; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ADD ATTRIBUTE c text CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | text | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+-------------------+-----------+----------+---------
+ a | integer | | |
+ b | character varying | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 DROP ATTRIBUTE b; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 DROP ATTRIBUTE b CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa; -- fails
+ERROR: cannot alter type "test_type2" because it is the type of a typed table
+HINT: Use ALTER ... CASCADE to alter the typed tables too.
+ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa CASCADE;
+\d test_type2
+ Composite type "public.test_type2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+
+\d test_tbl2
+ Table "public.test_tbl2"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Number of child tables: 1 (Use \d+ to list them.)
+Typed table of type: test_type2
+
+\d test_tbl2_subclass
+ Table "public.test_tbl2_subclass"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ aa | integer | | |
+ c | text | | |
+Inherits: test_tbl2
+
+DROP TABLE test_tbl2_subclass;
+CREATE TYPE test_typex AS (a int, b text);
+CREATE TABLE test_tblx (x int, y test_typex check ((y).a > 0));
+ALTER TYPE test_typex DROP ATTRIBUTE a; -- fails
+ERROR: cannot drop column a of composite type test_typex because other objects depend on it
+DETAIL: constraint test_tblx_y_check on table test_tblx depends on column a of composite type test_typex
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+ALTER TYPE test_typex DROP ATTRIBUTE a CASCADE;
+NOTICE: drop cascades to constraint test_tblx_y_check on table test_tblx
+\d test_tblx
+ Table "public.test_tblx"
+ Column | Type | Collation | Nullable | Default
+--------+------------+-----------+----------+---------
+ x | integer | | |
+ y | test_typex | | |
+
+DROP TABLE test_tblx;
+DROP TYPE test_typex;
+-- This test isn't that interesting on its own, but the purpose is to leave
+-- behind a table to test pg_upgrade with. The table has a composite type
+-- column in it, and the composite type has a dropped attribute.
+CREATE TYPE test_type3 AS (a int);
+CREATE TABLE test_tbl3 (c) AS SELECT '(1)'::test_type3;
+ALTER TYPE test_type3 DROP ATTRIBUTE a, ADD ATTRIBUTE b int;
+CREATE TYPE test_type_empty AS ();
+DROP TYPE test_type_empty;
+--
+-- typed tables: OF / NOT OF
+--
+CREATE TYPE tt_t0 AS (z inet, x int, y numeric(8,2));
+ALTER TYPE tt_t0 DROP ATTRIBUTE z;
+CREATE TABLE tt0 (x int NOT NULL, y numeric(8,2)); -- OK
+CREATE TABLE tt1 (x int, y bigint); -- wrong base type
+CREATE TABLE tt2 (x int, y numeric(9,2)); -- wrong typmod
+CREATE TABLE tt3 (y numeric(8,2), x int); -- wrong column order
+CREATE TABLE tt4 (x int); -- too few columns
+CREATE TABLE tt5 (x int, y numeric(8,2), z int); -- too few columns
+CREATE TABLE tt6 () INHERITS (tt0); -- can't have a parent
+CREATE TABLE tt7 (x int, q text, y numeric(8,2));
+ALTER TABLE tt7 DROP q; -- OK
+ALTER TABLE tt0 OF tt_t0;
+ALTER TABLE tt1 OF tt_t0;
+ERROR: table "tt1" has different type for column "y"
+ALTER TABLE tt2 OF tt_t0;
+ERROR: table "tt2" has different type for column "y"
+ALTER TABLE tt3 OF tt_t0;
+ERROR: table has column "y" where type requires "x"
+ALTER TABLE tt4 OF tt_t0;
+ERROR: table is missing column "y"
+ALTER TABLE tt5 OF tt_t0;
+ERROR: table has extra column "z"
+ALTER TABLE tt6 OF tt_t0;
+ERROR: typed tables cannot inherit
+ALTER TABLE tt7 OF tt_t0;
+CREATE TYPE tt_t1 AS (x int, y numeric(8,2));
+ALTER TABLE tt7 OF tt_t1; -- reassign an already-typed table
+ALTER TABLE tt7 NOT OF;
+\d tt7
+ Table "public.tt7"
+ Column | Type | Collation | Nullable | Default
+--------+--------------+-----------+----------+---------
+ x | integer | | |
+ y | numeric(8,2) | | |
+
+-- make sure we can drop a constraint on the parent but it remains on the child
+CREATE TABLE test_drop_constr_parent (c text CHECK (c IS NOT NULL));
+CREATE TABLE test_drop_constr_child () INHERITS (test_drop_constr_parent);
+ALTER TABLE ONLY test_drop_constr_parent DROP CONSTRAINT "test_drop_constr_parent_c_check";
+-- should fail
+INSERT INTO test_drop_constr_child (c) VALUES (NULL);
+ERROR: new row for relation "test_drop_constr_child" violates check constraint "test_drop_constr_parent_c_check"
+DETAIL: Failing row contains (null).
+DROP TABLE test_drop_constr_parent CASCADE;
+NOTICE: drop cascades to table test_drop_constr_child
+--
+-- IF EXISTS test
+--
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+NOTICE: relation "tt8" does not exist, skipping
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+NOTICE: relation "tt8" does not exist, skipping
+CREATE TABLE tt8(a int);
+CREATE SCHEMA alter2;
+ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
+ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
+ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
+ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
+ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
+ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
+\d alter2.tt8
+ Table "alter2.tt8"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ f1 | integer | | not null | 0
+Indexes:
+ "xxx" PRIMARY KEY, btree (f1)
+Check constraints:
+ "tt8_f_check" CHECK (f1 >= 0 AND f1 <= 10)
+
+DROP TABLE alter2.tt8;
+DROP SCHEMA alter2;
+--
+-- Check conflicts between index and CHECK constraint names
+--
+CREATE TABLE tt9(c integer);
+ALTER TABLE tt9 ADD CHECK(c > 1);
+ALTER TABLE tt9 ADD CHECK(c > 2); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 3);
+ALTER TABLE tt9 ADD CONSTRAINT foo CHECK(c > 4); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD UNIQUE(c);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key UNIQUE(c); -- fail, dup name
+ERROR: relation "tt9_c_key" already exists
+ALTER TABLE tt9 ADD CONSTRAINT foo UNIQUE(c); -- fail, dup name
+ERROR: constraint "foo" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key CHECK(c > 5); -- fail, dup name
+ERROR: constraint "tt9_c_key" for relation "tt9" already exists
+ALTER TABLE tt9 ADD CONSTRAINT tt9_c_key2 CHECK(c > 6);
+ALTER TABLE tt9 ADD UNIQUE(c); -- picks nonconflicting name
+\d tt9
+ Table "public.tt9"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c | integer | | |
+Indexes:
+ "tt9_c_key" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key1" UNIQUE CONSTRAINT, btree (c)
+ "tt9_c_key3" UNIQUE CONSTRAINT, btree (c)
+Check constraints:
+ "foo" CHECK (c > 3)
+ "tt9_c_check" CHECK (c > 1)
+ "tt9_c_check1" CHECK (c > 2)
+ "tt9_c_key2" CHECK (c > 6)
+
+DROP TABLE tt9;
+-- Check that comments on constraints and indexes are not lost at ALTER TABLE.
+CREATE TABLE comment_test (
+ id int,
+ positive_col int CHECK (positive_col > 0),
+ indexed_col int,
+ CONSTRAINT comment_test_pk PRIMARY KEY (id));
+CREATE INDEX comment_test_index ON comment_test(indexed_col);
+COMMENT ON COLUMN comment_test.id IS 'Column ''id'' on comment_test';
+COMMENT ON INDEX comment_test_index IS 'Simple index on comment_test';
+COMMENT ON CONSTRAINT comment_test_positive_col_check ON comment_test IS 'CHECK constraint on comment_test.positive_col';
+COMMENT ON CONSTRAINT comment_test_pk ON comment_test IS 'PRIMARY KEY constraint of comment_test';
+COMMENT ON INDEX comment_test_pk IS 'Index backing the PRIMARY KEY of comment_test';
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Change the datatype of all the columns. ALTER TABLE is optimized to not
+-- rebuild an index if the new data type is binary compatible with the old
+-- one. Check do a dummy ALTER TABLE that doesn't change the datatype
+-- first, to test that no-op codepath, and another one that does.
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE int;
+ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE bigint;
+-- Check that the comments are intact.
+SELECT col_description('comment_test'::regclass, 1) as comment;
+ comment
+-----------------------------
+ Column 'id' on comment_test
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ index | comment
+--------------------+-----------------------------------------------
+ comment_test_index | Simple index on comment_test
+ comment_test_pk | Index backing the PRIMARY KEY of comment_test
+(2 rows)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
+ constraint | comment
+---------------------------------+-----------------------------------------------
+ comment_test_pk | PRIMARY KEY constraint of comment_test
+ comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
+(2 rows)
+
+-- Check compatibility for foreign keys and comments. This is done
+-- separately as rebuilding the column type of the parent leads
+-- to an error and would reduce the test scope.
+CREATE TABLE comment_test_child (
+ id text CONSTRAINT comment_test_child_fk REFERENCES comment_test);
+CREATE INDEX comment_test_child_fk ON comment_test_child(id);
+COMMENT ON COLUMN comment_test_child.id IS 'Column ''id'' on comment_test_child';
+COMMENT ON INDEX comment_test_child_fk IS 'Index backing the FOREIGN KEY of comment_test_child';
+COMMENT ON CONSTRAINT comment_test_child_fk ON comment_test_child IS 'FOREIGN KEY constraint of comment_test_child';
+-- Change column type of parent
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
+ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int USING id::integer;
+ERROR: foreign key constraint "comment_test_child_fk" cannot be implemented
+DETAIL: Key columns "id" and "id" are of incompatible types: text and integer.
+-- Comments should be intact
+SELECT col_description('comment_test_child'::regclass, 1) as comment;
+ comment
+-----------------------------------
+ Column 'id' on comment_test_child
+(1 row)
+
+SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ index | comment
+-----------------------+-----------------------------------------------------
+ comment_test_child_fk | Index backing the FOREIGN KEY of comment_test_child
+(1 row)
+
+SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test_child'::regclass ORDER BY 1, 2;
+ constraint | comment
+-----------------------+----------------------------------------------
+ comment_test_child_fk | FOREIGN KEY constraint of comment_test_child
+(1 row)
+
+-- Check that we map relation oids to filenodes and back correctly. Only
+-- display bad mappings so the test output doesn't change all the time. A
+-- filenode function call can return NULL for a relation dropped concurrently
+-- with the call's surrounding query, so ignore a NULL mapped_oid for
+-- relations that no longer exist after all calls finish.
+CREATE TEMP TABLE filenode_mapping AS
+SELECT
+ oid, mapped_oid, reltablespace, relfilenode, relname
+FROM pg_class,
+ pg_filenode_relation(reltablespace, pg_relation_filenode(oid)) AS mapped_oid
+WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
+SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
+WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
+ oid | mapped_oid | reltablespace | relfilenode | relname
+-----+------------+---------------+-------------+---------
+(0 rows)
+
+-- Checks on creating and manipulation of user defined relations in
+-- pg_catalog.
+SHOW allow_system_table_mods;
+ allow_system_table_mods
+-------------------------
+ off
+(1 row)
+
+-- disallowed because of search_path issues with pg_dump
+CREATE TABLE pg_catalog.new_system_table();
+ERROR: permission denied to create "pg_catalog.new_system_table"
+DETAIL: System catalog modifications are currently disallowed.
+-- instead create in public first, move to catalog
+CREATE TABLE new_system_table(id serial primary key, othercol text);
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table SET SCHEMA public;
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+-- will be ignored -- already there:
+ALTER TABLE new_system_table SET SCHEMA pg_catalog;
+ALTER TABLE new_system_table RENAME TO old_system_table;
+CREATE INDEX old_system_table__othercol ON old_system_table (othercol);
+INSERT INTO old_system_table(othercol) VALUES ('somedata'), ('otherdata');
+UPDATE old_system_table SET id = -id;
+DELETE FROM old_system_table WHERE othercol = 'somedata';
+TRUNCATE old_system_table;
+ALTER TABLE old_system_table DROP CONSTRAINT new_system_table_pkey;
+ALTER TABLE old_system_table DROP COLUMN othercol;
+DROP TABLE old_system_table;
+-- set logged
+CREATE UNLOGGED TABLE unlogged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of an unlogged table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | u
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | u
+(3 rows)
+
+CREATE UNLOGGED TABLE unlogged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged1); -- foreign key
+CREATE UNLOGGED TABLE unlogged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged3); -- self-referencing foreign key
+ALTER TABLE unlogged3 SET LOGGED; -- skip self-referencing foreign key
+ALTER TABLE unlogged2 SET LOGGED; -- fails because a foreign key to an unlogged table exists
+ERROR: could not change table "unlogged2" to logged because it references unlogged table "unlogged1"
+ALTER TABLE unlogged1 SET LOGGED;
+-- check relpersistence of an unlogged table after changing to permanent
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+------------------+---------+----------------
+ unlogged1 | r | p
+ unlogged1_f1_seq | S | p
+ unlogged1_pkey | i | p
+(3 rows)
+
+ALTER TABLE unlogged1 SET LOGGED; -- silently do nothing
+DROP TABLE unlogged3;
+DROP TABLE unlogged2;
+DROP TABLE unlogged1;
+-- set unlogged
+CREATE TABLE logged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
+-- check relpersistence of a permanent table
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | p
+ logged1_f1_seq | S | p
+ logged1_pkey | i | p
+(3 rows)
+
+CREATE TABLE logged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged1); -- foreign key
+CREATE TABLE logged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged3); -- self-referencing foreign key
+ALTER TABLE logged1 SET UNLOGGED; -- fails because a foreign key from a permanent table exists
+ERROR: could not change table "logged1" to unlogged because it references logged table "logged2"
+ALTER TABLE logged3 SET UNLOGGED; -- skip self-referencing foreign key
+ALTER TABLE logged2 SET UNLOGGED;
+ALTER TABLE logged1 SET UNLOGGED;
+-- check relpersistence of a permanent table after changing to unlogged
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
+UNION ALL
+SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
+UNION ALL
+SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
+ORDER BY relname;
+ relname | relkind | relpersistence
+----------------+---------+----------------
+ logged1 | r | u
+ logged1_f1_seq | S | p
+ logged1_pkey | i | u
+(3 rows)
+
+ALTER TABLE logged1 SET UNLOGGED; -- silently do nothing
+DROP TABLE logged3;
+DROP TABLE logged2;
+DROP TABLE logged1;
+-- test ADD COLUMN IF NOT EXISTS
+CREATE TABLE test_add_column(c1 integer);
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer;
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN c2 integer; -- fail because c2 already exists
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+ALTER TABLE ONLY test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN c2 integer, -- fail because c2 already exists
+ ADD COLUMN c3 integer primary key;
+ERROR: column "c2" of relation "test_add_column" already exists
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN c3 integer primary key;
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer primary key; -- skipping because c3 already exists
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
+ ADD COLUMN IF NOT EXISTS c3 integer, -- skipping because c3 already exists
+ ADD COLUMN c4 integer REFERENCES test_add_column;
+NOTICE: column "c2" of relation "test_add_column" already exists, skipping
+NOTICE: column "c3" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c4 integer REFERENCES test_add_column;
+NOTICE: column "c4" of relation "test_add_column" already exists, skipping
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c5 SERIAL CHECK (c5 > 8);
+\d test_add_column
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------------------------------------------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+ c5 | integer | | not null | nextval('test_add_column_c5_seq'::regclass)
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Check constraints:
+ "test_add_column_c5_check" CHECK (c5 > 8)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ALTER TABLE test_add_column
+ ADD COLUMN IF NOT EXISTS c5 SERIAL CHECK (c5 > 10);
+NOTICE: column "c5" of relation "test_add_column" already exists, skipping
+\d test_add_column*
+ Table "public.test_add_column"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------------------------------------------
+ c1 | integer | | |
+ c2 | integer | | |
+ c3 | integer | | not null |
+ c4 | integer | | |
+ c5 | integer | | not null | nextval('test_add_column_c5_seq'::regclass)
+Indexes:
+ "test_add_column_pkey" PRIMARY KEY, btree (c3)
+Check constraints:
+ "test_add_column_c5_check" CHECK (c5 > 8)
+Foreign-key constraints:
+ "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+Referenced by:
+ TABLE "test_add_column" CONSTRAINT "test_add_column_c4_fkey" FOREIGN KEY (c4) REFERENCES test_add_column(c3)
+
+ Sequence "public.test_add_column_c5_seq"
+ Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
+---------+-------+---------+------------+-----------+---------+-------
+ integer | 1 | 1 | 2147483647 | 1 | no | 1
+Owned by: public.test_add_column.c5
+
+ Index "public.test_add_column_pkey"
+ Column | Type | Key? | Definition
+--------+---------+------+------------
+ c3 | integer | yes | c3
+primary key, btree, for table "public.test_add_column"
+
+DROP TABLE test_add_column;
+\d test_add_column*
+-- assorted cases with multiple ALTER TABLE steps
+CREATE TABLE ataddindex(f1 INT);
+INSERT INTO ataddindex VALUES (42), (43);
+CREATE UNIQUE INDEX ataddindexi0 ON ataddindex(f1);
+ALTER TABLE ataddindex
+ ADD PRIMARY KEY USING INDEX ataddindexi0,
+ ALTER f1 TYPE BIGINT;
+\d ataddindex
+ Table "public.ataddindex"
+ Column | Type | Collation | Nullable | Default
+--------+--------+-----------+----------+---------
+ f1 | bigint | | not null |
+Indexes:
+ "ataddindexi0" PRIMARY KEY, btree (f1)
+
+DROP TABLE ataddindex;
+CREATE TABLE ataddindex(f1 VARCHAR(10));
+INSERT INTO ataddindex(f1) VALUES ('foo'), ('a');
+ALTER TABLE ataddindex
+ ALTER f1 SET DATA TYPE TEXT,
+ ADD EXCLUDE ((f1 LIKE 'a') WITH =);
+\d ataddindex
+ Table "public.ataddindex"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+Indexes:
+ "ataddindex_expr_excl" EXCLUDE USING btree ((f1 ~~ 'a'::text) WITH =)
+
+DROP TABLE ataddindex;
+CREATE TABLE ataddindex(id int, ref_id int);
+ALTER TABLE ataddindex
+ ADD PRIMARY KEY (id),
+ ADD FOREIGN KEY (ref_id) REFERENCES ataddindex;
+\d ataddindex
+ Table "public.ataddindex"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id | integer | | not null |
+ ref_id | integer | | |
+Indexes:
+ "ataddindex_pkey" PRIMARY KEY, btree (id)
+Foreign-key constraints:
+ "ataddindex_ref_id_fkey" FOREIGN KEY (ref_id) REFERENCES ataddindex(id)
+Referenced by:
+ TABLE "ataddindex" CONSTRAINT "ataddindex_ref_id_fkey" FOREIGN KEY (ref_id) REFERENCES ataddindex(id)
+
+DROP TABLE ataddindex;
+CREATE TABLE ataddindex(id int, ref_id int);
+ALTER TABLE ataddindex
+ ADD UNIQUE (id),
+ ADD FOREIGN KEY (ref_id) REFERENCES ataddindex (id);
+\d ataddindex
+ Table "public.ataddindex"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ id | integer | | |
+ ref_id | integer | | |
+Indexes:
+ "ataddindex_id_key" UNIQUE CONSTRAINT, btree (id)
+Foreign-key constraints:
+ "ataddindex_ref_id_fkey" FOREIGN KEY (ref_id) REFERENCES ataddindex(id)
+Referenced by:
+ TABLE "ataddindex" CONSTRAINT "ataddindex_ref_id_fkey" FOREIGN KEY (ref_id) REFERENCES ataddindex(id)
+
+DROP TABLE ataddindex;
+-- unsupported constraint types for partitioned tables
+CREATE TABLE partitioned (
+ a int,
+ b int
+) PARTITION BY RANGE (a, (a+b+1));
+ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ERROR: exclusion constraints are not supported on partitioned tables
+LINE 1: ALTER TABLE partitioned ADD EXCLUDE USING gist (a WITH &&);
+ ^
+-- cannot drop column that is part of the partition key
+ALTER TABLE partitioned DROP COLUMN a;
+ERROR: cannot drop column "a" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned ALTER COLUMN a TYPE char(5);
+ERROR: cannot alter column "a" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned DROP COLUMN b;
+ERROR: cannot drop column "b" because it is part of the partition key of relation "partitioned"
+ALTER TABLE partitioned ALTER COLUMN b TYPE char(5);
+ERROR: cannot alter column "b" because it is part of the partition key of relation "partitioned"
+-- partitioned table cannot participate in regular inheritance
+CREATE TABLE nonpartitioned (
+ a int,
+ b int
+);
+ALTER TABLE partitioned INHERIT nonpartitioned;
+ERROR: cannot change inheritance of partitioned table
+ALTER TABLE nonpartitioned INHERIT partitioned;
+ERROR: cannot inherit from partitioned table "partitioned"
+-- cannot add NO INHERIT constraint to partitioned tables
+ALTER TABLE partitioned ADD CONSTRAINT chk_a CHECK (a > 0) NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "partitioned"
+DROP TABLE partitioned, nonpartitioned;
+--
+-- ATTACH PARTITION
+--
+-- check that target table is partitioned
+CREATE TABLE unparted (
+ a int
+);
+CREATE TABLE fail_part (like unparted);
+ALTER TABLE unparted ATTACH PARTITION fail_part FOR VALUES IN ('a');
+ERROR: table "unparted" is not partitioned
+DROP TABLE unparted, fail_part;
+-- check that partition bound is compatible
+CREATE TABLE list_parted (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+) PARTITION BY LIST (a);
+CREATE TABLE fail_part (LIKE list_parted);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) TO (10);
+ERROR: invalid bound specification for a list partition
+LINE 1: ...list_parted ATTACH PARTITION fail_part FOR VALUES FROM (1) T...
+ ^
+DROP TABLE fail_part;
+-- check that the table being attached exists
+ALTER TABLE list_parted ATTACH PARTITION nonexistent FOR VALUES IN (1);
+ERROR: relation "nonexistent" does not exist
+-- check ownership of the source table
+CREATE ROLE regress_test_me;
+CREATE ROLE regress_test_not_me;
+CREATE TABLE not_owned_by_me (LIKE list_parted);
+ALTER TABLE not_owned_by_me OWNER TO regress_test_not_me;
+SET SESSION AUTHORIZATION regress_test_me;
+CREATE TABLE owned_by_me (
+ a int
+) PARTITION BY LIST (a);
+ALTER TABLE owned_by_me ATTACH PARTITION not_owned_by_me FOR VALUES IN (1);
+ERROR: must be owner of table not_owned_by_me
+RESET SESSION AUTHORIZATION;
+DROP TABLE owned_by_me, not_owned_by_me;
+DROP ROLE regress_test_not_me;
+DROP ROLE regress_test_me;
+-- check that the table being attached is not part of regular inheritance
+CREATE TABLE parent (LIKE list_parted);
+CREATE TABLE child () INHERITS (parent);
+ALTER TABLE list_parted ATTACH PARTITION child FOR VALUES IN (1);
+ERROR: cannot attach inheritance child as partition
+ALTER TABLE list_parted ATTACH PARTITION parent FOR VALUES IN (1);
+ERROR: cannot attach inheritance parent as partition
+DROP TABLE parent CASCADE;
+NOTICE: drop cascades to table child
+-- check any TEMP-ness
+CREATE TEMP TABLE temp_parted (a int) PARTITION BY LIST (a);
+CREATE TABLE perm_part (a int);
+ALTER TABLE temp_parted ATTACH PARTITION perm_part FOR VALUES IN (1);
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_parted"
+DROP TABLE temp_parted, perm_part;
+-- check that the table being attached is not a typed table
+CREATE TYPE mytype AS (a int);
+CREATE TABLE fail_part OF mytype;
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: cannot attach a typed table as partition
+DROP TYPE mytype CASCADE;
+NOTICE: drop cascades to table fail_part
+-- check that the table being attached has only columns present in the parent
+CREATE TABLE fail_part (like list_parted, c int);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: table "fail_part" contains column "c" not found in parent "list_parted"
+DETAIL: The new partition may contain only the columns present in parent.
+DROP TABLE fail_part;
+-- check that the table being attached has every column of the parent
+CREATE TABLE fail_part (a int NOT NULL);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing column "b"
+DROP TABLE fail_part;
+-- check that columns match in type, collation and NOT NULL status
+CREATE TABLE fail_part (
+ b char(3),
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different type for column "b"
+ALTER TABLE fail_part ALTER b TYPE char (2) COLLATE "POSIX";
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different collation for column "b"
+DROP TABLE fail_part;
+-- check that the table being attached has all constraints of the parent
+CREATE TABLE fail_part (
+ b char(2) COLLATE "C",
+ a int NOT NULL
+);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table is missing constraint "check_a"
+-- check that the constraint matches in definition with parent's constraint
+ALTER TABLE fail_part ADD CONSTRAINT check_a CHECK (a >= 0);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: child table "fail_part" has different definition for check constraint "check_a"
+DROP TABLE fail_part;
+-- check the attributes and constraints after partition is attached
+CREATE TABLE part_1 (
+ a int NOT NULL,
+ b char(2) COLLATE "C",
+ CONSTRAINT check_a CHECK (a > 0)
+);
+ALTER TABLE list_parted ATTACH PARTITION part_1 FOR VALUES IN (1);
+-- attislocal and conislocal are always false for merged attributes and constraints respectively.
+SELECT attislocal, attinhcount FROM pg_attribute WHERE attrelid = 'part_1'::regclass AND attnum > 0;
+ attislocal | attinhcount
+------------+-------------
+ f | 1
+ f | 1
+(2 rows)
+
+SELECT conislocal, coninhcount FROM pg_constraint WHERE conrelid = 'part_1'::regclass AND conname = 'check_a';
+ conislocal | coninhcount
+------------+-------------
+ f | 1
+(1 row)
+
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE fail_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ERROR: partition "fail_part" would overlap partition "part_1"
+LINE 1: ...LE list_parted ATTACH PARTITION fail_part FOR VALUES IN (1);
+ ^
+DROP TABLE fail_part;
+-- check that an existing table can be attached as a default partition
+CREATE TABLE def_part (LIKE list_parted INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION def_part DEFAULT;
+-- check attaching default partition fails if a default partition already
+-- exists
+CREATE TABLE fail_def_part (LIKE part_1 INCLUDING CONSTRAINTS);
+ALTER TABLE list_parted ATTACH PARTITION fail_def_part DEFAULT;
+ERROR: partition "fail_def_part" conflicts with existing default partition "def_part"
+LINE 1: ...ER TABLE list_parted ATTACH PARTITION fail_def_part DEFAULT;
+ ^
+-- check validation when attaching list partitions
+CREATE TABLE list_parted2 (
+ a int,
+ b char
+) PARTITION BY LIST (a);
+-- check that violating rows are correctly reported
+CREATE TABLE part_2 (LIKE list_parted2);
+INSERT INTO part_2 VALUES (3, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: partition constraint of relation "part_2" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part_2;
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+-- check partition cannot be attached if default has some row for its values
+CREATE TABLE list_parted2_def PARTITION OF list_parted2 DEFAULT;
+INSERT INTO list_parted2_def VALUES (11, 'z');
+CREATE TABLE part_3 (LIKE list_parted2);
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+ERROR: updated partition constraint for default partition "list_parted2_def" would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM list_parted2_def WHERE a = 11;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3 FOR VALUES IN (11);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part_3_4 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IN (3))
+);
+-- however, if a list partition does not accept nulls, there should be
+-- an explicit NOT NULL constraint on the partition key column for the
+-- validation scan to be skipped;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+-- adding a NOT NULL constraint will cause the scan to be skipped
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+ALTER TABLE part_3_4 ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
+-- check if default partition scan skipped
+ALTER TABLE list_parted2_def ADD CONSTRAINT check_a CHECK (a IN (5, 6));
+CREATE TABLE part_55_66 PARTITION OF list_parted2 FOR VALUES IN (55, 66);
+-- check validation when attaching range partitions
+CREATE TABLE range_parted (
+ a int,
+ b int
+) PARTITION BY RANGE (a, b);
+-- check that violating rows are correctly reported
+CREATE TABLE part1 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 1 AND b <= 10)
+);
+INSERT INTO part1 VALUES (1, 10);
+-- Remember the TO bound is exclusive
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+ERROR: partition constraint of relation "part1" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part1;
+ALTER TABLE range_parted ATTACH PARTITION part1 FOR VALUES FROM (1, 1) TO (1, 10);
+-- adding constraints that describe the desired partition constraint
+-- (or more restrictive) will help skip the validation scan
+CREATE TABLE part2 (
+ a int NOT NULL CHECK (a = 1),
+ b int NOT NULL CHECK (b >= 10 AND b < 18)
+);
+ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
+-- Create default partition
+CREATE TABLE partr_def1 PARTITION OF range_parted DEFAULT;
+-- Only one default partition is allowed, hence, following should give error
+CREATE TABLE partr_def2 (LIKE part1 INCLUDING CONSTRAINTS);
+ALTER TABLE range_parted ATTACH PARTITION partr_def2 DEFAULT;
+ERROR: partition "partr_def2" conflicts with existing default partition "partr_def1"
+LINE 1: ...LTER TABLE range_parted ATTACH PARTITION partr_def2 DEFAULT;
+ ^
+-- Overlapping partitions cannot be attached, hence, following should give error
+INSERT INTO partr_def1 VALUES (2, 10);
+CREATE TABLE part3 (LIKE range_parted);
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (2, 10) TO (2, 20);
+ERROR: updated partition constraint for default partition "partr_def1" would be violated by some row
+-- Attaching partitions should be successful when there are no overlapping rows
+ALTER TABLE range_parted ATTACH partition part3 FOR VALUES FROM (3, 10) TO (3, 20);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE part_5 (
+ LIKE list_parted2
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE part_5_a PARTITION OF part_5 FOR VALUES IN ('a');
+INSERT INTO part_5_a (a, b) VALUES (6, 'a');
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+ERROR: partition constraint of relation "part_5_a" is violated by some row
+-- delete the faulting row and also add a constraint to skip the scan
+DELETE FROM part_5_a WHERE a NOT IN (3);
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 5);
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+ALTER TABLE list_parted2 DETACH PARTITION part_5;
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+-- scan should again be skipped, even though NOT NULL is now a column property
+ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
+ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
+-- Check the case where attnos of the partitioning columns in the table being
+-- attached differs from the parent. It should not affect the constraint-
+-- checking logic that allows to skip the scan.
+CREATE TABLE part_6 (
+ c int,
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 6)
+);
+ALTER TABLE part_6 DROP c;
+ALTER TABLE list_parted2 ATTACH PARTITION part_6 FOR VALUES IN (6);
+-- Similar to above, but the table being attached is a partitioned table
+-- whose partition has still different attnos for the root partitioning
+-- columns.
+CREATE TABLE part_7 (
+ LIKE list_parted2,
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+) PARTITION BY LIST (b);
+CREATE TABLE part_7_a_null (
+ c int,
+ d int,
+ e int,
+ LIKE list_parted2, -- 'a' will have attnum = 4
+ CONSTRAINT check_b CHECK (b IS NULL OR b = 'a'),
+ CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 7)
+);
+ALTER TABLE part_7_a_null DROP c, DROP d, DROP e;
+ALTER TABLE part_7 ATTACH PARTITION part_7_a_null FOR VALUES IN ('a', null);
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+-- Same example, but check this time that the constraint correctly detects
+-- violating rows
+ALTER TABLE list_parted2 DETACH PARTITION part_7;
+ALTER TABLE part_7 DROP CONSTRAINT check_a; -- thusly, scan won't be skipped
+INSERT INTO part_7 (a, b) VALUES (8, null), (9, 'a');
+SELECT tableoid::regclass, a, b FROM part_7 order by a;
+ tableoid | a | b
+---------------+---+---
+ part_7_a_null | 8 |
+ part_7_a_null | 9 | a
+(2 rows)
+
+ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
+ERROR: partition constraint of relation "part_7_a_null" is violated by some row
+-- check that leaf partitions of default partition are scanned when
+-- attaching a partitioned table.
+ALTER TABLE part_5 DROP CONSTRAINT check_a;
+CREATE TABLE part5_def PARTITION OF part_5 DEFAULT PARTITION BY LIST(a);
+CREATE TABLE part5_def_p1 PARTITION OF part5_def FOR VALUES IN (5);
+INSERT INTO part5_def_p1 VALUES (5, 'y');
+CREATE TABLE part5_p1 (LIKE part_5);
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+ERROR: updated partition constraint for default partition "part5_def_p1" would be violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM part5_def_p1 WHERE b = 'y';
+ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+-- check that the table being attached is not already a partition
+ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
+ERROR: "part_2" is already a partition
+-- check that circular inheritance is not allowed
+ALTER TABLE part_5 ATTACH PARTITION list_parted2 FOR VALUES IN ('b');
+ERROR: circular inheritance not allowed
+DETAIL: "part_5" is already a child of "list_parted2".
+ALTER TABLE list_parted2 ATTACH PARTITION list_parted2 FOR VALUES IN (0);
+ERROR: circular inheritance not allowed
+DETAIL: "list_parted2" is already a child of "list_parted2".
+-- If a partitioned table being created or an existing table being attached
+-- as a partition does not have a constraint that would allow validation scan
+-- to be skipped, but an individual partition does, then the partition's
+-- validation scan is skipped.
+CREATE TABLE quuux (a int, b text) PARTITION BY LIST (a);
+CREATE TABLE quuux_default PARTITION OF quuux DEFAULT PARTITION BY LIST (b);
+CREATE TABLE quuux_default1 PARTITION OF quuux_default (
+ CONSTRAINT check_1 CHECK (a IS NOT NULL AND a = 1)
+) FOR VALUES IN ('b');
+CREATE TABLE quuux1 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux1 FOR VALUES IN (1); -- validate!
+CREATE TABLE quuux2 (a int, b text);
+ALTER TABLE quuux ATTACH PARTITION quuux2 FOR VALUES IN (2); -- skip validation
+DROP TABLE quuux1, quuux2;
+-- should validate for quuux1, but not for quuux2
+CREATE TABLE quuux1 PARTITION OF quuux FOR VALUES IN (1);
+CREATE TABLE quuux2 PARTITION OF quuux FOR VALUES IN (2);
+DROP TABLE quuux;
+-- check validation when attaching hash partitions
+-- Use hand-rolled hash functions and operator class to get predictable result
+-- on different machines. part_test_int4_ops is defined in insert.sql.
+-- check that the new partition won't overlap with an existing partition
+CREATE TABLE hash_parted (
+ a int,
+ b int
+) PARTITION BY HASH (a part_test_int4_ops);
+CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 4, REMAINDER 0);
+CREATE TABLE fail_part (LIKE hpart_1);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 4);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+LINE 1: ...hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODU...
+ ^
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 0);
+ERROR: partition "fail_part" would overlap partition "hpart_1"
+LINE 1: ...hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODU...
+ ^
+DROP TABLE fail_part;
+-- check validation when attaching hash partitions
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_2 (LIKE hash_parted);
+INSERT INTO hpart_2 VALUES (3, 0);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+ERROR: partition constraint of relation "hpart_2" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_2;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_2 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
+-- check that leaf partitions are scanned when attaching a partitioned
+-- table
+CREATE TABLE hpart_5 (
+ LIKE hash_parted
+) PARTITION BY LIST (b);
+-- check that violating rows are correctly reported
+CREATE TABLE hpart_5_a PARTITION OF hpart_5 FOR VALUES IN ('1', '2', '3');
+INSERT INTO hpart_5_a (a, b) VALUES (7, 1);
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+ERROR: partition constraint of relation "hpart_5_a" is violated by some row
+-- should be ok after deleting the bad row
+DELETE FROM hpart_5_a;
+ALTER TABLE hash_parted ATTACH PARTITION hpart_5 FOR VALUES WITH (MODULUS 4, REMAINDER 2);
+-- check that the table being attach is with valid modulus and remainder value
+CREATE TABLE fail_part(LIKE hash_parted);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 0, REMAINDER 1);
+ERROR: modulus for hash partition must be a positive integer
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 8, REMAINDER 8);
+ERROR: remainder for hash partition must be less than modulus
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH (MODULUS 3, REMAINDER 2);
+ERROR: every hash partition modulus must be a factor of the next larger modulus
+DROP TABLE fail_part;
+--
+-- DETACH PARTITION
+--
+-- check that the table is partitioned at all
+CREATE TABLE regular_table (a int);
+ALTER TABLE regular_table DETACH PARTITION any_name;
+ERROR: table "regular_table" is not partitioned
+DROP TABLE regular_table;
+-- check that the partition being detached exists at all
+ALTER TABLE list_parted2 DETACH PARTITION part_4;
+ERROR: relation "part_4" does not exist
+ALTER TABLE hash_parted DETACH PARTITION hpart_4;
+ERROR: relation "hpart_4" does not exist
+-- check that the partition being detached is actually a partition of the parent
+CREATE TABLE not_a_part (a int);
+ALTER TABLE list_parted2 DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "list_parted2"
+ALTER TABLE list_parted2 DETACH PARTITION part_1;
+ERROR: relation "part_1" is not a partition of relation "list_parted2"
+ALTER TABLE hash_parted DETACH PARTITION not_a_part;
+ERROR: relation "not_a_part" is not a partition of relation "hash_parted"
+DROP TABLE not_a_part;
+-- check that, after being detached, attinhcount/coninhcount is dropped to 0 and
+-- attislocal/conislocal is set to true
+ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
+SELECT attinhcount, attislocal FROM pg_attribute WHERE attrelid = 'part_3_4'::regclass AND attnum > 0;
+ attinhcount | attislocal
+-------------+------------
+ 0 | t
+ 0 | t
+(2 rows)
+
+SELECT coninhcount, conislocal FROM pg_constraint WHERE conrelid = 'part_3_4'::regclass AND conname = 'check_a';
+ coninhcount | conislocal
+-------------+------------
+ 0 | t
+(1 row)
+
+DROP TABLE part_3_4;
+-- check that a detached partition is not dropped on dropping a partitioned table
+CREATE TABLE range_parted2 (
+ a int
+) PARTITION BY RANGE(a);
+CREATE TABLE part_rp PARTITION OF range_parted2 FOR VALUES FROM (0) to (100);
+ALTER TABLE range_parted2 DETACH PARTITION part_rp;
+DROP TABLE range_parted2;
+SELECT * from part_rp;
+ a
+---
+(0 rows)
+
+DROP TABLE part_rp;
+-- Check ALTER TABLE commands for partitioned tables and partitions
+-- cannot add/drop column to/from *only* the parent
+ALTER TABLE ONLY list_parted2 ADD COLUMN c int;
+ERROR: column must be added to child tables too
+ALTER TABLE ONLY list_parted2 DROP COLUMN b;
+ERROR: cannot drop column from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- cannot add a column to partition or drop an inherited one
+ALTER TABLE part_2 ADD COLUMN c text;
+ERROR: cannot add column to a partition
+ALTER TABLE part_2 DROP COLUMN b;
+ERROR: cannot drop inherited column "b"
+-- Nor rename, alter type
+ALTER TABLE part_2 RENAME COLUMN b to c;
+ERROR: cannot rename inherited column "b"
+ALTER TABLE part_2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter inherited column "b"
+-- cannot add/drop NOT NULL or check constraints to *only* the parent, when
+-- partitions exist
+ALTER TABLE ONLY list_parted2 ALTER b SET NOT NULL;
+ERROR: constraint must be added to child tables too
+DETAIL: Column "b" of relation "part_2" is not already NOT NULL.
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE ONLY list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ERROR: constraint must be added to child tables too
+ALTER TABLE list_parted2 ALTER b SET NOT NULL;
+ALTER TABLE ONLY list_parted2 ALTER b DROP NOT NULL;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+ALTER TABLE list_parted2 ADD CONSTRAINT check_b CHECK (b <> 'zz');
+ALTER TABLE ONLY list_parted2 DROP CONSTRAINT check_b;
+ERROR: cannot remove constraint from only the partitioned table when partitions exist
+HINT: Do not specify the ONLY keyword.
+-- It's alright though, if no partitions are yet created
+CREATE TABLE parted_no_parts (a int) PARTITION BY LIST (a);
+ALTER TABLE ONLY parted_no_parts ALTER a SET NOT NULL;
+ALTER TABLE ONLY parted_no_parts ADD CONSTRAINT check_a CHECK (a > 0);
+ALTER TABLE ONLY parted_no_parts ALTER a DROP NOT NULL;
+ALTER TABLE ONLY parted_no_parts DROP CONSTRAINT check_a;
+DROP TABLE parted_no_parts;
+-- cannot drop inherited NOT NULL or check constraints from partition
+ALTER TABLE list_parted2 ALTER b SET NOT NULL, ADD CONSTRAINT check_a2 CHECK (a > 0);
+ALTER TABLE part_2 ALTER b DROP NOT NULL;
+ERROR: column "b" is marked NOT NULL in parent table
+ALTER TABLE part_2 DROP CONSTRAINT check_a2;
+ERROR: cannot drop inherited constraint "check_a2" of relation "part_2"
+-- Doesn't make sense to add NO INHERIT constraints on partitioned tables
+ALTER TABLE list_parted2 add constraint check_b2 check (b <> 'zz') NO INHERIT;
+ERROR: cannot add NO INHERIT constraint to partitioned table "list_parted2"
+-- check that a partition cannot participate in regular inheritance
+CREATE TABLE inh_test () INHERITS (part_2);
+ERROR: cannot inherit from partition "part_2"
+CREATE TABLE inh_test (LIKE part_2);
+ALTER TABLE inh_test INHERIT part_2;
+ERROR: cannot inherit from a partition
+ALTER TABLE part_2 INHERIT inh_test;
+ERROR: cannot change inheritance of a partition
+-- cannot drop or alter type of partition key columns of lower level
+-- partitioned tables; for example, part_5, which is list_parted2's
+-- partition, is partitioned on b;
+ALTER TABLE list_parted2 DROP COLUMN b;
+ERROR: cannot drop column "b" because it is part of the partition key of relation "part_5"
+ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
+ERROR: cannot alter column "b" because it is part of the partition key of relation "part_5"
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+ a
+---
+(0 rows)
+
+-- cleanup
+DROP TABLE list_parted, list_parted2, range_parted;
+DROP TABLE fail_def_part;
+DROP TABLE hash_parted;
+-- more tests for certain multi-level partitioning scenarios
+create table p (a int, b int) partition by range (a, b);
+create table p1 (b int, a int not null) partition by range (b);
+create table p11 (like p1);
+alter table p11 drop a;
+alter table p11 add a int;
+alter table p11 drop a;
+alter table p11 add a int not null;
+-- attnum for key attribute 'a' is different in p, p1, and p11
+select attrelid::regclass, attname, attnum
+from pg_attribute
+where attname = 'a'
+ and (attrelid = 'p'::regclass
+ or attrelid = 'p1'::regclass
+ or attrelid = 'p11'::regclass)
+order by attrelid::regclass::text;
+ attrelid | attname | attnum
+----------+---------+--------
+ p | a | 1
+ p1 | a | 2
+ p11 | a | 4
+(3 rows)
+
+alter table p1 attach partition p11 for values from (2) to (5);
+insert into p1 (a, b) values (2, 3);
+-- check that partition validation scan correctly detects violating rows
+alter table p attach partition p1 for values from (1, 2) to (1, 10);
+ERROR: partition constraint of relation "p11" is violated by some row
+-- cleanup
+drop table p;
+drop table p1;
+-- validate constraint on partitioned tables should only scan leaf partitions
+create table parted_validate_test (a int) partition by list (a);
+create table parted_validate_test_1 partition of parted_validate_test for values in (0, 1);
+alter table parted_validate_test add constraint parted_validate_test_chka check (a > 0) not valid;
+alter table parted_validate_test validate constraint parted_validate_test_chka;
+drop table parted_validate_test;
+-- test alter column options
+CREATE TABLE attmp(i integer);
+INSERT INTO attmp VALUES (1);
+ALTER TABLE attmp ALTER COLUMN i SET (n_distinct = 1, n_distinct_inherited = 2);
+ALTER TABLE attmp ALTER COLUMN i RESET (n_distinct_inherited);
+ANALYZE attmp;
+DROP TABLE attmp;
+DROP USER regress_alter_table_user1;
+-- check that violating rows are correctly reported when attaching as the
+-- default partition
+create table defpart_attach_test (a int) partition by list (a);
+create table defpart_attach_test1 partition of defpart_attach_test for values in (1);
+create table defpart_attach_test_d (b int, a int);
+alter table defpart_attach_test_d drop b;
+insert into defpart_attach_test_d values (1), (2);
+-- error because its constraint as the default partition would be violated
+-- by the row containing 1
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+ERROR: partition constraint of relation "defpart_attach_test_d" is violated by some row
+delete from defpart_attach_test_d where a = 1;
+alter table defpart_attach_test_d add check (a > 1);
+-- should be attached successfully and without needing to be scanned
+alter table defpart_attach_test attach partition defpart_attach_test_d default;
+-- check that attaching a partition correctly reports any rows in the default
+-- partition that should not be there for the new partition to be attached
+-- successfully
+create table defpart_attach_test_2 (like defpart_attach_test_d);
+alter table defpart_attach_test attach partition defpart_attach_test_2 for values in (2);
+ERROR: updated partition constraint for default partition "defpart_attach_test_d" would be violated by some row
+drop table defpart_attach_test;
+-- check combinations of temporary and permanent relations when attaching
+-- partitions.
+create table perm_part_parent (a int) partition by list (a);
+create temp table temp_part_parent (a int) partition by list (a);
+create table perm_part_child (a int);
+create temp table temp_part_child (a int);
+alter table temp_part_parent attach partition perm_part_child default; -- error
+ERROR: cannot attach a permanent relation as partition of temporary relation "temp_part_parent"
+alter table perm_part_parent attach partition temp_part_child default; -- error
+ERROR: cannot attach a temporary relation as partition of permanent relation "perm_part_parent"
+alter table temp_part_parent attach partition temp_part_child default; -- ok
+drop table perm_part_parent cascade;
+drop table temp_part_parent cascade;
+-- check that attaching partitions to a table while it is being used is
+-- prevented
+create table tab_part_attach (a int) partition by list (a);
+create or replace function func_part_attach() returns trigger
+ language plpgsql as $$
+ begin
+ execute 'create table tab_part_attach_1 (a int)';
+ execute 'alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)';
+ return null;
+ end $$;
+create trigger trig_part_attach before insert on tab_part_attach
+ for each statement execute procedure func_part_attach();
+insert into tab_part_attach values (1);
+ERROR: cannot ALTER TABLE "tab_part_attach" because it is being used by active queries in this session
+CONTEXT: SQL statement "alter table tab_part_attach attach partition tab_part_attach_1 for values in (1)"
+PL/pgSQL function func_part_attach() line 4 at EXECUTE
+drop table tab_part_attach;
+drop function func_part_attach();
+-- test case where the partitioning operator is a SQL function whose
+-- evaluation results in the table's relcache being rebuilt partway through
+-- the execution of an ATTACH PARTITION command
+create function at_test_sql_partop (int4, int4) returns int language sql
+as $$ select case when $1 = $2 then 0 when $1 > $2 then 1 else -1 end; $$;
+create operator class at_test_sql_partop for type int4 using btree as
+ operator 1 < (int4, int4), operator 2 <= (int4, int4),
+ operator 3 = (int4, int4), operator 4 >= (int4, int4),
+ operator 5 > (int4, int4), function 1 at_test_sql_partop(int4, int4);
+create table at_test_sql_partop (a int) partition by range (a at_test_sql_partop);
+create table at_test_sql_partop_1 (a int);
+alter table at_test_sql_partop attach partition at_test_sql_partop_1 for values from (0) to (10);
+drop table at_test_sql_partop;
+drop operator class at_test_sql_partop using btree;
+drop function at_test_sql_partop;
+/* Test case for bug #16242 */
+-- We create a parent and child where the child has missing
+-- non-null attribute values, and arrange to pass them through
+-- tuple conversion from the child to the parent tupdesc
+create table bar1 (a integer, b integer not null default 1)
+ partition by range (a);
+create table bar2 (a integer);
+insert into bar2 values (1);
+alter table bar2 add column b integer not null default 1;
+-- (at this point bar2 contains tuple with natts=1)
+alter table bar1 attach partition bar2 default;
+-- this works:
+select * from bar1;
+ a | b
+---+---
+ 1 | 1
+(1 row)
+
+-- this exercises tuple conversion:
+create function xtrig()
+ returns trigger language plpgsql
+as $$
+ declare
+ r record;
+ begin
+ for r in select * from old loop
+ raise info 'a=%, b=%', r.a, r.b;
+ end loop;
+ return NULL;
+ end;
+$$;
+create trigger xtrig
+ after update on bar1
+ referencing old table as old
+ for each statement execute procedure xtrig();
+update bar1 set a = a + 1;
+INFO: a=1, b=1
+/* End test case for bug #16242 */
+-- Test that ALTER TABLE rewrite preserves a clustered index
+-- for normal indexes and indexes on constraints.
+create table alttype_cluster (a int);
+alter table alttype_cluster add primary key (a);
+create index alttype_cluster_ind on alttype_cluster (a);
+alter table alttype_cluster cluster on alttype_cluster_ind;
+-- Normal index remains clustered.
+select indexrelid::regclass, indisclustered from pg_index
+ where indrelid = 'alttype_cluster'::regclass
+ order by indexrelid::regclass::text;
+ indexrelid | indisclustered
+----------------------+----------------
+ alttype_cluster_ind | t
+ alttype_cluster_pkey | f
+(2 rows)
+
+alter table alttype_cluster alter a type bigint;
+select indexrelid::regclass, indisclustered from pg_index
+ where indrelid = 'alttype_cluster'::regclass
+ order by indexrelid::regclass::text;
+ indexrelid | indisclustered
+----------------------+----------------
+ alttype_cluster_ind | t
+ alttype_cluster_pkey | f
+(2 rows)
+
+-- Constraint index remains clustered.
+alter table alttype_cluster cluster on alttype_cluster_pkey;
+select indexrelid::regclass, indisclustered from pg_index
+ where indrelid = 'alttype_cluster'::regclass
+ order by indexrelid::regclass::text;
+ indexrelid | indisclustered
+----------------------+----------------
+ alttype_cluster_ind | f
+ alttype_cluster_pkey | t
+(2 rows)
+
+alter table alttype_cluster alter a type int;
+select indexrelid::regclass, indisclustered from pg_index
+ where indrelid = 'alttype_cluster'::regclass
+ order by indexrelid::regclass::text;
+ indexrelid | indisclustered
+----------------------+----------------
+ alttype_cluster_ind | f
+ alttype_cluster_pkey | t
+(2 rows)
+
+drop table alttype_cluster;
diff --git a/src/test/regress/expected/box.out b/src/test/regress/expected/box.out
index 4d0f169214..f5c0ec106d 100644
--- a/src/test/regress/expected/box.out
+++ b/src/test/regress/expected/box.out
@@ -253,6 +253,7 @@ INSERT INTO box_temp
('(-infinity,0)(0,infinity)'),
('(-infinity,-infinity)(infinity,infinity)');
SET enable_seqscan = false;
+SET enable_bitmapscan = false;
SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
f1
----------------------------
@@ -476,6 +477,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
(2 rows)
RESET enable_seqscan;
+RESET enable_bitmapscan;
DROP INDEX box_spgist;
--
-- Test the SP-GiST index on the larger volume of data
diff --git a/src/test/regress/expected/brin_1.out b/src/test/regress/expected/brin_1.out
new file mode 100644
index 0000000000..8c85680368
--- /dev/null
+++ b/src/test/regress/expected/brin_1.out
@@ -0,0 +1,528 @@
+CREATE TABLE brintest (byteacol bytea,
+ charcol "char",
+ namecol name,
+ int8col bigint,
+ int2col smallint,
+ int4col integer,
+ textcol text,
+ oidcol oid,
+ tidcol tid,
+ float4col real,
+ float8col double precision,
+ macaddrcol macaddr,
+ inetcol inet,
+ cidrcol cidr,
+ bpcharcol character,
+ datecol date,
+ timecol time without time zone,
+ timestampcol timestamp without time zone,
+ timestamptzcol timestamp with time zone,
+ intervalcol interval,
+ timetzcol time with time zone,
+ bitcol bit(10),
+ varbitcol bit varying(16),
+ numericcol numeric,
+ uuidcol uuid,
+ int4rangecol int4range,
+ lsncol pg_lsn,
+ boxcol box
+) WITH (fillfactor=10);
+INSERT INTO brintest SELECT
+ repeat(stringu1, 8)::bytea,
+ substr(stringu1, 1, 1)::"char",
+ stringu1::name, 142857 * tenthous,
+ thousand,
+ twothousand,
+ repeat(stringu1, 8),
+ unique1::oid,
+ format('(%s,%s)', tenthous, twenty)::tid,
+ (four + 1.0)/(hundred+1),
+ odd::float8 / (tenthous + 1),
+ format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
+ inet '10.2.3.4/24' + tenthous,
+ cidr '10.2.3/24' + tenthous,
+ substr(stringu1, 1, 1)::bpchar,
+ date '1995-08-15' + tenthous,
+ time '01:20:30' + thousand * interval '18.5 second',
+ timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
+ timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
+ justify_days(justify_hours(tenthous * interval '12 minutes')),
+ timetz '01:30:20+02' + hundred * interval '15 seconds',
+ thousand::bit(10),
+ tenthous::bit(16)::varbit,
+ tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
+ format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
+ int4range(thousand, twothousand),
+ format('%s/%s%s', odd, even, tenthous)::pg_lsn,
+ box(point(odd, even), point(thousand, twothousand))
+FROM tenk1 ORDER BY unique2 LIMIT 100;
+-- throw in some NULL's and different values
+INSERT INTO brintest (inetcol, cidrcol, int4rangecol) SELECT
+ inet 'fe80::6e40:8ff:fea9:8c46' + tenthous,
+ cidr 'fe80::6e40:8ff:fea9:8c46' + tenthous,
+ 'empty'::int4range
+FROM tenk1 ORDER BY thousand, tenthous LIMIT 25;
+CREATE INDEX brinidx ON brintest USING brin (
+ byteacol,
+ charcol,
+ namecol,
+ int8col,
+ int2col,
+ int4col,
+ textcol,
+ oidcol,
+ tidcol,
+ float4col,
+ float8col,
+ macaddrcol,
+ inetcol inet_inclusion_ops,
+ inetcol inet_minmax_ops,
+ cidrcol inet_inclusion_ops,
+ cidrcol inet_minmax_ops,
+ bpcharcol,
+ datecol,
+ timecol,
+ timestampcol,
+ timestamptzcol,
+ intervalcol,
+ timetzcol,
+ bitcol,
+ varbitcol,
+ numericcol,
+ uuidcol,
+ int4rangecol,
+ lsncol,
+ boxcol
+) with (pages_per_range = 1);
+CREATE TABLE brinopers (colname name, typ text,
+ op text[], value text[], matches int[],
+ check (cardinality(op) = cardinality(value)),
+ check (cardinality(op) = cardinality(matches)));
+INSERT INTO brinopers VALUES
+ ('byteacol', 'bytea',
+ '{>, >=, =, <=, <}',
+ '{AAAAAA, AAAAAA, BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA, ZZZZZZ, ZZZZZZ}',
+ '{100, 100, 1, 100, 100}'),
+ ('charcol', '"char"',
+ '{>, >=, =, <=, <}',
+ '{A, A, M, Z, Z}',
+ '{97, 100, 6, 100, 98}'),
+ ('namecol', 'name',
+ '{>, >=, =, <=, <}',
+ '{AAAAAA, AAAAAA, MAAAAA, ZZAAAA, ZZAAAA}',
+ '{100, 100, 2, 100, 100}'),
+ ('int2col', 'int2',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int2col', 'int4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int2col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 999, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int2',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1999}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 800, 1999, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('int8col', 'int2',
+ '{>, >=}',
+ '{0, 0}',
+ '{100, 100}'),
+ ('int8col', 'int4',
+ '{>, >=}',
+ '{0, 0}',
+ '{100, 100}'),
+ ('int8col', 'int8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 1257141600, 1428427143, 1428427143}',
+ '{100, 100, 1, 100, 100}'),
+ ('textcol', 'text',
+ '{>, >=, =, <=, <}',
+ '{ABABAB, ABABAB, BNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAABNAAAA, ZZAAAA, ZZAAAA}',
+ '{100, 100, 1, 100, 100}'),
+ ('oidcol', 'oid',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 8800, 9999, 9999}',
+ '{100, 100, 1, 100, 100}'),
+ ('tidcol', 'tid',
+ '{>, >=, =, <=, <}',
+ '{"(0,0)", "(0,0)", "(8800,0)", "(9999,19)", "(9999,19)"}',
+ '{100, 100, 1, 100, 100}'),
+ ('float4col', 'float4',
+ '{>, >=, =, <=, <}',
+ '{0.0103093, 0.0103093, 1, 1, 1}',
+ '{100, 100, 4, 100, 96}'),
+ ('float4col', 'float8',
+ '{>, >=, =, <=, <}',
+ '{0.0103093, 0.0103093, 1, 1, 1}',
+ '{100, 100, 4, 100, 96}'),
+ ('float8col', 'float4',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 0, 1.98, 1.98}',
+ '{99, 100, 1, 100, 100}'),
+ ('float8col', 'float8',
+ '{>, >=, =, <=, <}',
+ '{0, 0, 0, 1.98, 1.98}',
+ '{99, 100, 1, 100, 100}'),
+ ('macaddrcol', 'macaddr',
+ '{>, >=, =, <=, <}',
+ '{00:00:01:00:00:00, 00:00:01:00:00:00, 2c:00:2d:00:16:00, ff:fe:00:00:00:00, ff:fe:00:00:00:00}',
+ '{99, 100, 2, 100, 100}'),
+ ('inetcol', 'inet',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14.231/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14.231/24, 10.2.14.231/25, 10.2.14.231/8, 0/0}',
+ '{100, 1, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('inetcol', 'inet',
+ '{&&, >>=, <<=, =}',
+ '{fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46, fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('inetcol', 'cidr',
+ '{&&, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14/24, 10.2.14/25, 10/8, 0/0}',
+ '{100, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('inetcol', 'cidr',
+ '{&&, >>=, <<=, =}',
+ '{fe80::/32, fe80::6e40:8ff:fea9:8c46, fe80::/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('cidrcol', 'inet',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14.231/24, 10.2.14.231/25, 10.2.14.231/8, 0/0}',
+ '{100, 2, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('cidrcol', 'inet',
+ '{&&, >>=, <<=, =}',
+ '{fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46, fe80::6e40:8ff:fea9:a673/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('cidrcol', 'cidr',
+ '{&&, =, <, <=, >, >=, >>=, >>, <<=, <<}',
+ '{10/8, 10.2.14/24, 255.255.255.255, 255.255.255.255, 0.0.0.0, 0.0.0.0, 10.2.14/24, 10.2.14/25, 10/8, 0/0}',
+ '{100, 2, 100, 100, 125, 125, 2, 2, 100, 100}'),
+ ('cidrcol', 'cidr',
+ '{&&, >>=, <<=, =}',
+ '{fe80::/32, fe80::6e40:8ff:fea9:8c46, fe80::/32, fe80::6e40:8ff:fea9:8c46}',
+ '{25, 1, 25, 1}'),
+ ('bpcharcol', 'bpchar',
+ '{>, >=, =, <=, <}',
+ '{A, A, W, Z, Z}',
+ '{97, 100, 6, 100, 98}'),
+ ('datecol', 'date',
+ '{>, >=, =, <=, <}',
+ '{1995-08-15, 1995-08-15, 2009-12-01, 2022-12-30, 2022-12-30}',
+ '{100, 100, 1, 100, 100}'),
+ ('timecol', 'time',
+ '{>, >=, =, <=, <}',
+ '{01:20:30, 01:20:30, 02:28:57, 06:28:31.5, 06:28:31.5}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestampcol', 'timestamp',
+ '{>, >=, =, <=, <}',
+ '{1942-07-23 03:05:09, 1942-07-23 03:05:09, 1964-03-24 19:26:45, 1984-01-20 22:42:21, 1984-01-20 22:42:21}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestampcol', 'timestamptz',
+ '{>, >=, =, <=, <}',
+ '{1942-07-23 03:05:09, 1942-07-23 03:05:09, 1964-03-24 19:26:45, 1984-01-20 22:42:21, 1984-01-20 22:42:21}',
+ '{100, 100, 1, 100, 100}'),
+ ('timestamptzcol', 'timestamptz',
+ '{>, >=, =, <=, <}',
+ '{1972-10-10 03:00:00-04, 1972-10-10 03:00:00-04, 1972-10-19 09:00:00-07, 1972-11-20 19:00:00-03, 1972-11-20 19:00:00-03}',
+ '{100, 100, 1, 100, 100}'),
+ ('intervalcol', 'interval',
+ '{>, >=, =, <=, <}',
+ '{00:00:00, 00:00:00, 1 mons 13 days 12:24, 2 mons 23 days 07:48:00, 1 year}',
+ '{100, 100, 1, 100, 100}'),
+ ('timetzcol', 'timetz',
+ '{>, >=, =, <=, <}',
+ '{01:30:20+02, 01:30:20+02, 01:35:50+02, 23:55:05+02, 23:55:05+02}',
+ '{99, 100, 2, 100, 100}'),
+ ('bitcol', 'bit(10)',
+ '{>, >=, =, <=, <}',
+ '{0000000010, 0000000010, 0011011110, 1111111000, 1111111000}',
+ '{100, 100, 1, 100, 100}'),
+ ('varbitcol', 'varbit(16)',
+ '{>, >=, =, <=, <}',
+ '{0000000000000100, 0000000000000100, 0001010001100110, 1111111111111000, 1111111111111000}',
+ '{100, 100, 1, 100, 100}'),
+ ('numericcol', 'numeric',
+ '{>, >=, =, <=, <}',
+ '{0.00, 0.01, 2268164.347826086956521739130434782609, 99470151.9, 99470151.9}',
+ '{100, 100, 1, 100, 100}'),
+ ('uuidcol', 'uuid',
+ '{>, >=, =, <=, <}',
+ '{00040004-0004-0004-0004-000400040004, 00040004-0004-0004-0004-000400040004, 52225222-5222-5222-5222-522252225222, 99989998-9998-9998-9998-999899989998, 99989998-9998-9998-9998-999899989998}',
+ '{100, 100, 1, 100, 100}'),
+ ('int4rangecol', 'int4range',
+ '{<<, &<, &&, &>, >>, @>, <@, =, <, <=, >, >=}',
+ '{"[10000,)","[10000,)","(,]","[3,4)","[36,44)","(1500,1501]","[3,4)","[222,1222)","[36,44)","[43,1043)","[367,4466)","[519,)"}',
+ '{53, 53, 53, 53, 50, 22, 72, 1, 74, 75, 34, 21}'),
+ ('int4rangecol', 'int4range',
+ '{@>, <@, =, <=, >, >=}',
+ '{empty, empty, empty, empty, empty, empty}',
+ '{125, 72, 72, 72, 53, 125}'),
+ ('int4rangecol', 'int4',
+ '{@>}',
+ '{1500}',
+ '{22}'),
+ ('lsncol', 'pg_lsn',
+ '{>, >=, =, <=, <, IS, IS NOT}',
+ '{0/1200, 0/1200, 44/455222, 198/1999799, 198/1999799, NULL, NULL}',
+ '{100, 100, 1, 100, 100, 25, 100}'),
+ ('boxcol', 'point',
+ '{@>}',
+ '{"(500,43)"}',
+ '{11}'),
+ ('boxcol', 'box',
+ '{<<, &<, &&, &>, >>, <<|, &<|, |&>, |>>, @>, <@, ~=}',
+ '{"((1000,2000),(3000,4000))","((1,2),(3000,4000))","((1,2),(3000,4000))","((1,2),(3000,4000))","((1,2),(3,4))","((1000,2000),(3000,4000))","((1,2000),(3,4000))","((1000,2),(3000,4))","((1,2),(3,4))","((1,2),(300,400))","((1,2),(3000,4000))","((222,1222),(44,45))"}',
+ '{100, 100, 100, 99, 96, 100, 100, 99, 96, 1, 99, 1}');
+DO $x$
+DECLARE
+ r record;
+ r2 record;
+ cond text;
+ idx_ctids tid[];
+ ss_ctids tid[];
+ count int;
+ plan_ok bool;
+ plan_line text;
+BEGIN
+ FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers, unnest(op) WITH ORDINALITY AS oper LOOP
+
+ -- prepare the condition
+ IF r.value IS NULL THEN
+ cond := format('%I %s %L', r.colname, r.oper, r.value);
+ ELSE
+ cond := format('%I %s %L::%s', r.colname, r.oper, r.value, r.typ);
+ END IF;
+
+ -- run the query using the brin index
+ SET enable_seqscan = 0;
+ SET enable_bitmapscan = 1;
+
+ plan_ok := false;
+ FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond) LOOP
+ IF plan_line LIKE '%Bitmap Heap Scan on brintest%' THEN
+ plan_ok := true;
+ END IF;
+ END LOOP;
+ IF NOT plan_ok THEN
+ RAISE WARNING 'did not get bitmap indexscan plan for %', r;
+ END IF;
+
+ EXECUTE format($y$SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond)
+ INTO idx_ctids;
+
+ -- run the query using a seqscan
+ SET enable_seqscan = 1;
+ SET enable_bitmapscan = 0;
+
+ plan_ok := false;
+ FOR plan_line IN EXECUTE format($y$EXPLAIN SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond) LOOP
+ IF plan_line LIKE '%Seq Scan on brintest%' THEN
+ plan_ok := true;
+ END IF;
+ END LOOP;
+ IF NOT plan_ok THEN
+ RAISE WARNING 'did not get seqscan plan for %', r;
+ END IF;
+
+ EXECUTE format($y$SELECT array_agg(ctid) FROM brintest WHERE %s $y$, cond)
+ INTO ss_ctids;
+
+ -- make sure both return the same results
+ count := array_length(idx_ctids, 1);
+
+ IF NOT (count = array_length(ss_ctids, 1) AND
+ idx_ctids @> ss_ctids AND
+ idx_ctids <@ ss_ctids) THEN
+ -- report the results of each scan to make the differences obvious
+ RAISE WARNING 'something not right in %: count %', r, count;
+ SET enable_seqscan = 1;
+ SET enable_bitmapscan = 0;
+ FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest WHERE ' || cond LOOP
+ RAISE NOTICE 'seqscan: %', r2;
+ END LOOP;
+
+ SET enable_seqscan = 0;
+ SET enable_bitmapscan = 1;
+ FOR r2 IN EXECUTE 'SELECT ' || r.colname || ' FROM brintest WHERE ' || cond LOOP
+ RAISE NOTICE 'bitmapscan: %', r2;
+ END LOOP;
+ END IF;
+
+ -- make sure we found expected number of matches
+ IF count != r.matches THEN RAISE WARNING 'unexpected number of results % for %', count, r; END IF;
+ END LOOP;
+END;
+$x$;
+RESET enable_seqscan;
+RESET enable_bitmapscan;
+INSERT INTO brintest SELECT
+ repeat(stringu1, 42)::bytea,
+ substr(stringu1, 1, 1)::"char",
+ stringu1::name, 142857 * tenthous,
+ thousand,
+ twothousand,
+ repeat(stringu1, 42),
+ unique1::oid,
+ format('(%s,%s)', tenthous, twenty)::tid,
+ (four + 1.0)/(hundred+1),
+ odd::float8 / (tenthous + 1),
+ format('%s:00:%s:00:%s:00', to_hex(odd), to_hex(even), to_hex(hundred))::macaddr,
+ inet '10.2.3.4' + tenthous,
+ cidr '10.2.3/24' + tenthous,
+ substr(stringu1, 1, 1)::bpchar,
+ date '1995-08-15' + tenthous,
+ time '01:20:30' + thousand * interval '18.5 second',
+ timestamp '1942-07-23 03:05:09' + tenthous * interval '36.38 hours',
+ timestamptz '1972-10-10 03:00' + thousand * interval '1 hour',
+ justify_days(justify_hours(tenthous * interval '12 minutes')),
+ timetz '01:30:20' + hundred * interval '15 seconds',
+ thousand::bit(10),
+ tenthous::bit(16)::varbit,
+ tenthous::numeric(36,30) * fivethous * even / (hundred + 1),
+ format('%s%s-%s-%s-%s-%s%s%s', to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'), to_char(tenthous, 'FM0000'))::uuid,
+ int4range(thousand, twothousand),
+ format('%s/%s%s', odd, even, tenthous)::pg_lsn,
+ box(point(odd, even), point(thousand, twothousand))
+FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET 5;
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+VACUUM brintest; -- force a summarization cycle in brinidx
+UPDATE brintest SET int8col = int8col * int4col;
+UPDATE brintest SET textcol = '' WHERE textcol IS NOT NULL;
+-- Tests for brin_summarize_new_values
+SELECT brin_summarize_new_values('brintest'); -- error, not an index
+ERROR: "brintest" is not an index
+SELECT brin_summarize_new_values('tenk1_unique1'); -- error, not a BRIN index
+ERROR: "tenk1_unique1" is not a BRIN index
+SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected
+ brin_summarize_new_values
+---------------------------
+ 9
+(1 row)
+
+-- Tests for brin_desummarize_range
+SELECT brin_desummarize_range('brinidx', -1); -- error, invalid range
+ERROR: block number out of range: -1
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_desummarize_range('brinidx', 100000000);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+-- Test brin_summarize_range
+CREATE TABLE brin_summarize (
+ value int
+) WITH (fillfactor=10, autovacuum_enabled=false);
+CREATE INDEX brin_summarize_idx ON brin_summarize USING brin (value) WITH (pages_per_range=2);
+-- Fill a few pages
+DO $$
+DECLARE curtid tid;
+BEGIN
+ LOOP
+ INSERT INTO brin_summarize VALUES (1) RETURNING ctid INTO curtid;
+ EXIT WHEN curtid > tid '(2, 0)';
+ END LOOP;
+END;
+$$;
+-- summarize one range
+SELECT brin_summarize_range('brin_summarize_idx', 0);
+ brin_summarize_range
+----------------------
+ 0
+(1 row)
+
+-- nothing: already summarized
+SELECT brin_summarize_range('brin_summarize_idx', 1);
+ brin_summarize_range
+----------------------
+ 0
+(1 row)
+
+-- summarize one range
+SELECT brin_summarize_range('brin_summarize_idx', 2);
+ brin_summarize_range
+----------------------
+ 1
+(1 row)
+
+-- nothing: page doesn't exist in table
+SELECT brin_summarize_range('brin_summarize_idx', 4294967295);
+ brin_summarize_range
+----------------------
+ 2
+(1 row)
+
+-- invalid block number values
+SELECT brin_summarize_range('brin_summarize_idx', -1);
+ERROR: block number out of range: -1
+SELECT brin_summarize_range('brin_summarize_idx', 4294967296);
+ERROR: block number out of range: 4294967296
+-- test value merging in add_value
+CREATE TABLE brintest_2 (n numrange);
+CREATE INDEX brinidx_2 ON brintest_2 USING brin (n);
+INSERT INTO brintest_2 VALUES ('empty');
+INSERT INTO brintest_2 VALUES (numrange(0, 2^1000::numeric));
+INSERT INTO brintest_2 VALUES ('(-1, 0)');
+SELECT brin_desummarize_range('brinidx', 0);
+ brin_desummarize_range
+------------------------
+
+(1 row)
+
+SELECT brin_summarize_range('brinidx', 0);
+ brin_summarize_range
+----------------------
+ 1
+(1 row)
+
+DROP TABLE brintest_2;
+-- test brin cost estimates behave sanely based on correlation of values
+CREATE TABLE brin_test (a INT, b INT);
+INSERT INTO brin_test SELECT x/100,x%100 FROM generate_series(1,10000) x(x);
+CREATE INDEX brin_test_a_idx ON brin_test USING brin (a) WITH (pages_per_range = 2);
+CREATE INDEX brin_test_b_idx ON brin_test USING brin (b) WITH (pages_per_range = 2);
+VACUUM ANALYZE brin_test;
+-- Ensure brin index is used when columns are perfectly correlated
+EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE a = 1;
+ QUERY PLAN
+--------------------------------------------
+ Bitmap Heap Scan on brin_test
+ Recheck Cond: (a = 1)
+ -> Bitmap Index Scan on brin_test_a_idx
+ Index Cond: (a = 1)
+(4 rows)
+
+-- Ensure brin index is not used when values are not correlated
+EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE b = 1;
+ QUERY PLAN
+-----------------------
+ Seq Scan on brin_test
+ Filter: (b = 1)
+(2 rows)
+
diff --git a/src/test/regress/expected/cluster_1.out b/src/test/regress/expected/cluster_1.out
new file mode 100644
index 0000000000..7167f05199
--- /dev/null
+++ b/src/test/regress/expected/cluster_1.out
@@ -0,0 +1,582 @@
+--
+-- CLUSTER
+--
+CREATE TABLE clstr_tst_s (rf_a SERIAL PRIMARY KEY,
+ b INT);
+CREATE TABLE clstr_tst (a SERIAL PRIMARY KEY,
+ b INT,
+ c TEXT,
+ d TEXT,
+ CONSTRAINT clstr_tst_con FOREIGN KEY (b) REFERENCES clstr_tst_s);
+CREATE INDEX clstr_tst_b ON clstr_tst (b);
+CREATE INDEX clstr_tst_c ON clstr_tst (c);
+CREATE INDEX clstr_tst_c_b ON clstr_tst (c,b);
+CREATE INDEX clstr_tst_b_c ON clstr_tst (b,c);
+INSERT INTO clstr_tst_s (b) VALUES (0);
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+INSERT INTO clstr_tst_s (b) SELECT b FROM clstr_tst_s;
+CREATE TABLE clstr_tst_inh () INHERITS (clstr_tst);
+INSERT INTO clstr_tst (b, c) VALUES (11, 'once');
+INSERT INTO clstr_tst (b, c) VALUES (10, 'diez');
+INSERT INTO clstr_tst (b, c) VALUES (31, 'treinta y uno');
+INSERT INTO clstr_tst (b, c) VALUES (22, 'veintidos');
+INSERT INTO clstr_tst (b, c) VALUES (3, 'tres');
+INSERT INTO clstr_tst (b, c) VALUES (20, 'veinte');
+INSERT INTO clstr_tst (b, c) VALUES (23, 'veintitres');
+INSERT INTO clstr_tst (b, c) VALUES (21, 'veintiuno');
+INSERT INTO clstr_tst (b, c) VALUES (4, 'cuatro');
+INSERT INTO clstr_tst (b, c) VALUES (14, 'catorce');
+INSERT INTO clstr_tst (b, c) VALUES (2, 'dos');
+INSERT INTO clstr_tst (b, c) VALUES (18, 'dieciocho');
+INSERT INTO clstr_tst (b, c) VALUES (27, 'veintisiete');
+INSERT INTO clstr_tst (b, c) VALUES (25, 'veinticinco');
+INSERT INTO clstr_tst (b, c) VALUES (13, 'trece');
+INSERT INTO clstr_tst (b, c) VALUES (28, 'veintiocho');
+INSERT INTO clstr_tst (b, c) VALUES (32, 'treinta y dos');
+INSERT INTO clstr_tst (b, c) VALUES (5, 'cinco');
+INSERT INTO clstr_tst (b, c) VALUES (29, 'veintinueve');
+INSERT INTO clstr_tst (b, c) VALUES (1, 'uno');
+INSERT INTO clstr_tst (b, c) VALUES (24, 'veinticuatro');
+INSERT INTO clstr_tst (b, c) VALUES (30, 'treinta');
+INSERT INTO clstr_tst (b, c) VALUES (12, 'doce');
+INSERT INTO clstr_tst (b, c) VALUES (17, 'diecisiete');
+INSERT INTO clstr_tst (b, c) VALUES (9, 'nueve');
+INSERT INTO clstr_tst (b, c) VALUES (19, 'diecinueve');
+INSERT INTO clstr_tst (b, c) VALUES (26, 'veintiseis');
+INSERT INTO clstr_tst (b, c) VALUES (15, 'quince');
+INSERT INTO clstr_tst (b, c) VALUES (7, 'siete');
+INSERT INTO clstr_tst (b, c) VALUES (16, 'dieciseis');
+INSERT INTO clstr_tst (b, c) VALUES (8, 'ocho');
+-- This entry is needed to test that TOASTED values are copied correctly.
+INSERT INTO clstr_tst (b, c, d) VALUES (6, 'seis', repeat('xyzzy', 100000));
+CLUSTER clstr_tst_c ON clstr_tst;
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY a;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 1 | 11 | once | |
+ 2 | 10 | diez | |
+ 3 | 31 | treinta y uno | |
+ 4 | 22 | veintidos | |
+ 5 | 3 | tres | |
+ 6 | 20 | veinte | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 9 | 4 | cuatro | |
+ 10 | 14 | catorce | |
+ 11 | 2 | dos | |
+ 12 | 18 | dieciocho | |
+ 13 | 27 | veintisiete | |
+ 14 | 25 | veinticinco | |
+ 15 | 13 | trece | |
+ 16 | 28 | veintiocho | |
+ 17 | 32 | treinta y dos | |
+ 18 | 5 | cinco | |
+ 19 | 29 | veintinueve | |
+ 20 | 1 | uno | |
+ 21 | 24 | veinticuatro | |
+ 22 | 30 | treinta | |
+ 23 | 12 | doce | |
+ 24 | 17 | diecisiete | |
+ 25 | 9 | nueve | |
+ 26 | 19 | diecinueve | |
+ 27 | 26 | veintiseis | |
+ 28 | 15 | quince | |
+ 29 | 7 | siete | |
+ 30 | 16 | dieciseis | |
+ 31 | 8 | ocho | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY b;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 20 | 1 | uno | |
+ 11 | 2 | dos | |
+ 5 | 3 | tres | |
+ 9 | 4 | cuatro | |
+ 18 | 5 | cinco | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 31 | 8 | ocho | |
+ 25 | 9 | nueve | |
+ 2 | 10 | diez | |
+ 1 | 11 | once | |
+ 23 | 12 | doce | |
+ 15 | 13 | trece | |
+ 10 | 14 | catorce | |
+ 28 | 15 | quince | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 12 | 18 | dieciocho | |
+ 26 | 19 | diecinueve | |
+ 6 | 20 | veinte | |
+ 8 | 21 | veintiuno | |
+ 4 | 22 | veintidos | |
+ 7 | 23 | veintitres | |
+ 21 | 24 | veinticuatro | |
+ 14 | 25 | veinticinco | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 16 | 28 | veintiocho | |
+ 19 | 29 | veintinueve | |
+ 22 | 30 | treinta | |
+ 3 | 31 | treinta y uno | |
+ 17 | 32 | treinta y dos | |
+(32 rows)
+
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst ORDER BY c;
+ a | b | c | substring | length
+----+----+---------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+(32 rows)
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+----------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+(33 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------
+ clstr_tst_con
+ clstr_tst_pkey
+(2 rows)
+
+SELECT relname, relkind,
+ EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
+FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
+ relname | relkind | hastoast
+----------------------+---------+----------
+ clstr_tst | r | f
+ clstr_tst_a_seq | S | f
+ clstr_tst_b | i | f
+ clstr_tst_b_c | i | f
+ clstr_tst_c | i | f
+ clstr_tst_c_b | i | f
+ clstr_tst_inh | r | f
+ clstr_tst_pkey | i | f
+ clstr_tst_s | r | f
+ clstr_tst_s_pkey | i | f
+ clstr_tst_s_rf_a_seq | S | f
+(11 rows)
+
+-- Verify that indisclustered is correctly set
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+-------------
+ clstr_tst_c
+(1 row)
+
+-- Try changing indisclustered
+ALTER TABLE clstr_tst CLUSTER ON clstr_tst_b_c;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------------
+ clstr_tst_b_c
+(1 row)
+
+-- Try turning off all clustering
+ALTER TABLE clstr_tst SET WITHOUT CLUSTER;
+SELECT pg_class.relname FROM pg_index, pg_class, pg_class AS pg_class_2
+WHERE pg_class.oid=indexrelid
+ AND indrelid=pg_class_2.oid
+ AND pg_class_2.relname = 'clstr_tst'
+ AND indisclustered;
+ relname
+---------
+(0 rows)
+
+-- Verify that clustering all tables does in fact cluster the right ones
+CREATE USER regress_clstr_user;
+CREATE TABLE clstr_1 (a INT PRIMARY KEY);
+CREATE TABLE clstr_2 (a INT PRIMARY KEY);
+CREATE TABLE clstr_3 (a INT PRIMARY KEY);
+ALTER TABLE clstr_1 OWNER TO regress_clstr_user;
+ALTER TABLE clstr_3 OWNER TO regress_clstr_user;
+GRANT SELECT ON clstr_2 TO regress_clstr_user;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- "CLUSTER <tablename>" on a table that hasn't been clustered
+CLUSTER clstr_2;
+ERROR: there is no previously clustered index for table "clstr_2"
+CLUSTER clstr_1_pkey ON clstr_1;
+CLUSTER clstr_2 USING clstr_2_pkey;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 1
+ 2
+ 2
+ 1
+(6 rows)
+
+-- revert to the original state
+DELETE FROM clstr_1;
+DELETE FROM clstr_2;
+DELETE FROM clstr_3;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+INSERT INTO clstr_2 VALUES (2);
+INSERT INTO clstr_2 VALUES (1);
+INSERT INTO clstr_3 VALUES (2);
+INSERT INTO clstr_3 VALUES (1);
+-- this user can only cluster clstr_1 and clstr_3, but the latter
+-- has not been clustered
+SET SESSION AUTHORIZATION regress_clstr_user;
+CLUSTER;
+SELECT * FROM clstr_1 UNION ALL
+ SELECT * FROM clstr_2 UNION ALL
+ SELECT * FROM clstr_3;
+ a
+---
+ 1
+ 2
+ 2
+ 1
+ 2
+ 1
+(6 rows)
+
+-- cluster a single table using the indisclustered bit previously set
+DELETE FROM clstr_1;
+INSERT INTO clstr_1 VALUES (2);
+INSERT INTO clstr_1 VALUES (1);
+CLUSTER clstr_1;
+SELECT * FROM clstr_1;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- Test MVCC-safety of cluster. There isn't much we can do to verify the
+-- results with a single backend...
+CREATE TABLE clustertest (key int PRIMARY KEY);
+INSERT INTO clustertest VALUES (10);
+INSERT INTO clustertest VALUES (20);
+INSERT INTO clustertest VALUES (30);
+INSERT INTO clustertest VALUES (40);
+INSERT INTO clustertest VALUES (50);
+-- Use a transaction so that updates are not committed when CLUSTER sees 'em
+BEGIN;
+-- Test update where the old row version is found first in the scan
+UPDATE clustertest SET key = 100 WHERE key = 10;
+-- Test update where the new row version is found first in the scan
+UPDATE clustertest SET key = 35 WHERE key = 40;
+-- Test longer update chain
+UPDATE clustertest SET key = 60 WHERE key = 50;
+UPDATE clustertest SET key = 70 WHERE key = 60;
+UPDATE clustertest SET key = 80 WHERE key = 70;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 100
+ 35
+ 80
+(5 rows)
+
+CLUSTER clustertest_pkey ON clustertest;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+COMMIT;
+SELECT * FROM clustertest;
+ key
+-----
+ 20
+ 30
+ 35
+ 80
+ 100
+(5 rows)
+
+-- check that temp tables can be clustered
+create temp table clstr_temp (col1 int primary key, col2 text);
+insert into clstr_temp values (2, 'two'), (1, 'one');
+cluster clstr_temp using clstr_temp_pkey;
+select * from clstr_temp;
+ col1 | col2
+------+------
+ 1 | one
+ 2 | two
+(2 rows)
+
+drop table clstr_temp;
+RESET SESSION AUTHORIZATION;
+-- Check that partitioned tables cannot be clustered
+CREATE TABLE clstrpart (a int) PARTITION BY RANGE (a);
+CREATE INDEX clstrpart_idx ON clstrpart (a);
+ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+ERROR: cannot mark index clustered in partitioned table
+CLUSTER clstrpart USING clstrpart_idx;
+ERROR: cannot cluster a partitioned table
+DROP TABLE clstrpart;
+-- Test CLUSTER with external tuplesorting
+create table clstr_4 as select * from tenk1;
+create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
+-- ensure we don't use the index in CLUSTER nor the checking SELECTs
+set enable_indexscan = off;
+-- Use external sort:
+set maintenance_work_mem = '1MB';
+cluster clstr_4 using cluster_sort;
+select * from
+(select hundred, lag(hundred) over () as lhundred,
+ thousand, lag(thousand) over () as lthousand,
+ tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
+where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
+ hundred | lhundred | thousand | lthousand | tenthous | ltenthous
+---------+----------+----------+-----------+----------+-----------
+(0 rows)
+
+reset enable_indexscan;
+reset maintenance_work_mem;
+-- test CLUSTER on expression index
+CREATE TABLE clstr_expression(id serial primary key, a int, b text COLLATE "C");
+INSERT INTO clstr_expression(a, b) SELECT g.i % 42, 'prefix'||g.i FROM generate_series(1, 133) g(i);
+CREATE INDEX clstr_expression_minus_a ON clstr_expression ((-a), b);
+CREATE INDEX clstr_expression_upper_b ON clstr_expression ((upper(b)));
+-- verify indexes work before cluster
+BEGIN;
+SET LOCAL enable_seqscan = false;
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_upper_b on clstr_expression
+ Index Cond: (upper(b) = 'PREFIX3'::text)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ id | a | b
+----+---+---------
+ 3 | 3 | prefix3
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_minus_a on clstr_expression
+ Index Cond: ((- a) = '-3'::integer)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ id | a | b
+-----+---+-----------
+ 129 | 3 | prefix129
+ 3 | 3 | prefix3
+ 45 | 3 | prefix45
+ 87 | 3 | prefix87
+(4 rows)
+
+COMMIT;
+-- and after clustering on clstr_expression_minus_a
+CLUSTER clstr_expression USING clstr_expression_minus_a;
+BEGIN;
+SET LOCAL enable_seqscan = false;
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_upper_b on clstr_expression
+ Index Cond: (upper(b) = 'PREFIX3'::text)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ id | a | b
+----+---+---------
+ 3 | 3 | prefix3
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_minus_a on clstr_expression
+ Index Cond: ((- a) = '-3'::integer)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ id | a | b
+-----+---+-----------
+ 129 | 3 | prefix129
+ 3 | 3 | prefix3
+ 45 | 3 | prefix45
+ 87 | 3 | prefix87
+(4 rows)
+
+COMMIT;
+-- and after clustering on clstr_expression_upper_b
+CLUSTER clstr_expression USING clstr_expression_upper_b;
+BEGIN;
+SET LOCAL enable_seqscan = false;
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_upper_b on clstr_expression
+ Index Cond: (upper(b) = 'PREFIX3'::text)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
+ id | a | b
+----+---+---------
+ 3 | 3 | prefix3
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ QUERY PLAN
+---------------------------------------------------------------
+ Index Scan using clstr_expression_minus_a on clstr_expression
+ Index Cond: ((- a) = '-3'::integer)
+(2 rows)
+
+SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
+ id | a | b
+-----+---+-----------
+ 129 | 3 | prefix129
+ 3 | 3 | prefix3
+ 45 | 3 | prefix45
+ 87 | 3 | prefix87
+(4 rows)
+
+COMMIT;
+-- clean up
+DROP TABLE clustertest;
+DROP TABLE clstr_1;
+DROP TABLE clstr_2;
+DROP TABLE clstr_3;
+DROP TABLE clstr_4;
+DROP TABLE clstr_expression;
+DROP USER regress_clstr_user;
diff --git a/src/test/regress/expected/combocid_1.out b/src/test/regress/expected/combocid_1.out
new file mode 100644
index 0000000000..533a2f2fe4
--- /dev/null
+++ b/src/test/regress/expected/combocid_1.out
@@ -0,0 +1,169 @@
+--
+-- Tests for some likely failure cases with combo cmin/cmax mechanism
+--
+CREATE TEMP TABLE combocidtest (foobar int);
+BEGIN;
+-- a few dummy ops to push up the CommandId counter
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest VALUES (1);
+INSERT INTO combocidtest VALUES (2);
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+SAVEPOINT s1;
+UPDATE combocidtest SET foobar = foobar + 10;
+-- here we should see only updated tuples
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,3) | 12 | 11
+ (0,4) | 12 | 12
+(2 rows)
+
+ROLLBACK TO s1;
+-- now we should see old tuples, but with combo CIDs starting at 0
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+COMMIT;
+-- combo data is not there anymore, but should still see tuples
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+-- Test combo cids with portals
+BEGIN;
+INSERT INTO combocidtest VALUES (333);
+DECLARE c CURSOR FOR SELECT ctid,cmin,* FROM combocidtest;
+DELETE FROM combocidtest;
+FETCH ALL FROM c;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,5) | 0 | 333
+(3 rows)
+
+ROLLBACK;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+(2 rows)
+
+-- check behavior with locked tuples
+BEGIN;
+-- a few dummy ops to push up the CommandId counter
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest SELECT 1 LIMIT 0;
+INSERT INTO combocidtest VALUES (444);
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+SAVEPOINT s1;
+-- this doesn't affect cmin
+SELECT ctid,cmin,* FROM combocidtest FOR UPDATE;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+-- but this does
+UPDATE combocidtest SET foobar = foobar + 10;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,7) | 12 | 11
+ (0,8) | 12 | 12
+ (0,9) | 12 | 454
+(3 rows)
+
+ROLLBACK TO s1;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+COMMIT;
+SELECT ctid,cmin,* FROM combocidtest;
+ ctid | cmin | foobar
+-------+------+--------
+ (0,1) | 10 | 1
+ (0,2) | 11 | 2
+ (0,6) | 10 | 444
+(3 rows)
+
+-- test for bug reported in
+-- CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
+CREATE TABLE IF NOT EXISTS testcase(
+ id int PRIMARY KEY,
+ balance numeric
+);
+INSERT INTO testcase VALUES (1, 0);
+BEGIN;
+SELECT * FROM testcase WHERE testcase.id = 1 FOR UPDATE;
+ id | balance
+----+---------
+ 1 | 0
+(1 row)
+
+UPDATE testcase SET balance = balance + 400 WHERE id=1;
+SAVEPOINT subxact;
+UPDATE testcase SET balance = balance - 100 WHERE id=1;
+ROLLBACK TO SAVEPOINT subxact;
+-- should return one tuple
+SELECT * FROM testcase WHERE id = 1 FOR UPDATE;
+ id | balance
+----+---------
+ 1 | 400
+(1 row)
+
+ROLLBACK;
+DROP TABLE testcase;
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index 0dfb26c301..3ea01b08cd 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -129,11 +129,12 @@ ERROR: function int4in(internal) does not exist
CREATE ACCESS METHOD bogus TYPE TABLE HANDLER bthandler;
ERROR: function bthandler must return type table_am_handler
SELECT amname, amhandler, amtype FROM pg_am where amtype = 't' ORDER BY 1, 2;
- amname | amhandler | amtype
---------+----------------------+--------
- heap | heap_tableam_handler | t
- heap2 | heap_tableam_handler | t
-(2 rows)
+ amname | amhandler | amtype
+----------+--------------------------+--------
+ heap | heap_tableam_handler | t
+ heap2 | heap_tableam_handler | t
+ zedstore | zedstore_tableam_handler | t
+(3 rows)
-- First create tables employing the new AM using USING
-- plain CREATE TABLE
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 93a8736a3f..62c71569ec 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1872,7 +1872,9 @@ SELECT count(*) FROM dupindexcols
--
-- Check ordering of =ANY indexqual results (bug in 9.2.0)
--
-vacuum tenk1; -- ensure we get consistent plans here
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
explain (costs off)
SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
@@ -1913,6 +1915,7 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+RESET enable_indexscan;
SET enable_indexonlyscan = OFF;
explain (costs off)
SELECT thousand, tenthous FROM tenk1
@@ -1935,6 +1938,8 @@ ORDER BY thousand;
1 | 1001
(2 rows)
+RESET enable_seqscan;
+RESET enable_bitmapscan;
RESET enable_indexonlyscan;
--
-- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/create_index_1.out b/src/test/regress/expected/create_index_1.out
new file mode 100644
index 0000000000..6f2b20f918
--- /dev/null
+++ b/src/test/regress/expected/create_index_1.out
@@ -0,0 +1,2772 @@
+--
+-- CREATE_INDEX
+-- Create ancillary data structures (i.e. indices)
+--
+--
+-- BTREE
+--
+CREATE INDEX onek_unique1 ON onek USING btree(unique1 int4_ops);
+CREATE INDEX IF NOT EXISTS onek_unique1 ON onek USING btree(unique1 int4_ops);
+NOTICE: relation "onek_unique1" already exists, skipping
+CREATE INDEX IF NOT EXISTS ON onek USING btree(unique1 int4_ops);
+ERROR: syntax error at or near "ON"
+LINE 1: CREATE INDEX IF NOT EXISTS ON onek USING btree(unique1 int4_...
+ ^
+CREATE INDEX onek_unique2 ON onek USING btree(unique2 int4_ops);
+CREATE INDEX onek_hundred ON onek USING btree(hundred int4_ops);
+CREATE INDEX onek_stringu1 ON onek USING btree(stringu1 name_ops);
+CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
+CREATE INDEX tenk1_unique2 ON tenk1 USING btree(unique2 int4_ops);
+CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
+CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
+CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
+CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
+CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
+CREATE INDEX rix ON road USING btree (name text_ops);
+CREATE INDEX iix ON ihighway USING btree (name text_ops);
+CREATE INDEX six ON shighway USING btree (name text_ops);
+-- test comments
+COMMENT ON INDEX six_wrong IS 'bad index';
+ERROR: relation "six_wrong" does not exist
+COMMENT ON INDEX six IS 'good index';
+COMMENT ON INDEX six IS NULL;
+--
+-- BTREE ascending/descending cases
+--
+-- we load int4/text from pure descending data (each key is a new
+-- low key) and name/f8 from pure ascending data (each key is a new
+-- high key). we had a bug where new low keys would sometimes be
+-- "lost".
+--
+CREATE INDEX bt_i4_index ON bt_i4_heap USING btree (seqno int4_ops);
+CREATE INDEX bt_name_index ON bt_name_heap USING btree (seqno name_ops);
+CREATE INDEX bt_txt_index ON bt_txt_heap USING btree (seqno text_ops);
+CREATE INDEX bt_f8_index ON bt_f8_heap USING btree (seqno float8_ops);
+--
+-- BTREE partial indices
+--
+CREATE INDEX onek2_u1_prtl ON onek2 USING btree(unique1 int4_ops)
+ where unique1 < 20 or unique1 > 980;
+CREATE INDEX onek2_u2_prtl ON onek2 USING btree(unique2 int4_ops)
+ where stringu1 < 'B';
+CREATE INDEX onek2_stu1_prtl ON onek2 USING btree(stringu1 name_ops)
+ where onek2.stringu1 >= 'J' and onek2.stringu1 < 'K';
+--
+-- GiST (rtree-equivalent opclasses only)
+--
+CREATE INDEX grect2ind ON fast_emp4000 USING gist (home_base);
+CREATE INDEX gpolygonind ON polygon_tbl USING gist (f1);
+CREATE INDEX gcircleind ON circle_tbl USING gist (f1);
+INSERT INTO POINT_TBL(f1) VALUES (NULL);
+CREATE INDEX gpointind ON point_tbl USING gist (f1);
+CREATE TEMP TABLE gpolygon_tbl AS
+ SELECT polygon(home_base) AS f1 FROM slow_emp4000;
+INSERT INTO gpolygon_tbl VALUES ( '(1000,0,0,1000)' );
+INSERT INTO gpolygon_tbl VALUES ( '(0,1000,1000,1000)' );
+CREATE TEMP TABLE gcircle_tbl AS
+ SELECT circle(home_base) AS f1 FROM slow_emp4000;
+CREATE INDEX ggpolygonind ON gpolygon_tbl USING gist (f1);
+CREATE INDEX ggcircleind ON gcircle_tbl USING gist (f1);
+--
+-- Test GiST indexes
+--
+-- get non-indexed results for comparison purposes
+SET enable_seqscan = ON;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
+SELECT * FROM fast_emp4000
+ WHERE home_base <@ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ home_base
+-----------------------
+ (337,455),(240,359)
+ (1444,403),(1346,344)
+(2 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ count
+-------
+ 278
+(1 row)
+
+SELECT * FROM polygon_tbl WHERE f1 @> '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ f1
+---------------------
+ ((2,0),(2,4),(0,0))
+(1 row)
+
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ f1
+---------------
+ <(1,2),3>
+ <(1,3),5>
+ <(1,2),100>
+ <(100,1),115>
+(4 rows)
+
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ count
+-------
+ 5
+(1 row)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ count
+-------
+ 4
+(1 row)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+
+(10 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ f1
+----
+
+(1 row)
+
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (1e-300,-1e-300)
+ (0,0)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+(9 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ f1
+-------------------------------------------------
+ ((240,359),(240,455),(337,455),(337,359))
+ ((662,163),(662,187),(759,187),(759,163))
+ ((1000,0),(0,1000))
+ ((0,1000),(1000,1000))
+ ((1346,344),(1346,403),(1444,403),(1444,344))
+ ((278,1409),(278,1457),(369,1457),(369,1409))
+ ((907,1156),(907,1201),(948,1201),(948,1156))
+ ((1517,971),(1517,1043),(1594,1043),(1594,971))
+ ((175,1820),(175,1850),(259,1850),(259,1820))
+ ((2424,81),(2424,160),(2424,160),(2424,81))
+(10 rows)
+
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ circle_center | radius
+----------------+--------
+ (288.5,407) | 68
+ (710.5,175) | 50
+ (323.5,1433) | 51
+ (927.5,1178.5) | 30
+ (1395,373.5) | 57
+ (1555.5,1007) | 53
+ (217,1835) | 45
+ (489,2421.5) | 22
+ (2424,120.5) | 40
+ (751.5,2655) | 20
+(10 rows)
+
+-- Now check the results from plain indexscan
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = OFF;
+EXPLAIN (COSTS OFF)
+SELECT * FROM fast_emp4000
+ WHERE home_base <@ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ QUERY PLAN
+-----------------------------------------------------------------
+ Sort
+ Sort Key: ((home_base[0])[0])
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base <@ '(2000,1000),(200,200)'::box)
+(4 rows)
+
+SELECT * FROM fast_emp4000
+ WHERE home_base <@ '(200,200),(2000,1000)'::box
+ ORDER BY (home_base[0])[0];
+ home_base
+-----------------------
+ (337,455),(240,359)
+ (1444,403),(1346,344)
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base && '(1000,1000),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base && '(1000,1000,0,0)'::box;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using grect2ind on fast_emp4000
+ Index Cond: (home_base IS NULL)
+(3 rows)
+
+SELECT count(*) FROM fast_emp4000 WHERE home_base IS NULL;
+ count
+-------
+ 278
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM polygon_tbl WHERE f1 @> '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ QUERY PLAN
+------------------------------------------------------------
+ Sort
+ Sort Key: ((poly_center(f1))[0])
+ -> Index Scan using gpolygonind on polygon_tbl
+ Index Cond: (f1 @> '((1,1),(2,2),(2,1))'::polygon)
+(4 rows)
+
+SELECT * FROM polygon_tbl WHERE f1 @> '((1,1),(2,2),(2,1))'::polygon
+ ORDER BY (poly_center(f1))[0];
+ f1
+---------------------
+ ((2,0),(2,4),(0,0))
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ QUERY PLAN
+--------------------------------------------------
+ Sort
+ Sort Key: (area(f1))
+ -> Index Scan using gcircleind on circle_tbl
+ Index Cond: (f1 && '<(1,-2),1>'::circle)
+(4 rows)
+
+SELECT * FROM circle_tbl WHERE f1 && circle(point(1,-2), 1)
+ ORDER BY area(f1);
+ f1
+---------------
+ <(1,2),3>
+ <(1,3),5>
+ <(1,2),100>
+ <(100,1),115>
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ QUERY PLAN
+------------------------------------------------------------
+ Aggregate
+ -> Index Scan using ggpolygonind on gpolygon_tbl
+ Index Cond: (f1 && '((1000,1000),(0,0))'::polygon)
+(3 rows)
+
+SELECT count(*) FROM gpolygon_tbl WHERE f1 && '(1000,1000,0,0)'::polygon;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Index Scan using ggcircleind on gcircle_tbl
+ Index Cond: (f1 && '<(500,500),500>'::circle)
+(3 rows)
+
+SELECT count(*) FROM gcircle_tbl WHERE f1 && '<(500,500),500>'::circle;
+ count
+-------
+ 2
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(100,100),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ box '(0,0,100,100)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(100,100),(0,0)'::box)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE box '(0,0,100,100)' @> f1;
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ QUERY PLAN
+----------------------------------------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '((0,0),(0,100),(100,100),(50,50),(100,0),(0,0))'::polygon)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ polygon '(0,0),(0,100),(100,100),(50,50),(100,0),(0,0)';
+ count
+-------
+ 4
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ QUERY PLAN
+----------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '<(50,50),50>'::circle)
+(3 rows)
+
+SELECT count(*) FROM point_tbl WHERE f1 <@ circle '<(50,50),50>';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 << '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 << '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 >> '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >> '(0.0, 0.0)';
+ count
+-------
+ 3
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 <^ '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 <^ '(0.0, 0.0)';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 >^ '(0,0)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 >^ '(0.0, 0.0)';
+ count
+-------
+ 4
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Index Only Scan using gpointind on point_tbl p
+ Index Cond: (f1 ~= '(-5,-12)'::point)
+(3 rows)
+
+SELECT count(*) FROM point_tbl p WHERE p.f1 ~= '(-5, -12)';
+ count
+-------
+ 1
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Order By: (f1 <-> '(0,1)'::point)
+(2 rows)
+
+SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (1e-300,-1e-300)
+ (0,0)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+
+(10 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 IS NULL)
+(2 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NULL;
+ f1
+----
+
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+----------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 IS NOT NULL)
+ Order By: (f1 <-> '(0,1)'::point)
+(3 rows)
+
+SELECT * FROM point_tbl WHERE f1 IS NOT NULL ORDER BY f1 <-> '0,1';
+ f1
+-------------------
+ (1e-300,-1e-300)
+ (0,0)
+ (-3,4)
+ (-10,0)
+ (10,10)
+ (-5,-12)
+ (5.1,34.5)
+ (1e+300,Infinity)
+ (NaN,NaN)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+------------------------------------------------
+ Index Only Scan using gpointind on point_tbl
+ Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+ Order By: (f1 <-> '(0,1)'::point)
+(3 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (1e-300,-1e-300)
+ (0,0)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ QUERY PLAN
+-----------------------------------------------------
+ Limit
+ -> Index Scan using ggpolygonind on gpolygon_tbl
+ Order By: (f1 <-> '(0,0)'::point)
+(3 rows)
+
+SELECT * FROM gpolygon_tbl ORDER BY f1 <-> '(0,0)'::point LIMIT 10;
+ f1
+-------------------------------------------------
+ ((240,359),(240,455),(337,455),(337,359))
+ ((662,163),(662,187),(759,187),(759,163))
+ ((1000,0),(0,1000))
+ ((0,1000),(1000,1000))
+ ((1346,344),(1346,403),(1444,403),(1444,344))
+ ((278,1409),(278,1457),(369,1457),(369,1409))
+ ((907,1156),(907,1201),(948,1201),(948,1156))
+ ((1517,971),(1517,1043),(1594,1043),(1594,971))
+ ((175,1820),(175,1850),(259,1850),(259,1820))
+ ((2424,81),(2424,160),(2424,160),(2424,81))
+(10 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ QUERY PLAN
+---------------------------------------------------
+ Limit
+ -> Index Scan using ggcircleind on gcircle_tbl
+ Order By: (f1 <-> '(200,300)'::point)
+(3 rows)
+
+SELECT circle_center(f1), round(radius(f1)) as radius FROM gcircle_tbl ORDER BY f1 <-> '(200,300)'::point LIMIT 10;
+ circle_center | radius
+----------------+--------
+ (288.5,407) | 68
+ (710.5,175) | 50
+ (323.5,1433) | 51
+ (927.5,1178.5) | 30
+ (1395,373.5) | 57
+ (1555.5,1007) | 53
+ (217,1835) | 45
+ (489,2421.5) | 22
+ (2424,120.5) | 40
+ (751.5,2655) | 20
+(10 rows)
+
+-- Now check the results from bitmap indexscan
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = ON;
+EXPLAIN (COSTS OFF)
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ QUERY PLAN
+------------------------------------------------------------
+ Sort
+ Sort Key: ((f1 <-> '(0,1)'::point))
+ -> Bitmap Heap Scan on point_tbl
+ Recheck Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+ -> Bitmap Index Scan on gpointind
+ Index Cond: (f1 <@ '(10,10),(-10,-10)'::box)
+(6 rows)
+
+SELECT * FROM point_tbl WHERE f1 <@ '(-10,-10),(10,10)':: box ORDER BY f1 <-> '0,1';
+ f1
+------------------
+ (0,0)
+ (1e-300,-1e-300)
+ (-3,4)
+ (-10,0)
+ (10,10)
+(5 rows)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+--
+-- GIN over int[] and text[]
+--
+-- Note: GIN currently supports only bitmap scans, not plain indexscans
+--
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = ON;
+CREATE INDEX intarrayidx ON array_index_op_test USING gin (i);
+explain (costs off)
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ QUERY PLAN
+----------------------------------------------------
+ Sort
+ Sort Key: seqno
+ -> Bitmap Heap Scan on array_index_op_test
+ Recheck Cond: (i @> '{32}'::integer[])
+ -> Bitmap Index Scan on intarrayidx
+ Index Cond: (i @> '{32}'::integer[])
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(8 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(8 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32,17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32,17}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(11 rows)
+
+SELECT * FROM array_index_op_test WHERE i <@ '{38,34,32,89}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------+----------------------------------------------------------------------------------------------------------------------------
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 101 | {} | {}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE i = '{47,77}' ORDER BY seqno;
+ seqno | i | t
+-------+---------+-----------------------------------------------------------------------------------------------------------------
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i @> '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 1 | {92,75,71,52,64,83} | {AAAAAAAA44066,AAAAAA1059,AAAAAAAAAAA176,AAAAAAA48038}
+ 2 | {3,6} | {AAAAAA98232,AAAAAAAA79710,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAAAAAAA55798,AAAAAAAAA12793}
+ 3 | {37,64,95,43,3,41,13,30,11,43} | {AAAAAAAAAA48845,AAAAA75968,AAAAA95309,AAA54451,AAAAAAAAAA22292,AAAAAAA99836,A96617,AA17009,AAAAAAAAAAAAAA95246}
+ 4 | {71,39,99,55,33,75,45} | {AAAAAAAAA53663,AAAAAAAAAAAAAAA67062,AAAAAAAAAA64777,AAA99043,AAAAAAAAAAAAAAAAAAA91804,39557}
+ 5 | {50,42,77,50,4} | {AAAAAAAAAAAAAAAAA26540,AAAAAAA79710,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA176,AAAAA95309,AAAAAAAAAAA46154,AAAAAA66777,AAAAAAAAA27249,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA70104}
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 7 | {12,51,88,64,8} | {AAAAAAAAAAAAAAAAAA12591,AAAAAAAAAAAAAAAAA50407,AAAAAAAAAAAA67946}
+ 8 | {60,84} | {AAAAAAA81898,AAAAAA1059,AAAAAAAAAAAA81511,AAAAA961,AAAAAAAAAAAAAAAA31334,AAAAA64741,AA6416,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAA50407}
+ 9 | {56,52,35,27,80,44,81,22} | {AAAAAAAAAAAAAAA73034,AAAAAAAAAAAAA7929,AAAAAAA66161,AA88409,39557,A27153,AAAAAAAA9523,AAAAAAAAAAA99000}
+ 10 | {71,5,45} | {AAAAAAAAAAA21658,AAAAAAAAAAAA21089,AAA54451,AAAAAAAAAAAAAAAAAA54141,AAAAAAAAAAAAAA28620,AAAAAAAAAAA21658,AAAAAAAAAAA74076,AAAAAAAAA27249}
+ 11 | {41,86,74,48,22,74,47,50} | {AAAAAAAA9523,AAAAAAAAAAAA37562,AAAAAAAAAAAAAAAA14047,AAAAAAAAAAA46154,AAAA41702,AAAAAAAAAAAAAAAAA764,AAAAA62737,39557}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 13 | {3,52,34,23} | {AAAAAA98232,AAAA49534,AAAAAAAAAAA21658}
+ 14 | {78,57,19} | {AAAA8857,AAAAAAAAAAAAAAA73034,AAAAAAAA81587,AAAAAAAAAAAAAAA68526,AAAAA75968,AAAAAAAAAAAAAA65909,AAAAAAAAA10012,AAAAAAAAAAAAAA65909}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 16 | {14,63,85,11} | {AAAAAA66777}
+ 17 | {7,10,81,85} | {AAAAAA43678,AAAAAAA12144,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAAAAA15356}
+ 18 | {1} | {AAAAAAAAAAA33576,AAAAA95309,64261,AAA59323,AAAAAAAAAAAAAA95246,55847,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAAAA64374}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 20 | {72,89,70,51,54,37,8,49,79} | {AAAAAA58494}
+ 21 | {2,8,65,10,5,79,43} | {AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAAAAA91804,AAAAA64669,AAAAAAAAAAAAAAAA1443,AAAAAAAAAAAAAAAA23657,AAAAA12179,AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAA31334,AAAAAAAAAAAAAAAA41303,AAAAAAAAAAAAAAAAAAA85420}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 23 | {40,90,5,38,72,40,30,10,43,55} | {A6053,AAAAAAAAAAA6119,AA44673,AAAAAAAAAAAAAAAAA764,AA17009,AAAAA17383,AAAAA70514,AAAAA33250,AAAAA95309,AAAAAAAAAAAA37562}
+ 24 | {94,61,99,35,48} | {AAAAAAAAAAA50956,AAAAAAAAAAA15165,AAAA85070,AAAAAAAAAAAAAAA36627,AAAAA961,AAAAAAAAAA55219}
+ 25 | {31,1,10,11,27,79,38} | {AAAAAAAAAAAAAAAAAA59334,45449}
+ 26 | {71,10,9,69,75} | {47735,AAAAAAA21462,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA91804,AAAAAAAAA72121,AAAAAAAAAAAAAAAAAAA1205,AAAAA41597,AAAA8857,AAAAAAAAAAAAAAAAAAA15356,AA17009}
+ 27 | {94} | {AA6416,A6053,AAAAAAA21462,AAAAAAA57334,AAAAAAAAAAAAAAAAAA12591,AA88409,AAAAAAAAAAAAA70254}
+ 28 | {14,33,6,34,14} | {AAAAAAAAAAAAAAA13198,AAAAAAAA69452,AAAAAAAAAAA82945,AAAAAAA12144,AAAAAAAAA72121,AAAAAAAAAA18601}
+ 29 | {39,21} | {AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA38885,AAAA85070,AAAAAAAAAAAAAAAAAAA70104,AAAAA66674,AAAAAAAAAAAAA62007,AAAAAAAA69452,AAAAAAA1242,AAAAAAAAAAAAAAAA1729,AAAA35194}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 31 | {80,24,18,21,54} | {AAAAAAAAAAAAAAA13198,AAAAAAAAAAAAAAAAAAA70415,A27153,AAAAAAAAA53663,AAAAAAAAAAAAAAAAA50407,A68938}
+ 32 | {58,79,82,80,67,75,98,10,41} | {AAAAAAAAAAAAAAAAAA61286,AAA54451,AAAAAAAAAAAAAAAAAAA87527,A96617,51533}
+ 33 | {74,73} | {A85417,AAAAAAA56483,AAAAA17383,AAAAAAAAAAAAA62159,AAAAAAAAAAAA52814,AAAAAAAAAAAAA85723,AAAAAAAAAAAAAAAAAA55796}
+ 34 | {70,45} | {AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAA28620,AAAAAAAAAA55219,AAAAAAAA23648,AAAAAAAAAA22292,AAAAAAA1242}
+ 35 | {23,40} | {AAAAAAAAAAAA52814,AAAA48949,AAAAAAAAA34727,AAAA8857,AAAAAAAAAAAAAAAAAAA62179,AAAAAAAAAAAAAAA68526,AAAAAAA99836,AAAAAAAA50094,AAAA91194,AAAAAAAAAAAAA73084}
+ 36 | {79,82,14,52,30,5,79} | {AAAAAAAAA53663,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA89194,AA88409,AAAAAAAAAAAAAAA81326,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAA33598}
+ 37 | {53,11,81,39,3,78,58,64,74} | {AAAAAAAAAAAAAAAAAAA17075,AAAAAAA66161,AAAAAAAA23648,AAAAAAAAAAAAAA10611}
+ 38 | {59,5,4,95,28} | {AAAAAAAAAAA82945,A96617,47735,AAAAA12179,AAAAA64669,AAAAAA99807,AA74433,AAAAAAAAAAAAAAAAA59387}
+ 39 | {82,43,99,16,74} | {AAAAAAAAAAAAAAA67062,AAAAAAA57334,AAAAAAAAAAAAAA65909,A27153,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAA64777,AAAAAAAAAAAA81511,AAAAAAAAAAAAAA65909,AAAAAAAAAAAAAA28620}
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 41 | {19,26,63,12,93,73,27,94} | {AAAAAAA79710,AAAAAAAAAA55219,AAAA41702,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAAAAA63050,AAAAAAA99836,AAAAAAAAAAAAAA8666}
+ 42 | {15,76,82,75,8,91} | {AAAAAAAAAAA176,AAAAAA38063,45449,AAAAAA54032,AAAAAAA81898,AA6416,AAAAAAAAAAAAAAAAAAA62179,45449,AAAAA60038,AAAAAAAA81587}
+ 43 | {39,87,91,97,79,28} | {AAAAAAAAAAA74076,A96617,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAAAAA55796,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAA67946}
+ 44 | {40,58,68,29,54} | {AAAAAAA81898,AAAAAA66777,AAAAAA98232}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 46 | {53,24} | {AAAAAAAAAAA53908,AAAAAA54032,AAAAA17383,AAAA48949,AAAAAAAAAA18601,AAAAA64669,45449,AAAAAAAAAAA98051,AAAAAAAAAAAAAAAAAA71621}
+ 47 | {98,23,64,12,75,61} | {AAA59323,AAAAA95309,AAAAAAAAAAAAAAAA31334,AAAAAAAAA27249,AAAAA17383,AAAAAAAAAAAA37562,AAAAAA1059,A84822,55847,AAAAA70466}
+ 48 | {76,14} | {AAAAAAAAAAAAA59671,AAAAAAAAAAAAAAAAAAA91804,AAAAAA66777,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAA73084,AAAAAAA79710,AAAAAAAAAAAAAAA40402,AAAAAAAAAAAAAAAAAAA65037}
+ 49 | {56,5,54,37,49} | {AA21643,AAAAAAAAAAA92631,AAAAAAAA81587}
+ 50 | {20,12,37,64,93} | {AAAAAAAAAA5483,AAAAAAAAAAAAAAAAAAA1205,AA6416,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAAAA47955}
+ 51 | {47} | {AAAAAAAAAAAAAA96505,AAAAAAAAAAAAAAAAAA36842,AAAAA95309,AAAAAAAA81587,AA6416,AAAA91194,AAAAAA58494,AAAAAA1059,AAAAAAAA69452}
+ 52 | {89,0} | {AAAAAAAAAAAAAAAAAA47955,AAAAAAA48038,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAA73084,AAAAA70466,AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA46154,AA66862}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 54 | {70,47} | {AAAAAAAAAAAAAAAAAA54141,AAAAA40681,AAAAAAA48038,AAAAAAAAAAAAAAAA29150,AAAAA41597,AAAAAAAAAAAAAAAAAA59334,AA15322}
+ 55 | {47,79,47,64,72,25,71,24,93} | {AAAAAAAAAAAAAAAAAA55796,AAAAA62737}
+ 56 | {33,7,60,54,93,90,77,85,39} | {AAAAAAAAAAAAAAAAAA32918,AA42406}
+ 57 | {23,45,10,42,36,21,9,96} | {AAAAAAAAAAAAAAAAAAA70415}
+ 58 | {92} | {AAAAAAAAAAAAAAAA98414,AAAAAAAA23648,AAAAAAAAAAAAAAAAAA55796,AA25381,AAAAAAAAAAA6119}
+ 59 | {9,69,46,77} | {39557,AAAAAAA89932,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAAAAAA26540,AAA20874,AA6416,AAAAAAAAAAAAAAAAAA47955}
+ 60 | {62,2,59,38,89} | {AAAAAAA89932,AAAAAAAAAAAAAAAAAAA15356,AA99927,AA17009,AAAAAAAAAAAAAAA35875}
+ 61 | {72,2,44,95,54,54,13} | {AAAAAAAAAAAAAAAAAAA91804}
+ 62 | {83,72,29,73} | {AAAAAAAAAAAAA15097,AAAA8857,AAAAAAAAAAAA35809,AAAAAAAAAAAA52814,AAAAAAAAAAAAAAAAAAA38885,AAAAAAAAAAAAAAAAAA24183,AAAAAA43678,A96617}
+ 63 | {11,4,61,87} | {AAAAAAAAA27249,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAA13198,AAA20874,39557,51533,AAAAAAAAAAA53908,AAAAAAAAAAAAAA96505,AAAAAAAA78938}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 66 | {31,23,70,52,4,33,48,25} | {AAAAAAAAAAAAAAAAA69675,AAAAAAAA50094,AAAAAAAAAAA92631,AAAA35194,39557,AAAAAAA99836}
+ 67 | {31,94,7,10} | {AAAAAA38063,A96617,AAAA35194,AAAAAAAAAAAA67946}
+ 68 | {90,43,38} | {AA75092,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAA92631,AAAAAAAAA10012,AAAAAAAAAAAAA7929,AA21643}
+ 69 | {67,35,99,85,72,86,44} | {AAAAAAAAAAAAAAAAAAA1205,AAAAAAAA50094,AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAAAAAAA47955}
+ 70 | {56,70,83} | {AAAA41702,AAAAAAAAAAA82945,AA21643,AAAAAAAAAAA99000,A27153,AA25381,AAAAAAAAAAAAAA96505,AAAAAAA1242}
+ 71 | {74,26} | {AAAAAAAAAAA50956,AA74433,AAAAAAA21462,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAA70254,AAAAAAAAAA43419,39557}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 73 | {88,25,96,78,65,15,29,19} | {AAA54451,AAAAAAAAA27249,AAAAAAA9228,AAAAAAAAAAAAAAA67062,AAAAAAAAAAAAAAAAAAA70415,AAAAA17383,AAAAAAAAAAAAAAAA33598}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 75 | {12,96,83,24,71,89,55} | {AAAA48949,AAAAAAAA29716,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAA29150,AAA28075,AAAAAAAAAAAAAAAAA43052}
+ 76 | {92,55,10,7} | {AAAAAAAAAAAAAAA67062}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 78 | {55,89,44,84,34} | {AAAAAAAAAAA6119,AAAAAAAAAAAAAA8666,AA99927,AA42406,AAAAAAA81898,AAAAAAA9228,AAAAAAAAAAA92631,AA21643,AAAAAAAAAAAAAA28620}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 80 | {74,89,44,80,0} | {AAAA35194,AAAAAAAA79710,AAA20874,AAAAAAAAAAAAAAAAAAA70104,AAAAAAAAAAAAA73084,AAAAAAA57334,AAAAAAA9228,AAAAAAAAAAAAA62007}
+ 81 | {63,77,54,48,61,53,97} | {AAAAAAAAAAAAAAA81326,AAAAAAAAAA22292,AA25381,AAAAAAAAAAA74076,AAAAAAA81898,AAAAAAAAA72121}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 83 | {14,10} | {AAAAAAAAAA22292,AAAAAAAAAAAAA70254,AAAAAAAAAAA6119}
+ 84 | {11,83,35,13,96,94} | {AAAAA95309,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAAA24183}
+ 85 | {39,60} | {AAAAAAAAAAAAAAAA55798,AAAAAAAAAA22292,AAAAAAA66161,AAAAAAA21462,AAAAAAAAAAAAAAAAAA12591,55847,AAAAAA98232,AAAAAAAAAAA46154}
+ 86 | {33,81,72,74,45,36,82} | {AAAAAAAA81587,AAAAAAAAAAAAAA96505,45449,AAAA80176}
+ 87 | {57,27,50,12,97,68} | {AAAAAAAAAAAAAAAAA26540,AAAAAAAAA10012,AAAAAAAAAAAA35809,AAAAAAAAAAAAAAAA29150,AAAAAAAAAAA82945,AAAAAA66777,31228,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAA96505}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 90 | {88,75} | {AAAAA60038,AAAAAAAA23648,AAAAAAAAAAA99000,AAAA41702,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAA68526}
+ 91 | {78} | {AAAAAAAAAAAAA62007,AAA99043}
+ 92 | {85,63,49,45} | {AAAAAAA89932,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA21089}
+ 93 | {11} | {AAAAAAAAAAA176,AAAAAAAAAAAAAA8666,AAAAAAAAAAAAAAA453,AAAAAAAAAAAAA85723,A68938,AAAAAAAAAAAAA9821,AAAAAAA48038,AAAAAAAAAAAAAAAAA59387,AA99927,AAAAA17383}
+ 94 | {98,9,85,62,88,91,60,61,38,86} | {AAAAAAAA81587,AAAAA17383,AAAAAAAA81587}
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 99 | {37,86} | {AAAAAAAAAAAAAAAAAA32918,AAAAA70514,AAAAAAAAA10012,AAAAAAAAAAAAAAAAA59387,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA15356}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+ 101 | {} | {}
+ 102 | {NULL} | {NULL}
+(102 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_index_op_test WHERE i <@ '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+--------+--------
+ 102 | {NULL} | {NULL}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i @> '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_op_test WHERE i && '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+CREATE INDEX textarrayidx ON array_index_op_test USING gin (t);
+explain (costs off)
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
+ QUERY PLAN
+------------------------------------------------------------
+ Sort
+ Sort Key: seqno
+ -> Bitmap Heap Scan on array_index_op_test
+ Recheck Cond: (t @> '{AAAAAAAA72908}'::text[])
+ -> Bitmap Index Scan on textarrayidx
+ Index Cond: (t @> '{AAAAAAAA72908}'::text[])
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAA72908}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(4 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------------------+--------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------------------+--------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAAA72908,AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+------+--------------------------------------------------------------------
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAAA72908,AAAAAAAAAA646}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t <@ '{AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------+-----------------------------------------------------------------------------------------------------------
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 101 | {} | {}
+(3 rows)
+
+SELECT * FROM array_index_op_test WHERE t = '{AAAAAAAAAA646,A87088}' ORDER BY seqno;
+ seqno | i | t
+-------+------------+------------------------
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t @> '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 1 | {92,75,71,52,64,83} | {AAAAAAAA44066,AAAAAA1059,AAAAAAAAAAA176,AAAAAAA48038}
+ 2 | {3,6} | {AAAAAA98232,AAAAAAAA79710,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAAAAAAA55798,AAAAAAAAA12793}
+ 3 | {37,64,95,43,3,41,13,30,11,43} | {AAAAAAAAAA48845,AAAAA75968,AAAAA95309,AAA54451,AAAAAAAAAA22292,AAAAAAA99836,A96617,AA17009,AAAAAAAAAAAAAA95246}
+ 4 | {71,39,99,55,33,75,45} | {AAAAAAAAA53663,AAAAAAAAAAAAAAA67062,AAAAAAAAAA64777,AAA99043,AAAAAAAAAAAAAAAAAAA91804,39557}
+ 5 | {50,42,77,50,4} | {AAAAAAAAAAAAAAAAA26540,AAAAAAA79710,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA176,AAAAA95309,AAAAAAAAAAA46154,AAAAAA66777,AAAAAAAAA27249,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA70104}
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 7 | {12,51,88,64,8} | {AAAAAAAAAAAAAAAAAA12591,AAAAAAAAAAAAAAAAA50407,AAAAAAAAAAAA67946}
+ 8 | {60,84} | {AAAAAAA81898,AAAAAA1059,AAAAAAAAAAAA81511,AAAAA961,AAAAAAAAAAAAAAAA31334,AAAAA64741,AA6416,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAA50407}
+ 9 | {56,52,35,27,80,44,81,22} | {AAAAAAAAAAAAAAA73034,AAAAAAAAAAAAA7929,AAAAAAA66161,AA88409,39557,A27153,AAAAAAAA9523,AAAAAAAAAAA99000}
+ 10 | {71,5,45} | {AAAAAAAAAAA21658,AAAAAAAAAAAA21089,AAA54451,AAAAAAAAAAAAAAAAAA54141,AAAAAAAAAAAAAA28620,AAAAAAAAAAA21658,AAAAAAAAAAA74076,AAAAAAAAA27249}
+ 11 | {41,86,74,48,22,74,47,50} | {AAAAAAAA9523,AAAAAAAAAAAA37562,AAAAAAAAAAAAAAAA14047,AAAAAAAAAAA46154,AAAA41702,AAAAAAAAAAAAAAAAA764,AAAAA62737,39557}
+ 12 | {17,99,18,52,91,72,0,43,96,23} | {AAAAA33250,AAAAAAAAAAAAAAAAAAA85420,AAAAAAAAAAA33576}
+ 13 | {3,52,34,23} | {AAAAAA98232,AAAA49534,AAAAAAAAAAA21658}
+ 14 | {78,57,19} | {AAAA8857,AAAAAAAAAAAAAAA73034,AAAAAAAA81587,AAAAAAAAAAAAAAA68526,AAAAA75968,AAAAAAAAAAAAAA65909,AAAAAAAAA10012,AAAAAAAAAAAAAA65909}
+ 15 | {17,14,16,63,67} | {AA6416,AAAAAAAAAA646,AAAAA95309}
+ 16 | {14,63,85,11} | {AAAAAA66777}
+ 17 | {7,10,81,85} | {AAAAAA43678,AAAAAAA12144,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAAAAA15356}
+ 18 | {1} | {AAAAAAAAAAA33576,AAAAA95309,64261,AAA59323,AAAAAAAAAAAAAA95246,55847,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAAAA64374}
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 20 | {72,89,70,51,54,37,8,49,79} | {AAAAAA58494}
+ 21 | {2,8,65,10,5,79,43} | {AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAAAAA91804,AAAAA64669,AAAAAAAAAAAAAAAA1443,AAAAAAAAAAAAAAAA23657,AAAAA12179,AAAAAAAAAAAAAAAAA88852,AAAAAAAAAAAAAAAA31334,AAAAAAAAAAAAAAAA41303,AAAAAAAAAAAAAAAAAAA85420}
+ 22 | {11,6,56,62,53,30} | {AAAAAAAA72908}
+ 23 | {40,90,5,38,72,40,30,10,43,55} | {A6053,AAAAAAAAAAA6119,AA44673,AAAAAAAAAAAAAAAAA764,AA17009,AAAAA17383,AAAAA70514,AAAAA33250,AAAAA95309,AAAAAAAAAAAA37562}
+ 24 | {94,61,99,35,48} | {AAAAAAAAAAA50956,AAAAAAAAAAA15165,AAAA85070,AAAAAAAAAAAAAAA36627,AAAAA961,AAAAAAAAAA55219}
+ 25 | {31,1,10,11,27,79,38} | {AAAAAAAAAAAAAAAAAA59334,45449}
+ 26 | {71,10,9,69,75} | {47735,AAAAAAA21462,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA91804,AAAAAAAAA72121,AAAAAAAAAAAAAAAAAAA1205,AAAAA41597,AAAA8857,AAAAAAAAAAAAAAAAAAA15356,AA17009}
+ 27 | {94} | {AA6416,A6053,AAAAAAA21462,AAAAAAA57334,AAAAAAAAAAAAAAAAAA12591,AA88409,AAAAAAAAAAAAA70254}
+ 28 | {14,33,6,34,14} | {AAAAAAAAAAAAAAA13198,AAAAAAAA69452,AAAAAAAAAAA82945,AAAAAAA12144,AAAAAAAAA72121,AAAAAAAAAA18601}
+ 29 | {39,21} | {AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAAAAA38885,AAAA85070,AAAAAAAAAAAAAAAAAAA70104,AAAAA66674,AAAAAAAAAAAAA62007,AAAAAAAA69452,AAAAAAA1242,AAAAAAAAAAAAAAAA1729,AAAA35194}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 31 | {80,24,18,21,54} | {AAAAAAAAAAAAAAA13198,AAAAAAAAAAAAAAAAAAA70415,A27153,AAAAAAAAA53663,AAAAAAAAAAAAAAAAA50407,A68938}
+ 32 | {58,79,82,80,67,75,98,10,41} | {AAAAAAAAAAAAAAAAAA61286,AAA54451,AAAAAAAAAAAAAAAAAAA87527,A96617,51533}
+ 33 | {74,73} | {A85417,AAAAAAA56483,AAAAA17383,AAAAAAAAAAAAA62159,AAAAAAAAAAAA52814,AAAAAAAAAAAAA85723,AAAAAAAAAAAAAAAAAA55796}
+ 34 | {70,45} | {AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAA28620,AAAAAAAAAA55219,AAAAAAAA23648,AAAAAAAAAA22292,AAAAAAA1242}
+ 35 | {23,40} | {AAAAAAAAAAAA52814,AAAA48949,AAAAAAAAA34727,AAAA8857,AAAAAAAAAAAAAAAAAAA62179,AAAAAAAAAAAAAAA68526,AAAAAAA99836,AAAAAAAA50094,AAAA91194,AAAAAAAAAAAAA73084}
+ 36 | {79,82,14,52,30,5,79} | {AAAAAAAAA53663,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA89194,AA88409,AAAAAAAAAAAAAAA81326,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAA33598}
+ 37 | {53,11,81,39,3,78,58,64,74} | {AAAAAAAAAAAAAAAAAAA17075,AAAAAAA66161,AAAAAAAA23648,AAAAAAAAAAAAAA10611}
+ 38 | {59,5,4,95,28} | {AAAAAAAAAAA82945,A96617,47735,AAAAA12179,AAAAA64669,AAAAAA99807,AA74433,AAAAAAAAAAAAAAAAA59387}
+ 39 | {82,43,99,16,74} | {AAAAAAAAAAAAAAA67062,AAAAAAA57334,AAAAAAAAAAAAAA65909,A27153,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAA64777,AAAAAAAAAAAA81511,AAAAAAAAAAAAAA65909,AAAAAAAAAAAAAA28620}
+ 40 | {34} | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
+ 41 | {19,26,63,12,93,73,27,94} | {AAAAAAA79710,AAAAAAAAAA55219,AAAA41702,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA71621,AAAAAAAAAAAAAAAAA63050,AAAAAAA99836,AAAAAAAAAAAAAA8666}
+ 42 | {15,76,82,75,8,91} | {AAAAAAAAAAA176,AAAAAA38063,45449,AAAAAA54032,AAAAAAA81898,AA6416,AAAAAAAAAAAAAAAAAAA62179,45449,AAAAA60038,AAAAAAAA81587}
+ 43 | {39,87,91,97,79,28} | {AAAAAAAAAAA74076,A96617,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAAAAA55796,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAA67946}
+ 44 | {40,58,68,29,54} | {AAAAAAA81898,AAAAAA66777,AAAAAA98232}
+ 45 | {99,45} | {AAAAAAAA72908,AAAAAAAAAAAAAAAAAAA17075,AA88409,AAAAAAAAAAAAAAAAAA36842,AAAAAAA48038,AAAAAAAAAAAAAA10611}
+ 46 | {53,24} | {AAAAAAAAAAA53908,AAAAAA54032,AAAAA17383,AAAA48949,AAAAAAAAAA18601,AAAAA64669,45449,AAAAAAAAAAA98051,AAAAAAAAAAAAAAAAAA71621}
+ 47 | {98,23,64,12,75,61} | {AAA59323,AAAAA95309,AAAAAAAAAAAAAAAA31334,AAAAAAAAA27249,AAAAA17383,AAAAAAAAAAAA37562,AAAAAA1059,A84822,55847,AAAAA70466}
+ 48 | {76,14} | {AAAAAAAAAAAAA59671,AAAAAAAAAAAAAAAAAAA91804,AAAAAA66777,AAAAAAAAAAAAAAAAAAA89194,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAA73084,AAAAAAA79710,AAAAAAAAAAAAAAA40402,AAAAAAAAAAAAAAAAAAA65037}
+ 49 | {56,5,54,37,49} | {AA21643,AAAAAAAAAAA92631,AAAAAAAA81587}
+ 50 | {20,12,37,64,93} | {AAAAAAAAAA5483,AAAAAAAAAAAAAAAAAAA1205,AA6416,AAAAAAAAAAAAAAAAA63050,AAAAAAAAAAAAAAAAAA47955}
+ 51 | {47} | {AAAAAAAAAAAAAA96505,AAAAAAAAAAAAAAAAAA36842,AAAAA95309,AAAAAAAA81587,AA6416,AAAA91194,AAAAAA58494,AAAAAA1059,AAAAAAAA69452}
+ 52 | {89,0} | {AAAAAAAAAAAAAAAAAA47955,AAAAAAA48038,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAA73084,AAAAA70466,AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA46154,AA66862}
+ 53 | {38,17} | {AAAAAAAAAAA21658}
+ 54 | {70,47} | {AAAAAAAAAAAAAAAAAA54141,AAAAA40681,AAAAAAA48038,AAAAAAAAAAAAAAAA29150,AAAAA41597,AAAAAAAAAAAAAAAAAA59334,AA15322}
+ 55 | {47,79,47,64,72,25,71,24,93} | {AAAAAAAAAAAAAAAAAA55796,AAAAA62737}
+ 56 | {33,7,60,54,93,90,77,85,39} | {AAAAAAAAAAAAAAAAAA32918,AA42406}
+ 57 | {23,45,10,42,36,21,9,96} | {AAAAAAAAAAAAAAAAAAA70415}
+ 58 | {92} | {AAAAAAAAAAAAAAAA98414,AAAAAAAA23648,AAAAAAAAAAAAAAAAAA55796,AA25381,AAAAAAAAAAA6119}
+ 59 | {9,69,46,77} | {39557,AAAAAAA89932,AAAAAAAAAAAAAAAAA43052,AAAAAAAAAAAAAAAAA26540,AAA20874,AA6416,AAAAAAAAAAAAAAAAAA47955}
+ 60 | {62,2,59,38,89} | {AAAAAAA89932,AAAAAAAAAAAAAAAAAAA15356,AA99927,AA17009,AAAAAAAAAAAAAAA35875}
+ 61 | {72,2,44,95,54,54,13} | {AAAAAAAAAAAAAAAAAAA91804}
+ 62 | {83,72,29,73} | {AAAAAAAAAAAAA15097,AAAA8857,AAAAAAAAAAAA35809,AAAAAAAAAAAA52814,AAAAAAAAAAAAAAAAAAA38885,AAAAAAAAAAAAAAAAAA24183,AAAAAA43678,A96617}
+ 63 | {11,4,61,87} | {AAAAAAAAA27249,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAA13198,AAA20874,39557,51533,AAAAAAAAAAA53908,AAAAAAAAAAAAAA96505,AAAAAAAA78938}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 65 | {61,5,76,59,17} | {AAAAAA99807,AAAAA64741,AAAAAAAAAAA53908,AA21643,AAAAAAAAA10012}
+ 66 | {31,23,70,52,4,33,48,25} | {AAAAAAAAAAAAAAAAA69675,AAAAAAAA50094,AAAAAAAAAAA92631,AAAA35194,39557,AAAAAAA99836}
+ 67 | {31,94,7,10} | {AAAAAA38063,A96617,AAAA35194,AAAAAAAAAAAA67946}
+ 68 | {90,43,38} | {AA75092,AAAAAAAAAAAAAAAAA69675,AAAAAAAAAAA92631,AAAAAAAAA10012,AAAAAAAAAAAAA7929,AA21643}
+ 69 | {67,35,99,85,72,86,44} | {AAAAAAAAAAAAAAAAAAA1205,AAAAAAAA50094,AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAAAAAAA47955}
+ 70 | {56,70,83} | {AAAA41702,AAAAAAAAAAA82945,AA21643,AAAAAAAAAAA99000,A27153,AA25381,AAAAAAAAAAAAAA96505,AAAAAAA1242}
+ 71 | {74,26} | {AAAAAAAAAAA50956,AA74433,AAAAAAA21462,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAA36627,AAAAAAAAAAAAA70254,AAAAAAAAAA43419,39557}
+ 72 | {22,1,16,78,20,91,83} | {47735,AAAAAAA56483,AAAAAAAAAAAAA93788,AA42406,AAAAAAAAAAAAA73084,AAAAAAAA72908,AAAAAAAAAAAAAAAAAA61286,AAAAA66674,AAAAAAAAAAAAAAAAA50407}
+ 73 | {88,25,96,78,65,15,29,19} | {AAA54451,AAAAAAAAA27249,AAAAAAA9228,AAAAAAAAAAAAAAA67062,AAAAAAAAAAAAAAAAAAA70415,AAAAA17383,AAAAAAAAAAAAAAAA33598}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 75 | {12,96,83,24,71,89,55} | {AAAA48949,AAAAAAAA29716,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA67946,AAAAAAAAAAAAAAAA29150,AAA28075,AAAAAAAAAAAAAAAAA43052}
+ 76 | {92,55,10,7} | {AAAAAAAAAAAAAAA67062}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 78 | {55,89,44,84,34} | {AAAAAAAAAAA6119,AAAAAAAAAAAAAA8666,AA99927,AA42406,AAAAAAA81898,AAAAAAA9228,AAAAAAAAAAA92631,AA21643,AAAAAAAAAAAAAA28620}
+ 79 | {45} | {AAAAAAAAAA646,AAAAAAAAAAAAAAAAAAA70415,AAAAAA43678,AAAAAAAA72908}
+ 80 | {74,89,44,80,0} | {AAAA35194,AAAAAAAA79710,AAA20874,AAAAAAAAAAAAAAAAAAA70104,AAAAAAAAAAAAA73084,AAAAAAA57334,AAAAAAA9228,AAAAAAAAAAAAA62007}
+ 81 | {63,77,54,48,61,53,97} | {AAAAAAAAAAAAAAA81326,AAAAAAAAAA22292,AA25381,AAAAAAAAAAA74076,AAAAAAA81898,AAAAAAAAA72121}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 83 | {14,10} | {AAAAAAAAAA22292,AAAAAAAAAAAAA70254,AAAAAAAAAAA6119}
+ 84 | {11,83,35,13,96,94} | {AAAAA95309,AAAAAAAAAAAAAAAAAA32918,AAAAAAAAAAAAAAAAAA24183}
+ 85 | {39,60} | {AAAAAAAAAAAAAAAA55798,AAAAAAAAAA22292,AAAAAAA66161,AAAAAAA21462,AAAAAAAAAAAAAAAAAA12591,55847,AAAAAA98232,AAAAAAAAAAA46154}
+ 86 | {33,81,72,74,45,36,82} | {AAAAAAAA81587,AAAAAAAAAAAAAA96505,45449,AAAA80176}
+ 87 | {57,27,50,12,97,68} | {AAAAAAAAAAAAAAAAA26540,AAAAAAAAA10012,AAAAAAAAAAAA35809,AAAAAAAAAAAAAAAA29150,AAAAAAAAAAA82945,AAAAAA66777,31228,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAA96505}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 90 | {88,75} | {AAAAA60038,AAAAAAAA23648,AAAAAAAAAAA99000,AAAA41702,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAA68526}
+ 91 | {78} | {AAAAAAAAAAAAA62007,AAA99043}
+ 92 | {85,63,49,45} | {AAAAAAA89932,AAAAAAAAAAAAA22860,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAAA21089}
+ 93 | {11} | {AAAAAAAAAAA176,AAAAAAAAAAAAAA8666,AAAAAAAAAAAAAAA453,AAAAAAAAAAAAA85723,A68938,AAAAAAAAAAAAA9821,AAAAAAA48038,AAAAAAAAAAAAAAAAA59387,AA99927,AAAAA17383}
+ 94 | {98,9,85,62,88,91,60,61,38,86} | {AAAAAAAA81587,AAAAA17383,AAAAAAAA81587}
+ 95 | {47,77} | {AAAAAAAAAAAAAAAAA764,AAAAAAAAAAA74076,AAAAAAAAAA18107,AAAAA40681,AAAAAAAAAAAAAAA35875,AAAAA60038,AAAAAAA56483}
+ 96 | {23,97,43} | {AAAAAAAAAA646,A87088}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 99 | {37,86} | {AAAAAAAAAAAAAAAAAA32918,AAAAA70514,AAAAAAAAA10012,AAAAAAAAAAAAAAAAA59387,AAAAAAAAAA64777,AAAAAAAAAAAAAAAAAAA15356}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+ 101 | {} | {}
+ 102 | {NULL} | {NULL}
+(102 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{}' ORDER BY seqno;
+ seqno | i | t
+-------+---+---
+(0 rows)
+
+SELECT * FROM array_index_op_test WHERE t <@ '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+-- And try it with a multicolumn GIN index
+DROP INDEX intarrayidx, textarrayidx;
+CREATE INDEX botharrayidx ON array_index_op_test USING gin (i, t);
+SELECT * FROM array_index_op_test WHERE i @> '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' ORDER BY seqno;
+ seqno | i | t
+-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
+ 6 | {39,35,5,94,17,92,60,32} | {AAAAAAAAAAAAAAA35875,AAAAAAAAAAAAAAAA23657}
+ 74 | {32} | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
+ 77 | {97,15,32,17,55,59,18,37,50,39} | {AAAAAAAAAAAA67946,AAAAAA54032,AAAAAAAA81587,55847,AAAAAAAAAAAAAA28620,AAAAAAAAAAAAAAAAA43052,AAAAAA75463,AAAA49534,AAAAAAAA44066}
+ 89 | {40,32,17,6,30,88} | {AA44673,AAAAAAAAAAA6119,AAAAAAAAAAAAAAAA23657,AAAAAAAAAAAAAAAAAA47955,AAAAAAAAAAAAAAAA33598,AAAAAAAAAAA33576,AA44673}
+ 98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(6 rows)
+
+SELECT * FROM array_index_op_test WHERE t @> '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(7 rows)
+
+SELECT * FROM array_index_op_test WHERE t && '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ 19 | {52,82,17,74,23,46,69,51,75} | {AAAAAAAAAAAAA73084,AAAAA75968,AAAAAAAAAAAAAAAA14047,AAAAAAA80240,AAAAAAAAAAAAAAAAAAA1205,A68938}
+ 30 | {26,81,47,91,34} | {AAAAAAAAAAAAAAAAAAA70104,AAAAAAA80240}
+ 64 | {26,19,34,24,81,78} | {A96617,AAAAAAAAAAAAAAAAAAA70104,A68938,AAAAAAAAAAA53908,AAAAAAAAAAAAAAA453,AA17009,AAAAAAA80240}
+ 82 | {34,60,4,79,78,16,86,89,42,50} | {AAAAA40681,AAAAAAAAAAAAAAAAAA12591,AAAAAAA80240,AAAAAAAAAAAAAAAA55798,AAAAAAAAAAAAAAAAAAA70104}
+ 88 | {41,90,77,24,6,24} | {AAAA35194,AAAA35194,AAAAAAA80240,AAAAAAAAAAA46154,AAAAAA58494,AAAAAAAAAAAAAAAAAAA17075,AAAAAAAAAAAAAAAAAA59334,AAAAAAAAAAAAAAAAAAA91804,AA74433}
+ 97 | {54,2,86,65} | {47735,AAAAAAA99836,AAAAAAAAAAAAAAAAA6897,AAAAAAAAAAAAAAAA29150,AAAAAAA80240,AAAAAAAAAAAAAAAA98414,AAAAAAA56483,AAAAAAAAAAAAAAAA29150,AAAAAAA39692,AA21643}
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(7 rows)
+
+SELECT * FROM array_index_op_test WHERE i @> '{32}' AND t && '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------------+------------------------------------------------------------------------------
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE i && '{32}' AND t @> '{AAAAAAA80240}' ORDER BY seqno;
+ seqno | i | t
+-------+-----------------------------+------------------------------------------------------------------------------
+ 100 | {85,32,57,39,49,84,32,3,30} | {AAAAAAA80240,AAAAAAAAAAAAAAAA1729,AAAAA60038,AAAAAAAAAAA92631,AAAAAAAA9523}
+(1 row)
+
+SELECT * FROM array_index_op_test WHERE t = '{}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+--------+--------
+ 102 | {NULL} | {NULL}
+(1 row)
+
+SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
+ seqno | i | t
+-------+----+----
+ 101 | {} | {}
+(1 row)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+--
+-- Try a GIN index with a lot of items with same key. (GIN creates a posting
+-- tree when there are enough duplicates)
+--
+CREATE TABLE array_gin_test (a int[]);
+INSERT INTO array_gin_test SELECT ARRAY[1, g%5, g] FROM generate_series(1, 10000) g;
+CREATE INDEX array_gin_test_idx ON array_gin_test USING gin (a);
+SELECT COUNT(*) FROM array_gin_test WHERE a @> '{2}';
+ count
+-------
+ 2000
+(1 row)
+
+DROP TABLE array_gin_test;
+--
+-- Test GIN index's reloptions
+--
+CREATE INDEX gin_relopts_test ON array_index_op_test USING gin (i)
+ WITH (FASTUPDATE=on, GIN_PENDING_LIST_LIMIT=128);
+\d+ gin_relopts_test
+ Index "public.gin_relopts_test"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+---------+------+------------+---------+--------------
+ i | integer | yes | i | plain |
+gin, for table "public.array_index_op_test"
+Options: fastupdate=on, gin_pending_list_limit=128
+
+--
+-- HASH
+--
+CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
+CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
+CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
+CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops) WITH (fillfactor=60);
+CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
+CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
+DROP TABLE unlogged_hash_table;
+-- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
+-- Test hash index build tuplesorting. Force hash tuplesort using low
+-- maintenance_work_mem setting and fillfactor:
+SET maintenance_work_mem = '1MB';
+CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
+ QUERY PLAN
+-------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (stringu1 = 'TVAAAA'::name)
+ -> Bitmap Index Scan on hash_tuplesort_idx
+ Index Cond: (stringu1 = 'TVAAAA'::name)
+(5 rows)
+
+SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
+ count
+-------
+ 14
+(1 row)
+
+DROP INDEX hash_tuplesort_idx;
+RESET maintenance_work_mem;
+--
+-- Test functional index
+--
+CREATE TABLE func_index_heap (f1 text, f2 text);
+CREATE UNIQUE INDEX func_index_index on func_index_heap (textcat(f1,f2));
+INSERT INTO func_index_heap VALUES('ABC','DEF');
+INSERT INTO func_index_heap VALUES('AB','CDEFG');
+INSERT INTO func_index_heap VALUES('QWE','RTY');
+-- this should fail because of unique index:
+INSERT INTO func_index_heap VALUES('ABCD', 'EF');
+ERROR: duplicate key value violates unique constraint "func_index_index"
+DETAIL: Key (textcat(f1, f2))=(ABCDEF) already exists.
+-- but this shouldn't:
+INSERT INTO func_index_heap VALUES('QWERTY');
+-- while we're here, see that the metadata looks sane
+\d func_index_heap
+ Table "public.func_index_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "func_index_index" UNIQUE, btree (textcat(f1, f2))
+
+\d func_index_index
+ Index "public.func_index_index"
+ Column | Type | Key? | Definition
+---------+------+------+-----------------
+ textcat | text | yes | textcat(f1, f2)
+unique, btree, for table "public.func_index_heap"
+
+--
+-- Same test, expressional index
+--
+DROP TABLE func_index_heap;
+CREATE TABLE func_index_heap (f1 text, f2 text);
+CREATE UNIQUE INDEX func_index_index on func_index_heap ((f1 || f2) text_ops);
+INSERT INTO func_index_heap VALUES('ABC','DEF');
+INSERT INTO func_index_heap VALUES('AB','CDEFG');
+INSERT INTO func_index_heap VALUES('QWE','RTY');
+-- this should fail because of unique index:
+INSERT INTO func_index_heap VALUES('ABCD', 'EF');
+ERROR: duplicate key value violates unique constraint "func_index_index"
+DETAIL: Key ((f1 || f2))=(ABCDEF) already exists.
+-- but this shouldn't:
+INSERT INTO func_index_heap VALUES('QWERTY');
+-- while we're here, see that the metadata looks sane
+\d func_index_heap
+ Table "public.func_index_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "func_index_index" UNIQUE, btree ((f1 || f2))
+
+\d func_index_index
+ Index "public.func_index_index"
+ Column | Type | Key? | Definition
+--------+------+------+------------
+ expr | text | yes | (f1 || f2)
+unique, btree, for table "public.func_index_heap"
+
+-- this should fail because of unsafe column type (anonymous record)
+create index on func_index_heap ((f1 || f2), (row(f1, f2)));
+ERROR: column "row" has pseudo-type record
+--
+-- Test unique index with included columns
+--
+CREATE TABLE covering_index_heap (f1 int, f2 int, f3 text);
+CREATE UNIQUE INDEX covering_index_index on covering_index_heap (f1,f2) INCLUDE(f3);
+INSERT INTO covering_index_heap VALUES(1,1,'AAA');
+INSERT INTO covering_index_heap VALUES(1,2,'AAA');
+-- this should fail because of unique index on f1,f2:
+INSERT INTO covering_index_heap VALUES(1,2,'BBB');
+ERROR: duplicate key value violates unique constraint "covering_index_index"
+DETAIL: Key (f1, f2)=(1, 2) already exists.
+-- and this shouldn't:
+INSERT INTO covering_index_heap VALUES(1,4,'AAA');
+-- Try to build index on table that already contains data
+CREATE UNIQUE INDEX covering_pkey on covering_index_heap (f1,f2) INCLUDE(f3);
+-- Try to use existing covering index as primary key
+ALTER TABLE covering_index_heap ADD CONSTRAINT covering_pkey PRIMARY KEY USING INDEX
+covering_pkey;
+DROP TABLE covering_index_heap;
+--
+-- Also try building functional, expressional, and partial indexes on
+-- tables that already contain data.
+--
+create unique index hash_f8_index_1 on hash_f8_heap(abs(random));
+create unique index hash_f8_index_2 on hash_f8_heap((seqno + 1), random);
+create unique index hash_f8_index_3 on hash_f8_heap(random) where seqno > 1000;
+--
+-- Try some concurrent index builds
+--
+-- Unfortunately this only tests about half the code paths because there are
+-- no concurrent updates happening to the table at the same time.
+CREATE TABLE concur_heap (f1 text, f2 text);
+-- empty table
+CREATE INDEX CONCURRENTLY concur_index1 ON concur_heap(f2,f1);
+CREATE INDEX CONCURRENTLY IF NOT EXISTS concur_index1 ON concur_heap(f2,f1);
+NOTICE: relation "concur_index1" already exists, skipping
+INSERT INTO concur_heap VALUES ('a','b');
+INSERT INTO concur_heap VALUES ('b','b');
+-- unique index
+CREATE UNIQUE INDEX CONCURRENTLY concur_index2 ON concur_heap(f1);
+CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
+NOTICE: relation "concur_index2" already exists, skipping
+-- check if constraint is set up properly to be enforced
+INSERT INTO concur_heap VALUES ('b','x');
+ERROR: duplicate key value violates unique constraint "concur_index2"
+DETAIL: Key (f1)=(b) already exists.
+-- check if constraint is enforced properly at build time
+CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+ERROR: could not create unique index "concur_index3"
+DETAIL: Key (f2)=(b) is duplicated.
+-- test that expression indexes and partial indexes work concurrently
+CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
+CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
+-- here we also check that you can default the index name
+CREATE INDEX CONCURRENTLY on concur_heap((f2||f1));
+-- You can't do a concurrent index build in a transaction
+BEGIN;
+CREATE INDEX CONCURRENTLY concur_index7 ON concur_heap(f1);
+ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- But you can do a regular index build in a transaction
+BEGIN;
+CREATE INDEX std_index on concur_heap(f2);
+COMMIT;
+-- Failed builds are left invalid by VACUUM FULL, fixed by REINDEX
+VACUUM FULL concur_heap;
+REINDEX TABLE concur_heap;
+ERROR: could not create unique index "concur_index3"
+DETAIL: Key (f2)=(b) is duplicated.
+DELETE FROM concur_heap WHERE f1 = 'b';
+VACUUM FULL concur_heap;
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "concur_heap_expr_idx" btree ((f2 || f1))
+ "concur_index1" btree (f2, f1)
+ "concur_index2" UNIQUE, btree (f1)
+ "concur_index3" UNIQUE, btree (f2) INVALID
+ "concur_index4" btree (f2) WHERE f1 = 'a'::text
+ "concur_index5" btree (f2) WHERE f1 = 'x'::text
+ "std_index" btree (f2)
+
+REINDEX TABLE concur_heap;
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "concur_heap_expr_idx" btree ((f2 || f1))
+ "concur_index1" btree (f2, f1)
+ "concur_index2" UNIQUE, btree (f1)
+ "concur_index3" UNIQUE, btree (f2)
+ "concur_index4" btree (f2) WHERE f1 = 'a'::text
+ "concur_index5" btree (f2) WHERE f1 = 'x'::text
+ "std_index" btree (f2)
+
+-- Temporary tables with concurrent builds and on-commit actions
+-- CONCURRENTLY used with CREATE INDEX and DROP INDEX is ignored.
+-- PRESERVE ROWS, the default.
+CREATE TEMP TABLE concur_temp (f1 int, f2 text)
+ ON COMMIT PRESERVE ROWS;
+INSERT INTO concur_temp VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX CONCURRENTLY concur_temp_ind ON concur_temp(f1);
+DROP INDEX CONCURRENTLY concur_temp_ind;
+DROP TABLE concur_temp;
+-- ON COMMIT DROP
+BEGIN;
+CREATE TEMP TABLE concur_temp (f1 int, f2 text)
+ ON COMMIT DROP;
+INSERT INTO concur_temp VALUES (1, 'foo'), (2, 'bar');
+-- Fails when running in a transaction.
+CREATE INDEX CONCURRENTLY concur_temp_ind ON concur_temp(f1);
+ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- ON COMMIT DELETE ROWS
+CREATE TEMP TABLE concur_temp (f1 int, f2 text)
+ ON COMMIT DELETE ROWS;
+INSERT INTO concur_temp VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX CONCURRENTLY concur_temp_ind ON concur_temp(f1);
+DROP INDEX CONCURRENTLY concur_temp_ind;
+DROP TABLE concur_temp;
+--
+-- Try some concurrent index drops
+--
+DROP INDEX CONCURRENTLY "concur_index2"; -- works
+DROP INDEX CONCURRENTLY IF EXISTS "concur_index2"; -- notice
+NOTICE: index "concur_index2" does not exist, skipping
+-- failures
+DROP INDEX CONCURRENTLY "concur_index2", "concur_index3";
+ERROR: DROP INDEX CONCURRENTLY does not support dropping multiple objects
+BEGIN;
+DROP INDEX CONCURRENTLY "concur_index5";
+ERROR: DROP INDEX CONCURRENTLY cannot run inside a transaction block
+ROLLBACK;
+-- successes
+DROP INDEX CONCURRENTLY IF EXISTS "concur_index3";
+DROP INDEX CONCURRENTLY "concur_index4";
+DROP INDEX CONCURRENTLY "concur_index5";
+DROP INDEX CONCURRENTLY "concur_index1";
+DROP INDEX CONCURRENTLY "concur_heap_expr_idx";
+\d concur_heap
+ Table "public.concur_heap"
+ Column | Type | Collation | Nullable | Default
+--------+------+-----------+----------+---------
+ f1 | text | | |
+ f2 | text | | |
+Indexes:
+ "std_index" btree (f2)
+
+DROP TABLE concur_heap;
+--
+-- Test ADD CONSTRAINT USING INDEX
+--
+CREATE TABLE cwi_test( a int , b varchar(10), c char);
+-- add some data so that all tests have something to work with.
+INSERT INTO cwi_test VALUES(1, 2), (3, 4), (5, 6);
+CREATE UNIQUE INDEX cwi_uniq_idx ON cwi_test(a , b);
+ALTER TABLE cwi_test ADD primary key USING INDEX cwi_uniq_idx;
+\d cwi_test
+ Table "public.cwi_test"
+ Column | Type | Collation | Nullable | Default
+--------+-----------------------+-----------+----------+---------
+ a | integer | | not null |
+ b | character varying(10) | | not null |
+ c | character(1) | | |
+Indexes:
+ "cwi_uniq_idx" PRIMARY KEY, btree (a, b)
+
+\d cwi_uniq_idx
+ Index "public.cwi_uniq_idx"
+ Column | Type | Key? | Definition
+--------+-----------------------+------+------------
+ a | integer | yes | a
+ b | character varying(10) | yes | b
+primary key, btree, for table "public.cwi_test"
+
+CREATE UNIQUE INDEX cwi_uniq2_idx ON cwi_test(b , a);
+ALTER TABLE cwi_test DROP CONSTRAINT cwi_uniq_idx,
+ ADD CONSTRAINT cwi_replaced_pkey PRIMARY KEY
+ USING INDEX cwi_uniq2_idx;
+NOTICE: ALTER TABLE / ADD CONSTRAINT USING INDEX will rename index "cwi_uniq2_idx" to "cwi_replaced_pkey"
+\d cwi_test
+ Table "public.cwi_test"
+ Column | Type | Collation | Nullable | Default
+--------+-----------------------+-----------+----------+---------
+ a | integer | | not null |
+ b | character varying(10) | | not null |
+ c | character(1) | | |
+Indexes:
+ "cwi_replaced_pkey" PRIMARY KEY, btree (b, a)
+
+\d cwi_replaced_pkey
+ Index "public.cwi_replaced_pkey"
+ Column | Type | Key? | Definition
+--------+-----------------------+------+------------
+ b | character varying(10) | yes | b
+ a | integer | yes | a
+primary key, btree, for table "public.cwi_test"
+
+DROP INDEX cwi_replaced_pkey; -- Should fail; a constraint depends on it
+ERROR: cannot drop index cwi_replaced_pkey because constraint cwi_replaced_pkey on table cwi_test requires it
+HINT: You can drop constraint cwi_replaced_pkey on table cwi_test instead.
+-- Check that non-default index options are rejected
+CREATE UNIQUE INDEX cwi_uniq3_idx ON cwi_test(a desc);
+ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq3_idx; -- fail
+ERROR: index "cwi_uniq3_idx" column number 1 does not have default sorting behavior
+LINE 1: ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq3_idx;
+ ^
+DETAIL: Cannot create a primary key or unique constraint using such an index.
+CREATE UNIQUE INDEX cwi_uniq4_idx ON cwi_test(b collate "POSIX");
+ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq4_idx; -- fail
+ERROR: index "cwi_uniq4_idx" column number 1 does not have default sorting behavior
+LINE 1: ALTER TABLE cwi_test ADD UNIQUE USING INDEX cwi_uniq4_idx;
+ ^
+DETAIL: Cannot create a primary key or unique constraint using such an index.
+DROP TABLE cwi_test;
+-- ADD CONSTRAINT USING INDEX is forbidden on partitioned tables
+CREATE TABLE cwi_test(a int) PARTITION BY hash (a);
+create unique index on cwi_test (a);
+alter table cwi_test add primary key using index cwi_test_a_idx ;
+ERROR: ALTER TABLE / ADD CONSTRAINT USING INDEX is not supported on partitioned tables
+DROP TABLE cwi_test;
+--
+-- Check handling of indexes on system columns
+--
+CREATE TABLE syscol_table (a INT);
+-- System columns cannot be indexed
+CREATE INDEX ON syscolcol_table (ctid);
+ERROR: relation "syscolcol_table" does not exist
+-- nor used in expressions
+CREATE INDEX ON syscol_table ((ctid >= '(1000,0)'));
+ERROR: index creation on system columns is not supported
+-- nor used in predicates
+CREATE INDEX ON syscol_table (a) WHERE ctid >= '(1000,0)';
+ERROR: index creation on system columns is not supported
+DROP TABLE syscol_table;
+--
+-- Tests for IS NULL/IS NOT NULL with b-tree indexes
+--
+SELECT unique1, unique2 INTO onek_with_null FROM onek;
+INSERT INTO onek_with_null (unique1,unique2) VALUES (NULL, -1), (NULL, NULL);
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2,unique1);
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = ON;
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 desc nulls last,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2 nulls first,unique1);
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL;
+ count
+-------
+ 2
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NOT NULL;
+ count
+-------
+ 1
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NOT NULL AND unique1 > 500;
+ count
+-------
+ 499
+(1 row)
+
+SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique1 > 500;
+ count
+-------
+ 0
+(1 row)
+
+DROP INDEX onek_nulltest;
+-- Check initial-positioning logic too
+CREATE UNIQUE INDEX onek_nulltest ON onek_with_null (unique2);
+SET enable_seqscan = OFF;
+SET enable_indexscan = ON;
+SET enable_bitmapscan = OFF;
+SELECT unique1, unique2 FROM onek_with_null
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ | -1
+ 147 | 0
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= -1
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ | -1
+ 147 | 0
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= 0
+ ORDER BY unique2 LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 147 | 0
+ 931 | 1
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ |
+ 278 | 999
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 >= -1
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 278 | 999
+ 0 | 998
+(2 rows)
+
+SELECT unique1, unique2 FROM onek_with_null WHERE unique2 < 999
+ ORDER BY unique2 DESC LIMIT 2;
+ unique1 | unique2
+---------+---------
+ 0 | 998
+ 744 | 997
+(2 rows)
+
+RESET enable_seqscan;
+RESET enable_indexscan;
+RESET enable_bitmapscan;
+DROP TABLE onek_with_null;
+--
+-- Check bitmap index path planning
+--
+EXPLAIN (COSTS OFF)
+SELECT * FROM tenk1
+ WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1
+ Recheck Cond: (((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3)) OR ((thousand = 42) AND (tenthous = 42)))
+ -> BitmapOr
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 1))
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 3))
+ -> Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: ((thousand = 42) AND (tenthous = 42))
+(9 rows)
+
+SELECT * FROM tenk1
+ WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+ 42 | 5530 | 0 | 2 | 2 | 2 | 42 | 42 | 42 | 42 | 42 | 84 | 85 | QBAAAA | SEIAAA | OOOOxx
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM tenk1
+ WHERE hundred = 42 AND (thousand = 42 OR thousand = 99);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on tenk1
+ Recheck Cond: (hundred = 42)
+ Filter: ((thousand = 42) OR (thousand = 99))
+ -> Bitmap Index Scan on tenk1_hundred
+ Index Cond: (hundred = 42)
+(6 rows)
+
+SELECT count(*) FROM tenk1
+ WHERE hundred = 42 AND (thousand = 42 OR thousand = 99);
+ count
+-------
+ 10
+(1 row)
+
+--
+-- Check behavior with duplicate index column contents
+--
+CREATE TABLE dupindexcols AS
+ SELECT unique1 as id, stringu2::text as f1 FROM tenk1;
+CREATE INDEX dupindexcols_i ON dupindexcols (f1, id, f1 text_pattern_ops);
+ANALYZE dupindexcols;
+EXPLAIN (COSTS OFF)
+ SELECT count(*) FROM dupindexcols
+ WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
+ Aggregate
+ -> Bitmap Heap Scan on dupindexcols
+ Recheck Cond: ((f1 >= 'WA'::text) AND (f1 <= 'ZZZ'::text) AND (id < 1000) AND (f1 ~<~ 'YX'::text))
+ -> Bitmap Index Scan on dupindexcols_i
+ Index Cond: ((f1 >= 'WA'::text) AND (f1 <= 'ZZZ'::text) AND (id < 1000) AND (f1 ~<~ 'YX'::text))
+(5 rows)
+
+SELECT count(*) FROM dupindexcols
+ WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
+ count
+-------
+ 97
+(1 row)
+
+--
+-- Check ordering of =ANY indexqual results (bug in 9.2.0)
+--
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
+explain (costs off)
+SELECT unique1 FROM tenk1
+WHERE unique1 IN (1,42,7)
+ORDER BY unique1;
+ QUERY PLAN
+-------------------------------------------------------
+ Index Only Scan using tenk1_unique1 on tenk1
+ Index Cond: (unique1 = ANY ('{1,42,7}'::integer[]))
+(2 rows)
+
+SELECT unique1 FROM tenk1
+WHERE unique1 IN (1,42,7)
+ORDER BY unique1;
+ unique1
+---------
+ 1
+ 7
+ 42
+(3 rows)
+
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ QUERY PLAN
+-------------------------------------------------------
+ Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: (thousand < 2)
+ Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
+(3 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous
+----------+----------
+ 0 | 3000
+ 1 | 1001
+(2 rows)
+
+RESET enable_indexscan;
+SET enable_indexonlyscan = OFF;
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Sort Key: thousand
+ -> Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(4 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand;
+ thousand | tenthous
+----------+----------
+ 0 | 3000
+ 1 | 1001
+(2 rows)
+
+RESET enable_seqscan;
+RESET enable_bitmapscan;
+RESET enable_indexonlyscan;
+--
+-- Check elimination of constant-NULL subexpressions
+--
+explain (costs off)
+ select * from tenk1 where (thousand, tenthous) in ((1,1001), (null,null));
+ QUERY PLAN
+------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand = 1) AND (tenthous = 1001))
+(2 rows)
+
+--
+-- Check matching of boolean index columns to WHERE conditions and sort keys
+--
+create temp table boolindex (b bool, i int, unique(b, i), junk float);
+explain (costs off)
+ select * from boolindex order by b, i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+(2 rows)
+
+explain (costs off)
+ select * from boolindex where b order by i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b = true order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where not b order by i limit 10;
+ QUERY PLAN
+-------------------------------------------------------
+ Limit
+ -> Index Scan using boolindex_b_i_key on boolindex
+ Index Cond: (b = false)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b is true order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = true)
+(3 rows)
+
+explain (costs off)
+ select * from boolindex where b is false order by i desc limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ -> Index Scan Backward using boolindex_b_i_key on boolindex
+ Index Cond: (b = false)
+(3 rows)
+
+--
+-- REINDEX (VERBOSE)
+--
+CREATE TABLE reindex_verbose(id integer primary key);
+\set VERBOSITY terse \\ -- suppress machine-dependent details
+REINDEX (VERBOSE) TABLE reindex_verbose;
+INFO: index "reindex_verbose_pkey" was reindexed
+\set VERBOSITY default
+DROP TABLE reindex_verbose;
+--
+-- REINDEX CONCURRENTLY
+--
+CREATE TABLE concur_reindex_tab (c1 int);
+-- REINDEX
+REINDEX TABLE concur_reindex_tab; -- notice
+NOTICE: table "concur_reindex_tab" has no indexes to reindex
+REINDEX TABLE CONCURRENTLY concur_reindex_tab; -- notice
+NOTICE: table "concur_reindex_tab" has no indexes that can be reindexed concurrently
+ALTER TABLE concur_reindex_tab ADD COLUMN c2 text; -- add toast index
+-- Normal index with integer column
+CREATE UNIQUE INDEX concur_reindex_ind1 ON concur_reindex_tab(c1);
+-- Normal index with text column
+CREATE INDEX concur_reindex_ind2 ON concur_reindex_tab(c2);
+-- UNIQUE index with expression
+CREATE UNIQUE INDEX concur_reindex_ind3 ON concur_reindex_tab(abs(c1));
+-- Duplicate column names
+CREATE INDEX concur_reindex_ind4 ON concur_reindex_tab(c1, c1, c2);
+-- Create table for check on foreign key dependence switch with indexes swapped
+ALTER TABLE concur_reindex_tab ADD PRIMARY KEY USING INDEX concur_reindex_ind1;
+CREATE TABLE concur_reindex_tab2 (c1 int REFERENCES concur_reindex_tab);
+INSERT INTO concur_reindex_tab VALUES (1, 'a');
+INSERT INTO concur_reindex_tab VALUES (2, 'a');
+-- Reindex concurrently of exclusion constraint currently not supported
+CREATE TABLE concur_reindex_tab3 (c1 int, c2 int4range, EXCLUDE USING gist (c2 WITH &&));
+INSERT INTO concur_reindex_tab3 VALUES (3, '[1,2]');
+REINDEX INDEX CONCURRENTLY concur_reindex_tab3_c2_excl; -- error
+ERROR: concurrent index creation for exclusion constraints is not supported
+REINDEX TABLE CONCURRENTLY concur_reindex_tab3; -- succeeds with warning
+WARNING: cannot reindex exclusion constraint index "public.concur_reindex_tab3_c2_excl" concurrently, skipping
+NOTICE: table "concur_reindex_tab3" has no indexes that can be reindexed concurrently
+INSERT INTO concur_reindex_tab3 VALUES (4, '[2,4]');
+ERROR: conflicting key value violates exclusion constraint "concur_reindex_tab3_c2_excl"
+DETAIL: Key (c2)=([2,5)) conflicts with existing key (c2)=([1,3)).
+-- Check materialized views
+CREATE MATERIALIZED VIEW concur_reindex_matview AS SELECT * FROM concur_reindex_tab;
+-- Dependency lookup before and after the follow-up REINDEX commands.
+-- These should remain consistent.
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_tab'::regclass,
+ 'concur_reindex_ind1'::regclass,
+ 'concur_reindex_ind2'::regclass,
+ 'concur_reindex_ind3'::regclass,
+ 'concur_reindex_ind4'::regclass,
+ 'concur_reindex_matview'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+------------------------------------------------------------+---------
+ index concur_reindex_ind1 | constraint concur_reindex_ind1 on table concur_reindex_tab | i
+ index concur_reindex_ind2 | collation "default" | n
+ index concur_reindex_ind2 | column c2 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | table concur_reindex_tab | a
+ index concur_reindex_ind4 | collation "default" | n
+ index concur_reindex_ind4 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c2 of table concur_reindex_tab | a
+ materialized view concur_reindex_matview | schema public | n
+ table concur_reindex_tab | schema public | n
+(10 rows)
+
+REINDEX INDEX CONCURRENTLY concur_reindex_ind1;
+REINDEX TABLE CONCURRENTLY concur_reindex_tab;
+REINDEX TABLE CONCURRENTLY concur_reindex_matview;
+NOTICE: table "concur_reindex_matview" has no indexes that can be reindexed concurrently
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_tab'::regclass,
+ 'concur_reindex_ind1'::regclass,
+ 'concur_reindex_ind2'::regclass,
+ 'concur_reindex_ind3'::regclass,
+ 'concur_reindex_ind4'::regclass,
+ 'concur_reindex_matview'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+------------------------------------------------------------+---------
+ index concur_reindex_ind1 | constraint concur_reindex_ind1 on table concur_reindex_tab | i
+ index concur_reindex_ind2 | collation "default" | n
+ index concur_reindex_ind2 | column c2 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind3 | table concur_reindex_tab | a
+ index concur_reindex_ind4 | collation "default" | n
+ index concur_reindex_ind4 | column c1 of table concur_reindex_tab | a
+ index concur_reindex_ind4 | column c2 of table concur_reindex_tab | a
+ materialized view concur_reindex_matview | schema public | n
+ table concur_reindex_tab | schema public | n
+(10 rows)
+
+-- Check that comments are preserved
+CREATE TABLE testcomment (i int);
+CREATE INDEX testcomment_idx1 ON testcomment (i);
+COMMENT ON INDEX testcomment_idx1 IS 'test comment';
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+REINDEX TABLE testcomment;
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+REINDEX TABLE CONCURRENTLY testcomment ;
+SELECT obj_description('testcomment_idx1'::regclass, 'pg_class');
+ obj_description
+-----------------
+ test comment
+(1 row)
+
+DROP TABLE testcomment;
+-- Check that indisclustered updates are preserved
+CREATE TABLE concur_clustered(i int);
+CREATE INDEX concur_clustered_i_idx ON concur_clustered(i);
+ALTER TABLE concur_clustered CLUSTER ON concur_clustered_i_idx;
+REINDEX TABLE CONCURRENTLY concur_clustered;
+SELECT indexrelid::regclass, indisclustered FROM pg_index
+ WHERE indrelid = 'concur_clustered'::regclass;
+ indexrelid | indisclustered
+------------------------+----------------
+ concur_clustered_i_idx | t
+(1 row)
+
+DROP TABLE concur_clustered;
+-- Check that indisreplident updates are preserved.
+CREATE TABLE concur_replident(i int NOT NULL);
+CREATE UNIQUE INDEX concur_replident_i_idx ON concur_replident(i);
+ALTER TABLE concur_replident REPLICA IDENTITY
+ USING INDEX concur_replident_i_idx;
+SELECT indexrelid::regclass, indisreplident FROM pg_index
+ WHERE indrelid = 'concur_replident'::regclass;
+ indexrelid | indisreplident
+------------------------+----------------
+ concur_replident_i_idx | t
+(1 row)
+
+REINDEX TABLE CONCURRENTLY concur_replident;
+SELECT indexrelid::regclass, indisreplident FROM pg_index
+ WHERE indrelid = 'concur_replident'::regclass;
+ indexrelid | indisreplident
+------------------------+----------------
+ concur_replident_i_idx | t
+(1 row)
+
+DROP TABLE concur_replident;
+-- Partitions
+-- Create some partitioned tables
+CREATE TABLE concur_reindex_part (c1 int, c2 int) PARTITION BY RANGE (c1);
+CREATE TABLE concur_reindex_part_0 PARTITION OF concur_reindex_part
+ FOR VALUES FROM (0) TO (10) PARTITION BY list (c2);
+CREATE TABLE concur_reindex_part_0_1 PARTITION OF concur_reindex_part_0
+ FOR VALUES IN (1);
+CREATE TABLE concur_reindex_part_0_2 PARTITION OF concur_reindex_part_0
+ FOR VALUES IN (2);
+-- This partitioned table will have no partitions.
+CREATE TABLE concur_reindex_part_10 PARTITION OF concur_reindex_part
+ FOR VALUES FROM (10) TO (20) PARTITION BY list (c2);
+-- Create some partitioned indexes
+CREATE INDEX concur_reindex_part_index ON ONLY concur_reindex_part (c1);
+CREATE INDEX concur_reindex_part_index_0 ON ONLY concur_reindex_part_0 (c1);
+ALTER INDEX concur_reindex_part_index ATTACH PARTITION concur_reindex_part_index_0;
+-- This partitioned index will have no partitions.
+CREATE INDEX concur_reindex_part_index_10 ON ONLY concur_reindex_part_10 (c1);
+ALTER INDEX concur_reindex_part_index ATTACH PARTITION concur_reindex_part_index_10;
+CREATE INDEX concur_reindex_part_index_0_1 ON ONLY concur_reindex_part_0_1 (c1);
+ALTER INDEX concur_reindex_part_index_0 ATTACH PARTITION concur_reindex_part_index_0_1;
+CREATE INDEX concur_reindex_part_index_0_2 ON ONLY concur_reindex_part_0_2 (c1);
+ALTER INDEX concur_reindex_part_index_0 ATTACH PARTITION concur_reindex_part_index_0_2;
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+-- REINDEX should preserve dependencies of partition tree.
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_part'::regclass,
+ 'concur_reindex_part_0'::regclass,
+ 'concur_reindex_part_0_1'::regclass,
+ 'concur_reindex_part_0_2'::regclass,
+ 'concur_reindex_part_index'::regclass,
+ 'concur_reindex_part_index_0'::regclass,
+ 'concur_reindex_part_index_0_1'::regclass,
+ 'concur_reindex_part_index_0_2'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+--------------------------------------------+---------
+ column c1 of table concur_reindex_part | table concur_reindex_part | i
+ column c2 of table concur_reindex_part_0 | table concur_reindex_part_0 | i
+ index concur_reindex_part_index | column c1 of table concur_reindex_part | a
+ index concur_reindex_part_index_0 | column c1 of table concur_reindex_part_0 | a
+ index concur_reindex_part_index_0 | index concur_reindex_part_index | P
+ index concur_reindex_part_index_0 | table concur_reindex_part_0 | S
+ index concur_reindex_part_index_0_1 | column c1 of table concur_reindex_part_0_1 | a
+ index concur_reindex_part_index_0_1 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_1 | table concur_reindex_part_0_1 | S
+ index concur_reindex_part_index_0_2 | column c1 of table concur_reindex_part_0_2 | a
+ index concur_reindex_part_index_0_2 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_2 | table concur_reindex_part_0_2 | S
+ table concur_reindex_part | schema public | n
+ table concur_reindex_part_0 | schema public | n
+ table concur_reindex_part_0 | table concur_reindex_part | a
+ table concur_reindex_part_0_1 | schema public | n
+ table concur_reindex_part_0_1 | table concur_reindex_part_0 | a
+ table concur_reindex_part_0_2 | schema public | n
+ table concur_reindex_part_0_2 | table concur_reindex_part_0 | a
+(19 rows)
+
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_0_1;
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index_0_2;
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+REINDEX TABLE CONCURRENTLY concur_reindex_part_0_1;
+REINDEX TABLE CONCURRENTLY concur_reindex_part_0_2;
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+ pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+ deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+ objid in ('concur_reindex_part'::regclass,
+ 'concur_reindex_part_0'::regclass,
+ 'concur_reindex_part_0_1'::regclass,
+ 'concur_reindex_part_0_2'::regclass,
+ 'concur_reindex_part_index'::regclass,
+ 'concur_reindex_part_index_0'::regclass,
+ 'concur_reindex_part_index_0_1'::regclass,
+ 'concur_reindex_part_index_0_2'::regclass)
+ ORDER BY 1, 2;
+ obj | objref | deptype
+------------------------------------------+--------------------------------------------+---------
+ column c1 of table concur_reindex_part | table concur_reindex_part | i
+ column c2 of table concur_reindex_part_0 | table concur_reindex_part_0 | i
+ index concur_reindex_part_index | column c1 of table concur_reindex_part | a
+ index concur_reindex_part_index_0 | column c1 of table concur_reindex_part_0 | a
+ index concur_reindex_part_index_0 | index concur_reindex_part_index | P
+ index concur_reindex_part_index_0 | table concur_reindex_part_0 | S
+ index concur_reindex_part_index_0_1 | column c1 of table concur_reindex_part_0_1 | a
+ index concur_reindex_part_index_0_1 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_1 | table concur_reindex_part_0_1 | S
+ index concur_reindex_part_index_0_2 | column c1 of table concur_reindex_part_0_2 | a
+ index concur_reindex_part_index_0_2 | index concur_reindex_part_index_0 | P
+ index concur_reindex_part_index_0_2 | table concur_reindex_part_0_2 | S
+ table concur_reindex_part | schema public | n
+ table concur_reindex_part_0 | schema public | n
+ table concur_reindex_part_0 | table concur_reindex_part | a
+ table concur_reindex_part_0_1 | schema public | n
+ table concur_reindex_part_0_1 | table concur_reindex_part_0 | a
+ table concur_reindex_part_0_2 | schema public | n
+ table concur_reindex_part_0_2 | table concur_reindex_part_0 | a
+(19 rows)
+
+SELECT relid, parentrelid, level FROM pg_partition_tree('concur_reindex_part_index')
+ ORDER BY relid, level;
+ relid | parentrelid | level
+-------------------------------+-----------------------------+-------
+ concur_reindex_part_index | | 0
+ concur_reindex_part_index_0 | concur_reindex_part_index | 1
+ concur_reindex_part_index_10 | concur_reindex_part_index | 1
+ concur_reindex_part_index_0_1 | concur_reindex_part_index_0 | 2
+ concur_reindex_part_index_0_2 | concur_reindex_part_index_0 | 2
+(5 rows)
+
+-- REINDEX for partitioned indexes
+-- REINDEX TABLE fails for partitioned indexes
+-- Top-most parent index
+REINDEX TABLE concur_reindex_part_index; -- error
+ERROR: "concur_reindex_part_index" is not a table or materialized view
+REINDEX TABLE CONCURRENTLY concur_reindex_part_index; -- error
+ERROR: "concur_reindex_part_index" is not a table or materialized view
+-- Partitioned index with no leaves
+REINDEX TABLE concur_reindex_part_index_10; -- error
+ERROR: "concur_reindex_part_index_10" is not a table or materialized view
+REINDEX TABLE CONCURRENTLY concur_reindex_part_index_10; -- error
+ERROR: "concur_reindex_part_index_10" is not a table or materialized view
+-- Cannot run in a transaction block
+BEGIN;
+REINDEX INDEX concur_reindex_part_index;
+ERROR: REINDEX INDEX cannot run inside a transaction block
+CONTEXT: while reindexing partitioned index "public.concur_reindex_part_index"
+ROLLBACK;
+-- Helper functions to track changes of relfilenodes in a partition tree.
+-- Create a table tracking the relfilenode state.
+CREATE OR REPLACE FUNCTION create_relfilenode_part(relname text, indname text)
+ RETURNS VOID AS
+ $func$
+ BEGIN
+ EXECUTE format('
+ CREATE TABLE %I AS
+ SELECT oid, relname, relfilenode, relkind, reltoastrelid
+ FROM pg_class
+ WHERE oid IN
+ (SELECT relid FROM pg_partition_tree(''%I''));',
+ relname, indname);
+ END
+ $func$ LANGUAGE plpgsql;
+CREATE OR REPLACE FUNCTION compare_relfilenode_part(tabname text)
+ RETURNS TABLE (relname name, relkind "char", state text) AS
+ $func$
+ BEGIN
+ RETURN QUERY EXECUTE
+ format(
+ 'SELECT b.relname,
+ b.relkind,
+ CASE WHEN a.relfilenode = b.relfilenode THEN ''relfilenode is unchanged''
+ ELSE ''relfilenode has changed'' END
+ -- Do not join with OID here as CONCURRENTLY changes it.
+ FROM %I b JOIN pg_class a ON b.relname = a.relname
+ ORDER BY 1;', tabname);
+ END
+ $func$ LANGUAGE plpgsql;
+-- Check that expected relfilenodes are changed, non-concurrent case.
+SELECT create_relfilenode_part('reindex_index_status', 'concur_reindex_part_index');
+ create_relfilenode_part
+-------------------------
+
+(1 row)
+
+REINDEX INDEX concur_reindex_part_index;
+SELECT * FROM compare_relfilenode_part('reindex_index_status');
+ relname | relkind | state
+-------------------------------+---------+--------------------------
+ concur_reindex_part_index | I | relfilenode is unchanged
+ concur_reindex_part_index_0 | I | relfilenode is unchanged
+ concur_reindex_part_index_0_1 | i | relfilenode has changed
+ concur_reindex_part_index_0_2 | i | relfilenode has changed
+ concur_reindex_part_index_10 | I | relfilenode is unchanged
+(5 rows)
+
+DROP TABLE reindex_index_status;
+-- concurrent case.
+SELECT create_relfilenode_part('reindex_index_status', 'concur_reindex_part_index');
+ create_relfilenode_part
+-------------------------
+
+(1 row)
+
+REINDEX INDEX CONCURRENTLY concur_reindex_part_index;
+SELECT * FROM compare_relfilenode_part('reindex_index_status');
+ relname | relkind | state
+-------------------------------+---------+--------------------------
+ concur_reindex_part_index | I | relfilenode is unchanged
+ concur_reindex_part_index_0 | I | relfilenode is unchanged
+ concur_reindex_part_index_0_1 | i | relfilenode has changed
+ concur_reindex_part_index_0_2 | i | relfilenode has changed
+ concur_reindex_part_index_10 | I | relfilenode is unchanged
+(5 rows)
+
+DROP TABLE reindex_index_status;
+-- REINDEX for partitioned tables
+-- REINDEX INDEX fails for partitioned tables
+-- Top-most parent
+REINDEX INDEX concur_reindex_part; -- error
+ERROR: "concur_reindex_part" is not an index
+REINDEX INDEX CONCURRENTLY concur_reindex_part; -- error
+ERROR: "concur_reindex_part" is not an index
+-- Partitioned with no leaves
+REINDEX INDEX concur_reindex_part_10; -- error
+ERROR: "concur_reindex_part_10" is not an index
+REINDEX INDEX CONCURRENTLY concur_reindex_part_10; -- error
+ERROR: "concur_reindex_part_10" is not an index
+-- Cannot run in a transaction block
+BEGIN;
+REINDEX TABLE concur_reindex_part;
+ERROR: REINDEX TABLE cannot run inside a transaction block
+CONTEXT: while reindexing partitioned table "public.concur_reindex_part"
+ROLLBACK;
+-- Check that expected relfilenodes are changed, non-concurrent case.
+-- Note that the partition tree changes of the *indexes* need to be checked.
+SELECT create_relfilenode_part('reindex_index_status', 'concur_reindex_part_index');
+ create_relfilenode_part
+-------------------------
+
+(1 row)
+
+REINDEX TABLE concur_reindex_part;
+SELECT * FROM compare_relfilenode_part('reindex_index_status');
+ relname | relkind | state
+-------------------------------+---------+--------------------------
+ concur_reindex_part_index | I | relfilenode is unchanged
+ concur_reindex_part_index_0 | I | relfilenode is unchanged
+ concur_reindex_part_index_0_1 | i | relfilenode has changed
+ concur_reindex_part_index_0_2 | i | relfilenode has changed
+ concur_reindex_part_index_10 | I | relfilenode is unchanged
+(5 rows)
+
+DROP TABLE reindex_index_status;
+-- concurrent case.
+SELECT create_relfilenode_part('reindex_index_status', 'concur_reindex_part_index');
+ create_relfilenode_part
+-------------------------
+
+(1 row)
+
+REINDEX TABLE CONCURRENTLY concur_reindex_part;
+SELECT * FROM compare_relfilenode_part('reindex_index_status');
+ relname | relkind | state
+-------------------------------+---------+--------------------------
+ concur_reindex_part_index | I | relfilenode is unchanged
+ concur_reindex_part_index_0 | I | relfilenode is unchanged
+ concur_reindex_part_index_0_1 | i | relfilenode has changed
+ concur_reindex_part_index_0_2 | i | relfilenode has changed
+ concur_reindex_part_index_10 | I | relfilenode is unchanged
+(5 rows)
+
+DROP TABLE reindex_index_status;
+DROP FUNCTION create_relfilenode_part;
+DROP FUNCTION compare_relfilenode_part;
+-- Cleanup of partition tree used for REINDEX test.
+DROP TABLE concur_reindex_part;
+-- Check errors
+-- Cannot run inside a transaction block
+BEGIN;
+REINDEX TABLE CONCURRENTLY concur_reindex_tab;
+ERROR: REINDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+REINDEX TABLE CONCURRENTLY pg_class; -- no catalog relation
+ERROR: cannot reindex system catalogs concurrently
+REINDEX INDEX CONCURRENTLY pg_class_oid_index; -- no catalog index
+ERROR: cannot reindex system catalogs concurrently
+-- These are the toast table and index of pg_authid.
+REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_1260; -- no catalog toast table
+ERROR: cannot reindex system catalogs concurrently
+REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_1260_index; -- no catalog toast index
+ERROR: cannot reindex system catalogs concurrently
+REINDEX SYSTEM CONCURRENTLY postgres; -- not allowed for SYSTEM
+ERROR: cannot reindex system catalogs concurrently
+-- Warns about catalog relations
+REINDEX SCHEMA CONCURRENTLY pg_catalog;
+WARNING: cannot reindex system catalogs concurrently, skipping all
+-- Check the relation status, there should not be invalid indexes
+\d concur_reindex_tab
+ Table "public.concur_reindex_tab"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | not null |
+ c2 | text | | |
+Indexes:
+ "concur_reindex_ind1" PRIMARY KEY, btree (c1)
+ "concur_reindex_ind2" btree (c2)
+ "concur_reindex_ind3" UNIQUE, btree (abs(c1))
+ "concur_reindex_ind4" btree (c1, c1, c2)
+Referenced by:
+ TABLE "concur_reindex_tab2" CONSTRAINT "concur_reindex_tab2_c1_fkey" FOREIGN KEY (c1) REFERENCES concur_reindex_tab(c1)
+
+DROP MATERIALIZED VIEW concur_reindex_matview;
+DROP TABLE concur_reindex_tab, concur_reindex_tab2, concur_reindex_tab3;
+-- Check handling of invalid indexes
+CREATE TABLE concur_reindex_tab4 (c1 int);
+INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
+-- This trick creates an invalid index.
+CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+ERROR: could not create unique index "concur_reindex_ind5"
+DETAIL: Key (c1)=(1) is duplicated.
+-- Reindexing concurrently this index fails with the same failure.
+-- The extra index created is itself invalid, and can be dropped.
+REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
+ERROR: could not create unique index "concur_reindex_ind5_ccnew"
+DETAIL: Key (c1)=(1) is duplicated.
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+ "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
+
+DROP INDEX concur_reindex_ind5_ccnew;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- The invalid index is not processed when running REINDEX TABLE.
+REINDEX TABLE CONCURRENTLY concur_reindex_tab4;
+WARNING: cannot reindex invalid index "public.concur_reindex_ind5" concurrently, skipping
+NOTICE: table "concur_reindex_tab4" has no indexes that can be reindexed concurrently
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+
+-- But it is fixed with REINDEX INDEX.
+REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
+\d concur_reindex_tab4
+ Table "public.concur_reindex_tab4"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "concur_reindex_ind5" UNIQUE, btree (c1)
+
+DROP TABLE concur_reindex_tab4;
+-- Check handling of indexes with expressions and predicates. The
+-- definitions of the rebuilt indexes should match the original
+-- definitions.
+CREATE TABLE concur_exprs_tab (c1 int , c2 boolean);
+INSERT INTO concur_exprs_tab (c1, c2) VALUES (1369652450, FALSE),
+ (414515746, TRUE),
+ (897778963, FALSE);
+CREATE UNIQUE INDEX concur_exprs_index_expr
+ ON concur_exprs_tab ((c1::text COLLATE "C"));
+CREATE UNIQUE INDEX concur_exprs_index_pred ON concur_exprs_tab (c1)
+ WHERE (c1::text > 500000000::text COLLATE "C");
+CREATE UNIQUE INDEX concur_exprs_index_pred_2
+ ON concur_exprs_tab ((1 / c1))
+ WHERE ('-H') >= (c2::TEXT) COLLATE "C";
+ANALYZE concur_exprs_tab;
+SELECT starelid::regclass, count(*) FROM pg_statistic WHERE starelid IN (
+ 'concur_exprs_index_expr'::regclass,
+ 'concur_exprs_index_pred'::regclass,
+ 'concur_exprs_index_pred_2'::regclass)
+ GROUP BY starelid ORDER BY starelid::regclass::text;
+ starelid | count
+-------------------------+-------
+ concur_exprs_index_expr | 1
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+--------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= ((c2)::text COLLATE "C"))
+(1 row)
+
+REINDEX TABLE CONCURRENTLY concur_exprs_tab;
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+--------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= ((c2)::text COLLATE "C"))
+(1 row)
+
+-- ALTER TABLE recreates the indexes, which should keep their collations.
+ALTER TABLE concur_exprs_tab ALTER c2 TYPE TEXT;
+SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
+ pg_get_indexdef
+---------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_expr ON public.concur_exprs_tab USING btree (((c1)::text) COLLATE "C")
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred'::regclass);
+ pg_get_indexdef
+----------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred ON public.concur_exprs_tab USING btree (c1) WHERE ((c1)::text > ((500000000)::text COLLATE "C"))
+(1 row)
+
+SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
+ pg_get_indexdef
+------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= (c2 COLLATE "C"))
+(1 row)
+
+-- Statistics should remain intact.
+SELECT starelid::regclass, count(*) FROM pg_statistic WHERE starelid IN (
+ 'concur_exprs_index_expr'::regclass,
+ 'concur_exprs_index_pred'::regclass,
+ 'concur_exprs_index_pred_2'::regclass)
+ GROUP BY starelid ORDER BY starelid::regclass::text;
+ starelid | count
+-------------------------+-------
+ concur_exprs_index_expr | 1
+(1 row)
+
+DROP TABLE concur_exprs_tab;
+-- Temporary tables and on-commit actions, where CONCURRENTLY is ignored.
+-- ON COMMIT PRESERVE ROWS, the default.
+CREATE TEMP TABLE concur_temp_tab_1 (c1 int, c2 text)
+ ON COMMIT PRESERVE ROWS;
+INSERT INTO concur_temp_tab_1 VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX concur_temp_ind_1 ON concur_temp_tab_1(c2);
+REINDEX TABLE CONCURRENTLY concur_temp_tab_1;
+REINDEX INDEX CONCURRENTLY concur_temp_ind_1;
+-- Still fails in transaction blocks
+BEGIN;
+REINDEX INDEX CONCURRENTLY concur_temp_ind_1;
+ERROR: REINDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- ON COMMIT DELETE ROWS
+CREATE TEMP TABLE concur_temp_tab_2 (c1 int, c2 text)
+ ON COMMIT DELETE ROWS;
+CREATE INDEX concur_temp_ind_2 ON concur_temp_tab_2(c2);
+REINDEX TABLE CONCURRENTLY concur_temp_tab_2;
+REINDEX INDEX CONCURRENTLY concur_temp_ind_2;
+-- ON COMMIT DROP
+BEGIN;
+CREATE TEMP TABLE concur_temp_tab_3 (c1 int, c2 text)
+ ON COMMIT PRESERVE ROWS;
+INSERT INTO concur_temp_tab_3 VALUES (1, 'foo'), (2, 'bar');
+CREATE INDEX concur_temp_ind_3 ON concur_temp_tab_3(c2);
+-- Fails when running in a transaction
+REINDEX INDEX CONCURRENTLY concur_temp_ind_3;
+ERROR: REINDEX CONCURRENTLY cannot run inside a transaction block
+COMMIT;
+-- REINDEX SCHEMA processes all temporary relations
+CREATE TABLE reindex_temp_before AS
+SELECT oid, relname, relfilenode, relkind, reltoastrelid
+ FROM pg_class
+ WHERE relname IN ('concur_temp_ind_1', 'concur_temp_ind_2');
+SELECT pg_my_temp_schema()::regnamespace as temp_schema_name \gset
+REINDEX SCHEMA CONCURRENTLY :temp_schema_name;
+SELECT b.relname,
+ b.relkind,
+ CASE WHEN a.relfilenode = b.relfilenode THEN 'relfilenode is unchanged'
+ ELSE 'relfilenode has changed' END
+ FROM reindex_temp_before b JOIN pg_class a ON b.oid = a.oid
+ ORDER BY 1;
+ relname | relkind | case
+-------------------+---------+-------------------------
+ concur_temp_ind_1 | i | relfilenode has changed
+ concur_temp_ind_2 | i | relfilenode has changed
+(2 rows)
+
+DROP TABLE concur_temp_tab_1, concur_temp_tab_2, reindex_temp_before;
+--
+-- REINDEX SCHEMA
+--
+REINDEX SCHEMA schema_to_reindex; -- failure, schema does not exist
+ERROR: schema "schema_to_reindex" does not exist
+CREATE SCHEMA schema_to_reindex;
+SET search_path = 'schema_to_reindex';
+CREATE TABLE table1(col1 SERIAL PRIMARY KEY);
+INSERT INTO table1 SELECT generate_series(1,400);
+CREATE TABLE table2(col1 SERIAL PRIMARY KEY, col2 TEXT NOT NULL);
+INSERT INTO table2 SELECT generate_series(1,400), 'abc';
+CREATE INDEX ON table2(col2);
+CREATE MATERIALIZED VIEW matview AS SELECT col1 FROM table2;
+CREATE INDEX ON matview(col1);
+CREATE VIEW view AS SELECT col2 FROM table2;
+CREATE TABLE reindex_before AS
+SELECT oid, relname, relfilenode, relkind, reltoastrelid
+ FROM pg_class
+ where relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'schema_to_reindex');
+INSERT INTO reindex_before
+SELECT oid, 'pg_toast_TABLE', relfilenode, relkind, reltoastrelid
+FROM pg_class WHERE oid IN
+ (SELECT reltoastrelid FROM reindex_before WHERE reltoastrelid > 0);
+INSERT INTO reindex_before
+SELECT oid, 'pg_toast_TABLE_index', relfilenode, relkind, reltoastrelid
+FROM pg_class where oid in
+ (select indexrelid from pg_index where indrelid in
+ (select reltoastrelid from reindex_before where reltoastrelid > 0));
+REINDEX SCHEMA schema_to_reindex;
+CREATE TABLE reindex_after AS SELECT oid, relname, relfilenode, relkind
+ FROM pg_class
+ where relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'schema_to_reindex');
+SELECT b.relname,
+ b.relkind,
+ CASE WHEN a.relfilenode = b.relfilenode THEN 'relfilenode is unchanged'
+ ELSE 'relfilenode has changed' END
+ FROM reindex_before b JOIN pg_class a ON b.oid = a.oid
+ ORDER BY 1;
+ relname | relkind | case
+------------------+---------+--------------------------
+ matview | m | relfilenode is unchanged
+ matview_col1_idx | i | relfilenode has changed
+ table1 | r | relfilenode is unchanged
+ table1_col1_seq | S | relfilenode is unchanged
+ table1_pkey | i | relfilenode has changed
+ table2 | r | relfilenode is unchanged
+ table2_col1_seq | S | relfilenode is unchanged
+ table2_col2_idx | i | relfilenode has changed
+ table2_pkey | i | relfilenode has changed
+ view | v | relfilenode is unchanged
+(10 rows)
+
+REINDEX SCHEMA schema_to_reindex;
+BEGIN;
+REINDEX SCHEMA schema_to_reindex; -- failure, cannot run in a transaction
+ERROR: REINDEX SCHEMA cannot run inside a transaction block
+END;
+-- concurrently
+REINDEX SCHEMA CONCURRENTLY schema_to_reindex;
+-- Failure for unauthorized user
+CREATE ROLE regress_reindexuser NOLOGIN;
+SET SESSION ROLE regress_reindexuser;
+REINDEX SCHEMA schema_to_reindex;
+ERROR: must be owner of schema schema_to_reindex
+-- Permission failures with toast tables and indexes (pg_authid here)
+RESET ROLE;
+GRANT USAGE ON SCHEMA pg_toast TO regress_reindexuser;
+SET SESSION ROLE regress_reindexuser;
+REINDEX TABLE pg_toast.pg_toast_1260;
+ERROR: must be owner of table pg_toast_1260
+REINDEX INDEX pg_toast.pg_toast_1260_index;
+ERROR: must be owner of index pg_toast_1260_index
+-- Clean up
+RESET ROLE;
+REVOKE USAGE ON SCHEMA pg_toast FROM regress_reindexuser;
+DROP ROLE regress_reindexuser;
+DROP SCHEMA schema_to_reindex CASCADE;
+NOTICE: drop cascades to 6 other objects
+DETAIL: drop cascades to table table1
+drop cascades to table table2
+drop cascades to materialized view matview
+drop cascades to view view
+drop cascades to table reindex_before
+drop cascades to table reindex_after
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index ed8c01b8de..06eba312f4 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -275,16 +275,16 @@ CREATE TABLE IF NOT EXISTS as_select1 AS EXECUTE select1;
NOTICE: relation "as_select1" already exists, skipping
DROP TABLE as_select1;
DEALLOCATE select1;
+-- FIXME: enable this test when we introduce meta-page overflow for zedstore
-- create an extra wide table to test for issues related to that
-- (temporarily hide query, to avoid the long CREATE TABLE stmt)
-\set ECHO none
-INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
-SELECT firstc, lastc FROM extra_wide_table;
- firstc | lastc
------------+----------
- first col | last col
-(1 row)
-
+-- \set ECHO none
+-- SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
+-- FROM generate_series(1, 1100) g(i)
+-- \gexec
+-- \set ECHO all
+-- INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
+-- SELECT firstc, lastc FROM extra_wide_table;
-- check that tables with oids cannot be created anymore
CREATE TABLE withoid() WITH OIDS;
ERROR: syntax error at or near "OIDS"
diff --git a/src/test/regress/expected/fsm_1.out b/src/test/regress/expected/fsm_1.out
new file mode 100644
index 0000000000..9b5f9be13a
--- /dev/null
+++ b/src/test/regress/expected/fsm_1.out
@@ -0,0 +1,73 @@
+--
+-- Free Space Map test
+--
+SELECT current_setting('block_size')::integer AS blocksize,
+current_setting('block_size')::integer / 8 AS strsize
+\gset
+CREATE TABLE fsm_check_size (num int, str text);
+-- Fill 3 blocks with one record each
+ALTER TABLE fsm_check_size SET (fillfactor=15);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(1,3) i;
+-- There should be no FSM
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- The following operations are for testing the functionality of the local
+-- in-memory map. In particular, we want to be able to insert into some
+-- other block than the one at the end of the heap, without using a FSM.
+-- Fill most of the last block
+ALTER TABLE fsm_check_size SET (fillfactor=100);
+INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
+FROM generate_series(101,105) i;
+-- Make sure records can go into any block but the last one
+ALTER TABLE fsm_check_size SET (fillfactor=30);
+-- Insert large record and make sure it does not cause the relation to extend
+INSERT INTO fsm_check_size VALUES (111, rpad('', :strsize, 'a'));
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
+pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ heap_nblocks | fsm_nblocks
+--------------+-------------
+ 5 | 0
+(1 row)
+
+-- Extend table with enough blocks to exceed the FSM threshold
+DO $$
+DECLARE curtid tid;
+num int;
+BEGIN
+num = 11;
+ LOOP
+ INSERT INTO fsm_check_size VALUES (num, 'b') RETURNING ctid INTO curtid;
+ EXIT WHEN curtid >= tid '(4, 0)';
+ num = num + 1;
+ END LOOP;
+END;
+$$;
+VACUUM fsm_check_size;
+SELECT pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
+ fsm_nblocks
+-------------
+ 0
+(1 row)
+
+-- Add long random string to extend TOAST table to 1 block
+INSERT INTO fsm_check_size
+VALUES(0, (SELECT string_agg(md5(chr(i)), '')
+ FROM generate_series(1, :blocksize / 100) i));
+VACUUM fsm_check_size;
+SELECT pg_relation_size(reltoastrelid, 'main') / :blocksize AS toast_nblocks,
+pg_relation_size(reltoastrelid, 'fsm') / :blocksize AS toast_fsm_nblocks
+FROM pg_class WHERE relname = 'fsm_check_size';
+ toast_nblocks | toast_fsm_nblocks
+---------------+-------------------
+ |
+(1 row)
+
+DROP TABLE fsm_check_size;
diff --git a/src/test/regress/expected/join_hash_1.out b/src/test/regress/expected/join_hash_1.out
new file mode 100644
index 0000000000..b745ef8358
--- /dev/null
+++ b/src/test/regress/expected/join_hash_1.out
@@ -0,0 +1,1015 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(9 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious outer hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- An full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious outer hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan 1)) AND ((SubPlan 2) = (SubPlan 3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan 1) = hjtest_1.id) AND ((SubPlan 3) = (SubPlan 2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
diff --git a/src/test/regress/expected/limit_1.out b/src/test/regress/expected/limit_1.out
new file mode 100644
index 0000000000..55020ae124
--- /dev/null
+++ b/src/test/regress/expected/limit_1.out
@@ -0,0 +1,505 @@
+--
+-- LIMIT
+-- Check the LIMIT/OFFSET feature of SELECT
+--
+SELECT ''::text AS two, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 50
+ ORDER BY unique1 LIMIT 2;
+ two | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 51 | 76 | ZBAAAA
+ | 52 | 985 | ACAAAA
+(2 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 60
+ ORDER BY unique1 LIMIT 5;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 61 | 560 | JCAAAA
+ | 62 | 633 | KCAAAA
+ | 63 | 296 | LCAAAA
+ | 64 | 479 | MCAAAA
+ | 65 | 64 | NCAAAA
+(5 rows)
+
+SELECT ''::text AS two, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 60 AND unique1 < 63
+ ORDER BY unique1 LIMIT 5;
+ two | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 61 | 560 | JCAAAA
+ | 62 | 633 | KCAAAA
+(2 rows)
+
+SELECT ''::text AS three, unique1, unique2, stringu1
+ FROM onek WHERE unique1 > 100
+ ORDER BY unique1 LIMIT 3 OFFSET 20;
+ three | unique1 | unique2 | stringu1
+-------+---------+---------+----------
+ | 121 | 700 | REAAAA
+ | 122 | 519 | SEAAAA
+ | 123 | 777 | TEAAAA
+(3 rows)
+
+SELECT ''::text AS zero, unique1, unique2, stringu1
+ FROM onek WHERE unique1 < 50
+ ORDER BY unique1 DESC LIMIT 8 OFFSET 99;
+ zero | unique1 | unique2 | stringu1
+------+---------+---------+----------
+(0 rows)
+
+SELECT ''::text AS eleven, unique1, unique2, stringu1
+ FROM onek WHERE unique1 < 50
+ ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
+ eleven | unique1 | unique2 | stringu1
+--------+---------+---------+----------
+ | 10 | 520 | KAAAAA
+ | 9 | 49 | JAAAAA
+ | 8 | 653 | IAAAAA
+ | 7 | 647 | HAAAAA
+ | 6 | 978 | GAAAAA
+ | 5 | 541 | FAAAAA
+ | 4 | 833 | EAAAAA
+ | 3 | 431 | DAAAAA
+ | 2 | 326 | CAAAAA
+ | 1 | 214 | BAAAAA
+ | 0 | 998 | AAAAAA
+(11 rows)
+
+SELECT ''::text AS ten, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 OFFSET 990;
+ ten | unique1 | unique2 | stringu1
+-----+---------+---------+----------
+ | 990 | 369 | CMAAAA
+ | 991 | 426 | DMAAAA
+ | 992 | 363 | EMAAAA
+ | 993 | 661 | FMAAAA
+ | 994 | 695 | GMAAAA
+ | 995 | 144 | HMAAAA
+ | 996 | 258 | IMAAAA
+ | 997 | 21 | JMAAAA
+ | 998 | 549 | KMAAAA
+ | 999 | 152 | LMAAAA
+(10 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 OFFSET 990 LIMIT 5;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 990 | 369 | CMAAAA
+ | 991 | 426 | DMAAAA
+ | 992 | 363 | EMAAAA
+ | 993 | 661 | FMAAAA
+ | 994 | 695 | GMAAAA
+(5 rows)
+
+SELECT ''::text AS five, unique1, unique2, stringu1
+ FROM onek
+ ORDER BY unique1 LIMIT 5 OFFSET 900;
+ five | unique1 | unique2 | stringu1
+------+---------+---------+----------
+ | 900 | 913 | QIAAAA
+ | 901 | 931 | RIAAAA
+ | 902 | 702 | SIAAAA
+ | 903 | 641 | TIAAAA
+ | 904 | 793 | UIAAAA
+(5 rows)
+
+-- Test null limit and offset. The planner would discard a simple null
+-- constant, so to ensure executor is exercised, do this:
+select * from int8_tbl limit (case when random() < 0.5 then null::bigint end);
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+select * from int8_tbl offset (case when random() < 0.5 then null::bigint end);
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+-- Test assorted cases involving backwards fetch from a LIMIT plan node
+begin;
+declare c1 cursor for select * from int8_tbl limit 10;
+fetch all in c1;
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+fetch 1 in c1;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c1;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | -4567890123456789
+(1 row)
+
+fetch backward all in c1;
+ q1 | q2
+------------------+------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | 123
+ 123 | 4567890123456789
+ 123 | 456
+(4 rows)
+
+fetch backward 1 in c1;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c1;
+ q1 | q2
+------------------+-------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(5 rows)
+
+declare c2 cursor for select * from int8_tbl limit 3;
+fetch all in c2;
+ q1 | q2
+------------------+------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+(3 rows)
+
+fetch 1 in c2;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c2;
+ q1 | q2
+------------------+-----
+ 4567890123456789 | 123
+(1 row)
+
+fetch backward all in c2;
+ q1 | q2
+-----+------------------
+ 123 | 4567890123456789
+ 123 | 456
+(2 rows)
+
+fetch backward 1 in c2;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c2;
+ q1 | q2
+------------------+------------------
+ 123 | 456
+ 123 | 4567890123456789
+ 4567890123456789 | 123
+(3 rows)
+
+declare c3 cursor for select * from int8_tbl offset 3;
+fetch all in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(2 rows)
+
+fetch 1 in c3;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | -4567890123456789
+(1 row)
+
+fetch backward all in c3;
+ q1 | q2
+------------------+------------------
+ 4567890123456789 | 4567890123456789
+(1 row)
+
+fetch backward 1 in c3;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c3;
+ q1 | q2
+------------------+-------------------
+ 4567890123456789 | 4567890123456789
+ 4567890123456789 | -4567890123456789
+(2 rows)
+
+declare c4 cursor for select * from int8_tbl offset 10;
+fetch all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch backward 1 in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+fetch all in c4;
+ q1 | q2
+----+----
+(0 rows)
+
+rollback;
+-- Stress test for variable LIMIT in conjunction with bounded-heap sorting
+SELECT
+ (SELECT n
+ FROM (VALUES (1)) AS x,
+ (SELECT n FROM generate_series(1,10) AS n
+ ORDER BY n LIMIT 1 OFFSET s-1) AS y) AS z
+ FROM generate_series(1,10) AS s;
+ z
+----
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+ 10
+(10 rows)
+
+--
+-- Test behavior of volatile and set-returning functions in conjunction
+-- with ORDER BY and LIMIT.
+--
+create temp sequence testseq;
+explain (verbose, costs off)
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by unique2 limit 10;
+ QUERY PLAN
+----------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (nextval('testseq'::regclass))
+ -> Index Scan using tenk1_unique2 on public.tenk1
+ Output: unique1, unique2, nextval('testseq'::regclass)
+(4 rows)
+
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by unique2 limit 10;
+ unique1 | unique2 | nextval
+---------+---------+---------
+ 8800 | 0 | 1
+ 1891 | 1 | 2
+ 3420 | 2 | 3
+ 9850 | 3 | 4
+ 7164 | 4 | 5
+ 8009 | 5 | 6
+ 5057 | 6 | 7
+ 6701 | 7 | 8
+ 4321 | 8 | 9
+ 3043 | 9 | 10
+(10 rows)
+
+select currval('testseq');
+ currval
+---------
+ 10
+(1 row)
+
+explain (verbose, costs off)
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by tenthous limit 10;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (nextval('testseq'::regclass)), tenthous
+ -> Result
+ Output: unique1, unique2, nextval('testseq'::regclass), tenthous
+ -> Sort
+ Output: unique1, unique2, tenthous
+ Sort Key: tenk1.tenthous
+ -> Seq Scan on public.tenk1
+ Output: unique1, unique2, tenthous
+(9 rows)
+
+select unique1, unique2, nextval('testseq')
+ from tenk1 order by tenthous limit 10;
+ unique1 | unique2 | nextval
+---------+---------+---------
+ 0 | 9998 | 11
+ 1 | 2838 | 12
+ 2 | 2716 | 13
+ 3 | 5679 | 14
+ 4 | 1621 | 15
+ 5 | 5557 | 16
+ 6 | 2855 | 17
+ 7 | 8518 | 18
+ 8 | 5435 | 19
+ 9 | 4463 | 20
+(10 rows)
+
+select currval('testseq');
+ currval
+---------
+ 20
+(1 row)
+
+explain (verbose, costs off)
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by unique2 limit 7;
+ QUERY PLAN
+------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (generate_series(1, 10))
+ -> ProjectSet
+ Output: unique1, unique2, generate_series(1, 10)
+ -> Index Scan using tenk1_unique2 on public.tenk1
+ Output: unique1, unique2
+(6 rows)
+
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by unique2 limit 7;
+ unique1 | unique2 | generate_series
+---------+---------+-----------------
+ 8800 | 0 | 1
+ 8800 | 0 | 2
+ 8800 | 0 | 3
+ 8800 | 0 | 4
+ 8800 | 0 | 5
+ 8800 | 0 | 6
+ 8800 | 0 | 7
+(7 rows)
+
+explain (verbose, costs off)
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by tenthous limit 7;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Limit
+ Output: unique1, unique2, (generate_series(1, 10)), tenthous
+ -> ProjectSet
+ Output: unique1, unique2, generate_series(1, 10), tenthous
+ -> Sort
+ Output: unique1, unique2, tenthous
+ Sort Key: tenk1.tenthous
+ -> Seq Scan on public.tenk1
+ Output: unique1, unique2, tenthous
+(9 rows)
+
+select unique1, unique2, generate_series(1,10)
+ from tenk1 order by tenthous limit 7;
+ unique1 | unique2 | generate_series
+---------+---------+-----------------
+ 0 | 9998 | 1
+ 0 | 9998 | 2
+ 0 | 9998 | 3
+ 0 | 9998 | 4
+ 0 | 9998 | 5
+ 0 | 9998 | 6
+ 0 | 9998 | 7
+(7 rows)
+
+-- use of random() is to keep planner from folding the expressions together
+explain (verbose, costs off)
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------
+ ProjectSet
+ Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
+ -> Result
+(3 rows)
+
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
+ s1 | s2
+----+----
+ 0 | 0
+ 1 | 1
+ 2 | 2
+(3 rows)
+
+explain (verbose, costs off)
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
+order by s2 desc;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------
+ Sort
+ Output: (generate_series(0, 2)), (generate_series(((random() * '0.1'::double precision))::integer, 2))
+ Sort Key: (generate_series(((random() * '0.1'::double precision))::integer, 2)) DESC
+ -> ProjectSet
+ Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
+ -> Result
+(6 rows)
+
+select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
+order by s2 desc;
+ s1 | s2
+----+----
+ 2 | 2
+ 1 | 1
+ 0 | 0
+(3 rows)
+
+-- test for failure to set all aggregates' aggtranstype
+explain (verbose, costs off)
+select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
+ from tenk1 group by thousand order by thousand limit 3;
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------
+ Limit
+ Output: (sum(tenthous)), (((sum(tenthous))::double precision + (random() * '0'::double precision))), thousand
+ -> GroupAggregate
+ Output: sum(tenthous), ((sum(tenthous))::double precision + (random() * '0'::double precision)), thousand
+ Group Key: tenk1.thousand
+ -> Index Only Scan using tenk1_thous_tenthous on public.tenk1
+ Output: thousand, tenthous
+(7 rows)
+
+select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
+ from tenk1 group by thousand order by thousand limit 3;
+ s1 | s2
+-------+-------
+ 45000 | 45000
+ 45010 | 45010
+ 45020 | 45020
+(3 rows)
+
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 7ed29b4961..4354387828 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -869,6 +869,12 @@ xid8ge(xid8,xid8)
xid8eq(xid8,xid8)
xid8ne(xid8,xid8)
xid8cmp(xid8,xid8)
+zstideq(zstid,zstid)
+zstidle(zstid,zstid)
+zstidge(zstid,zstid)
+zstidgt(zstid,zstid)
+zstidlt(zstid,zstid)
+zstidne(zstid,zstid)
-- restore normal output mode
\a\t
-- List of functions used by libpq's fe-lobj.c
diff --git a/src/test/regress/expected/psql_1.out b/src/test/regress/expected/psql_1.out
new file mode 100644
index 0000000000..ab1eed5b43
--- /dev/null
+++ b/src/test/regress/expected/psql_1.out
@@ -0,0 +1,5078 @@
+--
+-- Tests for psql features that aren't closely connected to any
+-- specific server features
+--
+-- \set
+-- fail: invalid name
+\set invalid/name foo
+invalid variable name: "invalid/name"
+-- fail: invalid value for special variable
+\set AUTOCOMMIT foo
+unrecognized value "foo" for "AUTOCOMMIT": Boolean expected
+\set FETCH_COUNT foo
+invalid value "foo" for "FETCH_COUNT": integer expected
+-- check handling of built-in boolean variable
+\echo :ON_ERROR_ROLLBACK
+off
+\set ON_ERROR_ROLLBACK
+\echo :ON_ERROR_ROLLBACK
+on
+\set ON_ERROR_ROLLBACK foo
+unrecognized value "foo" for "ON_ERROR_ROLLBACK"
+Available values are: on, off, interactive.
+\echo :ON_ERROR_ROLLBACK
+on
+\set ON_ERROR_ROLLBACK on
+\echo :ON_ERROR_ROLLBACK
+on
+\unset ON_ERROR_ROLLBACK
+\echo :ON_ERROR_ROLLBACK
+off
+-- \g and \gx
+SELECT 1 as one, 2 as two \g
+ one | two
+-----+-----
+ 1 | 2
+(1 row)
+
+\gx
+-[ RECORD 1 ]
+one | 1
+two | 2
+
+SELECT 3 as three, 4 as four \gx
+-[ RECORD 1 ]
+three | 3
+four | 4
+
+\g
+ three | four
+-------+------
+ 3 | 4
+(1 row)
+
+-- \gx should work in FETCH_COUNT mode too
+\set FETCH_COUNT 1
+SELECT 1 as one, 2 as two \g
+ one | two
+-----+-----
+ 1 | 2
+(1 row)
+
+\gx
+-[ RECORD 1 ]
+one | 1
+two | 2
+
+SELECT 3 as three, 4 as four \gx
+-[ RECORD 1 ]
+three | 3
+four | 4
+
+\g
+ three | four
+-------+------
+ 3 | 4
+(1 row)
+
+\unset FETCH_COUNT
+-- \g/\gx with pset options
+SELECT 1 as one, 2 as two \g (format=csv csv_fieldsep='\t')
+one two
+1 2
+\g
+ one | two
+-----+-----
+ 1 | 2
+(1 row)
+
+SELECT 1 as one, 2 as two \gx (title='foo bar')
+foo bar
+-[ RECORD 1 ]
+one | 1
+two | 2
+
+\g
+ one | two
+-----+-----
+ 1 | 2
+(1 row)
+
+-- \gset
+select 10 as test01, 20 as test02, 'Hello' as test03 \gset pref01_
+\echo :pref01_test01 :pref01_test02 :pref01_test03
+10 20 Hello
+-- should fail: bad variable name
+select 10 as "bad name"
+\gset
+invalid variable name: "bad name"
+-- multiple backslash commands in one line
+select 1 as x, 2 as y \gset pref01_ \\ \echo :pref01_x
+1
+select 3 as x, 4 as y \gset pref01_ \echo :pref01_x \echo :pref01_y
+3
+4
+select 5 as x, 6 as y \gset pref01_ \\ \g \echo :pref01_x :pref01_y
+ x | y
+---+---
+ 5 | 6
+(1 row)
+
+5 6
+select 7 as x, 8 as y \g \gset pref01_ \echo :pref01_x :pref01_y
+ x | y
+---+---
+ 7 | 8
+(1 row)
+
+7 8
+-- NULL should unset the variable
+\set var2 xyz
+select 1 as var1, NULL as var2, 3 as var3 \gset
+\echo :var1 :var2 :var3
+1 :var2 3
+-- \gset requires just one tuple
+select 10 as test01, 20 as test02 from generate_series(1,3) \gset
+more than one row returned for \gset
+select 10 as test01, 20 as test02 from generate_series(1,0) \gset
+no rows returned for \gset
+-- \gset should work in FETCH_COUNT mode too
+\set FETCH_COUNT 1
+select 1 as x, 2 as y \gset pref01_ \\ \echo :pref01_x
+1
+select 3 as x, 4 as y \gset pref01_ \echo :pref01_x \echo :pref01_y
+3
+4
+select 10 as test01, 20 as test02 from generate_series(1,3) \gset
+more than one row returned for \gset
+select 10 as test01, 20 as test02 from generate_series(1,0) \gset
+no rows returned for \gset
+\unset FETCH_COUNT
+-- \gdesc
+SELECT
+ NULL AS zero,
+ 1 AS one,
+ 2.0 AS two,
+ 'three' AS three,
+ $1 AS four,
+ sin($2) as five,
+ 'foo'::varchar(4) as six,
+ CURRENT_DATE AS now
+\gdesc
+ Column | Type
+--------+----------------------
+ zero | text
+ one | integer
+ two | numeric
+ three | text
+ four | text
+ five | double precision
+ six | character varying(4)
+ now | date
+(8 rows)
+
+-- should work with tuple-returning utilities, such as EXECUTE
+PREPARE test AS SELECT 1 AS first, 2 AS second;
+EXECUTE test \gdesc
+ Column | Type
+--------+---------
+ first | integer
+ second | integer
+(2 rows)
+
+EXPLAIN EXECUTE test \gdesc
+ Column | Type
+------------+------
+ QUERY PLAN | text
+(1 row)
+
+-- should fail cleanly - syntax error
+SELECT 1 + \gdesc
+ERROR: syntax error at end of input
+LINE 1: SELECT 1 +
+ ^
+-- check behavior with empty results
+SELECT \gdesc
+The command has no result, or the result has no columns.
+CREATE TABLE bububu(a int) \gdesc
+The command has no result, or the result has no columns.
+-- subject command should not have executed
+TABLE bububu; -- fail
+ERROR: relation "bububu" does not exist
+LINE 1: TABLE bububu;
+ ^
+-- query buffer should remain unchanged
+SELECT 1 AS x, 'Hello', 2 AS y, true AS "dirty\name"
+\gdesc
+ Column | Type
+------------+---------
+ x | integer
+ ?column? | text
+ y | integer
+ dirty\name | boolean
+(4 rows)
+
+\g
+ x | ?column? | y | dirty\name
+---+----------+---+------------
+ 1 | Hello | 2 | t
+(1 row)
+
+-- all on one line
+SELECT 3 AS x, 'Hello', 4 AS y, true AS "dirty\name" \gdesc \g
+ Column | Type
+------------+---------
+ x | integer
+ ?column? | text
+ y | integer
+ dirty\name | boolean
+(4 rows)
+
+ x | ?column? | y | dirty\name
+---+----------+---+------------
+ 3 | Hello | 4 | t
+(1 row)
+
+-- \gexec
+create temporary table gexec_test(a int, b text, c date, d float);
+select format('create index on gexec_test(%I)', attname)
+from pg_attribute
+where attrelid = 'gexec_test'::regclass and attnum > 0
+order by attnum
+\gexec
+create index on gexec_test(a)
+create index on gexec_test(b)
+create index on gexec_test(c)
+create index on gexec_test(d)
+-- \gexec should work in FETCH_COUNT mode too
+-- (though the fetch limit applies to the executed queries not the meta query)
+\set FETCH_COUNT 1
+select 'select 1 as ones', 'select x.y, x.y*2 as double from generate_series(1,4) as x(y)'
+union all
+select 'drop table gexec_test', NULL
+union all
+select 'drop table gexec_test', 'select ''2000-01-01''::date as party_over'
+\gexec
+select 1 as ones
+ ones
+------
+ 1
+(1 row)
+
+select x.y, x.y*2 as double from generate_series(1,4) as x(y)
+ y | double
+---+--------
+ 1 | 2
+ 2 | 4
+ 3 | 6
+ 4 | 8
+(4 rows)
+
+drop table gexec_test
+drop table gexec_test
+ERROR: table "gexec_test" does not exist
+select '2000-01-01'::date as party_over
+ party_over
+------------
+ 01-01-2000
+(1 row)
+
+\unset FETCH_COUNT
+-- show all pset options
+\pset
+border 1
+columns 0
+csv_fieldsep ','
+expanded off
+fieldsep '|'
+fieldsep_zero off
+footer on
+format aligned
+linestyle ascii
+null ''
+numericlocale off
+pager 1
+pager_min_lines 0
+recordsep '\n'
+recordsep_zero off
+tableattr
+title
+tuples_only off
+unicode_border_linestyle single
+unicode_column_linestyle single
+unicode_header_linestyle single
+-- test multi-line headers, wrapping, and newline indicators
+-- in aligned, unaligned, and wrapped formats
+prepare q as select array_to_string(array_agg(repeat('x',2*n)),E'\n') as "ab
+
+c", array_to_string(array_agg(repeat('y',20-2*n)),E'\n') as "a
+bc" from generate_series(1,10) as n(n) group by n>1 order by n>1;
+\pset linestyle ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab + a +
+ + bc
+ c
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx +yyyyyyyyyyyyyyyy +
+xxxxxx +yyyyyyyyyyyyyy +
+xxxxxxxx +yyyyyyyyyyyy +
+xxxxxxxxxx +yyyyyyyyyy +
+xxxxxxxxxxxx +yyyyyyyy +
+xxxxxxxxxxxxxx +yyyyyy +
+xxxxxxxxxxxxxxxx +yyyy +
+xxxxxxxxxxxxxxxxxx +yy +
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab + a +
+ + bc
+ c
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx +yyyyyyyyyyyyyyyy +
+xxxxxx +yyyyyyyyyyyyyy +
+xxxxxxxx +yyyyyyyyyyyy +
+xxxxxxxxxx +yyyyyyyyyy +
+xxxxxxxxxxxx +yyyyyyyy +
+xxxxxxxxxxxxxx +yyyyyy +
+xxxxxxxxxxxxxxxx +yyyy +
+xxxxxxxxxxxxxxxxxx +yy +
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab +| a +
+ +| bc
+ c |
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx +| yyyyyyyyyyyyyyyy +
+ xxxxxx +| yyyyyyyyyyyyyy +
+ xxxxxxxx +| yyyyyyyyyyyy +
+ xxxxxxxxxx +| yyyyyyyyyy +
+ xxxxxxxxxxxx +| yyyyyyyy +
+ xxxxxxxxxxxxxx +| yyyyyy +
+ xxxxxxxxxxxxxxxx +| yyyy +
+ xxxxxxxxxxxxxxxxxx +| yy +
+ xxxxxxxxxxxxxxxxxxxx |
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab +| a +
+ +| bc
+ c |
+-------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx +| yyyyyyyyyyyyyyyy +
+ xxxxxx +| yyyyyyyyyyyyyy +
+ xxxxxxxx +| yyyyyyyyyyyy +
+ xxxxxxxxxx +| yyyyyyyyyy +
+ xxxxxxxxxxxx +| yyyyyyyy +
+ xxxxxxxxxxxxxx +| yyyyyy +
+ xxxxxxxxxxxxxxxx +| yyyy +
+ xxxxxxxxxxxxxxxxx.| yy +
+.x +|
+ xxxxxxxxxxxxxxxxx.|
+.xxx |
+(2 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| ab +| a +|
+| +| bc |
+| c | |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx +| yyyyyyyyyyyyyyyy +|
+| xxxxxx +| yyyyyyyyyyyyyy +|
+| xxxxxxxx +| yyyyyyyyyyyy +|
+| xxxxxxxxxx +| yyyyyyyyyy +|
+| xxxxxxxxxxxx +| yyyyyyyy +|
+| xxxxxxxxxxxxxx +| yyyyyy +|
+| xxxxxxxxxxxxxxxx +| yyyy +|
+| xxxxxxxxxxxxxxxxxx +| yy +|
+| xxxxxxxxxxxxxxxxxxxx | |
++----------------------+--------------------+
+(2 rows)
+
+\pset format wrapped
+execute q;
++-----------------+--------------------+
+| ab +| a +|
+| +| bc |
+| c | |
++-----------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx +| yyyyyyyyyyyyyyyy +|
+| xxxxxx +| yyyyyyyyyyyyyy +|
+| xxxxxxxx +| yyyyyyyyyyyy +|
+| xxxxxxxxxx +| yyyyyyyyyy +|
+| xxxxxxxxxxxx +| yyyyyyyy +|
+| xxxxxxxxxxxxxx +| yyyyyy +|
+| xxxxxxxxxxxxxxx.| yyyy +|
+|.x +| yy +|
+| xxxxxxxxxxxxxxx.| |
+|.xxx +| |
+| xxxxxxxxxxxxxxx.| |
+|.xxxxx | |
++-----------------+--------------------+
+(2 rows)
+
+\pset expanded on
+\pset columns 20
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+* Record 1
+ab+ xx
+ +
+c
+a + yyyyyyyyyyyyyyyyyy
+bc
+* Record 2
+ab+ xxxx +
+ + xxxxxx +
+c xxxxxxxx +
+ xxxxxxxxxx +
+ xxxxxxxxxxxx +
+ xxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxxxxxxx
+a + yyyyyyyyyyyyyyyy +
+bc yyyyyyyyyyyyyy +
+ yyyyyyyyyyyy +
+ yyyyyyyyyy +
+ yyyyyyyy +
+ yyyyyy +
+ yyyy +
+ yy +
+
+
+\pset format wrapped
+execute q;
+* Record 1
+ab+ xx
+ +
+c
+a + yyyyyyyyyyyyyyy.
+bc .yyy
+* Record 2
+ab+ xxxx +
+ + xxxxxx +
+c xxxxxxxx +
+ xxxxxxxxxx +
+ xxxxxxxxxxxx +
+ xxxxxxxxxxxxxx +
+ xxxxxxxxxxxxxxx.
+ .x +
+ xxxxxxxxxxxxxxx.
+ .xxx +
+ xxxxxxxxxxxxxxx.
+ .xxxxx
+a + yyyyyyyyyyyyyyy.
+bc .y +
+ yyyyyyyyyyyyyy +
+ yyyyyyyyyyyy +
+ yyyyyyyyyy +
+ yyyyyyyy +
+ yyyyyy +
+ yyyy +
+ yy +
+
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+-[ RECORD 1 ]------------
+ab+| xx
+ +|
+c |
+a +| yyyyyyyyyyyyyyyyyy
+bc |
+-[ RECORD 2 ]------------
+ab+| xxxx +
+ +| xxxxxx +
+c | xxxxxxxx +
+ | xxxxxxxxxx +
+ | xxxxxxxxxxxx +
+ | xxxxxxxxxxxxxx +
+ | xxxxxxxxxxxxxxxx +
+ | xxxxxxxxxxxxxxxxxx +
+ | xxxxxxxxxxxxxxxxxxxx
+a +| yyyyyyyyyyyyyyyy +
+bc | yyyyyyyyyyyyyy +
+ | yyyyyyyyyyyy +
+ | yyyyyyyyyy +
+ | yyyyyyyy +
+ | yyyyyy +
+ | yyyy +
+ | yy +
+ |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]------
+ab+| xx
+ +|
+c |
+a +| yyyyyyyyyyyyyy.
+bc |.yyyy
+-[ RECORD 2 ]------
+ab+| xxxx +
+ +| xxxxxx +
+c | xxxxxxxx +
+ | xxxxxxxxxx +
+ | xxxxxxxxxxxx +
+ | xxxxxxxxxxxxxx+
+ | xxxxxxxxxxxxxx.
+ |.xx +
+ | xxxxxxxxxxxxxx.
+ |.xxxx +
+ | xxxxxxxxxxxxxx.
+ |.xxxxxx
+a +| yyyyyyyyyyyyyy.
+bc |.yy +
+ | yyyyyyyyyyyyyy+
+ | yyyyyyyyyyyy +
+ | yyyyyyyyyy +
+ | yyyyyyyy +
+ | yyyyyy +
+ | yyyy +
+ | yy +
+ |
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
++-[ RECORD 1 ]--------------+
+| ab+| xx |
+| +| |
+| c | |
+| a +| yyyyyyyyyyyyyyyyyy |
+| bc | |
++-[ RECORD 2 ]--------------+
+| ab+| xxxx +|
+| +| xxxxxx +|
+| c | xxxxxxxx +|
+| | xxxxxxxxxx +|
+| | xxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxxxxxx +|
+| | xxxxxxxxxxxxxxxxxxxx |
+| a +| yyyyyyyyyyyyyyyy +|
+| bc | yyyyyyyyyyyyyy +|
+| | yyyyyyyyyyyy +|
+| | yyyyyyyyyy +|
+| | yyyyyyyy +|
+| | yyyyyy +|
+| | yyyy +|
+| | yy +|
+| | |
++----+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+
+| ab+| xx |
+| +| |
+| c | |
+| a +| yyyyyyyyyyy.|
+| bc |.yyyyyyy |
++-[ RECORD 2 ]-----+
+| ab+| xxxx +|
+| +| xxxxxx +|
+| c | xxxxxxxx +|
+| | xxxxxxxxxx +|
+| | xxxxxxxxxxx.|
+| |.x +|
+| | xxxxxxxxxxx.|
+| |.xxx +|
+| | xxxxxxxxxxx.|
+| |.xxxxx +|
+| | xxxxxxxxxxx.|
+| |.xxxxxxx +|
+| | xxxxxxxxxxx.|
+| |.xxxxxxxxx |
+| a +| yyyyyyyyyyy.|
+| bc |.yyyyy +|
+| | yyyyyyyyyyy.|
+| |.yyy +|
+| | yyyyyyyyyyy.|
+| |.y +|
+| | yyyyyyyyyy +|
+| | yyyyyyyy +|
+| | yyyyyy +|
+| | yyyy +|
+| | yy +|
+| | |
++----+-------------+
+
+\pset linestyle old-ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab a
+ + bc
+ c +
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab a
+ + bc
+ c +
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(2 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
+ ab | a
++ |+ bc
++ c |+
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx : yyyyyyyyyyyyyy
+ xxxxxxxx : yyyyyyyyyyyy
+ xxxxxxxxxx : yyyyyyyyyy
+ xxxxxxxxxxxx : yyyyyyyy
+ xxxxxxxxxxxxxx : yyyyyy
+ xxxxxxxxxxxxxxxx : yyyy
+ xxxxxxxxxxxxxxxxxx : yy
+ xxxxxxxxxxxxxxxxxxxx :
+(2 rows)
+
+\pset format wrapped
+execute q;
+ ab | a
++ |+ bc
++ c |+
+-------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx : yyyyyyyyyyyyyy
+ xxxxxxxx : yyyyyyyyyyyy
+ xxxxxxxxxx : yyyyyyyyyy
+ xxxxxxxxxxxx : yyyyyyyy
+ xxxxxxxxxxxxxx : yyyyyy
+ xxxxxxxxxxxxxxxx : yyyy
+ xxxxxxxxxxxxxxxxx : yy
+ x :
+ xxxxxxxxxxxxxxxxx
+ xxx
+(2 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|a
+bc
+xx|yyyyyyyyyyyyyyyyyy
+xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+(2 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| ab | a |
+|+ |+ bc |
+|+ c |+ |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx : yyyyyyyyyyyyyy |
+| xxxxxxxx : yyyyyyyyyyyy |
+| xxxxxxxxxx : yyyyyyyyyy |
+| xxxxxxxxxxxx : yyyyyyyy |
+| xxxxxxxxxxxxxx : yyyyyy |
+| xxxxxxxxxxxxxxxx : yyyy |
+| xxxxxxxxxxxxxxxxxx : yy |
+| xxxxxxxxxxxxxxxxxxxx : |
++----------------------+--------------------+
+(2 rows)
+
+\pset format wrapped
+execute q;
++-----------------+--------------------+
+| ab | a |
+|+ |+ bc |
+|+ c |+ |
++-----------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx : yyyyyyyyyyyyyy |
+| xxxxxxxx : yyyyyyyyyyyy |
+| xxxxxxxxxx : yyyyyyyyyy |
+| xxxxxxxxxxxx : yyyyyyyy |
+| xxxxxxxxxxxxxx : yyyyyy |
+| xxxxxxxxxxxxxxx : yyyy |
+| x : yy |
+| xxxxxxxxxxxxxxx : |
+| xxx |
+| xxxxxxxxxxxxxxx |
+| xxxxx |
++-----------------+--------------------+
+(2 rows)
+
+\pset expanded on
+\pset columns 20
+\pset border 0
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+* Record 1
+ ab xx
++
++c
+ a yyyyyyyyyyyyyyyyyy
++bc
+* Record 2
+ ab xxxx
++ xxxxxx
++c xxxxxxxx
+ xxxxxxxxxx
+ xxxxxxxxxxxx
+ xxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxxxxxx
+ a yyyyyyyyyyyyyyyy
++bc yyyyyyyyyyyyyy
+ yyyyyyyyyyyy
+ yyyyyyyyyy
+ yyyyyyyy
+ yyyyyy
+ yyyy
+ yy
+
+
+\pset format wrapped
+execute q;
+* Record 1
+ ab xx
++
++c
+ a yyyyyyyyyyyyyyyy
++bc yy
+* Record 2
+ ab xxxx
++ xxxxxx
++c xxxxxxxx
+ xxxxxxxxxx
+ xxxxxxxxxxxx
+ xxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxx
+ xxxxxxxxxxxxxxxx
+ xx
+ xxxxxxxxxxxxxxxx
+ xxxx
+ a yyyyyyyyyyyyyyyy
++bc yyyyyyyyyyyyyy
+ yyyyyyyyyyyy
+ yyyyyyyyyy
+ yyyyyyyy
+ yyyyyy
+ yyyy
+ yy
+
+
+\pset border 1
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
+-[ RECORD 1 ]-------------
+ ab | xx
++ ;
++c ;
+ a | yyyyyyyyyyyyyyyyyy
++bc ;
+-[ RECORD 2 ]-------------
+ ab | xxxx
++ : xxxxxx
++c : xxxxxxxx
+ : xxxxxxxxxx
+ : xxxxxxxxxxxx
+ : xxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxxxxxxxx
+ a | yyyyyyyyyyyyyyyy
++bc : yyyyyyyyyyyyyy
+ : yyyyyyyyyyyy
+ : yyyyyyyyyy
+ : yyyyyyyy
+ : yyyyyy
+ : yyyy
+ : yy
+ :
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]-------
+ ab | xx
++ ;
++c ;
+ a | yyyyyyyyyyyyyy
++bc ; yyyy
+-[ RECORD 2 ]-------
+ ab | xxxx
++ : xxxxxx
++c : xxxxxxxx
+ : xxxxxxxxxx
+ : xxxxxxxxxxxx
+ : xxxxxxxxxxxxxx
+ : xxxxxxxxxxxxxx
+ ; xx
+ : xxxxxxxxxxxxxx
+ ; xxxx
+ : xxxxxxxxxxxxxx
+ ; xxxxxx
+ a | yyyyyyyyyyyyyy
++bc ; yy
+ : yyyyyyyyyyyyyy
+ : yyyyyyyyyyyy
+ : yyyyyyyyyy
+ : yyyyyyyy
+ : yyyyyy
+ : yyyy
+ : yy
+ :
+
+\pset border 2
+\pset format unaligned
+execute q;
+ab
+
+c|xx
+a
+bc|yyyyyyyyyyyyyyyyyy
+
+ab
+
+c|xxxx
+xxxxxx
+xxxxxxxx
+xxxxxxxxxx
+xxxxxxxxxxxx
+xxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxx
+xxxxxxxxxxxxxxxxxxxx
+a
+bc|yyyyyyyyyyyyyyyy
+yyyyyyyyyyyyyy
+yyyyyyyyyyyy
+yyyyyyyyyy
+yyyyyyyy
+yyyyyy
+yyyy
+yy
+
+\pset format aligned
+execute q;
++-[ RECORD 1 ]--------------+
+| ab | xx |
+|+ ; |
+|+c ; |
+| a | yyyyyyyyyyyyyyyyyy |
+|+bc ; |
++-[ RECORD 2 ]--------------+
+| ab | xxxx |
+|+ : xxxxxx |
+|+c : xxxxxxxx |
+| : xxxxxxxxxx |
+| : xxxxxxxxxxxx |
+| : xxxxxxxxxxxxxx |
+| : xxxxxxxxxxxxxxxx |
+| : xxxxxxxxxxxxxxxxxx |
+| : xxxxxxxxxxxxxxxxxxxx |
+| a | yyyyyyyyyyyyyyyy |
+|+bc : yyyyyyyyyyyyyy |
+| : yyyyyyyyyyyy |
+| : yyyyyyyyyy |
+| : yyyyyyyy |
+| : yyyyyy |
+| : yyyy |
+| : yy |
+| : |
++----+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+
+| ab | xx |
+|+ ; |
+|+c ; |
+| a | yyyyyyyyyyy |
+|+bc ; yyyyyyy |
++-[ RECORD 2 ]-----+
+| ab | xxxx |
+|+ : xxxxxx |
+|+c : xxxxxxxx |
+| : xxxxxxxxxx |
+| : xxxxxxxxxxx |
+| ; x |
+| : xxxxxxxxxxx |
+| ; xxx |
+| : xxxxxxxxxxx |
+| ; xxxxx |
+| : xxxxxxxxxxx |
+| ; xxxxxxx |
+| : xxxxxxxxxxx |
+| ; xxxxxxxxx |
+| a | yyyyyyyyyyy |
+|+bc ; yyyyy |
+| : yyyyyyyyyyy |
+| ; yyy |
+| : yyyyyyyyyyy |
+| ; y |
+| : yyyyyyyyyy |
+| : yyyyyyyy |
+| : yyyyyy |
+| : yyyy |
+| : yy |
+| : |
++----+-------------+
+
+deallocate q;
+-- test single-line header and data
+prepare q as select repeat('x',2*n) as "0123456789abcdef", repeat('y',20-2*n) as "0123456789" from generate_series(1,10) as n;
+\pset linestyle ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef | 0123456789
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxxx |
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef | 0123456789
+---------------------+------------------
+ xx | yyyyyyyyyyyyyyyy.
+ |.yy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxx.|
+.x |
+(10 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| 0123456789abcdef | 0123456789 |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxxxx | |
++----------------------+--------------------+
+(10 rows)
+
+\pset format wrapped
+execute q;
++--------------------+-----------------+
+| 0123456789abcdef | 0123456789 |
++--------------------+-----------------+
+| xx | yyyyyyyyyyyyyyy.|
+| |.yyy |
+| xxxx | yyyyyyyyyyyyyyy.|
+| |.y |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxx.| |
+|.xx | |
++--------------------+-----------------+
+(10 rows)
+
+\pset expanded on
+\pset columns 30
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset format wrapped
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyy.
+ .yyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyy.
+ .yyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyy.
+ .yy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxx.
+ .xx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxx.
+ .xxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxx.
+ .xxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxx.
+ .xxxxxxxx
+0123456789
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]----+-----------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyy.
+ |.yyyyyyyy
+-[ RECORD 2 ]----+-----------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyy.
+ |.yyyyyy
+-[ RECORD 3 ]----+-----------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyy.
+ |.yyyy
+-[ RECORD 4 ]----+-----------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyy.
+ |.yy
+-[ RECORD 5 ]----+-----------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+-----------
+0123456789abcdef | xxxxxxxxxx.
+ |.xxxxxxxxxx
+0123456789 |
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
++-[ RECORD 1 ]-----+----------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyyy |
++-[ RECORD 2 ]-----+----------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+----------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxxxx |
+| 0123456789 | |
++------------------+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+---------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyy.|
+| |.yyyyyyy.|
+| |.yyyy |
++-[ RECORD 2 ]-----+---------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyy.|
+| |.yyyyyyy.|
+| |.yy |
++-[ RECORD 3 ]-----+---------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyy.|
+| |.yyyyyyy |
++-[ RECORD 4 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.x |
+| 0123456789 | yyyyyyy.|
+| |.yyyyy |
++-[ RECORD 5 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxx |
+| 0123456789 | yyyyyyy.|
+| |.yyy |
++-[ RECORD 6 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxx |
+| 0123456789 | yyyyyyy.|
+| |.y |
++-[ RECORD 7 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx.|
+| |.xx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx.|
+| |.xxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+---------+
+| 0123456789abcdef | xxxxxxx.|
+| |.xxxxxxx.|
+| |.xxxxxx |
+| 0123456789 | |
++------------------+---------+
+
+\pset expanded on
+\pset columns 20
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset format wrapped
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yyy
+* Record 2
+0123456789abcdef xxx.
+ .x
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .y
+* Record 3
+0123456789abcdef xxx.
+ .xxx
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy.
+ .yy
+* Record 4
+0123456789abcdef xxx.
+ .xxx.
+ .xx
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .yyy
+* Record 5
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .x
+0123456789 yyy.
+ .yyy.
+ .yyy.
+ .y
+* Record 6
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx
+0123456789 yyy.
+ .yyy.
+ .yy
+* Record 7
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xx
+0123456789 yyy.
+ .yyy
+* Record 8
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .x
+0123456789 yyy.
+ .y
+* Record 9
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xxx.
+ .xx
+0123456789
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]----+----
+0123456789abcdef | xx
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy
+-[ RECORD 2 ]----+----
+0123456789abcdef | xxx.
+ |.x
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.y
+-[ RECORD 3 ]----+----
+0123456789abcdef | xxx.
+ |.xxx
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy.
+ |.yy
+-[ RECORD 4 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xx
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.yyy
+-[ RECORD 5 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.x
+0123456789 | yyy.
+ |.yyy.
+ |.yyy.
+ |.y
+-[ RECORD 6 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx
+0123456789 | yyy.
+ |.yyy.
+ |.yy
+-[ RECORD 7 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xx
+0123456789 | yyy.
+ |.yyy
+-[ RECORD 8 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.x
+0123456789 | yyy.
+ |.y
+-[ RECORD 9 ]----+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx
+0123456789 | yy
+-[ RECORD 10 ]---+----
+0123456789abcdef | xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xxx.
+ |.xx
+0123456789 |
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
++-[ RECORD 1 ]-----+----------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyyy |
++-[ RECORD 2 ]-----+----------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+----------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxxxx |
+| 0123456789 | |
++------------------+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+-----+
+| 0123456789abcdef | xx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy |
++-[ RECORD 2 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.x |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.y |
++-[ RECORD 3 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yy |
++-[ RECORD 4 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.yyy |
++-[ RECORD 5 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.x |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yyy.|
+| |.y |
++-[ RECORD 6 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx |
+| 0123456789 | yyy.|
+| |.yyy.|
+| |.yy |
++-[ RECORD 7 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xx |
+| 0123456789 | yyy.|
+| |.yyy |
++-[ RECORD 8 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.x |
+| 0123456789 | yyy.|
+| |.y |
++-[ RECORD 9 ]-----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+-----+
+| 0123456789abcdef | xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xxx.|
+| |.xx |
+| 0123456789 | |
++------------------+-----+
+
+\pset linestyle old-ascii
+\pset expanded off
+\pset columns 40
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef 0123456789
+-------------------- ------------------
+xx yyyyyyyyyyyyyyyyyy
+xxxx yyyyyyyyyyyyyyyy
+xxxxxx yyyyyyyyyyyyyy
+xxxxxxxx yyyyyyyyyyyy
+xxxxxxxxxx yyyyyyyyyy
+xxxxxxxxxxxx yyyyyyyy
+xxxxxxxxxxxxxx yyyyyy
+xxxxxxxxxxxxxxxx yyyy
+xxxxxxxxxxxxxxxxxx yy
+xxxxxxxxxxxxxxxxxxxx
+(10 rows)
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
+ 0123456789abcdef | 0123456789
+----------------------+--------------------
+ xx | yyyyyyyyyyyyyyyyyy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxxx |
+(10 rows)
+
+\pset format wrapped
+execute q;
+ 0123456789abcdef | 0123456789
+---------------------+------------------
+ xx | yyyyyyyyyyyyyyyy
+ ; yy
+ xxxx | yyyyyyyyyyyyyyyy
+ xxxxxx | yyyyyyyyyyyyyy
+ xxxxxxxx | yyyyyyyyyyyy
+ xxxxxxxxxx | yyyyyyyyyy
+ xxxxxxxxxxxx | yyyyyyyy
+ xxxxxxxxxxxxxx | yyyyyy
+ xxxxxxxxxxxxxxxx | yyyy
+ xxxxxxxxxxxxxxxxxx | yy
+ xxxxxxxxxxxxxxxxxxx |
+ x
+(10 rows)
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|0123456789
+xx|yyyyyyyyyyyyyyyyyy
+xxxx|yyyyyyyyyyyyyyyy
+xxxxxx|yyyyyyyyyyyyyy
+xxxxxxxx|yyyyyyyyyyyy
+xxxxxxxxxx|yyyyyyyyyy
+xxxxxxxxxxxx|yyyyyyyy
+xxxxxxxxxxxxxx|yyyyyy
+xxxxxxxxxxxxxxxx|yyyy
+xxxxxxxxxxxxxxxxxx|yy
+xxxxxxxxxxxxxxxxxxxx|
+(10 rows)
+\pset format aligned
+execute q;
++----------------------+--------------------+
+| 0123456789abcdef | 0123456789 |
++----------------------+--------------------+
+| xx | yyyyyyyyyyyyyyyyyy |
+| xxxx | yyyyyyyyyyyyyyyy |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxxxx | |
++----------------------+--------------------+
+(10 rows)
+
+\pset format wrapped
+execute q;
++--------------------+-----------------+
+| 0123456789abcdef | 0123456789 |
++--------------------+-----------------+
+| xx | yyyyyyyyyyyyyyy |
+| ; yyy |
+| xxxx | yyyyyyyyyyyyyyy |
+| ; y |
+| xxxxxx | yyyyyyyyyyyyyy |
+| xxxxxxxx | yyyyyyyyyyyy |
+| xxxxxxxxxx | yyyyyyyyyy |
+| xxxxxxxxxxxx | yyyyyyyy |
+| xxxxxxxxxxxxxx | yyyyyy |
+| xxxxxxxxxxxxxxxx | yyyy |
+| xxxxxxxxxxxxxxxxxx | yy |
+| xxxxxxxxxxxxxxxxxx | |
+| xx |
++--------------------+-----------------+
+(10 rows)
+
+\pset expanded on
+\pset border 0
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset format wrapped
+execute q;
+* Record 1
+0123456789abcdef xx
+0123456789 yyyyyyyyyyyyyyyyyy
+* Record 2
+0123456789abcdef xxxx
+0123456789 yyyyyyyyyyyyyyyy
+* Record 3
+0123456789abcdef xxxxxx
+0123456789 yyyyyyyyyyyyyy
+* Record 4
+0123456789abcdef xxxxxxxx
+0123456789 yyyyyyyyyyyy
+* Record 5
+0123456789abcdef xxxxxxxxxx
+0123456789 yyyyyyyyyy
+* Record 6
+0123456789abcdef xxxxxxxxxxxx
+0123456789 yyyyyyyy
+* Record 7
+0123456789abcdef xxxxxxxxxxxxxx
+0123456789 yyyyyy
+* Record 8
+0123456789abcdef xxxxxxxxxxxxxxxx
+0123456789 yyyy
+* Record 9
+0123456789abcdef xxxxxxxxxxxxxxxxxx
+0123456789 yy
+* Record 10
+0123456789abcdef xxxxxxxxxxxxxxxxxxxx
+0123456789
+
+\pset border 1
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset format wrapped
+execute q;
+-[ RECORD 1 ]----+---------------------
+0123456789abcdef | xx
+0123456789 | yyyyyyyyyyyyyyyyyy
+-[ RECORD 2 ]----+---------------------
+0123456789abcdef | xxxx
+0123456789 | yyyyyyyyyyyyyyyy
+-[ RECORD 3 ]----+---------------------
+0123456789abcdef | xxxxxx
+0123456789 | yyyyyyyyyyyyyy
+-[ RECORD 4 ]----+---------------------
+0123456789abcdef | xxxxxxxx
+0123456789 | yyyyyyyyyyyy
+-[ RECORD 5 ]----+---------------------
+0123456789abcdef | xxxxxxxxxx
+0123456789 | yyyyyyyyyy
+-[ RECORD 6 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxx
+0123456789 | yyyyyyyy
+-[ RECORD 7 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxx
+0123456789 | yyyyyy
+-[ RECORD 8 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxx
+0123456789 | yyyy
+-[ RECORD 9 ]----+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxx
+0123456789 | yy
+-[ RECORD 10 ]---+---------------------
+0123456789abcdef | xxxxxxxxxxxxxxxxxxxx
+0123456789 |
+
+\pset border 2
+\pset format unaligned
+execute q;
+0123456789abcdef|xx
+0123456789|yyyyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxx
+0123456789|yyyyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxx
+0123456789|yyyyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxx
+0123456789|yyyyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxx
+0123456789|yyyyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxx
+0123456789|yyyyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxx
+0123456789|yyyyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxx
+0123456789|yyyy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxx
+0123456789|yy
+
+0123456789abcdef|xxxxxxxxxxxxxxxxxxxx
+0123456789|
+\pset format aligned
+execute q;
++-[ RECORD 1 ]-----+----------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyyy |
++-[ RECORD 2 ]-----+----------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+----------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxx |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+----------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxxxxx |
+| 0123456789 | |
++------------------+----------------------+
+
+\pset format wrapped
+execute q;
++-[ RECORD 1 ]-----+-------------------+
+| 0123456789abcdef | xx |
+| 0123456789 | yyyyyyyyyyyyyyyyy |
+| ; y |
++-[ RECORD 2 ]-----+-------------------+
+| 0123456789abcdef | xxxx |
+| 0123456789 | yyyyyyyyyyyyyyyy |
++-[ RECORD 3 ]-----+-------------------+
+| 0123456789abcdef | xxxxxx |
+| 0123456789 | yyyyyyyyyyyyyy |
++-[ RECORD 4 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxx |
+| 0123456789 | yyyyyyyyyyyy |
++-[ RECORD 5 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxx |
+| 0123456789 | yyyyyyyyyy |
++-[ RECORD 6 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxx |
+| 0123456789 | yyyyyyyy |
++-[ RECORD 7 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxx |
+| 0123456789 | yyyyyy |
++-[ RECORD 8 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxx |
+| 0123456789 | yyyy |
++-[ RECORD 9 ]-----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxx |
+| ; x |
+| 0123456789 | yy |
++-[ RECORD 10 ]----+-------------------+
+| 0123456789abcdef | xxxxxxxxxxxxxxxxx |
+| ; xxx |
+| 0123456789 | |
++------------------+-------------------+
+
+deallocate q;
+\pset linestyle ascii
+\pset border 1
+-- support table for output-format tests (useful to create a footer)
+create table psql_serial_tab (id serial);
+-- test header/footer/tuples_only behavior in aligned/unaligned/wrapped cases
+\pset format aligned
+\pset expanded off
+\d psql_serial_tab_id_seq
+ Sequence "public.psql_serial_tab_id_seq"
+ Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
+---------+-------+---------+------------+-----------+---------+-------
+ integer | 1 | 1 | 2147483647 | 1 | no | 1
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+ pg_catalog | exp | double precision | double precision | func
+ pg_catalog | exp | numeric | numeric | func
+
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+-[ RECORD 1 ]---------
+Type | integer
+Start | 1
+Minimum | 1
+Maximum | 2147483647
+Increment | 1
+Cycles? | no
+Cache | 1
+
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+Schema | pg_catalog
+Name | exp
+Result data type | double precision
+Argument data types | double precision
+Type | func
+--------------------+-----------------
+Schema | pg_catalog
+Name | exp
+Result data type | numeric
+Argument data types | numeric
+Type | func
+
+\pset tuples_only false
+-- empty table is a special case for this format
+select 1 where false;
+(0 rows)
+
+\pset format unaligned
+\pset expanded off
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+Type|Start|Minimum|Maximum|Increment|Cycles?|Cache
+integer|1|1|2147483647|1|no|1
+Owned by: public.psql_serial_tab.id
+\pset tuples_only true
+\df exp
+pg_catalog|exp|double precision|double precision|func
+pg_catalog|exp|numeric|numeric|func
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+
+Type|integer
+Start|1
+Minimum|1
+Maximum|2147483647
+Increment|1
+Cycles?|no
+Cache|1
+
+Owned by: public.psql_serial_tab.id
+\pset tuples_only true
+\df exp
+Schema|pg_catalog
+Name|exp
+Result data type|double precision
+Argument data types|double precision
+Type|func
+
+Schema|pg_catalog
+Name|exp
+Result data type|numeric
+Argument data types|numeric
+Type|func
+\pset tuples_only false
+\pset format wrapped
+\pset expanded off
+\d psql_serial_tab_id_seq
+ Sequence "public.psql_serial_tab_id_seq"
+ Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
+---------+-------+---------+------------+-----------+---------+-------
+ integer | 1 | 1 | 2147483647 | 1 | no | 1
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+ pg_catalog | exp | double precision | double precision | func
+ pg_catalog | exp | numeric | numeric | func
+
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Sequence "public.psql_serial_tab_id_seq"
+-[ RECORD 1 ]---------
+Type | integer
+Start | 1
+Minimum | 1
+Maximum | 2147483647
+Increment | 1
+Cycles? | no
+Cache | 1
+
+Owned by: public.psql_serial_tab.id
+
+\pset tuples_only true
+\df exp
+Schema | pg_catalog
+Name | exp
+Result data type | double precision
+Argument data types | double precision
+Type | func
+--------------------+-----------------
+Schema | pg_catalog
+Name | exp
+Result data type | numeric
+Argument data types | numeric
+Type | func
+
+\pset tuples_only false
+-- check conditional am display
+\pset expanded off
+CREATE SCHEMA tableam_display;
+CREATE ROLE regress_display_role;
+ALTER SCHEMA tableam_display OWNER TO regress_display_role;
+SET search_path TO tableam_display;
+CREATE ACCESS METHOD heap_psql TYPE TABLE HANDLER heap_tableam_handler;
+SET ROLE TO regress_display_role;
+-- Use only relations with a physical size of zero.
+CREATE TABLE tbl_heap_psql(f1 int, f2 char(100)) using heap_psql;
+CREATE TABLE tbl_heap(f1 int, f2 char(100)) using heap;
+CREATE VIEW view_heap_psql AS SELECT f1 from tbl_heap_psql;
+CREATE MATERIALIZED VIEW mat_view_heap_psql USING heap_psql AS SELECT f1 from tbl_heap_psql;
+\d+ tbl_heap_psql
+ Table "tableam_display.tbl_heap_psql"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+
+\d+ tbl_heap
+ Table "tableam_display.tbl_heap"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+
+\set HIDE_TABLEAM off
+\d+ tbl_heap_psql
+ Table "tableam_display.tbl_heap_psql"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+Access method: heap_psql
+
+\d+ tbl_heap
+ Table "tableam_display.tbl_heap"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+----------------+-----------+----------+---------+----------+--------------+-------------
+ f1 | integer | | | | plain | |
+ f2 | character(100) | | | | extended | |
+Access method: heap
+
+-- AM is displayed for tables, indexes and materialized views.
+\d+
+ List of relations
+ Schema | Name | Type | Owner | Persistence | Access Method | Size | Description
+-----------------+--------------------+-------------------+----------------------+-------------+---------------+---------+-------------
+ tableam_display | mat_view_heap_psql | materialized view | regress_display_role | permanent | heap_psql | 0 bytes |
+ tableam_display | tbl_heap | table | regress_display_role | permanent | heap | 0 bytes |
+ tableam_display | tbl_heap_psql | table | regress_display_role | permanent | heap_psql | 0 bytes |
+ tableam_display | view_heap_psql | view | regress_display_role | permanent | | 0 bytes |
+(4 rows)
+
+\dt+
+ List of relations
+ Schema | Name | Type | Owner | Persistence | Access Method | Size | Description
+-----------------+---------------+-------+----------------------+-------------+---------------+---------+-------------
+ tableam_display | tbl_heap | table | regress_display_role | permanent | heap | 0 bytes |
+ tableam_display | tbl_heap_psql | table | regress_display_role | permanent | heap_psql | 0 bytes |
+(2 rows)
+
+\dm+
+ List of relations
+ Schema | Name | Type | Owner | Persistence | Access Method | Size | Description
+-----------------+--------------------+-------------------+----------------------+-------------+---------------+---------+-------------
+ tableam_display | mat_view_heap_psql | materialized view | regress_display_role | permanent | heap_psql | 0 bytes |
+(1 row)
+
+-- But not for views and sequences.
+\dv+
+ List of relations
+ Schema | Name | Type | Owner | Persistence | Size | Description
+-----------------+----------------+------+----------------------+-------------+---------+-------------
+ tableam_display | view_heap_psql | view | regress_display_role | permanent | 0 bytes |
+(1 row)
+
+\set HIDE_TABLEAM on
+\d+
+ List of relations
+ Schema | Name | Type | Owner | Persistence | Size | Description
+-----------------+--------------------+-------------------+----------------------+-------------+---------+-------------
+ tableam_display | mat_view_heap_psql | materialized view | regress_display_role | permanent | 0 bytes |
+ tableam_display | tbl_heap | table | regress_display_role | permanent | 0 bytes |
+ tableam_display | tbl_heap_psql | table | regress_display_role | permanent | 0 bytes |
+ tableam_display | view_heap_psql | view | regress_display_role | permanent | 0 bytes |
+(4 rows)
+
+RESET ROLE;
+RESET search_path;
+DROP SCHEMA tableam_display CASCADE;
+NOTICE: drop cascades to 4 other objects
+DETAIL: drop cascades to table tableam_display.tbl_heap_psql
+drop cascades to table tableam_display.tbl_heap
+drop cascades to view tableam_display.view_heap_psql
+drop cascades to materialized view tableam_display.mat_view_heap_psql
+DROP ACCESS METHOD heap_psql;
+DROP ROLE regress_display_role;
+-- test numericlocale (as best we can without control of psql's locale)
+\pset format aligned
+\pset expanded off
+\pset numericlocale true
+select n, -n as m, n * 111 as x, '1e90'::float8 as f
+from generate_series(0,3) n;
+ n | m | x | f
+---+----+-----+-------
+ 0 | 0 | 0 | 1e+90
+ 1 | -1 | 111 | 1e+90
+ 2 | -2 | 222 | 1e+90
+ 3 | -3 | 333 | 1e+90
+(4 rows)
+
+\pset numericlocale false
+-- test asciidoc output format
+\pset format asciidoc
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+
+.Sequence "public.psql_serial_tab_id_seq"
+[options="header",cols="<l,>l,>l,>l,>l,<l,>l",frame="none"]
+|====
+^l|Type ^l|Start ^l|Minimum ^l|Maximum ^l|Increment ^l|Cycles? ^l|Cache
+|integer |1 |1 |2147483647 |1 |no |1
+|====
+
+....
+Owned by: public.psql_serial_tab.id
+....
+\pset tuples_only true
+\df exp
+
+[cols="<l,<l,<l,<l,<l",frame="none"]
+|====
+|pg_catalog |exp |double precision |double precision |func
+|pg_catalog |exp |numeric |numeric |func
+|====
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+
+.Sequence "public.psql_serial_tab_id_seq"
+[cols="h,l",frame="none"]
+|====
+2+^|Record 1
+<l|Type <l|integer
+<l|Start >l|1
+<l|Minimum >l|1
+<l|Maximum >l|2147483647
+<l|Increment >l|1
+<l|Cycles? <l|no
+<l|Cache >l|1
+|====
+
+....
+Owned by: public.psql_serial_tab.id
+....
+\pset tuples_only true
+\df exp
+
+[cols="h,l",frame="none"]
+|====
+2+|
+<l|Schema <l|pg_catalog
+<l|Name <l|exp
+<l|Result data type <l|double precision
+<l|Argument data types <l|double precision
+<l|Type <l|func
+2+|
+<l|Schema <l|pg_catalog
+<l|Name <l|exp
+<l|Result data type <l|numeric
+<l|Argument data types <l|numeric
+<l|Type <l|func
+|====
+\pset tuples_only false
+prepare q as
+ select 'some|text' as "a|title", ' ' as "empty ", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+
+[options="header",cols="<l,<l,>l",frame="none",grid="none"]
+|====
+^l|a\|title ^l|empty ^l|int
+|some\|text | |1
+|some\|text | |2
+|====
+
+....
+(2 rows)
+....
+\pset border 1
+execute q;
+
+[options="header",cols="<l,<l,>l",frame="none"]
+|====
+^l|a\|title ^l|empty ^l|int
+|some\|text | |1
+|some\|text | |2
+|====
+
+....
+(2 rows)
+....
+\pset border 2
+execute q;
+
+[options="header",cols="<l,<l,>l",frame="all",grid="all"]
+|====
+^l|a\|title ^l|empty ^l|int
+|some\|text | |1
+|some\|text | |2
+|====
+
+....
+(2 rows)
+....
+\pset expanded on
+\pset border 0
+execute q;
+
+[cols="h,l",frame="none",grid="none"]
+|====
+2+^|Record 1
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|1
+2+^|Record 2
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|2
+|====
+\pset border 1
+execute q;
+
+[cols="h,l",frame="none"]
+|====
+2+^|Record 1
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|1
+2+^|Record 2
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|2
+|====
+\pset border 2
+execute q;
+
+[cols="h,l",frame="all",grid="all"]
+|====
+2+^|Record 1
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|1
+2+^|Record 2
+<l|a\|title <l|some\|text
+<l|empty <l|
+<l|int >l|2
+|====
+deallocate q;
+-- test csv output format
+\pset format csv
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+Type,Start,Minimum,Maximum,Increment,Cycles?,Cache
+integer,1,1,2147483647,1,no,1
+\pset tuples_only true
+\df exp
+pg_catalog,exp,double precision,double precision,func
+pg_catalog,exp,numeric,numeric,func
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+Type,integer
+Start,1
+Minimum,1
+Maximum,2147483647
+Increment,1
+Cycles?,no
+Cache,1
+\pset tuples_only true
+\df exp
+Schema,pg_catalog
+Name,exp
+Result data type,double precision
+Argument data types,double precision
+Type,func
+Schema,pg_catalog
+Name,exp
+Result data type,numeric
+Argument data types,numeric
+Type,func
+\pset tuples_only false
+prepare q as
+ select 'some"text' as "a""title", E' <foo>\n<bar>' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+execute q;
+"a""title",junk,empty,int
+"some""text"," <foo>
+<bar>", ,1
+"some""text"," <foo>
+<bar>", ,2
+\pset expanded on
+execute q;
+"a""title","some""text"
+junk," <foo>
+<bar>"
+empty,
+int,1
+"a""title","some""text"
+junk," <foo>
+<bar>"
+empty,
+int,2
+deallocate q;
+-- special cases
+\pset expanded off
+select 'comma,comma' as comma, 'semi;semi' as semi;
+comma,semi
+"comma,comma",semi;semi
+\pset csv_fieldsep ';'
+select 'comma,comma' as comma, 'semi;semi' as semi;
+comma;semi
+comma,comma;"semi;semi"
+select '\.' as data;
+data
+"\."
+\pset csv_fieldsep '.'
+select '\' as d1, '' as d2;
+"d1"."d2"
+"\".""
+-- illegal csv separators
+\pset csv_fieldsep ''
+\pset: csv_fieldsep must be a single one-byte character
+\pset csv_fieldsep '\0'
+\pset: csv_fieldsep must be a single one-byte character
+\pset csv_fieldsep '\n'
+\pset: csv_fieldsep cannot be a double quote, a newline, or a carriage return
+\pset csv_fieldsep '\r'
+\pset: csv_fieldsep cannot be a double quote, a newline, or a carriage return
+\pset csv_fieldsep '"'
+\pset: csv_fieldsep cannot be a double quote, a newline, or a carriage return
+\pset csv_fieldsep ',,'
+\pset: csv_fieldsep must be a single one-byte character
+\pset csv_fieldsep ','
+-- test html output format
+\pset format html
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+<table border="1">
+ <caption>Sequence "public.psql_serial_tab_id_seq"</caption>
+ <tr>
+ <th align="center">Type</th>
+ <th align="center">Start</th>
+ <th align="center">Minimum</th>
+ <th align="center">Maximum</th>
+ <th align="center">Increment</th>
+ <th align="center">Cycles?</th>
+ <th align="center">Cache</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">integer</td>
+ <td align="right">1</td>
+ <td align="right">1</td>
+ <td align="right">2147483647</td>
+ <td align="right">1</td>
+ <td align="left">no</td>
+ <td align="right">1</td>
+ </tr>
+</table>
+<p>Owned by: public.psql_serial_tab.id<br />
+</p>
+\pset tuples_only true
+\df exp
+<table border="1">
+ <tr valign="top">
+ <td align="left">pg_catalog</td>
+ <td align="left">exp</td>
+ <td align="left">double precision</td>
+ <td align="left">double precision</td>
+ <td align="left">func</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">pg_catalog</td>
+ <td align="left">exp</td>
+ <td align="left">numeric</td>
+ <td align="left">numeric</td>
+ <td align="left">func</td>
+ </tr>
+</table>
+
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+<table border="1">
+ <caption>Sequence "public.psql_serial_tab_id_seq"</caption>
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>Type</th>
+ <td align="left">integer</td>
+ </tr>
+ <tr valign="top">
+ <th>Start</th>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <th>Minimum</th>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <th>Maximum</th>
+ <td align="right">2147483647</td>
+ </tr>
+ <tr valign="top">
+ <th>Increment</th>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <th>Cycles?</th>
+ <td align="left">no</td>
+ </tr>
+ <tr valign="top">
+ <th>Cache</th>
+ <td align="right">1</td>
+ </tr>
+</table>
+<p>Owned by: public.psql_serial_tab.id<br />
+</p>
+\pset tuples_only true
+\df exp
+<table border="1">
+
+ <tr><td colspan="2"> </td></tr>
+ <tr valign="top">
+ <th>Schema</th>
+ <td align="left">pg_catalog</td>
+ </tr>
+ <tr valign="top">
+ <th>Name</th>
+ <td align="left">exp</td>
+ </tr>
+ <tr valign="top">
+ <th>Result data type</th>
+ <td align="left">double precision</td>
+ </tr>
+ <tr valign="top">
+ <th>Argument data types</th>
+ <td align="left">double precision</td>
+ </tr>
+ <tr valign="top">
+ <th>Type</th>
+ <td align="left">func</td>
+ </tr>
+
+ <tr><td colspan="2"> </td></tr>
+ <tr valign="top">
+ <th>Schema</th>
+ <td align="left">pg_catalog</td>
+ </tr>
+ <tr valign="top">
+ <th>Name</th>
+ <td align="left">exp</td>
+ </tr>
+ <tr valign="top">
+ <th>Result data type</th>
+ <td align="left">numeric</td>
+ </tr>
+ <tr valign="top">
+ <th>Argument data types</th>
+ <td align="left">numeric</td>
+ </tr>
+ <tr valign="top">
+ <th>Type</th>
+ <td align="left">func</td>
+ </tr>
+</table>
+
+\pset tuples_only false
+prepare q as
+ select 'some"text' as "a&title", E' <foo>\n<bar>' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+<table border="0">
+ <tr>
+ <th align="center">a&title</th>
+ <th align="center">junk</th>
+ <th align="center">empty</th>
+ <th align="center">int</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">2</td>
+ </tr>
+</table>
+<p>(2 rows)<br />
+</p>
+\pset border 1
+execute q;
+<table border="1">
+ <tr>
+ <th align="center">a&title</th>
+ <th align="center">junk</th>
+ <th align="center">empty</th>
+ <th align="center">int</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">2</td>
+ </tr>
+</table>
+<p>(2 rows)<br />
+</p>
+\pset tableattr foobar
+execute q;
+<table border="1" foobar>
+ <tr>
+ <th align="center">a&title</th>
+ <th align="center">junk</th>
+ <th align="center">empty</th>
+ <th align="center">int</th>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">1</td>
+ </tr>
+ <tr valign="top">
+ <td align="left">some"text</td>
+ <td align="left"> <foo><br />
+<bar></td>
+ <td align="left"> </td>
+ <td align="right">2</td>
+ </tr>
+</table>
+<p>(2 rows)<br />
+</p>
+\pset tableattr
+\pset expanded on
+\pset border 0
+execute q;
+<table border="0">
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">1</td>
+ </tr>
+
+ <tr><td colspan="2" align="center">Record 2</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">2</td>
+ </tr>
+</table>
+
+\pset border 1
+execute q;
+<table border="1">
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">1</td>
+ </tr>
+
+ <tr><td colspan="2" align="center">Record 2</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">2</td>
+ </tr>
+</table>
+
+\pset tableattr foobar
+execute q;
+<table border="1" foobar>
+
+ <tr><td colspan="2" align="center">Record 1</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">1</td>
+ </tr>
+
+ <tr><td colspan="2" align="center">Record 2</td></tr>
+ <tr valign="top">
+ <th>a&title</th>
+ <td align="left">some"text</td>
+ </tr>
+ <tr valign="top">
+ <th>junk</th>
+ <td align="left"> <foo><br />
+<bar></td>
+ </tr>
+ <tr valign="top">
+ <th>empty</th>
+ <td align="left"> </td>
+ </tr>
+ <tr valign="top">
+ <th>int</th>
+ <td align="right">2</td>
+ </tr>
+</table>
+
+\pset tableattr
+deallocate q;
+-- test latex output format
+\pset format latex
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+\begin{center}
+Sequence "public.psql\_serial\_tab\_id\_seq"
+\end{center}
+
+\begin{tabular}{l | r | r | r | r | l | r}
+\textit{Type} & \textit{Start} & \textit{Minimum} & \textit{Maximum} & \textit{Increment} & \textit{Cycles?} & \textit{Cache} \\
+\hline
+integer & 1 & 1 & 2147483647 & 1 & no & 1 \\
+\end{tabular}
+
+\noindent Owned by: public.psql\_serial\_tab.id \\
+
+\pset tuples_only true
+\df exp
+\begin{tabular}{l | l | l | l | l}
+pg\_catalog & exp & double precision & double precision & func \\
+pg\_catalog & exp & numeric & numeric & func \\
+\end{tabular}
+
+\noindent
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+\begin{center}
+Sequence "public.psql\_serial\_tab\_id\_seq"
+\end{center}
+
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+Type & integer \\
+Start & 1 \\
+Minimum & 1 \\
+Maximum & 2147483647 \\
+Increment & 1 \\
+Cycles? & no \\
+Cache & 1 \\
+\end{tabular}
+
+\noindent Owned by: public.psql\_serial\_tab.id \\
+
+\pset tuples_only true
+\df exp
+\begin{tabular}{c|l}
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & double precision \\
+Argument data types & double precision \\
+Type & func \\
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & numeric \\
+Argument data types & numeric \\
+Type & func \\
+\end{tabular}
+
+\noindent
+\pset tuples_only false
+prepare q as
+ select 'some\more_text' as "a$title", E' #<foo>%&^~|\n{bar}' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+\begin{tabular}{lllr}
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset border 1
+execute q;
+\begin{tabular}{l | l | l | r}
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset border 2
+execute q;
+\begin{tabular}{| l | l | l | r |}
+\hline
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\hline
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset border 3
+execute q;
+\begin{tabular}{| l | l | l | r |}
+\hline
+\textit{a\$title} & \textit{junk} & \textit{empty} & \textit{int} \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 1 \\
+\hline
+some\textbackslash{}more\_text & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} & & 2 \\
+\hline
+\end{tabular}
+
+\noindent (2 rows) \\
+
+\pset expanded on
+\pset border 0
+execute q;
+\begin{tabular}{cl}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 1
+execute q;
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 2
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset border 3
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+deallocate q;
+-- test latex-longtable output format
+\pset format latex-longtable
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+\begin{longtable}{l | r | r | r | r | l | r}
+\small\textbf{\textit{Type}} & \small\textbf{\textit{Start}} & \small\textbf{\textit{Minimum}} & \small\textbf{\textit{Maximum}} & \small\textbf{\textit{Increment}} & \small\textbf{\textit{Cycles?}} & \small\textbf{\textit{Cache}} \\
+\midrule
+\endfirsthead
+\small\textbf{\textit{Type}} & \small\textbf{\textit{Start}} & \small\textbf{\textit{Minimum}} & \small\textbf{\textit{Maximum}} & \small\textbf{\textit{Increment}} & \small\textbf{\textit{Cycles?}} & \small\textbf{\textit{Cache}} \\
+\midrule
+\endhead
+\caption[Sequence "public.psql\_serial\_tab\_id\_seq" (Continued)]{Sequence "public.psql\_serial\_tab\_id\_seq"}
+\endfoot
+\caption[Sequence "public.psql\_serial\_tab\_id\_seq"]{Sequence "public.psql\_serial\_tab\_id\_seq"}
+\endlastfoot
+\raggedright{integer}
+&
+\raggedright{1}
+&
+\raggedright{1}
+&
+\raggedright{2147483647}
+&
+\raggedright{1}
+&
+\raggedright{no}
+&
+\raggedright{1} \tabularnewline
+\end{longtable}
+\pset tuples_only true
+\df exp
+\begin{longtable}{l | l | l | l | l}
+\raggedright{pg\_catalog}
+&
+\raggedright{exp}
+&
+\raggedright{double precision}
+&
+\raggedright{double precision}
+&
+\raggedright{func} \tabularnewline
+\raggedright{pg\_catalog}
+&
+\raggedright{exp}
+&
+\raggedright{numeric}
+&
+\raggedright{numeric}
+&
+\raggedright{func} \tabularnewline
+\end{longtable}
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+\begin{center}
+Sequence "public.psql\_serial\_tab\_id\_seq"
+\end{center}
+
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+Type & integer \\
+Start & 1 \\
+Minimum & 1 \\
+Maximum & 2147483647 \\
+Increment & 1 \\
+Cycles? & no \\
+Cache & 1 \\
+\end{tabular}
+
+\noindent Owned by: public.psql\_serial\_tab.id \\
+
+\pset tuples_only true
+\df exp
+\begin{tabular}{c|l}
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & double precision \\
+Argument data types & double precision \\
+Type & func \\
+\hline
+Schema & pg\_catalog \\
+Name & exp \\
+Result data type & numeric \\
+Argument data types & numeric \\
+Type & func \\
+\end{tabular}
+
+\noindent
+\pset tuples_only false
+prepare q as
+ select 'some\more_text' as "a$title", E' #<foo>%&^~|\n{bar}' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+\begin{longtable}{lllr}
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endhead
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+\end{longtable}
+\pset border 1
+execute q;
+\begin{longtable}{l | l | l | r}
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endhead
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+\end{longtable}
+\pset border 2
+execute q;
+\begin{longtable}{| l | l | l | r |}
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endhead
+\bottomrule
+\endfoot
+\bottomrule
+\endlastfoot
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+\end{longtable}
+\pset border 3
+execute q;
+\begin{longtable}{| l | l | l | r |}
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\endhead
+\bottomrule
+\endfoot
+\bottomrule
+\endlastfoot
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+ \hline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+ \hline
+\end{longtable}
+\pset tableattr lr
+execute q;
+\begin{longtable}{| p{lr\textwidth} | p{lr\textwidth} | p{lr\textwidth} | r |}
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\midrule
+\endfirsthead
+\toprule
+\small\textbf{\textit{a\$title}} & \small\textbf{\textit{junk}} & \small\textbf{\textit{empty}} & \small\textbf{\textit{int}} \\
+\endhead
+\bottomrule
+\endfoot
+\bottomrule
+\endlastfoot
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{1} \tabularnewline
+ \hline
+\raggedright{some\textbackslash{}more\_text}
+&
+\raggedright{ \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\}}
+&
+\raggedright{ }
+&
+\raggedright{2} \tabularnewline
+ \hline
+\end{longtable}
+\pset tableattr
+\pset expanded on
+\pset border 0
+execute q;
+\begin{tabular}{cl}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 1
+execute q;
+\begin{tabular}{c|l}
+\multicolumn{2}{c}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\multicolumn{2}{c}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\end{tabular}
+
+\noindent
+\pset border 2
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset border 3
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset tableattr lr
+execute q;
+\begin{tabular}{|c|l|}
+\hline
+\multicolumn{2}{|c|}{\textit{Record 1}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 1 \\
+\hline
+\multicolumn{2}{|c|}{\textit{Record 2}} \\
+\hline
+a\$title & some\textbackslash{}more\_text \\
+junk & \#\textless{}foo\textgreater{}\%\&\^{}\~{}\textbar{}\\\{bar\} \\
+empty & \\
+int & 2 \\
+\hline
+\end{tabular}
+
+\noindent
+\pset tableattr
+deallocate q;
+-- test troff-ms output format
+\pset format troff-ms
+\pset border 1
+\pset expanded off
+\d psql_serial_tab_id_seq
+.LP
+.DS C
+Sequence "public.psql_serial_tab_id_seq"
+.DE
+.LP
+.TS
+center;
+l | r | r | r | r | l | r.
+\fIType\fP \fIStart\fP \fIMinimum\fP \fIMaximum\fP \fIIncrement\fP \fICycles?\fP \fICache\fP
+_
+integer 1 1 2147483647 1 no 1
+.TE
+.DS L
+Owned by: public.psql_serial_tab.id
+.DE
+\pset tuples_only true
+\df exp
+.LP
+.TS
+center;
+l | l | l | l | l.
+pg_catalog exp double precision double precision func
+pg_catalog exp numeric numeric func
+.TE
+.DS L
+.DE
+\pset tuples_only false
+\pset expanded on
+\d psql_serial_tab_id_seq
+.LP
+.DS C
+Sequence "public.psql_serial_tab_id_seq"
+.DE
+.LP
+.TS
+center;
+c s.
+\fIRecord 1\fP
+_
+.T&
+c | l.
+Type integer
+Start 1
+Minimum 1
+Maximum 2147483647
+Increment 1
+Cycles? no
+Cache 1
+.TE
+.DS L
+Owned by: public.psql_serial_tab.id
+.DE
+\pset tuples_only true
+\df exp
+.LP
+.TS
+center;
+c l;
+_
+Schema pg_catalog
+Name exp
+Result data type double precision
+Argument data types double precision
+Type func
+_
+Schema pg_catalog
+Name exp
+Result data type numeric
+Argument data types numeric
+Type func
+.TE
+.DS L
+.DE
+\pset tuples_only false
+prepare q as
+ select 'some\text' as "a\title", E' <foo>\n<bar>' as "junk",
+ ' ' as "empty", n as int
+ from generate_series(1,2) as n;
+\pset expanded off
+\pset border 0
+execute q;
+.LP
+.TS
+center;
+lllr.
+\fIa\(rstitle\fP \fIjunk\fP \fIempty\fP \fIint\fP
+_
+some\(rstext <foo>
+<bar> 1
+some\(rstext <foo>
+<bar> 2
+.TE
+.DS L
+(2 rows)
+.DE
+\pset border 1
+execute q;
+.LP
+.TS
+center;
+l | l | l | r.
+\fIa\(rstitle\fP \fIjunk\fP \fIempty\fP \fIint\fP
+_
+some\(rstext <foo>
+<bar> 1
+some\(rstext <foo>
+<bar> 2
+.TE
+.DS L
+(2 rows)
+.DE
+\pset border 2
+execute q;
+.LP
+.TS
+center box;
+l | l | l | r.
+\fIa\(rstitle\fP \fIjunk\fP \fIempty\fP \fIint\fP
+_
+some\(rstext <foo>
+<bar> 1
+some\(rstext <foo>
+<bar> 2
+.TE
+.DS L
+(2 rows)
+.DE
+\pset expanded on
+\pset border 0
+execute q;
+.LP
+.TS
+center;
+c s.
+\fIRecord 1\fP
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 1
+.T&
+c s.
+\fIRecord 2\fP
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 2
+.TE
+.DS L
+.DE
+\pset border 1
+execute q;
+.LP
+.TS
+center;
+c s.
+\fIRecord 1\fP
+_
+.T&
+c | l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 1
+.T&
+c s.
+\fIRecord 2\fP
+_
+.T&
+c | l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 2
+.TE
+.DS L
+.DE
+\pset border 2
+execute q;
+.LP
+.TS
+center box;
+c s.
+\fIRecord 1\fP
+_
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 1
+_
+.T&
+c s.
+\fIRecord 2\fP
+_
+.T&
+c l.
+a\(rstitle some\(rstext
+junk <foo>
+<bar>
+empty
+int 2
+.TE
+.DS L
+.DE
+deallocate q;
+-- check ambiguous format requests
+\pset format a
+\pset: ambiguous abbreviation "a" matches both "aligned" and "asciidoc"
+\pset format l
+-- clean up after output format tests
+drop table psql_serial_tab;
+\pset format aligned
+\pset expanded off
+\pset border 1
+-- \echo and allied features
+\echo this is a test
+this is a test
+\echo -n without newline
+without newline\echo with -n newline
+with -n newline
+\echo '-n' with newline
+-n with newline
+\set foo bar
+\echo foo = :foo
+foo = bar
+\qecho this is a test
+this is a test
+\qecho foo = :foo
+foo = bar
+\warn this is a test
+this is a test
+\warn foo = :foo
+foo = bar
+-- tests for \if ... \endif
+\if true
+ select 'okay';
+ ?column?
+----------
+ okay
+(1 row)
+
+ select 'still okay';
+ ?column?
+------------
+ still okay
+(1 row)
+
+\else
+ not okay;
+ still not okay
+\endif
+-- at this point query buffer should still have last valid line
+\g
+ ?column?
+------------
+ still okay
+(1 row)
+
+-- \if should work okay on part of a query
+select
+ \if true
+ 42
+ \else
+ (bogus
+ \endif
+ forty_two;
+ forty_two
+-----------
+ 42
+(1 row)
+
+select \if false \\ (bogus \else \\ 42 \endif \\ forty_two;
+ forty_two
+-----------
+ 42
+(1 row)
+
+-- test a large nested if using a variety of true-equivalents
+\if true
+ \if 1
+ \if yes
+ \if on
+ \echo 'all true'
+all true
+ \else
+ \echo 'should not print #1-1'
+ \endif
+ \else
+ \echo 'should not print #1-2'
+ \endif
+ \else
+ \echo 'should not print #1-3'
+ \endif
+\else
+ \echo 'should not print #1-4'
+\endif
+-- test a variety of false-equivalents in an if/elif/else structure
+\if false
+ \echo 'should not print #2-1'
+\elif 0
+ \echo 'should not print #2-2'
+\elif no
+ \echo 'should not print #2-3'
+\elif off
+ \echo 'should not print #2-4'
+\else
+ \echo 'all false'
+all false
+\endif
+-- test true-false elif after initial true branch
+\if true
+ \echo 'should print #2-5'
+should print #2-5
+\elif true
+ \echo 'should not print #2-6'
+\elif false
+ \echo 'should not print #2-7'
+\else
+ \echo 'should not print #2-8'
+\endif
+-- test simple true-then-else
+\if true
+ \echo 'first thing true'
+first thing true
+\else
+ \echo 'should not print #3-1'
+\endif
+-- test simple false-true-else
+\if false
+ \echo 'should not print #4-1'
+\elif true
+ \echo 'second thing true'
+second thing true
+\else
+ \echo 'should not print #5-1'
+\endif
+-- invalid boolean expressions are false
+\if invalid boolean expression
+unrecognized value "invalid boolean expression" for "\if expression": Boolean expected
+ \echo 'will not print #6-1'
+\else
+ \echo 'will print anyway #6-2'
+will print anyway #6-2
+\endif
+-- test un-matched endif
+\endif
+\endif: no matching \if
+-- test un-matched else
+\else
+\else: no matching \if
+-- test un-matched elif
+\elif
+\elif: no matching \if
+-- test double-else error
+\if true
+\else
+\else
+\else: cannot occur after \else
+\endif
+-- test elif out-of-order
+\if false
+\else
+\elif
+\elif: cannot occur after \else
+\endif
+-- test if-endif matching in a false branch
+\if false
+ \if false
+ \echo 'should not print #7-1'
+ \else
+ \echo 'should not print #7-2'
+ \endif
+ \echo 'should not print #7-3'
+\else
+ \echo 'should print #7-4'
+should print #7-4
+\endif
+-- show that vars and backticks are not expanded when ignoring extra args
+\set foo bar
+\echo :foo :'foo' :"foo"
+bar 'bar' "bar"
+\pset fieldsep | `nosuchcommand` :foo :'foo' :"foo"
+\pset: extra argument "nosuchcommand" ignored
+\pset: extra argument ":foo" ignored
+\pset: extra argument ":'foo'" ignored
+\pset: extra argument ":"foo"" ignored
+-- show that vars and backticks are not expanded and commands are ignored
+-- when in a false if-branch
+\set try_to_quit '\\q'
+\if false
+ :try_to_quit
+ \echo `nosuchcommand` :foo :'foo' :"foo"
+ \pset fieldsep | `nosuchcommand` :foo :'foo' :"foo"
+ \a
+ \C arg1
+ \c arg1 arg2 arg3 arg4
+ \cd arg1
+ \conninfo
+ \copy arg1 arg2 arg3 arg4 arg5 arg6
+ \copyright
+ SELECT 1 as one, 2, 3 \crosstabview
+ \dt arg1
+ \e arg1 arg2
+ \ef whole_line
+ \ev whole_line
+ \echo arg1 arg2 arg3 arg4 arg5
+ \echo arg1
+ \encoding arg1
+ \errverbose
+ \f arg1
+ \g arg1
+ \gx arg1
+ \gexec
+ SELECT 1 AS one \gset
+ \h
+ \?
+ \html
+ \i arg1
+ \ir arg1
+ \l arg1
+ \lo arg1 arg2
+invalid command \lo
+ \lo_list
+ \o arg1
+ \p
+ \password arg1
+ \prompt arg1 arg2
+ \pset arg1 arg2
+ \q
+ \reset
+ \s arg1
+ \set arg1 arg2 arg3 arg4 arg5 arg6 arg7
+ \setenv arg1 arg2
+ \sf whole_line
+ \sv whole_line
+ \t arg1
+ \T arg1
+ \timing arg1
+ \unset arg1
+ \w arg1
+ \watch arg1
+ \x arg1
+ -- \else here is eaten as part of OT_FILEPIPE argument
+ \w |/no/such/file \else
+ -- \endif here is eaten as part of whole-line argument
+ \! whole_line \endif
+ \z
+\else
+ \echo 'should print #8-1'
+should print #8-1
+\endif
+-- :{?...} defined variable test
+\set i 1
+\if :{?i}
+ \echo '#9-1 ok, variable i is defined'
+#9-1 ok, variable i is defined
+\else
+ \echo 'should not print #9-2'
+\endif
+\if :{?no_such_variable}
+ \echo 'should not print #10-1'
+\else
+ \echo '#10-2 ok, variable no_such_variable is not defined'
+#10-2 ok, variable no_such_variable is not defined
+\endif
+SELECT :{?i} AS i_is_defined;
+ i_is_defined
+--------------
+ t
+(1 row)
+
+SELECT NOT :{?no_such_var} AS no_such_var_is_not_defined;
+ no_such_var_is_not_defined
+----------------------------
+ t
+(1 row)
+
+-- SHOW_CONTEXT
+\set SHOW_CONTEXT never
+do $$
+begin
+ raise notice 'foo';
+ raise exception 'bar';
+end $$;
+NOTICE: foo
+ERROR: bar
+\set SHOW_CONTEXT errors
+do $$
+begin
+ raise notice 'foo';
+ raise exception 'bar';
+end $$;
+NOTICE: foo
+ERROR: bar
+CONTEXT: PL/pgSQL function inline_code_block line 4 at RAISE
+\set SHOW_CONTEXT always
+do $$
+begin
+ raise notice 'foo';
+ raise exception 'bar';
+end $$;
+NOTICE: foo
+CONTEXT: PL/pgSQL function inline_code_block line 3 at RAISE
+ERROR: bar
+CONTEXT: PL/pgSQL function inline_code_block line 4 at RAISE
+-- test printing and clearing the query buffer
+SELECT 1;
+ ?column?
+----------
+ 1
+(1 row)
+
+\p
+SELECT 1;
+SELECT 2 \r
+\p
+SELECT 1;
+SELECT 3 \p
+SELECT 3
+UNION SELECT 4 \p
+SELECT 3
+UNION SELECT 4
+UNION SELECT 5
+ORDER BY 1;
+ ?column?
+----------
+ 3
+ 4
+ 5
+(3 rows)
+
+\r
+\p
+SELECT 3
+UNION SELECT 4
+UNION SELECT 5
+ORDER BY 1;
+-- tests for special result variables
+-- working query, 2 rows selected
+SELECT 1 AS stuff UNION SELECT 2;
+ stuff
+-------
+ 1
+ 2
+(2 rows)
+
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 2
+-- syntax error
+SELECT 1 UNION;
+ERROR: syntax error at or near ";"
+LINE 1: SELECT 1 UNION;
+ ^
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42601
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at or near ";"
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42601
+-- empty query
+;
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+-- must have kept previous values
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at or near ";"
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42601
+-- other query error
+DROP TABLE this_table_does_not_exist;
+ERROR: table "this_table_does_not_exist" does not exist
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42P01
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: table "this_table_does_not_exist" does not exist
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42P01
+-- nondefault verbosity error settings (except verbose, which is too unstable)
+\set VERBOSITY terse
+SELECT 1 UNION;
+ERROR: syntax error at or near ";" at character 15
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42601
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at or near ";"
+\set VERBOSITY sqlstate
+SELECT 1/0;
+ERROR: 22012
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 22012
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: division by zero
+\set VERBOSITY default
+-- working \gdesc
+SELECT 3 AS three, 4 AS four \gdesc
+ Column | Type
+--------+---------
+ three | integer
+ four | integer
+(2 rows)
+
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 2
+-- \gdesc with an error
+SELECT 4 AS \gdesc
+ERROR: syntax error at end of input
+LINE 1: SELECT 4 AS
+ ^
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 42601
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: syntax error at end of input
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 42601
+-- check row count for a cursor-fetched query
+\set FETCH_COUNT 10
+select unique2 from tenk1 order by unique2 limit 19;
+ unique2
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+ 10
+ 11
+ 12
+ 13
+ 14
+ 15
+ 16
+ 17
+ 18
+(19 rows)
+
+\echo 'error:' :ERROR
+error: false
+\echo 'error code:' :SQLSTATE
+error code: 00000
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 19
+-- cursor-fetched query with an error after the first group
+select 1/(15-unique2) from tenk1 order by unique2 limit 19;
+ ?column?
+----------
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ 0
+ERROR: division by zero
+\echo 'error:' :ERROR
+error: true
+\echo 'error code:' :SQLSTATE
+error code: 22012
+\echo 'number of rows:' :ROW_COUNT
+number of rows: 0
+\echo 'last error message:' :LAST_ERROR_MESSAGE
+last error message: division by zero
+\echo 'last error code:' :LAST_ERROR_SQLSTATE
+last error code: 22012
+\unset FETCH_COUNT
+create schema testpart;
+create role regress_partitioning_role;
+alter schema testpart owner to regress_partitioning_role;
+set role to regress_partitioning_role;
+-- run test inside own schema and hide other partitions
+set search_path to testpart;
+create table testtable_apple(logdate date);
+create table testtable_orange(logdate date);
+create index testtable_apple_index on testtable_apple(logdate);
+create index testtable_orange_index on testtable_orange(logdate);
+create table testpart_apple(logdate date) partition by range(logdate);
+create table testpart_orange(logdate date) partition by range(logdate);
+create index testpart_apple_index on testpart_apple(logdate);
+create index testpart_orange_index on testpart_orange(logdate);
+-- only partition related object should be displayed
+\dP test*apple*
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+----------------------+---------------------------+-------------------+-------------+----------------
+ testpart | testpart_apple | regress_partitioning_role | partitioned table | |
+ testpart | testpart_apple_index | regress_partitioning_role | partitioned index | | testpart_apple
+(2 rows)
+
+\dPt test*apple*
+ List of partitioned tables
+ Schema | Name | Owner | Parent name
+----------+----------------+---------------------------+-------------
+ testpart | testpart_apple | regress_partitioning_role |
+(1 row)
+
+\dPi test*apple*
+ List of partitioned indexes
+ Schema | Name | Owner | Parent name | Table
+----------+----------------------+---------------------------+-------------+----------------
+ testpart | testpart_apple_index | regress_partitioning_role | | testpart_apple
+(1 row)
+
+drop table testtable_apple;
+drop table testtable_orange;
+drop table testpart_apple;
+drop table testpart_orange;
+create table parent_tab (id int) partition by range (id);
+create index parent_index on parent_tab (id);
+create table child_0_10 partition of parent_tab
+ for values from (0) to (10);
+create table child_10_20 partition of parent_tab
+ for values from (10) to (20);
+create table child_20_30 partition of parent_tab
+ for values from (20) to (30);
+insert into parent_tab values (generate_series(0,29));
+create table child_30_40 partition of parent_tab
+for values from (30) to (40)
+ partition by range(id);
+create table child_30_35 partition of child_30_40
+ for values from (30) to (35);
+create table child_35_40 partition of child_30_40
+ for values from (35) to (40);
+insert into parent_tab values (generate_series(30,39));
+\dPt
+ List of partitioned tables
+ Schema | Name | Owner
+----------+------------+---------------------------
+ testpart | parent_tab | regress_partitioning_role
+(1 row)
+
+\dPi
+ List of partitioned indexes
+ Schema | Name | Owner | Table
+----------+--------------+---------------------------+------------
+ testpart | parent_index | regress_partitioning_role | parent_tab
+(1 row)
+
+\dP testpart.*
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+--------------------+---------------------------+-------------------+--------------+-------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table | |
+ testpart | child_30_40 | regress_partitioning_role | partitioned table | parent_tab |
+ testpart | parent_index | regress_partitioning_role | partitioned index | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | partitioned index | parent_index | child_30_40
+(4 rows)
+
+\dP
+ List of partitioned relations
+ Schema | Name | Owner | Type | Table
+----------+--------------+---------------------------+-------------------+------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table |
+ testpart | parent_index | regress_partitioning_role | partitioned index | parent_tab
+(2 rows)
+
+\dPtn
+ List of partitioned tables
+ Schema | Name | Owner | Parent name
+----------+-------------+---------------------------+-------------
+ testpart | parent_tab | regress_partitioning_role |
+ testpart | child_30_40 | regress_partitioning_role | parent_tab
+(2 rows)
+
+\dPin
+ List of partitioned indexes
+ Schema | Name | Owner | Parent name | Table
+----------+--------------------+---------------------------+--------------+-------------
+ testpart | parent_index | regress_partitioning_role | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | parent_index | child_30_40
+(2 rows)
+
+\dPn
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+--------------------+---------------------------+-------------------+--------------+-------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table | |
+ testpart | child_30_40 | regress_partitioning_role | partitioned table | parent_tab |
+ testpart | parent_index | regress_partitioning_role | partitioned index | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | partitioned index | parent_index | child_30_40
+(4 rows)
+
+\dPn testpart.*
+ List of partitioned relations
+ Schema | Name | Owner | Type | Parent name | Table
+----------+--------------------+---------------------------+-------------------+--------------+-------------
+ testpart | parent_tab | regress_partitioning_role | partitioned table | |
+ testpart | child_30_40 | regress_partitioning_role | partitioned table | parent_tab |
+ testpart | parent_index | regress_partitioning_role | partitioned index | | parent_tab
+ testpart | child_30_40_id_idx | regress_partitioning_role | partitioned index | parent_index | child_30_40
+(4 rows)
+
+drop table parent_tab cascade;
+drop schema testpart;
+set search_path to default;
+set role to default;
+drop role regress_partitioning_role;
+-- \d on toast table (use pg_statistic's toast table, which has a known name)
+\d pg_toast.pg_toast_2619
+TOAST table "pg_toast.pg_toast_2619"
+ Column | Type
+------------+---------
+ chunk_id | oid
+ chunk_seq | integer
+ chunk_data | bytea
+Owning table: "pg_catalog.pg_statistic"
+Indexes:
+ "pg_toast_2619_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
+
+-- check printing info about access methods
+\dA
+List of access methods
+ Name | Type
+----------+-------
+ brin | Index
+ btree | Index
+ gin | Index
+ gist | Index
+ hash | Index
+ heap | Table
+ heap2 | Table
+ spgist | Index
+ zedstore | Table
+(9 rows)
+
+\dA *
+List of access methods
+ Name | Type
+----------+-------
+ brin | Index
+ btree | Index
+ gin | Index
+ gist | Index
+ hash | Index
+ heap | Table
+ heap2 | Table
+ spgist | Index
+ zedstore | Table
+(9 rows)
+
+\dA h*
+List of access methods
+ Name | Type
+-------+-------
+ hash | Index
+ heap | Table
+ heap2 | Table
+(3 rows)
+
+\dA foo
+List of access methods
+ Name | Type
+------+------
+(0 rows)
+
+\dA foo bar
+List of access methods
+ Name | Type
+------+------
+(0 rows)
+
+\dA: extra argument "bar" ignored
+\dA+
+ List of access methods
+ Name | Type | Handler | Description
+----------+-------+--------------------------+----------------------------------------
+ brin | Index | brinhandler | block range index (BRIN) access method
+ btree | Index | bthandler | b-tree index access method
+ gin | Index | ginhandler | GIN index access method
+ gist | Index | gisthandler | GiST index access method
+ hash | Index | hashhandler | hash index access method
+ heap | Table | heap_tableam_handler | heap table access method
+ heap2 | Table | heap_tableam_handler |
+ spgist | Index | spghandler | SP-GiST index access method
+ zedstore | Table | zedstore_tableam_handler | zedstore table access method
+(9 rows)
+
+\dA+ *
+ List of access methods
+ Name | Type | Handler | Description
+----------+-------+--------------------------+----------------------------------------
+ brin | Index | brinhandler | block range index (BRIN) access method
+ btree | Index | bthandler | b-tree index access method
+ gin | Index | ginhandler | GIN index access method
+ gist | Index | gisthandler | GiST index access method
+ hash | Index | hashhandler | hash index access method
+ heap | Table | heap_tableam_handler | heap table access method
+ heap2 | Table | heap_tableam_handler |
+ spgist | Index | spghandler | SP-GiST index access method
+ zedstore | Table | zedstore_tableam_handler | zedstore table access method
+(9 rows)
+
+\dA+ h*
+ List of access methods
+ Name | Type | Handler | Description
+-------+-------+----------------------+--------------------------
+ hash | Index | hashhandler | hash index access method
+ heap | Table | heap_tableam_handler | heap table access method
+ heap2 | Table | heap_tableam_handler |
+(3 rows)
+
+\dA+ foo
+ List of access methods
+ Name | Type | Handler | Description
+------+------+---------+-------------
+(0 rows)
+
+\dAc brin pg*.oid*
+ List of operator classes
+ AM | Input type | Storage type | Operator class | Default?
+------+------------+--------------+----------------+----------
+ brin | oid | | oid_minmax_ops | yes
+(1 row)
+
+\dAf spgist
+ List of operator families
+ AM | Operator family | Applicable types
+--------+-----------------+------------------
+ spgist | box_ops | box
+ spgist | kd_point_ops | point
+ spgist | network_ops | inet
+ spgist | poly_ops | polygon
+ spgist | quad_point_ops | point
+ spgist | range_ops | anyrange
+ spgist | text_ops | text
+(7 rows)
+
+\dAf btree int4
+ List of operator families
+ AM | Operator family | Applicable types
+-------+-----------------+---------------------------
+ btree | integer_ops | smallint, integer, bigint
+(1 row)
+
+\dAo+ btree float_ops
+ List of operators of operator families
+ AM | Operator family | Operator | Strategy | Purpose | Sort opfamily
+-------+-----------------+---------------------------------------+----------+---------+---------------
+ btree | float_ops | <(double precision,double precision) | 1 | search |
+ btree | float_ops | <=(double precision,double precision) | 2 | search |
+ btree | float_ops | =(double precision,double precision) | 3 | search |
+ btree | float_ops | >=(double precision,double precision) | 4 | search |
+ btree | float_ops | >(double precision,double precision) | 5 | search |
+ btree | float_ops | <(real,real) | 1 | search |
+ btree | float_ops | <=(real,real) | 2 | search |
+ btree | float_ops | =(real,real) | 3 | search |
+ btree | float_ops | >=(real,real) | 4 | search |
+ btree | float_ops | >(real,real) | 5 | search |
+ btree | float_ops | <(double precision,real) | 1 | search |
+ btree | float_ops | <=(double precision,real) | 2 | search |
+ btree | float_ops | =(double precision,real) | 3 | search |
+ btree | float_ops | >=(double precision,real) | 4 | search |
+ btree | float_ops | >(double precision,real) | 5 | search |
+ btree | float_ops | <(real,double precision) | 1 | search |
+ btree | float_ops | <=(real,double precision) | 2 | search |
+ btree | float_ops | =(real,double precision) | 3 | search |
+ btree | float_ops | >=(real,double precision) | 4 | search |
+ btree | float_ops | >(real,double precision) | 5 | search |
+(20 rows)
+
+\dAo * pg_catalog.jsonb_path_ops
+ List of operators of operator families
+ AM | Operator family | Operator | Strategy | Purpose
+-----+-----------------+--------------------+----------+---------
+ gin | jsonb_path_ops | @>(jsonb,jsonb) | 7 | search
+ gin | jsonb_path_ops | @?(jsonb,jsonpath) | 15 | search
+ gin | jsonb_path_ops | @@(jsonb,jsonpath) | 16 | search
+(3 rows)
+
+\dAp+ btree float_ops
+ List of support functions of operator families
+ AM | Operator family | Registered left type | Registered right type | Number | Function
+-------+-----------------+----------------------+-----------------------+--------+------------------------------------------------------------------------------
+ btree | float_ops | double precision | double precision | 1 | btfloat8cmp(double precision,double precision)
+ btree | float_ops | double precision | double precision | 2 | btfloat8sortsupport(internal)
+ btree | float_ops | double precision | double precision | 3 | in_range(double precision,double precision,double precision,boolean,boolean)
+ btree | float_ops | real | real | 1 | btfloat4cmp(real,real)
+ btree | float_ops | real | real | 2 | btfloat4sortsupport(internal)
+ btree | float_ops | double precision | real | 1 | btfloat84cmp(double precision,real)
+ btree | float_ops | real | double precision | 1 | btfloat48cmp(real,double precision)
+ btree | float_ops | real | double precision | 3 | in_range(real,real,double precision,boolean,boolean)
+(8 rows)
+
+\dAp * pg_catalog.uuid_ops
+ List of support functions of operator families
+ AM | Operator family | Registered left type | Registered right type | Number | Function
+-------+-----------------+----------------------+-----------------------+--------+--------------------
+ btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
+ btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
+ btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ hash | uuid_ops | uuid | uuid | 1 | uuid_hash
+ hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
+(5 rows)
+
diff --git a/src/test/regress/expected/psql_crosstab_1.out b/src/test/regress/expected/psql_crosstab_1.out
new file mode 100644
index 0000000000..d8ab4c35f3
--- /dev/null
+++ b/src/test/regress/expected/psql_crosstab_1.out
@@ -0,0 +1,216 @@
+--
+-- \crosstabview
+--
+CREATE TABLE ctv_data (v, h, c, i, d) AS
+VALUES
+ ('v1','h2','foo', 3, '2015-04-01'::date),
+ ('v2','h1','bar', 3, '2015-01-02'),
+ ('v1','h0','baz', NULL, '2015-07-12'),
+ ('v0','h4','qux', 4, '2015-07-15'),
+ ('v0','h4','dbl', -3, '2014-12-15'),
+ ('v0',NULL,'qux', 5, '2014-07-15'),
+ ('v1','h2','quux',7, '2015-04-04');
+-- make plans more stable
+ANALYZE ctv_data;
+-- running \crosstabview after query uses query in buffer
+SELECT v, EXTRACT(year FROM d), count(*)
+ FROM ctv_data
+ GROUP BY 1, 2
+ ORDER BY 1, 2;
+ v | date_part | count
+----+-----------+-------
+ v0 | 2014 | 2
+ v0 | 2015 | 1
+ v1 | 2015 | 3
+ v2 | 2015 | 1
+(4 rows)
+
+-- basic usage with 3 columns
+ \crosstabview
+ v | 2014 | 2015
+----+------+------
+ v0 | 2 | 1
+ v1 | | 3
+ v2 | | 1
+(3 rows)
+
+-- ordered months in horizontal header, quoted column name
+SELECT v, to_char(d, 'Mon') AS "month name", EXTRACT(month FROM d) AS num,
+ count(*) FROM ctv_data GROUP BY 1,2,3 ORDER BY 1
+ \crosstabview v "month name" 4 num
+ v | Jan | Apr | Jul | Dec
+----+-----+-----+-----+-----
+ v0 | | | 2 | 1
+ v1 | | 2 | 1 |
+ v2 | 1 | | |
+(3 rows)
+
+-- ordered months in vertical header, ordered years in horizontal header
+SELECT EXTRACT(year FROM d) AS year, to_char(d,'Mon') AS """month"" name",
+ EXTRACT(month FROM d) AS month,
+ format('sum=%s avg=%s', sum(i), avg(i)::numeric(2,1))
+ FROM ctv_data
+ GROUP BY EXTRACT(year FROM d), to_char(d,'Mon'), EXTRACT(month FROM d)
+ORDER BY month
+\crosstabview """month"" name" year format year
+ "month" name | 2014 | 2015
+--------------+-----------------+----------------
+ Jan | | sum=3 avg=3.0
+ Apr | | sum=10 avg=5.0
+ Jul | sum=5 avg=5.0 | sum=4 avg=4.0
+ Dec | sum=-3 avg=-3.0 |
+(4 rows)
+
+-- combine contents vertically into the same cell (V/H duplicates)
+SELECT v, h, string_agg(c, E'\n') FROM ctv_data GROUP BY v, h ORDER BY 1,2,3
+ \crosstabview 1 2 3
+ v | h4 | | h0 | h2 | h1
+----+-----+-----+-----+------+-----
+ v0 | qux+| qux | | |
+ | dbl | | | |
+ v1 | | | baz | foo +|
+ | | | | quux |
+ v2 | | | | | bar
+(3 rows)
+
+-- horizontal ASC order from window function
+SELECT v,h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v0 | | | | qux+| qux
+ | | | | dbl |
+ v1 | baz | | foo +| |
+ | | | quux | |
+ v2 | | bar | | |
+(3 rows)
+
+-- horizontal DESC order from window function
+SELECT v, h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h DESC) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | | h4 | h2 | h1 | h0
+----+-----+-----+------+-----+-----
+ v0 | qux | qux+| | |
+ | | dbl | | |
+ v1 | | | foo +| | baz
+ | | | quux | |
+ v2 | | | | bar |
+(3 rows)
+
+-- horizontal ASC order from window function, NULLs pushed rightmost
+SELECT v,h, string_agg(c, E'\n') AS c, row_number() OVER(ORDER BY h NULLS LAST) AS r
+FROM ctv_data GROUP BY v, h ORDER BY 1,3,2
+ \crosstabview v h c r
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v0 | | | | qux+| qux
+ | | | | dbl |
+ v1 | baz | | foo +| |
+ | | | quux | |
+ v2 | | bar | | |
+(3 rows)
+
+-- only null, no column name, 2 columns: error
+SELECT null,null \crosstabview
+\crosstabview: query must return at least three columns
+-- only null, no column name, 3 columns: works
+SELECT null,null,null \crosstabview
+ ?column? |
+----------+--
+ |
+(1 row)
+
+-- null display
+\pset null '#null#'
+SELECT v,h, string_agg(i::text, E'\n') AS i FROM ctv_data
+GROUP BY v, h ORDER BY h,v
+ \crosstabview v h i
+ v | h0 | h1 | h2 | h4 | #null#
+----+--------+----+----+----+--------
+ v1 | #null# | | 3 +| |
+ | | | 7 | |
+ v2 | | 3 | | |
+ v0 | | | | 4 +| 5
+ | | | | -3 |
+(3 rows)
+
+\pset null ''
+-- refer to columns by position
+SELECT v,h,string_agg(i::text, E'\n'), string_agg(c, E'\n')
+FROM ctv_data GROUP BY v, h ORDER BY h,v
+ \crosstabview 2 1 4
+ h | v1 | v2 | v0
+----+------+-----+-----
+ h0 | baz | |
+ h1 | | bar |
+ h2 | quux+| |
+ | foo | |
+ h4 | | | qux+
+ | | | dbl
+ | | | qux
+(5 rows)
+
+-- refer to columns by positions and names mixed
+SELECT v,h, string_agg(i::text, E'\n') AS i, string_agg(c, E'\n') AS c
+FROM ctv_data GROUP BY v, h ORDER BY h,v
+ \crosstabview 1 "h" 4
+ v | h0 | h1 | h2 | h4 |
+----+-----+-----+------+-----+-----
+ v1 | baz | | quux+| |
+ | | | foo | |
+ v2 | | bar | | |
+ v0 | | | | qux+| qux
+ | | | | dbl |
+(3 rows)
+
+-- refer to columns by quoted names, check downcasing of unquoted name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview "22" B "Foo"
+ 22 | 2
+----+---
+ 1 | 3
+(1 row)
+
+-- error: bad column name
+SELECT v,h,c,i FROM ctv_data
+ \crosstabview v h j
+\crosstabview: column name not found: "j"
+-- error: need to quote name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview 1 2 Foo
+\crosstabview: column name not found: "foo"
+-- error: need to not quote name
+SELECT 1 as "22", 2 as b, 3 as "Foo"
+ \crosstabview 1 "B" "Foo"
+\crosstabview: column name not found: "B"
+-- error: bad column number
+SELECT v,h,i,c FROM ctv_data
+ \crosstabview 2 1 5
+\crosstabview: column number 5 is out of range 1..4
+-- error: same H and V columns
+SELECT v,h,i,c FROM ctv_data
+ \crosstabview 2 h 4
+\crosstabview: vertical and horizontal headers must be different columns
+-- error: too many columns
+SELECT a,a,1 FROM generate_series(1,3000) AS a
+ \crosstabview
+\crosstabview: maximum number of columns (1600) exceeded
+-- error: only one column
+SELECT 1 \crosstabview
+\crosstabview: query must return at least three columns
+DROP TABLE ctv_data;
+-- check error reporting (bug #14476)
+CREATE TABLE ctv_data (x int, y int, v text);
+INSERT INTO ctv_data SELECT 1, x, '*' || x FROM generate_series(1,10) x;
+SELECT * FROM ctv_data \crosstabview
+ x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
+---+----+----+----+----+----+----+----+----+----+-----
+ 1 | *1 | *2 | *3 | *4 | *5 | *6 | *7 | *8 | *9 | *10
+(1 row)
+
+INSERT INTO ctv_data VALUES (1, 10, '*'); -- duplicate data to cause error
+SELECT * FROM ctv_data \crosstabview
+\crosstabview: query result contains multiple data values for row "1", column "10"
+DROP TABLE ctv_data;
diff --git a/src/test/regress/expected/rangefuncs_1.out b/src/test/regress/expected/rangefuncs_1.out
new file mode 100644
index 0000000000..78b177ceb0
--- /dev/null
+++ b/src/test/regress/expected/rangefuncs_1.out
@@ -0,0 +1,2100 @@
+CREATE TABLE rngfunc2(rngfuncid int, f2 int);
+INSERT INTO rngfunc2 VALUES(1, 11);
+INSERT INTO rngfunc2 VALUES(2, 22);
+INSERT INTO rngfunc2 VALUES(1, 111);
+CREATE FUNCTION rngfunct(int) returns setof rngfunc2 as 'SELECT * FROM rngfunc2 WHERE rngfuncid = $1 ORDER BY f2;' LANGUAGE SQL;
+-- function with ORDINALITY
+select * from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select * from rngfunct(1) with ordinality as z(a,b,ord) where b > 100; -- ordinal 2, not 1
+ a | b | ord
+---+-----+-----
+ 1 | 111 | 2
+(1 row)
+
+-- ordinality vs. column names and types
+select a,b,ord from rngfunct(1) with ordinality as z(a,b,ord);
+ a | b | ord
+---+-----+-----
+ 1 | 11 | 1
+ 1 | 111 | 2
+(2 rows)
+
+select a,ord from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select * from unnest(array['a','b']) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ a | 1
+ b | 2
+(2 rows)
+
+select a,ord from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select * from unnest(array[1.0::float8]) with ordinality as z(a,ord);
+ a | ord
+---+-----
+ 1 | 1
+(1 row)
+
+select row_to_json(s.*) from generate_series(11,14) with ordinality s;
+ row_to_json
+-------------------------
+ {"s":11,"ordinality":1}
+ {"s":12,"ordinality":2}
+ {"s":13,"ordinality":3}
+ {"s":14,"ordinality":4}
+(4 rows)
+
+-- ordinality vs. views
+create temporary view vw_ord as select * from (values (1)) v(n) join rngfunct(1) with ordinality as z(a,b,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | ord
+---+---+----+-----
+ 1 | 1 | 11 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN rngfunct(1) WITH ORDINALITY z(a, b, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- multiple functions
+select * from rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord);
+ a | b | c | d | ord
+---+-----+---+----+-----
+ 1 | 11 | 2 | 22 | 1
+ 1 | 111 | | | 2
+(2 rows)
+
+create temporary view vw_ord as select * from (values (1)) v(n) join rows from(rngfunct(1),rngfunct(2)) with ordinality as z(a,b,c,d,ord) on (n=ord);
+select * from vw_ord;
+ n | a | b | c | d | ord
+---+---+----+---+----+-----
+ 1 | 1 | 11 | 2 | 22 | 1
+(1 row)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+-------------------------------------------------------------------------------------------------------
+ SELECT v.n, +
+ z.a, +
+ z.b, +
+ z.c, +
+ z.d, +
+ z.ord +
+ FROM (( VALUES (1)) v(n) +
+ JOIN ROWS FROM(rngfunct(1), rngfunct(2)) WITH ORDINALITY z(a, b, c, d, ord) ON ((v.n = z.ord)));
+(1 row)
+
+drop view vw_ord;
+-- expansions of unnest()
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]);
+ unnest | unnest | unnest
+--------+--------+--------
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select * from unnest(array[10,20],array['foo','bar'],array[1.0]) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 1.0 | 1
+ 20 | bar | | 2
+(2 rows)
+
+select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(101,102)) with ordinality as z(a,b,c,ord);
+ a | b | c | ord
+----+-----+-----+-----
+ 10 | foo | 101 | 1
+ 20 | bar | 102 | 2
+(2 rows)
+
+create temporary view vw_ord as select * from unnest(array[10,20],array['foo','bar'],array[1.0]) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+-----
+ 10 | foo | 1.0
+ 20 | bar |
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(1,2)) as z(a,b,c);
+select * from vw_ord;
+ a | b | c
+----+-----+---
+ 10 | foo | 1
+ 20 | bar | 2
+(2 rows)
+
+select definition from pg_views where viewname='vw_ord';
+ definition
+----------------------------------------------------------------------------------------------------------------------
+ SELECT z.a, +
+ z.b, +
+ z.c +
+ FROM ROWS FROM(unnest(ARRAY[10, 20]), unnest(ARRAY['foo'::text, 'bar'::text]), generate_series(1, 2)) z(a, b, c);
+(1 row)
+
+drop view vw_ord;
+-- ordinality and multiple functions vs. rewind and reverse scan
+begin;
+declare rf_cur scroll cursor for select * from rows from(generate_series(1,5),generate_series(1,2)) with ordinality as g(i,j,o);
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch backward all from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+ 4 | | 4
+ 3 | | 3
+ 2 | 2 | 2
+ 1 | 1 | 1
+(5 rows)
+
+fetch all from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+ 2 | 2 | 2
+ 3 | | 3
+ 4 | | 4
+ 5 | | 5
+(5 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+(0 rows)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 5 | | 5
+(1 row)
+
+fetch absolute 1 from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch next from rf_cur;
+ i | j | o
+---+---+---
+ 4 | | 4
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 3 | | 3
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 2 | 2 | 2
+(1 row)
+
+fetch prior from rf_cur;
+ i | j | o
+---+---+---
+ 1 | 1 | 1
+(1 row)
+
+commit;
+-- function with implicit LATERAL
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) z where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2
+-----------+-----+-----------+-----
+ 1 | 11 | 1 | 11
+ 2 | 22 | 2 | 22
+ 1 | 111 | 1 | 111
+(3 rows)
+
+-- function with implicit LATERAL and explicit ORDINALITY
+select * from rngfunc2, rngfunct(rngfunc2.rngfuncid) with ordinality as z(rngfuncid,f2,ord) where rngfunc2.f2 = z.f2;
+ rngfuncid | f2 | rngfuncid | f2 | ord
+-----------+-----+-----------+-----+-----
+ 1 | 11 | 1 | 11 | 1
+ 2 | 22 | 2 | 22 | 1
+ 1 | 111 | 1 | 111 | 2
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+ 2 | 22
+(3 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(1) z where z.rngfuncid = rngfunc2.rngfuncid) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- function in subselect
+select * from rngfunc2 where f2 in (select f2 from rngfunct(rngfunc2.rngfuncid) z where z.rngfuncid = 1) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+-- nested functions
+select rngfunct.rngfuncid, rngfunct.f2 from rngfunct(sin(pi()/2)::int) ORDER BY 1,2;
+ rngfuncid | f2
+-----------+-----
+ 1 | 11
+ 1 | 111
+(2 rows)
+
+CREATE TABLE rngfunc (rngfuncid int, rngfuncsubid int, rngfuncname text, primary key(rngfuncid,rngfuncsubid));
+INSERT INTO rngfunc VALUES(1,1,'Joe');
+INSERT INTO rngfunc VALUES(1,2,'Ed');
+INSERT INTO rngfunc VALUES(2,1,'Mary');
+-- sql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc1(int) RETURNS int AS 'SELECT $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc1(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc1(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc1
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc1(1) WITH ORDINALITY as t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc2(int) RETURNS setof int AS 'SELECT rngfuncid FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc2(1) AS t1;
+ t1
+----
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc2
+-------------
+ 1
+ 1
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc2(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+ 1 | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = b
+CREATE FUNCTION getrngfunc3(int) RETURNS setof text AS 'SELECT rngfuncname FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc3(1) AS t1;
+ t1
+-----
+ Joe
+ Ed
+(2 rows)
+
+SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc3
+-------------
+ Joe
+ Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc3(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+-----+---
+ Joe | 1
+ Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc4(int) RETURNS rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc4(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc4(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = c
+CREATE FUNCTION getrngfunc5(int) RETURNS setof rngfunc AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc5(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc5(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = f, prorettype = record
+CREATE FUNCTION getrngfunc6(int) RETURNS RECORD AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc6(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc6(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- sql, proretset = t, prorettype = record
+CREATE FUNCTION getrngfunc7(int) RETURNS setof record AS 'SELECT * FROM rngfunc WHERE rngfuncid = $1;' LANGUAGE SQL;
+SELECT * FROM getrngfunc7(1) AS t1(rngfuncid int, rngfuncsubid int, rngfuncname text);
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) ) WITH ORDINALITY;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc7(1) AS
+(rngfuncid int, rngfuncsubid int, rngfuncname text);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+ 1 | 2 | Ed
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS
+ SELECT * FROM ROWS FROM( getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text) )
+ WITH ORDINALITY;
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname | ordinality
+-----------+--------------+-------------+------------
+ 1 | 1 | Joe | 1
+ 1 | 2 | Ed | 2
+(2 rows)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = b
+CREATE FUNCTION getrngfunc8(int) RETURNS int AS 'DECLARE rngfuncint int; BEGIN SELECT rngfuncid into rngfuncint FROM rngfunc WHERE rngfuncid = $1; RETURN rngfuncint; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc8(1) AS t1;
+ t1
+----
+ 1
+(1 row)
+
+SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1);
+SELECT * FROM vw_getrngfunc;
+ getrngfunc8
+-------------
+ 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc8(1) WITH ORDINALITY AS t1(v,o);
+SELECT * FROM vw_getrngfunc;
+ v | o
+---+---
+ 1 | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- plpgsql, proretset = f, prorettype = c
+CREATE FUNCTION getrngfunc9(int) RETURNS rngfunc AS 'DECLARE rngfunctup rngfunc%ROWTYPE; BEGIN SELECT * into rngfunctup FROM rngfunc WHERE rngfuncid = $1; RETURN rngfunctup; END;' LANGUAGE plpgsql;
+SELECT * FROM getrngfunc9(1) AS t1;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1);
+SELECT * FROM vw_getrngfunc;
+ rngfuncid | rngfuncsubid | rngfuncname
+-----------+--------------+-------------
+ 1 | 1 | Joe
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+CREATE VIEW vw_getrngfunc AS SELECT * FROM getrngfunc9(1) WITH ORDINALITY AS t1(a,b,c,o);
+SELECT * FROM vw_getrngfunc;
+ a | b | c | o
+---+---+-----+---
+ 1 | 1 | Joe | 1
+(1 row)
+
+DROP VIEW vw_getrngfunc;
+-- mix 'n match kinds, to exercise expandRTE and related logic
+select * from rows from(getrngfunc1(1),getrngfunc2(1),getrngfunc3(1),getrngfunc4(1),getrngfunc5(1),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc8(1),getrngfunc9(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+---+-----+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1
+ | 1 | Ed | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | | 2
+(2 rows)
+
+select * from rows from(getrngfunc9(1),getrngfunc8(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc6(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc5(1),getrngfunc4(1),getrngfunc3(1),getrngfunc2(1),getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
+ a | b | c | d | e | f | g | h | i | j | k | l | m | o | p | q | r | s | t | u
+---+---+-----+---+---+---+-----+---+---+-----+---+---+-----+---+---+-----+-----+---+---+---
+ 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | Joe | 1 | 1 | 1
+ | | | | 1 | 2 | Ed | | | | 1 | 2 | Ed | | | | Ed | 1 | | 2
+(2 rows)
+
+create temporary view vw_rngfunc as
+ select * from rows from(getrngfunc9(1),
+ getrngfunc7(1) AS (rngfuncid int, rngfuncsubid int, rngfuncname text),
+ getrngfunc1(1))
+ with ordinality as t1(a,b,c,d,e,f,g,n);
+select * from vw_rngfunc;
+ a | b | c | d | e | f | g | n
+---+---+-----+---+---+-----+---+---
+ 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1
+ | | | 1 | 2 | Ed | | 2
+(2 rows)
+
+select pg_get_viewdef('vw_rngfunc');
+ pg_get_viewdef
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ SELECT t1.a, +
+ t1.b, +
+ t1.c, +
+ t1.d, +
+ t1.e, +
+ t1.f, +
+ t1.g, +
+ t1.n +
+ FROM ROWS FROM(getrngfunc9(1), getrngfunc7(1) AS (rngfuncid integer, rngfuncsubid integer, rngfuncname text), getrngfunc1(1)) WITH ORDINALITY t1(a, b, c, d, e, f, g, n);
+(1 row)
+
+drop view vw_rngfunc;
+DROP FUNCTION getrngfunc1(int);
+DROP FUNCTION getrngfunc2(int);
+DROP FUNCTION getrngfunc3(int);
+DROP FUNCTION getrngfunc4(int);
+DROP FUNCTION getrngfunc5(int);
+DROP FUNCTION getrngfunc6(int);
+DROP FUNCTION getrngfunc7(int);
+DROP FUNCTION getrngfunc8(int);
+DROP FUNCTION getrngfunc9(int);
+DROP FUNCTION rngfunct(int);
+DROP TABLE rngfunc2;
+DROP TABLE rngfunc;
+-- Rescan tests --
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq1;
+CREATE TEMPORARY SEQUENCE rngfunc_rescan_seq2;
+CREATE TYPE rngfunc_rescan_t AS (i integer, s bigint);
+CREATE FUNCTION rngfunc_sql(int,int) RETURNS setof rngfunc_rescan_t AS 'SELECT i, nextval(''rngfunc_rescan_seq1'') FROM generate_series($1,$2) i;' LANGUAGE SQL;
+-- plpgsql functions use materialize mode
+CREATE FUNCTION rngfunc_mat(int,int) RETURNS setof rngfunc_rescan_t AS 'begin for i in $1..$2 loop return next (i, nextval(''rngfunc_rescan_seq2'')); end loop; end;' LANGUAGE plpgsql;
+--invokes ExecReScanFunctionScan - all these cases should materialize the function only once
+-- LEFT JOIN on a condition that the planner can't prove to be true is used to ensure the function
+-- is on the inner path of a nestloop join
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_sql(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) ON (r+i)<100;
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN rngfunc_mat(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 11 | 1 | 1
+ 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 3
+ 3 | 11 | 1 | 1
+ 3 | 12 | 2 | 2
+ 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN ROWS FROM( rngfunc_sql(11,13), rngfunc_mat(11,13) ) WITH ORDINALITY AS f(i1,s1,i2,s2,o) ON (r+i1+i2)<100;
+ r | i1 | s1 | i2 | s2 | o
+---+----+----+----+----+---
+ 1 | 11 | 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 12 | 2 | 2
+ 1 | 13 | 3 | 13 | 3 | 3
+ 2 | 11 | 1 | 11 | 1 | 1
+ 2 | 12 | 2 | 12 | 2 | 2
+ 2 | 13 | 3 | 13 | 3 | 3
+ 3 | 11 | 1 | 11 | 1 | 1
+ 3 | 12 | 2 | 12 | 2 | 2
+ 3 | 13 | 3 | 13 | 3 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 2 | 11
+ 2 | 12
+ 2 | 13
+ 3 | 11
+ 3 | 12
+ 3 | 13
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 11 | 1
+ 2 | 12 | 2
+ 2 | 13 | 3
+ 3 | 11 | 1
+ 3 | 12 | 2
+ 3 | 13 | 3
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) f(i) ON (r+i)<100;
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 10
+ 2 | 20
+ 2 | 30
+ 3 | 10
+ 3 | 20
+ 3 | 30
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 10 | 1
+ 2 | 20 | 2
+ 2 | 30 | 3
+ 3 | 10 | 1
+ 3 | 20 | 2
+ 3 | 30 | 3
+(9 rows)
+
+--invokes ExecReScanFunctionScan with chgParam != NULL (using implied LATERAL)
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_sql(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_sql(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 2 | 12 | 4
+ 2 | 13 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(10+r,13) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 1 | 12 | 2 | 2
+ 1 | 13 | 3 | 3
+ 2 | 12 | 4 | 1
+ 2 | 13 | 5 | 2
+ 3 | 13 | 6 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r);
+ r | i | s
+---+----+---
+ 1 | 11 | 1
+ 2 | 11 | 2
+ 2 | 12 | 3
+ 3 | 11 | 4
+ 3 | 12 | 5
+ 3 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), rngfunc_mat(11,10+r) WITH ORDINALITY AS f(i,s,o);
+ r | i | s | o
+---+----+---+---
+ 1 | 11 | 1 | 1
+ 2 | 11 | 2 | 1
+ 2 | 12 | 3 | 2
+ 3 | 11 | 4 | 1
+ 3 | 12 | 5 | 2
+ 3 | 13 | 6 | 3
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2);
+ r1 | r2 | i | s
+----+----+----+----
+ 11 | 12 | 11 | 1
+ 11 | 12 | 12 | 2
+ 13 | 15 | 13 | 3
+ 13 | 15 | 14 | 4
+ 13 | 15 | 15 | 5
+ 16 | 20 | 16 | 6
+ 16 | 20 | 17 | 7
+ 16 | 20 | 18 | 8
+ 16 | 20 | 19 | 9
+ 16 | 20 | 20 | 10
+(10 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), rngfunc_mat(r1,r2) WITH ORDINALITY AS f(i,s,o);
+ r1 | r2 | i | s | o
+----+----+----+----+---
+ 11 | 12 | 11 | 1 | 1
+ 11 | 12 | 12 | 2 | 2
+ 13 | 15 | 13 | 3 | 1
+ 13 | 15 | 14 | 4 | 2
+ 13 | 15 | 15 | 5 | 3
+ 16 | 20 | 16 | 6 | 1
+ 16 | 20 | 17 | 7 | 2
+ 16 | 20 | 18 | 8 | 3
+ 16 | 20 | 19 | 9 | 4
+ 16 | 20 | 20 | 10 | 5
+(10 rows)
+
+-- selective rescan of multiple functions:
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(11,11), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | | | 12 | 2
+ 1 | | | 13 | 3
+ 2 | 11 | 1 | 12 | 4
+ 2 | | | 13 | 5
+ 3 | 11 | 1 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(11,11) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | |
+ 1 | 13 | 3 | |
+ 2 | 12 | 4 | 11 | 1
+ 2 | 13 | 5 | |
+ 3 | 13 | 6 | 11 | 1
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( rngfunc_sql(10+r,13), rngfunc_mat(10+r,13) );
+ r | i | s | i | s
+---+----+---+----+---
+ 1 | 11 | 1 | 11 | 1
+ 1 | 12 | 2 | 12 | 2
+ 1 | 13 | 3 | 13 | 3
+ 2 | 12 | 4 | 12 | 4
+ 2 | 13 | 5 | 13 | 5
+ 3 | 13 | 6 | 13 | 6
+(6 rows)
+
+SELECT setval('rngfunc_rescan_seq1',1,false),setval('rngfunc_rescan_seq2',1,false);
+ setval | setval
+--------+--------
+ 1 | 1
+(1 row)
+
+SELECT * FROM generate_series(1,2) r1, generate_series(r1,3) r2, ROWS FROM( rngfunc_sql(10+r1,13), rngfunc_mat(10+r2,13) );
+ r1 | r2 | i | s | i | s
+----+----+----+----+----+---
+ 1 | 1 | 11 | 1 | 11 | 1
+ 1 | 1 | 12 | 2 | 12 | 2
+ 1 | 1 | 13 | 3 | 13 | 3
+ 1 | 2 | 11 | 4 | 12 | 4
+ 1 | 2 | 12 | 5 | 13 | 5
+ 1 | 2 | 13 | 6 | |
+ 1 | 3 | 11 | 7 | 13 | 6
+ 1 | 3 | 12 | 8 | |
+ 1 | 3 | 13 | 9 | |
+ 2 | 2 | 12 | 10 | 12 | 7
+ 2 | 2 | 13 | 11 | 13 | 8
+ 2 | 3 | 12 | 12 | 13 | 9
+ 2 | 3 | 13 | 13 | |
+(13 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) f(i);
+ r | i
+---+----
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 12
+ 2 | 13
+ 2 | 14
+ 2 | 15
+ 2 | 16
+ 2 | 17
+ 2 | 18
+ 3 | 13
+ 3 | 14
+ 3 | 15
+ 3 | 16
+ 3 | 17
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 11 | 1
+ 1 | 12 | 2
+ 1 | 13 | 3
+ 1 | 14 | 4
+ 1 | 15 | 5
+ 1 | 16 | 6
+ 1 | 17 | 7
+ 1 | 18 | 8
+ 1 | 19 | 9
+ 2 | 12 | 1
+ 2 | 13 | 2
+ 2 | 14 | 3
+ 2 | 15 | 4
+ 2 | 16 | 5
+ 2 | 17 | 6
+ 2 | 18 | 7
+ 3 | 13 | 1
+ 3 | 14 | 2
+ 3 | 15 | 3
+ 3 | 16 | 4
+ 3 | 17 | 5
+(21 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) f(i);
+ r | i
+---+----
+ 1 | 10
+ 1 | 20
+ 1 | 30
+ 2 | 20
+ 2 | 40
+ 2 | 60
+ 3 | 30
+ 3 | 60
+ 3 | 90
+(9 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) WITH ORDINALITY AS f(i,o);
+ r | i | o
+---+----+---
+ 1 | 10 | 1
+ 1 | 20 | 2
+ 1 | 30 | 3
+ 2 | 20 | 1
+ 2 | 40 | 2
+ 2 | 60 | 3
+ 3 | 30 | 1
+ 3 | 60 | 2
+ 3 | 90 | 3
+(9 rows)
+
+-- deep nesting
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(21,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 21
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 21
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 21
+ 3 | 3 | 10 | 22
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 21
+ 3 | 3 | 30 | 22
+ 3 | 3 | 30 | 23
+(27 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(20+r1,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 21
+ 1 | 1 | 10 | 22
+ 1 | 1 | 10 | 23
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 21
+ 1 | 1 | 30 | 22
+ 1 | 1 | 30 | 23
+ 2 | 2 | 10 | 22
+ 2 | 2 | 10 | 23
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 22
+ 2 | 2 | 30 | 23
+ 3 | 3 | 10 | 23
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 23
+(18 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r2,r2+3) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+----
+ 1 | 1 | 10 | 10
+ 1 | 1 | 10 | 11
+ 1 | 1 | 10 | 12
+ 1 | 1 | 10 | 13
+ 1 | 1 | 20 | 20
+ 1 | 1 | 20 | 21
+ 1 | 1 | 20 | 22
+ 1 | 1 | 20 | 23
+ 1 | 1 | 30 | 30
+ 1 | 1 | 30 | 31
+ 1 | 1 | 30 | 32
+ 1 | 1 | 30 | 33
+ 2 | 2 | 10 | 10
+ 2 | 2 | 10 | 11
+ 2 | 2 | 10 | 12
+ 2 | 2 | 10 | 13
+ 2 | 2 | 20 | 20
+ 2 | 2 | 20 | 21
+ 2 | 2 | 20 | 22
+ 2 | 2 | 20 | 23
+ 2 | 2 | 30 | 30
+ 2 | 2 | 30 | 31
+ 2 | 2 | 30 | 32
+ 2 | 2 | 30 | 33
+ 3 | 3 | 10 | 10
+ 3 | 3 | 10 | 11
+ 3 | 3 | 10 | 12
+ 3 | 3 | 10 | 13
+ 3 | 3 | 20 | 20
+ 3 | 3 | 20 | 21
+ 3 | 3 | 20 | 22
+ 3 | 3 | 20 | 23
+ 3 | 3 | 30 | 30
+ 3 | 3 | 30 | 31
+ 3 | 3 | 30 | 32
+ 3 | 3 | 30 | 33
+(36 rows)
+
+SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
+ LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
+ LEFT JOIN generate_series(r1,2+r2/5) f(i) ON ((r2+i)<100) OFFSET 0) s1;
+ r1 | r1 | r2 | i
+----+----+----+---
+ 1 | 1 | 10 | 1
+ 1 | 1 | 10 | 2
+ 1 | 1 | 10 | 3
+ 1 | 1 | 10 | 4
+ 1 | 1 | 20 | 1
+ 1 | 1 | 20 | 2
+ 1 | 1 | 20 | 3
+ 1 | 1 | 20 | 4
+ 1 | 1 | 20 | 5
+ 1 | 1 | 20 | 6
+ 1 | 1 | 30 | 1
+ 1 | 1 | 30 | 2
+ 1 | 1 | 30 | 3
+ 1 | 1 | 30 | 4
+ 1 | 1 | 30 | 5
+ 1 | 1 | 30 | 6
+ 1 | 1 | 30 | 7
+ 1 | 1 | 30 | 8
+ 2 | 2 | 10 | 2
+ 2 | 2 | 10 | 3
+ 2 | 2 | 10 | 4
+ 2 | 2 | 20 | 2
+ 2 | 2 | 20 | 3
+ 2 | 2 | 20 | 4
+ 2 | 2 | 20 | 5
+ 2 | 2 | 20 | 6
+ 2 | 2 | 30 | 2
+ 2 | 2 | 30 | 3
+ 2 | 2 | 30 | 4
+ 2 | 2 | 30 | 5
+ 2 | 2 | 30 | 6
+ 2 | 2 | 30 | 7
+ 2 | 2 | 30 | 8
+ 3 | 3 | 10 | 3
+ 3 | 3 | 10 | 4
+ 3 | 3 | 20 | 3
+ 3 | 3 | 20 | 4
+ 3 | 3 | 20 | 5
+ 3 | 3 | 20 | 6
+ 3 | 3 | 30 | 3
+ 3 | 3 | 30 | 4
+ 3 | 3 | 30 | 5
+ 3 | 3 | 30 | 6
+ 3 | 3 | 30 | 7
+ 3 | 3 | 30 | 8
+(45 rows)
+
+-- check handling of FULL JOIN with multiple lateral references (bug #15741)
+SELECT *
+FROM (VALUES (1),(2)) v1(r1)
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, v1.r1) AS gs1
+ LEFT JOIN LATERAL (
+ SELECT *
+ FROM generate_series(1, gs1) AS gs2
+ LEFT JOIN generate_series(1, gs2) AS gs3 ON TRUE
+ ) AS ss1 ON TRUE
+ FULL JOIN generate_series(1, v1.r1) AS gs4 ON FALSE
+ ) AS ss0 ON TRUE;
+ r1 | gs1 | gs2 | gs3 | gs4
+----+-----+-----+-----+-----
+ 1 | | | | 1
+ 1 | 1 | 1 | 1 |
+ 2 | | | | 1
+ 2 | | | | 2
+ 2 | 1 | 1 | 1 |
+ 2 | 2 | 1 | 1 |
+ 2 | 2 | 2 | 1 |
+ 2 | 2 | 2 | 2 |
+(8 rows)
+
+DROP FUNCTION rngfunc_sql(int,int);
+DROP FUNCTION rngfunc_mat(int,int);
+DROP SEQUENCE rngfunc_rescan_seq1;
+DROP SEQUENCE rngfunc_rescan_seq2;
+--
+-- Test cases involving OUT parameters
+--
+CREATE FUNCTION rngfunc(in f1 int, out f2 int)
+AS 'select $1+1' LANGUAGE sql;
+SELECT rngfunc(42);
+ rngfunc
+---------
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42);
+ f2
+----
+ 43
+(1 row)
+
+SELECT * FROM rngfunc(42) AS p(x);
+ x
+----
+ 43
+(1 row)
+
+-- explicit spec of return type is OK
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+-- error, wrong result type
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int) RETURNS float
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be integer because of OUT parameters
+-- with multiple OUT params you must get a RECORD result
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text) RETURNS int
+AS 'select $1+1' LANGUAGE sql;
+ERROR: function result type must be record because of OUT parameters
+CREATE OR REPLACE FUNCTION rngfunc(in f1 int, out f2 int, out f3 text)
+RETURNS record
+AS 'select $1+1' LANGUAGE sql;
+ERROR: cannot change return type of existing function
+HINT: Use DROP FUNCTION rngfunc(integer) first.
+CREATE OR REPLACE FUNCTION rngfuncr(in f1 int, out f2 int, out text)
+AS $$select $1-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncr(f1) FROM int4_tbl;
+ f1 | rngfuncr
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (123455,123456z)
+ -123456 | (-123457,-123456z)
+ 2147483647 | (2147483646,2147483647z)
+ -2147483647 | (-2147483648,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncr(42);
+ f2 | column2
+----+---------
+ 41 | 42z
+(1 row)
+
+SELECT * FROM rngfuncr(42) AS p(a,b);
+ a | b
+----+-----
+ 41 | 42z
+(1 row)
+
+CREATE OR REPLACE FUNCTION rngfuncb(in f1 int, inout f2 int, out text)
+AS $$select $2-1, $1::text || 'z'$$ LANGUAGE sql;
+SELECT f1, rngfuncb(f1, f1/2) FROM int4_tbl;
+ f1 | rngfuncb
+-------------+----------------------------
+ 0 | (-1,0z)
+ 123456 | (61727,123456z)
+ -123456 | (-61729,-123456z)
+ 2147483647 | (1073741822,2147483647z)
+ -2147483647 | (-1073741824,-2147483647z)
+(5 rows)
+
+SELECT * FROM rngfuncb(42, 99);
+ f2 | column2
+----+---------
+ 98 | 42z
+(1 row)
+
+SELECT * FROM rngfuncb(42, 99) AS p(a,b);
+ a | b
+----+-----
+ 98 | 42z
+(1 row)
+
+-- Can reference function with or without OUT params for DROP, etc
+DROP FUNCTION rngfunc(int);
+DROP FUNCTION rngfuncr(in f2 int, out f1 int, out text);
+DROP FUNCTION rngfuncb(in f1 int, inout f2 int);
+--
+-- For my next trick, polymorphic OUT parameters
+--
+CREATE FUNCTION dup (f1 anyelement, f2 out anyelement, f3 out anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+SELECT dup('xyz'); -- fails
+ERROR: could not determine polymorphic type because input has type unknown
+SELECT dup('xyz'::text);
+ dup
+-------------------
+ (xyz,"{xyz,xyz}")
+(1 row)
+
+SELECT * FROM dup('xyz'::text);
+ f2 | f3
+-----+-----------
+ xyz | {xyz,xyz}
+(1 row)
+
+-- fails, as we are attempting to rename first argument
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot change name of input parameter "f1"
+HINT: Use DROP FUNCTION dup(anyelement) first.
+DROP FUNCTION dup(anyelement);
+-- equivalent behavior, though different name exposed for input arg
+CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+SELECT dup(22);
+ dup
+----------------
+ (22,"{22,22}")
+(1 row)
+
+DROP FUNCTION dup(anyelement);
+-- fails, no way to deduce outputs
+CREATE FUNCTION bad (f1 int, out f2 anyelement, out f3 anyarray)
+AS 'select $1, array[$1,$1]' LANGUAGE sql;
+ERROR: cannot determine result data type
+DETAIL: A function returning a polymorphic type must have at least one polymorphic argument.
+--
+-- table functions
+--
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a int)
+AS $$ SELECT a FROM generate_series(1,5) a(a) $$ LANGUAGE sql;
+SELECT * FROM rngfunc();
+ a
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+DROP FUNCTION rngfunc();
+CREATE OR REPLACE FUNCTION rngfunc(int)
+RETURNS TABLE(a int, b int)
+AS $$ SELECT a, b
+ FROM generate_series(1,$1) a(a),
+ generate_series(1,$1) b(b) $$ LANGUAGE sql;
+SELECT * FROM rngfunc(3);
+ a | b
+---+---
+ 1 | 1
+ 1 | 2
+ 1 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(9 rows)
+
+DROP FUNCTION rngfunc(int);
+-- case that causes change of typmod knowledge during inlining
+CREATE OR REPLACE FUNCTION rngfunc()
+RETURNS TABLE(a varchar(5))
+AS $$ SELECT 'hello'::varchar(5) $$ LANGUAGE sql STABLE;
+SELECT * FROM rngfunc() GROUP BY 1;
+ a
+-------
+ hello
+(1 row)
+
+DROP FUNCTION rngfunc();
+--
+-- some tests on SQL functions with RETURNING
+--
+create temp table tt(f1 serial, data text);
+create function insert_tt(text) returns int as
+$$ insert into tt(data) values($1) returning f1 $$
+language sql;
+select insert_tt('foo');
+ insert_tt
+-----------
+ 1
+(1 row)
+
+select insert_tt('bar');
+ insert_tt
+-----------
+ 2
+(1 row)
+
+select * from tt;
+ f1 | data
+----+------
+ 1 | foo
+ 2 | bar
+(2 rows)
+
+-- insert will execute to completion even if function needs just 1 row
+create or replace function insert_tt(text) returns int as
+$$ insert into tt(data) values($1),($1||$1) returning f1 $$
+language sql;
+select insert_tt('fool');
+ insert_tt
+-----------
+ 3
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+(4 rows)
+
+-- setof does what's expected
+create or replace function insert_tt2(text,text) returns setof int as
+$$ insert into tt(data) values($1),($2) returning f1 $$
+language sql;
+select insert_tt2('foolish','barrish');
+ insert_tt2
+------------
+ 5
+ 6
+(2 rows)
+
+select * from insert_tt2('baz','quux');
+ insert_tt2
+------------
+ 7
+ 8
+(2 rows)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+(8 rows)
+
+-- limit doesn't prevent execution to completion
+select insert_tt2('foolish','barrish') limit 1;
+ insert_tt2
+------------
+ 9
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+(10 rows)
+
+-- triggers will fire, too
+create function noticetrigger() returns trigger as $$
+begin
+ raise notice 'noticetrigger % %', new.f1, new.data;
+ return null;
+end $$ language plpgsql;
+create trigger tnoticetrigger after insert on tt for each row
+execute procedure noticetrigger();
+select insert_tt2('foolme','barme') limit 1;
+NOTICE: noticetrigger 11 foolme
+NOTICE: noticetrigger 12 barme
+ insert_tt2
+------------
+ 11
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+(12 rows)
+
+-- and rules work
+create temp table tt_log(f1 int, data text);
+create rule insert_tt_rule as on insert to tt do also
+ insert into tt_log values(new.*);
+select insert_tt2('foollog','barlog') limit 1;
+NOTICE: noticetrigger 13 foollog
+NOTICE: noticetrigger 14 barlog
+ insert_tt2
+------------
+ 13
+(1 row)
+
+select * from tt;
+ f1 | data
+----+----------
+ 1 | foo
+ 2 | bar
+ 3 | fool
+ 4 | foolfool
+ 5 | foolish
+ 6 | barrish
+ 7 | baz
+ 8 | quux
+ 9 | foolish
+ 10 | barrish
+ 11 | foolme
+ 12 | barme
+ 13 | foollog
+ 14 | barlog
+(14 rows)
+
+-- note that nextval() gets executed a second time in the rule expansion,
+-- which is expected.
+select * from tt_log;
+ f1 | data
+----+---------
+ 15 | foollog
+ 16 | barlog
+(2 rows)
+
+-- test case for a whole-row-variable bug
+create function rngfunc1(n integer, out a text, out b text)
+ returns setof record
+ language sql
+ as $$ select 'foo ' || i, 'bar ' || i from generate_series(1,$1) i $$;
+set work_mem='64kB';
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+reset work_mem;
+select t.a, t, t.a from rngfunc1(10000) t limit 1;
+ a | t | a
+-------+-------------------+-------
+ foo 1 | ("foo 1","bar 1") | foo 1
+(1 row)
+
+drop function rngfunc1(n integer);
+-- test use of SQL functions returning record
+-- this is supported in some cases where the query doesn't specify
+-- the actual record type ...
+create function array_to_set(anyarray) returns setof record as $$
+ select i AS "index", $1[i] AS "value" from generate_subscripts($1, 1) i
+$$ language sql strict immutable;
+select array_to_set(array['one', 'two']);
+ array_to_set
+--------------
+ (1,one)
+ (2,two)
+(2 rows)
+
+select * from array_to_set(array['one', 'two']) as t(f1 int,f2 text);
+ f1 | f2
+----+-----
+ 1 | one
+ 2 | two
+(2 rows)
+
+select * from array_to_set(array['one', 'two']); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from array_to_set(array['one', 'two']);
+ ^
+create temp table rngfunc(f1 int8, f2 int8);
+create function testrngfunc() returns record as $$
+ insert into rngfunc values (1,2) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+(1 row)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+(1 row)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+create function testrngfunc() returns setof record as $$
+ insert into rngfunc values (1,2), (3,4) returning *;
+$$ language sql;
+select testrngfunc();
+ testrngfunc
+-------------
+ (1,2)
+ (3,4)
+(2 rows)
+
+select * from testrngfunc() as t(f1 int8,f2 int8);
+ f1 | f2
+----+----
+ 1 | 2
+ 3 | 4
+(2 rows)
+
+select * from testrngfunc(); -- fail
+ERROR: a column definition list is required for functions returning "record"
+LINE 1: select * from testrngfunc();
+ ^
+drop function testrngfunc();
+--
+-- Check some cases involving added/dropped columns in a rowtype result
+--
+create temp table users (userid text, seq int, email text, todrop bool, moredrop int, enabled bool);
+insert into users values ('id',1,'email',true,11,true);
+insert into users values ('id2',2,'email2',true,12,true);
+alter table users drop column todrop;
+create or replace function get_first_user() returns users as
+$$ SELECT * FROM users ORDER BY userid LIMIT 1; $$
+language sql stable;
+SELECT get_first_user();
+ get_first_user
+-------------------
+ (id,1,email,11,t)
+(1 row)
+
+SELECT * FROM get_first_user();
+ userid | seq | email | moredrop | enabled
+--------+-----+-------+----------+---------
+ id | 1 | email | 11 | t
+(1 row)
+
+create or replace function get_users() returns setof users as
+$$ SELECT * FROM users ORDER BY userid; $$
+language sql stable;
+SELECT get_users();
+ get_users
+---------------------
+ (id,1,email,11,t)
+ (id2,2,email2,12,t)
+(2 rows)
+
+SELECT * FROM get_users();
+ userid | seq | email | moredrop | enabled
+--------+-----+--------+----------+---------
+ id | 1 | email | 11 | t
+ id2 | 2 | email2 | 12 | t
+(2 rows)
+
+SELECT * FROM get_users() WITH ORDINALITY; -- make sure ordinality copes
+ userid | seq | email | moredrop | enabled | ordinality
+--------+-----+--------+----------+---------+------------
+ id | 1 | email | 11 | t | 1
+ id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+-- multiple functions vs. dropped columns
+SELECT * FROM ROWS FROM(generate_series(10,11), get_users()) WITH ORDINALITY;
+ generate_series | userid | seq | email | moredrop | enabled | ordinality
+-----------------+--------+-----+--------+----------+---------+------------
+ 10 | id | 1 | email | 11 | t | 1
+ 11 | id2 | 2 | email2 | 12 | t | 2
+(2 rows)
+
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+-- check that we can cope with post-parsing changes in rowtypes
+create temp view usersview as
+SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+alter table users add column junk text;
+select * from usersview;
+ userid | seq | email | moredrop | enabled | generate_series | ordinality
+--------+-----+--------+----------+---------+-----------------+------------
+ id | 1 | email | 11 | t | 10 | 1
+ id2 | 2 | email2 | 12 | t | 11 | 2
+(2 rows)
+
+begin;
+alter table users drop column moredrop;
+select * from usersview; -- expect clean failure
+ERROR: attribute 5 of type record has been dropped
+rollback;
+alter table users alter column seq type numeric;
+select * from usersview; -- expect clean failure
+ERROR: attribute 2 of type record has wrong type
+DETAIL: Table has type numeric, but query expects integer.
+drop view usersview;
+drop function get_first_user();
+drop function get_users();
+drop table users;
+-- this won't get inlined because of type coercion, but it shouldn't fail
+create or replace function rngfuncbar() returns setof text as
+$$ select 'foo'::varchar union all select 'bar'::varchar ; $$
+language sql stable;
+select rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+select * from rngfuncbar();
+ rngfuncbar
+------------
+ foo
+ bar
+(2 rows)
+
+drop function rngfuncbar();
+-- check handling of a SQL function with multiple OUT params (bug #5777)
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1) $$ language sql;
+select * from rngfuncbar();
+ column1 | column2
+---------+---------
+ 1 | 2.1
+(1 row)
+
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned type integer at ordinal position 2, but query expects numeric.
+create or replace function rngfuncbar(out integer, out numeric) as
+$$ select (1, 2.1, 3) $$ language sql;
+select * from rngfuncbar(); -- fail
+ERROR: function return row and query-specified return row do not match
+DETAIL: Returned row contains 3 attributes, but query expects 2.
+drop function rngfuncbar();
+-- check whole-row-Var handling in nested lateral functions (bug #11703)
+create function extractq2(t int8_tbl) returns int8 as $$
+ select t.q2
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ QUERY PLAN
+------------------------------------
+ Nested Loop
+ Output: f.x
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+ -> Function Scan on f
+ Output: f.x
+ Function Call: int8_tbl.q2
+(7 rows)
+
+select x from int8_tbl, extractq2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+create function extractq2_2(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t) offset 0
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------------
+ Nested Loop
+ Output: ((int8_tbl.*).q2)
+ -> Seq Scan on public.int8_tbl
+ Output: int8_tbl.*
+ -> Result
+ Output: (int8_tbl.*).q2
+(6 rows)
+
+select x from int8_tbl, extractq2_2(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- without the "offset 0", this function gets optimized quite differently
+create function extractq2_2_opt(t int8_tbl) returns table(ret1 int8) as $$
+ select extractq2(t)
+$$ language sql immutable;
+explain (verbose, costs off)
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ QUERY PLAN
+-----------------------------
+ Seq Scan on public.int8_tbl
+ Output: int8_tbl.q2
+(2 rows)
+
+select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
+ x
+-------------------
+ 456
+ 4567890123456789
+ 123
+ 4567890123456789
+ -4567890123456789
+(5 rows)
+
+-- check handling of nulls in SRF results (bug #7808)
+create type rngfunc2 as (a integer, b text);
+select *, row_to_json(u) from unnest(array[(1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+---+---------------------
+ | | {"a":null,"b":null}
+ | | {"a":null,"b":null}
+(2 rows)
+
+select *, row_to_json(u) from unnest(array[null::rngfunc2, (1,'foo')::rngfunc2, null::rngfunc2]) u;
+ a | b | row_to_json
+---+-----+---------------------
+ | | {"a":null,"b":null}
+ 1 | foo | {"a":1,"b":"foo"}
+ | | {"a":null,"b":null}
+(3 rows)
+
+select *, row_to_json(u) from unnest(array[]::rngfunc2[]) u;
+ a | b | row_to_json
+---+---+-------------
+(0 rows)
+
+drop type rngfunc2;
diff --git a/src/test/regress/expected/reloptions_1.out b/src/test/regress/expected/reloptions_1.out
new file mode 100644
index 0000000000..d766b195f8
--- /dev/null
+++ b/src/test/regress/expected/reloptions_1.out
@@ -0,0 +1,219 @@
+-- Simple create
+CREATE TABLE reloptions_test(i INT) WITH (FiLLFaCToR=30,
+ autovacuum_enabled = false, autovacuum_analyze_scale_factor = 0.2);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {fillfactor=30,autovacuum_enabled=false,autovacuum_analyze_scale_factor=0.2}
+(1 row)
+
+-- Fail min/max values check
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=2);
+ERROR: value 2 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=110);
+ERROR: value 110 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = -10.0);
+ERROR: value -10.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor = 110.0);
+ERROR: value 110.0 out of bounds for option "autovacuum_analyze_scale_factor"
+DETAIL: Valid values are between "0.000000" and "100.000000".
+-- Fail when option and namespace do not exist
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE TABLE reloptions_test2(i INT) WITH (not_existing_namespace.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_namespace"
+-- Fail while setting improper values
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=-30.1);
+ERROR: value -30.1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor='string');
+ERROR: invalid value for integer option "fillfactor": string
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=true);
+ERROR: invalid value for integer option "fillfactor": true
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=12);
+ERROR: invalid value for boolean option "autovacuum_enabled": 12
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled=30.5);
+ERROR: invalid value for boolean option "autovacuum_enabled": 30.5
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_enabled='string');
+ERROR: invalid value for boolean option "autovacuum_enabled": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor='string');
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": string
+CREATE TABLE reloptions_test2(i INT) WITH (autovacuum_analyze_scale_factor=true);
+ERROR: invalid value for floating point option "autovacuum_analyze_scale_factor": true
+-- Fail if option is specified twice
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor=30, fillfactor=40);
+ERROR: parameter "fillfactor" specified more than once
+-- Specifying name only for a non-Boolean option should fail
+CREATE TABLE reloptions_test2(i INT) WITH (fillfactor);
+ERROR: invalid value for integer option "fillfactor": true
+-- Simple ALTER TABLE
+ALTER TABLE reloptions_test SET (fillfactor=31,
+ autovacuum_analyze_scale_factor = 0.3);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+------------------------------------------------------------------------------
+ {autovacuum_enabled=false,fillfactor=31,autovacuum_analyze_scale_factor=0.3}
+(1 row)
+
+-- Set boolean option to true without specifying value
+ALTER TABLE reloptions_test SET (autovacuum_enabled, fillfactor=32);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-----------------------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true,fillfactor=32}
+(1 row)
+
+-- Check that RESET works well
+ALTER TABLE reloptions_test RESET (fillfactor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+---------------------------------------------------------------
+ {autovacuum_analyze_scale_factor=0.3,autovacuum_enabled=true}
+(1 row)
+
+-- Resetting all values causes the column to become null
+ALTER TABLE reloptions_test RESET (autovacuum_enabled,
+ autovacuum_analyze_scale_factor);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass AND
+ reloptions IS NULL;
+ reloptions
+------------
+
+(1 row)
+
+-- RESET fails if a value is specified
+ALTER TABLE reloptions_test RESET (fillfactor=12);
+ERROR: RESET must not include values for parameters
+-- Test vacuum_truncate option
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test(i INT NOT NULL, j text)
+ WITH (vacuum_truncate=false,
+ toast.vacuum_truncate=false,
+ autovacuum_enabled=false);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+--------------------------------------------------
+ {vacuum_truncate=false,autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid =
+ (SELECT reltoastrelid FROM pg_class
+ WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (vacuum_truncate);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+----------------------------
+ {autovacuum_enabled=false}
+(1 row)
+
+INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM reloptions_test;
+SELECT pg_relation_size('reloptions_test') = 0;
+ ?column?
+----------
+ f
+(1 row)
+
+-- Test toast.* options
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR)
+ WITH (toast.autovacuum_vacuum_cost_delay = 23);
+SELECT reltoastrelid as toast_oid
+ FROM pg_class WHERE oid = 'reloptions_test'::regclass \gset
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test SET (toast.autovacuum_vacuum_cost_delay = 24);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+ALTER TABLE reloptions_test RESET (toast.autovacuum_vacuum_cost_delay);
+SELECT reloptions FROM pg_class WHERE oid = :toast_oid;
+ reloptions
+------------
+(0 rows)
+
+-- Fail on non-existent options in toast namespace
+CREATE TABLE reloptions_test2 (i int) WITH (toast.not_existing_option = 42);
+ERROR: unrecognized parameter "not_existing_option"
+-- Mix TOAST & heap
+DROP TABLE reloptions_test;
+CREATE TABLE reloptions_test (s VARCHAR) WITH
+ (toast.autovacuum_vacuum_cost_delay = 23,
+ autovacuum_vacuum_cost_delay = 24, fillfactor = 40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
+ reloptions
+-------------------------------------------------
+ {autovacuum_vacuum_cost_delay=24,fillfactor=40}
+(1 row)
+
+SELECT reloptions FROM pg_class WHERE oid = (
+ SELECT reltoastrelid FROM pg_class WHERE oid = 'reloptions_test'::regclass);
+ reloptions
+------------
+(0 rows)
+
+--
+-- CREATE INDEX, ALTER INDEX for btrees
+--
+CREATE INDEX reloptions_test_idx ON reloptions_test (s) WITH (fillfactor=30);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=30}
+(1 row)
+
+-- Fail when option and namespace do not exist
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_option=2);
+ERROR: unrecognized parameter "not_existing_option"
+CREATE INDEX reloptions_test_idx ON reloptions_test (s)
+ WITH (not_existing_ns.fillfactor=2);
+ERROR: unrecognized parameter namespace "not_existing_ns"
+-- Check allowed ranges
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=1);
+ERROR: value 1 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+CREATE INDEX reloptions_test_idx2 ON reloptions_test (s) WITH (fillfactor=130);
+ERROR: value 130 out of bounds for option "fillfactor"
+DETAIL: Valid values are between "10" and "100".
+-- Check ALTER
+ALTER INDEX reloptions_test_idx SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
+-- Check ALTER on empty reloption list
+CREATE INDEX reloptions_test_idx3 ON reloptions_test (s);
+ALTER INDEX reloptions_test_idx3 SET (fillfactor=40);
+SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test_idx3'::regclass;
+ reloptions
+-----------------
+ {fillfactor=40}
+(1 row)
+
diff --git a/src/test/regress/expected/rowsecurity_1.out b/src/test/regress/expected/rowsecurity_1.out
new file mode 100644
index 0000000000..79b1f73e08
--- /dev/null
+++ b/src/test/regress/expected/rowsecurity_1.out
@@ -0,0 +1,4052 @@
+--
+-- Test of Row-level security feature
+--
+-- Clean up in case a prior regression run failed
+-- Suppress NOTICE messages when users/groups don't exist
+SET client_min_messages TO 'warning';
+DROP USER IF EXISTS regress_rls_alice;
+DROP USER IF EXISTS regress_rls_bob;
+DROP USER IF EXISTS regress_rls_carol;
+DROP USER IF EXISTS regress_rls_dave;
+DROP USER IF EXISTS regress_rls_exempt_user;
+DROP ROLE IF EXISTS regress_rls_group1;
+DROP ROLE IF EXISTS regress_rls_group2;
+DROP SCHEMA IF EXISTS regress_rls_schema CASCADE;
+RESET client_min_messages;
+-- initial setup
+CREATE USER regress_rls_alice NOLOGIN;
+CREATE USER regress_rls_bob NOLOGIN;
+CREATE USER regress_rls_carol NOLOGIN;
+CREATE USER regress_rls_dave NOLOGIN;
+CREATE USER regress_rls_exempt_user BYPASSRLS NOLOGIN;
+CREATE ROLE regress_rls_group1 NOLOGIN;
+CREATE ROLE regress_rls_group2 NOLOGIN;
+GRANT regress_rls_group1 TO regress_rls_bob;
+GRANT regress_rls_group2 TO regress_rls_carol;
+CREATE SCHEMA regress_rls_schema;
+GRANT ALL ON SCHEMA regress_rls_schema to public;
+SET search_path = regress_rls_schema;
+-- setup of malicious function
+CREATE OR REPLACE FUNCTION f_leak(text) RETURNS bool
+ COST 0.0000001 LANGUAGE plpgsql
+ AS 'BEGIN RAISE NOTICE ''f_leak => %'', $1; RETURN true; END';
+GRANT EXECUTE ON FUNCTION f_leak(text) TO public;
+-- BASIC Row-Level Security Scenario
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE uaccount (
+ pguser name primary key,
+ seclv int
+);
+GRANT SELECT ON uaccount TO public;
+INSERT INTO uaccount VALUES
+ ('regress_rls_alice', 99),
+ ('regress_rls_bob', 1),
+ ('regress_rls_carol', 2),
+ ('regress_rls_dave', 3);
+CREATE TABLE category (
+ cid int primary key,
+ cname text
+);
+GRANT ALL ON category TO public;
+INSERT INTO category VALUES
+ (11, 'novel'),
+ (22, 'science fiction'),
+ (33, 'technology'),
+ (44, 'manga');
+CREATE TABLE document (
+ did int primary key,
+ cid int references category(cid),
+ dlevel int not null,
+ dauthor name,
+ dtitle text
+);
+GRANT ALL ON document TO public;
+INSERT INTO document VALUES
+ ( 1, 11, 1, 'regress_rls_bob', 'my first novel'),
+ ( 2, 11, 2, 'regress_rls_bob', 'my second novel'),
+ ( 3, 22, 2, 'regress_rls_bob', 'my science fiction'),
+ ( 4, 44, 1, 'regress_rls_bob', 'my first manga'),
+ ( 5, 44, 2, 'regress_rls_bob', 'my second manga'),
+ ( 6, 22, 1, 'regress_rls_carol', 'great science fiction'),
+ ( 7, 33, 2, 'regress_rls_carol', 'great technology book'),
+ ( 8, 44, 1, 'regress_rls_carol', 'great manga'),
+ ( 9, 22, 1, 'regress_rls_dave', 'awesome science fiction'),
+ (10, 33, 2, 'regress_rls_dave', 'awesome technology book');
+ALTER TABLE document ENABLE ROW LEVEL SECURITY;
+-- user's security level must be higher than or equal to document's
+CREATE POLICY p1 ON document AS PERMISSIVE
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+-- try to create a policy of bogus type
+CREATE POLICY p1 ON document AS UGLY
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+ERROR: unrecognized row security option "ugly"
+LINE 1: CREATE POLICY p1 ON document AS UGLY
+ ^
+HINT: Only PERMISSIVE or RESTRICTIVE policies are supported currently.
+-- but Dave isn't allowed to anything at cid 50 or above
+-- this is to make sure that we sort the policies by name first
+-- when applying WITH CHECK, a later INSERT by Dave should fail due
+-- to p1r first
+CREATE POLICY p2r ON document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid <> 44 AND cid < 50);
+-- and Dave isn't allowed to see manga documents
+CREATE POLICY p1r ON document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid <> 44);
+\dp
+ Access privileges
+ Schema | Name | Type | Access privileges | Column privileges | Policies
+--------------------+----------+-------+---------------------------------------------+-------------------+--------------------------------------------
+ regress_rls_schema | category | table | regress_rls_alice=arwdDxt/regress_rls_alice+| |
+ | | | =arwdDxt/regress_rls_alice | |
+ regress_rls_schema | document | table | regress_rls_alice=arwdDxt/regress_rls_alice+| | p1: +
+ | | | =arwdDxt/regress_rls_alice | | (u): (dlevel <= ( SELECT uaccount.seclv +
+ | | | | | FROM uaccount +
+ | | | | | WHERE (uaccount.pguser = CURRENT_USER)))+
+ | | | | | p2r (RESTRICTIVE): +
+ | | | | | (u): ((cid <> 44) AND (cid < 50)) +
+ | | | | | to: regress_rls_dave +
+ | | | | | p1r (RESTRICTIVE): +
+ | | | | | (u): (cid <> 44) +
+ | | | | | to: regress_rls_dave
+ regress_rls_schema | uaccount | table | regress_rls_alice=arwdDxt/regress_rls_alice+| |
+ | | | =r/regress_rls_alice | |
+(3 rows)
+
+\d document
+ Table "regress_rls_schema.document"
+ Column | Type | Collation | Nullable | Default
+---------+---------+-----------+----------+---------
+ did | integer | | not null |
+ cid | integer | | |
+ dlevel | integer | | not null |
+ dauthor | name | | |
+ dtitle | text | | |
+Indexes:
+ "document_pkey" PRIMARY KEY, btree (did)
+Foreign-key constraints:
+ "document_cid_fkey" FOREIGN KEY (cid) REFERENCES category(cid)
+Policies:
+ POLICY "p1"
+ USING ((dlevel <= ( SELECT uaccount.seclv
+ FROM uaccount
+ WHERE (uaccount.pguser = CURRENT_USER))))
+ POLICY "p1r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING ((cid <> 44))
+ POLICY "p2r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING (((cid <> 44) AND (cid < 50)))
+
+SELECT * FROM pg_policies WHERE schemaname = 'regress_rls_schema' AND tablename = 'document' ORDER BY policyname;
+ schemaname | tablename | policyname | permissive | roles | cmd | qual | with_check
+--------------------+-----------+------------+-------------+--------------------+-----+--------------------------------------------+------------
+ regress_rls_schema | document | p1 | PERMISSIVE | {public} | ALL | (dlevel <= ( SELECT uaccount.seclv +|
+ | | | | | | FROM uaccount +|
+ | | | | | | WHERE (uaccount.pguser = CURRENT_USER))) |
+ regress_rls_schema | document | p1r | RESTRICTIVE | {regress_rls_dave} | ALL | (cid <> 44) |
+ regress_rls_schema | document | p2r | RESTRICTIVE | {regress_rls_dave} | ALL | ((cid <> 44) AND (cid < 50)) |
+(3 rows)
+
+-- viewpoint from regress_rls_bob
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(5 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+(5 rows)
+
+-- try a sampled version
+SELECT * FROM document TABLESAMPLE BERNOULLI(50) REPEATABLE(0)
+ WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+(10 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 44 | 5 | 2 | regress_rls_bob | my second manga | manga
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+ 33 | 10 | 2 | regress_rls_dave | awesome technology book | technology
+(10 rows)
+
+-- try a sampled version
+SELECT * FROM document TABLESAMPLE BERNOULLI(50) REPEATABLE(0)
+ WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great manga
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------
+ Seq Scan on document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+-----------------------------------------------------------
+ Hash Join
+ Hash Cond: (category.cid = document.cid)
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on category
+ -> Hash
+ -> Seq Scan on document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(9 rows)
+
+-- viewpoint from regress_rls_dave
+SET SESSION AUTHORIZATION regress_rls_dave;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+(7 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => awesome technology book
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-------------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 22 | 9 | 1 | regress_rls_dave | awesome science fiction | science fiction
+ 33 | 10 | 2 | regress_rls_dave | awesome technology book | technology
+(7 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
+ Seq Scan on document
+ Filter: ((cid <> 44) AND (cid <> 44) AND (cid < 50) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (category.cid = document.cid)
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on category
+ -> Hash
+ -> Seq Scan on document
+ Filter: ((cid <> 44) AND (cid <> 44) AND (cid < 50) AND (dlevel <= $0) AND f_leak(dtitle))
+(9 rows)
+
+-- 44 would technically fail for both p2r and p1r, but we should get an error
+-- back from p1r for this because it sorts first
+INSERT INTO document VALUES (100, 44, 1, 'regress_rls_dave', 'testing sorting of policies'); -- fail
+ERROR: new row violates row-level security policy "p1r" for table "document"
+-- Just to see a p2r error
+INSERT INTO document VALUES (100, 55, 1, 'regress_rls_dave', 'testing sorting of policies'); -- fail
+ERROR: new row violates row-level security policy "p2r" for table "document"
+-- only owner can change policies
+ALTER POLICY p1 ON document USING (true); --fail
+ERROR: must be owner of table document
+DROP POLICY p1 ON document; --fail
+ERROR: must be owner of relation document
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p1 ON document USING (dauthor = current_user);
+-- viewpoint from regress_rls_bob again
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+--------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+(5 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER by did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my science fiction
+NOTICE: f_leak => my first manga
+NOTICE: f_leak => my second manga
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-----------------+--------------------+-----------------
+ 11 | 1 | 1 | regress_rls_bob | my first novel | novel
+ 11 | 2 | 2 | regress_rls_bob | my second novel | novel
+ 22 | 3 | 2 | regress_rls_bob | my science fiction | science fiction
+ 44 | 4 | 1 | regress_rls_bob | my first manga | manga
+ 44 | 5 | 2 | regress_rls_bob | my second manga | manga
+(5 rows)
+
+-- viewpoint from rls_regres_carol again
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+(3 rows)
+
+SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle) ORDER by did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => great manga
+ cid | did | dlevel | dauthor | dtitle | cname
+-----+-----+--------+-------------------+-----------------------+-----------------
+ 22 | 6 | 1 | regress_rls_carol | great science fiction | science fiction
+ 33 | 7 | 2 | regress_rls_carol | great technology book | technology
+ 44 | 8 | 1 | regress_rls_carol | great manga | manga
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------
+ Seq Scan on document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM document NATURAL JOIN category WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Index Scan using category_pkey on category
+ Index Cond: (cid = document.cid)
+(5 rows)
+
+-- interaction of FK/PK constraints
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY p2 ON category
+ USING (CASE WHEN current_user = 'regress_rls_bob' THEN cid IN (11, 33)
+ WHEN current_user = 'regress_rls_carol' THEN cid IN (22, 44)
+ ELSE false END);
+ALTER TABLE category ENABLE ROW LEVEL SECURITY;
+-- cannot delete PK referenced by invisible FK
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM document d FULL OUTER JOIN category c on d.cid = c.cid ORDER BY d.did, c.cid;
+ did | cid | dlevel | dauthor | dtitle | cid | cname
+-----+-----+--------+-----------------+--------------------+-----+------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel | 11 | novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel | 11 | novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction | |
+ 4 | 44 | 1 | regress_rls_bob | my first manga | |
+ 5 | 44 | 2 | regress_rls_bob | my second manga | |
+ | | | | | 33 | technology
+(6 rows)
+
+DELETE FROM category WHERE cid = 33; -- fails with FK violation
+ERROR: update or delete on table "category" violates foreign key constraint "document_cid_fkey" on table "document"
+DETAIL: Key is still referenced from table "document".
+-- can insert FK referencing invisible PK
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM document d FULL OUTER JOIN category c on d.cid = c.cid ORDER BY d.did, c.cid;
+ did | cid | dlevel | dauthor | dtitle | cid | cname
+-----+-----+--------+-------------------+-----------------------+-----+-----------------
+ 6 | 22 | 1 | regress_rls_carol | great science fiction | 22 | science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book | |
+ 8 | 44 | 1 | regress_rls_carol | great manga | 44 | manga
+(3 rows)
+
+INSERT INTO document VALUES (11, 33, 1, current_user, 'hoge');
+-- UNIQUE or PRIMARY KEY constraint violation DOES reveal presence of row
+SET SESSION AUTHORIZATION regress_rls_bob;
+INSERT INTO document VALUES (8, 44, 1, 'regress_rls_bob', 'my third manga'); -- Must fail with unique violation, revealing presence of did we can't see
+ERROR: duplicate key value violates unique constraint "document_pkey"
+SELECT * FROM document WHERE did = 8; -- and confirm we can't see it
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+---------+--------
+(0 rows)
+
+-- RLS policies are checked before constraints
+INSERT INTO document VALUES (8, 44, 1, 'regress_rls_carol', 'my third manga'); -- Should fail with RLS check violation, not duplicate key violation
+ERROR: new row violates row-level security policy for table "document"
+UPDATE document SET did = 8, dauthor = 'regress_rls_carol' WHERE did = 5; -- Should fail with RLS check violation, not duplicate key violation
+ERROR: new row violates row-level security policy for table "document"
+-- database superuser does bypass RLS policy when enabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- database superuser does bypass RLS policy when disabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- database non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- RLS policy does not apply to table owner when RLS enabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+-- RLS policy does not apply to table owner when RLS disabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO OFF;
+SELECT * FROM document;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 22 | 2 | regress_rls_bob | my science fiction
+ 4 | 44 | 1 | regress_rls_bob | my first manga
+ 5 | 44 | 2 | regress_rls_bob | my second manga
+ 6 | 22 | 1 | regress_rls_carol | great science fiction
+ 7 | 33 | 2 | regress_rls_carol | great technology book
+ 8 | 44 | 1 | regress_rls_carol | great manga
+ 9 | 22 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 33 | 2 | regress_rls_dave | awesome technology book
+ 11 | 33 | 1 | regress_rls_carol | hoge
+(11 rows)
+
+SELECT * FROM category;
+ cid | cname
+-----+-----------------
+ 11 | novel
+ 22 | science fiction
+ 33 | technology
+ 44 | manga
+(4 rows)
+
+--
+-- Table inheritance and RLS policy
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE TABLE t1 (id int not null primary key, a int, junk1 text, b text);
+ALTER TABLE t1 DROP COLUMN junk1; -- just a disturbing factor
+GRANT ALL ON t1 TO public;
+COPY t1 FROM stdin WITH ;
+CREATE TABLE t2 (c float) INHERITS (t1);
+GRANT ALL ON t2 TO public;
+COPY t2 FROM stdin;
+CREATE TABLE t3 (id int not null primary key, c text, b text, a int);
+ALTER TABLE t3 INHERIT t1;
+GRANT ALL ON t3 TO public;
+COPY t3(id, a,b,c) FROM stdin;
+CREATE POLICY p1 ON t1 FOR ALL TO PUBLIC USING (a % 2 = 0); -- be even number
+CREATE POLICY p2 ON t2 FOR ALL TO PUBLIC USING (a % 2 = 1); -- be odd number
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE t2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM t1;
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+-- reference to system column
+SELECT tableoid::regclass, * FROM t1;
+ tableoid | id | a | b
+----------+-----+---+-----
+ t1 | 102 | 2 | bbb
+ t1 | 104 | 4 | dad
+ t2 | 202 | 2 | bcd
+ t2 | 204 | 4 | def
+ t3 | 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT *, t1 FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+-- reference to whole-row reference
+SELECT *, t1 FROM t1;
+ id | a | b | t1
+-----+---+-----+-------------
+ 102 | 2 | bbb | (102,2,bbb)
+ 104 | 4 | dad | (104,4,dad)
+ 202 | 2 | bcd | (202,2,bcd)
+ 204 | 4 | def | (204,4,def)
+ 302 | 2 | yyy | (302,2,yyy)
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT *, t1 FROM t1;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(7 rows)
+
+-- for share/update lock
+SELECT * FROM t1 FOR SHARE;
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 FOR SHARE;
+ QUERY PLAN
+-------------------------------------
+ LockRows
+ -> Append
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t2 t1_1
+ Filter: ((a % 2) = 0)
+ -> Seq Scan on t3 t1_2
+ Filter: ((a % 2) = 0)
+(8 rows)
+
+SELECT * FROM t1 WHERE f_leak(b) FOR SHARE;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 104 | 4 | dad
+ 202 | 2 | bcd
+ 204 | 4 | def
+ 302 | 2 | yyy
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b) FOR SHARE;
+ QUERY PLAN
+-----------------------------------------------------
+ LockRows
+ -> Append
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(8 rows)
+
+-- union all query
+SELECT a, b, tableoid::regclass FROM t2 UNION ALL SELECT a, b, tableoid::regclass FROM t3;
+ a | b | tableoid
+---+-----+----------
+ 1 | abc | t2
+ 3 | cde | t2
+ 1 | xxx | t3
+ 2 | yyy | t3
+ 3 | zzz | t3
+(5 rows)
+
+EXPLAIN (COSTS OFF) SELECT a, b, tableoid::regclass FROM t2 UNION ALL SELECT a, b, tableoid::regclass FROM t3;
+ QUERY PLAN
+-------------------------------
+ Append
+ -> Seq Scan on t2
+ Filter: ((a % 2) = 1)
+ -> Seq Scan on t3
+(4 rows)
+
+-- superuser is allowed to bypass RLS checks
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2 t1_1
+ Filter: f_leak(b)
+ -> Seq Scan on t3 t1_2
+ Filter: f_leak(b)
+(7 rows)
+
+-- non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2 t1_1
+ Filter: f_leak(b)
+ -> Seq Scan on t3 t1_2
+ Filter: f_leak(b)
+(7 rows)
+
+--
+-- Partitioned Tables
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE part_document (
+ did int,
+ cid int,
+ dlevel int not null,
+ dauthor name,
+ dtitle text
+) PARTITION BY RANGE (cid);
+GRANT ALL ON part_document TO public;
+-- Create partitions for document categories
+CREATE TABLE part_document_fiction PARTITION OF part_document FOR VALUES FROM (11) to (12);
+CREATE TABLE part_document_satire PARTITION OF part_document FOR VALUES FROM (55) to (56);
+CREATE TABLE part_document_nonfiction PARTITION OF part_document FOR VALUES FROM (99) to (100);
+GRANT ALL ON part_document_fiction TO public;
+GRANT ALL ON part_document_satire TO public;
+GRANT ALL ON part_document_nonfiction TO public;
+INSERT INTO part_document VALUES
+ ( 1, 11, 1, 'regress_rls_bob', 'my first novel'),
+ ( 2, 11, 2, 'regress_rls_bob', 'my second novel'),
+ ( 3, 99, 2, 'regress_rls_bob', 'my science textbook'),
+ ( 4, 55, 1, 'regress_rls_bob', 'my first satire'),
+ ( 5, 99, 2, 'regress_rls_bob', 'my history book'),
+ ( 6, 11, 1, 'regress_rls_carol', 'great science fiction'),
+ ( 7, 99, 2, 'regress_rls_carol', 'great technology book'),
+ ( 8, 55, 2, 'regress_rls_carol', 'great satire'),
+ ( 9, 11, 1, 'regress_rls_dave', 'awesome science fiction'),
+ (10, 99, 2, 'regress_rls_dave', 'awesome technology book');
+ALTER TABLE part_document ENABLE ROW LEVEL SECURITY;
+-- Create policy on parent
+-- user's security level must be higher than or equal to document's
+CREATE POLICY pp1 ON part_document AS PERMISSIVE
+ USING (dlevel <= (SELECT seclv FROM uaccount WHERE pguser = current_user));
+-- Dave is only allowed to see cid < 55
+CREATE POLICY pp1r ON part_document AS RESTRICTIVE TO regress_rls_dave
+ USING (cid < 55);
+\d+ part_document
+ Partitioned table "regress_rls_schema.part_document"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+---------+---------+-----------+----------+---------+----------+--------------+-------------
+ did | integer | | | | plain | |
+ cid | integer | | | | plain | |
+ dlevel | integer | | not null | | plain | |
+ dauthor | name | | | | plain | |
+ dtitle | text | | | | extended | |
+Partition key: RANGE (cid)
+Policies:
+ POLICY "pp1"
+ USING ((dlevel <= ( SELECT uaccount.seclv
+ FROM uaccount
+ WHERE (uaccount.pguser = CURRENT_USER))))
+ POLICY "pp1r" AS RESTRICTIVE
+ TO regress_rls_dave
+ USING ((cid < 55))
+Partitions: part_document_fiction FOR VALUES FROM (11) TO (12),
+ part_document_nonfiction FOR VALUES FROM (99) TO (100),
+ part_document_satire FOR VALUES FROM (55) TO (56)
+
+SELECT * FROM pg_policies WHERE schemaname = 'regress_rls_schema' AND tablename like '%part_document%' ORDER BY policyname;
+ schemaname | tablename | policyname | permissive | roles | cmd | qual | with_check
+--------------------+---------------+------------+-------------+--------------------+-----+--------------------------------------------+------------
+ regress_rls_schema | part_document | pp1 | PERMISSIVE | {public} | ALL | (dlevel <= ( SELECT uaccount.seclv +|
+ | | | | | | FROM uaccount +|
+ | | | | | | WHERE (uaccount.pguser = CURRENT_USER))) |
+ regress_rls_schema | part_document | pp1r | RESTRICTIVE | {regress_rls_dave} | ALL | (cid < 55) |
+(2 rows)
+
+-- viewpoint from regress_rls_bob
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+(10 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- viewpoint from regress_rls_dave
+SET SESSION AUTHORIZATION regress_rls_dave;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+--------------------------------------------------------------
+ Seq Scan on part_document_fiction part_document
+ Filter: ((cid < 55) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+-- pp1 ERROR
+INSERT INTO part_document VALUES (100, 11, 5, 'regress_rls_dave', 'testing pp1'); -- fail
+ERROR: new row violates row-level security policy for table "part_document"
+-- pp1r ERROR
+INSERT INTO part_document VALUES (100, 99, 1, 'regress_rls_dave', 'testing pp1r'); -- fail
+ERROR: new row violates row-level security policy "pp1r" for table "part_document"
+-- Show that RLS policy does not apply for direct inserts to children
+-- This should fail with RLS POLICY pp1r violation.
+INSERT INTO part_document VALUES (100, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- fail
+ERROR: new row violates row-level security policy "pp1r" for table "part_document"
+-- But this should succeed.
+INSERT INTO part_document_satire VALUES (100, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- success
+-- We still cannot see the row using the parent
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+-- But we can if we look directly
+SELECT * FROM part_document_satire WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => testing RLS with partitions
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- Turn on RLS and create policy on child to show RLS is checked before constraints
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER TABLE part_document_satire ENABLE ROW LEVEL SECURITY;
+CREATE POLICY pp3 ON part_document_satire AS RESTRICTIVE
+ USING (cid < 55);
+-- This should fail with RLS violation now.
+SET SESSION AUTHORIZATION regress_rls_dave;
+INSERT INTO part_document_satire VALUES (101, 55, 1, 'regress_rls_dave', 'testing RLS with partitions'); -- fail
+ERROR: new row violates row-level security policy for table "part_document_satire"
+-- And now we cannot see directly into the partition either, due to RLS
+SELECT * FROM part_document_satire WHERE f_leak(dtitle) ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+---------+--------
+(0 rows)
+
+-- The parent looks same as before
+-- viewpoint from regress_rls_dave
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+--------------------------------------------------------------
+ Seq Scan on part_document_fiction part_document
+ Filter: ((cid < 55) AND (dlevel <= $0) AND f_leak(dtitle))
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+(5 rows)
+
+-- viewpoint from regress_rls_carol
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => awesome science fiction
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => great satire
+NOTICE: f_leak => testing RLS with partitions
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+NOTICE: f_leak => great technology book
+NOTICE: f_leak => awesome technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ InitPlan 1 (returns $0)
+ -> Index Scan using uaccount_pkey on uaccount
+ Index Cond: (pguser = CURRENT_USER)
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dlevel <= $0) AND f_leak(dtitle))
+(10 rows)
+
+-- only owner can change policies
+ALTER POLICY pp1 ON part_document USING (true); --fail
+ERROR: must be owner of table part_document
+DROP POLICY pp1 ON part_document; --fail
+ERROR: must be owner of relation part_document
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY pp1 ON part_document USING (dauthor = current_user);
+-- viewpoint from regress_rls_bob again
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => my first novel
+NOTICE: f_leak => my second novel
+NOTICE: f_leak => my first satire
+NOTICE: f_leak => my science textbook
+NOTICE: f_leak => my history book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+---------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+(5 rows)
+
+-- viewpoint from rls_regres_carol again
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM part_document WHERE f_leak(dtitle) ORDER BY did;
+NOTICE: f_leak => great science fiction
+NOTICE: f_leak => great satire
+NOTICE: f_leak => great technology book
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM part_document WHERE f_leak(dtitle);
+ QUERY PLAN
+---------------------------------------------------------------
+ Append
+ -> Seq Scan on part_document_fiction part_document
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Seq Scan on part_document_satire part_document_1
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+ -> Seq Scan on part_document_nonfiction part_document_2
+ Filter: ((dauthor = CURRENT_USER) AND f_leak(dtitle))
+(7 rows)
+
+-- database superuser does bypass RLS policy when enabled
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+SELECT * FROM part_document ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- database non-superuser with bypass privilege can bypass RLS policy when disabled
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+SELECT * FROM part_document ORDER BY did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- RLS policy does not apply to table owner when RLS enabled.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+SELECT * FROM part_document ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 1 | 11 | 1 | regress_rls_bob | my first novel
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+ 3 | 99 | 2 | regress_rls_bob | my science textbook
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 5 | 99 | 2 | regress_rls_bob | my history book
+ 6 | 11 | 1 | regress_rls_carol | great science fiction
+ 7 | 99 | 2 | regress_rls_carol | great technology book
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 9 | 11 | 1 | regress_rls_dave | awesome science fiction
+ 10 | 99 | 2 | regress_rls_dave | awesome technology book
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(11 rows)
+
+SELECT * FROM part_document_satire ORDER by did;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-------------------+-----------------------------
+ 4 | 55 | 1 | regress_rls_bob | my first satire
+ 8 | 55 | 2 | regress_rls_carol | great satire
+ 100 | 55 | 1 | regress_rls_dave | testing RLS with partitions
+(3 rows)
+
+-- When RLS disabled, other users get ERROR.
+SET SESSION AUTHORIZATION regress_rls_dave;
+SET row_security TO OFF;
+SELECT * FROM part_document ORDER by did;
+ERROR: query would be affected by row-level security policy for table "part_document"
+SELECT * FROM part_document_satire ORDER by did;
+ERROR: query would be affected by row-level security policy for table "part_document_satire"
+-- Check behavior with a policy that uses a SubPlan not an InitPlan.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE POLICY pp3 ON part_document AS RESTRICTIVE
+ USING ((SELECT dlevel <= seclv FROM uaccount WHERE pguser = current_user));
+SET SESSION AUTHORIZATION regress_rls_carol;
+INSERT INTO part_document VALUES (100, 11, 5, 'regress_rls_carol', 'testing pp3'); -- fail
+ERROR: new row violates row-level security policy "pp3" for table "part_document"
+----- Dependencies -----
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security TO ON;
+CREATE TABLE dependee (x integer, y integer);
+CREATE TABLE dependent (x integer, y integer);
+CREATE POLICY d1 ON dependent FOR ALL
+ TO PUBLIC
+ USING (x = (SELECT d.x FROM dependee d WHERE d.y = y));
+DROP TABLE dependee; -- Should fail without CASCADE due to dependency on row security qual?
+ERROR: cannot drop table dependee because other objects depend on it
+DETAIL: policy d1 on table dependent depends on table dependee
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+DROP TABLE dependee CASCADE;
+NOTICE: drop cascades to policy d1 on table dependent
+EXPLAIN (COSTS OFF) SELECT * FROM dependent; -- After drop, should be unqualified
+ QUERY PLAN
+-----------------------
+ Seq Scan on dependent
+(1 row)
+
+----- RECURSION ----
+--
+-- Simple recursion
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rec1 (x integer, y integer);
+CREATE POLICY r1 ON rec1 USING (x = (SELECT r.x FROM rec1 r WHERE y = r.y));
+ALTER TABLE rec1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, direct recursion
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rec2 (a integer, b integer);
+ALTER POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2 WHERE b = y));
+CREATE POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1 WHERE y = b));
+ALTER TABLE rec2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion via views
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rec1v AS SELECT * FROM rec1;
+CREATE VIEW rec2v AS SELECT * FROM rec2;
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2v WHERE b = y));
+ALTER POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1v WHERE y = b));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion via views
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- Mutual recursion via .s.b views
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+DROP VIEW rec1v, rec2v CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to policy r1 on table rec1
+drop cascades to policy r2 on table rec2
+CREATE VIEW rec1v WITH (security_barrier) AS SELECT * FROM rec1;
+CREATE VIEW rec2v WITH (security_barrier) AS SELECT * FROM rec2;
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY r1 ON rec1 USING (x = (SELECT a FROM rec2v WHERE b = y));
+CREATE POLICY r2 ON rec2 USING (a = (SELECT x FROM rec1v WHERE y = b));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rec1; -- fail, mutual recursion via s.b. views
+ERROR: infinite recursion detected in policy for relation "rec1"
+--
+-- recursive RLS and VIEWs in policy
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE s1 (a int, b text);
+INSERT INTO s1 (SELECT x, md5(x::text) FROM generate_series(-10,10) x);
+CREATE TABLE s2 (x int, y text);
+INSERT INTO s2 (SELECT x, md5(x::text) FROM generate_series(-6,6) x);
+GRANT SELECT ON s1, s2 TO regress_rls_bob;
+CREATE POLICY p1 ON s1 USING (a in (select x from s2 where y like '%2f%'));
+CREATE POLICY p2 ON s2 USING (x in (select a from s1 where b like '%22%'));
+CREATE POLICY p3 ON s1 FOR INSERT WITH CHECK (a = (SELECT a FROM s1));
+ALTER TABLE s1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE s2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW v2 AS SELECT * FROM s2 WHERE y like '%af%';
+SELECT * FROM s1 WHERE f_leak(b); -- fail (infinite recursion)
+ERROR: infinite recursion detected in policy for relation "s1"
+INSERT INTO s1 VALUES (1, 'foo'); -- fail (infinite recursion)
+ERROR: infinite recursion detected in policy for relation "s1"
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p3 on s1;
+ALTER POLICY p2 ON s2 USING (x % 2 = 0);
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- OK
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+ a | b
+---+----------------------------------
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM only s1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on s1
+ Filter: ((hashed SubPlan 1) AND f_leak(b))
+ SubPlan 1
+ -> Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%2f%'::text))
+(5 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p1 ON s1 USING (a in (select x from v2)); -- using VIEW in RLS policy
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- OK
+NOTICE: f_leak => 0267aaf632e87a63288a08331f22c7c3
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+ a | b
+----+----------------------------------
+ -4 | 0267aaf632e87a63288a08331f22c7c3
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM s1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on s1
+ Filter: ((hashed SubPlan 1) AND f_leak(b))
+ SubPlan 1
+ -> Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%af%'::text))
+(5 rows)
+
+SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
+ xx | x | y
+----+----+----------------------------------
+ -6 | -6 | 596a3d04481816330f07e4f97510c28f
+ -4 | -4 | 0267aaf632e87a63288a08331f22c7c3
+ 2 | 2 | c81e728d9d4c2f636f067f89cc14862c
+(3 rows)
+
+EXPLAIN (COSTS OFF) SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Seq Scan on s2
+ Filter: (((x % 2) = 0) AND (y ~~ '%28%'::text))
+ SubPlan 2
+ -> Limit
+ -> Seq Scan on s1
+ Filter: (hashed SubPlan 1)
+ SubPlan 1
+ -> Seq Scan on s2 s2_1
+ Filter: (((x % 2) = 0) AND (y ~~ '%af%'::text))
+(9 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+ALTER POLICY p2 ON s2 USING (x in (select a from s1 where b like '%d2%'));
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM s1 WHERE f_leak(b); -- fail (infinite recursion via view)
+ERROR: infinite recursion detected in policy for relation "s1"
+-- prepared statement with regress_rls_alice privilege
+PREPARE p1(int) AS SELECT * FROM t1 WHERE a <= $1;
+EXECUTE p1(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p1(2);
+ QUERY PLAN
+----------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+ -> Seq Scan on t2 t1_1
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+ -> Seq Scan on t3 t1_2
+ Filter: ((a <= 2) AND ((a % 2) = 0))
+(7 rows)
+
+-- superuser is allowed to bypass RLS checks
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => cde
+NOTICE: f_leak => def
+NOTICE: f_leak => xxx
+NOTICE: f_leak => yyy
+NOTICE: f_leak => zzz
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 103 | 3 | ccc
+ 104 | 4 | dad
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 203 | 3 | cde
+ 204 | 4 | def
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+ 303 | 3 | zzz
+(11 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: f_leak(b)
+ -> Seq Scan on t2 t1_1
+ Filter: f_leak(b)
+ -> Seq Scan on t3 t1_2
+ Filter: f_leak(b)
+(7 rows)
+
+-- plan cache should be invalidated
+EXECUTE p1(2);
+ id | a | b
+-----+---+-----
+ 101 | 1 | aba
+ 102 | 2 | bbb
+ 201 | 1 | abc
+ 202 | 2 | bcd
+ 301 | 1 | xxx
+ 302 | 2 | yyy
+(6 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p1(2);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (a <= 2)
+ -> Seq Scan on t2 t1_1
+ Filter: (a <= 2)
+ -> Seq Scan on t3 t1_2
+ Filter: (a <= 2)
+(7 rows)
+
+PREPARE p2(int) AS SELECT * FROM t1 WHERE a = $1;
+EXECUTE p2(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p2(2);
+ QUERY PLAN
+---------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: (a = 2)
+ -> Seq Scan on t2 t1_1
+ Filter: (a = 2)
+ -> Seq Scan on t3 t1_2
+ Filter: (a = 2)
+(7 rows)
+
+-- also, case when privilege switch from superuser
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+EXECUTE p2(2);
+ id | a | b
+-----+---+-----
+ 102 | 2 | bbb
+ 202 | 2 | bcd
+ 302 | 2 | yyy
+(3 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE p2(2);
+ QUERY PLAN
+---------------------------------------------
+ Append
+ -> Seq Scan on t1
+ Filter: ((a = 2) AND ((a % 2) = 0))
+ -> Seq Scan on t2 t1_1
+ Filter: ((a = 2) AND ((a % 2) = 0))
+ -> Seq Scan on t3 t1_2
+ Filter: ((a = 2) AND ((a % 2) = 0))
+(7 rows)
+
+--
+-- UPDATE / DELETE and Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+EXPLAIN (COSTS OFF) UPDATE t1 SET b = b || b WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Update on t1
+ Update on t1
+ Update on t2 t1_1
+ Update on t3 t1_2
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(10 rows)
+
+UPDATE t1 SET b = b || b WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => yyy
+EXPLAIN (COSTS OFF) UPDATE only t1 SET b = b || '_updt' WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Update on t1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+UPDATE only t1 SET b = b || '_updt' WHERE f_leak(b);
+NOTICE: f_leak => bbbbbb
+NOTICE: f_leak => daddad
+-- returning clause with system column
+UPDATE only t1 SET b = b WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+(2 rows)
+
+UPDATE t1 SET b = b WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ id | a | b
+-----+---+-------------
+ 102 | 2 | bbbbbb_updt
+ 104 | 4 | daddad_updt
+ 202 | 2 | bcdbcd
+ 204 | 4 | defdef
+ 302 | 2 | yyyyyy
+(5 rows)
+
+UPDATE t1 SET b = b WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+ t2 | 202 | 2 | bcdbcd | (202,2,bcdbcd)
+ t2 | 204 | 4 | defdef | (204,4,defdef)
+ t3 | 302 | 2 | yyyyyy | (302,2,yyyyyy)
+(5 rows)
+
+-- updates with from clause
+EXPLAIN (COSTS OFF) UPDATE t2 SET b=t2.b FROM t3
+WHERE t2.a = 3 and t3.a = 2 AND f_leak(t2.b) AND f_leak(t3.b);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t2
+ -> Nested Loop
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Seq Scan on t3
+ Filter: ((a = 2) AND f_leak(b))
+(6 rows)
+
+UPDATE t2 SET b=t2.b FROM t3
+WHERE t2.a = 3 and t3.a = 2 AND f_leak(t2.b) AND f_leak(t3.b);
+NOTICE: f_leak => cde
+NOTICE: f_leak => yyyyyy
+EXPLAIN (COSTS OFF) UPDATE t1 SET b=t1.b FROM t2
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t1
+ Update on t1
+ Update on t2 t1_1
+ Update on t3 t1_2
+ -> Nested Loop
+ -> Seq Scan on t1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Nested Loop
+ -> Seq Scan on t2 t1_1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Nested Loop
+ -> Seq Scan on t3 t1_2
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+(19 rows)
+
+UPDATE t1 SET b=t1.b FROM t2
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+EXPLAIN (COSTS OFF) UPDATE t2 SET b=t2.b FROM t1
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on t2
+ -> Nested Loop
+ -> Seq Scan on t2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: ((a = 3) AND ((a % 2) = 0) AND f_leak(b))
+(11 rows)
+
+UPDATE t2 SET b=t2.b FROM t1
+WHERE t1.a = 3 and t2.a = 3 AND f_leak(t1.b) AND f_leak(t2.b);
+NOTICE: f_leak => cde
+-- updates with from clause self join
+EXPLAIN (COSTS OFF) UPDATE t2 t2_1 SET b = t2_2.b FROM t2 t2_2
+WHERE t2_1.a = 3 AND t2_2.a = t2_1.a AND t2_2.b = t2_1.b
+AND f_leak(t2_1.b) AND f_leak(t2_2.b) RETURNING *, t2_1, t2_2;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Update on t2 t2_1
+ -> Nested Loop
+ Join Filter: (t2_1.b = t2_2.b)
+ -> Seq Scan on t2 t2_1
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+ -> Seq Scan on t2 t2_2
+ Filter: ((a = 3) AND ((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+UPDATE t2 t2_1 SET b = t2_2.b FROM t2 t2_2
+WHERE t2_1.a = 3 AND t2_2.a = t2_1.a AND t2_2.b = t2_1.b
+AND f_leak(t2_1.b) AND f_leak(t2_2.b) RETURNING *, t2_1, t2_2;
+NOTICE: f_leak => cde
+NOTICE: f_leak => cde
+ id | a | b | c | id | a | b | c | t2_1 | t2_2
+-----+---+-----+-----+-----+---+-----+-----+-----------------+-----------------
+ 203 | 3 | cde | 3.3 | 203 | 3 | cde | 3.3 | (203,3,cde,3.3) | (203,3,cde,3.3)
+(1 row)
+
+EXPLAIN (COSTS OFF) UPDATE t1 t1_1 SET b = t1_2.b FROM t1 t1_2
+WHERE t1_1.a = 4 AND t1_2.a = t1_1.a AND t1_2.b = t1_1.b
+AND f_leak(t1_1.b) AND f_leak(t1_2.b) RETURNING *, t1_1, t1_2;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on t1 t1_1
+ Update on t1 t1_1
+ Update on t2 t1_1_1
+ Update on t3 t1_1_2
+ -> Nested Loop
+ Join Filter: (t1_1.b = t1_2.b)
+ -> Seq Scan on t1 t1_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Nested Loop
+ Join Filter: (t1_1_1.b = t1_2.b)
+ -> Seq Scan on t2 t1_1_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Nested Loop
+ Join Filter: (t1_1_2.b = t1_2.b)
+ -> Seq Scan on t3 t1_1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Append
+ -> Seq Scan on t1 t1_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_2_1
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2_2
+ Filter: ((a = 4) AND ((a % 2) = 0) AND f_leak(b))
+(37 rows)
+
+UPDATE t1 t1_1 SET b = t1_2.b FROM t1 t1_2
+WHERE t1_1.a = 4 AND t1_2.a = t1_1.a AND t1_2.b = t1_1.b
+AND f_leak(t1_1.b) AND f_leak(t1_2.b) RETURNING *, t1_1, t1_2;
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => defdef
+NOTICE: f_leak => defdef
+NOTICE: f_leak => daddad_updt
+NOTICE: f_leak => defdef
+ id | a | b | id | a | b | t1_1 | t1_2
+-----+---+-------------+-----+---+-------------+---------------------+---------------------
+ 104 | 4 | daddad_updt | 104 | 4 | daddad_updt | (104,4,daddad_updt) | (104,4,daddad_updt)
+ 204 | 4 | defdef | 204 | 4 | defdef | (204,4,defdef) | (204,4,defdef)
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+SELECT * FROM t1 ORDER BY a,b;
+ id | a | b
+-----+---+-------------
+ 101 | 1 | aba
+ 201 | 1 | abc
+ 301 | 1 | xxx
+ 102 | 2 | bbbbbb_updt
+ 202 | 2 | bcdbcd
+ 302 | 2 | yyyyyy
+ 103 | 3 | ccc
+ 203 | 3 | cde
+ 303 | 3 | zzz
+ 104 | 4 | daddad_updt
+ 204 | 4 | defdef
+(11 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+EXPLAIN (COSTS OFF) DELETE FROM only t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Delete on t1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+EXPLAIN (COSTS OFF) DELETE FROM t1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------
+ Delete on t1
+ Delete on t1
+ Delete on t2 t1_1
+ Delete on t3 t1_2
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t2 t1_1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> Seq Scan on t3 t1_2
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(10 rows)
+
+DELETE FROM only t1 WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bbbbbb_updt
+NOTICE: f_leak => daddad_updt
+ tableoid | id | a | b | t1
+----------+-----+---+-------------+---------------------
+ t1 | 102 | 2 | bbbbbb_updt | (102,2,bbbbbb_updt)
+ t1 | 104 | 4 | daddad_updt | (104,4,daddad_updt)
+(2 rows)
+
+DELETE FROM t1 WHERE f_leak(b) RETURNING tableoid::regclass, *, t1;
+NOTICE: f_leak => bcdbcd
+NOTICE: f_leak => defdef
+NOTICE: f_leak => yyyyyy
+ tableoid | id | a | b | t1
+----------+-----+---+--------+----------------
+ t2 | 202 | 2 | bcdbcd | (202,2,bcdbcd)
+ t2 | 204 | 4 | defdef | (204,4,defdef)
+ t3 | 302 | 2 | yyyyyy | (302,2,yyyyyy)
+(3 rows)
+
+--
+-- S.b. view on top of Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE b1 (a int, b text);
+INSERT INTO b1 (SELECT x, md5(x::text) FROM generate_series(-10,10) x);
+CREATE POLICY p1 ON b1 USING (a % 2 = 0);
+ALTER TABLE b1 ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON b1 TO regress_rls_bob;
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW bv1 WITH (security_barrier) AS SELECT * FROM b1 WHERE a > 0 WITH CHECK OPTION;
+GRANT ALL ON bv1 TO regress_rls_carol;
+SET SESSION AUTHORIZATION regress_rls_carol;
+EXPLAIN (COSTS OFF) SELECT * FROM bv1 WHERE f_leak(b);
+ QUERY PLAN
+---------------------------------------------
+ Subquery Scan on bv1
+ Filter: f_leak(bv1.b)
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND ((a % 2) = 0))
+(4 rows)
+
+SELECT * FROM bv1 WHERE f_leak(b);
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+ a | b
+----+----------------------------------
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+(5 rows)
+
+INSERT INTO bv1 VALUES (-1, 'xxx'); -- should fail view WCO
+ERROR: new row violates row-level security policy for table "b1"
+INSERT INTO bv1 VALUES (11, 'xxx'); -- should fail RLS check
+ERROR: new row violates row-level security policy for table "b1"
+INSERT INTO bv1 VALUES (12, 'xxx'); -- ok
+EXPLAIN (COSTS OFF) UPDATE bv1 SET b = 'yyy' WHERE a = 4 AND f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Update on b1
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND (a = 4) AND ((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+UPDATE bv1 SET b = 'yyy' WHERE a = 4 AND f_leak(b);
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+EXPLAIN (COSTS OFF) DELETE FROM bv1 WHERE a = 6 AND f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Delete on b1
+ -> Seq Scan on b1
+ Filter: ((a > 0) AND (a = 6) AND ((a % 2) = 0) AND f_leak(b))
+(3 rows)
+
+DELETE FROM bv1 WHERE a = 6 AND f_leak(b);
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM b1;
+ a | b
+-----+----------------------------------
+ -10 | 1b0fd9efa5279c4203b7c70233f86dbf
+ -9 | 252e691406782824eec43d7eadc3d256
+ -8 | a8d2ec85eaf98407310b72eb73dda247
+ -7 | 74687a12d3915d3c4d83f1af7b3683d5
+ -6 | 596a3d04481816330f07e4f97510c28f
+ -5 | 47c1b025fa18ea96c33fbb6718688c0f
+ -4 | 0267aaf632e87a63288a08331f22c7c3
+ -3 | b3149ecea4628efd23d2f86e5a723472
+ -2 | 5d7b9adcbe1c629ec722529dd12e5129
+ -1 | 6bb61e3b7bce0931da574d19d1d82c88
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | xxx
+ 4 | yyy
+(21 rows)
+
+--
+-- INSERT ... ON CONFLICT DO UPDATE and Row-level security
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p1 ON document;
+DROP POLICY p1r ON document;
+CREATE POLICY p1 ON document FOR SELECT USING (true);
+CREATE POLICY p2 ON document FOR INSERT WITH CHECK (dauthor = current_user);
+CREATE POLICY p3 ON document FOR UPDATE
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'))
+ WITH CHECK (dauthor = current_user);
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Exists...
+SELECT * FROM document WHERE did = 2;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------
+ 2 | 11 | 2 | regress_rls_bob | my second novel
+(1 row)
+
+-- ...so violates actual WITH CHECK OPTION within UPDATE (not INSERT, since
+-- alternative UPDATE path happens to be taken):
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_carol', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, dauthor = EXCLUDED.dauthor;
+ERROR: new row violates row-level security policy for table "document"
+-- Violates USING qual for UPDATE policy p3.
+--
+-- UPDATE path is taken, but UPDATE fails purely because *existing* row to be
+-- updated is not a "novel"/cid 11 (row is not leaked, even though we have
+-- SELECT privileges sufficient to see the row in this instance):
+INSERT INTO document VALUES (33, 22, 1, 'regress_rls_bob', 'okay science fiction'); -- preparation for next statement
+INSERT INTO document VALUES (33, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'Some novel, replaces sci-fi') -- takes UPDATE path
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Fine (we UPDATE, since INSERT WCOs and UPDATE security barrier quals + WCOs
+-- not violated):
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+----------------
+ 2 | 11 | 2 | regress_rls_bob | my first novel
+(1 row)
+
+-- Fine (we INSERT, so "cid = 33" ("technology") isn't evaluated):
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------------
+ 78 | 11 | 1 | regress_rls_bob | some technology novel
+(1 row)
+
+-- Fine (same query, but we UPDATE, so "cid = 33", ("technology") is not the
+-- case in respect of *existing* tuple):
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+-----------------------
+ 78 | 33 | 1 | regress_rls_bob | some technology novel
+(1 row)
+
+-- Same query a third time, but now fails due to existing tuple finally not
+-- passing quals:
+INSERT INTO document VALUES (78, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'some technology novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33 RETURNING *;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Don't fail just because INSERT doesn't satisfy WITH CHECK option that
+-- originated as a barrier/USING() qual from the UPDATE. Note that the UPDATE
+-- path *isn't* taken, and so UPDATE-related policy does not apply:
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ did | cid | dlevel | dauthor | dtitle
+-----+-----+--------+-----------------+----------------------------------
+ 79 | 33 | 1 | regress_rls_bob | technology book, can only insert
+(1 row)
+
+-- But this time, the same statement fails, because the UPDATE path is taken,
+-- and updating the row just inserted falls afoul of security barrier qual
+-- (enforced as WCO) -- what we might have updated target tuple to is
+-- irrelevant, in fact.
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Test default USING qual enforced as WCO
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p1 ON document;
+DROP POLICY p2 ON document;
+DROP POLICY p3 ON document;
+CREATE POLICY p3_with_default ON document FOR UPDATE
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'));
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Just because WCO-style enforcement of USING quals occurs with
+-- existing/target tuple does not mean that the implementation can be allowed
+-- to fail to also enforce this qual against the final tuple appended to
+-- relation (since in the absence of an explicit WCO, this is also interpreted
+-- as an UPDATE/ALL WCO in general).
+--
+-- UPDATE path is taken here (fails due to existing tuple). Note that this is
+-- not reported as a "USING expression", because it's an RLS UPDATE check that originated as
+-- a USING qual for the purposes of RLS in general, as opposed to an explicit
+-- USING qual that is ordinarily a security barrier. We leave it up to the
+-- UPDATE to make this fail:
+INSERT INTO document VALUES (79, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'technology book, can only insert')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy for table "document"
+-- UPDATE path is taken here. Existing tuple passes, since its cid
+-- corresponds to "novel", but default USING qual is enforced against
+-- post-UPDATE tuple too (as always when updating with a policy that lacks an
+-- explicit WCO), and so this fails:
+INSERT INTO document VALUES (2, (SELECT cid from category WHERE cname = 'technology'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET cid = EXCLUDED.cid, dtitle = EXCLUDED.dtitle RETURNING *;
+ERROR: new row violates row-level security policy for table "document"
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP POLICY p3_with_default ON document;
+--
+-- Test ALL policies with ON CONFLICT DO UPDATE (much the same as existing UPDATE
+-- tests)
+--
+CREATE POLICY p3_with_all ON document FOR ALL
+ USING (cid = (SELECT cid from category WHERE cname = 'novel'))
+ WITH CHECK (dauthor = current_user);
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Fails, since ALL WCO is enforced in insert path:
+INSERT INTO document VALUES (80, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_carol', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle, cid = 33;
+ERROR: new row violates row-level security policy for table "document"
+-- Fails, since ALL policy USING qual is enforced (existing, target tuple is in
+-- violation, since it has the "manga" cid):
+INSERT INTO document VALUES (4, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dtitle = EXCLUDED.dtitle;
+ERROR: new row violates row-level security policy (USING expression) for table "document"
+-- Fails, since ALL WCO are enforced:
+INSERT INTO document VALUES (1, (SELECT cid from category WHERE cname = 'novel'), 1, 'regress_rls_bob', 'my first novel')
+ ON CONFLICT (did) DO UPDATE SET dauthor = 'regress_rls_carol';
+ERROR: new row violates row-level security policy for table "document"
+--
+-- ROLE/GROUP
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE z1 (a int, b text);
+CREATE TABLE z2 (a int, b text);
+GRANT SELECT ON z1,z2 TO regress_rls_group1, regress_rls_group2,
+ regress_rls_bob, regress_rls_carol;
+INSERT INTO z1 VALUES
+ (1, 'aba'),
+ (2, 'bbb'),
+ (3, 'ccc'),
+ (4, 'dad');
+CREATE POLICY p1 ON z1 TO regress_rls_group1 USING (a % 2 = 0);
+CREATE POLICY p2 ON z1 TO regress_rls_group2 USING (a % 2 = 1);
+ALTER TABLE z1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+PREPARE plancache_test AS SELECT * FROM z1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+PREPARE plancache_test2 AS WITH q AS MATERIALIZED (SELECT * FROM z1 WHERE f_leak(b)) SELECT * FROM q,z2;
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+PREPARE plancache_test3 AS WITH q AS MATERIALIZED (SELECT * FROM z2) SELECT * FROM q,z1 WHERE f_leak(z1.b);
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+SET ROLE regress_rls_group1;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(7 rows)
+
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => ccc
+ a | b
+---+-----
+ 1 | aba
+ 3 | ccc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+SET ROLE regress_rls_group2;
+SELECT * FROM z1 WHERE f_leak(b);
+NOTICE: f_leak => aba
+NOTICE: f_leak => ccc
+ a | b
+---+-----
+ 1 | aba
+ 3 | ccc
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM z1 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(2 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test2;
+ QUERY PLAN
+-------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z2
+(7 rows)
+
+EXPLAIN (COSTS OFF) EXECUTE plancache_test3;
+ QUERY PLAN
+-----------------------------------------------------
+ Nested Loop
+ CTE q
+ -> Seq Scan on z2
+ -> CTE Scan on q
+ -> Materialize
+ -> Seq Scan on z1
+ Filter: (((a % 2) = 1) AND f_leak(b))
+(7 rows)
+
+--
+-- Views should follow policy for view owner.
+--
+-- View and Table owner are the same.
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE VIEW rls_view AS SELECT * FROM z1 WHERE f_leak(b);
+GRANT SELECT ON rls_view TO regress_rls_bob;
+-- Query as role that is not owner of view or table. Should return all records.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rls_view;
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 1 | aba
+ 2 | bbb
+ 3 | ccc
+ 4 | dad
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+---------------------
+ Seq Scan on z1
+ Filter: f_leak(b)
+(2 rows)
+
+-- Query as view/table owner. Should return all records.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM rls_view;
+NOTICE: f_leak => aba
+NOTICE: f_leak => bbb
+NOTICE: f_leak => ccc
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 1 | aba
+ 2 | bbb
+ 3 | ccc
+ 4 | dad
+(4 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+---------------------
+ Seq Scan on z1
+ Filter: f_leak(b)
+(2 rows)
+
+DROP VIEW rls_view;
+-- View and Table owners are different.
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rls_view AS SELECT * FROM z1 WHERE f_leak(b);
+GRANT SELECT ON rls_view TO regress_rls_alice;
+-- Query as role that is not owner of view but is owner of table.
+-- Should return records based on view owner policies.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+-- Query as role that is not owner of table but is owner of view.
+-- Should return records based on view owner policies.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+-- Query as role that is not the owner of the table or view without permissions.
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM rls_view; --fail - permission denied.
+ERROR: permission denied for view rls_view
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view; --fail - permission denied.
+ERROR: permission denied for view rls_view
+-- Query as role that is not the owner of the table or view with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_carol;
+SELECT * FROM rls_view;
+NOTICE: f_leak => bbb
+NOTICE: f_leak => dad
+ a | b
+---+-----
+ 2 | bbb
+ 4 | dad
+(2 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM rls_view;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on z1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+DROP VIEW rls_view;
+--
+-- Command specific
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE x1 (a int, b text, c text);
+GRANT ALL ON x1 TO PUBLIC;
+INSERT INTO x1 VALUES
+ (1, 'abc', 'regress_rls_bob'),
+ (2, 'bcd', 'regress_rls_bob'),
+ (3, 'cde', 'regress_rls_carol'),
+ (4, 'def', 'regress_rls_carol'),
+ (5, 'efg', 'regress_rls_bob'),
+ (6, 'fgh', 'regress_rls_bob'),
+ (7, 'fgh', 'regress_rls_carol'),
+ (8, 'fgh', 'regress_rls_carol');
+CREATE POLICY p0 ON x1 FOR ALL USING (c = current_user);
+CREATE POLICY p1 ON x1 FOR SELECT USING (a % 2 = 0);
+CREATE POLICY p2 ON x1 FOR INSERT WITH CHECK (a % 2 = 1);
+CREATE POLICY p3 ON x1 FOR UPDATE USING (a % 2 = 0);
+CREATE POLICY p4 ON x1 FOR DELETE USING (a < 8);
+ALTER TABLE x1 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM x1 WHERE f_leak(b) ORDER BY a ASC;
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => efg
+NOTICE: f_leak => fgh
+NOTICE: f_leak => fgh
+ a | b | c
+---+-----+-------------------
+ 1 | abc | regress_rls_bob
+ 2 | bcd | regress_rls_bob
+ 4 | def | regress_rls_carol
+ 5 | efg | regress_rls_bob
+ 6 | fgh | regress_rls_bob
+ 8 | fgh | regress_rls_carol
+(6 rows)
+
+UPDATE x1 SET b = b || '_updt' WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => abc
+NOTICE: f_leak => bcd
+NOTICE: f_leak => def
+NOTICE: f_leak => efg
+NOTICE: f_leak => fgh
+NOTICE: f_leak => fgh
+ a | b | c
+---+----------+-------------------
+ 1 | abc_updt | regress_rls_bob
+ 2 | bcd_updt | regress_rls_bob
+ 4 | def_updt | regress_rls_carol
+ 5 | efg_updt | regress_rls_bob
+ 6 | fgh_updt | regress_rls_bob
+ 8 | fgh_updt | regress_rls_carol
+(6 rows)
+
+SET SESSION AUTHORIZATION regress_rls_carol;
+SELECT * FROM x1 WHERE f_leak(b) ORDER BY a ASC;
+NOTICE: f_leak => cde
+NOTICE: f_leak => fgh
+NOTICE: f_leak => bcd_updt
+NOTICE: f_leak => def_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => fgh_updt
+ a | b | c
+---+----------+-------------------
+ 2 | bcd_updt | regress_rls_bob
+ 3 | cde | regress_rls_carol
+ 4 | def_updt | regress_rls_carol
+ 6 | fgh_updt | regress_rls_bob
+ 7 | fgh | regress_rls_carol
+ 8 | fgh_updt | regress_rls_carol
+(6 rows)
+
+UPDATE x1 SET b = b || '_updt' WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => cde
+NOTICE: f_leak => fgh
+NOTICE: f_leak => bcd_updt
+NOTICE: f_leak => def_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => fgh_updt
+ a | b | c
+---+---------------+-------------------
+ 3 | cde_updt | regress_rls_carol
+ 7 | fgh_updt | regress_rls_carol
+ 2 | bcd_updt_updt | regress_rls_bob
+ 4 | def_updt_updt | regress_rls_carol
+ 6 | fgh_updt_updt | regress_rls_bob
+ 8 | fgh_updt_updt | regress_rls_carol
+(6 rows)
+
+DELETE FROM x1 WHERE f_leak(b) RETURNING *;
+NOTICE: f_leak => cde_updt
+NOTICE: f_leak => fgh_updt
+NOTICE: f_leak => bcd_updt_updt
+NOTICE: f_leak => def_updt_updt
+NOTICE: f_leak => fgh_updt_updt
+NOTICE: f_leak => fgh_updt_updt
+ a | b | c
+---+---------------+-------------------
+ 3 | cde_updt | regress_rls_carol
+ 7 | fgh_updt | regress_rls_carol
+ 2 | bcd_updt_updt | regress_rls_bob
+ 4 | def_updt_updt | regress_rls_carol
+ 6 | fgh_updt_updt | regress_rls_bob
+ 8 | fgh_updt_updt | regress_rls_carol
+(6 rows)
+
+--
+-- Duplicate Policy Names
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE y1 (a int, b text);
+CREATE TABLE y2 (a int, b text);
+GRANT ALL ON y1, y2 TO regress_rls_bob;
+CREATE POLICY p1 ON y1 FOR ALL USING (a % 2 = 0);
+CREATE POLICY p2 ON y1 FOR SELECT USING (a > 2);
+CREATE POLICY p1 ON y1 FOR SELECT USING (a % 2 = 1); --fail
+ERROR: policy "p1" for table "y1" already exists
+CREATE POLICY p1 ON y2 FOR ALL USING (a % 2 = 0); --OK
+ALTER TABLE y1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE y2 ENABLE ROW LEVEL SECURITY;
+--
+-- Expression structure with SBV
+--
+-- Create view as table owner. RLS should NOT be applied.
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE VIEW rls_sbv WITH (security_barrier) AS
+ SELECT * FROM y1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) SELECT * FROM rls_sbv WHERE (a = 1);
+ QUERY PLAN
+-----------------------------------
+ Seq Scan on y1
+ Filter: (f_leak(b) AND (a = 1))
+(2 rows)
+
+DROP VIEW rls_sbv;
+-- Create view as role that does not own table. RLS should be applied.
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE VIEW rls_sbv WITH (security_barrier) AS
+ SELECT * FROM y1 WHERE f_leak(b);
+EXPLAIN (COSTS OFF) SELECT * FROM rls_sbv WHERE (a = 1);
+ QUERY PLAN
+------------------------------------------------------------------
+ Seq Scan on y1
+ Filter: ((a = 1) AND ((a > 2) OR ((a % 2) = 0)) AND f_leak(b))
+(2 rows)
+
+DROP VIEW rls_sbv;
+--
+-- Expression structure
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+INSERT INTO y2 (SELECT x, md5(x::text) FROM generate_series(0,20) x);
+CREATE POLICY p2 ON y2 USING (a % 3 = 0);
+CREATE POLICY p3 ON y2 USING (a % 4 = 0);
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM y2 WHERE f_leak(b);
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => eccbc87e4b5ce2fe28308fd9f2a7baf3
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => 45c48cce2e2d7fbdea1afc51c7c6ad26
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => 9bf31c7ff062936a96d3c8bd1f8f2ff3
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(14 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------------
+ Seq Scan on y2
+ Filter: ((((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)) AND f_leak(b))
+(2 rows)
+
+--
+-- Qual push-down of leaky functions, when not referring to table
+--
+SELECT * FROM y2 WHERE f_leak('abc');
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+NOTICE: f_leak => abc
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(14 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 WHERE f_leak('abc');
+ QUERY PLAN
+---------------------------------------------------------------------------------------
+ Seq Scan on y2
+ Filter: (f_leak('abc'::text) AND (((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)))
+(2 rows)
+
+CREATE TABLE test_qual_pushdown (
+ abc text
+);
+INSERT INTO test_qual_pushdown VALUES ('abc'),('def');
+SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(abc);
+NOTICE: f_leak => abc
+NOTICE: f_leak => def
+ a | b | abc
+---+---+-----
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(abc);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (test_qual_pushdown.abc = y2.b)
+ -> Seq Scan on test_qual_pushdown
+ Filter: f_leak(abc)
+ -> Hash
+ -> Seq Scan on y2
+ Filter: (((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0))
+(7 rows)
+
+SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(b);
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => eccbc87e4b5ce2fe28308fd9f2a7baf3
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => 45c48cce2e2d7fbdea1afc51c7c6ad26
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => 9bf31c7ff062936a96d3c8bd1f8f2ff3
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b | abc
+---+---+-----
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM y2 JOIN test_qual_pushdown ON (b = abc) WHERE f_leak(b);
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Hash Join
+ Hash Cond: (test_qual_pushdown.abc = y2.b)
+ -> Seq Scan on test_qual_pushdown
+ -> Hash
+ -> Seq Scan on y2
+ Filter: ((((a % 4) = 0) OR ((a % 3) = 0) OR ((a % 2) = 0)) AND f_leak(b))
+(6 rows)
+
+DROP TABLE test_qual_pushdown;
+--
+-- Plancache invalidate on user change.
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE t1 CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to table t2
+drop cascades to table t3
+CREATE TABLE t1 (a integer);
+GRANT SELECT ON t1 TO regress_rls_bob, regress_rls_carol;
+CREATE POLICY p1 ON t1 TO regress_rls_bob USING ((a % 2) = 0);
+CREATE POLICY p2 ON t1 TO regress_rls_carol USING ((a % 4) = 0);
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+-- Prepare as regress_rls_bob
+SET ROLE regress_rls_bob;
+PREPARE role_inval AS SELECT * FROM t1;
+-- Check plan
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(2 rows)
+
+-- Change to regress_rls_carol
+SET ROLE regress_rls_carol;
+-- Check plan- should be different
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 4) = 0)
+(2 rows)
+
+-- Change back to regress_rls_bob
+SET ROLE regress_rls_bob;
+-- Check plan- should be back to original
+EXPLAIN (COSTS OFF) EXECUTE role_inval;
+ QUERY PLAN
+-------------------------
+ Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(2 rows)
+
+--
+-- CTE and RLS
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE t1 CASCADE;
+CREATE TABLE t1 (a integer, b text);
+CREATE POLICY p1 ON t1 USING (a % 2 = 0);
+ALTER TABLE t1 ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON t1 TO regress_rls_bob;
+INSERT INTO t1 (SELECT x, md5(x::text) FROM generate_series(0,20) x);
+SET SESSION AUTHORIZATION regress_rls_bob;
+WITH cte1 AS MATERIALIZED (SELECT * FROM t1 WHERE f_leak(b)) SELECT * FROM cte1;
+NOTICE: f_leak => cfcd208495d565ef66e7dff9f98764da
+NOTICE: f_leak => c81e728d9d4c2f636f067f89cc14862c
+NOTICE: f_leak => a87ff679a2f3e71d9181a67b7542122c
+NOTICE: f_leak => 1679091c5a880faf6fb5e6087eb1b2dc
+NOTICE: f_leak => c9f0f895fb98ab9159f51fd0297e236d
+NOTICE: f_leak => d3d9446802a44259755d38e6d163e820
+NOTICE: f_leak => c20ad4d76fe97759aa27a0c99bff6710
+NOTICE: f_leak => aab3238922bcc25a6f606eb525ffdc56
+NOTICE: f_leak => c74d97b01eae257e44aa9d5bade97baf
+NOTICE: f_leak => 6f4922f45568161a8cdf4ad2299f6d23
+NOTICE: f_leak => 98f13708210194c475687be6106a3b84
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(11 rows)
+
+EXPLAIN (COSTS OFF)
+WITH cte1 AS MATERIALIZED (SELECT * FROM t1 WHERE f_leak(b)) SELECT * FROM cte1;
+ QUERY PLAN
+-------------------------------------------------
+ CTE Scan on cte1
+ CTE cte1
+ -> Seq Scan on t1
+ Filter: (((a % 2) = 0) AND f_leak(b))
+(4 rows)
+
+WITH cte1 AS (UPDATE t1 SET a = a + 1 RETURNING *) SELECT * FROM cte1; --fail
+ERROR: new row violates row-level security policy for table "t1"
+WITH cte1 AS (UPDATE t1 SET a = a RETURNING *) SELECT * FROM cte1; --ok
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+(11 rows)
+
+WITH cte1 AS (INSERT INTO t1 VALUES (21, 'Fail') RETURNING *) SELECT * FROM cte1; --fail
+ERROR: new row violates row-level security policy for table "t1"
+WITH cte1 AS (INSERT INTO t1 VALUES (20, 'Success') RETURNING *) SELECT * FROM cte1; --ok
+ a | b
+----+---------
+ 20 | Success
+(1 row)
+
+--
+-- Rename Policy
+--
+RESET SESSION AUTHORIZATION;
+ALTER POLICY p1 ON t1 RENAME TO p1; --fail
+ERROR: policy "p1" for table "t1" already exists
+SELECT polname, relname
+ FROM pg_policy pol
+ JOIN pg_class pc ON (pc.oid = pol.polrelid)
+ WHERE relname = 't1';
+ polname | relname
+---------+---------
+ p1 | t1
+(1 row)
+
+ALTER POLICY p1 ON t1 RENAME TO p2; --ok
+SELECT polname, relname
+ FROM pg_policy pol
+ JOIN pg_class pc ON (pc.oid = pol.polrelid)
+ WHERE relname = 't1';
+ polname | relname
+---------+---------
+ p2 | t1
+(1 row)
+
+--
+-- Check INSERT SELECT
+--
+SET SESSION AUTHORIZATION regress_rls_bob;
+CREATE TABLE t2 (a integer, b text);
+INSERT INTO t2 (SELECT * FROM t1);
+EXPLAIN (COSTS OFF) INSERT INTO t2 (SELECT * FROM t1);
+ QUERY PLAN
+-------------------------------
+ Insert on t2
+ -> Seq Scan on t1
+ Filter: ((a % 2) = 0)
+(3 rows)
+
+SELECT * FROM t2;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t2;
+ QUERY PLAN
+----------------
+ Seq Scan on t2
+(1 row)
+
+CREATE TABLE t3 AS SELECT * FROM t1;
+SELECT * FROM t3;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+SELECT * INTO t4 FROM t1;
+SELECT * FROM t4;
+ a | b
+----+----------------------------------
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(12 rows)
+
+--
+-- RLS with JOIN
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE blog (id integer, author text, post text);
+CREATE TABLE comment (blog_id integer, message text);
+GRANT ALL ON blog, comment TO regress_rls_bob;
+CREATE POLICY blog_1 ON blog USING (id % 2 = 0);
+ALTER TABLE blog ENABLE ROW LEVEL SECURITY;
+INSERT INTO blog VALUES
+ (1, 'alice', 'blog #1'),
+ (2, 'bob', 'blog #1'),
+ (3, 'alice', 'blog #2'),
+ (4, 'alice', 'blog #3'),
+ (5, 'john', 'blog #1');
+INSERT INTO comment VALUES
+ (1, 'cool blog'),
+ (1, 'fun blog'),
+ (3, 'crazy blog'),
+ (5, 'what?'),
+ (4, 'insane!'),
+ (2, 'who did it?');
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Check RLS JOIN with Non-RLS.
+SELECT id, author, message FROM blog JOIN comment ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 4 | alice | insane!
+ 2 | bob | who did it?
+(2 rows)
+
+-- Check Non-RLS JOIN with RLS.
+SELECT id, author, message FROM comment JOIN blog ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 4 | alice | insane!
+ 2 | bob | who did it?
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE POLICY comment_1 ON comment USING (blog_id < 4);
+ALTER TABLE comment ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Check RLS JOIN RLS
+SELECT id, author, message FROM blog JOIN comment ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 2 | bob | who did it?
+(1 row)
+
+SELECT id, author, message FROM comment JOIN blog ON id = blog_id;
+ id | author | message
+----+--------+-------------
+ 2 | bob | who did it?
+(1 row)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP TABLE blog, comment;
+--
+-- Default Deny Policy
+--
+RESET SESSION AUTHORIZATION;
+DROP POLICY p2 ON t1;
+ALTER TABLE t1 OWNER TO regress_rls_alice;
+-- Check that default deny does not apply to superuser.
+RESET SESSION AUTHORIZATION;
+SELECT * FROM t1;
+ a | b
+----+----------------------------------
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 11 | 6512bd43d9caa6e02c990b0a82652dca
+ 13 | c51ce410c124a10e0db5e4b97fc2af39
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 17 | 70efdf2ec9b086079795c442636b55fb
+ 19 | 1f0e3dad99908345f7439f8ffabdffc4
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(22 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+----------------
+ Seq Scan on t1
+(1 row)
+
+-- Check that default deny does not apply to table owner.
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM t1;
+ a | b
+----+----------------------------------
+ 1 | c4ca4238a0b923820dcc509a6f75849b
+ 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
+ 5 | e4da3b7fbbce2345d7772b0674a318d5
+ 7 | 8f14e45fceea167a5a36dedd4bea2543
+ 9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
+ 11 | 6512bd43d9caa6e02c990b0a82652dca
+ 13 | c51ce410c124a10e0db5e4b97fc2af39
+ 15 | 9bf31c7ff062936a96d3c8bd1f8f2ff3
+ 17 | 70efdf2ec9b086079795c442636b55fb
+ 19 | 1f0e3dad99908345f7439f8ffabdffc4
+ 0 | cfcd208495d565ef66e7dff9f98764da
+ 2 | c81e728d9d4c2f636f067f89cc14862c
+ 4 | a87ff679a2f3e71d9181a67b7542122c
+ 6 | 1679091c5a880faf6fb5e6087eb1b2dc
+ 8 | c9f0f895fb98ab9159f51fd0297e236d
+ 10 | d3d9446802a44259755d38e6d163e820
+ 12 | c20ad4d76fe97759aa27a0c99bff6710
+ 14 | aab3238922bcc25a6f606eb525ffdc56
+ 16 | c74d97b01eae257e44aa9d5bade97baf
+ 18 | 6f4922f45568161a8cdf4ad2299f6d23
+ 20 | 98f13708210194c475687be6106a3b84
+ 20 | Success
+(22 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+----------------
+ Seq Scan on t1
+(1 row)
+
+-- Check that default deny applies to non-owner/non-superuser when RLS on.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO ON;
+SELECT * FROM t1;
+ a | b
+---+---
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+--------------------------
+ Result
+ One-Time Filter: false
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM t1;
+ a | b
+---+---
+(0 rows)
+
+EXPLAIN (COSTS OFF) SELECT * FROM t1;
+ QUERY PLAN
+--------------------------
+ Result
+ One-Time Filter: false
+(2 rows)
+
+--
+-- COPY TO/FROM
+--
+RESET SESSION AUTHORIZATION;
+DROP TABLE copy_t CASCADE;
+ERROR: table "copy_t" does not exist
+CREATE TABLE copy_t (a integer, b text);
+CREATE POLICY p1 ON copy_t USING (a % 2 = 0);
+ALTER TABLE copy_t ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON copy_t TO regress_rls_bob, regress_rls_exempt_user;
+INSERT INTO copy_t (SELECT x, md5(x::text) FROM generate_series(0,10) x);
+-- Check COPY TO as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ',';
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ',';
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+2,c81e728d9d4c2f636f067f89cc14862c
+4,a87ff679a2f3e71d9181a67b7542122c
+6,1679091c5a880faf6fb5e6087eb1b2dc
+8,c9f0f895fb98ab9159f51fd0297e236d
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --ok
+0,cfcd208495d565ef66e7dff9f98764da
+1,c4ca4238a0b923820dcc509a6f75849b
+2,c81e728d9d4c2f636f067f89cc14862c
+3,eccbc87e4b5ce2fe28308fd9f2a7baf3
+4,a87ff679a2f3e71d9181a67b7542122c
+5,e4da3b7fbbce2345d7772b0674a318d5
+6,1679091c5a880faf6fb5e6087eb1b2dc
+7,8f14e45fceea167a5a36dedd4bea2543
+8,c9f0f895fb98ab9159f51fd0297e236d
+9,45c48cce2e2d7fbdea1afc51c7c6ad26
+10,d3d9446802a44259755d38e6d163e820
+-- Check COPY TO as user without permissions. SET row_security TO OFF;
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY (SELECT * FROM copy_t ORDER BY a ASC) TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_t
+-- Check COPY relation TO; keep it just one row to avoid reordering issues
+RESET SESSION AUTHORIZATION;
+SET row_security TO ON;
+CREATE TABLE copy_rel_to (a integer, b text);
+CREATE POLICY p1 ON copy_rel_to USING (a % 2 = 0);
+ALTER TABLE copy_rel_to ENABLE ROW LEVEL SECURITY;
+GRANT ALL ON copy_rel_to TO regress_rls_bob, regress_rls_exempt_user;
+INSERT INTO copy_rel_to VALUES (1, md5('1'));
+-- Check COPY TO as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ',';
+1,c4ca4238a0b923820dcc509a6f75849b
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ',';
+1,c4ca4238a0b923820dcc509a6f75849b
+-- Check COPY TO as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - would be affected by RLS
+ERROR: query would be affected by row-level security policy for table "copy_rel_to"
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+-- Check COPY TO as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+1,c4ca4238a0b923820dcc509a6f75849b
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --ok
+1,c4ca4238a0b923820dcc509a6f75849b
+-- Check COPY TO as user without permissions. SET row_security TO OFF;
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_rel_to
+SET row_security TO ON;
+COPY copy_rel_to TO STDOUT WITH DELIMITER ','; --fail - permission denied
+ERROR: permission denied for table copy_rel_to
+-- Check COPY FROM as Superuser/owner.
+RESET SESSION AUTHORIZATION;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --ok
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --ok
+-- Check COPY FROM as user with permissions.
+SET SESSION AUTHORIZATION regress_rls_bob;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --fail - would be affected by RLS.
+ERROR: query would be affected by row-level security policy for table "copy_t"
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --fail - COPY FROM not supported by RLS.
+ERROR: COPY FROM not supported with row-level security
+HINT: Use INSERT statements instead.
+-- Check COPY FROM as user with permissions and BYPASSRLS
+SET SESSION AUTHORIZATION regress_rls_exempt_user;
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --ok
+-- Check COPY FROM as user without permissions.
+SET SESSION AUTHORIZATION regress_rls_carol;
+SET row_security TO OFF;
+COPY copy_t FROM STDIN; --fail - permission denied.
+ERROR: permission denied for table copy_t
+SET row_security TO ON;
+COPY copy_t FROM STDIN; --fail - permission denied.
+ERROR: permission denied for table copy_t
+RESET SESSION AUTHORIZATION;
+DROP TABLE copy_t;
+DROP TABLE copy_rel_to CASCADE;
+-- Check WHERE CURRENT OF
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE current_check (currentid int, payload text, rlsuser text);
+GRANT ALL ON current_check TO PUBLIC;
+INSERT INTO current_check VALUES
+ (1, 'abc', 'regress_rls_bob'),
+ (2, 'bcd', 'regress_rls_bob'),
+ (3, 'cde', 'regress_rls_bob'),
+ (4, 'def', 'regress_rls_bob');
+CREATE POLICY p1 ON current_check FOR SELECT USING (currentid % 2 = 0);
+CREATE POLICY p2 ON current_check FOR DELETE USING (currentid = 4 AND rlsuser = current_user);
+CREATE POLICY p3 ON current_check FOR UPDATE USING (currentid = 4) WITH CHECK (rlsuser = current_user);
+ALTER TABLE current_check ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Can SELECT even rows
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+ 4 | def | regress_rls_bob
+(2 rows)
+
+-- Cannot UPDATE row 2
+UPDATE current_check SET payload = payload || '_new' WHERE currentid = 2 RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+BEGIN;
+DECLARE current_check_cursor SCROLL CURSOR FOR SELECT * FROM current_check;
+-- Returns rows that can be seen according to SELECT policy, like plain SELECT
+-- above (even rows)
+FETCH ABSOLUTE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+-- Still cannot UPDATE row 2 through cursor
+UPDATE current_check SET payload = payload || '_new' WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+-- Can update row 4 through cursor, which is the next visible row
+FETCH RELATIVE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def | regress_rls_bob
+(1 row)
+
+UPDATE current_check SET payload = payload || '_new' WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def_new | regress_rls_bob
+(1 row)
+
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+ 4 | def_new | regress_rls_bob
+(2 rows)
+
+-- Plan should be a subquery TID scan
+EXPLAIN (COSTS OFF) UPDATE current_check SET payload = payload WHERE CURRENT OF current_check_cursor;
+ QUERY PLAN
+-------------------------------------------------------------
+ Update on current_check
+ -> Tid Scan on current_check
+ TID Cond: CURRENT OF current_check_cursor
+ Filter: ((currentid = 4) AND ((currentid % 2) = 0))
+(4 rows)
+
+-- Similarly can only delete row 4
+FETCH ABSOLUTE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+DELETE FROM current_check WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+---------
+(0 rows)
+
+FETCH RELATIVE 1 FROM current_check_cursor;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def | regress_rls_bob
+(1 row)
+
+DELETE FROM current_check WHERE CURRENT OF current_check_cursor RETURNING *;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 4 | def_new | regress_rls_bob
+(1 row)
+
+SELECT * FROM current_check;
+ currentid | payload | rlsuser
+-----------+---------+-----------------
+ 2 | bcd | regress_rls_bob
+(1 row)
+
+COMMIT;
+--
+-- check pg_stats view filtering
+--
+SET row_security TO ON;
+SET SESSION AUTHORIZATION regress_rls_alice;
+ANALYZE current_check;
+-- Stats visible
+SELECT row_security_active('current_check');
+ row_security_active
+---------------------
+ f
+(1 row)
+
+SELECT attname, most_common_vals FROM pg_stats
+ WHERE tablename = 'current_check'
+ ORDER BY 1;
+ attname | most_common_vals
+-----------+-------------------
+ currentid | {4}
+ payload |
+ rlsuser | {regress_rls_bob}
+(3 rows)
+
+SET SESSION AUTHORIZATION regress_rls_bob;
+-- Stats not visible
+SELECT row_security_active('current_check');
+ row_security_active
+---------------------
+ t
+(1 row)
+
+SELECT attname, most_common_vals FROM pg_stats
+ WHERE tablename = 'current_check'
+ ORDER BY 1;
+ attname | most_common_vals
+---------+------------------
+(0 rows)
+
+--
+-- Collation support
+--
+BEGIN;
+CREATE TABLE coll_t (c) AS VALUES ('bar'::text);
+CREATE POLICY coll_p ON coll_t USING (c < ('foo'::text COLLATE "C"));
+ALTER TABLE coll_t ENABLE ROW LEVEL SECURITY;
+GRANT SELECT ON coll_t TO regress_rls_alice;
+SELECT (string_to_array(polqual, ':'))[7] AS inputcollid FROM pg_policy WHERE polrelid = 'coll_t'::regclass;
+ inputcollid
+------------------
+ inputcollid 950
+(1 row)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM coll_t;
+ c
+-----
+ bar
+(1 row)
+
+ROLLBACK;
+--
+-- Shared Object Dependencies
+--
+RESET SESSION AUTHORIZATION;
+BEGIN;
+CREATE ROLE regress_rls_eve;
+CREATE ROLE regress_rls_frank;
+CREATE TABLE tbl1 (c) AS VALUES ('bar'::text);
+GRANT SELECT ON TABLE tbl1 TO regress_rls_eve;
+CREATE POLICY P ON tbl1 TO regress_rls_eve, regress_rls_frank USING (true);
+SELECT refclassid::regclass, deptype
+ FROM pg_depend
+ WHERE classid = 'pg_policy'::regclass
+ AND refobjid = 'tbl1'::regclass;
+ refclassid | deptype
+------------+---------
+ pg_class | a
+(1 row)
+
+SELECT refclassid::regclass, deptype
+ FROM pg_shdepend
+ WHERE classid = 'pg_policy'::regclass
+ AND refobjid IN ('regress_rls_eve'::regrole, 'regress_rls_frank'::regrole);
+ refclassid | deptype
+------------+---------
+ pg_authid | r
+ pg_authid | r
+(2 rows)
+
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
+ERROR: role "regress_rls_eve" cannot be dropped because some objects depend on it
+DETAIL: privileges for table tbl1
+target of policy p on table tbl1
+ROLLBACK TO q;
+ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --fails due to dependency on GRANT SELECT
+ERROR: role "regress_rls_eve" cannot be dropped because some objects depend on it
+DETAIL: privileges for table tbl1
+ROLLBACK TO q;
+REVOKE ALL ON TABLE tbl1 FROM regress_rls_eve;
+SAVEPOINT q;
+DROP ROLE regress_rls_eve; --succeeds
+ROLLBACK TO q;
+SAVEPOINT q;
+DROP ROLE regress_rls_frank; --fails due to dependency on POLICY p
+ERROR: role "regress_rls_frank" cannot be dropped because some objects depend on it
+DETAIL: target of policy p on table tbl1
+ROLLBACK TO q;
+DROP POLICY p ON tbl1;
+SAVEPOINT q;
+DROP ROLE regress_rls_frank; -- succeeds
+ROLLBACK TO q;
+ROLLBACK; -- cleanup
+--
+-- Converting table to view
+--
+BEGIN;
+CREATE TABLE t (c int);
+CREATE POLICY p ON t USING (c % 2 = 1);
+ALTER TABLE t ENABLE ROW LEVEL SECURITY;
+SAVEPOINT q;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- fails due to row level security enabled
+ERROR: could not convert table "t" to a view because it has row security enabled
+ROLLBACK TO q;
+ALTER TABLE t DISABLE ROW LEVEL SECURITY;
+SAVEPOINT q;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- fails due to policy p on t
+ERROR: could not convert table "t" to a view because it has row security policies
+ROLLBACK TO q;
+DROP POLICY p ON t;
+CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD
+ SELECT * FROM generate_series(1,5) t0(c); -- succeeds
+ROLLBACK;
+--
+-- Policy expression handling
+--
+BEGIN;
+CREATE TABLE t (c) AS VALUES ('bar'::text);
+CREATE POLICY p ON t USING (max(c)); -- fails: aggregate functions are not allowed in policy expressions
+ERROR: aggregate functions are not allowed in policy expressions
+ROLLBACK;
+--
+-- Non-target relations are only subject to SELECT policies
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE r1 (a int);
+CREATE TABLE r2 (a int);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+GRANT ALL ON r1, r2 TO regress_rls_bob;
+CREATE POLICY p1 ON r1 USING (true);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON r2 FOR SELECT USING (true);
+CREATE POLICY p2 ON r2 FOR INSERT WITH CHECK (false);
+CREATE POLICY p3 ON r2 FOR UPDATE USING (false);
+CREATE POLICY p4 ON r2 FOR DELETE USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+SET SESSION AUTHORIZATION regress_rls_bob;
+SELECT * FROM r1;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+SELECT * FROM r2;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+-- r2 is read-only
+INSERT INTO r2 VALUES (2); -- Not allowed
+ERROR: new row violates row-level security policy for table "r2"
+UPDATE r2 SET a = 2 RETURNING *; -- Updates nothing
+ a
+---
+(0 rows)
+
+DELETE FROM r2 RETURNING *; -- Deletes nothing
+ a
+---
+(0 rows)
+
+-- r2 can be used as a non-target relation in DML
+INSERT INTO r1 SELECT a + 1 FROM r2 RETURNING *; -- OK
+ a
+----
+ 11
+ 21
+(2 rows)
+
+UPDATE r1 SET a = r2.a + 2 FROM r2 WHERE r1.a = r2.a RETURNING *; -- OK
+ a | a
+----+----
+ 12 | 10
+ 22 | 20
+(2 rows)
+
+DELETE FROM r1 USING r2 WHERE r1.a = r2.a + 2 RETURNING *; -- OK
+ a | a
+----+----
+ 12 | 10
+ 22 | 20
+(2 rows)
+
+SELECT * FROM r1;
+ a
+----
+ 11
+ 21
+(2 rows)
+
+SELECT * FROM r2;
+ a
+----
+ 10
+ 20
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+DROP TABLE r1;
+DROP TABLE r2;
+--
+-- FORCE ROW LEVEL SECURITY applies RLS to owners too
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int);
+INSERT INTO r1 VALUES (10), (20);
+CREATE POLICY p1 ON r1 USING (false);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- No error, but no rows
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- RLS error
+INSERT INTO r1 VALUES (1);
+ERROR: new row violates row-level security policy for table "r1"
+-- No error (unable to see any rows to update)
+UPDATE r1 SET a = 1;
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- No error (unable to see any rows to delete)
+DELETE FROM r1;
+TABLE r1;
+ a
+---
+(0 rows)
+
+SET row_security = off;
+-- these all fail, would be affected by RLS
+TABLE r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+UPDATE r1 SET a = 1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+DELETE FROM r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+DROP TABLE r1;
+--
+-- FORCE ROW LEVEL SECURITY does not break RI
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Errors due to rows in r2
+DELETE FROM r1;
+ERROR: update or delete on table "r1" violates foreign key constraint "r2_a_fkey" on table "r2"
+DETAIL: Key (a)=(10) is still referenced from table "r2".
+-- Reset r2 to no-RLS
+DROP POLICY p1 ON r2;
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+ALTER TABLE r2 DISABLE ROW LEVEL SECURITY;
+-- clean out r2 for INSERT test below
+DELETE FROM r2;
+-- Change r1 to not allow rows to be seen
+CREATE POLICY p1 ON r1 USING (false);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- No rows seen
+TABLE r1;
+ a
+---
+(0 rows)
+
+-- No error, RI still sees that row exists in r1
+INSERT INTO r2 VALUES (10);
+DROP TABLE r2;
+DROP TABLE r1;
+-- Ensure cascaded DELETE works
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1 ON DELETE CASCADE);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Deletes all records from both
+DELETE FROM r1;
+-- Remove FORCE from r2
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+-- As owner, we now bypass RLS
+-- verify no rows in r2 now
+TABLE r2;
+ a
+---
+(0 rows)
+
+DROP TABLE r2;
+DROP TABLE r1;
+-- Ensure cascaded UPDATE works
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE TABLE r2 (a int REFERENCES r1 ON UPDATE CASCADE);
+INSERT INTO r1 VALUES (10), (20);
+INSERT INTO r2 VALUES (10), (20);
+-- Create policies on r2 which prevent the
+-- owner from seeing any rows, but RI should
+-- still see them.
+CREATE POLICY p1 ON r2 USING (false);
+ALTER TABLE r2 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r2 FORCE ROW LEVEL SECURITY;
+-- Updates records in both
+UPDATE r1 SET a = a+5;
+-- Remove FORCE from r2
+ALTER TABLE r2 NO FORCE ROW LEVEL SECURITY;
+-- As owner, we now bypass RLS
+-- verify records in r2 updated
+TABLE r2;
+ a
+----
+ 15
+ 25
+(2 rows)
+
+DROP TABLE r2;
+DROP TABLE r1;
+--
+-- Test INSERT+RETURNING applies SELECT policies as
+-- WithCheckOptions (meaning an error is thrown)
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int);
+CREATE POLICY p1 ON r1 FOR SELECT USING (false);
+CREATE POLICY p2 ON r1 FOR INSERT WITH CHECK (true);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Works fine
+INSERT INTO r1 VALUES (10), (20);
+-- No error, but no rows
+TABLE r1;
+ a
+---
+(0 rows)
+
+SET row_security = off;
+-- fail, would be affected by RLS
+TABLE r1;
+ERROR: query would be affected by row-level security policy for table "r1"
+HINT: To disable the policy for the table's owner, use ALTER TABLE NO FORCE ROW LEVEL SECURITY.
+SET row_security = on;
+-- Error
+INSERT INTO r1 VALUES (10), (20) RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+DROP TABLE r1;
+--
+-- Test UPDATE+RETURNING applies SELECT policies as
+-- WithCheckOptions (meaning an error is thrown)
+--
+SET SESSION AUTHORIZATION regress_rls_alice;
+SET row_security = on;
+CREATE TABLE r1 (a int PRIMARY KEY);
+CREATE POLICY p1 ON r1 FOR SELECT USING (a < 20);
+CREATE POLICY p2 ON r1 FOR UPDATE USING (a < 20) WITH CHECK (true);
+CREATE POLICY p3 ON r1 FOR INSERT WITH CHECK (true);
+INSERT INTO r1 VALUES (10);
+ALTER TABLE r1 ENABLE ROW LEVEL SECURITY;
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Works fine
+UPDATE r1 SET a = 30;
+-- Show updated rows
+ALTER TABLE r1 NO FORCE ROW LEVEL SECURITY;
+TABLE r1;
+ a
+----
+ 30
+(1 row)
+
+-- reset value in r1 for test with RETURNING
+UPDATE r1 SET a = 10;
+-- Verify row reset
+TABLE r1;
+ a
+----
+ 10
+(1 row)
+
+ALTER TABLE r1 FORCE ROW LEVEL SECURITY;
+-- Error
+UPDATE r1 SET a = 30 RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+-- UPDATE path of INSERT ... ON CONFLICT DO UPDATE should also error out
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT (a) DO UPDATE SET a = 30 RETURNING *;
+ERROR: new row violates row-level security policy for table "r1"
+-- Should still error out without RETURNING (use of arbiter always requires
+-- SELECT permissions)
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT (a) DO UPDATE SET a = 30;
+ERROR: new row violates row-level security policy for table "r1"
+INSERT INTO r1 VALUES (10)
+ ON CONFLICT ON CONSTRAINT r1_pkey DO UPDATE SET a = 30;
+ERROR: new row violates row-level security policy for table "r1"
+DROP TABLE r1;
+-- Check dependency handling
+RESET SESSION AUTHORIZATION;
+CREATE TABLE dep1 (c1 int);
+CREATE TABLE dep2 (c1 int);
+CREATE POLICY dep_p1 ON dep1 TO regress_rls_bob USING (c1 > (select max(dep2.c1) from dep2));
+ALTER POLICY dep_p1 ON dep1 TO regress_rls_bob,regress_rls_carol;
+-- Should return one
+SELECT count(*) = 1 FROM pg_depend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_class WHERE relname = 'dep2');
+ ?column?
+----------
+ t
+(1 row)
+
+ALTER POLICY dep_p1 ON dep1 USING (true);
+-- Should return one
+SELECT count(*) = 1 FROM pg_shdepend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_authid WHERE rolname = 'regress_rls_bob');
+ ?column?
+----------
+ t
+(1 row)
+
+-- Should return one
+SELECT count(*) = 1 FROM pg_shdepend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_authid WHERE rolname = 'regress_rls_carol');
+ ?column?
+----------
+ t
+(1 row)
+
+-- Should return zero
+SELECT count(*) = 0 FROM pg_depend
+ WHERE objid = (SELECT oid FROM pg_policy WHERE polname = 'dep_p1')
+ AND refobjid = (SELECT oid FROM pg_class WHERE relname = 'dep2');
+ ?column?
+----------
+ t
+(1 row)
+
+-- DROP OWNED BY testing
+RESET SESSION AUTHORIZATION;
+CREATE ROLE regress_rls_dob_role1;
+CREATE ROLE regress_rls_dob_role2;
+CREATE TABLE dob_t1 (c1 int);
+CREATE TABLE dob_t2 (c1 int) PARTITION BY RANGE (c1);
+CREATE POLICY p1 ON dob_t1 TO regress_rls_dob_role1 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t1; -- should fail, already gone
+ERROR: policy "p1" for table "dob_t1" does not exist
+CREATE POLICY p1 ON dob_t1 TO regress_rls_dob_role1,regress_rls_dob_role2 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t1; -- should succeed
+CREATE POLICY p1 ON dob_t2 TO regress_rls_dob_role1,regress_rls_dob_role2 USING (true);
+DROP OWNED BY regress_rls_dob_role1;
+DROP POLICY p1 ON dob_t2; -- should succeed
+DROP USER regress_rls_dob_role1;
+DROP USER regress_rls_dob_role2;
+-- Bug #15708: view + table with RLS should check policies as view owner
+CREATE TABLE ref_tbl (a int);
+INSERT INTO ref_tbl VALUES (1);
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl VALUES (10);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (EXISTS (SELECT 1 FROM ref_tbl));
+GRANT SELECT ON ref_tbl TO regress_rls_bob;
+GRANT SELECT ON rls_tbl TO regress_rls_bob;
+CREATE VIEW rls_view AS SELECT * FROM rls_tbl;
+ALTER VIEW rls_view OWNER TO regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_alice;
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM ref_tbl; -- Permission denied
+ERROR: permission denied for table ref_tbl
+SELECT * FROM rls_tbl; -- Permission denied
+ERROR: permission denied for table rls_tbl
+SELECT * FROM rls_view; -- OK
+ a
+----
+ 10
+(1 row)
+
+RESET SESSION AUTHORIZATION;
+DROP VIEW rls_view;
+DROP TABLE rls_tbl;
+DROP TABLE ref_tbl;
+-- Leaky operator test
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl SELECT x/10 FROM generate_series(1, 100) x;
+ANALYZE rls_tbl;
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+GRANT SELECT ON rls_tbl TO regress_rls_alice;
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE FUNCTION op_leak(int, int) RETURNS bool
+ AS 'BEGIN RAISE NOTICE ''op_leak => %, %'', $1, $2; RETURN $1 < $2; END'
+ LANGUAGE plpgsql;
+CREATE OPERATOR <<< (procedure = op_leak, leftarg = int, rightarg = int,
+ restrict = scalarltsel);
+SELECT * FROM rls_tbl WHERE a <<< 1000;
+ a
+---
+(0 rows)
+
+DROP OPERATOR <<< (int, int);
+DROP FUNCTION op_leak(int, int);
+RESET SESSION AUTHORIZATION;
+DROP TABLE rls_tbl;
+-- Bug #16006: whole-row Vars in a policy don't play nice with sub-selects
+SET SESSION AUTHORIZATION regress_rls_alice;
+CREATE TABLE rls_tbl (a int, b int, c int);
+CREATE POLICY p1 ON rls_tbl USING (rls_tbl >= ROW(1,1,1));
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+ALTER TABLE rls_tbl FORCE ROW LEVEL SECURITY;
+INSERT INTO rls_tbl SELECT 10, 20, 30;
+EXPLAIN (VERBOSE, COSTS OFF)
+INSERT INTO rls_tbl
+ SELECT * FROM (SELECT b, c FROM rls_tbl ORDER BY a) ss;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Insert on regress_rls_schema.rls_tbl
+ -> Subquery Scan on ss
+ Output: ss.b, ss.c, NULL::integer
+ -> Sort
+ Output: rls_tbl_1.b, rls_tbl_1.c, rls_tbl_1.a
+ Sort Key: rls_tbl_1.a
+ -> Seq Scan on regress_rls_schema.rls_tbl rls_tbl_1
+ Output: rls_tbl_1.b, rls_tbl_1.c, rls_tbl_1.a
+ Filter: (rls_tbl_1.* >= '(1,1,1)'::record)
+(9 rows)
+
+INSERT INTO rls_tbl
+ SELECT * FROM (SELECT b, c FROM rls_tbl ORDER BY a) ss;
+SELECT * FROM rls_tbl;
+ a | b | c
+----+----+----
+ 10 | 20 | 30
+ 20 | 30 |
+(2 rows)
+
+DROP TABLE rls_tbl;
+RESET SESSION AUTHORIZATION;
+--
+-- Clean up objects
+--
+RESET SESSION AUTHORIZATION;
+DROP SCHEMA regress_rls_schema CASCADE;
+NOTICE: drop cascades to 29 other objects
+DETAIL: drop cascades to function f_leak(text)
+drop cascades to table uaccount
+drop cascades to table category
+drop cascades to table document
+drop cascades to table part_document
+drop cascades to table dependent
+drop cascades to table rec1
+drop cascades to table rec2
+drop cascades to view rec1v
+drop cascades to view rec2v
+drop cascades to table s1
+drop cascades to table s2
+drop cascades to view v2
+drop cascades to table b1
+drop cascades to view bv1
+drop cascades to table z1
+drop cascades to table z2
+drop cascades to table x1
+drop cascades to table y1
+drop cascades to table y2
+drop cascades to table t1
+drop cascades to table t2
+drop cascades to table t3
+drop cascades to table t4
+drop cascades to table current_check
+drop cascades to table dep1
+drop cascades to table dep2
+drop cascades to table dob_t1
+drop cascades to table dob_t2
+DROP USER regress_rls_alice;
+DROP USER regress_rls_bob;
+DROP USER regress_rls_carol;
+DROP USER regress_rls_dave;
+DROP USER regress_rls_exempt_user;
+DROP ROLE regress_rls_group1;
+DROP ROLE regress_rls_group2;
+-- Arrange to have a few policies left over, for testing
+-- pg_dump/pg_restore
+CREATE SCHEMA regress_rls_schema;
+CREATE TABLE rls_tbl (c1 int);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (c1 > 5);
+CREATE POLICY p2 ON rls_tbl FOR SELECT USING (c1 <= 3);
+CREATE POLICY p3 ON rls_tbl FOR UPDATE USING (c1 <= 3) WITH CHECK (c1 > 5);
+CREATE POLICY p4 ON rls_tbl FOR DELETE USING (c1 <= 3);
+CREATE TABLE rls_tbl_force (c1 int);
+ALTER TABLE rls_tbl_force ENABLE ROW LEVEL SECURITY;
+ALTER TABLE rls_tbl_force FORCE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl_force USING (c1 = 5) WITH CHECK (c1 < 5);
+CREATE POLICY p2 ON rls_tbl_force FOR SELECT USING (c1 = 8);
+CREATE POLICY p3 ON rls_tbl_force FOR UPDATE USING (c1 = 8) WITH CHECK (c1 >= 5);
+CREATE POLICY p4 ON rls_tbl_force FOR DELETE USING (c1 = 8);
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index 192445878d..088d35fdc9 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -43,7 +43,6 @@ dupindexcols|t
e_star|f
emp|f
equipment_r|f
-extra_wide_table|f
f_star|f
fast_emp4000|t
float4_tbl|f
diff --git a/src/test/regress/expected/select_parallel_1.out b/src/test/regress/expected/select_parallel_1.out
new file mode 100644
index 0000000000..c63aa334c0
--- /dev/null
+++ b/src/test/regress/expected/select_parallel_1.out
@@ -0,0 +1,1150 @@
+--
+-- PARALLEL
+--
+create function sp_parallel_restricted(int) returns int as
+ $$begin return $1; end$$ language plpgsql parallel restricted;
+-- Serializable isolation would disable parallel query, so explicitly use an
+-- arbitrary other level.
+begin isolation level repeatable read;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+-- Parallel Append with partial-subplans
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+-----------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Partial Aggregate
+ -> Parallel Append
+ -> Parallel Seq Scan on d_star
+ -> Parallel Seq Scan on f_star
+ -> Parallel Seq Scan on e_star
+ -> Parallel Seq Scan on b_star
+ -> Parallel Seq Scan on c_star
+ -> Parallel Seq Scan on a_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a1;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+-- Parallel Append with both partial and non-partial subplans
+alter table c_star set (parallel_workers = 0);
+alter table d_star set (parallel_workers = 0);
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+-----------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Partial Aggregate
+ -> Parallel Append
+ -> Seq Scan on d_star
+ -> Seq Scan on c_star
+ -> Parallel Seq Scan on f_star
+ -> Parallel Seq Scan on e_star
+ -> Parallel Seq Scan on b_star
+ -> Parallel Seq Scan on a_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a2;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+-- Parallel Append with only non-partial subplans
+alter table a_star set (parallel_workers = 0);
+alter table b_star set (parallel_workers = 0);
+alter table e_star set (parallel_workers = 0);
+alter table f_star set (parallel_workers = 0);
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+--------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Partial Aggregate
+ -> Parallel Append
+ -> Seq Scan on d_star
+ -> Seq Scan on f_star
+ -> Seq Scan on e_star
+ -> Seq Scan on b_star
+ -> Seq Scan on c_star
+ -> Seq Scan on a_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a3;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+-- Temporary hack to investigate whether extra vacuum/analyze is happening
+select relname, relpages, reltuples
+from pg_class
+where relname like '__star' order by relname;
+ relname | relpages | reltuples
+---------+----------+-----------
+ a_star | 5 | 3
+ b_star | 6 | 4
+ c_star | 6 | 4
+ d_star | 8 | 16
+ e_star | 7 | 7
+ f_star | 8 | 16
+(6 rows)
+
+-- Disable Parallel Append
+alter table a_star reset (parallel_workers);
+alter table b_star reset (parallel_workers);
+alter table c_star reset (parallel_workers);
+alter table d_star reset (parallel_workers);
+alter table e_star reset (parallel_workers);
+alter table f_star reset (parallel_workers);
+set enable_parallel_append to off;
+explain (costs off)
+ select round(avg(aa)), sum(aa) from a_star;
+ QUERY PLAN
+-----------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Append
+ -> Parallel Seq Scan on a_star
+ -> Parallel Seq Scan on b_star
+ -> Parallel Seq Scan on c_star
+ -> Parallel Seq Scan on d_star
+ -> Parallel Seq Scan on e_star
+ -> Parallel Seq Scan on f_star
+(11 rows)
+
+select round(avg(aa)), sum(aa) from a_star a4;
+ round | sum
+-------+-----
+ 14 | 355
+(1 row)
+
+reset enable_parallel_append;
+-- Parallel Append that runs serially
+create function sp_test_func() returns setof text as
+$$ select 'foo'::varchar union all select 'bar'::varchar $$
+language sql stable;
+select sp_test_func() order by 1;
+ sp_test_func
+--------------
+ bar
+ foo
+(2 rows)
+
+-- Parallel Append is not to be used when the subpath depends on the outer param
+create table part_pa_test(a int, b int) partition by range(a);
+create table part_pa_test_p1 partition of part_pa_test for values from (minvalue) to (0);
+create table part_pa_test_p2 partition of part_pa_test for values from (0) to (maxvalue);
+explain (costs off)
+ select (select max((select pa1.b from part_pa_test pa1 where pa1.a = pa2.a)))
+ from part_pa_test pa2;
+ QUERY PLAN
+--------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 3
+ -> Parallel Append
+ -> Parallel Seq Scan on part_pa_test_p1 pa2
+ -> Parallel Seq Scan on part_pa_test_p2 pa2_1
+ SubPlan 2
+ -> Result
+ SubPlan 1
+ -> Append
+ -> Seq Scan on part_pa_test_p1 pa1
+ Filter: (a = pa2.a)
+ -> Seq Scan on part_pa_test_p2 pa1_1
+ Filter: (a = pa2.a)
+(14 rows)
+
+drop table part_pa_test;
+-- test with leader participation disabled
+set parallel_leader_participation = off;
+explain (costs off)
+ select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ QUERY PLAN
+---------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (stringu1 = 'GRAAAA'::name)
+(6 rows)
+
+select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ count
+-------
+ 15
+(1 row)
+
+-- test with leader participation disabled, but no workers available (so
+-- the leader will have to run the plan despite the setting)
+set max_parallel_workers = 0;
+explain (costs off)
+ select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ QUERY PLAN
+---------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (stringu1 = 'GRAAAA'::name)
+(6 rows)
+
+select count(*) from tenk1 where stringu1 = 'GRAAAA';
+ count
+-------
+ 15
+(1 row)
+
+reset max_parallel_workers;
+reset parallel_leader_participation;
+-- test that parallel_restricted function doesn't run in worker
+alter table tenk1 set (parallel_workers = 4);
+explain (verbose, costs off)
+select sp_parallel_restricted(unique1) from tenk1
+ where stringu1 = 'GRAAAA' order by 1;
+ QUERY PLAN
+---------------------------------------------------------
+ Sort
+ Output: (sp_parallel_restricted(unique1))
+ Sort Key: (sp_parallel_restricted(tenk1.unique1))
+ -> Gather
+ Output: sp_parallel_restricted(unique1)
+ Workers Planned: 4
+ -> Parallel Seq Scan on public.tenk1
+ Output: unique1
+ Filter: (tenk1.stringu1 = 'GRAAAA'::name)
+(9 rows)
+
+-- test parallel plan when group by expression is in target list.
+explain (costs off)
+ select length(stringu1) from tenk1 group by length(stringu1);
+ QUERY PLAN
+---------------------------------------------------
+ Finalize HashAggregate
+ Group Key: (length((stringu1)::text))
+ -> Gather
+ Workers Planned: 4
+ -> Partial HashAggregate
+ Group Key: length((stringu1)::text)
+ -> Parallel Seq Scan on tenk1
+(7 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length
+--------
+ 6
+(1 row)
+
+explain (costs off)
+ select stringu1, count(*) from tenk1 group by stringu1 order by stringu1;
+ QUERY PLAN
+----------------------------------------------------
+ Sort
+ Sort Key: stringu1
+ -> Finalize HashAggregate
+ Group Key: stringu1
+ -> Gather
+ Workers Planned: 4
+ -> Partial HashAggregate
+ Group Key: stringu1
+ -> Parallel Seq Scan on tenk1
+(9 rows)
+
+-- test that parallel plan for aggregates is not selected when
+-- target list contains parallel restricted clause.
+explain (costs off)
+ select sum(sp_parallel_restricted(unique1)) from tenk1
+ group by(sp_parallel_restricted(unique1));
+ QUERY PLAN
+-------------------------------------------------------------------
+ HashAggregate
+ Group Key: sp_parallel_restricted(unique1)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+(5 rows)
+
+-- test prepared statement
+prepare tenk1_count(integer) As select count((unique1)) from tenk1 where hundred > $1;
+explain (costs off) execute tenk1_count(1);
+ QUERY PLAN
+----------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (hundred > 1)
+(6 rows)
+
+execute tenk1_count(1);
+ count
+-------
+ 9800
+(1 row)
+
+deallocate tenk1_count;
+-- test parallel plans for queries containing un-correlated subplans.
+alter table tenk2 set (parallel_workers = 0);
+explain (costs off)
+ select count(*) from tenk1 where (two, four) not in
+ (select hundred, thousand from tenk2 where thousand > 100);
+ QUERY PLAN
+------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk1
+ Filter: (NOT (hashed SubPlan 1))
+ SubPlan 1
+ -> Seq Scan on tenk2
+ Filter: (thousand > 100)
+(9 rows)
+
+select count(*) from tenk1 where (two, four) not in
+ (select hundred, thousand from tenk2 where thousand > 100);
+ count
+-------
+ 10000
+(1 row)
+
+-- this is not parallel-safe due to use of random() within SubLink's testexpr:
+explain (costs off)
+ select * from tenk1 where (unique1 + random())::integer not in
+ (select ten from tenk2);
+ QUERY PLAN
+------------------------------------
+ Seq Scan on tenk1
+ Filter: (NOT (hashed SubPlan 1))
+ SubPlan 1
+ -> Seq Scan on tenk2
+(4 rows)
+
+alter table tenk2 reset (parallel_workers);
+-- test parallel plan for a query containing initplan.
+set enable_indexscan = off;
+set enable_indexonlyscan = off;
+set enable_bitmapscan = off;
+alter table tenk2 set (parallel_workers = 2);
+explain (costs off)
+ select count(*) from tenk1
+ where tenk1.unique1 = (Select max(tenk2.unique1) from tenk2);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ InitPlan 1 (returns $2)
+ -> Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Seq Scan on tenk2
+ -> Gather
+ Workers Planned: 4
+ Params Evaluated: $2
+ -> Parallel Seq Scan on tenk1
+ Filter: (unique1 = $2)
+(12 rows)
+
+select count(*) from tenk1
+ where tenk1.unique1 = (Select max(tenk2.unique1) from tenk2);
+ count
+-------
+ 1
+(1 row)
+
+reset enable_indexscan;
+reset enable_indexonlyscan;
+reset enable_bitmapscan;
+alter table tenk2 reset (parallel_workers);
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+ select count((unique1)) from tenk1 where hundred > 1;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Scan using tenk1_hundred on tenk1
+ Index Cond: (hundred > 1)
+(6 rows)
+
+select count((unique1)) from tenk1 where hundred > 1;
+ count
+-------
+ 9800
+(1 row)
+
+-- test parallel index-only scans.
+explain (costs off)
+ select count(*) from tenk1 where thousand > 95;
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: (thousand > 95)
+(6 rows)
+
+select count(*) from tenk1 where thousand > 95;
+ count
+-------
+ 9040
+(1 row)
+
+-- test rescan cases too
+set enable_material = false;
+explain (costs off)
+select * from
+ (select count(unique1) from tenk1 where hundred > 10) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Nested Loop Left Join
+ -> Values Scan on "*VALUES*"
+ -> Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Scan using tenk1_hundred on tenk1
+ Index Cond: (hundred > 10)
+(8 rows)
+
+select * from
+ (select count(unique1) from tenk1 where hundred > 10) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ count | x
+-------+---
+ 8900 | 1
+ 8900 | 2
+ 8900 | 3
+(3 rows)
+
+explain (costs off)
+select * from
+ (select count(*) from tenk1 where thousand > 99) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Nested Loop Left Join
+ -> Values Scan on "*VALUES*"
+ -> Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Only Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: (thousand > 99)
+(8 rows)
+
+select * from
+ (select count(*) from tenk1 where thousand > 99) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ count | x
+-------+---
+ 9000 | 1
+ 9000 | 2
+ 9000 | 3
+(3 rows)
+
+reset enable_material;
+reset enable_seqscan;
+reset enable_bitmapscan;
+-- test parallel bitmap heap scan.
+set enable_seqscan to off;
+set enable_indexscan to off;
+set enable_hashjoin to off;
+set enable_mergejoin to off;
+set enable_material to off;
+-- test prefetching, if the platform allows it
+DO $$
+BEGIN
+ SET effective_io_concurrency = 50;
+EXCEPTION WHEN invalid_parameter_value THEN
+END $$;
+set work_mem='64kB'; --set small work mem to force lossy pages
+explain (costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+ QUERY PLAN
+------------------------------------------------------------
+ Aggregate
+ -> Nested Loop
+ -> Seq Scan on tenk2
+ Filter: (thousand = 0)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Bitmap Heap Scan on tenk1
+ Recheck Cond: (hundred > 1)
+ -> Bitmap Index Scan on tenk1_hundred
+ Index Cond: (hundred > 1)
+(10 rows)
+
+select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+ count
+-------
+ 98000
+(1 row)
+
+create table bmscantest (a int, t text);
+insert into bmscantest select r, 'fooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo' FROM generate_series(1,100000) r;
+create index i_bmtest ON bmscantest(a);
+select count(*) from bmscantest where a>1;
+ count
+-------
+ 99999
+(1 row)
+
+-- test accumulation of stats for parallel nodes
+reset enable_seqscan;
+alter table tenk2 set (parallel_workers = 0);
+explain (analyze, timing off, summary off, costs off)
+ select count(*) from tenk1, tenk2 where tenk1.hundred > 1
+ and tenk2.thousand=0;
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Aggregate (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=98000 loops=1)
+ -> Seq Scan on tenk2 (actual rows=10 loops=1)
+ Filter: (thousand = 0)
+ Rows Removed by Filter: 9990
+ -> Gather (actual rows=9800 loops=10)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
+ Filter: (hundred > 1)
+ Rows Removed by Filter: 40
+(11 rows)
+
+alter table tenk2 reset (parallel_workers);
+reset work_mem;
+create function explain_parallel_sort_stats() returns setof text
+language plpgsql as
+$$
+declare ln text;
+begin
+ for ln in
+ explain (analyze, timing off, summary off, costs off)
+ select * from
+ (select ten from tenk1 where ten < 100 order by ten) ss
+ right join (values (1),(2),(3)) v(x) on true
+ loop
+ ln := regexp_replace(ln, 'Memory: \S*', 'Memory: xxx');
+ return next ln;
+ end loop;
+end;
+$$;
+select * from explain_parallel_sort_stats();
+ explain_parallel_sort_stats
+--------------------------------------------------------------------------
+ Nested Loop Left Join (actual rows=30000 loops=1)
+ -> Values Scan on "*VALUES*" (actual rows=3 loops=1)
+ -> Gather Merge (actual rows=10000 loops=3)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Sort (actual rows=2000 loops=15)
+ Sort Key: tenk1.ten
+ Sort Method: quicksort Memory: xxx
+ Worker 0: Sort Method: quicksort Memory: xxx
+ Worker 1: Sort Method: quicksort Memory: xxx
+ Worker 2: Sort Method: quicksort Memory: xxx
+ Worker 3: Sort Method: quicksort Memory: xxx
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=15)
+ Filter: (ten < 100)
+(14 rows)
+
+reset enable_indexscan;
+reset enable_hashjoin;
+reset enable_mergejoin;
+reset enable_material;
+reset effective_io_concurrency;
+drop table bmscantest;
+drop function explain_parallel_sort_stats();
+-- test parallel merge join path.
+set enable_hashjoin to off;
+set enable_nestloop to off;
+explain (costs off)
+ select count(*) from tenk1, tenk2 where tenk1.unique1 = tenk2.unique1;
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Merge Join
+ Merge Cond: (tenk1.unique1 = tenk2.unique1)
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+ -> Index Only Scan using tenk2_unique1 on tenk2
+(8 rows)
+
+select count(*) from tenk1, tenk2 where tenk1.unique1 = tenk2.unique1;
+ count
+-------
+ 10000
+(1 row)
+
+reset enable_hashjoin;
+reset enable_nestloop;
+-- test gather merge
+set enable_hashagg = false;
+explain (costs off)
+ select count(*) from tenk1 group by twenty;
+ QUERY PLAN
+----------------------------------------------------
+ Finalize GroupAggregate
+ Group Key: twenty
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: twenty
+ -> Sort
+ Sort Key: twenty
+ -> Parallel Seq Scan on tenk1
+(9 rows)
+
+select count(*) from tenk1 group by twenty;
+ count
+-------
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+(20 rows)
+
+--test expressions in targetlist are pushed down for gather merge
+create function sp_simple_func(var1 integer) returns integer
+as $$
+begin
+ return var1 + 10;
+end;
+$$ language plpgsql PARALLEL SAFE;
+explain (costs off, verbose)
+ select ten, sp_simple_func(ten) from tenk1 where ten < 100 order by ten;
+ QUERY PLAN
+-----------------------------------------------------
+ Gather Merge
+ Output: ten, (sp_simple_func(ten))
+ Workers Planned: 4
+ -> Result
+ Output: ten, sp_simple_func(ten)
+ -> Sort
+ Output: ten
+ Sort Key: tenk1.ten
+ -> Parallel Seq Scan on public.tenk1
+ Output: ten
+ Filter: (tenk1.ten < 100)
+(11 rows)
+
+drop function sp_simple_func(integer);
+-- test handling of SRFs in targetlist (bug in 10.0)
+explain (costs off)
+ select count(*), generate_series(1,2) from tenk1 group by twenty;
+ QUERY PLAN
+----------------------------------------------------------
+ ProjectSet
+ -> Finalize GroupAggregate
+ Group Key: twenty
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: twenty
+ -> Sort
+ Sort Key: twenty
+ -> Parallel Seq Scan on tenk1
+(10 rows)
+
+select count(*), generate_series(1,2) from tenk1 group by twenty;
+ count | generate_series
+-------+-----------------
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+ 500 | 1
+ 500 | 2
+(40 rows)
+
+-- test gather merge with parallel leader participation disabled
+set parallel_leader_participation = off;
+explain (costs off)
+ select count(*) from tenk1 group by twenty;
+ QUERY PLAN
+----------------------------------------------------
+ Finalize GroupAggregate
+ Group Key: twenty
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: twenty
+ -> Sort
+ Sort Key: twenty
+ -> Parallel Seq Scan on tenk1
+(9 rows)
+
+select count(*) from tenk1 group by twenty;
+ count
+-------
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+ 500
+(20 rows)
+
+reset parallel_leader_participation;
+--test rescan behavior of gather merge
+set enable_material = false;
+explain (costs off)
+select * from
+ (select string4, count(unique2)
+ from tenk1 group by string4 order by string4) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ QUERY PLAN
+----------------------------------------------------------
+ Nested Loop Left Join
+ -> Values Scan on "*VALUES*"
+ -> Finalize GroupAggregate
+ Group Key: tenk1.string4
+ -> Gather Merge
+ Workers Planned: 4
+ -> Partial GroupAggregate
+ Group Key: tenk1.string4
+ -> Sort
+ Sort Key: tenk1.string4
+ -> Parallel Seq Scan on tenk1
+(11 rows)
+
+select * from
+ (select string4, count(unique2)
+ from tenk1 group by string4 order by string4) ss
+ right join (values (1),(2),(3)) v(x) on true;
+ string4 | count | x
+---------+-------+---
+ AAAAxx | 2500 | 1
+ HHHHxx | 2500 | 1
+ OOOOxx | 2500 | 1
+ VVVVxx | 2500 | 1
+ AAAAxx | 2500 | 2
+ HHHHxx | 2500 | 2
+ OOOOxx | 2500 | 2
+ VVVVxx | 2500 | 2
+ AAAAxx | 2500 | 3
+ HHHHxx | 2500 | 3
+ OOOOxx | 2500 | 3
+ VVVVxx | 2500 | 3
+(12 rows)
+
+reset enable_material;
+reset enable_hashagg;
+-- check parallelized int8 aggregate (bug #14897)
+explain (costs off)
+select avg(unique1::int8) from tenk1;
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+(5 rows)
+
+select avg(unique1::int8) from tenk1;
+ avg
+-----------------------
+ 4999.5000000000000000
+(1 row)
+
+-- gather merge test with a LIMIT
+explain (costs off)
+ select fivethous from tenk1 order by fivethous limit 4;
+ QUERY PLAN
+----------------------------------------------
+ Limit
+ -> Gather Merge
+ Workers Planned: 4
+ -> Sort
+ Sort Key: fivethous
+ -> Parallel Seq Scan on tenk1
+(6 rows)
+
+select fivethous from tenk1 order by fivethous limit 4;
+ fivethous
+-----------
+ 0
+ 0
+ 1
+ 1
+(4 rows)
+
+-- gather merge test with 0 worker
+set max_parallel_workers = 0;
+explain (costs off)
+ select string4 from tenk1 order by string4 limit 5;
+ QUERY PLAN
+----------------------------------------------
+ Limit
+ -> Gather Merge
+ Workers Planned: 4
+ -> Sort
+ Sort Key: string4
+ -> Parallel Seq Scan on tenk1
+(6 rows)
+
+select string4 from tenk1 order by string4 limit 5;
+ string4
+---------
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+(5 rows)
+
+-- gather merge test with 0 workers, with parallel leader
+-- participation disabled (the leader will have to run the plan
+-- despite the setting)
+set parallel_leader_participation = off;
+explain (costs off)
+ select string4 from tenk1 order by string4 limit 5;
+ QUERY PLAN
+----------------------------------------------
+ Limit
+ -> Gather Merge
+ Workers Planned: 4
+ -> Sort
+ Sort Key: string4
+ -> Parallel Seq Scan on tenk1
+(6 rows)
+
+select string4 from tenk1 order by string4 limit 5;
+ string4
+---------
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+ AAAAxx
+(5 rows)
+
+reset parallel_leader_participation;
+reset max_parallel_workers;
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+explain (costs off)
+ select stringu1::int2 from tenk1 where unique1 = 1;
+ QUERY PLAN
+-----------------------------------------------
+ Gather
+ Workers Planned: 1
+ Single Copy: true
+ -> Index Scan using tenk1_unique1 on tenk1
+ Index Cond: (unique1 = 1)
+(5 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+-- exercise record typmod remapping between backends
+CREATE FUNCTION make_record(n int)
+ RETURNS RECORD LANGUAGE plpgsql PARALLEL SAFE AS
+$$
+BEGIN
+ RETURN CASE n
+ WHEN 1 THEN ROW(1)
+ WHEN 2 THEN ROW(1, 2)
+ WHEN 3 THEN ROW(1, 2, 3)
+ WHEN 4 THEN ROW(1, 2, 3, 4)
+ ELSE ROW(1, 2, 3, 4, 5)
+ END;
+END;
+$$;
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+SELECT make_record(x) FROM (SELECT generate_series(1, 5) x) ss ORDER BY x;
+ make_record
+-------------
+ (1)
+ (1,2)
+ (1,2,3)
+ (1,2,3,4)
+ (1,2,3,4,5)
+(5 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+DROP function make_record(n int);
+-- test the sanity of parallel query after the active role is dropped.
+drop role if exists regress_parallel_worker;
+NOTICE: role "regress_parallel_worker" does not exist, skipping
+create role regress_parallel_worker;
+set role regress_parallel_worker;
+reset session authorization;
+drop role regress_parallel_worker;
+set force_parallel_mode = 1;
+select count(*) from tenk1;
+ count
+-------
+ 10000
+(1 row)
+
+reset force_parallel_mode;
+reset role;
+-- Window function calculation can't be pushed to workers.
+explain (costs off, verbose)
+ select count(*) from tenk1 a where (unique1, two) in
+ (select unique1, row_number() over() from tenk1 b);
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
+ Aggregate
+ Output: count(*)
+ -> Hash Semi Join
+ Hash Cond: ((a.unique1 = b.unique1) AND (a.two = (row_number() OVER (?))))
+ -> Gather
+ Output: a.unique1, a.two
+ Workers Planned: 4
+ -> Parallel Seq Scan on public.tenk1 a
+ Output: a.unique1, a.two
+ -> Hash
+ Output: b.unique1, (row_number() OVER (?))
+ -> WindowAgg
+ Output: b.unique1, row_number() OVER (?)
+ -> Gather
+ Output: b.unique1
+ Workers Planned: 4
+ -> Parallel Index Only Scan using tenk1_unique1 on public.tenk1 b
+ Output: b.unique1
+(18 rows)
+
+-- LIMIT/OFFSET within sub-selects can't be pushed to workers.
+explain (costs off)
+ select * from tenk1 a where two in
+ (select two from tenk1 b where stringu1 like '%AAAA' limit 3);
+ QUERY PLAN
+---------------------------------------------------------------
+ Hash Semi Join
+ Hash Cond: (a.two = b.two)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 a
+ -> Hash
+ -> Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 b
+ Filter: (stringu1 ~~ '%AAAA'::text)
+(11 rows)
+
+-- to increase the parallel query test coverage
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM tenk1;
+ QUERY PLAN
+-------------------------------------------------------------
+ Gather (actual rows=10000 loops=1)
+ Workers Planned: 4
+ Workers Launched: 4
+ -> Parallel Seq Scan on tenk1 (actual rows=2000 loops=5)
+(4 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+-- provoke error in worker
+SAVEPOINT settings;
+SET LOCAL force_parallel_mode = 1;
+select stringu1::int2 from tenk1 where unique1 = 1;
+ERROR: invalid input syntax for type smallint: "BAAAAA"
+CONTEXT: parallel worker
+ROLLBACK TO SAVEPOINT settings;
+-- test interaction with set-returning functions
+SAVEPOINT settings;
+-- multiple subqueries under a single Gather node
+-- must set parallel_setup_cost > 0 to discourage multiple Gather nodes
+SET LOCAL parallel_setup_cost = 10;
+EXPLAIN (COSTS OFF)
+SELECT unique1 FROM tenk1 WHERE fivethous = tenthous + 1
+UNION ALL
+SELECT unique1 FROM tenk1 WHERE fivethous = tenthous + 1;
+ QUERY PLAN
+----------------------------------------------------
+ Gather
+ Workers Planned: 4
+ -> Parallel Append
+ -> Parallel Seq Scan on tenk1
+ Filter: (fivethous = (tenthous + 1))
+ -> Parallel Seq Scan on tenk1 tenk1_1
+ Filter: (fivethous = (tenthous + 1))
+(7 rows)
+
+ROLLBACK TO SAVEPOINT settings;
+-- can't use multiple subqueries under a single Gather node due to initPlans
+EXPLAIN (COSTS OFF)
+SELECT unique1 FROM tenk1 WHERE fivethous =
+ (SELECT unique1 FROM tenk1 WHERE fivethous = 1 LIMIT 1)
+UNION ALL
+SELECT unique1 FROM tenk1 WHERE fivethous =
+ (SELECT unique2 FROM tenk1 WHERE fivethous = 1 LIMIT 1)
+ORDER BY 1;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sort
+ Sort Key: tenk1.unique1
+ -> Append
+ -> Gather
+ Workers Planned: 4
+ Params Evaluated: $1
+ InitPlan 1 (returns $1)
+ -> Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 tenk1_2
+ Filter: (fivethous = 1)
+ -> Parallel Seq Scan on tenk1
+ Filter: (fivethous = $1)
+ -> Gather
+ Workers Planned: 4
+ Params Evaluated: $3
+ InitPlan 2 (returns $3)
+ -> Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Seq Scan on tenk1 tenk1_3
+ Filter: (fivethous = 1)
+ -> Parallel Seq Scan on tenk1 tenk1_1
+ Filter: (fivethous = $3)
+(25 rows)
+
+-- test interaction with SRFs
+SELECT * FROM information_schema.foreign_data_wrapper_options
+ORDER BY 1, 2, 3;
+ foreign_data_wrapper_catalog | foreign_data_wrapper_name | option_name | option_value
+------------------------------+---------------------------+-------------+--------------
+(0 rows)
+
+-- test passing expanded-value representations to workers
+CREATE FUNCTION make_some_array(int,int) returns int[] as
+$$declare x int[];
+ begin
+ x[1] := $1;
+ x[2] := $2;
+ return x;
+ end$$ language plpgsql parallel safe;
+CREATE TABLE fooarr(f1 text, f2 int[], f3 text);
+INSERT INTO fooarr VALUES('1', ARRAY[1,2], 'one');
+PREPARE pstmt(text, int[]) AS SELECT * FROM fooarr WHERE f1 = $1 AND f2 = $2;
+EXPLAIN (COSTS OFF) EXECUTE pstmt('1', make_some_array(1,2));
+ QUERY PLAN
+------------------------------------------------------------------
+ Gather
+ Workers Planned: 3
+ -> Parallel Seq Scan on fooarr
+ Filter: ((f1 = '1'::text) AND (f2 = '{1,2}'::integer[]))
+(4 rows)
+
+EXECUTE pstmt('1', make_some_array(1,2));
+ f1 | f2 | f3
+----+-------+-----
+ 1 | {1,2} | one
+(1 row)
+
+DEALLOCATE pstmt;
+-- test interaction between subquery and partial_paths
+CREATE VIEW tenk1_vw_sec WITH (security_barrier) AS SELECT * FROM tenk1;
+EXPLAIN (COSTS OFF)
+SELECT 1 FROM tenk1_vw_sec
+ WHERE (SELECT sum(f1) FROM int4_tbl WHERE f1 < unique1) < 100;
+ QUERY PLAN
+-------------------------------------------------------------------
+ Subquery Scan on tenk1_vw_sec
+ Filter: ((SubPlan 1) < 100)
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Index Only Scan using tenk1_unique1 on tenk1
+ SubPlan 1
+ -> Aggregate
+ -> Seq Scan on int4_tbl
+ Filter: (f1 < tenk1_vw_sec.unique1)
+(9 rows)
+
+rollback;
diff --git a/src/test/regress/expected/strings_1.out b/src/test/regress/expected/strings_1.out
new file mode 100644
index 0000000000..c1fe70a012
--- /dev/null
+++ b/src/test/regress/expected/strings_1.out
@@ -0,0 +1,2056 @@
+--
+-- STRINGS
+-- Test various data entry syntaxes.
+--
+-- SQL string continuation syntax
+-- E021-03 character string literals
+SELECT 'first line'
+' - next line'
+ ' - third line'
+ AS "Three lines to one";
+ Three lines to one
+-------------------------------------
+ first line - next line - third line
+(1 row)
+
+-- illegal string continuation syntax
+SELECT 'first line'
+' - next line' /* this comment is not allowed here */
+' - third line'
+ AS "Illegal comment within continuation";
+ERROR: syntax error at or near "' - third line'"
+LINE 3: ' - third line'
+ ^
+-- Unicode escapes
+SET standard_conforming_strings TO on;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ data
+------
+ data
+(1 row)
+
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ dat\+000061
+-------------
+ dat\+000061
+(1 row)
+
+SELECT U&'a\\b' AS "a\b";
+ a\b
+-----
+ a\b
+(1 row)
+
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ tricky
+--------
+ \
+(1 row)
+
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT U&'wrong: \061';
+ ^
+HINT: Unicode escapes must be \XXXX or \+XXXXXX.
+SELECT U&'wrong: \+0061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+HINT: Unicode escapes must be \XXXX or \+XXXXXX.
+SELECT U&'wrong: +0061' UESCAPE +;
+ERROR: UESCAPE must be followed by a simple string literal at or near "+"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE +;
+ ^
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: invalid Unicode escape character at or near "'+'"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+SELECT U&'wrong: \db99';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99';
+ ^
+SELECT U&'wrong: \db99xy';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99xy';
+ ^
+SELECT U&'wrong: \db99\\';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99\\';
+ ^
+SELECT U&'wrong: \db99\0061';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \db99\0061';
+ ^
+SELECT U&'wrong: \+00db99\+000061';
+ERROR: invalid Unicode surrogate pair
+LINE 1: SELECT U&'wrong: \+00db99\+000061';
+ ^
+SELECT U&'wrong: \+2FFFFF';
+ERROR: invalid Unicode escape value
+LINE 1: SELECT U&'wrong: \+2FFFFF';
+ ^
+-- while we're here, check the same cases in E-style literals
+SELECT E'd\u0061t\U00000061' AS "data";
+ data
+------
+ data
+(1 row)
+
+SELECT E'a\\b' AS "a\b";
+ a\b
+-----
+ a\b
+(1 row)
+
+SELECT E'wrong: \u061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT E'wrong: \u061';
+ ^
+HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX.
+SELECT E'wrong: \U0061';
+ERROR: invalid Unicode escape
+LINE 1: SELECT E'wrong: \U0061';
+ ^
+HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX.
+SELECT E'wrong: \udb99';
+ERROR: invalid Unicode surrogate pair at or near "'"
+LINE 1: SELECT E'wrong: \udb99';
+ ^
+SELECT E'wrong: \udb99xy';
+ERROR: invalid Unicode surrogate pair at or near "x"
+LINE 1: SELECT E'wrong: \udb99xy';
+ ^
+SELECT E'wrong: \udb99\\';
+ERROR: invalid Unicode surrogate pair at or near "\"
+LINE 1: SELECT E'wrong: \udb99\\';
+ ^
+SELECT E'wrong: \udb99\u0061';
+ERROR: invalid Unicode surrogate pair at or near "\u0061"
+LINE 1: SELECT E'wrong: \udb99\u0061';
+ ^
+SELECT E'wrong: \U0000db99\U00000061';
+ERROR: invalid Unicode surrogate pair at or near "\U00000061"
+LINE 1: SELECT E'wrong: \U0000db99\U00000061';
+ ^
+SELECT E'wrong: \U002FFFFF';
+ERROR: invalid Unicode escape value at or near "\U002FFFFF"
+LINE 1: SELECT E'wrong: \U002FFFFF';
+ ^
+SET standard_conforming_strings TO off;
+SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061...
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&' \' UESCAPE '!' AS "tricky";
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&' \' UESCAPE '!' AS "tricky";
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT 'tricky' AS U&"\" UESCAPE '!';
+ \
+--------
+ tricky
+(1 row)
+
+SELECT U&'wrong: \061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: \+0061';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: \+0061';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+SELECT U&'wrong: +0061' UESCAPE '+';
+ERROR: unsafe use of string constant with Unicode escapes
+LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
+ ^
+DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
+RESET standard_conforming_strings;
+-- bytea
+SET bytea_output TO hex;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------
+ \xdeadbeef
+(1 row)
+
+SELECT E'\\xDeAdBeE'::bytea;
+ERROR: invalid hexadecimal data: odd number of digits
+LINE 1: SELECT E'\\xDeAdBeE'::bytea;
+ ^
+SELECT E'\\xDeAdBeEx'::bytea;
+ERROR: invalid hexadecimal digit: "x"
+LINE 1: SELECT E'\\xDeAdBeEx'::bytea;
+ ^
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------
+ \xde00beef
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+--------------------
+ \x4465416442654566
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465006442654566
+(1 row)
+
+SELECT E'De\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+--------------------
+ \x4465536442654566
+(1 row)
+
+SELECT E'De\\678dBeEf'::bytea;
+ERROR: invalid input syntax for type bytea
+LINE 1: SELECT E'De\\678dBeEf'::bytea;
+ ^
+SET bytea_output TO escape;
+SELECT E'\\xDeAdBeEf'::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\x De Ad Be Ef '::bytea;
+ bytea
+------------------
+ \336\255\276\357
+(1 row)
+
+SELECT E'\\xDe00BeEf'::bytea;
+ bytea
+------------------
+ \336\000\276\357
+(1 row)
+
+SELECT E'DeAdBeEf'::bytea;
+ bytea
+----------
+ DeAdBeEf
+(1 row)
+
+SELECT E'De\\000dBeEf'::bytea;
+ bytea
+-------------
+ De\000dBeEf
+(1 row)
+
+SELECT E'De\\123dBeEf'::bytea;
+ bytea
+----------
+ DeSdBeEf
+(1 row)
+
+--
+-- test conversions between various string types
+-- E021-10 implicit casting among the character data types
+--
+SELECT CAST(f1 AS text) AS "text(char)" FROM CHAR_TBL;
+ text(char)
+------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(f1 AS text) AS "text(varchar)" FROM VARCHAR_TBL;
+ text(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS text) AS "text(name)";
+ text(name)
+------------
+ namefield
+(1 row)
+
+-- since this is an explicit cast, it should truncate w/o error:
+SELECT CAST(f1 AS char(10)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+------------
+ doh!
+ hi de ho n
+(2 rows)
+
+-- note: implicit-cast case is tested in char.sql
+SELECT CAST(f1 AS char(20)) AS "char(text)" FROM TEXT_TBL;
+ char(text)
+----------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS char(10)) AS "char(varchar)" FROM VARCHAR_TBL;
+ char(varchar)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS char(10)) AS "char(name)";
+ char(name)
+------------
+ namefield
+(1 row)
+
+SELECT CAST(f1 AS varchar) AS "varchar(text)" FROM TEXT_TBL;
+ varchar(text)
+-------------------
+ doh!
+ hi de ho neighbor
+(2 rows)
+
+SELECT CAST(f1 AS varchar) AS "varchar(char)" FROM CHAR_TBL;
+ varchar(char)
+---------------
+ a
+ ab
+ abcd
+ abcd
+(4 rows)
+
+SELECT CAST(name 'namefield' AS varchar) AS "varchar(name)";
+ varchar(name)
+---------------
+ namefield
+(1 row)
+
+--
+-- test SQL string functions
+-- E### and T### are feature reference numbers from SQL99
+--
+-- E021-09 trim function
+SELECT TRIM(BOTH FROM ' bunch o blanks ') = 'bunch o blanks' AS "bunch o blanks";
+ bunch o blanks
+----------------
+ t
+(1 row)
+
+SELECT TRIM(LEADING FROM ' bunch o blanks ') = 'bunch o blanks ' AS "bunch o blanks ";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(TRAILING FROM ' bunch o blanks ') = ' bunch o blanks' AS " bunch o blanks";
+ bunch o blanks
+------------------
+ t
+(1 row)
+
+SELECT TRIM(BOTH 'x' FROM 'xxxxxsome Xsxxxxx') = 'some Xs' AS "some Xs";
+ some Xs
+---------
+ t
+(1 row)
+
+-- E021-06 substring expression
+SELECT SUBSTRING('1234567890' FROM 3) = '34567890' AS "34567890";
+ 34567890
+----------
+ t
+(1 row)
+
+SELECT SUBSTRING('1234567890' FROM 4 FOR 3) = '456' AS "456";
+ 456
+-----
+ t
+(1 row)
+
+-- T581 regular expression substring (with SQL's bizarre regexp syntax)
+SELECT SUBSTRING('abcdefg' SIMILAR 'a#"(b_d)#"%' ESCAPE '#') AS "bcd";
+ bcd
+-----
+ bcd
+(1 row)
+
+-- obsolete SQL99 syntax
+SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
+ bcd
+-----
+ bcd
+(1 row)
+
+-- No match should return NULL
+SELECT SUBSTRING('abcdefg' SIMILAR '#"(b_d)#"%' ESCAPE '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- Null inputs should return NULL
+SELECT SUBSTRING('abcdefg' SIMILAR '%' ESCAPE NULL) IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING(NULL SIMILAR '%' ESCAPE '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+SELECT SUBSTRING('abcdefg' SIMILAR NULL ESCAPE '#') IS NULL AS "True";
+ True
+------
+ t
+(1 row)
+
+-- The first and last parts should act non-greedy
+SELECT SUBSTRING('abcdefg' SIMILAR 'a#"%#"g' ESCAPE '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' SIMILAR 'a*#"%#"g*' ESCAPE '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- Vertical bar in any part affects only that part
+SELECT SUBSTRING('abcdefg' SIMILAR 'a|b#"%#"g' ESCAPE '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' SIMILAR 'a#"%#"x|g' ESCAPE '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+SELECT SUBSTRING('abcdefg' SIMILAR 'a#"%|ab#"g' ESCAPE '#') AS "bcdef";
+ bcdef
+-------
+ bcdef
+(1 row)
+
+-- Can't have more than two part separators
+SELECT SUBSTRING('abcdefg' SIMILAR 'a*#"%#"g*#"x' ESCAPE '#') AS "error";
+ERROR: SQL regular expression may not contain more than two escape-double-quote separators
+CONTEXT: SQL function "substring" statement 1
+-- Postgres extension: with 0 or 1 separator, assume parts 1 and 3 are empty
+SELECT SUBSTRING('abcdefg' SIMILAR 'a#"%g' ESCAPE '#') AS "bcdefg";
+ bcdefg
+--------
+ bcdefg
+(1 row)
+
+SELECT SUBSTRING('abcdefg' SIMILAR 'a%g' ESCAPE '#') AS "abcdefg";
+ abcdefg
+---------
+ abcdefg
+(1 row)
+
+-- substring() with just two arguments is not allowed by SQL spec;
+-- we accept it, but we interpret the pattern as a POSIX regexp not SQL
+SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- With a parenthesized subexpression, return only what matches the subexpr
+SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
+ cde
+-----
+ cde
+(1 row)
+
+-- Check behavior of SIMILAR TO, which uses largely the same regexp variant
+SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
+ true
+------
+ t
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false;
+ false
+-------
+ f
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false;
+ false
+-------
+ f
+(1 row)
+
+SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true;
+ true
+------
+ t
+(1 row)
+
+-- Postgres uses '\' as the default escape character, which is not per spec
+SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false;
+ false
+-------
+ f
+(1 row)
+
+-- and an empty string to mean "no escape", which is also not per spec
+SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true;
+ true
+------
+ t
+(1 row)
+
+-- these behaviors are per spec, though:
+SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null;
+ null
+------
+
+(1 row)
+
+SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error;
+ERROR: invalid escape string
+HINT: Escape string must be empty or one character.
+-- Test back reference in regexp_replace
+SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
+ regexp_replace
+----------------
+ (111) 222-3333
+(1 row)
+
+SELECT regexp_replace('AAA BBB CCC ', E'\\s+', ' ', 'g');
+ regexp_replace
+----------------
+ AAA BBB CCC
+(1 row)
+
+SELECT regexp_replace('AAA', '^|$', 'Z', 'g');
+ regexp_replace
+----------------
+ ZAAAZ
+(1 row)
+
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'gi');
+ regexp_replace
+----------------
+ Z Z
+(1 row)
+
+-- invalid regexp option
+SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
+ERROR: invalid regular expression option: "z"
+-- set so we can tell NULL from empty string
+\pset null '\\N'
+-- return all matches from regexp
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,beque}
+(1 row)
+
+-- test case insensitive
+SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
+ regexp_matches
+----------------
+ {bAR,bEqUE}
+(1 row)
+
+-- global option - more than one match
+SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+ regexp_matches
+----------------
+ {bar,beque}
+ {bazil,barf}
+(2 rows)
+
+-- empty capture group (matched empty string)
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
+ regexp_matches
+----------------
+ {bar,"",beque}
+(1 row)
+
+-- no match
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)(beque)$re$);
+ regexp_matches
+----------------
+(0 rows)
+
+-- optional capture group did not match, null entry in array
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(.+)?(beque)$re$);
+ regexp_matches
+------------------
+ {bar,NULL,beque}
+(1 row)
+
+-- no capture groups
+SELECT regexp_matches('foobarbequebaz', $re$barbeque$re$);
+ regexp_matches
+----------------
+ {barbeque}
+(1 row)
+
+-- start/end-of-line matches are of zero length
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '^', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('foo' || chr(10) || 'bar' || chr(10) || 'bequq' || chr(10) || 'baz', '$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {""}
+ {""}
+ {""}
+(4 rows)
+
+SELECT regexp_matches('1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '^.?', 'mg');
+ regexp_matches
+----------------
+ {1}
+ {2}
+ {3}
+ {4}
+ {""}
+(5 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4' || chr(10), '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+ {""}
+(10 rows)
+
+SELECT regexp_matches(chr(10) || '1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4', '.?$', 'mg');
+ regexp_matches
+----------------
+ {""}
+ {1}
+ {""}
+ {2}
+ {""}
+ {3}
+ {""}
+ {4}
+ {""}
+(9 rows)
+
+-- give me errors
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$, 'gz');
+ERROR: invalid regular expression option: "z"
+SELECT regexp_matches('foobarbequebaz', $re$(barbeque$re$);
+ERROR: invalid regular expression: parentheses () not balanced
+SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque){2,1}$re$);
+ERROR: invalid regular expression: invalid repetition count(s)
+-- split string on regexp
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s+$re$) AS foo;
+ foo | length
+-------+--------
+ the | 3
+ quick | 5
+ brown | 5
+ fox | 3
+ jumps | 5
+ over | 4
+ the | 3
+ lazy | 4
+ dog | 3
+(9 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s+$re$);
+ regexp_split_to_array
+-----------------------------------------------
+ {the,quick,brown,fox,jumps,over,the,lazy,dog}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', $re$\s*$re$) AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ f | 1
+ o | 1
+ x | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ t | 1
+ h | 1
+ e | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ d | 1
+ o | 1
+ g | 1
+(35 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', $re$\s*$re$);
+ regexp_split_to_array
+-------------------------------------------------------------------------
+ {t,h,e,q,u,i,c,k,b,r,o,w,n,f,o,x,j,u,m,p,s,o,v,e,r,t,h,e,l,a,z,y,d,o,g}
+(1 row)
+
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '') AS foo;
+ foo | length
+-----+--------
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ q | 1
+ u | 1
+ i | 1
+ c | 1
+ k | 1
+ | 1
+ b | 1
+ r | 1
+ o | 1
+ w | 1
+ n | 1
+ | 1
+ f | 1
+ o | 1
+ x | 1
+ | 1
+ j | 1
+ u | 1
+ m | 1
+ p | 1
+ s | 1
+ | 1
+ o | 1
+ v | 1
+ e | 1
+ r | 1
+ | 1
+ t | 1
+ h | 1
+ e | 1
+ | 1
+ l | 1
+ a | 1
+ z | 1
+ y | 1
+ | 1
+ d | 1
+ o | 1
+ g | 1
+(43 rows)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '');
+ regexp_split_to_array
+---------------------------------------------------------------------------------------------------------
+ {t,h,e," ",q,u,i,c,k," ",b,r,o,w,n," ",f,o,x," ",j,u,m,p,s," ",o,v,e,r," ",t,h,e," ",l,a,z,y," ",d,o,g}
+(1 row)
+
+-- case insensitive
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i') AS foo;
+ foo | length
+---------------------------+--------
+ th | 2
+ QUick bROWn FOx jUMPs ov | 25
+ r Th | 4
+ lazy dOG | 9
+(4 rows)
+
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'i');
+ regexp_split_to_array
+-----------------------------------------------------
+ {th," QUick bROWn FOx jUMPs ov","r Th"," lazy dOG"}
+(1 row)
+
+-- no match of pattern
+SELECT foo, length(foo) FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', 'nomatch') AS foo;
+ foo | length
+---------------------------------------------+--------
+ the quick brown fox jumps over the lazy dog | 43
+(1 row)
+
+SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', 'nomatch');
+ regexp_split_to_array
+-------------------------------------------------
+ {"the quick brown fox jumps over the lazy dog"}
+(1 row)
+
+-- some corner cases
+SELECT regexp_split_to_array('123456','1');
+ regexp_split_to_array
+-----------------------
+ {"",23456}
+(1 row)
+
+SELECT regexp_split_to_array('123456','6');
+ regexp_split_to_array
+-----------------------
+ {12345,""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','.');
+ regexp_split_to_array
+------------------------
+ {"","","","","","",""}
+(1 row)
+
+SELECT regexp_split_to_array('123456','');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('123456','(?:)');
+ regexp_split_to_array
+-----------------------
+ {1,2,3,4,5,6}
+(1 row)
+
+SELECT regexp_split_to_array('1','');
+ regexp_split_to_array
+-----------------------
+ {1}
+(1 row)
+
+-- errors
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'zippy') AS foo;
+ERROR: invalid regular expression option: "z"
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'iz');
+ERROR: invalid regular expression option: "z"
+-- global option meaningless for regexp_split
+SELECT foo, length(foo) FROM regexp_split_to_table('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g') AS foo;
+ERROR: regexp_split_to_table() does not support the "global" option
+SELECT regexp_split_to_array('thE QUick bROWn FOx jUMPs ovEr The lazy dOG', 'e', 'g');
+ERROR: regexp_split_to_array() does not support the "global" option
+-- change NULL-display back
+\pset null ''
+-- E021-11 position expression
+SELECT POSITION('4' IN '1234567890') = '4' AS "4";
+ 4
+---
+ t
+(1 row)
+
+SELECT POSITION('5' IN '1234567890') = '5' AS "5";
+ 5
+---
+ t
+(1 row)
+
+-- T312 character overlay function
+SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";
+ yabadaba
+----------
+ yabadaba
+(1 row)
+
+SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";
+ yabadabadoo
+-------------
+ yabadabadoo
+(1 row)
+
+SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
+ bubba
+-------
+ bubba
+(1 row)
+
+--
+-- test LIKE
+-- Be sure to form every test as a LIKE/NOT LIKE pair.
+--
+-- simplest examples
+-- E061-04 like predicate
+SELECT 'hawkeye' LIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' LIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'indio%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'indio%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' LIKE 'h%eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in__o' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in__o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'in_o' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' NOT LIKE 'in_o' AS "true";
+ true
+------
+ t
+(1 row)
+
+-- unused escape character
+SELECT 'hawkeye' LIKE 'h%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT LIKE 'h%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE 'ind_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE 'ind_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character
+-- E061-05 like predicate with escape clause
+SELECT 'h%' LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%' NOT LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' LIKE 'h#%%' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%wkeye' NOT LIKE 'h#%%' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'h%awkeye' LIKE 'h#%a%k%e' ESCAPE '#' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'h%awkeye' NOT LIKE 'h#%a%k%e' ESCAPE '#' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'indio' LIKE '_ndio' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'indio' NOT LIKE '_ndio' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_nd_o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_nd_o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' LIKE 'i$_d%o' ESCAPE '$' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'i_dio' NOT LIKE 'i$_d%o' ESCAPE '$' AS "false";
+ false
+-------
+ f
+(1 row)
+
+-- escape character same as pattern character
+SELECT 'maca' LIKE 'm%aca' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'maca' NOT LIKE 'm%aca' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'ma%a' LIKE 'm%a%%a' ESCAPE '%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'ma%a' NOT LIKE 'm%a%%a' ESCAPE '%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'bear' LIKE 'b_ear' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'bear' NOT LIKE 'b_ear' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE 'b_e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'be_r' NOT LIKE 'b_e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' LIKE '__e__r' ESCAPE '_' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'be_r' NOT LIKE '__e__r' ESCAPE '_' AS "true";
+ true
+------
+ t
+(1 row)
+
+--
+-- test ILIKE (case-insensitive LIKE)
+-- Be sure to form every test as an ILIKE/NOT ILIKE pair.
+--
+SELECT 'hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'hawkeye' ILIKE 'H%Eye' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'hawkeye' NOT ILIKE 'H%Eye' AS "false";
+ false
+-------
+ f
+(1 row)
+
+SELECT 'Hawkeye' ILIKE 'h%' AS "true";
+ true
+------
+ t
+(1 row)
+
+SELECT 'Hawkeye' NOT ILIKE 'h%' AS "false";
+ false
+-------
+ f
+(1 row)
+
+--
+-- test %/_ combination cases, cf bugs #4821 and #5478
+--
+SELECT 'foo' LIKE '_%' as t, 'f' LIKE '_%' as t, '' LIKE '_%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%_' as t, 'f' LIKE '%_' as t, '' LIKE '%_' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '__%' as t, 'foo' LIKE '___%' as t, 'foo' LIKE '____%' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'foo' LIKE '%__' as t, 'foo' LIKE '%___' as t, 'foo' LIKE '%____' as f;
+ t | t | f
+---+---+---
+ t | t | f
+(1 row)
+
+SELECT 'jack' LIKE '%____%' AS t;
+ t
+---
+ t
+(1 row)
+
+--
+-- basic tests of LIKE with indexes
+--
+CREATE TABLE texttest (a text PRIMARY KEY, b int);
+SELECT * FROM texttest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+CREATE TABLE byteatest (a bytea PRIMARY KEY, b int);
+SELECT * FROM byteatest WHERE a LIKE '%1%';
+ a | b
+---+---
+(0 rows)
+
+DROP TABLE texttest, byteatest;
+--
+-- test implicit type conversion
+--
+-- E021-07 character concatenation
+SELECT 'unknown' || ' and unknown' AS "Concat unknown types";
+ Concat unknown types
+----------------------
+ unknown and unknown
+(1 row)
+
+SELECT text 'text' || ' and unknown' AS "Concat text to unknown type";
+ Concat text to unknown type
+-----------------------------
+ text and unknown
+(1 row)
+
+SELECT char(20) 'characters' || ' and text' AS "Concat char to unknown type";
+ Concat char to unknown type
+-----------------------------
+ characters and text
+(1 row)
+
+SELECT text 'text' || char(20) ' and characters' AS "Concat text to char";
+ Concat text to char
+---------------------
+ text and characters
+(1 row)
+
+SELECT text 'text' || varchar ' and varchar' AS "Concat text to varchar";
+ Concat text to varchar
+------------------------
+ text and varchar
+(1 row)
+
+--
+-- test substr with toasted text values
+--
+CREATE TABLE toasttest(f1 text);
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(repeat('1234567890',10000));
+insert into toasttest values(repeat('1234567890',10000));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+TRUNCATE TABLE toasttest;
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect >0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+TRUNCATE TABLE toasttest;
+ALTER TABLE toasttest set (toast_tuple_target = 4080);
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+INSERT INTO toasttest values (repeat('1234567890',300));
+-- expect 0 blocks
+SELECT pg_relation_size(reltoastrelid) = 0 AS is_empty
+ FROM pg_class where relname = 'toasttest';
+ is_empty
+----------
+
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test substr with toasted bytea values
+--
+CREATE TABLE toasttest(f1 bytea);
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+--
+-- Ensure that some values are uncompressed, to test the faster substring
+-- operation used in that case
+--
+alter table toasttest alter column f1 set storage external;
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+insert into toasttest values(decode(repeat('1234567890',10000),'escape'));
+-- If the starting position is zero or less, then return from the start of the string
+-- adjusting the length to be consistent with the "negative start" per SQL.
+SELECT substr(f1, -1, 5) from toasttest;
+ substr
+--------
+ 123
+ 123
+ 123
+ 123
+(4 rows)
+
+-- If the length is less than zero, an ERROR is thrown.
+SELECT substr(f1, 5, -1) from toasttest;
+ERROR: negative substring length not allowed
+-- If no third argument (length) is provided, the length to the end of the
+-- string is assumed.
+SELECT substr(f1, 99995) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+-- If start plus length is > string length, the result is truncated to
+-- string length
+SELECT substr(f1, 99995, 10) from toasttest;
+ substr
+--------
+ 567890
+ 567890
+ 567890
+ 567890
+(4 rows)
+
+DROP TABLE toasttest;
+-- test internally compressing datums
+-- this tests compressing a datum to a very small size which exercises a
+-- corner case in packed-varlena handling: even though small, the compressed
+-- datum must be given a 4-byte header because there are no bits to indicate
+-- compression in a 1-byte header
+CREATE TABLE toasttest (c char(4096));
+INSERT INTO toasttest VALUES('x');
+SELECT length(c), c::text FROM toasttest;
+ length | c
+--------+---
+ 1 | x
+(1 row)
+
+SELECT c FROM toasttest;
+ c
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ x
+(1 row)
+
+DROP TABLE toasttest;
+--
+-- test length
+--
+SELECT length('abcdef') AS "length_6";
+ length_6
+----------
+ 6
+(1 row)
+
+--
+-- test strpos
+--
+SELECT strpos('abcdef', 'cd') AS "pos_3";
+ pos_3
+-------
+ 3
+(1 row)
+
+SELECT strpos('abcdef', 'xy') AS "pos_0";
+ pos_0
+-------
+ 0
+(1 row)
+
+SELECT strpos('abcdef', '') AS "pos_1";
+ pos_1
+-------
+ 1
+(1 row)
+
+SELECT strpos('', 'xy') AS "pos_0";
+ pos_0
+-------
+ 0
+(1 row)
+
+SELECT strpos('', '') AS "pos_1";
+ pos_1
+-------
+ 1
+(1 row)
+
+--
+-- test replace
+--
+SELECT replace('abcdef', 'de', '45') AS "abc45f";
+ abc45f
+--------
+ abc45f
+(1 row)
+
+SELECT replace('yabadabadoo', 'ba', '123') AS "ya123da123doo";
+ ya123da123doo
+---------------
+ ya123da123doo
+(1 row)
+
+SELECT replace('yabadoo', 'bad', '') AS "yaoo";
+ yaoo
+------
+ yaoo
+(1 row)
+
+--
+-- test split_part
+--
+select split_part('joeuser@mydatabase','@',0) AS "an error";
+ERROR: field position must be greater than zero
+select split_part('joeuser@mydatabase','@',1) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+select split_part('joeuser@mydatabase','@',2) AS "mydatabase";
+ mydatabase
+------------
+ mydatabase
+(1 row)
+
+select split_part('joeuser@mydatabase','@',3) AS "empty string";
+ empty string
+--------------
+
+(1 row)
+
+select split_part('@joeuser@mydatabase@','@',2) AS "joeuser";
+ joeuser
+---------
+ joeuser
+(1 row)
+
+--
+-- test to_hex
+--
+select to_hex(256*256*256 - 1) AS "ffffff";
+ ffffff
+--------
+ ffffff
+(1 row)
+
+select to_hex(256::bigint*256::bigint*256::bigint*256::bigint - 1) AS "ffffffff";
+ ffffffff
+----------
+ ffffffff
+(1 row)
+
+--
+-- MD5 test suite - from IETF RFC 1321
+-- (see: ftp://ftp.rfc-editor.org/in-notes/rfc1321.txt)
+--
+select md5('') = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a') = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc') = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest') = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz') = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890') = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5(''::bytea) = 'd41d8cd98f00b204e9800998ecf8427e' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('a'::bytea) = '0cc175b9c0f1b6a831c399e269772661' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abc'::bytea) = '900150983cd24fb0d6963f7d28e17f72' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('message digest'::bytea) = 'f96b697d7cb7938d525a2f31aaf161d0' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('abcdefghijklmnopqrstuvwxyz'::bytea) = 'c3fcd3d76192e4007dfb496cca67e13b' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'::bytea) = 'd174ab98d277d9f5a5611c2c9f419d9f' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+select md5('12345678901234567890123456789012345678901234567890123456789012345678901234567890'::bytea) = '57edf4a22be3c955ac49da2e2107b67a' AS "TRUE";
+ TRUE
+------
+ t
+(1 row)
+
+--
+-- SHA-2
+--
+SET bytea_output TO hex;
+SELECT sha224('');
+ sha224
+------------------------------------------------------------
+ \xd14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f
+(1 row)
+
+SELECT sha224('The quick brown fox jumps over the lazy dog.');
+ sha224
+------------------------------------------------------------
+ \x619cba8e8e05826e9b8c519c0a5c68f4fb653e8a3d8aa04bb2c8cd4c
+(1 row)
+
+SELECT sha256('');
+ sha256
+--------------------------------------------------------------------
+ \xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
+(1 row)
+
+SELECT sha256('The quick brown fox jumps over the lazy dog.');
+ sha256
+--------------------------------------------------------------------
+ \xef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c
+(1 row)
+
+SELECT sha384('');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \x38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b
+(1 row)
+
+SELECT sha384('The quick brown fox jumps over the lazy dog.');
+ sha384
+----------------------------------------------------------------------------------------------------
+ \xed892481d8272ca6df370bf706e4d7bc1b5739fa2177aae6c50e946678718fc67a7af2819a021c2fc34e91bdb63409d7
+(1 row)
+
+SELECT sha512('');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \xcf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e
+(1 row)
+
+SELECT sha512('The quick brown fox jumps over the lazy dog.');
+ sha512
+------------------------------------------------------------------------------------------------------------------------------------
+ \x91ea1245f20d46ae9a037a989f54f1f790f0a47607eeb8a14d12890cea77a1bbc6c7ed9cf205e67b7f2b8fd4c7dfd3a7a8617e45f3c463d481c7e586c39ac1ed
+(1 row)
+
+--
+-- encode/decode
+--
+SELECT encode('\x1234567890abcdef00', 'hex');
+ encode
+--------------------
+ 1234567890abcdef00
+(1 row)
+
+SELECT decode('1234567890abcdef00', 'hex');
+ decode
+----------------------
+ \x1234567890abcdef00
+(1 row)
+
+SELECT encode(('\x' || repeat('1234567890abcdef0001', 7))::bytea, 'base64');
+ encode
+------------------------------------------------------------------------------
+ EjRWeJCrze8AARI0VniQq83vAAESNFZ4kKvN7wABEjRWeJCrze8AARI0VniQq83vAAESNFZ4kKvN+
+ 7wABEjRWeJCrze8AAQ==
+(1 row)
+
+SELECT decode(encode(('\x' || repeat('1234567890abcdef0001', 7))::bytea,
+ 'base64'), 'base64');
+ decode
+------------------------------------------------------------------------------------------------------------------------------------------------
+ \x1234567890abcdef00011234567890abcdef00011234567890abcdef00011234567890abcdef00011234567890abcdef00011234567890abcdef00011234567890abcdef0001
+(1 row)
+
+SELECT encode('\x1234567890abcdef00', 'escape');
+ encode
+-----------------------------
+ \x124Vx\220\253\315\357\000
+(1 row)
+
+SELECT decode(encode('\x1234567890abcdef00', 'escape'), 'escape');
+ decode
+----------------------
+ \x1234567890abcdef00
+(1 row)
+
+--
+-- get_bit/set_bit etc
+--
+SELECT get_bit('\x1234567890abcdef00'::bytea, 43);
+ get_bit
+---------
+ 1
+(1 row)
+
+SELECT get_bit('\x1234567890abcdef00'::bytea, 99); -- error
+ERROR: index 99 out of valid range, 0..71
+SELECT set_bit('\x1234567890abcdef00'::bytea, 43, 0);
+ set_bit
+----------------------
+ \x1234567890a3cdef00
+(1 row)
+
+SELECT set_bit('\x1234567890abcdef00'::bytea, 99, 0); -- error
+ERROR: index 99 out of valid range, 0..71
+SELECT get_byte('\x1234567890abcdef00'::bytea, 3);
+ get_byte
+----------
+ 120
+(1 row)
+
+SELECT get_byte('\x1234567890abcdef00'::bytea, 99); -- error
+ERROR: index 99 out of valid range, 0..8
+SELECT set_byte('\x1234567890abcdef00'::bytea, 7, 11);
+ set_byte
+----------------------
+ \x1234567890abcd0b00
+(1 row)
+
+SELECT set_byte('\x1234567890abcdef00'::bytea, 99, 11); -- error
+ERROR: index 99 out of valid range, 0..8
+--
+-- test behavior of escape_string_warning and standard_conforming_strings options
+--
+set escape_string_warning = off;
+set standard_conforming_strings = off;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ off
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ off
+(1 row)
+
+set escape_string_warning = on;
+set standard_conforming_strings = on;
+show escape_string_warning;
+ escape_string_warning
+-----------------------
+ on
+(1 row)
+
+show standard_conforming_strings;
+ standard_conforming_strings
+-----------------------------
+ on
+(1 row)
+
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3,...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd'...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+WARNING: nonstandard use of \\ in a string literal
+LINE 1: ...'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as ...
+ ^
+HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set escape_string_warning = off;
+set standard_conforming_strings = on;
+select 'a\bcd' as f1, 'a\b''cd' as f2, 'a\b''''cd' as f3, 'abcd\' as f4, 'ab\''cd' as f5, '\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+set standard_conforming_strings = off;
+select 'a\\bcd' as f1, 'a\\b\'cd' as f2, 'a\\b\'''cd' as f3, 'abcd\\' as f4, 'ab\\\'cd' as f5, '\\\\' as f6;
+ f1 | f2 | f3 | f4 | f5 | f6
+-------+--------+---------+-------+--------+----
+ a\bcd | a\b'cd | a\b''cd | abcd\ | ab\'cd | \\
+(1 row)
+
+--
+-- Additional string functions
+--
+SET bytea_output TO escape;
+SELECT initcap('hi THOMAS');
+ initcap
+-----------
+ Hi Thomas
+(1 row)
+
+SELECT lpad('hi', 5, 'xy');
+ lpad
+-------
+ xyxhi
+(1 row)
+
+SELECT lpad('hi', 5);
+ lpad
+-------
+ hi
+(1 row)
+
+SELECT lpad('hi', -5, 'xy');
+ lpad
+------
+
+(1 row)
+
+SELECT lpad('hello', 2);
+ lpad
+------
+ he
+(1 row)
+
+SELECT lpad('hi', 5, '');
+ lpad
+------
+ hi
+(1 row)
+
+SELECT rpad('hi', 5, 'xy');
+ rpad
+-------
+ hixyx
+(1 row)
+
+SELECT rpad('hi', 5);
+ rpad
+-------
+ hi
+(1 row)
+
+SELECT rpad('hi', -5, 'xy');
+ rpad
+------
+
+(1 row)
+
+SELECT rpad('hello', 2);
+ rpad
+------
+ he
+(1 row)
+
+SELECT rpad('hi', 5, '');
+ rpad
+------
+ hi
+(1 row)
+
+SELECT ltrim('zzzytrim', 'xyz');
+ ltrim
+-------
+ trim
+(1 row)
+
+SELECT translate('', '14', 'ax');
+ translate
+-----------
+
+(1 row)
+
+SELECT translate('12345', '14', 'ax');
+ translate
+-----------
+ a23x5
+(1 row)
+
+SELECT ascii('x');
+ ascii
+-------
+ 120
+(1 row)
+
+SELECT ascii('');
+ ascii
+-------
+ 0
+(1 row)
+
+SELECT chr(65);
+ chr
+-----
+ A
+(1 row)
+
+SELECT chr(0);
+ERROR: null character not permitted
+SELECT repeat('Pg', 4);
+ repeat
+----------
+ PgPgPgPg
+(1 row)
+
+SELECT repeat('Pg', -4);
+ repeat
+--------
+
+(1 row)
+
+SELECT trim(E'\\000'::bytea from E'\\000Tom\\000'::bytea);
+ btrim
+-------
+ Tom
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, E'\\000'::bytea);
+ btrim
+-------
+ trim
+(1 row)
+
+SELECT btrim(''::bytea, E'\\000'::bytea);
+ btrim
+-------
+
+(1 row)
+
+SELECT btrim(E'\\000trim\\000'::bytea, ''::bytea);
+ btrim
+--------------
+ \000trim\000
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'Th\\001omas'::bytea from 2),'escape');
+ encode
+-------------
+ TTh\x01omas
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 8),'escape');
+ encode
+--------------------
+ Th\000omas\x02\x03
+(1 row)
+
+SELECT encode(overlay(E'Th\\000omas'::bytea placing E'\\002\\003'::bytea from 5 for 3),'escape');
+ encode
+-----------------
+ Th\000o\x02\x03
+(1 row)
+
diff --git a/src/test/regress/expected/tablesample_1.out b/src/test/regress/expected/tablesample_1.out
new file mode 100644
index 0000000000..8f0c7a2669
--- /dev/null
+++ b/src/test/regress/expected/tablesample_1.out
@@ -0,0 +1,300 @@
+CREATE TABLE test_tablesample (id int, name text) WITH (fillfactor=10);
+-- use fillfactor so we don't have to load too much data to get multiple pages
+INSERT INTO test_tablesample
+ SELECT i, repeat(i::text, 200) FROM generate_series(0, 9) s(i);
+SELECT t.id FROM test_tablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (100.0/11) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
+ id
+----
+ 3
+ 4
+ 5
+ 7
+ 8
+(5 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (5.5) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+-- 100% should give repeatable count results (ie, all rows) in any case
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100);
+ count
+-------
+ 10
+(1 row)
+
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100) REPEATABLE (1+2);
+ count
+-------
+ 10
+(1 row)
+
+SELECT count(*) FROM test_tablesample TABLESAMPLE SYSTEM (100) REPEATABLE (0.4);
+ count
+-------
+ 10
+(1 row)
+
+CREATE VIEW test_tablesample_v1 AS
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (10*2) REPEATABLE (2);
+CREATE VIEW test_tablesample_v2 AS
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (99);
+\d+ test_tablesample_v1
+ View "public.test_tablesample_v1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ id | integer | | | | plain |
+View definition:
+ SELECT test_tablesample.id
+ FROM test_tablesample TABLESAMPLE system ((10 * 2)) REPEATABLE (2);
+
+\d+ test_tablesample_v2
+ View "public.test_tablesample_v2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ id | integer | | | | plain |
+View definition:
+ SELECT test_tablesample.id
+ FROM test_tablesample TABLESAMPLE system (99);
+
+-- check a sampled query doesn't affect cursor in progress
+BEGIN;
+DECLARE tablesample_cur CURSOR FOR
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+FETCH FIRST FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH FIRST FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+FETCH NEXT FROM tablesample_cur;
+ id
+----
+(0 rows)
+
+CLOSE tablesample_cur;
+END;
+EXPLAIN (COSTS OFF)
+ SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (2);
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sample Scan on test_tablesample
+ Sampling: system ('50'::real) REPEATABLE ('2'::double precision)
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+ SELECT * FROM test_tablesample_v1;
+ QUERY PLAN
+--------------------------------------------------------------------
+ Sample Scan on test_tablesample
+ Sampling: system ('20'::real) REPEATABLE ('2'::double precision)
+(2 rows)
+
+-- check inheritance behavior
+explain (costs off)
+ select count(*) from person tablesample bernoulli (100);
+ QUERY PLAN
+-------------------------------------------------
+ Aggregate
+ -> Append
+ -> Sample Scan on person
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on emp
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on student
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on stud_emp
+ Sampling: bernoulli ('100'::real)
+(10 rows)
+
+select count(*) from person tablesample bernoulli (100);
+ count
+-------
+ 58
+(1 row)
+
+select count(*) from person;
+ count
+-------
+ 58
+(1 row)
+
+-- check that collations get assigned within the tablesample arguments
+SELECT count(*) FROM test_tablesample TABLESAMPLE bernoulli (('1'::text < '0'::text)::int);
+ count
+-------
+ 0
+(1 row)
+
+-- check behavior during rescans, as well as correct handling of min/max pct
+select * from
+ (values (0),(100)) v(pct),
+ lateral (select count(*) from tenk1 tablesample bernoulli (pct)) ss;
+ pct | count
+-----+-------
+ 0 | 0
+ 100 | 10000
+(2 rows)
+
+select * from
+ (values (0),(100)) v(pct),
+ lateral (select count(*) from tenk1 tablesample system (pct)) ss;
+ pct | count
+-----+-------
+ 0 | 0
+ 100 | 10000
+(2 rows)
+
+explain (costs off)
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample bernoulli (pct)) ss
+ group by pct;
+ QUERY PLAN
+--------------------------------------------------------
+ HashAggregate
+ Group Key: "*VALUES*".column1
+ -> Nested Loop
+ -> Values Scan on "*VALUES*"
+ -> Sample Scan on tenk1
+ Sampling: bernoulli ("*VALUES*".column1)
+(6 rows)
+
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample bernoulli (pct)) ss
+ group by pct;
+ pct | count
+-----+-------
+ 100 | 10000
+(1 row)
+
+select pct, count(unique1) from
+ (values (0),(100)) v(pct),
+ lateral (select * from tenk1 tablesample system (pct)) ss
+ group by pct;
+ pct | count
+-----+-------
+ 100 | 10000
+(1 row)
+
+-- errors
+SELECT id FROM test_tablesample TABLESAMPLE FOOBAR (1);
+ERROR: tablesample method foobar does not exist
+LINE 1: SELECT id FROM test_tablesample TABLESAMPLE FOOBAR (1);
+ ^
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (NULL);
+ERROR: TABLESAMPLE parameter cannot be null
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) REPEATABLE (NULL);
+ERROR: TABLESAMPLE REPEATABLE parameter cannot be null
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (-1);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (200);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (-1);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (200);
+ERROR: sample percentage must be between 0 and 100
+SELECT id FROM test_tablesample_v1 TABLESAMPLE BERNOULLI (1);
+ERROR: TABLESAMPLE clause can only be applied to tables and materialized views
+LINE 1: SELECT id FROM test_tablesample_v1 TABLESAMPLE BERNOULLI (1)...
+ ^
+INSERT INTO test_tablesample_v1 VALUES(1);
+ERROR: cannot insert into view "test_tablesample_v1"
+DETAIL: Views containing TABLESAMPLE are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+WITH query_select AS (SELECT * FROM test_tablesample)
+SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEATABLE (1);
+ERROR: TABLESAMPLE clause can only be applied to tables and materialized views
+LINE 2: SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEA...
+ ^
+SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5);
+ERROR: syntax error at or near "TABLESAMPLE"
+LINE 1: ...CT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPL...
+ ^
+-- check partitioned tables support tablesample
+create table parted_sample (a int) partition by list (a);
+create table parted_sample_1 partition of parted_sample for values in (1);
+create table parted_sample_2 partition of parted_sample for values in (2);
+explain (costs off)
+ select * from parted_sample tablesample bernoulli (100);
+ QUERY PLAN
+-------------------------------------------
+ Append
+ -> Sample Scan on parted_sample_1
+ Sampling: bernoulli ('100'::real)
+ -> Sample Scan on parted_sample_2
+ Sampling: bernoulli ('100'::real)
+(5 rows)
+
+drop table parted_sample, parted_sample_1, parted_sample_2;
diff --git a/src/test/regress/expected/transactions_1.out b/src/test/regress/expected/transactions_1.out
new file mode 100644
index 0000000000..672b918303
--- /dev/null
+++ b/src/test/regress/expected/transactions_1.out
@@ -0,0 +1,1022 @@
+--
+-- TRANSACTIONS
+--
+BEGIN;
+SELECT *
+ INTO TABLE xacttest
+ FROM aggtest;
+INSERT INTO xacttest (a, b) VALUES (777, 777.777);
+END;
+-- should retrieve one value--
+SELECT a FROM xacttest WHERE a > 100;
+ a
+-----
+ 777
+(1 row)
+
+BEGIN;
+CREATE TABLE disappear (a int4);
+DELETE FROM aggtest;
+-- should be empty
+SELECT * FROM aggtest;
+ a | b
+---+---
+(0 rows)
+
+ABORT;
+-- should not exist
+SELECT oid FROM pg_class WHERE relname = 'disappear';
+ oid
+-----
+(0 rows)
+
+-- should have members again
+SELECT * FROM aggtest;
+ a | b
+-----+---------
+ 56 | 7.8
+ 100 | 99.097
+ 0 | 0.09561
+ 42 | 324.78
+(4 rows)
+
+-- Read-only tests
+CREATE TABLE writetest (a int);
+CREATE TEMPORARY TABLE temptest (a int);
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ WRITE; --fail
+ERROR: transaction read-write mode must be set before any query
+COMMIT;
+BEGIN;
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; -- ok
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SAVEPOINT x;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; --fail
+ERROR: cannot set transaction read-write mode inside a read-only transaction
+COMMIT;
+BEGIN;
+SET TRANSACTION READ WRITE; -- ok
+SAVEPOINT x;
+SET TRANSACTION READ WRITE; -- ok
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+SET TRANSACTION READ ONLY; -- ok
+SET TRANSACTION READ WRITE; --fail
+ERROR: cannot set transaction read-write mode inside a read-only transaction
+COMMIT;
+BEGIN;
+SET TRANSACTION READ WRITE; -- ok
+SAVEPOINT x;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+ROLLBACK TO SAVEPOINT x;
+SHOW transaction_read_only; -- off
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SAVEPOINT y;
+SET TRANSACTION READ ONLY; -- ok
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+RELEASE SAVEPOINT y;
+SHOW transaction_read_only; -- off
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+COMMIT;
+SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY;
+DROP TABLE writetest; -- fail
+ERROR: cannot execute DROP TABLE in a read-only transaction
+INSERT INTO writetest VALUES (1); -- fail
+ERROR: cannot execute INSERT in a read-only transaction
+SELECT * FROM writetest; -- ok
+ a
+---
+(0 rows)
+
+DELETE FROM temptest; -- ok
+UPDATE temptest SET a = 0 FROM writetest WHERE temptest.a = 1 AND writetest.a = temptest.a; -- ok
+PREPARE test AS UPDATE writetest SET a = 0; -- ok
+EXECUTE test; -- fail
+ERROR: cannot execute UPDATE in a read-only transaction
+SELECT * FROM writetest, temptest; -- ok
+ a | a
+---+---
+(0 rows)
+
+CREATE TABLE test AS SELECT * FROM writetest; -- fail
+ERROR: cannot execute CREATE TABLE AS in a read-only transaction
+START TRANSACTION READ WRITE;
+DROP TABLE writetest; -- ok
+COMMIT;
+-- Subtransactions, basic tests
+-- create & drop tables
+SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE;
+CREATE TABLE trans_foobar (a int);
+BEGIN;
+ CREATE TABLE trans_foo (a int);
+ SAVEPOINT one;
+ DROP TABLE trans_foo;
+ CREATE TABLE trans_bar (a int);
+ ROLLBACK TO SAVEPOINT one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ CREATE TABLE trans_baz (a int);
+ RELEASE SAVEPOINT two;
+ drop TABLE trans_foobar;
+ CREATE TABLE trans_barbaz (a int);
+COMMIT;
+-- should exist: trans_barbaz, trans_baz, trans_foo
+SELECT * FROM trans_foo; -- should be empty
+ a
+---
+(0 rows)
+
+SELECT * FROM trans_bar; -- shouldn't exist
+ERROR: relation "trans_bar" does not exist
+LINE 1: SELECT * FROM trans_bar;
+ ^
+SELECT * FROM trans_barbaz; -- should be empty
+ a
+---
+(0 rows)
+
+SELECT * FROM trans_baz; -- should be empty
+ a
+---
+(0 rows)
+
+-- inserts
+BEGIN;
+ INSERT INTO trans_foo VALUES (1);
+ SAVEPOINT one;
+ INSERT into trans_bar VALUES (1);
+ERROR: relation "trans_bar" does not exist
+LINE 1: INSERT into trans_bar VALUES (1);
+ ^
+ ROLLBACK TO one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ INSERT into trans_barbaz VALUES (1);
+ RELEASE two;
+ SAVEPOINT three;
+ SAVEPOINT four;
+ INSERT INTO trans_foo VALUES (2);
+ RELEASE SAVEPOINT four;
+ ROLLBACK TO SAVEPOINT three;
+ RELEASE SAVEPOINT three;
+ INSERT INTO trans_foo VALUES (3);
+COMMIT;
+SELECT * FROM trans_foo; -- should have 1 and 3
+ a
+---
+ 1
+ 3
+(2 rows)
+
+SELECT * FROM trans_barbaz; -- should have 1
+ a
+---
+ 1
+(1 row)
+
+-- test whole-tree commit
+BEGIN;
+ SAVEPOINT one;
+ SELECT trans_foo;
+ERROR: column "trans_foo" does not exist
+LINE 1: SELECT trans_foo;
+ ^
+ ROLLBACK TO SAVEPOINT one;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ CREATE TABLE savepoints (a int);
+ SAVEPOINT three;
+ INSERT INTO savepoints VALUES (1);
+ SAVEPOINT four;
+ INSERT INTO savepoints VALUES (2);
+ SAVEPOINT five;
+ INSERT INTO savepoints VALUES (3);
+ ROLLBACK TO SAVEPOINT five;
+COMMIT;
+COMMIT; -- should not be in a transaction block
+WARNING: there is no transaction in progress
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- test whole-tree rollback
+BEGIN;
+ SAVEPOINT one;
+ DELETE FROM savepoints WHERE a=1;
+ RELEASE SAVEPOINT one;
+ SAVEPOINT two;
+ DELETE FROM savepoints WHERE a=1;
+ SAVEPOINT three;
+ DELETE FROM savepoints WHERE a=2;
+ROLLBACK;
+COMMIT; -- should not be in a transaction block
+WARNING: there is no transaction in progress
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+-- test whole-tree commit on an aborted subtransaction
+BEGIN;
+ INSERT INTO savepoints VALUES (4);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (5);
+ SELECT trans_foo;
+ERROR: column "trans_foo" does not exist
+LINE 1: SELECT trans_foo;
+ ^
+COMMIT;
+SELECT * FROM savepoints;
+ a
+---
+ 1
+ 2
+(2 rows)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (6);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (7);
+ RELEASE SAVEPOINT one;
+ INSERT INTO savepoints VALUES (8);
+COMMIT;
+-- rows 6 and 8 should have been created by the same xact
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=6 AND b.a=8;
+ ?column?
+----------
+ t
+(1 row)
+
+-- rows 6 and 7 should have been created by different xacts
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=6 AND b.a=7;
+ ?column?
+----------
+ t
+(1 row)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (9);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (10);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (11);
+COMMIT;
+SELECT a FROM savepoints WHERE a in (9, 10, 11);
+ a
+----
+ 9
+ 11
+(2 rows)
+
+-- rows 9 and 11 should have been created by different xacts
+SELECT a.xmin = b.xmin FROM savepoints a, savepoints b WHERE a.a=9 AND b.a=11;
+ ?column?
+----------
+ t
+(1 row)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (12);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (13);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (14);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (15);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (16);
+ SAVEPOINT three;
+ INSERT INTO savepoints VALUES (17);
+COMMIT;
+SELECT a FROM savepoints WHERE a BETWEEN 12 AND 17;
+ a
+----
+ 12
+ 15
+ 16
+ 17
+(4 rows)
+
+BEGIN;
+ INSERT INTO savepoints VALUES (18);
+ SAVEPOINT one;
+ INSERT INTO savepoints VALUES (19);
+ SAVEPOINT two;
+ INSERT INTO savepoints VALUES (20);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (21);
+ ROLLBACK TO SAVEPOINT one;
+ INSERT INTO savepoints VALUES (22);
+COMMIT;
+SELECT a FROM savepoints WHERE a BETWEEN 18 AND 22;
+ a
+----
+ 18
+ 22
+(2 rows)
+
+DROP TABLE savepoints;
+-- only in a transaction block:
+SAVEPOINT one;
+ERROR: SAVEPOINT can only be used in transaction blocks
+ROLLBACK TO SAVEPOINT one;
+ERROR: ROLLBACK TO SAVEPOINT can only be used in transaction blocks
+RELEASE SAVEPOINT one;
+ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
+-- Only "rollback to" allowed in aborted state
+BEGIN;
+ SAVEPOINT one;
+ SELECT 0/0;
+ERROR: division by zero
+ SAVEPOINT two; -- ignored till the end of ...
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+ RELEASE SAVEPOINT one; -- ignored till the end of ...
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+ ROLLBACK TO SAVEPOINT one;
+ SELECT 1;
+ ?column?
+----------
+ 1
+(1 row)
+
+COMMIT;
+SELECT 1; -- this should work
+ ?column?
+----------
+ 1
+(1 row)
+
+-- check non-transactional behavior of cursors
+BEGIN;
+ DECLARE c CURSOR FOR SELECT unique2 FROM tenk1 ORDER BY unique2;
+ SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 0
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+(10 rows)
+
+ ROLLBACK TO SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 10
+ 11
+ 12
+ 13
+ 14
+ 15
+ 16
+ 17
+ 18
+ 19
+(10 rows)
+
+ RELEASE SAVEPOINT one;
+ FETCH 10 FROM c;
+ unique2
+---------
+ 20
+ 21
+ 22
+ 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+(10 rows)
+
+ CLOSE c;
+ DECLARE c CURSOR FOR SELECT unique2/0 FROM tenk1 ORDER BY unique2;
+ SAVEPOINT two;
+ FETCH 10 FROM c;
+ERROR: division by zero
+ ROLLBACK TO SAVEPOINT two;
+ -- c is now dead to the world ...
+ FETCH 10 FROM c;
+ERROR: portal "c" cannot be run
+ ROLLBACK TO SAVEPOINT two;
+ RELEASE SAVEPOINT two;
+ FETCH 10 FROM c;
+ERROR: portal "c" cannot be run
+COMMIT;
+--
+-- Check that "stable" functions are really stable. They should not be
+-- able to see the partial results of the calling query. (Ideally we would
+-- also check that they don't see commits of concurrent transactions, but
+-- that's a mite hard to do within the limitations of pg_regress.)
+--
+select * from xacttest;
+ a | b
+-----+---------
+ 56 | 7.8
+ 100 | 99.097
+ 0 | 0.09561
+ 42 | 324.78
+ 777 | 777.777
+(5 rows)
+
+create or replace function max_xacttest() returns smallint language sql as
+'select max(a) from xacttest' stable;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 787 | 99.097
+ 787 | 324.78
+ 787 | 777.777
+(5 rows)
+
+rollback;
+-- But a volatile function can see the partial results of the calling query
+create or replace function max_xacttest() returns smallint language sql as
+'select max(a) from xacttest' volatile;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 797 | 99.097
+ 807 | 324.78
+ 817 | 777.777
+(5 rows)
+
+rollback;
+-- Now the same test with plpgsql (since it depends on SPI which is different)
+create or replace function max_xacttest() returns smallint language plpgsql as
+'begin return max(a) from xacttest; end' stable;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 787 | 99.097
+ 787 | 324.78
+ 787 | 777.777
+(5 rows)
+
+rollback;
+create or replace function max_xacttest() returns smallint language plpgsql as
+'begin return max(a) from xacttest; end' volatile;
+begin;
+update xacttest set a = max_xacttest() + 10 where a > 0;
+select * from xacttest;
+ a | b
+-----+---------
+ 0 | 0.09561
+ 787 | 7.8
+ 797 | 99.097
+ 807 | 324.78
+ 817 | 777.777
+(5 rows)
+
+rollback;
+-- test case for problems with dropping an open relation during abort
+BEGIN;
+ savepoint x;
+ CREATE TABLE koju (a INT UNIQUE);
+ INSERT INTO koju VALUES (1);
+ INSERT INTO koju VALUES (1);
+ERROR: duplicate key value violates unique constraint "koju_a_key"
+DETAIL: Key (a)=(1) already exists.
+ rollback to x;
+ CREATE TABLE koju (a INT UNIQUE);
+ INSERT INTO koju VALUES (1);
+ INSERT INTO koju VALUES (1);
+ERROR: duplicate key value violates unique constraint "koju_a_key"
+DETAIL: Key (a)=(1) already exists.
+ROLLBACK;
+DROP TABLE trans_foo;
+DROP TABLE trans_baz;
+DROP TABLE trans_barbaz;
+-- test case for problems with revalidating an open relation during abort
+create function inverse(int) returns float8 as
+$$
+begin
+ analyze revalidate_bug;
+ return 1::float8/$1;
+exception
+ when division_by_zero then return 0;
+end$$ language plpgsql volatile;
+create table revalidate_bug (c float8 unique);
+insert into revalidate_bug values (1);
+insert into revalidate_bug values (inverse(0));
+drop table revalidate_bug;
+drop function inverse(int);
+-- verify that cursors created during an aborted subtransaction are
+-- closed, but that we do not rollback the effect of any FETCHs
+-- performed in the aborted subtransaction
+begin;
+savepoint x;
+create table abc (a int);
+insert into abc values (5);
+insert into abc values (10);
+declare foo cursor for select * from abc;
+fetch from foo;
+ a
+---
+ 5
+(1 row)
+
+rollback to x;
+-- should fail
+fetch from foo;
+ERROR: cursor "foo" does not exist
+commit;
+begin;
+create table abc (a int);
+insert into abc values (5);
+insert into abc values (10);
+insert into abc values (15);
+declare foo cursor for select * from abc;
+fetch from foo;
+ a
+---
+ 5
+(1 row)
+
+savepoint x;
+fetch from foo;
+ a
+----
+ 10
+(1 row)
+
+rollback to x;
+fetch from foo;
+ a
+----
+ 15
+(1 row)
+
+abort;
+-- Test for proper cleanup after a failure in a cursor portal
+-- that was created in an outer subtransaction
+CREATE FUNCTION invert(x float8) RETURNS float8 LANGUAGE plpgsql AS
+$$ begin return 1/x; end $$;
+CREATE FUNCTION create_temp_tab() RETURNS text
+LANGUAGE plpgsql AS $$
+BEGIN
+ CREATE TEMP TABLE new_table (f1 float8);
+ -- case of interest is that we fail while holding an open
+ -- relcache reference to new_table
+ INSERT INTO new_table SELECT invert(0.0);
+ RETURN 'foo';
+END $$;
+BEGIN;
+DECLARE ok CURSOR FOR SELECT * FROM int8_tbl;
+DECLARE ctt CURSOR FOR SELECT create_temp_tab();
+FETCH ok;
+ q1 | q2
+-----+-----
+ 123 | 456
+(1 row)
+
+SAVEPOINT s1;
+FETCH ok; -- should work
+ q1 | q2
+-----+------------------
+ 123 | 4567890123456789
+(1 row)
+
+FETCH ctt; -- error occurs here
+ERROR: division by zero
+CONTEXT: PL/pgSQL function invert(double precision) line 1 at RETURN
+SQL statement "INSERT INTO new_table SELECT invert(0.0)"
+PL/pgSQL function create_temp_tab() line 6 at SQL statement
+ROLLBACK TO s1;
+FETCH ok; -- should work
+ q1 | q2
+------------------+-----
+ 4567890123456789 | 123
+(1 row)
+
+FETCH ctt; -- must be rejected
+ERROR: portal "ctt" cannot be run
+COMMIT;
+DROP FUNCTION create_temp_tab();
+DROP FUNCTION invert(x float8);
+-- Tests for AND CHAIN
+CREATE TABLE abc (a int);
+-- set nondefault value so we have something to override below
+SET default_transaction_read_only = on;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (1);
+INSERT INTO abc VALUES (2);
+COMMIT AND CHAIN; -- TBLOCK_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+INSERT INTO abc VALUES (3); -- check it's really aborted
+ERROR: current transaction is aborted, commands ignored until end of transaction block
+COMMIT AND CHAIN; -- TBLOCK_ABORT_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (4);
+COMMIT;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ, READ WRITE, DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+SAVEPOINT x;
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+COMMIT AND CHAIN; -- TBLOCK_ABORT_PENDING
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ on
+(1 row)
+
+INSERT INTO abc VALUES (5);
+COMMIT;
+-- different mix of options just for fun
+START TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ WRITE, NOT DEFERRABLE;
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+INSERT INTO abc VALUES (6);
+ROLLBACK AND CHAIN; -- TBLOCK_ABORT_PENDING
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+INSERT INTO abc VALUES ('error');
+ERROR: invalid input syntax for type integer: "error"
+LINE 1: INSERT INTO abc VALUES ('error');
+ ^
+ROLLBACK AND CHAIN; -- TBLOCK_ABORT_END
+SHOW transaction_isolation;
+ transaction_isolation
+-----------------------
+ serializable
+(1 row)
+
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SHOW transaction_deferrable;
+ transaction_deferrable
+------------------------
+ off
+(1 row)
+
+ROLLBACK;
+-- not allowed outside a transaction block
+COMMIT AND CHAIN; -- error
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+ROLLBACK AND CHAIN; -- error
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+SELECT * FROM abc ORDER BY 1;
+ a
+---
+ 1
+ 2
+ 4
+ 5
+(4 rows)
+
+RESET default_transaction_read_only;
+DROP TABLE abc;
+-- Test assorted behaviors around the implicit transaction block created
+-- when multiple SQL commands are sent in a single Query message. These
+-- tests rely on the fact that psql will not break SQL commands apart at a
+-- backslash-quoted semicolon, but will send them as one Query.
+create temp table i_table (f1 int);
+-- psql will show only the last result in a multi-statement Query
+SELECT 1\; SELECT 2\; SELECT 3;
+ ?column?
+----------
+ 3
+(1 row)
+
+-- this implicitly commits:
+insert into i_table values(1)\; select * from i_table;
+ f1
+----
+ 1
+(1 row)
+
+-- 1/0 error will cause rolling back the whole implicit transaction
+insert into i_table values(2)\; select * from i_table\; select 1/0;
+ERROR: division by zero
+select * from i_table;
+ f1
+----
+ 1
+(1 row)
+
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- can use regular begin/commit/rollback within a single Query
+begin\; insert into i_table values(3)\; commit;
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+begin\; insert into i_table values(4)\; rollback;
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- begin converts implicit transaction into a regular one that
+-- can extend past the end of the Query
+select 1\; begin\; insert into i_table values(5);
+commit;
+select 1\; begin\; insert into i_table values(6);
+rollback;
+-- commit in implicit-transaction state commits but issues a warning.
+insert into i_table values(7)\; commit\; insert into i_table values(8)\; select 1/0;
+WARNING: there is no transaction in progress
+ERROR: division by zero
+-- similarly, rollback aborts but issues a warning.
+insert into i_table values(9)\; rollback\; select 2;
+WARNING: there is no transaction in progress
+ ?column?
+----------
+ 2
+(1 row)
+
+select * from i_table;
+ f1
+----
+ 1
+ 3
+ 5
+ 7
+(4 rows)
+
+rollback; -- we are not in a transaction at this point
+WARNING: there is no transaction in progress
+-- implicit transaction block is still a transaction block, for e.g. VACUUM
+SELECT 1\; VACUUM;
+ERROR: VACUUM cannot run inside a transaction block
+SELECT 1\; COMMIT\; VACUUM;
+WARNING: there is no transaction in progress
+ERROR: VACUUM cannot run inside a transaction block
+-- we disallow savepoint-related commands in implicit-transaction state
+SELECT 1\; SAVEPOINT sp;
+ERROR: SAVEPOINT can only be used in transaction blocks
+SELECT 1\; COMMIT\; SAVEPOINT sp;
+WARNING: there is no transaction in progress
+ERROR: SAVEPOINT can only be used in transaction blocks
+ROLLBACK TO SAVEPOINT sp\; SELECT 2;
+ERROR: ROLLBACK TO SAVEPOINT can only be used in transaction blocks
+SELECT 2\; RELEASE SAVEPOINT sp\; SELECT 3;
+ERROR: RELEASE SAVEPOINT can only be used in transaction blocks
+-- but this is OK, because the BEGIN converts it to a regular xact
+SELECT 1\; BEGIN\; SAVEPOINT sp\; ROLLBACK TO SAVEPOINT sp\; COMMIT;
+-- Tests for AND CHAIN in implicit transaction blocks
+SET TRANSACTION READ ONLY\; COMMIT AND CHAIN; -- error
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+SET TRANSACTION READ ONLY\; ROLLBACK AND CHAIN; -- error
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+SHOW transaction_read_only;
+ transaction_read_only
+-----------------------
+ off
+(1 row)
+
+CREATE TABLE abc (a int);
+-- COMMIT/ROLLBACK + COMMIT/ROLLBACK AND CHAIN
+INSERT INTO abc VALUES (7)\; COMMIT\; INSERT INTO abc VALUES (8)\; COMMIT AND CHAIN; -- 7 commit, 8 error
+WARNING: there is no transaction in progress
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+INSERT INTO abc VALUES (9)\; ROLLBACK\; INSERT INTO abc VALUES (10)\; ROLLBACK AND CHAIN; -- 9 rollback, 10 error
+WARNING: there is no transaction in progress
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+-- COMMIT/ROLLBACK AND CHAIN + COMMIT/ROLLBACK
+INSERT INTO abc VALUES (11)\; COMMIT AND CHAIN\; INSERT INTO abc VALUES (12)\; COMMIT; -- 11 error, 12 not reached
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+INSERT INTO abc VALUES (13)\; ROLLBACK AND CHAIN\; INSERT INTO abc VALUES (14)\; ROLLBACK; -- 13 error, 14 not reached
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+-- START TRANSACTION + COMMIT/ROLLBACK AND CHAIN
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (15)\; COMMIT AND CHAIN; -- 15 ok
+SHOW transaction_isolation; -- transaction is active at this point
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+COMMIT;
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (16)\; ROLLBACK AND CHAIN; -- 16 ok
+SHOW transaction_isolation; -- transaction is active at this point
+ transaction_isolation
+-----------------------
+ repeatable read
+(1 row)
+
+ROLLBACK;
+-- START TRANSACTION + COMMIT/ROLLBACK + COMMIT/ROLLBACK AND CHAIN
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (17)\; COMMIT\; INSERT INTO abc VALUES (18)\; COMMIT AND CHAIN; -- 17 commit, 18 error
+ERROR: COMMIT AND CHAIN can only be used in transaction blocks
+SHOW transaction_isolation; -- out of transaction block
+ transaction_isolation
+-----------------------
+ read committed
+(1 row)
+
+START TRANSACTION ISOLATION LEVEL REPEATABLE READ\; INSERT INTO abc VALUES (19)\; ROLLBACK\; INSERT INTO abc VALUES (20)\; ROLLBACK AND CHAIN; -- 19 rollback, 20 error
+ERROR: ROLLBACK AND CHAIN can only be used in transaction blocks
+SHOW transaction_isolation; -- out of transaction block
+ transaction_isolation
+-----------------------
+ read committed
+(1 row)
+
+SELECT * FROM abc ORDER BY 1;
+ a
+----
+ 7
+ 15
+ 17
+(3 rows)
+
+DROP TABLE abc;
+-- Test for successful cleanup of an aborted transaction at session exit.
+-- THIS MUST BE THE LAST TEST IN THIS FILE.
+begin;
+select 1/0;
+ERROR: division by zero
+rollback to X;
+ERROR: savepoint "x" does not exist
+-- DO NOT ADD ANYTHING HERE.
diff --git a/src/test/regress/expected/triggers_1.out b/src/test/regress/expected/triggers_1.out
new file mode 100644
index 0000000000..5a8f055aed
--- /dev/null
+++ b/src/test/regress/expected/triggers_1.out
@@ -0,0 +1,2850 @@
+--
+-- TRIGGERS
+--
+create table pkeys (pkey1 int4 not null, pkey2 text not null);
+create table fkeys (fkey1 int4, fkey2 text, fkey3 int);
+create table fkeys2 (fkey21 int4, fkey22 text, pkey23 int not null);
+create index fkeys_i on fkeys (fkey1, fkey2);
+create index fkeys2_i on fkeys2 (fkey21, fkey22);
+create index fkeys2p_i on fkeys2 (pkey23);
+insert into pkeys values (10, '1');
+insert into pkeys values (20, '2');
+insert into pkeys values (30, '3');
+insert into pkeys values (40, '4');
+insert into pkeys values (50, '5');
+insert into pkeys values (60, '6');
+create unique index pkeys_i on pkeys (pkey1, pkey2);
+--
+-- For fkeys:
+-- (fkey1, fkey2) --> pkeys (pkey1, pkey2)
+-- (fkey3) --> fkeys2 (pkey23)
+--
+create trigger check_fkeys_pkey_exist
+ before insert or update on fkeys
+ for each row
+ execute function
+ check_primary_key ('fkey1', 'fkey2', 'pkeys', 'pkey1', 'pkey2');
+create trigger check_fkeys_pkey2_exist
+ before insert or update on fkeys
+ for each row
+ execute function check_primary_key ('fkey3', 'fkeys2', 'pkey23');
+--
+-- For fkeys2:
+-- (fkey21, fkey22) --> pkeys (pkey1, pkey2)
+--
+create trigger check_fkeys2_pkey_exist
+ before insert or update on fkeys2
+ for each row
+ execute procedure
+ check_primary_key ('fkey21', 'fkey22', 'pkeys', 'pkey1', 'pkey2');
+-- Test comments
+COMMENT ON TRIGGER check_fkeys2_pkey_bad ON fkeys2 IS 'wrong';
+ERROR: trigger "check_fkeys2_pkey_bad" for table "fkeys2" does not exist
+COMMENT ON TRIGGER check_fkeys2_pkey_exist ON fkeys2 IS 'right';
+COMMENT ON TRIGGER check_fkeys2_pkey_exist ON fkeys2 IS NULL;
+--
+-- For pkeys:
+-- ON DELETE/UPDATE (pkey1, pkey2) CASCADE:
+-- fkeys (fkey1, fkey2) and fkeys2 (fkey21, fkey22)
+--
+create trigger check_pkeys_fkey_cascade
+ before delete or update on pkeys
+ for each row
+ execute procedure
+ check_foreign_key (2, 'cascade', 'pkey1', 'pkey2',
+ 'fkeys', 'fkey1', 'fkey2', 'fkeys2', 'fkey21', 'fkey22');
+--
+-- For fkeys2:
+-- ON DELETE/UPDATE (pkey23) RESTRICT:
+-- fkeys (fkey3)
+--
+create trigger check_fkeys2_fkey_restrict
+ before delete or update on fkeys2
+ for each row
+ execute procedure check_foreign_key (1, 'restrict', 'pkey23', 'fkeys', 'fkey3');
+insert into fkeys2 values (10, '1', 1);
+insert into fkeys2 values (30, '3', 2);
+insert into fkeys2 values (40, '4', 5);
+insert into fkeys2 values (50, '5', 3);
+-- no key in pkeys
+insert into fkeys2 values (70, '5', 3);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys2_pkey_exist" found tuple referencing non-existent key in "pkeys".
+insert into fkeys values (10, '1', 2);
+insert into fkeys values (30, '3', 3);
+insert into fkeys values (40, '4', 2);
+insert into fkeys values (50, '5', 2);
+-- no key in pkeys
+insert into fkeys values (70, '5', 1);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys_pkey_exist" found tuple referencing non-existent key in "pkeys".
+-- no key in fkeys2
+insert into fkeys values (60, '6', 4);
+ERROR: tuple references non-existent key
+DETAIL: Trigger "check_fkeys_pkey2_exist" found tuple referencing non-existent key in "fkeys2".
+delete from pkeys where pkey1 = 30 and pkey2 = '3';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+ERROR: "check_fkeys2_fkey_restrict": tuple is referenced in "fkeys"
+CONTEXT: SQL statement "delete from fkeys2 where fkey21 = $1 and fkey22 = $2 "
+delete from pkeys where pkey1 = 40 and pkey2 = '4';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys2 are deleted
+update pkeys set pkey1 = 7, pkey2 = '70' where pkey1 = 50 and pkey2 = '5';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+ERROR: "check_fkeys2_fkey_restrict": tuple is referenced in "fkeys"
+CONTEXT: SQL statement "delete from fkeys2 where fkey21 = $1 and fkey22 = $2 "
+update pkeys set pkey1 = 7, pkey2 = '70' where pkey1 = 10 and pkey2 = '1';
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys are deleted
+NOTICE: check_pkeys_fkey_cascade: 1 tuple(s) of fkeys2 are deleted
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table in ('pkeys', 'fkeys', 'fkeys2')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+----------------------------+--------------------+---------------------+--------------------+--------------+------------------+--------------------+---------------+----------------------------+----------------------------
+ check_fkeys2_fkey_restrict | DELETE | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_fkey_restrict | UPDATE | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_pkey_exist | INSERT | public | fkeys2 | 1 | | ROW | BEFORE | |
+ check_fkeys2_pkey_exist | UPDATE | public | fkeys2 | 2 | | ROW | BEFORE | |
+ check_fkeys_pkey2_exist | INSERT | public | fkeys | 1 | | ROW | BEFORE | |
+ check_fkeys_pkey2_exist | UPDATE | public | fkeys | 1 | | ROW | BEFORE | |
+ check_fkeys_pkey_exist | INSERT | public | fkeys | 2 | | ROW | BEFORE | |
+ check_fkeys_pkey_exist | UPDATE | public | fkeys | 2 | | ROW | BEFORE | |
+ check_pkeys_fkey_cascade | DELETE | public | pkeys | 1 | | ROW | BEFORE | |
+ check_pkeys_fkey_cascade | UPDATE | public | pkeys | 1 | | ROW | BEFORE | |
+(10 rows)
+
+DROP TABLE pkeys;
+DROP TABLE fkeys;
+DROP TABLE fkeys2;
+-- Check behavior when trigger returns unmodified trigtuple
+create table trigtest (f1 int, f2 text);
+create trigger trigger_return_old
+ before insert or delete or update on trigtest
+ for each row execute procedure trigger_return_old();
+insert into trigtest values(1, 'foo');
+select * from trigtest;
+ f1 | f2
+----+-----
+ 1 | foo
+(1 row)
+
+update trigtest set f2 = f2 || 'bar';
+select * from trigtest;
+ f1 | f2
+----+-----
+ 1 | foo
+(1 row)
+
+delete from trigtest;
+select * from trigtest;
+ f1 | f2
+----+----
+(0 rows)
+
+drop table trigtest;
+create sequence ttdummy_seq increment 10 start 0 minvalue 0;
+create table tttest (
+ price_id int4,
+ price_val int4,
+ price_on int4,
+ price_off int4 default 999999
+);
+create trigger ttdummy
+ before delete or update on tttest
+ for each row
+ execute procedure
+ ttdummy (price_on, price_off);
+create trigger ttserial
+ before insert or update on tttest
+ for each row
+ execute procedure
+ autoinc (price_on, ttdummy_seq);
+insert into tttest values (1, 1, null);
+insert into tttest values (2, 2, null);
+insert into tttest values (3, 3, 0);
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 999999
+ 3 | 3 | 30 | 999999
+(3 rows)
+
+delete from tttest where price_id = 2;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 3 | 3 | 30 | 999999
+ 2 | 2 | 20 | 40
+(3 rows)
+
+-- what do we see ?
+-- get current prices
+select * from tttest where price_off = 999999;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 3 | 3 | 30 | 999999
+(2 rows)
+
+-- change price for price_id == 3
+update tttest set price_val = 30 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 30 | 50 | 999999
+ 3 | 3 | 30 | 50
+(4 rows)
+
+-- now we want to change pric_id in ALL tuples
+-- this gets us not what we need
+update tttest set price_id = 5 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 3 | 30 | 50
+ 5 | 30 | 60 | 999999
+ 3 | 30 | 50 | 60
+(5 rows)
+
+-- restore data as before last update:
+select set_ttdummy(0);
+ set_ttdummy
+-------------
+ 1
+(1 row)
+
+delete from tttest where price_id = 5;
+update tttest set price_off = 999999 where price_val = 30;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 3 | 3 | 30 | 50
+ 3 | 30 | 50 | 999999
+(4 rows)
+
+-- and try change price_id now!
+update tttest set price_id = 5 where price_id = 3;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 1 | 1 | 10 | 999999
+ 2 | 2 | 20 | 40
+ 5 | 3 | 30 | 50
+ 5 | 30 | 50 | 999999
+(4 rows)
+
+-- isn't it what we need ?
+select set_ttdummy(1);
+ set_ttdummy
+-------------
+ 0
+(1 row)
+
+-- we want to correct some "date"
+update tttest set price_on = -1 where price_id = 1;
+ERROR: ttdummy (tttest): you cannot change price_on and/or price_off columns (use set_ttdummy)
+-- but this doesn't work
+-- try in this way
+select set_ttdummy(0);
+ set_ttdummy
+-------------
+ 1
+(1 row)
+
+update tttest set price_on = -1 where price_id = 1;
+select * from tttest;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 2 | 2 | 20 | 40
+ 5 | 3 | 30 | 50
+ 5 | 30 | 50 | 999999
+ 1 | 1 | -1 | 999999
+(4 rows)
+
+-- isn't it what we need ?
+-- get price for price_id == 5 as it was @ "date" 35
+select * from tttest where price_on <= 35 and price_off > 35 and price_id = 5;
+ price_id | price_val | price_on | price_off
+----------+-----------+----------+-----------
+ 5 | 3 | 30 | 50
+(1 row)
+
+drop table tttest;
+drop sequence ttdummy_seq;
+--
+-- tests for per-statement triggers
+--
+CREATE TABLE log_table (tstamp timestamp default timeofday()::timestamp);
+CREATE TABLE main_table (a int unique, b int);
+COPY main_table (a,b) FROM stdin;
+CREATE FUNCTION trigger_func() RETURNS trigger LANGUAGE plpgsql AS '
+BEGIN
+ RAISE NOTICE ''trigger_func(%) called: action = %, when = %, level = %'', TG_ARGV[0], TG_OP, TG_WHEN, TG_LEVEL;
+ RETURN NULL;
+END;';
+CREATE TRIGGER before_ins_stmt_trig BEFORE INSERT ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('before_ins_stmt');
+CREATE TRIGGER after_ins_stmt_trig AFTER INSERT ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('after_ins_stmt');
+--
+-- if neither 'FOR EACH ROW' nor 'FOR EACH STATEMENT' was specified,
+-- CREATE TRIGGER should default to 'FOR EACH STATEMENT'
+--
+CREATE TRIGGER after_upd_stmt_trig AFTER UPDATE ON main_table
+EXECUTE PROCEDURE trigger_func('after_upd_stmt');
+-- Both insert and update statement level triggers (before and after) should
+-- fire. Doesn't fire UPDATE before trigger, but only because one isn't
+-- defined.
+INSERT INTO main_table (a, b) VALUES (5, 10) ON CONFLICT (a)
+ DO UPDATE SET b = EXCLUDED.b;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+CREATE TRIGGER after_upd_row_trig AFTER UPDATE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_row');
+INSERT INTO main_table DEFAULT VALUES;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+UPDATE main_table SET a = a + 1 WHERE b < 30;
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+-- UPDATE that effects zero rows should still call per-statement trigger
+UPDATE main_table SET a = a + 2 WHERE b > 100;
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+-- constraint now unneeded
+ALTER TABLE main_table DROP CONSTRAINT main_table_a_key;
+-- COPY should fire per-row and per-statement INSERT triggers
+COPY main_table (a, b) FROM stdin;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+SELECT * FROM main_table ORDER BY a, b;
+ a | b
+----+----
+ 6 | 10
+ 21 | 20
+ 30 | 40
+ 31 | 10
+ 50 | 35
+ 50 | 60
+ 81 | 15
+ |
+(8 rows)
+
+--
+-- test triggers with WHEN clause
+--
+CREATE TRIGGER modified_a BEFORE UPDATE OF a ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a) EXECUTE PROCEDURE trigger_func('modified_a');
+CREATE TRIGGER modified_any BEFORE UPDATE OF a ON main_table
+FOR EACH ROW WHEN (OLD.* IS DISTINCT FROM NEW.*) EXECUTE PROCEDURE trigger_func('modified_any');
+CREATE TRIGGER insert_a AFTER INSERT ON main_table
+FOR EACH ROW WHEN (NEW.a = 123) EXECUTE PROCEDURE trigger_func('insert_a');
+CREATE TRIGGER delete_a AFTER DELETE ON main_table
+FOR EACH ROW WHEN (OLD.a = 123) EXECUTE PROCEDURE trigger_func('delete_a');
+CREATE TRIGGER insert_when BEFORE INSERT ON main_table
+FOR EACH STATEMENT WHEN (true) EXECUTE PROCEDURE trigger_func('insert_when');
+CREATE TRIGGER delete_when AFTER DELETE ON main_table
+FOR EACH STATEMENT WHEN (true) EXECUTE PROCEDURE trigger_func('delete_when');
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table IN ('main_table')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+----------------------+--------------------+---------------------+--------------------+--------------+--------------------------------+--------------------+---------------+----------------------------+----------------------------
+ after_ins_stmt_trig | INSERT | public | main_table | 1 | | STATEMENT | AFTER | |
+ after_upd_row_trig | UPDATE | public | main_table | 1 | | ROW | AFTER | |
+ after_upd_stmt_trig | UPDATE | public | main_table | 1 | | STATEMENT | AFTER | |
+ before_ins_stmt_trig | INSERT | public | main_table | 1 | | STATEMENT | BEFORE | |
+ delete_a | DELETE | public | main_table | 1 | (old.a = 123) | ROW | AFTER | |
+ delete_when | DELETE | public | main_table | 1 | true | STATEMENT | AFTER | |
+ insert_a | INSERT | public | main_table | 1 | (new.a = 123) | ROW | AFTER | |
+ insert_when | INSERT | public | main_table | 2 | true | STATEMENT | BEFORE | |
+ modified_a | UPDATE | public | main_table | 1 | (old.a <> new.a) | ROW | BEFORE | |
+ modified_any | UPDATE | public | main_table | 2 | (old.* IS DISTINCT FROM new.*) | ROW | BEFORE | |
+(10 rows)
+
+INSERT INTO main_table (a) VALUES (123), (456);
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_when) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_a) called: action = INSERT, when = AFTER, level = ROW
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+COPY main_table FROM stdin;
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_when) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(insert_a) called: action = INSERT, when = AFTER, level = ROW
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+DELETE FROM main_table WHERE a IN (123, 456);
+NOTICE: trigger_func(delete_a) called: action = DELETE, when = AFTER, level = ROW
+NOTICE: trigger_func(delete_a) called: action = DELETE, when = AFTER, level = ROW
+NOTICE: trigger_func(delete_when) called: action = DELETE, when = AFTER, level = STATEMENT
+UPDATE main_table SET a = 50, b = 60;
+NOTICE: trigger_func(modified_any) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_any) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(modified_a) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+SELECT * FROM main_table ORDER BY a, b;
+ a | b
+----+----
+ 6 | 10
+ 21 | 20
+ 30 | 40
+ 31 | 10
+ 50 | 35
+ 50 | 60
+ 81 | 15
+ |
+(8 rows)
+
+SELECT pg_get_triggerdef(oid, true) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_a BEFORE UPDATE OF a ON main_table FOR EACH ROW WHEN (old.a <> new.a) EXECUTE FUNCTION trigger_func('modified_a')
+(1 row)
+
+SELECT pg_get_triggerdef(oid, false) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ pg_get_triggerdef
+----------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_a BEFORE UPDATE OF a ON public.main_table FOR EACH ROW WHEN ((old.a <> new.a)) EXECUTE FUNCTION trigger_func('modified_a')
+(1 row)
+
+SELECT pg_get_triggerdef(oid, true) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_any';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER modified_any BEFORE UPDATE OF a ON main_table FOR EACH ROW WHEN (old.* IS DISTINCT FROM new.*) EXECUTE FUNCTION trigger_func('modified_any')
+(1 row)
+
+-- Test RENAME TRIGGER
+ALTER TRIGGER modified_a ON main_table RENAME TO modified_modified_a;
+SELECT count(*) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_a';
+ count
+-------
+ 0
+(1 row)
+
+SELECT count(*) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'modified_modified_a';
+ count
+-------
+ 1
+(1 row)
+
+DROP TRIGGER modified_modified_a ON main_table;
+DROP TRIGGER modified_any ON main_table;
+DROP TRIGGER insert_a ON main_table;
+DROP TRIGGER delete_a ON main_table;
+DROP TRIGGER insert_when ON main_table;
+DROP TRIGGER delete_when ON main_table;
+-- Test WHEN condition accessing system columns.
+create table table_with_oids(a int);
+insert into table_with_oids values (1);
+create trigger oid_unchanged_trig after update on table_with_oids
+ for each row
+ when (new.tableoid = old.tableoid AND new.tableoid <> 0)
+ execute procedure trigger_func('after_upd_oid_unchanged');
+update table_with_oids set a = a + 1;
+NOTICE: trigger_func(after_upd_oid_unchanged) called: action = UPDATE, when = AFTER, level = ROW
+drop table table_with_oids;
+-- Test column-level triggers
+DROP TRIGGER after_upd_row_trig ON main_table;
+CREATE TRIGGER before_upd_a_row_trig BEFORE UPDATE OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_a_row');
+CREATE TRIGGER after_upd_b_row_trig AFTER UPDATE OF b ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_b_row');
+CREATE TRIGGER after_upd_a_b_row_trig AFTER UPDATE OF a, b ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('after_upd_a_b_row');
+CREATE TRIGGER before_upd_a_stmt_trig BEFORE UPDATE OF a ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('before_upd_a_stmt');
+CREATE TRIGGER after_upd_b_stmt_trig AFTER UPDATE OF b ON main_table
+FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func('after_upd_b_stmt');
+SELECT pg_get_triggerdef(oid) FROM pg_trigger WHERE tgrelid = 'main_table'::regclass AND tgname = 'after_upd_a_b_row_trig';
+ pg_get_triggerdef
+-------------------------------------------------------------------------------------------------------------------------------------------------
+ CREATE TRIGGER after_upd_a_b_row_trig AFTER UPDATE OF a, b ON public.main_table FOR EACH ROW EXECUTE FUNCTION trigger_func('after_upd_a_b_row')
+(1 row)
+
+UPDATE main_table SET a = 50;
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+UPDATE main_table SET b = 10;
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+--
+-- Test case for bug with BEFORE trigger followed by AFTER trigger with WHEN
+--
+CREATE TABLE some_t (some_col boolean NOT NULL);
+CREATE FUNCTION dummy_update_func() RETURNS trigger AS $$
+BEGIN
+ RAISE NOTICE 'dummy_update_func(%) called: action = %, old = %, new = %',
+ TG_ARGV[0], TG_OP, OLD, NEW;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER some_trig_before BEFORE UPDATE ON some_t FOR EACH ROW
+ EXECUTE PROCEDURE dummy_update_func('before');
+CREATE TRIGGER some_trig_aftera AFTER UPDATE ON some_t FOR EACH ROW
+ WHEN (NOT OLD.some_col AND NEW.some_col)
+ EXECUTE PROCEDURE dummy_update_func('aftera');
+CREATE TRIGGER some_trig_afterb AFTER UPDATE ON some_t FOR EACH ROW
+ WHEN (NOT NEW.some_col)
+ EXECUTE PROCEDURE dummy_update_func('afterb');
+INSERT INTO some_t VALUES (TRUE);
+UPDATE some_t SET some_col = TRUE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (t), new = (t)
+UPDATE some_t SET some_col = FALSE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (t), new = (f)
+NOTICE: dummy_update_func(afterb) called: action = UPDATE, old = (t), new = (f)
+UPDATE some_t SET some_col = TRUE;
+NOTICE: dummy_update_func(before) called: action = UPDATE, old = (f), new = (t)
+NOTICE: dummy_update_func(aftera) called: action = UPDATE, old = (f), new = (t)
+DROP TABLE some_t;
+-- bogus cases
+CREATE TRIGGER error_upd_and_col BEFORE UPDATE OR UPDATE OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_upd_and_col');
+ERROR: duplicate trigger events specified at or near "ON"
+LINE 1: ...ER error_upd_and_col BEFORE UPDATE OR UPDATE OF a ON main_ta...
+ ^
+CREATE TRIGGER error_upd_a_a BEFORE UPDATE OF a, a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_upd_a_a');
+ERROR: column "a" specified more than once
+CREATE TRIGGER error_ins_a BEFORE INSERT OF a ON main_table
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('error_ins_a');
+ERROR: syntax error at or near "OF"
+LINE 1: CREATE TRIGGER error_ins_a BEFORE INSERT OF a ON main_table
+ ^
+CREATE TRIGGER error_ins_when BEFORE INSERT OR UPDATE ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a)
+EXECUTE PROCEDURE trigger_func('error_ins_old');
+ERROR: INSERT trigger's WHEN condition cannot reference OLD values
+LINE 2: FOR EACH ROW WHEN (OLD.a <> NEW.a)
+ ^
+CREATE TRIGGER error_del_when BEFORE DELETE OR UPDATE ON main_table
+FOR EACH ROW WHEN (OLD.a <> NEW.a)
+EXECUTE PROCEDURE trigger_func('error_del_new');
+ERROR: DELETE trigger's WHEN condition cannot reference NEW values
+LINE 2: FOR EACH ROW WHEN (OLD.a <> NEW.a)
+ ^
+CREATE TRIGGER error_del_when BEFORE INSERT OR UPDATE ON main_table
+FOR EACH ROW WHEN (NEW.tableoid <> 0)
+EXECUTE PROCEDURE trigger_func('error_when_sys_column');
+ERROR: BEFORE trigger's WHEN condition cannot reference NEW system columns
+LINE 2: FOR EACH ROW WHEN (NEW.tableoid <> 0)
+ ^
+CREATE TRIGGER error_stmt_when BEFORE UPDATE OF a ON main_table
+FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
+EXECUTE PROCEDURE trigger_func('error_stmt_when');
+ERROR: statement trigger's WHEN condition cannot reference column values
+LINE 2: FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
+ ^
+-- check dependency restrictions
+ALTER TABLE main_table DROP COLUMN b;
+ERROR: cannot drop column b of table main_table because other objects depend on it
+DETAIL: trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
+trigger after_upd_a_b_row_trig on table main_table depends on column b of table main_table
+trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
+HINT: Use DROP ... CASCADE to drop the dependent objects too.
+-- this should succeed, but we'll roll it back to keep the triggers around
+begin;
+DROP TRIGGER after_upd_a_b_row_trig ON main_table;
+DROP TRIGGER after_upd_b_row_trig ON main_table;
+DROP TRIGGER after_upd_b_stmt_trig ON main_table;
+ALTER TABLE main_table DROP COLUMN b;
+rollback;
+-- Test enable/disable triggers
+create table trigtest (i serial primary key);
+-- test that disabling RI triggers works
+create table trigtest2 (i int references trigtest(i) on delete cascade);
+create function trigtest() returns trigger as $$
+begin
+ raise notice '% % % %', TG_RELNAME, TG_OP, TG_WHEN, TG_LEVEL;
+ return new;
+end;$$ language plpgsql;
+create trigger trigtest_b_row_tg before insert or update or delete on trigtest
+for each row execute procedure trigtest();
+create trigger trigtest_a_row_tg after insert or update or delete on trigtest
+for each row execute procedure trigtest();
+create trigger trigtest_b_stmt_tg before insert or update or delete on trigtest
+for each statement execute procedure trigtest();
+create trigger trigtest_a_stmt_tg after insert or update or delete on trigtest
+for each statement execute procedure trigtest();
+insert into trigtest default values;
+NOTICE: trigtest INSERT BEFORE STATEMENT
+NOTICE: trigtest INSERT BEFORE ROW
+NOTICE: trigtest INSERT AFTER ROW
+NOTICE: trigtest INSERT AFTER STATEMENT
+alter table trigtest disable trigger trigtest_b_row_tg;
+insert into trigtest default values;
+NOTICE: trigtest INSERT BEFORE STATEMENT
+NOTICE: trigtest INSERT AFTER ROW
+NOTICE: trigtest INSERT AFTER STATEMENT
+alter table trigtest disable trigger user;
+insert into trigtest default values;
+alter table trigtest enable trigger trigtest_a_stmt_tg;
+insert into trigtest default values;
+NOTICE: trigtest INSERT AFTER STATEMENT
+set session_replication_role = replica;
+insert into trigtest default values; -- does not trigger
+alter table trigtest enable always trigger trigtest_a_stmt_tg;
+insert into trigtest default values; -- now it does
+NOTICE: trigtest INSERT AFTER STATEMENT
+reset session_replication_role;
+insert into trigtest2 values(1);
+insert into trigtest2 values(2);
+delete from trigtest where i=2;
+NOTICE: trigtest DELETE AFTER STATEMENT
+select * from trigtest2;
+ i
+---
+ 1
+(1 row)
+
+alter table trigtest disable trigger all;
+delete from trigtest where i=1;
+select * from trigtest2;
+ i
+---
+ 1
+(1 row)
+
+-- ensure we still insert, even when all triggers are disabled
+insert into trigtest default values;
+select * from trigtest;
+ i
+---
+ 3
+ 4
+ 5
+ 6
+ 7
+(5 rows)
+
+drop table trigtest2;
+drop table trigtest;
+-- dump trigger data
+CREATE TABLE trigger_test (
+ i int,
+ v varchar
+);
+CREATE OR REPLACE FUNCTION trigger_data() RETURNS trigger
+LANGUAGE plpgsql AS $$
+
+declare
+
+ argstr text;
+ relid text;
+
+begin
+
+ relid := TG_relid::regclass;
+
+ -- plpgsql can't discover its trigger data in a hash like perl and python
+ -- can, or by a sort of reflection like tcl can,
+ -- so we have to hard code the names.
+ raise NOTICE 'TG_NAME: %', TG_name;
+ raise NOTICE 'TG_WHEN: %', TG_when;
+ raise NOTICE 'TG_LEVEL: %', TG_level;
+ raise NOTICE 'TG_OP: %', TG_op;
+ raise NOTICE 'TG_RELID::regclass: %', relid;
+ raise NOTICE 'TG_RELNAME: %', TG_relname;
+ raise NOTICE 'TG_TABLE_NAME: %', TG_table_name;
+ raise NOTICE 'TG_TABLE_SCHEMA: %', TG_table_schema;
+ raise NOTICE 'TG_NARGS: %', TG_nargs;
+
+ argstr := '[';
+ for i in 0 .. TG_nargs - 1 loop
+ if i > 0 then
+ argstr := argstr || ', ';
+ end if;
+ argstr := argstr || TG_argv[i];
+ end loop;
+ argstr := argstr || ']';
+ raise NOTICE 'TG_ARGV: %', argstr;
+
+ if TG_OP != 'INSERT' then
+ raise NOTICE 'OLD: %', OLD;
+ end if;
+
+ if TG_OP != 'DELETE' then
+ raise NOTICE 'NEW: %', NEW;
+ end if;
+
+ if TG_OP = 'DELETE' then
+ return OLD;
+ else
+ return NEW;
+ end if;
+
+end;
+$$;
+CREATE TRIGGER show_trigger_data_trig
+BEFORE INSERT OR UPDATE OR DELETE ON trigger_test
+FOR EACH ROW EXECUTE PROCEDURE trigger_data(23,'skidoo');
+insert into trigger_test values(1,'insert');
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: INSERT
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: NEW: (1,insert)
+update trigger_test set v = 'update' where i = 1;
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: UPDATE
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: OLD: (1,insert)
+NOTICE: NEW: (1,update)
+delete from trigger_test;
+NOTICE: TG_NAME: show_trigger_data_trig
+NOTICE: TG_WHEN: BEFORE
+NOTICE: TG_LEVEL: ROW
+NOTICE: TG_OP: DELETE
+NOTICE: TG_RELID::regclass: trigger_test
+NOTICE: TG_RELNAME: trigger_test
+NOTICE: TG_TABLE_NAME: trigger_test
+NOTICE: TG_TABLE_SCHEMA: public
+NOTICE: TG_NARGS: 2
+NOTICE: TG_ARGV: [23, skidoo]
+NOTICE: OLD: (1,update)
+DROP TRIGGER show_trigger_data_trig on trigger_test;
+DROP FUNCTION trigger_data();
+DROP TABLE trigger_test;
+--
+-- Test use of row comparisons on OLD/NEW
+--
+CREATE TABLE trigger_test (f1 int, f2 text, f3 text);
+-- this is the obvious (and wrong...) way to compare rows
+CREATE FUNCTION mytrigger() RETURNS trigger LANGUAGE plpgsql as $$
+begin
+ if row(old.*) = row(new.*) then
+ raise notice 'row % not changed', new.f1;
+ else
+ raise notice 'row % changed', new.f1;
+ end if;
+ return new;
+end$$;
+CREATE TRIGGER t
+BEFORE UPDATE ON trigger_test
+FOR EACH ROW EXECUTE PROCEDURE mytrigger();
+INSERT INTO trigger_test VALUES(1, 'foo', 'bar');
+INSERT INTO trigger_test VALUES(2, 'baz', 'quux');
+UPDATE trigger_test SET f3 = 'bar';
+NOTICE: row 1 not changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+-- this demonstrates that the above isn't really working as desired:
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+-- the right way when considering nulls is
+CREATE OR REPLACE FUNCTION mytrigger() RETURNS trigger LANGUAGE plpgsql as $$
+begin
+ if row(old.*) is distinct from row(new.*) then
+ raise notice 'row % changed', new.f1;
+ else
+ raise notice 'row % not changed', new.f1;
+ end if;
+ return new;
+end$$;
+UPDATE trigger_test SET f3 = 'bar';
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 changed
+NOTICE: row 2 changed
+UPDATE trigger_test SET f3 = NULL;
+NOTICE: row 1 not changed
+NOTICE: row 2 not changed
+DROP TABLE trigger_test;
+DROP FUNCTION mytrigger();
+-- Test snapshot management in serializable transactions involving triggers
+-- per bug report in 6bc73d4c0910042358k3d1adff3qa36f8df75198ecea@mail.gmail.com
+CREATE FUNCTION serializable_update_trig() RETURNS trigger LANGUAGE plpgsql AS
+$$
+declare
+ rec record;
+begin
+ new.description = 'updated in trigger';
+ return new;
+end;
+$$;
+CREATE TABLE serializable_update_tab (
+ id int,
+ filler text,
+ description text
+);
+CREATE TRIGGER serializable_update_trig BEFORE UPDATE ON serializable_update_tab
+ FOR EACH ROW EXECUTE PROCEDURE serializable_update_trig();
+INSERT INTO serializable_update_tab SELECT a, repeat('xyzxz', 100), 'new'
+ FROM generate_series(1, 50) a;
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+UPDATE serializable_update_tab SET description = 'no no', id = 1 WHERE id = 1;
+COMMIT;
+SELECT description FROM serializable_update_tab WHERE id = 1;
+ description
+--------------------
+ updated in trigger
+(1 row)
+
+DROP TABLE serializable_update_tab;
+-- minimal update trigger
+CREATE TABLE min_updates_test (
+ f1 text,
+ f2 int,
+ f3 int);
+INSERT INTO min_updates_test VALUES ('a',1,2),('b','2',null);
+CREATE TRIGGER z_min_update
+BEFORE UPDATE ON min_updates_test
+FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
+\set QUIET false
+UPDATE min_updates_test SET f1 = f1;
+UPDATE 0
+UPDATE min_updates_test SET f2 = f2 + 1;
+UPDATE 2
+UPDATE min_updates_test SET f3 = 2 WHERE f3 is null;
+UPDATE 1
+\set QUIET true
+SELECT * FROM min_updates_test;
+ f1 | f2 | f3
+----+----+----
+ a | 2 | 2
+ b | 3 | 2
+(2 rows)
+
+DROP TABLE min_updates_test;
+--
+-- Test triggers on views
+--
+CREATE VIEW main_view AS SELECT a, b FROM main_table;
+-- VIEW trigger function
+CREATE OR REPLACE FUNCTION view_trigger() RETURNS trigger
+LANGUAGE plpgsql AS $$
+declare
+ argstr text := '';
+begin
+ for i in 0 .. TG_nargs - 1 loop
+ if i > 0 then
+ argstr := argstr || ', ';
+ end if;
+ argstr := argstr || TG_argv[i];
+ end loop;
+
+ raise notice '% % % % (%)', TG_RELNAME, TG_WHEN, TG_OP, TG_LEVEL, argstr;
+
+ if TG_LEVEL = 'ROW' then
+ if TG_OP = 'INSERT' then
+ raise NOTICE 'NEW: %', NEW;
+ INSERT INTO main_table VALUES (NEW.a, NEW.b);
+ RETURN NEW;
+ end if;
+
+ if TG_OP = 'UPDATE' then
+ raise NOTICE 'OLD: %, NEW: %', OLD, NEW;
+ UPDATE main_table SET a = NEW.a, b = NEW.b WHERE a = OLD.a AND b = OLD.b;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN NEW;
+ end if;
+
+ if TG_OP = 'DELETE' then
+ raise NOTICE 'OLD: %', OLD;
+ DELETE FROM main_table WHERE a = OLD.a AND b = OLD.b;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN OLD;
+ end if;
+ end if;
+
+ RETURN NULL;
+end;
+$$;
+-- Before row triggers aren't allowed on views
+CREATE TRIGGER invalid_trig BEFORE INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_ins_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig BEFORE UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig BEFORE DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_del_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+-- After row triggers aren't allowed on views
+CREATE TRIGGER invalid_trig AFTER INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_ins_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig AFTER UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_upd_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+CREATE TRIGGER invalid_trig AFTER DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE trigger_func('before_del_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have row-level BEFORE or AFTER triggers.
+-- Truncate triggers aren't allowed on views
+CREATE TRIGGER invalid_trig BEFORE TRUNCATE ON main_view
+EXECUTE PROCEDURE trigger_func('before_tru_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have TRUNCATE triggers.
+CREATE TRIGGER invalid_trig AFTER TRUNCATE ON main_view
+EXECUTE PROCEDURE trigger_func('before_tru_row');
+ERROR: "main_view" is a view
+DETAIL: Views cannot have TRUNCATE triggers.
+-- INSTEAD OF triggers aren't allowed on tables
+CREATE TRIGGER invalid_trig INSTEAD OF INSERT ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_ins');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+CREATE TRIGGER invalid_trig INSTEAD OF DELETE ON main_table
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_del');
+ERROR: "main_table" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+-- Don't support WHEN clauses with INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_view
+FOR EACH ROW WHEN (OLD.a <> NEW.a) EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers cannot have WHEN conditions
+-- Don't support column-level INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE OF a ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers cannot have column lists
+-- Don't support statement-level INSTEAD OF triggers
+CREATE TRIGGER invalid_trig INSTEAD OF UPDATE ON main_view
+EXECUTE PROCEDURE view_trigger('instead_of_upd');
+ERROR: INSTEAD OF triggers must be FOR EACH ROW
+-- Valid INSTEAD OF triggers
+CREATE TRIGGER instead_of_insert_trig INSTEAD OF INSERT ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_ins');
+CREATE TRIGGER instead_of_update_trig INSTEAD OF UPDATE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_upd');
+CREATE TRIGGER instead_of_delete_trig INSTEAD OF DELETE ON main_view
+FOR EACH ROW EXECUTE PROCEDURE view_trigger('instead_of_del');
+-- Valid BEFORE statement VIEW triggers
+CREATE TRIGGER before_ins_stmt_trig BEFORE INSERT ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_ins_stmt');
+CREATE TRIGGER before_upd_stmt_trig BEFORE UPDATE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_upd_stmt');
+CREATE TRIGGER before_del_stmt_trig BEFORE DELETE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('before_view_del_stmt');
+-- Valid AFTER statement VIEW triggers
+CREATE TRIGGER after_ins_stmt_trig AFTER INSERT ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_ins_stmt');
+CREATE TRIGGER after_upd_stmt_trig AFTER UPDATE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_upd_stmt');
+CREATE TRIGGER after_del_stmt_trig AFTER DELETE ON main_view
+FOR EACH STATEMENT EXECUTE PROCEDURE view_trigger('after_view_del_stmt');
+\set QUIET false
+-- Insert into view using trigger
+INSERT INTO main_view VALUES (20, 30);
+NOTICE: main_view BEFORE INSERT STATEMENT (before_view_ins_stmt)
+NOTICE: main_view INSTEAD OF INSERT ROW (instead_of_ins)
+NOTICE: NEW: (20,30)
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER INSERT STATEMENT (after_view_ins_stmt)
+INSERT 0 1
+INSERT INTO main_view VALUES (21, 31) RETURNING a, b;
+NOTICE: main_view BEFORE INSERT STATEMENT (before_view_ins_stmt)
+NOTICE: main_view INSTEAD OF INSERT ROW (instead_of_ins)
+NOTICE: NEW: (21,31)
+NOTICE: trigger_func(before_ins_stmt) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_ins_stmt) called: action = INSERT, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER INSERT STATEMENT (after_view_ins_stmt)
+ a | b
+----+----
+ 21 | 31
+(1 row)
+
+INSERT 0 1
+-- Table trigger will prevent updates
+UPDATE main_view SET b = 31 WHERE a = 20;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (20,30), NEW: (20,31)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 0
+UPDATE main_view SET b = 32 WHERE a = 21 AND b = 31 RETURNING a, b;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (21,31), NEW: (21,32)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(before_upd_a_row) called: action = UPDATE, when = BEFORE, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+ a | b
+---+---
+(0 rows)
+
+UPDATE 0
+-- Remove table trigger to allow updates
+DROP TRIGGER before_upd_a_row_trig ON main_table;
+DROP TRIGGER
+UPDATE main_view SET b = 31 WHERE a = 20;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (20,30), NEW: (20,31)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 1
+UPDATE main_view SET b = 32 WHERE a = 21 AND b = 31 RETURNING a, b;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view INSTEAD OF UPDATE ROW (instead_of_upd)
+NOTICE: OLD: (21,31), NEW: (21,32)
+NOTICE: trigger_func(before_upd_a_stmt) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger_func(after_upd_a_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_row) called: action = UPDATE, when = AFTER, level = ROW
+NOTICE: trigger_func(after_upd_b_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: trigger_func(after_upd_stmt) called: action = UPDATE, when = AFTER, level = STATEMENT
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+ a | b
+----+----
+ 21 | 32
+(1 row)
+
+UPDATE 1
+-- Before and after stmt triggers should fire even when no rows are affected
+UPDATE main_view SET b = 0 WHERE false;
+NOTICE: main_view BEFORE UPDATE STATEMENT (before_view_upd_stmt)
+NOTICE: main_view AFTER UPDATE STATEMENT (after_view_upd_stmt)
+UPDATE 0
+-- Delete from view using trigger
+DELETE FROM main_view WHERE a IN (20,21);
+NOTICE: main_view BEFORE DELETE STATEMENT (before_view_del_stmt)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (21,10)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (20,31)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (21,32)
+NOTICE: main_view AFTER DELETE STATEMENT (after_view_del_stmt)
+DELETE 3
+DELETE FROM main_view WHERE a = 31 RETURNING a, b;
+NOTICE: main_view BEFORE DELETE STATEMENT (before_view_del_stmt)
+NOTICE: main_view INSTEAD OF DELETE ROW (instead_of_del)
+NOTICE: OLD: (31,10)
+NOTICE: main_view AFTER DELETE STATEMENT (after_view_del_stmt)
+ a | b
+----+----
+ 31 | 10
+(1 row)
+
+DELETE 1
+\set QUIET true
+-- Describe view should list triggers
+\d main_view
+ View "public.main_view"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ a | integer | | |
+ b | integer | | |
+Triggers:
+ after_del_stmt_trig AFTER DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_del_stmt')
+ after_ins_stmt_trig AFTER INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_ins_stmt')
+ after_upd_stmt_trig AFTER UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_upd_stmt')
+ before_del_stmt_trig BEFORE DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_del_stmt')
+ before_ins_stmt_trig BEFORE INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_ins_stmt')
+ before_upd_stmt_trig BEFORE UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_upd_stmt')
+ instead_of_delete_trig INSTEAD OF DELETE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_del')
+ instead_of_insert_trig INSTEAD OF INSERT ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_ins')
+ instead_of_update_trig INSTEAD OF UPDATE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_upd')
+
+-- Test dropping view triggers
+DROP TRIGGER instead_of_insert_trig ON main_view;
+DROP TRIGGER instead_of_delete_trig ON main_view;
+\d+ main_view
+ View "public.main_view"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+ b | integer | | | | plain |
+View definition:
+ SELECT main_table.a,
+ main_table.b
+ FROM main_table;
+Triggers:
+ after_del_stmt_trig AFTER DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_del_stmt')
+ after_ins_stmt_trig AFTER INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_ins_stmt')
+ after_upd_stmt_trig AFTER UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('after_view_upd_stmt')
+ before_del_stmt_trig BEFORE DELETE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_del_stmt')
+ before_ins_stmt_trig BEFORE INSERT ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_ins_stmt')
+ before_upd_stmt_trig BEFORE UPDATE ON main_view FOR EACH STATEMENT EXECUTE FUNCTION view_trigger('before_view_upd_stmt')
+ instead_of_update_trig INSTEAD OF UPDATE ON main_view FOR EACH ROW EXECUTE FUNCTION view_trigger('instead_of_upd')
+
+DROP VIEW main_view;
+--
+-- Test triggers on a join view
+--
+CREATE TABLE country_table (
+ country_id serial primary key,
+ country_name text unique not null,
+ continent text not null
+);
+INSERT INTO country_table (country_name, continent)
+ VALUES ('Japan', 'Asia'),
+ ('UK', 'Europe'),
+ ('USA', 'North America')
+ RETURNING *;
+ country_id | country_name | continent
+------------+--------------+---------------
+ 1 | Japan | Asia
+ 2 | UK | Europe
+ 3 | USA | North America
+(3 rows)
+
+CREATE TABLE city_table (
+ city_id serial primary key,
+ city_name text not null,
+ population bigint,
+ country_id int references country_table
+);
+CREATE VIEW city_view AS
+ SELECT city_id, city_name, population, country_name, continent
+ FROM city_table ci
+ LEFT JOIN country_table co ON co.country_id = ci.country_id;
+CREATE FUNCTION city_insert() RETURNS trigger LANGUAGE plpgsql AS $$
+declare
+ ctry_id int;
+begin
+ if NEW.country_name IS NOT NULL then
+ SELECT country_id, continent INTO ctry_id, NEW.continent
+ FROM country_table WHERE country_name = NEW.country_name;
+ if NOT FOUND then
+ raise exception 'No such country: "%"', NEW.country_name;
+ end if;
+ else
+ NEW.continent := NULL;
+ end if;
+
+ if NEW.city_id IS NOT NULL then
+ INSERT INTO city_table
+ VALUES(NEW.city_id, NEW.city_name, NEW.population, ctry_id);
+ else
+ INSERT INTO city_table(city_name, population, country_id)
+ VALUES(NEW.city_name, NEW.population, ctry_id)
+ RETURNING city_id INTO NEW.city_id;
+ end if;
+
+ RETURN NEW;
+end;
+$$;
+CREATE TRIGGER city_insert_trig INSTEAD OF INSERT ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_insert();
+CREATE FUNCTION city_delete() RETURNS trigger LANGUAGE plpgsql AS $$
+begin
+ DELETE FROM city_table WHERE city_id = OLD.city_id;
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN OLD;
+end;
+$$;
+CREATE TRIGGER city_delete_trig INSTEAD OF DELETE ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_delete();
+CREATE FUNCTION city_update() RETURNS trigger LANGUAGE plpgsql AS $$
+declare
+ ctry_id int;
+begin
+ if NEW.country_name IS DISTINCT FROM OLD.country_name then
+ SELECT country_id, continent INTO ctry_id, NEW.continent
+ FROM country_table WHERE country_name = NEW.country_name;
+ if NOT FOUND then
+ raise exception 'No such country: "%"', NEW.country_name;
+ end if;
+
+ UPDATE city_table SET city_name = NEW.city_name,
+ population = NEW.population,
+ country_id = ctry_id
+ WHERE city_id = OLD.city_id;
+ else
+ UPDATE city_table SET city_name = NEW.city_name,
+ population = NEW.population
+ WHERE city_id = OLD.city_id;
+ NEW.continent := OLD.continent;
+ end if;
+
+ if NOT FOUND then RETURN NULL; end if;
+ RETURN NEW;
+end;
+$$;
+CREATE TRIGGER city_update_trig INSTEAD OF UPDATE ON city_view
+FOR EACH ROW EXECUTE PROCEDURE city_update();
+\set QUIET false
+-- INSERT .. RETURNING
+INSERT INTO city_view(city_name) VALUES('Tokyo') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_name, population) VALUES('London', 7556900) RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 2 | London | 7556900 | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_name, country_name) VALUES('Washington DC', 'USA') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+---------------+------------+--------------+---------------
+ 3 | Washington DC | | USA | North America
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view(city_id, city_name) VALUES(123456, 'New York') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 123456 | New York | | |
+(1 row)
+
+INSERT 0 1
+INSERT INTO city_view VALUES(234567, 'Birmingham', 1016800, 'UK', 'EU') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+INSERT 0 1
+-- UPDATE .. RETURNING
+UPDATE city_view SET country_name = 'Japon' WHERE city_name = 'Tokyo'; -- error
+ERROR: No such country: "Japon"
+CONTEXT: PL/pgSQL function city_update() line 9 at RAISE
+UPDATE city_view SET country_name = 'Japan' WHERE city_name = 'Takyo'; -- no match
+UPDATE 0
+UPDATE city_view SET country_name = 'Japan' WHERE city_name = 'Tokyo' RETURNING *; -- OK
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | | Japan | Asia
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET population = 13010279 WHERE city_name = 'Tokyo' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 1 | Tokyo | 13010279 | Japan | Asia
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET country_name = 'UK' WHERE city_name = 'New York' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 123456 | New York | | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET country_name = 'USA', population = 8391881 WHERE city_name = 'New York' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+---------------
+ 123456 | New York | 8391881 | USA | North America
+(1 row)
+
+UPDATE 1
+UPDATE city_view SET continent = 'EU' WHERE continent = 'Europe' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE city_view v1 SET country_name = v2.country_name FROM city_view v2
+ WHERE v2.city_name = 'Birmingham' AND v1.city_name = 'London' RETURNING *;
+ city_id | city_name | population | country_name | continent | city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------+---------+------------+------------+--------------+-----------
+ 2 | London | 7556900 | UK | Europe | 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+UPDATE 1
+-- DELETE .. RETURNING
+DELETE FROM city_view WHERE city_name = 'Birmingham' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+------------+------------+--------------+-----------
+ 234567 | Birmingham | 1016800 | UK | Europe
+(1 row)
+
+DELETE 1
+\set QUIET true
+-- read-only view with WHERE clause
+CREATE VIEW european_city_view AS
+ SELECT * FROM city_view WHERE continent = 'Europe';
+SELECT count(*) FROM european_city_view;
+ count
+-------
+ 1
+(1 row)
+
+CREATE FUNCTION no_op_trig_fn() RETURNS trigger LANGUAGE plpgsql
+AS 'begin RETURN NULL; end';
+CREATE TRIGGER no_op_trig INSTEAD OF INSERT OR UPDATE OR DELETE
+ON european_city_view FOR EACH ROW EXECUTE PROCEDURE no_op_trig_fn();
+\set QUIET false
+INSERT INTO european_city_view VALUES (0, 'x', 10000, 'y', 'z');
+INSERT 0 0
+UPDATE european_city_view SET population = 10000;
+UPDATE 0
+DELETE FROM european_city_view;
+DELETE 0
+\set QUIET true
+-- rules bypassing no-op triggers
+CREATE RULE european_city_insert_rule AS ON INSERT TO european_city_view
+DO INSTEAD INSERT INTO city_view
+VALUES (NEW.city_id, NEW.city_name, NEW.population, NEW.country_name, NEW.continent)
+RETURNING *;
+CREATE RULE european_city_update_rule AS ON UPDATE TO european_city_view
+DO INSTEAD UPDATE city_view SET
+ city_name = NEW.city_name,
+ population = NEW.population,
+ country_name = NEW.country_name
+WHERE city_id = OLD.city_id
+RETURNING NEW.*;
+CREATE RULE european_city_delete_rule AS ON DELETE TO european_city_view
+DO INSTEAD DELETE FROM city_view WHERE city_id = OLD.city_id RETURNING *;
+\set QUIET false
+-- INSERT not limited by view's WHERE clause, but UPDATE AND DELETE are
+INSERT INTO european_city_view(city_name, country_name)
+ VALUES ('Cambridge', 'USA') RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+---------------
+ 4 | Cambridge | | USA | North America
+(1 row)
+
+INSERT 0 1
+UPDATE european_city_view SET country_name = 'UK'
+ WHERE city_name = 'Cambridge';
+UPDATE 0
+DELETE FROM european_city_view WHERE city_name = 'Cambridge';
+DELETE 0
+-- UPDATE and DELETE via rule and trigger
+UPDATE city_view SET country_name = 'UK'
+ WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | | UK | Europe
+(1 row)
+
+UPDATE 1
+UPDATE european_city_view SET population = 122800
+ WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | 122800 | UK | Europe
+(1 row)
+
+UPDATE 1
+DELETE FROM european_city_view WHERE city_name = 'Cambridge' RETURNING *;
+ city_id | city_name | population | country_name | continent
+---------+-----------+------------+--------------+-----------
+ 4 | Cambridge | 122800 | UK | Europe
+(1 row)
+
+DELETE 1
+-- join UPDATE test
+UPDATE city_view v SET population = 599657
+ FROM city_table ci, country_table co
+ WHERE ci.city_name = 'Washington DC' and co.country_name = 'USA'
+ AND v.city_id = ci.city_id AND v.country_name = co.country_name
+ RETURNING co.country_id, v.country_name,
+ v.city_id, v.city_name, v.population;
+ country_id | country_name | city_id | city_name | population
+------------+--------------+---------+---------------+------------
+ 3 | USA | 3 | Washington DC | 599657
+(1 row)
+
+UPDATE 1
+\set QUIET true
+SELECT * FROM city_view;
+ city_id | city_name | population | country_name | continent
+---------+---------------+------------+--------------+---------------
+ 1 | Tokyo | 13010279 | Japan | Asia
+ 123456 | New York | 8391881 | USA | North America
+ 2 | London | 7556900 | UK | Europe
+ 3 | Washington DC | 599657 | USA | North America
+(4 rows)
+
+DROP TABLE city_table CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view city_view
+drop cascades to view european_city_view
+DROP TABLE country_table;
+-- Test pg_trigger_depth()
+create table depth_a (id int not null primary key);
+create table depth_b (id int not null primary key);
+create table depth_c (id int not null primary key);
+create function depth_a_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ insert into depth_b values (new.id);
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ return new;
+end;
+$$;
+create trigger depth_a_tr before insert on depth_a
+ for each row execute procedure depth_a_tf();
+create function depth_b_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ begin
+ execute 'insert into depth_c values (' || new.id::text || ')';
+ exception
+ when sqlstate 'U9999' then
+ raise notice 'SQLSTATE = U9999: depth = %', pg_trigger_depth();
+ end;
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ if new.id = 1 then
+ execute 'insert into depth_c values (' || new.id::text || ')';
+ end if;
+ return new;
+end;
+$$;
+create trigger depth_b_tr before insert on depth_b
+ for each row execute procedure depth_b_tf();
+create function depth_c_tf() returns trigger
+ language plpgsql as $$
+begin
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ if new.id = 1 then
+ raise exception sqlstate 'U9999';
+ end if;
+ raise notice '%: depth = %', tg_name, pg_trigger_depth();
+ return new;
+end;
+$$;
+create trigger depth_c_tr before insert on depth_c
+ for each row execute procedure depth_c_tf();
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+insert into depth_a values (1);
+NOTICE: depth_a_tr: depth = 1
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+NOTICE: SQLSTATE = U9999: depth = 2
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+ERROR: U9999
+CONTEXT: PL/pgSQL function depth_c_tf() line 5 at RAISE
+SQL statement "insert into depth_c values (1)"
+PL/pgSQL function depth_b_tf() line 12 at EXECUTE
+SQL statement "insert into depth_b values (new.id)"
+PL/pgSQL function depth_a_tf() line 4 at SQL statement
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+insert into depth_a values (2);
+NOTICE: depth_a_tr: depth = 1
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_c_tr: depth = 3
+NOTICE: depth_c_tr: depth = 3
+NOTICE: depth_b_tr: depth = 2
+NOTICE: depth_a_tr: depth = 1
+select pg_trigger_depth();
+ pg_trigger_depth
+------------------
+ 0
+(1 row)
+
+drop table depth_a, depth_b, depth_c;
+drop function depth_a_tf();
+drop function depth_b_tf();
+drop function depth_c_tf();
+--
+-- Test updates to rows during firing of BEFORE ROW triggers.
+-- As of 9.2, such cases should be rejected (see bug #6123).
+--
+create temp table parent (
+ aid int not null primary key,
+ val1 text,
+ val2 text,
+ val3 text,
+ val4 text,
+ bcnt int not null default 0);
+create temp table child (
+ bid int not null primary key,
+ aid int not null,
+ val1 text);
+create function parent_upd_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if old.val1 <> new.val1 then
+ new.val2 = new.val1;
+ delete from child where child.aid = new.aid and child.val1 = new.val1;
+ end if;
+ return new;
+end;
+$$;
+create trigger parent_upd_trig before update on parent
+ for each row execute procedure parent_upd_func();
+create function parent_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ delete from child where aid = old.aid;
+ return old;
+end;
+$$;
+create trigger parent_del_trig before delete on parent
+ for each row execute procedure parent_del_func();
+create function child_ins_func()
+ returns trigger language plpgsql as
+$$
+begin
+ update parent set bcnt = bcnt + 1 where aid = new.aid;
+ return new;
+end;
+$$;
+create trigger child_ins_trig after insert on child
+ for each row execute procedure child_ins_func();
+create function child_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ update parent set bcnt = bcnt - 1 where aid = old.aid;
+ return old;
+end;
+$$;
+create trigger child_del_trig after delete on child
+ for each row execute procedure child_del_func();
+insert into parent values (1, 'a', 'a', 'a', 'a', 0);
+insert into child values (10, 1, 'b');
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+update parent set val1 = 'b' where aid = 1; -- should fail
+ERROR: tuple to be updated was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+delete from parent where aid = 1; -- should fail
+ERROR: tuple to be deleted was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+ 1 | a | a | a | a | 1
+(1 row)
+
+ bid | aid | val1
+-----+-----+------
+ 10 | 1 | b
+(1 row)
+
+-- replace the trigger function with one that restarts the deletion after
+-- having modified a child
+create or replace function parent_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ delete from child where aid = old.aid;
+ if found then
+ delete from parent where aid = old.aid;
+ return null; -- cancel outer deletion
+ end if;
+ return old;
+end;
+$$;
+delete from parent where aid = 1;
+select * from parent; select * from child;
+ aid | val1 | val2 | val3 | val4 | bcnt
+-----+------+------+------+------+------
+(0 rows)
+
+ bid | aid | val1
+-----+-----+------
+(0 rows)
+
+drop table parent, child;
+drop function parent_upd_func();
+drop function parent_del_func();
+drop function child_ins_func();
+drop function child_del_func();
+-- similar case, but with a self-referencing FK so that parent and child
+-- rows can be affected by a single operation
+create temp table self_ref_trigger (
+ id int primary key,
+ parent int references self_ref_trigger,
+ data text,
+ nchildren int not null default 0
+);
+create function self_ref_trigger_ins_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if new.parent is not null then
+ update self_ref_trigger set nchildren = nchildren + 1
+ where id = new.parent;
+ end if;
+ return new;
+end;
+$$;
+create trigger self_ref_trigger_ins_trig before insert on self_ref_trigger
+ for each row execute procedure self_ref_trigger_ins_func();
+create function self_ref_trigger_del_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if old.parent is not null then
+ update self_ref_trigger set nchildren = nchildren - 1
+ where id = old.parent;
+ end if;
+ return old;
+end;
+$$;
+create trigger self_ref_trigger_del_trig before delete on self_ref_trigger
+ for each row execute procedure self_ref_trigger_del_func();
+insert into self_ref_trigger values (1, null, 'root');
+insert into self_ref_trigger values (2, 1, 'root child A');
+insert into self_ref_trigger values (3, 1, 'root child B');
+insert into self_ref_trigger values (4, 2, 'grandchild 1');
+insert into self_ref_trigger values (5, 3, 'grandchild 2');
+update self_ref_trigger set data = 'root!' where id = 1;
+select * from self_ref_trigger;
+ id | parent | data | nchildren
+----+--------+--------------+-----------
+ 2 | 1 | root child A | 1
+ 4 | 2 | grandchild 1 | 0
+ 3 | 1 | root child B | 1
+ 5 | 3 | grandchild 2 | 0
+ 1 | | root! | 2
+(5 rows)
+
+delete from self_ref_trigger;
+ERROR: tuple to be deleted was already modified by an operation triggered by the current command
+HINT: Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.
+select * from self_ref_trigger;
+ id | parent | data | nchildren
+----+--------+--------------+-----------
+ 2 | 1 | root child A | 1
+ 4 | 2 | grandchild 1 | 0
+ 3 | 1 | root child B | 1
+ 5 | 3 | grandchild 2 | 0
+ 1 | | root! | 2
+(5 rows)
+
+drop table self_ref_trigger;
+drop function self_ref_trigger_ins_func();
+drop function self_ref_trigger_del_func();
+--
+-- Check that statement triggers work correctly even with all children excluded
+--
+create table stmt_trig_on_empty_upd (a int);
+create table stmt_trig_on_empty_upd1 () inherits (stmt_trig_on_empty_upd);
+create function update_stmt_notice() returns trigger as $$
+begin
+ raise notice 'updating %', TG_TABLE_NAME;
+ return null;
+end;
+$$ language plpgsql;
+create trigger before_stmt_trigger
+ before update on stmt_trig_on_empty_upd
+ execute procedure update_stmt_notice();
+create trigger before_stmt_trigger
+ before update on stmt_trig_on_empty_upd1
+ execute procedure update_stmt_notice();
+-- inherited no-op update
+update stmt_trig_on_empty_upd set a = a where false returning a+1 as aa;
+NOTICE: updating stmt_trig_on_empty_upd
+ aa
+----
+(0 rows)
+
+-- simple no-op update
+update stmt_trig_on_empty_upd1 set a = a where false returning a+1 as aa;
+NOTICE: updating stmt_trig_on_empty_upd1
+ aa
+----
+(0 rows)
+
+drop table stmt_trig_on_empty_upd cascade;
+NOTICE: drop cascades to table stmt_trig_on_empty_upd1
+drop function update_stmt_notice();
+--
+-- Check that index creation (or DDL in general) is prohibited in a trigger
+--
+create table trigger_ddl_table (
+ col1 integer,
+ col2 integer
+);
+create function trigger_ddl_func() returns trigger as $$
+begin
+ alter table trigger_ddl_table add primary key (col1);
+ return new;
+end$$ language plpgsql;
+create trigger trigger_ddl_func before insert on trigger_ddl_table for each row
+ execute procedure trigger_ddl_func();
+insert into trigger_ddl_table values (1, 42); -- fail
+ERROR: cannot ALTER TABLE "trigger_ddl_table" because it is being used by active queries in this session
+CONTEXT: SQL statement "alter table trigger_ddl_table add primary key (col1)"
+PL/pgSQL function trigger_ddl_func() line 3 at SQL statement
+create or replace function trigger_ddl_func() returns trigger as $$
+begin
+ create index on trigger_ddl_table (col2);
+ return new;
+end$$ language plpgsql;
+insert into trigger_ddl_table values (1, 42); -- fail
+ERROR: cannot CREATE INDEX "trigger_ddl_table" because it is being used by active queries in this session
+CONTEXT: SQL statement "create index on trigger_ddl_table (col2)"
+PL/pgSQL function trigger_ddl_func() line 3 at SQL statement
+drop table trigger_ddl_table;
+drop function trigger_ddl_func();
+--
+-- Verify behavior of before and after triggers with INSERT...ON CONFLICT
+-- DO UPDATE
+--
+create table upsert (key int4 primary key, color text);
+create function upsert_before_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if (TG_OP = 'UPDATE') then
+ raise warning 'before update (old): %', old.*::text;
+ raise warning 'before update (new): %', new.*::text;
+ elsif (TG_OP = 'INSERT') then
+ raise warning 'before insert (new): %', new.*::text;
+ if new.key % 2 = 0 then
+ new.key := new.key + 1;
+ new.color := new.color || ' trig modified';
+ raise warning 'before insert (new, modified): %', new.*::text;
+ end if;
+ end if;
+ return new;
+end;
+$$;
+create trigger upsert_before_trig before insert or update on upsert
+ for each row execute procedure upsert_before_func();
+create function upsert_after_func()
+ returns trigger language plpgsql as
+$$
+begin
+ if (TG_OP = 'UPDATE') then
+ raise warning 'after update (old): %', old.*::text;
+ raise warning 'after update (new): %', new.*::text;
+ elsif (TG_OP = 'INSERT') then
+ raise warning 'after insert (new): %', new.*::text;
+ end if;
+ return null;
+end;
+$$;
+create trigger upsert_after_trig after insert or update on upsert
+ for each row execute procedure upsert_after_func();
+insert into upsert values(1, 'black') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (1,black)
+WARNING: after insert (new): (1,black)
+insert into upsert values(2, 'red') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (2,red)
+WARNING: before insert (new, modified): (3,"red trig modified")
+WARNING: after insert (new): (3,"red trig modified")
+insert into upsert values(3, 'orange') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (3,orange)
+WARNING: before update (old): (3,"red trig modified")
+WARNING: before update (new): (3,"updated red trig modified")
+WARNING: after update (old): (3,"red trig modified")
+WARNING: after update (new): (3,"updated red trig modified")
+insert into upsert values(4, 'green') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (4,green)
+WARNING: before insert (new, modified): (5,"green trig modified")
+WARNING: after insert (new): (5,"green trig modified")
+insert into upsert values(5, 'purple') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (5,purple)
+WARNING: before update (old): (5,"green trig modified")
+WARNING: before update (new): (5,"updated green trig modified")
+WARNING: after update (old): (5,"green trig modified")
+WARNING: after update (new): (5,"updated green trig modified")
+insert into upsert values(6, 'white') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (6,white)
+WARNING: before insert (new, modified): (7,"white trig modified")
+WARNING: after insert (new): (7,"white trig modified")
+insert into upsert values(7, 'pink') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (7,pink)
+WARNING: before update (old): (7,"white trig modified")
+WARNING: before update (new): (7,"updated white trig modified")
+WARNING: after update (old): (7,"white trig modified")
+WARNING: after update (new): (7,"updated white trig modified")
+insert into upsert values(8, 'yellow') on conflict (key) do update set color = 'updated ' || upsert.color;
+WARNING: before insert (new): (8,yellow)
+WARNING: before insert (new, modified): (9,"yellow trig modified")
+WARNING: after insert (new): (9,"yellow trig modified")
+select * from upsert;
+ key | color
+-----+-----------------------------
+ 1 | black
+ 3 | updated red trig modified
+ 5 | updated green trig modified
+ 7 | updated white trig modified
+ 9 | yellow trig modified
+(5 rows)
+
+drop table upsert;
+drop function upsert_before_func();
+drop function upsert_after_func();
+--
+-- Verify that triggers with transition tables are not allowed on
+-- views
+--
+create table my_table (i int);
+create view my_view as select * from my_table;
+create function my_trigger_function() returns trigger as $$ begin end; $$ language plpgsql;
+create trigger my_trigger after update on my_view referencing old table as old_table
+ for each statement execute procedure my_trigger_function();
+ERROR: "my_view" is a view
+DETAIL: Triggers on views cannot have transition tables.
+drop function my_trigger_function();
+drop view my_view;
+drop table my_table;
+--
+-- Verify cases that are unsupported with partitioned tables
+--
+create table parted_trig (a int) partition by list (a);
+create function trigger_nothing() returns trigger
+ language plpgsql as $$ begin end; $$;
+create trigger failed before insert or update or delete on parted_trig
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a partitioned table
+DETAIL: Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.
+create trigger failed instead of update on parted_trig
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a table
+DETAIL: Tables cannot have INSTEAD OF triggers.
+create trigger failed after update on parted_trig
+ referencing old table as old_table
+ for each row execute procedure trigger_nothing();
+ERROR: "parted_trig" is a partitioned table
+DETAIL: Triggers on partitioned tables cannot have transition tables.
+drop table parted_trig;
+--
+-- Verify trigger creation for partitioned tables, and drop behavior
+--
+create table trigpart (a int, b int) partition by range (a);
+create table trigpart1 partition of trigpart for values from (0) to (1000);
+create trigger trg1 after insert on trigpart for each row execute procedure trigger_nothing();
+create table trigpart2 partition of trigpart for values from (1000) to (2000);
+create table trigpart3 (like trigpart);
+alter table trigpart attach partition trigpart3 for values from (2000) to (3000);
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+-----------+--------+-----------------
+ trigpart | trg1 | trigger_nothing
+ trigpart1 | trg1 | trigger_nothing
+ trigpart2 | trg1 | trigger_nothing
+ trigpart3 | trg1 | trigger_nothing
+(4 rows)
+
+drop trigger trg1 on trigpart1; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop trigger trg1 on trigpart2; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart2 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop trigger trg1 on trigpart3; -- fail
+ERROR: cannot drop trigger trg1 on table trigpart3 because trigger trg1 on table trigpart requires it
+HINT: You can drop trigger trg1 on table trigpart instead.
+drop table trigpart2; -- ok, trigger should be gone in that partition
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+-----------+--------+-----------------
+ trigpart | trg1 | trigger_nothing
+ trigpart1 | trg1 | trigger_nothing
+ trigpart3 | trg1 | trigger_nothing
+(3 rows)
+
+drop trigger trg1 on trigpart; -- ok, all gone
+select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
+ where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
+ tgrelid | tgname | tgfoid
+---------+--------+--------
+(0 rows)
+
+drop table trigpart;
+drop function trigger_nothing();
+--
+-- Verify that triggers are fired for partitioned tables
+--
+create table parted_stmt_trig (a int) partition by list (a);
+create table parted_stmt_trig1 partition of parted_stmt_trig for values in (1);
+create table parted_stmt_trig2 partition of parted_stmt_trig for values in (2);
+create table parted2_stmt_trig (a int) partition by list (a);
+create table parted2_stmt_trig1 partition of parted2_stmt_trig for values in (1);
+create table parted2_stmt_trig2 partition of parted2_stmt_trig for values in (2);
+create or replace function trigger_notice() returns trigger as $$
+ begin
+ raise notice 'trigger % on % % % for %', TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL;
+ if TG_LEVEL = 'ROW' then
+ return NEW;
+ end if;
+ return null;
+ end;
+ $$ language plpgsql;
+-- insert/update/delete statement-level triggers on the parent
+create trigger trig_ins_before before insert on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_ins_after after insert on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_before before update on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_after after update on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_before before delete on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_after after delete on parted_stmt_trig
+ for each statement execute procedure trigger_notice();
+-- insert/update/delete row-level triggers on the parent
+create trigger trig_ins_after_parent after insert on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_after_parent after update on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+create trigger trig_del_after_parent after delete on parted_stmt_trig
+ for each row execute procedure trigger_notice();
+-- insert/update/delete row-level triggers on the first partition
+create trigger trig_ins_before_child before insert on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_ins_after_child after insert on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_before_child before update on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_upd_after_child after update on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_del_before_child before delete on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+create trigger trig_del_after_child after delete on parted_stmt_trig1
+ for each row execute procedure trigger_notice();
+-- insert/update/delete statement-level triggers on the parent
+create trigger trig_ins_before_3 before insert on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_ins_after_3 after insert on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_before_3 before update on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_upd_after_3 after update on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_before_3 before delete on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+create trigger trig_del_after_3 after delete on parted2_stmt_trig
+ for each statement execute procedure trigger_notice();
+with ins (a) as (
+ insert into parted2_stmt_trig values (1), (2) returning a
+) insert into parted_stmt_trig select a from ins returning tableoid::regclass, a;
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_3 on parted2_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig2 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_3 on parted2_stmt_trig AFTER INSERT for STATEMENT
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+ tableoid | a
+-------------------+---
+ parted_stmt_trig1 | 1
+ parted_stmt_trig2 | 2
+(2 rows)
+
+with upd as (
+ update parted2_stmt_trig set a = a
+) update parted_stmt_trig set a = a;
+NOTICE: trigger trig_upd_before on parted_stmt_trig BEFORE UPDATE for STATEMENT
+NOTICE: trigger trig_upd_before_child on parted_stmt_trig1 BEFORE UPDATE for ROW
+NOTICE: trigger trig_upd_before_3 on parted2_stmt_trig BEFORE UPDATE for STATEMENT
+NOTICE: trigger trig_upd_after_child on parted_stmt_trig1 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after_parent on parted_stmt_trig1 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after_parent on parted_stmt_trig2 AFTER UPDATE for ROW
+NOTICE: trigger trig_upd_after on parted_stmt_trig AFTER UPDATE for STATEMENT
+NOTICE: trigger trig_upd_after_3 on parted2_stmt_trig AFTER UPDATE for STATEMENT
+delete from parted_stmt_trig;
+NOTICE: trigger trig_del_before on parted_stmt_trig BEFORE DELETE for STATEMENT
+NOTICE: trigger trig_del_before_child on parted_stmt_trig1 BEFORE DELETE for ROW
+NOTICE: trigger trig_del_after_parent on parted_stmt_trig2 AFTER DELETE for ROW
+NOTICE: trigger trig_del_after on parted_stmt_trig AFTER DELETE for STATEMENT
+-- insert via copy on the parent
+copy parted_stmt_trig(a) from stdin;
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig2 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+-- insert via copy on the first partition
+copy parted_stmt_trig1(a) from stdin;
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+-- Disabling a trigger in the parent table should disable children triggers too
+alter table parted_stmt_trig disable trigger trig_ins_after_parent;
+insert into parted_stmt_trig values (1);
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+alter table parted_stmt_trig enable trigger trig_ins_after_parent;
+insert into parted_stmt_trig values (1);
+NOTICE: trigger trig_ins_before on parted_stmt_trig BEFORE INSERT for STATEMENT
+NOTICE: trigger trig_ins_before_child on parted_stmt_trig1 BEFORE INSERT for ROW
+NOTICE: trigger trig_ins_after_child on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after_parent on parted_stmt_trig1 AFTER INSERT for ROW
+NOTICE: trigger trig_ins_after on parted_stmt_trig AFTER INSERT for STATEMENT
+drop table parted_stmt_trig, parted2_stmt_trig;
+-- Verify that triggers fire in alphabetical order
+create table parted_trig (a int) partition by range (a);
+create table parted_trig_1 partition of parted_trig for values from (0) to (1000)
+ partition by range (a);
+create table parted_trig_1_1 partition of parted_trig_1 for values from (0) to (100);
+create table parted_trig_2 partition of parted_trig for values from (1000) to (2000);
+create trigger zzz after insert on parted_trig for each row execute procedure trigger_notice();
+create trigger mmm after insert on parted_trig_1_1 for each row execute procedure trigger_notice();
+create trigger aaa after insert on parted_trig_1 for each row execute procedure trigger_notice();
+create trigger bbb after insert on parted_trig for each row execute procedure trigger_notice();
+create trigger qqq after insert on parted_trig_1_1 for each row execute procedure trigger_notice();
+insert into parted_trig values (50), (1500);
+NOTICE: trigger aaa on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger bbb on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger mmm on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger qqq on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger zzz on parted_trig_1_1 AFTER INSERT for ROW
+NOTICE: trigger bbb on parted_trig_2 AFTER INSERT for ROW
+NOTICE: trigger zzz on parted_trig_2 AFTER INSERT for ROW
+drop table parted_trig;
+-- Verify propagation of trigger arguments to partitions
+create table parted_trig (a int) partition by list (a);
+create table parted_trig1 partition of parted_trig for values in (1);
+create or replace function trigger_notice() returns trigger as $$
+ declare
+ arg1 text = TG_ARGV[0];
+ arg2 integer = TG_ARGV[1];
+ begin
+ raise notice 'trigger % on % % % for % args % %',
+ TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL, arg1, arg2;
+ return null;
+ end;
+ $$ language plpgsql;
+create trigger aaa after insert on parted_trig
+ for each row execute procedure trigger_notice('quirky', 1);
+-- Verify propagation of trigger arguments to partitions attached after creating trigger
+create table parted_trig2 partition of parted_trig for values in (2);
+create table parted_trig3 (like parted_trig);
+alter table parted_trig attach partition parted_trig3 for values in (3);
+insert into parted_trig values (1), (2), (3);
+NOTICE: trigger aaa on parted_trig1 AFTER INSERT for ROW args quirky 1
+NOTICE: trigger aaa on parted_trig2 AFTER INSERT for ROW args quirky 1
+NOTICE: trigger aaa on parted_trig3 AFTER INSERT for ROW args quirky 1
+drop table parted_trig;
+-- test irregular partitions (i.e., different column definitions),
+-- including that the WHEN clause works
+create function bark(text) returns bool language plpgsql immutable
+ as $$ begin raise notice '% <- woof!', $1; return true; end; $$;
+create or replace function trigger_notice_ab() returns trigger as $$
+ begin
+ raise notice 'trigger % on % % % for %: (a,b)=(%,%)',
+ TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL,
+ NEW.a, NEW.b;
+ if TG_LEVEL = 'ROW' then
+ return NEW;
+ end if;
+ return null;
+ end;
+ $$ language plpgsql;
+create table parted_irreg_ancestor (fd text, b text, fd2 int, fd3 int, a int)
+ partition by range (b);
+alter table parted_irreg_ancestor drop column fd,
+ drop column fd2, drop column fd3;
+create table parted_irreg (fd int, a int, fd2 int, b text)
+ partition by range (b);
+alter table parted_irreg drop column fd, drop column fd2;
+alter table parted_irreg_ancestor attach partition parted_irreg
+ for values from ('aaaa') to ('zzzz');
+create table parted1_irreg (b text, fd int, a int);
+alter table parted1_irreg drop column fd;
+alter table parted_irreg attach partition parted1_irreg
+ for values from ('aaaa') to ('bbbb');
+create trigger parted_trig after insert on parted_irreg
+ for each row execute procedure trigger_notice_ab();
+create trigger parted_trig_odd after insert on parted_irreg for each row
+ when (bark(new.b) AND new.a % 2 = 1) execute procedure trigger_notice_ab();
+-- we should hear barking for every insert, but parted_trig_odd only emits
+-- noise for odd values of a. parted_trig does it for all inserts.
+insert into parted_irreg values (1, 'aardvark'), (2, 'aanimals');
+NOTICE: aardvark <- woof!
+NOTICE: aanimals <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_odd on parted1_irreg AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(2,aanimals)
+insert into parted1_irreg values ('aardwolf', 2);
+NOTICE: aardwolf <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+insert into parted_irreg_ancestor values ('aasvogel', 3);
+NOTICE: aasvogel <- woof!
+NOTICE: trigger parted_trig on parted1_irreg AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+NOTICE: trigger parted_trig_odd on parted1_irreg AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+drop table parted_irreg_ancestor;
+--
+-- Constraint triggers and partitioned tables
+create table parted_constr_ancestor (a int, b text)
+ partition by range (b);
+create table parted_constr (a int, b text)
+ partition by range (b);
+alter table parted_constr_ancestor attach partition parted_constr
+ for values from ('aaaa') to ('zzzz');
+create table parted1_constr (a int, b text);
+alter table parted_constr attach partition parted1_constr
+ for values from ('aaaa') to ('bbbb');
+create constraint trigger parted_trig after insert on parted_constr_ancestor
+ deferrable
+ for each row execute procedure trigger_notice_ab();
+create constraint trigger parted_trig_two after insert on parted_constr
+ deferrable initially deferred
+ for each row when (bark(new.b) AND new.a % 2 = 1)
+ execute procedure trigger_notice_ab();
+-- The immediate constraint is fired immediately; the WHEN clause of the
+-- deferred constraint is also called immediately. The deferred constraint
+-- is fired at commit time.
+begin;
+insert into parted_constr values (1, 'aardvark');
+NOTICE: aardvark <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+insert into parted1_constr values (2, 'aardwolf');
+NOTICE: aardwolf <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+insert into parted_constr_ancestor values (3, 'aasvogel');
+NOTICE: aasvogel <- woof!
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+commit;
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+-- The WHEN clause is immediate, and both constraint triggers are fired at
+-- commit time.
+begin;
+set constraints parted_trig deferred;
+insert into parted_constr values (1, 'aardvark');
+NOTICE: aardvark <- woof!
+insert into parted1_constr values (2, 'aardwolf'), (3, 'aasvogel');
+NOTICE: aardwolf <- woof!
+NOTICE: aasvogel <- woof!
+commit;
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(1,aardvark)
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(2,aardwolf)
+NOTICE: trigger parted_trig on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+NOTICE: trigger parted_trig_two on parted1_constr AFTER INSERT for ROW: (a,b)=(3,aasvogel)
+drop table parted_constr_ancestor;
+drop function bark(text);
+-- Test that the WHEN clause is set properly to partitions
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create trigger parted_trigger after update on parted_trigger
+ for each row when (new.a % 2 = 1 and length(old.b) >= 2) execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (3);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (3) to (5);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+insert into parted_trigger values
+ (0, 'a'), (1, 'bbb'), (2, 'bcd'), (3, 'c'),
+ (1000, 'c'), (1001, 'ddd'), (1002, 'efg'), (1003, 'f'),
+ (2000, 'e'), (2001, 'fff'), (2002, 'ghi'), (2003, 'h');
+update parted_trigger set a = a + 2; -- notice for odd 'a' values, long 'b' values
+NOTICE: trigger parted_trigger on parted_trigger_1 AFTER UPDATE for ROW: (a,b)=(3,bbb)
+NOTICE: trigger parted_trigger on parted_trigger_2 AFTER UPDATE for ROW: (a,b)=(1003,ddd)
+NOTICE: trigger parted_trigger on parted_trigger_3_2 AFTER UPDATE for ROW: (a,b)=(2003,fff)
+drop table parted_trigger;
+-- try a constraint trigger, also
+create table parted_referenced (a int);
+create table unparted_trigger (a int, b text); -- for comparison purposes
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create constraint trigger parted_trigger after update on parted_trigger
+ from parted_referenced
+ for each row execute procedure trigger_notice_ab();
+create constraint trigger parted_trigger after update on unparted_trigger
+ from parted_referenced
+ for each row execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (3);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (3) to (5);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+select tgname, conname, t.tgrelid::regclass, t.tgconstrrelid::regclass,
+ c.conrelid::regclass, c.confrelid::regclass
+ from pg_trigger t join pg_constraint c on (t.tgconstraint = c.oid)
+ where tgname = 'parted_trigger'
+ order by t.tgrelid::regclass::text;
+ tgname | conname | tgrelid | tgconstrrelid | conrelid | confrelid
+----------------+----------------+--------------------+-------------------+--------------------+-----------
+ parted_trigger | parted_trigger | parted_trigger | parted_referenced | parted_trigger | -
+ parted_trigger | parted_trigger | parted_trigger_1 | parted_referenced | parted_trigger_1 | -
+ parted_trigger | parted_trigger | parted_trigger_2 | parted_referenced | parted_trigger_2 | -
+ parted_trigger | parted_trigger | parted_trigger_3 | parted_referenced | parted_trigger_3 | -
+ parted_trigger | parted_trigger | parted_trigger_3_1 | parted_referenced | parted_trigger_3_1 | -
+ parted_trigger | parted_trigger | parted_trigger_3_2 | parted_referenced | parted_trigger_3_2 | -
+ parted_trigger | parted_trigger | unparted_trigger | parted_referenced | unparted_trigger | -
+(7 rows)
+
+drop table parted_referenced, parted_trigger, unparted_trigger;
+-- verify that the "AFTER UPDATE OF columns" event is propagated correctly
+create table parted_trigger (a int, b text) partition by range (a);
+create table parted_trigger_1 partition of parted_trigger for values from (0) to (1000);
+create table parted_trigger_2 (drp int, a int, b text);
+alter table parted_trigger_2 drop column drp;
+alter table parted_trigger attach partition parted_trigger_2 for values from (1000) to (2000);
+create trigger parted_trigger after update of b on parted_trigger
+ for each row execute procedure trigger_notice_ab();
+create table parted_trigger_3 (b text, a int) partition by range (length(b));
+create table parted_trigger_3_1 partition of parted_trigger_3 for values from (1) to (4);
+create table parted_trigger_3_2 partition of parted_trigger_3 for values from (4) to (8);
+alter table parted_trigger attach partition parted_trigger_3 for values from (2000) to (3000);
+insert into parted_trigger values (0, 'a'), (1000, 'c'), (2000, 'e'), (2001, 'eeee');
+update parted_trigger set a = a + 2; -- no notices here
+update parted_trigger set b = b || 'b'; -- all triggers should fire
+NOTICE: trigger parted_trigger on parted_trigger_1 AFTER UPDATE for ROW: (a,b)=(2,ab)
+NOTICE: trigger parted_trigger on parted_trigger_2 AFTER UPDATE for ROW: (a,b)=(1002,cb)
+NOTICE: trigger parted_trigger on parted_trigger_3_1 AFTER UPDATE for ROW: (a,b)=(2002,eb)
+NOTICE: trigger parted_trigger on parted_trigger_3_2 AFTER UPDATE for ROW: (a,b)=(2003,eeeeb)
+drop table parted_trigger;
+drop function trigger_notice_ab();
+-- Make sure we don't end up with unnecessary copies of triggers, when
+-- cloning them.
+create table trg_clone (a int) partition by range (a);
+create table trg_clone1 partition of trg_clone for values from (0) to (1000);
+alter table trg_clone add constraint uniq unique (a) deferrable;
+create table trg_clone2 partition of trg_clone for values from (1000) to (2000);
+create table trg_clone3 partition of trg_clone for values from (2000) to (3000)
+ partition by range (a);
+create table trg_clone_3_3 partition of trg_clone3 for values from (2000) to (2100);
+select tgrelid::regclass, count(*) from pg_trigger
+ where tgrelid::regclass in ('trg_clone', 'trg_clone1', 'trg_clone2',
+ 'trg_clone3', 'trg_clone_3_3')
+ group by tgrelid::regclass order by tgrelid::regclass;
+ tgrelid | count
+---------------+-------
+ trg_clone | 1
+ trg_clone1 | 1
+ trg_clone2 | 1
+ trg_clone3 | 1
+ trg_clone_3_3 | 1
+(5 rows)
+
+drop table trg_clone;
+--
+-- Test the interaction between transition tables and both kinds of
+-- inheritance. We'll dump the contents of the transition tables in a
+-- format that shows the attribute order, so that we can distinguish
+-- tuple formats (though not dropped attributes).
+--
+create or replace function dump_insert() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, new table = %',
+ TG_NAME,
+ (select string_agg(new_table::text, ', ' order by a) from new_table);
+ return null;
+ end;
+$$;
+create or replace function dump_update() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, old table = %, new table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' order by a) from old_table),
+ (select string_agg(new_table::text, ', ' order by a) from new_table);
+ return null;
+ end;
+$$;
+create or replace function dump_delete() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = %, old table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' order by a) from old_table);
+ return null;
+ end;
+$$;
+--
+-- Verify behavior of statement triggers on partition hierarchy with
+-- transition tables. Tuples should appear to each trigger in the
+-- format of the relation the trigger is attached to.
+--
+-- set up a partition hierarchy with some different TupleDescriptors
+create table parent (a text, b int) partition by list (a);
+-- a child matching parent
+create table child1 partition of parent for values in ('AAA');
+-- a child with a dropped column
+create table child2 (x int, a text, b int);
+alter table child2 drop column x;
+alter table parent attach partition child2 for values in ('BBB');
+-- a child with a different column order
+create table child3 (b int, a text);
+alter table parent attach partition child3 for values in ('CCC');
+create trigger parent_insert_trig
+ after insert on parent referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger parent_update_trig
+ after update on parent referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger parent_delete_trig
+ after delete on parent referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child1_insert_trig
+ after insert on child1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child1_update_trig
+ after update on child1 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child1_delete_trig
+ after delete on child1 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child2_insert_trig
+ after insert on child2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child2_update_trig
+ after update on child2 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child2_delete_trig
+ after delete on child2 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child3_insert_trig
+ after insert on child3 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child3_update_trig
+ after update on child3 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child3_delete_trig
+ after delete on child3 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+SELECT trigger_name, event_manipulation, event_object_schema, event_object_table,
+ action_order, action_condition, action_orientation, action_timing,
+ action_reference_old_table, action_reference_new_table
+ FROM information_schema.triggers
+ WHERE event_object_table IN ('parent', 'child1', 'child2', 'child3')
+ ORDER BY trigger_name COLLATE "C", 2;
+ trigger_name | event_manipulation | event_object_schema | event_object_table | action_order | action_condition | action_orientation | action_timing | action_reference_old_table | action_reference_new_table
+--------------------+--------------------+---------------------+--------------------+--------------+------------------+--------------------+---------------+----------------------------+----------------------------
+ child1_delete_trig | DELETE | public | child1 | 1 | | STATEMENT | AFTER | old_table |
+ child1_insert_trig | INSERT | public | child1 | 1 | | STATEMENT | AFTER | | new_table
+ child1_update_trig | UPDATE | public | child1 | 1 | | STATEMENT | AFTER | old_table | new_table
+ child2_delete_trig | DELETE | public | child2 | 1 | | STATEMENT | AFTER | old_table |
+ child2_insert_trig | INSERT | public | child2 | 1 | | STATEMENT | AFTER | | new_table
+ child2_update_trig | UPDATE | public | child2 | 1 | | STATEMENT | AFTER | old_table | new_table
+ child3_delete_trig | DELETE | public | child3 | 1 | | STATEMENT | AFTER | old_table |
+ child3_insert_trig | INSERT | public | child3 | 1 | | STATEMENT | AFTER | | new_table
+ child3_update_trig | UPDATE | public | child3 | 1 | | STATEMENT | AFTER | old_table | new_table
+ parent_delete_trig | DELETE | public | parent | 1 | | STATEMENT | AFTER | old_table |
+ parent_insert_trig | INSERT | public | parent | 1 | | STATEMENT | AFTER | | new_table
+ parent_update_trig | UPDATE | public | parent | 1 | | STATEMENT | AFTER | old_table | new_table
+(12 rows)
+
+-- insert directly into children sees respective child-format tuples
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values ('BBB', 42);
+NOTICE: trigger = child2_insert_trig, new table = (BBB,42)
+insert into child3 values (42, 'CCC');
+NOTICE: trigger = child3_insert_trig, new table = (42,CCC)
+-- update via parent sees parent-format tuples
+update parent set b = b + 1;
+NOTICE: trigger = parent_update_trig, old table = (AAA,42), (BBB,42), (CCC,42), new table = (AAA,43), (BBB,43), (CCC,43)
+-- delete via parent sees parent-format tuples
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,43), (BBB,43), (CCC,43)
+-- insert into parent sees parent-format tuples
+insert into parent values ('AAA', 42);
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42)
+insert into parent values ('BBB', 42);
+NOTICE: trigger = parent_insert_trig, new table = (BBB,42)
+insert into parent values ('CCC', 42);
+NOTICE: trigger = parent_insert_trig, new table = (CCC,42)
+-- delete from children sees respective child-format tuples
+delete from child1;
+NOTICE: trigger = child1_delete_trig, old table = (AAA,42)
+delete from child2;
+NOTICE: trigger = child2_delete_trig, old table = (BBB,42)
+delete from child3;
+NOTICE: trigger = child3_delete_trig, old table = (42,CCC)
+-- copy into parent sees parent-format tuples
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- DML affecting parent sees tuples collected from children even if
+-- there is no transition table trigger on the children
+drop trigger child1_insert_trig on child1;
+drop trigger child1_update_trig on child1;
+drop trigger child1_delete_trig on child1;
+drop trigger child2_insert_trig on child2;
+drop trigger child2_update_trig on child2;
+drop trigger child2_delete_trig on child2;
+drop trigger child3_insert_trig on child3;
+drop trigger child3_update_trig on child3;
+drop trigger child3_delete_trig on child3;
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,42), (BBB,42), (CCC,42)
+-- copy into parent sees tuples collected from children even if there
+-- is no transition-table trigger on the children
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- insert into parent with a before trigger on a child tuple before
+-- insertion, and we capture the newly modified row in parent format
+create or replace function intercept_insert() returns trigger language plpgsql as
+$$
+ begin
+ new.b = new.b + 1000;
+ return new;
+ end;
+$$;
+create trigger intercept_insert_child3
+ before insert on child3
+ for each row execute procedure intercept_insert();
+-- insert, parent trigger sees post-modification parent-format tuple
+insert into parent values ('AAA', 42), ('BBB', 42), ('CCC', 66);
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,1066)
+-- copy, parent trigger sees post-modification parent-format tuple
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,1234)
+drop table child1, child2, child3, parent;
+drop function intercept_insert();
+--
+-- Verify prohibition of row triggers with transition triggers on
+-- partitions
+--
+create table parent (a text, b int) partition by list (a);
+create table child partition of parent for values in ('AAA');
+-- adding row trigger with transition table fails
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+ERROR: ROW triggers with transition tables are not supported on partitions
+-- detaching it first works
+alter table parent detach partition child;
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+-- but now we're not allowed to reattach it
+alter table parent attach partition child for values in ('AAA');
+ERROR: trigger "child_row_trig" prevents table "child" from becoming a partition
+DETAIL: ROW triggers with transition tables are not supported on partitions
+-- drop the trigger, and now we're allowed to attach it again
+drop trigger child_row_trig on child;
+alter table parent attach partition child for values in ('AAA');
+drop table child, parent;
+--
+-- Verify behavior of statement triggers on (non-partition)
+-- inheritance hierarchy with transition tables; similar to the
+-- partition case, except there is no rerouting on insertion and child
+-- tables can have extra columns
+--
+-- set up inheritance hierarchy with different TupleDescriptors
+create table parent (a text, b int);
+-- a child matching parent
+create table child1 () inherits (parent);
+-- a child with a different column order
+create table child2 (b int, a text);
+alter table child2 inherit parent;
+-- a child with an extra column
+create table child3 (c text) inherits (parent);
+create trigger parent_insert_trig
+ after insert on parent referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger parent_update_trig
+ after update on parent referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger parent_delete_trig
+ after delete on parent referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child1_insert_trig
+ after insert on child1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child1_update_trig
+ after update on child1 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child1_delete_trig
+ after delete on child1 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child2_insert_trig
+ after insert on child2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child2_update_trig
+ after update on child2 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child2_delete_trig
+ after delete on child2 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+create trigger child3_insert_trig
+ after insert on child3 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger child3_update_trig
+ after update on child3 referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger child3_delete_trig
+ after delete on child3 referencing old table as old_table
+ for each statement execute procedure dump_delete();
+-- insert directly into children sees respective child-format tuples
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values (42, 'BBB');
+NOTICE: trigger = child2_insert_trig, new table = (42,BBB)
+insert into child3 values ('CCC', 42, 'foo');
+NOTICE: trigger = child3_insert_trig, new table = (CCC,42,foo)
+-- update via parent sees parent-format tuples
+update parent set b = b + 1;
+NOTICE: trigger = parent_update_trig, old table = (AAA,42), (BBB,42), (CCC,42), new table = (AAA,43), (BBB,43), (CCC,43)
+-- delete via parent sees parent-format tuples
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,43), (BBB,43), (CCC,43)
+-- reinsert values into children for next test...
+insert into child1 values ('AAA', 42);
+NOTICE: trigger = child1_insert_trig, new table = (AAA,42)
+insert into child2 values (42, 'BBB');
+NOTICE: trigger = child2_insert_trig, new table = (42,BBB)
+insert into child3 values ('CCC', 42, 'foo');
+NOTICE: trigger = child3_insert_trig, new table = (CCC,42,foo)
+-- delete from children sees respective child-format tuples
+delete from child1;
+NOTICE: trigger = child1_delete_trig, old table = (AAA,42)
+delete from child2;
+NOTICE: trigger = child2_delete_trig, old table = (42,BBB)
+delete from child3;
+NOTICE: trigger = child3_delete_trig, old table = (CCC,42,foo)
+-- copy into parent sees parent-format tuples (no rerouting, so these
+-- are really inserted into the parent)
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (AAA,42), (BBB,42), (CCC,42)
+-- same behavior for copy if there is an index (interesting because rows are
+-- captured by a different code path in copy.c if there are indexes)
+create index on parent(b);
+copy parent (a, b) from stdin;
+NOTICE: trigger = parent_insert_trig, new table = (DDD,42)
+-- DML affecting parent sees tuples collected from children even if
+-- there is no transition table trigger on the children
+drop trigger child1_insert_trig on child1;
+drop trigger child1_update_trig on child1;
+drop trigger child1_delete_trig on child1;
+drop trigger child2_insert_trig on child2;
+drop trigger child2_update_trig on child2;
+drop trigger child2_delete_trig on child2;
+drop trigger child3_insert_trig on child3;
+drop trigger child3_update_trig on child3;
+drop trigger child3_delete_trig on child3;
+delete from parent;
+NOTICE: trigger = parent_delete_trig, old table = (AAA,42), (BBB,42), (CCC,42), (DDD,42)
+drop table child1, child2, child3, parent;
+--
+-- Verify prohibition of row triggers with transition triggers on
+-- inheritance children
+--
+create table parent (a text, b int);
+create table child () inherits (parent);
+-- adding row trigger with transition table fails
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+ERROR: ROW triggers with transition tables are not supported on inheritance children
+-- disinheriting it first works
+alter table child no inherit parent;
+create trigger child_row_trig
+ after insert on child referencing new table as new_table
+ for each row execute procedure dump_insert();
+-- but now we're not allowed to make it inherit anymore
+alter table child inherit parent;
+ERROR: trigger "child_row_trig" prevents table "child" from becoming an inheritance child
+DETAIL: ROW triggers with transition tables are not supported in inheritance hierarchies.
+-- drop the trigger, and now we're allowed to make it inherit again
+drop trigger child_row_trig on child;
+alter table child inherit parent;
+drop table child, parent;
+--
+-- Verify behavior of queries with wCTEs, where multiple transition
+-- tuplestores can be active at the same time because there are
+-- multiple DML statements that might fire triggers with transition
+-- tables
+--
+create table table1 (a int);
+create table table2 (a text);
+create trigger table1_trig
+ after insert on table1 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger table2_trig
+ after insert on table2 referencing new table as new_table
+ for each statement execute procedure dump_insert();
+with wcte as (insert into table1 values (42))
+ insert into table2 values ('hello world');
+NOTICE: trigger = table2_trig, new table = ("hello world")
+NOTICE: trigger = table1_trig, new table = (42)
+with wcte as (insert into table1 values (43))
+ insert into table1 values (44);
+NOTICE: trigger = table1_trig, new table = (43), (44)
+select * from table1;
+ a
+----
+ 42
+ 44
+ 43
+(3 rows)
+
+select * from table2;
+ a
+-------------
+ hello world
+(1 row)
+
+drop table table1;
+drop table table2;
+--
+-- Verify behavior of INSERT ... ON CONFLICT DO UPDATE ... with
+-- transition tables.
+--
+create table my_table (a int primary key, b text);
+create trigger my_table_insert_trig
+ after insert on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger my_table_update_trig
+ after update on my_table referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+-- inserts only
+insert into my_table values (1, 'AAA'), (2, 'BBB')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = <NULL>, new table = <NULL>
+NOTICE: trigger = my_table_insert_trig, new table = (1,AAA), (2,BBB)
+-- mixture of inserts and updates
+insert into my_table values (1, 'AAA'), (2, 'BBB'), (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = (1,AAA), (2,BBB), new table = (1,AAA:AAA), (2,BBB:BBB)
+NOTICE: trigger = my_table_insert_trig, new table = (3,CCC), (4,DDD)
+-- updates only
+insert into my_table values (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = my_table.b || ':' || excluded.b;
+NOTICE: trigger = my_table_update_trig, old table = (3,CCC), (4,DDD), new table = (3,CCC:CCC), (4,DDD:DDD)
+NOTICE: trigger = my_table_insert_trig, new table = <NULL>
+--
+-- now using a partitioned table
+--
+create table iocdu_tt_parted (a int primary key, b text) partition by list (a);
+create table iocdu_tt_parted1 partition of iocdu_tt_parted for values in (1);
+create table iocdu_tt_parted2 partition of iocdu_tt_parted for values in (2);
+create table iocdu_tt_parted3 partition of iocdu_tt_parted for values in (3);
+create table iocdu_tt_parted4 partition of iocdu_tt_parted for values in (4);
+create trigger iocdu_tt_parted_insert_trig
+ after insert on iocdu_tt_parted referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger iocdu_tt_parted_update_trig
+ after update on iocdu_tt_parted referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+-- inserts only
+insert into iocdu_tt_parted values (1, 'AAA'), (2, 'BBB')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = <NULL>, new table = <NULL>
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = (1,AAA), (2,BBB)
+-- mixture of inserts and updates
+insert into iocdu_tt_parted values (1, 'AAA'), (2, 'BBB'), (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = (1,AAA), (2,BBB), new table = (1,AAA:AAA), (2,BBB:BBB)
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = (3,CCC), (4,DDD)
+-- updates only
+insert into iocdu_tt_parted values (3, 'CCC'), (4, 'DDD')
+ on conflict (a) do
+ update set b = iocdu_tt_parted.b || ':' || excluded.b;
+NOTICE: trigger = iocdu_tt_parted_update_trig, old table = (3,CCC), (4,DDD), new table = (3,CCC:CCC), (4,DDD:DDD)
+NOTICE: trigger = iocdu_tt_parted_insert_trig, new table = <NULL>
+drop table iocdu_tt_parted;
+--
+-- Verify that you can't create a trigger with transition tables for
+-- more than one event.
+--
+create trigger my_table_multievent_trig
+ after insert or update on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+ERROR: transition tables cannot be specified for triggers with more than one event
+--
+-- Verify that you can't create a trigger with transition tables with
+-- a column list.
+--
+create trigger my_table_col_update_trig
+ after update of b on my_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+ERROR: transition tables cannot be specified for triggers with column lists
+drop table my_table;
+--
+-- Test firing of triggers with transition tables by foreign key cascades
+--
+create table refd_table (a int primary key, b text);
+create table trig_table (a int, b text,
+ foreign key (a) references refd_table on update cascade on delete cascade
+);
+create trigger trig_table_before_trig
+ before insert or update or delete on trig_table
+ for each statement execute procedure trigger_func('trig_table');
+create trigger trig_table_insert_trig
+ after insert on trig_table referencing new table as new_table
+ for each statement execute procedure dump_insert();
+create trigger trig_table_update_trig
+ after update on trig_table referencing old table as old_table new table as new_table
+ for each statement execute procedure dump_update();
+create trigger trig_table_delete_trig
+ after delete on trig_table referencing old table as old_table
+ for each statement execute procedure dump_delete();
+insert into refd_table values
+ (1, 'one'),
+ (2, 'two'),
+ (3, 'three');
+insert into trig_table values
+ (1, 'one a'),
+ (1, 'one b'),
+ (2, 'two a'),
+ (2, 'two b'),
+ (3, 'three a'),
+ (3, 'three b');
+NOTICE: trigger_func(trig_table) called: action = INSERT, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_insert_trig, new table = (1,"one a"), (1,"one b"), (2,"two a"), (2,"two b"), (3,"three a"), (3,"three b")
+update refd_table set a = 11 where b = 'one';
+NOTICE: trigger_func(trig_table) called: action = UPDATE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_update_trig, old table = (1,"one a"), (1,"one b"), new table = (11,"one a"), (11,"one b")
+select * from trig_table;
+ a | b
+----+---------
+ 2 | two a
+ 2 | two b
+ 3 | three a
+ 3 | three b
+ 11 | one a
+ 11 | one b
+(6 rows)
+
+delete from refd_table where length(b) = 3;
+NOTICE: trigger_func(trig_table) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b"), (11,"one a"), (11,"one b")
+select * from trig_table;
+ a | b
+---+---------
+ 3 | three a
+ 3 | three b
+(2 rows)
+
+drop table refd_table, trig_table;
+--
+-- self-referential FKs are even more fun
+--
+create table self_ref (a int primary key,
+ b int references self_ref(a) on delete cascade);
+create trigger self_ref_before_trig
+ before delete on self_ref
+ for each statement execute procedure trigger_func('self_ref');
+create trigger self_ref_r_trig
+ after delete on self_ref referencing old table as old_table
+ for each row execute procedure dump_delete();
+create trigger self_ref_s_trig
+ after delete on self_ref referencing old table as old_table
+ for each statement execute procedure dump_delete();
+insert into self_ref values (1, null), (2, 1), (3, 2);
+delete from self_ref where a = 1;
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_r_trig, old table = (1,), (2,1)
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_r_trig, old table = (1,), (2,1)
+NOTICE: trigger = self_ref_s_trig, old table = (1,), (2,1)
+NOTICE: trigger = self_ref_r_trig, old table = (3,2)
+NOTICE: trigger = self_ref_s_trig, old table = (3,2)
+-- without AR trigger, cascaded deletes all end up in one transition table
+drop trigger self_ref_r_trig on self_ref;
+insert into self_ref values (1, null), (2, 1), (3, 2), (4, 3);
+delete from self_ref where a = 1;
+NOTICE: trigger_func(self_ref) called: action = DELETE, when = BEFORE, level = STATEMENT
+NOTICE: trigger = self_ref_s_trig, old table = (1,), (2,1), (3,2), (4,3)
+drop table self_ref;
+-- cleanup
+drop function dump_insert();
+drop function dump_update();
+drop function dump_delete();
diff --git a/src/test/regress/expected/tsrf_1.out b/src/test/regress/expected/tsrf_1.out
new file mode 100644
index 0000000000..a0f7d80c69
--- /dev/null
+++ b/src/test/regress/expected/tsrf_1.out
@@ -0,0 +1,712 @@
+--
+-- tsrf - targetlist set returning function tests
+--
+-- simple srf
+SELECT generate_series(1, 3);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- parallel iteration
+SELECT generate_series(1, 3), generate_series(3,5);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 3
+ 2 | 4
+ 3 | 5
+(3 rows)
+
+-- parallel iteration, different number of rows
+SELECT generate_series(1, 2), generate_series(1,4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 1
+ 2 | 2
+ | 3
+ | 4
+(4 rows)
+
+-- srf, with SRF argument
+SELECT generate_series(1, generate_series(1, 3));
+ generate_series
+-----------------
+ 1
+ 1
+ 2
+ 1
+ 2
+ 3
+(6 rows)
+
+-- but we've traditionally rejected the same in FROM
+SELECT * FROM generate_series(1, generate_series(1, 3));
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM generate_series(1, generate_series(1, 3));
+ ^
+-- srf, with two SRF arguments
+SELECT generate_series(generate_series(1,3), generate_series(2, 4));
+ generate_series
+-----------------
+ 1
+ 2
+ 2
+ 3
+ 3
+ 4
+(6 rows)
+
+-- check proper nesting of SRFs in different expressions
+explain (verbose, costs off)
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ ProjectSet
+ Output: generate_series(1, (generate_series(1, 3))), (generate_series(2, 4))
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(2, 4)
+ -> Result
+(5 rows)
+
+SELECT generate_series(1, generate_series(1, 3)), generate_series(2, 4);
+ generate_series | generate_series
+-----------------+-----------------
+ 1 | 2
+ 1 | 3
+ 2 | 3
+ 1 | 4
+ 2 | 4
+ 3 | 4
+(6 rows)
+
+CREATE TABLE few(id int, dataa text, datab text);
+INSERT INTO few VALUES(1, 'a', 'foo'),(2, 'a', 'bar'),(3, 'b', 'bar');
+-- SRF with a provably-dummy relation
+explain (verbose, costs off)
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ QUERY PLAN
+--------------------------------------
+ ProjectSet
+ Output: unnest('{1,2}'::integer[])
+ -> Result
+ One-Time Filter: false
+(4 rows)
+
+SELECT unnest(ARRAY[1, 2]) FROM few WHERE false;
+ unnest
+--------
+(0 rows)
+
+-- SRF shouldn't prevent upper query from recognizing lower as dummy
+explain (verbose, costs off)
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ QUERY PLAN
+------------------------------------------------
+ Result
+ Output: f1.id, f1.dataa, f1.datab, ss.unnest
+ One-Time Filter: false
+(3 rows)
+
+SELECT * FROM few f1,
+ (SELECT unnest(ARRAY[1,2]) FROM few f2 WHERE false OFFSET 0) ss;
+ id | dataa | datab | unnest
+----+-------+-------+--------
+(0 rows)
+
+-- SRF output order of sorting is maintained, if SRF is not referenced
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id DESC;
+ id | g
+----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+ 2 | 1
+ 2 | 2
+ 2 | 3
+ 1 | 1
+ 1 | 2
+ 1 | 3
+(9 rows)
+
+-- but SRFs can be referenced in sort
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, g DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+SELECT few.id, generate_series(1,3) g FROM few ORDER BY id, generate_series(1,3) DESC;
+ id | g
+----+---
+ 1 | 3
+ 1 | 2
+ 1 | 1
+ 2 | 3
+ 2 | 2
+ 2 | 1
+ 3 | 3
+ 3 | 2
+ 3 | 1
+(9 rows)
+
+-- it's weird to have ORDER BYs that increase the number of results
+SELECT few.id FROM few ORDER BY id, generate_series(1,3) DESC;
+ id
+----
+ 1
+ 1
+ 1
+ 2
+ 2
+ 2
+ 3
+ 3
+ 3
+(9 rows)
+
+-- SRFs are computed after aggregation
+SET enable_hashagg TO 0; -- stable output order
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(3 rows)
+
+-- unless referenced in GROUP BY clause
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, unnest('{1,1,3}'::int[]);
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+SELECT few.dataa, count(*), min(id), max(id), unnest('{1,1,3}'::int[]) FROM few WHERE few.id = 1 GROUP BY few.dataa, 5;
+ dataa | count | min | max | unnest
+-------+-------+-----+-----+--------
+ a | 2 | 1 | 1 | 1
+ a | 1 | 1 | 1 | 3
+(2 rows)
+
+RESET enable_hashagg;
+-- check HAVING works when GROUP BY does [not] reference SRF output
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+SELECT dataa, generate_series(1,1), count(*) FROM few GROUP BY 1, 2 HAVING count(*) > 1;
+ dataa | generate_series | count
+-------+-----------------+-------
+ a | 1 | 2
+(1 row)
+
+-- it's weird to have GROUP BYs that increase the number of results
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+(1 row)
+
+SELECT few.dataa, count(*) FROM few WHERE dataa = 'a' GROUP BY few.dataa, unnest('{1,1,3}'::int[]) ORDER BY 2;
+ dataa | count
+-------+-------
+ a | 2
+ a | 4
+(2 rows)
+
+-- SRFs are not allowed if they'd need to be conditionally executed
+SELECT q1, case when q1 > 0 then generate_series(1,3) else 0 end FROM int8_tbl;
+ERROR: set-returning functions are not allowed in CASE
+LINE 1: SELECT q1, case when q1 > 0 then generate_series(1,3) else 0...
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ERROR: set-returning functions are not allowed in COALESCE
+LINE 1: SELECT q1, coalesce(generate_series(1,3), 0) FROM int8_tbl;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are not allowed in aggregate arguments
+SELECT min(generate_series(1, 3)) FROM few;
+ERROR: aggregate function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- ... unless they're within a sub-select
+SELECT sum((3 = ANY(SELECT generate_series(1,4)))::int);
+ sum
+-----
+ 1
+(1 row)
+
+SELECT sum((3 = ANY(SELECT lag(x) over(order by x)
+ FROM generate_series(1,4) x))::int);
+ sum
+-----
+ 1
+(1 row)
+
+-- SRFs are not allowed in window function arguments, either
+SELECT min(generate_series(1, 3)) OVER() FROM few;
+ERROR: window function calls cannot contain set-returning function calls
+LINE 1: SELECT min(generate_series(1, 3)) OVER() FROM few;
+ ^
+HINT: You might be able to move the set-returning function into a LATERAL FROM item.
+-- SRFs are normally computed after window functions
+SELECT id,lag(id) OVER(), count(*) OVER(), generate_series(1,3) FROM few;
+ id | lag | count | generate_series
+----+-----+-------+-----------------
+ 1 | | 3 | 1
+ 1 | | 3 | 2
+ 1 | | 3 | 3
+ 2 | 1 | 3 | 1
+ 2 | 1 | 3 | 2
+ 2 | 1 | 3 | 3
+ 3 | 2 | 3 | 1
+ 3 | 2 | 3 | 2
+ 3 | 2 | 3 | 3
+(9 rows)
+
+-- unless referencing SRFs
+SELECT SUM(count(*)) OVER(PARTITION BY generate_series(1,3) ORDER BY generate_series(1,3)), generate_series(1,3) g FROM few GROUP BY g;
+ sum | g
+-----+---
+ 3 | 1
+ 3 | 2
+ 3 | 3
+(3 rows)
+
+-- sorting + grouping
+SELECT few.dataa, count(*), min(id), max(id), generate_series(1,3) FROM few GROUP BY few.dataa ORDER BY 5, 1;
+ dataa | count | min | max | generate_series
+-------+-------+-----+-----+-----------------
+ a | 2 | 1 | 2 | 1
+ b | 1 | 3 | 3 | 1
+ a | 2 | 1 | 2 | 2
+ b | 1 | 3 | 3 | 2
+ a | 2 | 1 | 2 | 3
+ b | 1 | 3 | 3 | 3
+(6 rows)
+
+-- grouping sets are a bit special, they produce NULLs in columns not actually NULL
+set enable_hashagg = false;
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ a | | 2 | 2
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | | 1 | 1
+ b | | 2 | 1
+ | | 1 | 3
+ | | 2 | 3
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | 1 | 1
+ | | 1 | 3
+ | bar | 1 | 2
+ | foo | 1 | 1
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ b | | 2 | 1
+ a | foo | 2 | 1
+ | bar | 2 | 2
+ a | | 2 | 2
+ | | 2 | 3
+ b | bar | 2 | 1
+(16 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g);
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | foo | | 2
+ a | | | 4
+ b | bar | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | | 2 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY dataa;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | foo | | 2
+ a | | | 4
+ a | | 2 | 2
+ a | bar | 1 | 1
+ a | bar | 2 | 1
+ a | bar | | 2
+ a | foo | 1 | 1
+ a | foo | 2 | 1
+ a | | 1 | 2
+ b | bar | 1 | 1
+ b | | | 2
+ b | | 1 | 1
+ b | bar | 2 | 1
+ b | bar | | 2
+ b | | 2 | 1
+ | | 2 | 3
+ | | | 6
+ | bar | 1 | 2
+ | bar | 2 | 2
+ | bar | | 4
+ | foo | 1 | 1
+ | foo | 2 | 1
+ | foo | | 2
+ | | 1 | 3
+(24 rows)
+
+SELECT dataa, datab b, generate_series(1,2) g, count(*) FROM few GROUP BY CUBE(dataa, datab, g) ORDER BY g;
+ dataa | b | g | count
+-------+-----+---+-------
+ a | bar | 1 | 1
+ a | foo | 1 | 1
+ b | bar | 1 | 1
+ | bar | 1 | 2
+ | foo | 1 | 1
+ a | | 1 | 2
+ b | | 1 | 1
+ | | 1 | 3
+ a | | 2 | 2
+ b | | 2 | 1
+ | bar | 2 | 2
+ | | 2 | 3
+ | foo | 2 | 1
+ a | bar | 2 | 1
+ a | foo | 2 | 1
+ b | bar | 2 | 1
+ a | | | 4
+ b | bar | | 2
+ b | | | 2
+ | | | 6
+ a | foo | | 2
+ a | bar | | 2
+ | bar | | 4
+ | foo | | 2
+(24 rows)
+
+reset enable_hashagg;
+-- case with degenerate ORDER BY
+explain (verbose, costs off)
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ QUERY PLAN
+------------------------------------------------
+ ProjectSet
+ Output: ('foo'::text), generate_series(1, 2)
+ -> Seq Scan on public.few
+ Output: 'foo'::text
+(4 rows)
+
+select 'foo' as f, generate_series(1,2) as g from few order by 1;
+ f | g
+-----+---
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+ foo | 1
+ foo | 2
+(6 rows)
+
+-- data modification
+CREATE TABLE fewmore AS SELECT generate_series(1,3) AS data;
+INSERT INTO fewmore VALUES(generate_series(4,5));
+SELECT * FROM fewmore;
+ data
+------
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+-- SRFs are not allowed in UPDATE (they once were, but it was nonsense)
+UPDATE fewmore SET data = generate_series(4,9);
+ERROR: set-returning functions are not allowed in UPDATE
+LINE 1: UPDATE fewmore SET data = generate_series(4,9);
+ ^
+-- SRFs are not allowed in RETURNING
+INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3);
+ERROR: set-returning functions are not allowed in RETURNING
+LINE 1: INSERT INTO fewmore VALUES(1) RETURNING generate_series(1,3)...
+ ^
+-- nor standalone VALUES (but surely this is a bug?)
+VALUES(1, generate_series(1,2));
+ERROR: set-returning functions are not allowed in VALUES
+LINE 1: VALUES(1, generate_series(1,2));
+ ^
+-- We allow tSRFs that are not at top level
+SELECT int4mul(generate_series(1,2), 10);
+ int4mul
+---------
+ 10
+ 20
+(2 rows)
+
+SELECT generate_series(1,3) IS DISTINCT FROM 2;
+ ?column?
+----------
+ t
+ f
+ t
+(3 rows)
+
+-- but SRFs in function RTEs must be at top level (annoying restriction)
+SELECT * FROM int4mul(generate_series(1,2), 10);
+ERROR: set-returning functions must appear at top level of FROM
+LINE 1: SELECT * FROM int4mul(generate_series(1,2), 10);
+ ^
+-- DISTINCT ON is evaluated before tSRF evaluation if SRF is not
+-- referenced either in ORDER BY or in the DISTINCT ON list. The ORDER
+-- BY reference can be implicitly generated, if there's no other ORDER BY.
+-- implicit reference (via implicit ORDER) to all columns
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 1 | 1 | 1
+ 3 | 2 | 1
+ 5 | 3 | 1
+(3 rows)
+
+-- unreferenced in DISTINCT ON or ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 1
+ 1 | 4 | 2
+ 1 | 4 | 3
+ 3 | 2 | 1
+ 3 | 2 | 2
+ 3 | 2 | 3
+ 5 | 3 | 1
+ 5 | 3 | 2
+ 5 | 3 | 3
+(9 rows)
+
+-- referenced in ORDER BY
+SELECT DISTINCT ON (a) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 3 | 2 | 3
+ 5 | 3 | 3
+(3 rows)
+
+-- referenced in ORDER BY and DISTINCT ON
+SELECT DISTINCT ON (a, b, g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b)
+ORDER BY a, b DESC, g DESC;
+ a | b | g
+---+---+---
+ 1 | 4 | 3
+ 1 | 4 | 2
+ 1 | 4 | 1
+ 1 | 1 | 3
+ 1 | 1 | 2
+ 1 | 1 | 1
+ 3 | 2 | 3
+ 3 | 2 | 2
+ 3 | 2 | 1
+ 3 | 1 | 3
+ 3 | 1 | 2
+ 3 | 1 | 1
+ 5 | 3 | 3
+ 5 | 3 | 2
+ 5 | 3 | 1
+ 5 | 1 | 3
+ 5 | 1 | 2
+ 5 | 1 | 1
+(18 rows)
+
+-- only SRF mentioned in DISTINCT ON
+SELECT DISTINCT ON (g) a, b, generate_series(1,3) g
+FROM (VALUES (3, 2), (3,1), (1,1), (1,4), (5,3), (5,1)) AS t(a, b);
+ a | b | g
+---+---+---
+ 3 | 2 | 1
+ 5 | 1 | 2
+ 3 | 1 | 3
+(3 rows)
+
+-- LIMIT / OFFSET is evaluated after SRF evaluation
+SELECT a, generate_series(1,2) FROM (VALUES(1),(2),(3)) r(a) LIMIT 2 OFFSET 2;
+ a | generate_series
+---+-----------------
+ 2 | 1
+ 2 | 2
+(2 rows)
+
+-- SRFs are not allowed in LIMIT.
+SELECT 1 LIMIT generate_series(1,3);
+ERROR: set-returning functions are not allowed in LIMIT
+LINE 1: SELECT 1 LIMIT generate_series(1,3);
+ ^
+-- tSRF in correlated subquery, referencing table outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET few.id) FROM few;
+ generate_series
+-----------------
+ 2
+ 3
+
+(3 rows)
+
+-- tSRF in correlated subquery, referencing SRF outside
+SELECT (SELECT generate_series(1,3) LIMIT 1 OFFSET g.i) FROM generate_series(0,3) g(i);
+ generate_series
+-----------------
+ 1
+ 2
+ 3
+
+(4 rows)
+
+-- Operators can return sets too
+CREATE OPERATOR |@| (PROCEDURE = unnest, RIGHTARG = ANYARRAY);
+SELECT |@|ARRAY[1,2,3];
+ ?column?
+----------
+ 1
+ 2
+ 3
+(3 rows)
+
+-- Some fun cases involving duplicate SRF calls
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(1, 3)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(1,3) + 1 as xp1;
+ x | xp1
+---+-----
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+explain (verbose, costs off)
+select generate_series(1,3)+1 order by generate_series(1,3);
+ QUERY PLAN
+------------------------------------------------------------------------
+ Sort
+ Output: (((generate_series(1, 3)) + 1)), (generate_series(1, 3))
+ Sort Key: (generate_series(1, 3))
+ -> Result
+ Output: ((generate_series(1, 3)) + 1), (generate_series(1, 3))
+ -> ProjectSet
+ Output: generate_series(1, 3)
+ -> Result
+(8 rows)
+
+select generate_series(1,3)+1 order by generate_series(1,3);
+ ?column?
+----------
+ 2
+ 3
+ 4
+(3 rows)
+
+-- Check that SRFs of same nesting level run in lockstep
+explain (verbose, costs off)
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ QUERY PLAN
+------------------------------------------------------------------
+ Result
+ Output: (generate_series(1, 3)), ((generate_series(3, 6)) + 1)
+ -> ProjectSet
+ Output: generate_series(1, 3), generate_series(3, 6)
+ -> Result
+(5 rows)
+
+select generate_series(1,3) as x, generate_series(3,6) + 1 as y;
+ x | y
+---+---
+ 1 | 4
+ 2 | 5
+ 3 | 6
+ | 7
+(4 rows)
+
+-- Clean up
+DROP TABLE few;
+DROP TABLE fewmore;
diff --git a/src/test/regress/expected/updatable_views_1.out b/src/test/regress/expected/updatable_views_1.out
new file mode 100644
index 0000000000..27db0cecd1
--- /dev/null
+++ b/src/test/regress/expected/updatable_views_1.out
@@ -0,0 +1,3018 @@
+--
+-- UPDATABLE VIEWS
+--
+-- avoid bit-exact output here because operations may not be bit-exact.
+SET extra_float_digits = 0;
+-- check that non-updatable views and columns are rejected with useful error
+-- messages
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW ro_view1 AS SELECT DISTINCT a, b FROM base_tbl; -- DISTINCT not supported
+CREATE VIEW ro_view2 AS SELECT a, b FROM base_tbl GROUP BY a, b; -- GROUP BY not supported
+CREATE VIEW ro_view3 AS SELECT 1 FROM base_tbl HAVING max(a) > 0; -- HAVING not supported
+CREATE VIEW ro_view4 AS SELECT count(*) FROM base_tbl; -- Aggregate functions not supported
+CREATE VIEW ro_view5 AS SELECT a, rank() OVER() FROM base_tbl; -- Window functions not supported
+CREATE VIEW ro_view6 AS SELECT a, b FROM base_tbl UNION SELECT -a, b FROM base_tbl; -- Set ops not supported
+CREATE VIEW ro_view7 AS WITH t AS (SELECT a, b FROM base_tbl) SELECT * FROM t; -- WITH not supported
+CREATE VIEW ro_view8 AS SELECT a, b FROM base_tbl ORDER BY a OFFSET 1; -- OFFSET not supported
+CREATE VIEW ro_view9 AS SELECT a, b FROM base_tbl ORDER BY a LIMIT 1; -- LIMIT not supported
+CREATE VIEW ro_view10 AS SELECT 1 AS a; -- No base relations
+CREATE VIEW ro_view11 AS SELECT b1.a, b2.b FROM base_tbl b1, base_tbl b2; -- Multiple base relations
+CREATE VIEW ro_view12 AS SELECT * FROM generate_series(1, 10) AS g(a); -- SRF in rangetable
+CREATE VIEW ro_view13 AS SELECT a, b FROM (SELECT * FROM base_tbl) AS t; -- Subselect in rangetable
+CREATE VIEW rw_view14 AS SELECT ctid, a, b FROM base_tbl; -- System columns may be part of an updatable view
+CREATE VIEW rw_view15 AS SELECT a, upper(b) FROM base_tbl; -- Expression/function may be part of an updatable view
+CREATE VIEW rw_view16 AS SELECT a, b, a AS aa FROM base_tbl; -- Repeated column may be part of an updatable view
+CREATE VIEW ro_view17 AS SELECT * FROM ro_view1; -- Base relation not updatable
+CREATE VIEW ro_view18 AS SELECT * FROM (VALUES(1)) AS tmp(a); -- VALUES in rangetable
+CREATE SEQUENCE uv_seq;
+CREATE VIEW ro_view19 AS SELECT * FROM uv_seq; -- View based on a sequence
+CREATE VIEW ro_view20 AS SELECT a, b, generate_series(1, a) g FROM base_tbl; -- SRF in targetlist not supported
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ ro_view1 | NO
+ ro_view10 | NO
+ ro_view11 | NO
+ ro_view12 | NO
+ ro_view13 | NO
+ ro_view17 | NO
+ ro_view18 | NO
+ ro_view19 | NO
+ ro_view2 | NO
+ ro_view20 | NO
+ ro_view3 | NO
+ ro_view4 | NO
+ ro_view5 | NO
+ ro_view6 | NO
+ ro_view7 | NO
+ ro_view8 | NO
+ ro_view9 | NO
+ rw_view14 | YES
+ rw_view15 | YES
+ rw_view16 | YES
+(20 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ ro_view1 | NO | NO
+ ro_view10 | NO | NO
+ ro_view11 | NO | NO
+ ro_view12 | NO | NO
+ ro_view13 | NO | NO
+ ro_view17 | NO | NO
+ ro_view18 | NO | NO
+ ro_view19 | NO | NO
+ ro_view2 | NO | NO
+ ro_view20 | NO | NO
+ ro_view3 | NO | NO
+ ro_view4 | NO | NO
+ ro_view5 | NO | NO
+ ro_view6 | NO | NO
+ ro_view7 | NO | NO
+ ro_view8 | NO | NO
+ ro_view9 | NO | NO
+ rw_view14 | YES | YES
+ rw_view15 | YES | YES
+ rw_view16 | YES | YES
+(20 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ ro_view1 | a | NO
+ ro_view1 | b | NO
+ ro_view10 | a | NO
+ ro_view11 | a | NO
+ ro_view11 | b | NO
+ ro_view12 | a | NO
+ ro_view13 | a | NO
+ ro_view13 | b | NO
+ ro_view17 | a | NO
+ ro_view17 | b | NO
+ ro_view18 | a | NO
+ ro_view19 | last_value | NO
+ ro_view19 | log_cnt | NO
+ ro_view19 | is_called | NO
+ ro_view2 | a | NO
+ ro_view2 | b | NO
+ ro_view20 | a | NO
+ ro_view20 | b | NO
+ ro_view20 | g | NO
+ ro_view3 | ?column? | NO
+ ro_view4 | count | NO
+ ro_view5 | a | NO
+ ro_view5 | rank | NO
+ ro_view6 | a | NO
+ ro_view6 | b | NO
+ ro_view7 | a | NO
+ ro_view7 | b | NO
+ ro_view8 | a | NO
+ ro_view8 | b | NO
+ ro_view9 | a | NO
+ ro_view9 | b | NO
+ rw_view14 | ctid | NO
+ rw_view14 | a | YES
+ rw_view14 | b | YES
+ rw_view15 | a | YES
+ rw_view15 | upper | NO
+ rw_view16 | a | YES
+ rw_view16 | b | YES
+ rw_view16 | aa | YES
+(39 rows)
+
+-- Read-only views
+DELETE FROM ro_view1;
+ERROR: cannot delete from view "ro_view1"
+DETAIL: Views containing DISTINCT are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view2;
+ERROR: cannot delete from view "ro_view2"
+DETAIL: Views containing GROUP BY are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view3;
+ERROR: cannot delete from view "ro_view3"
+DETAIL: Views containing HAVING are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view4;
+ERROR: cannot delete from view "ro_view4"
+DETAIL: Views that return aggregate functions are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view5;
+ERROR: cannot delete from view "ro_view5"
+DETAIL: Views that return window functions are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+DELETE FROM ro_view6;
+ERROR: cannot delete from view "ro_view6"
+DETAIL: Views containing UNION, INTERSECT, or EXCEPT are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+UPDATE ro_view7 SET a=a+1;
+ERROR: cannot update view "ro_view7"
+DETAIL: Views containing WITH are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view8 SET a=a+1;
+ERROR: cannot update view "ro_view8"
+DETAIL: Views containing LIMIT or OFFSET are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view9 SET a=a+1;
+ERROR: cannot update view "ro_view9"
+DETAIL: Views containing LIMIT or OFFSET are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view10 SET a=a+1;
+ERROR: cannot update view "ro_view10"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view11 SET a=a+1;
+ERROR: cannot update view "ro_view11"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view12 SET a=a+1;
+ERROR: cannot update view "ro_view12"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+INSERT INTO ro_view13 VALUES (3, 'Row 3');
+ERROR: cannot insert into view "ro_view13"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+-- Partially updatable view
+INSERT INTO rw_view14 VALUES (null, 3, 'Row 3'); -- should fail
+ERROR: cannot insert into column "ctid" of view "rw_view14"
+DETAIL: View columns that refer to system columns are not updatable.
+INSERT INTO rw_view14 (a, b) VALUES (3, 'Row 3'); -- should be OK
+UPDATE rw_view14 SET ctid=null WHERE a=3; -- should fail
+ERROR: cannot update column "ctid" of view "rw_view14"
+DETAIL: View columns that refer to system columns are not updatable.
+UPDATE rw_view14 SET b='ROW 3' WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+--------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ 3 | ROW 3
+(6 rows)
+
+DELETE FROM rw_view14 WHERE a=3; -- should be OK
+-- Partially updatable view
+INSERT INTO rw_view15 VALUES (3, 'ROW 3'); -- should fail
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view15 (a) VALUES (3); -- should be OK
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT DO NOTHING; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO NOTHING; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO UPDATE set a = excluded.a; -- succeeds
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+INSERT INTO rw_view15 (a) VALUES (3) ON CONFLICT (a) DO UPDATE set upper = 'blarg'; -- fails
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+SELECT * FROM rw_view15;
+ a | upper
+----+-------------
+ -2 | ROW -2
+ -1 | ROW -1
+ 0 | ROW 0
+ 1 | ROW 1
+ 2 | ROW 2
+ 3 | UNSPECIFIED
+(6 rows)
+
+ALTER VIEW rw_view15 ALTER COLUMN upper SET DEFAULT 'NOT SET';
+INSERT INTO rw_view15 (a) VALUES (4); -- should fail
+ERROR: cannot insert into column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET upper='ROW 3' WHERE a=3; -- should fail
+ERROR: cannot update column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET upper=DEFAULT WHERE a=3; -- should fail
+ERROR: cannot update column "upper" of view "rw_view15"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view15 SET a=4 WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+-------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ 4 | Unspecified
+(6 rows)
+
+DELETE FROM rw_view15 WHERE a=4; -- should be OK
+-- Partially updatable view
+INSERT INTO rw_view16 VALUES (3, 'Row 3', 3); -- should fail
+ERROR: multiple assignments to same column "a"
+INSERT INTO rw_view16 (a, b) VALUES (3, 'Row 3'); -- should be OK
+UPDATE rw_view16 SET a=3, aa=-3 WHERE a=3; -- should fail
+ERROR: multiple assignments to same column "a"
+UPDATE rw_view16 SET aa=-3 WHERE a=3; -- should be OK
+SELECT * FROM base_tbl;
+ a | b
+----+--------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Row 2
+ -3 | Row 3
+(6 rows)
+
+DELETE FROM rw_view16 WHERE a=-3; -- should be OK
+-- Read-only views
+INSERT INTO ro_view17 VALUES (3, 'ROW 3');
+ERROR: cannot insert into view "ro_view1"
+DETAIL: Views containing DISTINCT are not automatically updatable.
+HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule.
+DELETE FROM ro_view18;
+ERROR: cannot delete from view "ro_view18"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable deleting from the view, provide an INSTEAD OF DELETE trigger or an unconditional ON DELETE DO INSTEAD rule.
+UPDATE ro_view19 SET last_value=1000;
+ERROR: cannot update view "ro_view19"
+DETAIL: Views that do not select from a single table or view are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+UPDATE ro_view20 SET b=upper(b);
+ERROR: cannot update view "ro_view20"
+DETAIL: Views that return set-returning functions are not automatically updatable.
+HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 16 other objects
+DETAIL: drop cascades to view ro_view1
+drop cascades to view ro_view17
+drop cascades to view ro_view2
+drop cascades to view ro_view3
+drop cascades to view ro_view4
+drop cascades to view ro_view5
+drop cascades to view ro_view6
+drop cascades to view ro_view7
+drop cascades to view ro_view8
+drop cascades to view ro_view9
+drop cascades to view ro_view11
+drop cascades to view ro_view13
+drop cascades to view rw_view14
+drop cascades to view rw_view15
+drop cascades to view rw_view16
+drop cascades to view ro_view20
+DROP VIEW ro_view10, ro_view12, ro_view18;
+DROP SEQUENCE uv_seq CASCADE;
+NOTICE: drop cascades to view ro_view19
+-- simple updatable view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view1';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view1';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view1'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | YES
+ rw_view1 | b | YES
+(2 rows)
+
+INSERT INTO rw_view1 VALUES (3, 'Row 3');
+INSERT INTO rw_view1 (a) VALUES (4);
+UPDATE rw_view1 SET a=5 WHERE a=4;
+DELETE FROM rw_view1 WHERE b='Row 2';
+SELECT * FROM base_tbl;
+ a | b
+----+-------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 3 | Row 3
+ 5 | Unspecified
+(6 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET a=6 WHERE a=5;
+ QUERY PLAN
+--------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a > 0) AND (a = 5))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE a=5;
+ QUERY PLAN
+--------------------------------------------------
+ Delete on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a > 0) AND (a = 5))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- view on top of view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT b AS bb, a AS aa FROM base_tbl WHERE a>0;
+CREATE VIEW rw_view2 AS SELECT aa AS aaa, bb AS bbb FROM rw_view1 WHERE aa<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view2';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view2 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view2';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view2 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view2'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view2 | aaa | YES
+ rw_view2 | bbb | YES
+(2 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3');
+INSERT INTO rw_view2 (aaa) VALUES (4);
+SELECT * FROM rw_view2;
+ aaa | bbb
+-----+-------------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row 3
+ 4 | Unspecified
+(4 rows)
+
+UPDATE rw_view2 SET bbb='Row 4' WHERE aaa=4;
+DELETE FROM rw_view2 WHERE aaa=2;
+SELECT * FROM rw_view2;
+ aaa | bbb
+-----+-------
+ 1 | Row 1
+ 3 | Row 3
+ 4 | Row 4
+(3 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET aaa=5 WHERE aaa=4;
+ QUERY PLAN
+--------------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a < 10) AND (a > 0) AND (a = 4))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE aaa=4;
+ QUERY PLAN
+--------------------------------------------------------
+ Delete on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: ((a < 10) AND (a > 0) AND (a = 4))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- view on top of view with rules
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0 OFFSET 0; -- not updatable without rules/triggers
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | NO
+ rw_view2 | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a, NEW.b) RETURNING *;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | YES
+ rw_view2 | NO | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_upd_rule AS ON UPDATE TO rw_view1
+ DO INSTEAD UPDATE base_tbl SET b=NEW.b WHERE a=OLD.a RETURNING NEW.*;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | NO | YES
+ rw_view2 | NO | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE RULE rw_view1_del_rule AS ON DELETE TO rw_view1
+ DO INSTEAD DELETE FROM base_tbl WHERE a=OLD.a RETURNING OLD.*;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+ rw_view2 | YES | YES
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | YES
+ rw_view1 | b | YES
+ rw_view2 | a | YES
+ rw_view2 | b | YES
+(4 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3') RETURNING *;
+ a | b
+---+-------
+ 3 | Row 3
+(1 row)
+
+UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-----------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row three
+(3 rows)
+
+DELETE FROM rw_view2 WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-------
+ 1 | Row 1
+ 2 | Row 2
+(2 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET a=3 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Update on base_tbl
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl base_tbl_1
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(10 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------------
+ Delete on base_tbl
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl base_tbl_1
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(10 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- view on top of view with triggers
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a>0 OFFSET 0; -- not updatable without rules/triggers
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a<10;
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | NO | NO | NO
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ INSERT INTO base_tbl VALUES (NEW.a, NEW.b);
+ RETURN NEW;
+ ELSIF TG_OP = 'UPDATE' THEN
+ UPDATE base_tbl SET b=NEW.b WHERE a=OLD.a;
+ RETURN NEW;
+ ELSIF TG_OP = 'DELETE' THEN
+ DELETE FROM base_tbl WHERE a=OLD.a;
+ RETURN OLD;
+ END IF;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_ins_trig INSTEAD OF INSERT ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | NO | NO | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE TRIGGER rw_view1_upd_trig INSTEAD OF UPDATE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | YES | NO | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+CREATE TRIGGER rw_view1_del_trig INSTEAD OF DELETE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | NO
+ rw_view2 | NO
+(2 rows)
+
+SELECT table_name, is_updatable, is_insertable_into,
+ is_trigger_updatable, is_trigger_deletable,
+ is_trigger_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ rw_view1 | NO | NO | YES | YES | YES
+ rw_view2 | NO | NO | NO | NO | NO
+(2 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE 'rw_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | a | NO
+ rw_view1 | b | NO
+ rw_view2 | a | NO
+ rw_view2 | b | NO
+(4 rows)
+
+INSERT INTO rw_view2 VALUES (3, 'Row 3') RETURNING *;
+ a | b
+---+-------
+ 3 | Row 3
+(1 row)
+
+UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-----------
+ 1 | Row 1
+ 2 | Row 2
+ 3 | Row three
+(3 rows)
+
+DELETE FROM rw_view2 WHERE a=3 RETURNING *;
+ a | b
+---+-----------
+ 3 | Row three
+(1 row)
+
+SELECT * FROM rw_view2;
+ a | b
+---+-------
+ 1 | Row 1
+ 2 | Row 2
+(2 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET a=3 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------
+ Update on rw_view1 rw_view1_1
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(7 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE a=2;
+ QUERY PLAN
+----------------------------------------------------------
+ Delete on rw_view1 rw_view1_1
+ -> Subquery Scan on rw_view1
+ Filter: ((rw_view1.a < 10) AND (rw_view1.a = 2))
+ -> Bitmap Heap Scan on base_tbl
+ Recheck Cond: (a > 0)
+ -> Bitmap Index Scan on base_tbl_pkey
+ Index Cond: (a > 0)
+(7 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP FUNCTION rw_view1_trig_fn();
+-- update using whole row from view
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
+CREATE VIEW rw_view1 AS SELECT b AS bb, a AS aa FROM base_tbl;
+CREATE FUNCTION rw_view1_aa(x rw_view1)
+ RETURNS int AS $$ SELECT x.aa $$ LANGUAGE sql;
+UPDATE rw_view1 v SET bb='Updated row 2' WHERE rw_view1_aa(v)=2
+ RETURNING rw_view1_aa(v), v.bb;
+ rw_view1_aa | bb
+-------------+---------------
+ 2 | Updated row 2
+(1 row)
+
+SELECT * FROM base_tbl;
+ a | b
+----+---------------
+ -2 | Row -2
+ -1 | Row -1
+ 0 | Row 0
+ 1 | Row 1
+ 2 | Updated row 2
+(5 rows)
+
+EXPLAIN (costs off)
+UPDATE rw_view1 v SET bb='Updated row 2' WHERE rw_view1_aa(v)=2
+ RETURNING rw_view1_aa(v), v.bb;
+ QUERY PLAN
+--------------------------------------------------
+ Update on base_tbl
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (a = 2)
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to function rw_view1_aa(rw_view1)
+-- permissions checks
+CREATE USER regress_view_user1;
+CREATE USER regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user1;
+CREATE TABLE base_tbl(a int, b text, c float);
+INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
+CREATE VIEW rw_view1 AS SELECT b AS bb, c AS cc, a AS aa FROM base_tbl;
+INSERT INTO rw_view1 VALUES ('Row 2', 2.0, 2);
+GRANT SELECT ON base_tbl TO regress_view_user2;
+GRANT SELECT ON rw_view1 TO regress_view_user2;
+GRANT UPDATE (a,c) ON base_tbl TO regress_view_user2;
+GRANT UPDATE (bb,cc) ON rw_view1 TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+CREATE VIEW rw_view2 AS SELECT b AS bb, c AS cc, a AS aa FROM base_tbl;
+SELECT * FROM base_tbl; -- ok
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+ 2 | Row 2 | 2
+(2 rows)
+
+SELECT * FROM rw_view1; -- ok
+ bb | cc | aa
+-------+----+----
+ Row 1 | 1 | 1
+ Row 2 | 2 | 2
+(2 rows)
+
+SELECT * FROM rw_view2; -- ok
+ bb | cc | aa
+-------+----+----
+ Row 1 | 1 | 1
+ Row 2 | 2 | 2
+(2 rows)
+
+INSERT INTO base_tbl VALUES (3, 'Row 3', 3.0); -- not allowed
+ERROR: permission denied for table base_tbl
+INSERT INTO rw_view1 VALUES ('Row 3', 3.0, 3); -- not allowed
+ERROR: permission denied for view rw_view1
+INSERT INTO rw_view2 VALUES ('Row 3', 3.0, 3); -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE base_tbl SET a=a, c=c; -- ok
+UPDATE base_tbl SET b=b; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET bb=bb, cc=cc; -- ok
+UPDATE rw_view1 SET aa=aa; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET aa=aa, cc=cc; -- ok
+UPDATE rw_view2 SET bb=bb; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM base_tbl; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM rw_view1; -- not allowed
+ERROR: permission denied for view rw_view1
+DELETE FROM rw_view2; -- not allowed
+ERROR: permission denied for table base_tbl
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT INSERT, DELETE ON base_tbl TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+INSERT INTO base_tbl VALUES (3, 'Row 3', 3.0); -- ok
+INSERT INTO rw_view1 VALUES ('Row 4', 4.0, 4); -- not allowed
+ERROR: permission denied for view rw_view1
+INSERT INTO rw_view2 VALUES ('Row 4', 4.0, 4); -- ok
+DELETE FROM base_tbl WHERE a=1; -- ok
+DELETE FROM rw_view1 WHERE aa=2; -- not allowed
+ERROR: permission denied for view rw_view1
+DELETE FROM rw_view2 WHERE aa=2; -- ok
+SELECT * FROM base_tbl;
+ a | b | c
+---+-------+---
+ 3 | Row 3 | 3
+ 4 | Row 4 | 4
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user1;
+REVOKE INSERT, DELETE ON base_tbl FROM regress_view_user2;
+GRANT INSERT, DELETE ON rw_view1 TO regress_view_user2;
+RESET SESSION AUTHORIZATION;
+SET SESSION AUTHORIZATION regress_view_user2;
+INSERT INTO base_tbl VALUES (5, 'Row 5', 5.0); -- not allowed
+ERROR: permission denied for table base_tbl
+INSERT INTO rw_view1 VALUES ('Row 5', 5.0, 5); -- ok
+INSERT INTO rw_view2 VALUES ('Row 6', 6.0, 6); -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM base_tbl WHERE a=3; -- not allowed
+ERROR: permission denied for table base_tbl
+DELETE FROM rw_view1 WHERE aa=3; -- ok
+DELETE FROM rw_view2 WHERE aa=4; -- not allowed
+ERROR: permission denied for table base_tbl
+SELECT * FROM base_tbl;
+ a | b | c
+---+-------+---
+ 4 | Row 4 | 4
+ 5 | Row 5 | 5
+(2 rows)
+
+RESET SESSION AUTHORIZATION;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- nested-view permissions
+CREATE TABLE base_tbl(a int, b text, c float);
+INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
+SET SESSION AUTHORIZATION regress_view_user1;
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl;
+SELECT * FROM rw_view1; -- not allowed
+ERROR: permission denied for table base_tbl
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1;
+SELECT * FROM rw_view2; -- not allowed
+ERROR: permission denied for view rw_view1
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+RESET SESSION AUTHORIZATION;
+GRANT SELECT ON base_tbl TO regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2; -- not allowed
+ERROR: permission denied for view rw_view1
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT SELECT ON rw_view1 TO regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+RESET SESSION AUTHORIZATION;
+GRANT UPDATE ON base_tbl TO regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE;
+ a | b | c
+---+-------+---
+ 1 | Row 1 | 1
+(1 row)
+
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for view rw_view1
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for view rw_view1
+SET SESSION AUTHORIZATION regress_view_user1;
+GRANT UPDATE ON rw_view1 TO regress_view_user2;
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE;
+ a | b | c
+---+-----+---
+ 1 | foo | 1
+(1 row)
+
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1;
+RESET SESSION AUTHORIZATION;
+REVOKE UPDATE ON base_tbl FROM regress_view_user1;
+SET SESSION AUTHORIZATION regress_view_user1;
+SELECT * FROM rw_view1;
+ a | b | c
+---+-----+---
+ 1 | bar | 1
+(1 row)
+
+SELECT * FROM rw_view1 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view1 SET b = 'foo' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+SET SESSION AUTHORIZATION regress_view_user2;
+SELECT * FROM rw_view2;
+ a | b | c
+---+-----+---
+ 1 | bar | 1
+(1 row)
+
+SELECT * FROM rw_view2 FOR UPDATE; -- not allowed
+ERROR: permission denied for table base_tbl
+UPDATE rw_view2 SET b = 'bar' WHERE a = 1; -- not allowed
+ERROR: permission denied for table base_tbl
+RESET SESSION AUTHORIZATION;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP USER regress_view_user1;
+DROP USER regress_view_user2;
+-- column defaults
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified', c serial);
+INSERT INTO base_tbl VALUES (1, 'Row 1');
+INSERT INTO base_tbl VALUES (2, 'Row 2');
+INSERT INTO base_tbl VALUES (3);
+CREATE VIEW rw_view1 AS SELECT a AS aa, b AS bb FROM base_tbl;
+ALTER VIEW rw_view1 ALTER COLUMN bb SET DEFAULT 'View default';
+INSERT INTO rw_view1 VALUES (4, 'Row 4');
+INSERT INTO rw_view1 (aa) VALUES (5);
+SELECT * FROM base_tbl;
+ a | b | c
+---+--------------+---
+ 1 | Row 1 | 1
+ 2 | Row 2 | 2
+ 3 | Unspecified | 3
+ 4 | Row 4 | 4
+ 5 | View default | 5
+(5 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- Table having triggers
+CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
+INSERT INTO base_tbl VALUES (1, 'Row 1');
+INSERT INTO base_tbl VALUES (2, 'Row 2');
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ UPDATE base_tbl SET b=NEW.b WHERE a=1;
+ RETURN NULL;
+ END IF;
+ RETURN NULL;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_ins_trig AFTER INSERT ON base_tbl
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+CREATE VIEW rw_view1 AS SELECT a AS aa, b AS bb FROM base_tbl;
+INSERT INTO rw_view1 VALUES (3, 'Row 3');
+select * from base_tbl;
+ a | b
+---+-------
+ 2 | Row 2
+ 3 | Row 3
+ 1 | Row 3
+(3 rows)
+
+DROP VIEW rw_view1;
+DROP TRIGGER rw_view1_ins_trig on base_tbl;
+DROP FUNCTION rw_view1_trig_fn();
+DROP TABLE base_tbl;
+-- view with ORDER BY
+CREATE TABLE base_tbl (a int, b int);
+INSERT INTO base_tbl VALUES (1,2), (4,5), (3,-3);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl ORDER BY a+b;
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 3 | -3
+ 1 | 2
+ 4 | 5
+(3 rows)
+
+INSERT INTO rw_view1 VALUES (7,-8);
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 7 | -8
+ 3 | -3
+ 1 | 2
+ 4 | 5
+(4 rows)
+
+EXPLAIN (verbose, costs off) UPDATE rw_view1 SET b = b + 1 RETURNING *;
+ QUERY PLAN
+-------------------------------------------------------------
+ Update on public.base_tbl
+ Output: base_tbl.a, base_tbl.b
+ -> Seq Scan on public.base_tbl
+ Output: base_tbl.a, (base_tbl.b + 1), base_tbl.ctid
+(4 rows)
+
+UPDATE rw_view1 SET b = b + 1 RETURNING *;
+ a | b
+---+----
+ 1 | 3
+ 4 | 6
+ 3 | -2
+ 7 | -7
+(4 rows)
+
+SELECT * FROM rw_view1;
+ a | b
+---+----
+ 7 | -7
+ 3 | -2
+ 1 | 3
+ 4 | 6
+(4 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- multiple array-column updates
+CREATE TABLE base_tbl (a int, arr int[]);
+INSERT INTO base_tbl VALUES (1,ARRAY[2]), (3,ARRAY[4]);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl;
+UPDATE rw_view1 SET arr[1] = 42, arr[2] = 77 WHERE a = 3;
+SELECT * FROM rw_view1;
+ a | arr
+---+---------
+ 1 | {2}
+ 3 | {42,77}
+(2 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- views with updatable and non-updatable columns
+CREATE TABLE base_tbl(a float);
+INSERT INTO base_tbl SELECT i/10.0 FROM generate_series(1,10) g(i);
+CREATE VIEW rw_view1 AS
+ SELECT ctid, sin(a) s, a, cos(a) c
+ FROM base_tbl
+ WHERE a != 0
+ ORDER BY abs(a);
+INSERT INTO rw_view1 VALUES (null, null, 1.1, null); -- should fail
+ERROR: cannot insert into column "ctid" of view "rw_view1"
+DETAIL: View columns that refer to system columns are not updatable.
+INSERT INTO rw_view1 (s, c, a) VALUES (null, null, 1.1); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view1 (a) VALUES (1.1) RETURNING a, s, c; -- OK
+ a | s | c
+-----+-------------------+-------------------
+ 1.1 | 0.891207360061435 | 0.453596121425577
+(1 row)
+
+UPDATE rw_view1 SET s = s WHERE a = 1.1; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view1 SET a = 1.05 WHERE a = 1.1 RETURNING s; -- OK
+ s
+-------------------
+ 0.867423225594017
+(1 row)
+
+DELETE FROM rw_view1 WHERE a = 1.05; -- OK
+CREATE VIEW rw_view2 AS
+ SELECT s, c, s/c t, a base_a, ctid
+ FROM rw_view1;
+INSERT INTO rw_view2 VALUES (null, null, null, 1.1, null); -- should fail
+ERROR: cannot insert into column "t" of view "rw_view2"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view2(s, c, base_a) VALUES (null, null, 1.1); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view2(base_a) VALUES (1.1) RETURNING t; -- OK
+ t
+------------------
+ 1.96475965724865
+(1 row)
+
+UPDATE rw_view2 SET s = s WHERE base_a = 1.1; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view2 SET t = t WHERE base_a = 1.1; -- should fail
+ERROR: cannot update column "t" of view "rw_view2"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view2 SET base_a = 1.05 WHERE base_a = 1.1; -- OK
+DELETE FROM rw_view2 WHERE base_a = 1.05 RETURNING base_a, s, c, t; -- OK
+ base_a | s | c | t
+--------+-------------------+-------------------+------------------
+ 1.05 | 0.867423225594017 | 0.497571047891727 | 1.74331530998317
+(1 row)
+
+CREATE VIEW rw_view3 AS
+ SELECT s, c, s/c t, ctid
+ FROM rw_view1;
+INSERT INTO rw_view3 VALUES (null, null, null, null); -- should fail
+ERROR: cannot insert into column "t" of view "rw_view3"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+INSERT INTO rw_view3(s) VALUES (null); -- should fail
+ERROR: cannot insert into column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+UPDATE rw_view3 SET s = s; -- should fail
+ERROR: cannot update column "s" of view "rw_view1"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+DELETE FROM rw_view3 WHERE s = sin(0.1); -- should be OK
+SELECT * FROM base_tbl ORDER BY a;
+ a
+-----
+ 0.2
+ 0.3
+ 0.4
+ 0.5
+ 0.6
+ 0.7
+ 0.8
+ 0.9
+ 1
+(9 rows)
+
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+ rw_view2 | YES
+ rw_view3 | NO
+(3 rows)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name;
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+ rw_view2 | YES | YES
+ rw_view3 | NO | NO
+(3 rows)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name LIKE E'r_\\_view%'
+ ORDER BY table_name, ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | ctid | NO
+ rw_view1 | s | NO
+ rw_view1 | a | YES
+ rw_view1 | c | NO
+ rw_view2 | s | NO
+ rw_view2 | c | NO
+ rw_view2 | t | NO
+ rw_view2 | base_a | YES
+ rw_view2 | ctid | NO
+ rw_view3 | s | NO
+ rw_view3 | c | NO
+ rw_view3 | t | NO
+ rw_view3 | ctid | NO
+(13 rows)
+
+SELECT events & 4 != 0 AS upd,
+ events & 8 != 0 AS ins,
+ events & 16 != 0 AS del
+ FROM pg_catalog.pg_relation_is_updatable('rw_view3'::regclass, false) t(events);
+ upd | ins | del
+-----+-----+-----
+ f | f | t
+(1 row)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+drop cascades to view rw_view3
+-- inheritance tests
+CREATE TABLE base_tbl_parent (a int);
+CREATE TABLE base_tbl_child (CHECK (a > 0)) INHERITS (base_tbl_parent);
+INSERT INTO base_tbl_parent SELECT * FROM generate_series(-8, -1);
+INSERT INTO base_tbl_child SELECT * FROM generate_series(1, 8);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl_parent;
+CREATE VIEW rw_view2 AS SELECT * FROM ONLY base_tbl_parent;
+SELECT * FROM rw_view1 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+(16 rows)
+
+SELECT * FROM ONLY rw_view1 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+(16 rows)
+
+SELECT * FROM rw_view2 ORDER BY a;
+ a
+----
+ -8
+ -7
+ -6
+ -5
+ -4
+ -3
+ -2
+ -1
+(8 rows)
+
+INSERT INTO rw_view1 VALUES (-100), (100);
+INSERT INTO rw_view2 VALUES (-200), (200);
+UPDATE rw_view1 SET a = a*10 WHERE a IN (-1, 1); -- Should produce -10 and 10
+UPDATE ONLY rw_view1 SET a = a*10 WHERE a IN (-2, 2); -- Should produce -20 and 20
+UPDATE rw_view2 SET a = a*10 WHERE a IN (-3, 3); -- Should produce -30 only
+UPDATE ONLY rw_view2 SET a = a*10 WHERE a IN (-4, 4); -- Should produce -40 only
+DELETE FROM rw_view1 WHERE a IN (-5, 5); -- Should delete -5 and 5
+DELETE FROM ONLY rw_view1 WHERE a IN (-6, 6); -- Should delete -6 and 6
+DELETE FROM rw_view2 WHERE a IN (-7, 7); -- Should delete -7 only
+DELETE FROM ONLY rw_view2 WHERE a IN (-8, 8); -- Should delete -8 only
+SELECT * FROM ONLY base_tbl_parent ORDER BY a;
+ a
+------
+ -200
+ -100
+ -40
+ -30
+ -20
+ -10
+ 100
+ 200
+(8 rows)
+
+SELECT * FROM base_tbl_child ORDER BY a;
+ a
+----
+ 3
+ 4
+ 7
+ 8
+ 10
+ 20
+(6 rows)
+
+CREATE TABLE other_tbl_parent (id int);
+CREATE TABLE other_tbl_child () INHERITS (other_tbl_parent);
+INSERT INTO other_tbl_parent VALUES (7),(200);
+INSERT INTO other_tbl_child VALUES (8),(100);
+EXPLAIN (costs off)
+UPDATE rw_view1 SET a = a + 1000 FROM other_tbl_parent WHERE a = id;
+ QUERY PLAN
+---------------------------------------------------------------
+ Update on base_tbl_parent
+ Update on base_tbl_parent
+ Update on base_tbl_child
+ -> Merge Join
+ Merge Cond: (base_tbl_parent.a = other_tbl_parent.id)
+ -> Sort
+ Sort Key: base_tbl_parent.a
+ -> Seq Scan on base_tbl_parent
+ -> Sort
+ Sort Key: other_tbl_parent.id
+ -> Append
+ -> Seq Scan on other_tbl_parent
+ -> Seq Scan on other_tbl_child
+ -> Merge Join
+ Merge Cond: (base_tbl_child.a = other_tbl_parent.id)
+ -> Sort
+ Sort Key: base_tbl_child.a
+ -> Seq Scan on base_tbl_child
+ -> Sort
+ Sort Key: other_tbl_parent.id
+ -> Append
+ -> Seq Scan on other_tbl_parent
+ -> Seq Scan on other_tbl_child
+(23 rows)
+
+UPDATE rw_view1 SET a = a + 1000 FROM other_tbl_parent WHERE a = id;
+SELECT * FROM ONLY base_tbl_parent ORDER BY a;
+ a
+------
+ -200
+ -100
+ -40
+ -30
+ -20
+ -10
+ 1100
+ 1200
+(8 rows)
+
+SELECT * FROM base_tbl_child ORDER BY a;
+ a
+------
+ 3
+ 4
+ 10
+ 20
+ 1007
+ 1008
+(6 rows)
+
+DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP TABLE other_tbl_parent CASCADE;
+NOTICE: drop cascades to table other_tbl_child
+-- simple WITH CHECK OPTION
+CREATE TABLE base_tbl (a int, b int DEFAULT 10);
+INSERT INTO base_tbl VALUES (1,2), (2,3), (1,-1);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a < b
+ WITH LOCAL CHECK OPTION;
+\d+ rw_view1
+ View "public.rw_view1"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+ b | integer | | | | plain |
+View definition:
+ SELECT base_tbl.a,
+ base_tbl.b
+ FROM base_tbl
+ WHERE base_tbl.a < base_tbl.b;
+Options: check_option=local
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view1';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+------------------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view1 | SELECT base_tbl.a, +| LOCAL | YES | YES | NO | NO | NO
+ | | | base_tbl.b +| | | | | |
+ | | | FROM base_tbl +| | | | | |
+ | | | WHERE (base_tbl.a < base_tbl.b); | | | | | |
+(1 row)
+
+INSERT INTO rw_view1 VALUES(3,4); -- ok
+INSERT INTO rw_view1 VALUES(4,3); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (4, 3).
+INSERT INTO rw_view1 VALUES(5,null); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (5, null).
+UPDATE rw_view1 SET b = 5 WHERE a = 3; -- ok
+UPDATE rw_view1 SET b = -5 WHERE a = 3; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (3, -5).
+INSERT INTO rw_view1(a) VALUES (9); -- ok
+INSERT INTO rw_view1(a) VALUES (10); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, 10).
+SELECT * FROM base_tbl;
+ a | b
+---+----
+ 1 | 2
+ 2 | 3
+ 1 | -1
+ 3 | 5
+ 9 | 10
+(5 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH LOCAL/CASCADED CHECK OPTION
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a > 0;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a < 10
+ WITH CHECK OPTION; -- implicitly cascaded
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+Options: check_option=cascaded
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (-5); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-5).
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (15); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (15).
+SELECT * FROM base_tbl;
+ a
+---
+ 5
+(1 row)
+
+UPDATE rw_view2 SET a = a - 10; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-5).
+UPDATE rw_view2 SET a = a + 10; -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (15).
+CREATE OR REPLACE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a < 10
+ WITH LOCAL CHECK OPTION;
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+Options: check_option=local
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| LOCAL | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (-10); -- ok, but not in view
+INSERT INTO rw_view2 VALUES (20); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (20).
+SELECT * FROM base_tbl;
+ a
+-----
+ 5
+ -10
+(2 rows)
+
+ALTER VIEW rw_view1 SET (check_option=here); -- invalid
+ERROR: invalid value for enum option "check_option": here
+DETAIL: Valid values are "local" and "cascaded".
+ALTER VIEW rw_view1 SET (check_option=local);
+INSERT INTO rw_view2 VALUES (-20); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (-20).
+INSERT INTO rw_view2 VALUES (30); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (30).
+ALTER VIEW rw_view2 RESET (check_option);
+\d+ rw_view2
+ View "public.rw_view2"
+ Column | Type | Collation | Nullable | Default | Storage | Description
+--------+---------+-----------+----------+---------+---------+-------------
+ a | integer | | | | plain |
+View definition:
+ SELECT rw_view1.a
+ FROM rw_view1
+ WHERE rw_view1.a < 10;
+
+SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+----------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view2 | SELECT rw_view1.a +| NONE | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a < 10); | | | | | |
+(1 row)
+
+INSERT INTO rw_view2 VALUES (30); -- ok, but not in view
+SELECT * FROM base_tbl;
+ a
+-----
+ 5
+ -10
+ 30
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- WITH CHECK OPTION with no local view qual
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WITH CHECK OPTION;
+CREATE VIEW rw_view2 AS SELECT * FROM rw_view1 WHERE a > 0;
+CREATE VIEW rw_view3 AS SELECT * FROM rw_view2 WITH CHECK OPTION;
+SELECT * FROM information_schema.views WHERE table_name LIKE E'rw\\_view_' ORDER BY table_name;
+ table_catalog | table_schema | table_name | view_definition | check_option | is_updatable | is_insertable_into | is_trigger_updatable | is_trigger_deletable | is_trigger_insertable_into
+---------------+--------------+------------+---------------------------+--------------+--------------+--------------------+----------------------+----------------------+----------------------------
+ regression | public | rw_view1 | SELECT base_tbl.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM base_tbl; | | | | | |
+ regression | public | rw_view2 | SELECT rw_view1.a +| NONE | YES | YES | NO | NO | NO
+ | | | FROM rw_view1 +| | | | | |
+ | | | WHERE (rw_view1.a > 0); | | | | | |
+ regression | public | rw_view3 | SELECT rw_view2.a +| CASCADED | YES | YES | NO | NO | NO
+ | | | FROM rw_view2; | | | | | |
+(3 rows)
+
+INSERT INTO rw_view1 VALUES (-1); -- ok
+INSERT INTO rw_view1 VALUES (1); -- ok
+INSERT INTO rw_view2 VALUES (-2); -- ok, but not in view
+INSERT INTO rw_view2 VALUES (2); -- ok
+INSERT INTO rw_view3 VALUES (-3); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-3).
+INSERT INTO rw_view3 VALUES (3); -- ok
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 3 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+drop cascades to view rw_view3
+-- WITH CHECK OPTION with scalar array ops
+CREATE TABLE base_tbl (a int, b int[]);
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a = ANY (b)
+ WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (1, ARRAY[1,2,3]); -- ok
+INSERT INTO rw_view1 VALUES (10, ARRAY[4,5]); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, {4,5}).
+UPDATE rw_view1 SET b[2] = -b[2] WHERE a = 1; -- ok
+UPDATE rw_view1 SET b[1] = -b[1] WHERE a = 1; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (1, {-1,-2,3}).
+PREPARE ins(int, int[]) AS INSERT INTO rw_view1 VALUES($1, $2);
+EXECUTE ins(2, ARRAY[1,2,3]); -- ok
+EXECUTE ins(10, ARRAY[4,5]); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (10, {4,5}).
+DEALLOCATE PREPARE ins;
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH CHECK OPTION with subquery
+CREATE TABLE base_tbl (a int);
+CREATE TABLE ref_tbl (a int PRIMARY KEY);
+INSERT INTO ref_tbl SELECT * FROM generate_series(1,10);
+CREATE VIEW rw_view1 AS
+ SELECT * FROM base_tbl b
+ WHERE EXISTS(SELECT 1 FROM ref_tbl r WHERE r.a = b.a)
+ WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (5); -- ok
+INSERT INTO rw_view1 VALUES (15); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15).
+UPDATE rw_view1 SET a = a + 5; -- ok
+UPDATE rw_view1 SET a = a + 5; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15).
+EXPLAIN (costs off) INSERT INTO rw_view1 VALUES (5);
+ QUERY PLAN
+---------------------------------------------------------
+ Insert on base_tbl b
+ -> Result
+ SubPlan 1
+ -> Index Only Scan using ref_tbl_pkey on ref_tbl r
+ Index Cond: (a = b.a)
+ SubPlan 2
+ -> Seq Scan on ref_tbl r_1
+(7 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET a = a + 5;
+ QUERY PLAN
+-----------------------------------------------------------
+ Update on base_tbl b
+ -> Hash Join
+ Hash Cond: (b.a = r.a)
+ -> Seq Scan on base_tbl b
+ -> Hash
+ -> Seq Scan on ref_tbl r
+ SubPlan 1
+ -> Index Only Scan using ref_tbl_pkey on ref_tbl r_1
+ Index Cond: (a = b.a)
+ SubPlan 2
+ -> Seq Scan on ref_tbl r_2
+(11 rows)
+
+DROP TABLE base_tbl, ref_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- WITH CHECK OPTION with BEFORE trigger on base table
+CREATE TABLE base_tbl (a int, b int);
+CREATE FUNCTION base_tbl_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ NEW.b := 10;
+ RETURN NEW;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER base_tbl_trig BEFORE INSERT OR UPDATE ON base_tbl
+ FOR EACH ROW EXECUTE PROCEDURE base_tbl_trig_fn();
+CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a < b WITH CHECK OPTION;
+INSERT INTO rw_view1 VALUES (5,0); -- ok
+INSERT INTO rw_view1 VALUES (15, 20); -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (15, 10).
+UPDATE rw_view1 SET a = 20, b = 30; -- should fail
+ERROR: new row violates check option for view "rw_view1"
+DETAIL: Failing row contains (20, 10).
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+DROP FUNCTION base_tbl_trig_fn();
+-- WITH LOCAL CHECK OPTION with INSTEAD OF trigger on base view
+CREATE TABLE base_tbl (a int, b int);
+CREATE VIEW rw_view1 AS SELECT a FROM base_tbl WHERE a < b;
+CREATE FUNCTION rw_view1_trig_fn()
+RETURNS trigger AS
+$$
+BEGIN
+ IF TG_OP = 'INSERT' THEN
+ INSERT INTO base_tbl VALUES (NEW.a, 10);
+ RETURN NEW;
+ ELSIF TG_OP = 'UPDATE' THEN
+ UPDATE base_tbl SET a=NEW.a WHERE a=OLD.a;
+ RETURN NEW;
+ ELSIF TG_OP = 'DELETE' THEN
+ DELETE FROM base_tbl WHERE a=OLD.a;
+ RETURN OLD;
+ END IF;
+END;
+$$
+LANGUAGE plpgsql;
+CREATE TRIGGER rw_view1_trig
+ INSTEAD OF INSERT OR UPDATE OR DELETE ON rw_view1
+ FOR EACH ROW EXECUTE PROCEDURE rw_view1_trig_fn();
+CREATE VIEW rw_view2 AS
+ SELECT * FROM rw_view1 WHERE a > 0 WITH LOCAL CHECK OPTION;
+INSERT INTO rw_view2 VALUES (-5); -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-5).
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (50); -- ok, but not in view
+UPDATE rw_view2 SET a = a - 10; -- should fail
+ERROR: new row violates check option for view "rw_view2"
+DETAIL: Failing row contains (-5).
+SELECT * FROM base_tbl;
+ a | b
+----+----
+ 5 | 10
+ 50 | 10
+(2 rows)
+
+-- Check option won't cascade down to base view with INSTEAD OF triggers
+ALTER VIEW rw_view2 SET (check_option=cascaded);
+INSERT INTO rw_view2 VALUES (100); -- ok, but not in view (doesn't fail rw_view1's check)
+UPDATE rw_view2 SET a = 200 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view1's check)
+SELECT * FROM base_tbl;
+ a | b
+-----+----
+ 50 | 10
+ 100 | 10
+ 200 | 10
+(3 rows)
+
+-- Neither local nor cascaded check options work with INSTEAD rules
+DROP TRIGGER rw_view1_trig ON rw_view1;
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a, 10);
+CREATE RULE rw_view1_upd_rule AS ON UPDATE TO rw_view1
+ DO INSTEAD UPDATE base_tbl SET a=NEW.a WHERE a=OLD.a;
+INSERT INTO rw_view2 VALUES (-10); -- ok, but not in view (doesn't fail rw_view2's check)
+INSERT INTO rw_view2 VALUES (5); -- ok
+INSERT INTO rw_view2 VALUES (20); -- ok, but not in view (doesn't fail rw_view1's check)
+UPDATE rw_view2 SET a = 30 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view1's check)
+INSERT INTO rw_view2 VALUES (5); -- ok
+UPDATE rw_view2 SET a = -5 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view2's check)
+SELECT * FROM base_tbl;
+ a | b
+-----+----
+ 50 | 10
+ 100 | 10
+ 200 | 10
+ -10 | 10
+ 20 | 10
+ 30 | 10
+ -5 | 10
+(7 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+DROP FUNCTION rw_view1_trig_fn();
+CREATE TABLE base_tbl (a int);
+CREATE VIEW rw_view1 AS SELECT a,10 AS b FROM base_tbl;
+CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
+ DO INSTEAD INSERT INTO base_tbl VALUES (NEW.a);
+CREATE VIEW rw_view2 AS
+ SELECT * FROM rw_view1 WHERE a > b WITH LOCAL CHECK OPTION;
+INSERT INTO rw_view2 VALUES (2,3); -- ok, but not in view (doesn't fail rw_view2's check)
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- security barrier view
+CREATE TABLE base_tbl (person text, visibility text);
+INSERT INTO base_tbl VALUES ('Tom', 'public'),
+ ('Dick', 'private'),
+ ('Harry', 'public');
+CREATE VIEW rw_view1 AS
+ SELECT person FROM base_tbl WHERE visibility = 'public';
+CREATE FUNCTION snoop(anyelement)
+RETURNS boolean AS
+$$
+BEGIN
+ RAISE NOTICE 'snooped value: %', $1;
+ RETURN true;
+END;
+$$
+LANGUAGE plpgsql COST 0.000001;
+CREATE OR REPLACE FUNCTION leakproof(anyelement)
+RETURNS boolean AS
+$$
+BEGIN
+ RETURN true;
+END;
+$$
+LANGUAGE plpgsql STRICT IMMUTABLE LEAKPROOF;
+SELECT * FROM rw_view1 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view1 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Harry
+DELETE FROM rw_view1 WHERE NOT snoop(person);
+NOTICE: snooped value: Dick
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+ALTER VIEW rw_view1 SET (security_barrier = true);
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view1';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view1 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view1';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view1 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view1'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view1 | person | YES
+(1 row)
+
+SELECT * FROM rw_view1 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view1 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+DELETE FROM rw_view1 WHERE NOT snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+EXPLAIN (costs off) SELECT * FROM rw_view1 WHERE snoop(person);
+ QUERY PLAN
+-----------------------------------------------
+ Subquery Scan on rw_view1
+ Filter: snoop(rw_view1.person)
+ -> Seq Scan on base_tbl
+ Filter: (visibility = 'public'::text)
+(4 rows)
+
+EXPLAIN (costs off) UPDATE rw_view1 SET person=person WHERE snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------
+ Update on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE NOT snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Delete on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND (NOT snoop(person)))
+(3 rows)
+
+-- security barrier view on top of security barrier view
+CREATE VIEW rw_view2 WITH (security_barrier = true) AS
+ SELECT * FROM rw_view1 WHERE snoop(person);
+SELECT table_name, is_insertable_into
+ FROM information_schema.tables
+ WHERE table_name = 'rw_view2';
+ table_name | is_insertable_into
+------------+--------------------
+ rw_view2 | YES
+(1 row)
+
+SELECT table_name, is_updatable, is_insertable_into
+ FROM information_schema.views
+ WHERE table_name = 'rw_view2';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ rw_view2 | YES | YES
+(1 row)
+
+SELECT table_name, column_name, is_updatable
+ FROM information_schema.columns
+ WHERE table_name = 'rw_view2'
+ ORDER BY ordinal_position;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ rw_view2 | person | YES
+(1 row)
+
+SELECT * FROM rw_view2 WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+ person
+--------
+ Tom
+ Harry
+(2 rows)
+
+UPDATE rw_view2 SET person=person WHERE snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+DELETE FROM rw_view2 WHERE NOT snoop(person);
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Tom
+NOTICE: snooped value: Harry
+NOTICE: snooped value: Harry
+EXPLAIN (costs off) SELECT * FROM rw_view2 WHERE snoop(person);
+ QUERY PLAN
+-----------------------------------------------------
+ Subquery Scan on rw_view2
+ Filter: snoop(rw_view2.person)
+ -> Subquery Scan on rw_view1
+ Filter: snoop(rw_view1.person)
+ -> Seq Scan on base_tbl
+ Filter: (visibility = 'public'::text)
+(6 rows)
+
+EXPLAIN (costs off) UPDATE rw_view2 SET person=person WHERE snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Update on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person) AND snoop(person))
+(3 rows)
+
+EXPLAIN (costs off) DELETE FROM rw_view2 WHERE NOT snoop(person);
+ QUERY PLAN
+-------------------------------------------------------------------------------------------
+ Delete on base_tbl
+ -> Seq Scan on base_tbl
+ Filter: ((visibility = 'public'::text) AND snoop(person) AND (NOT snoop(person)))
+(3 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to 2 other objects
+DETAIL: drop cascades to view rw_view1
+drop cascades to view rw_view2
+-- security barrier view on top of table with rules
+CREATE TABLE base_tbl(id int PRIMARY KEY, data text, deleted boolean);
+INSERT INTO base_tbl VALUES (1, 'Row 1', false), (2, 'Row 2', true);
+CREATE RULE base_tbl_ins_rule AS ON INSERT TO base_tbl
+ WHERE EXISTS (SELECT 1 FROM base_tbl t WHERE t.id = new.id)
+ DO INSTEAD
+ UPDATE base_tbl SET data = new.data, deleted = false WHERE id = new.id;
+CREATE RULE base_tbl_del_rule AS ON DELETE TO base_tbl
+ DO INSTEAD
+ UPDATE base_tbl SET deleted = true WHERE id = old.id;
+CREATE VIEW rw_view1 WITH (security_barrier=true) AS
+ SELECT id, data FROM base_tbl WHERE NOT deleted;
+SELECT * FROM rw_view1;
+ id | data
+----+-------
+ 1 | Row 1
+(1 row)
+
+EXPLAIN (costs off) DELETE FROM rw_view1 WHERE id = 1 AND snoop(data);
+ QUERY PLAN
+-------------------------------------------------------------------
+ Update on base_tbl base_tbl_1
+ -> Nested Loop
+ -> Index Scan using base_tbl_pkey on base_tbl base_tbl_1
+ Index Cond: (id = 1)
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (id = 1)
+ Filter: ((NOT deleted) AND snoop(data))
+(7 rows)
+
+DELETE FROM rw_view1 WHERE id = 1 AND snoop(data);
+NOTICE: snooped value: Row 1
+EXPLAIN (costs off) INSERT INTO rw_view1 VALUES (2, 'New row 2');
+ QUERY PLAN
+-----------------------------------------------------------
+ Insert on base_tbl
+ InitPlan 1 (returns $0)
+ -> Index Only Scan using base_tbl_pkey on base_tbl t
+ Index Cond: (id = 2)
+ -> Result
+ One-Time Filter: ($0 IS NOT TRUE)
+
+ Update on base_tbl
+ InitPlan 1 (returns $0)
+ -> Index Only Scan using base_tbl_pkey on base_tbl t
+ Index Cond: (id = 2)
+ -> Result
+ One-Time Filter: $0
+ -> Index Scan using base_tbl_pkey on base_tbl
+ Index Cond: (id = 2)
+(15 rows)
+
+INSERT INTO rw_view1 VALUES (2, 'New row 2');
+SELECT * FROM base_tbl;
+ id | data | deleted
+----+-----------+---------
+ 1 | Row 1 | t
+ 2 | New row 2 | f
+(2 rows)
+
+DROP TABLE base_tbl CASCADE;
+NOTICE: drop cascades to view rw_view1
+-- security barrier view based on inheritance set
+CREATE TABLE t1 (a int, b float, c text);
+CREATE INDEX t1_a_idx ON t1(a);
+INSERT INTO t1
+SELECT i,i,'t1' FROM generate_series(1,10) g(i);
+ANALYZE t1;
+CREATE TABLE t11 (d text) INHERITS (t1);
+CREATE INDEX t11_a_idx ON t11(a);
+INSERT INTO t11
+SELECT i,i,'t11','t11d' FROM generate_series(1,10) g(i);
+ANALYZE t11;
+CREATE TABLE t12 (e int[]) INHERITS (t1);
+CREATE INDEX t12_a_idx ON t12(a);
+INSERT INTO t12
+SELECT i,i,'t12','{1,2}'::int[] FROM generate_series(1,10) g(i);
+ANALYZE t12;
+CREATE TABLE t111 () INHERITS (t11, t12);
+NOTICE: merging multiple inherited definitions of column "a"
+NOTICE: merging multiple inherited definitions of column "b"
+NOTICE: merging multiple inherited definitions of column "c"
+CREATE INDEX t111_a_idx ON t111(a);
+INSERT INTO t111
+SELECT i,i,'t111','t111d','{1,1,1}'::int[] FROM generate_series(1,10) g(i);
+ANALYZE t111;
+CREATE VIEW v1 WITH (security_barrier=true) AS
+SELECT *, (SELECT d FROM t11 WHERE t11.a = t1.a LIMIT 1) AS d
+FROM t1
+WHERE a > 5 AND EXISTS(SELECT 1 FROM t12 WHERE t12.a = t1.a);
+SELECT * FROM v1 WHERE a=3; -- should not see anything
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM v1 WHERE a=8;
+ a | b | c | d
+---+---+------+------
+ 8 | 8 | t1 | t11d
+ 8 | 8 | t11 | t11d
+ 8 | 8 | t12 | t11d
+ 8 | 8 | t111 | t11d
+(4 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------------------------
+ Update on public.t1
+ Update on public.t1
+ Update on public.t11
+ Update on public.t12
+ Update on public.t111
+ -> Index Scan using t1_a_idx on public.t1
+ Output: 100, t1.b, t1.c, t1.ctid
+ Index Cond: ((t1.a > 5) AND (t1.a < 7))
+ Filter: ((t1.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t1.a) AND leakproof(t1.a))
+ SubPlan 1
+ -> Append
+ -> Seq Scan on public.t12 t12_1
+ Filter: (t12_1.a = t1.a)
+ -> Seq Scan on public.t111 t111_1
+ Filter: (t111_1.a = t1.a)
+ SubPlan 2
+ -> Append
+ -> Seq Scan on public.t12 t12_2
+ Output: t12_2.a
+ -> Seq Scan on public.t111 t111_2
+ Output: t111_2.a
+ -> Index Scan using t11_a_idx on public.t11
+ Output: 100, t11.b, t11.c, t11.d, t11.ctid
+ Index Cond: ((t11.a > 5) AND (t11.a < 7))
+ Filter: ((t11.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t11.a) AND leakproof(t11.a))
+ -> Index Scan using t12_a_idx on public.t12
+ Output: 100, t12.b, t12.c, t12.e, t12.ctid
+ Index Cond: ((t12.a > 5) AND (t12.a < 7))
+ Filter: ((t12.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t12.a) AND leakproof(t12.a))
+ -> Index Scan using t111_a_idx on public.t111
+ Output: 100, t111.b, t111.c, t111.d, t111.e, t111.ctid
+ Index Cond: ((t111.a > 5) AND (t111.a < 7))
+ Filter: ((t111.a <> 6) AND (alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t111.a) AND leakproof(t111.a))
+(33 rows)
+
+UPDATE v1 SET a=100 WHERE snoop(a) AND leakproof(a) AND a < 7 AND a != 6;
+SELECT * FROM v1 WHERE a=100; -- Nothing should have been changed to 100
+ a | b | c | d
+---+---+---+---
+(0 rows)
+
+SELECT * FROM t1 WHERE a=100; -- Nothing should have been changed to 100
+ a | b | c
+---+---+---
+(0 rows)
+
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE v1 SET a=a+1 WHERE snoop(a) AND leakproof(a) AND a = 8;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
+ Update on public.t1
+ Update on public.t1
+ Update on public.t11
+ Update on public.t12
+ Update on public.t111
+ -> Index Scan using t1_a_idx on public.t1
+ Output: (t1.a + 1), t1.b, t1.c, t1.ctid
+ Index Cond: ((t1.a > 5) AND (t1.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t1.a) AND leakproof(t1.a))
+ SubPlan 1
+ -> Append
+ -> Seq Scan on public.t12 t12_1
+ Filter: (t12_1.a = t1.a)
+ -> Seq Scan on public.t111 t111_1
+ Filter: (t111_1.a = t1.a)
+ SubPlan 2
+ -> Append
+ -> Seq Scan on public.t12 t12_2
+ Output: t12_2.a
+ -> Seq Scan on public.t111 t111_2
+ Output: t111_2.a
+ -> Index Scan using t11_a_idx on public.t11
+ Output: (t11.a + 1), t11.b, t11.c, t11.d, t11.ctid
+ Index Cond: ((t11.a > 5) AND (t11.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t11.a) AND leakproof(t11.a))
+ -> Index Scan using t12_a_idx on public.t12
+ Output: (t12.a + 1), t12.b, t12.c, t12.e, t12.ctid
+ Index Cond: ((t12.a > 5) AND (t12.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t12.a) AND leakproof(t12.a))
+ -> Index Scan using t111_a_idx on public.t111
+ Output: (t111.a + 1), t111.b, t111.c, t111.d, t111.e, t111.ctid
+ Index Cond: ((t111.a > 5) AND (t111.a = 8))
+ Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) AND snoop(t111.a) AND leakproof(t111.a))
+(33 rows)
+
+UPDATE v1 SET a=a+1 WHERE snoop(a) AND leakproof(a) AND a = 8;
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+NOTICE: snooped value: 8
+SELECT * FROM v1 WHERE b=8;
+ a | b | c | d
+---+---+------+------
+ 9 | 8 | t1 | t11d
+ 9 | 8 | t11 | t11d
+ 9 | 8 | t12 | t11d
+ 9 | 8 | t111 | t11d
+(4 rows)
+
+DELETE FROM v1 WHERE snoop(a) AND leakproof(a); -- should not delete everything, just where a>5
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 9
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+NOTICE: snooped value: 6
+NOTICE: snooped value: 7
+NOTICE: snooped value: 9
+NOTICE: snooped value: 9
+NOTICE: snooped value: 10
+TABLE t1; -- verify all a<=5 are intact
+ a | b | c
+---+---+------
+ 1 | 1 | t1
+ 2 | 2 | t1
+ 3 | 3 | t1
+ 4 | 4 | t1
+ 5 | 5 | t1
+ 1 | 1 | t11
+ 2 | 2 | t11
+ 3 | 3 | t11
+ 4 | 4 | t11
+ 5 | 5 | t11
+ 1 | 1 | t12
+ 2 | 2 | t12
+ 3 | 3 | t12
+ 4 | 4 | t12
+ 5 | 5 | t12
+ 1 | 1 | t111
+ 2 | 2 | t111
+ 3 | 3 | t111
+ 4 | 4 | t111
+ 5 | 5 | t111
+(20 rows)
+
+DROP TABLE t1, t11, t12, t111 CASCADE;
+NOTICE: drop cascades to view v1
+DROP FUNCTION snoop(anyelement);
+DROP FUNCTION leakproof(anyelement);
+CREATE TABLE tx1 (a integer);
+CREATE TABLE tx2 (b integer);
+CREATE TABLE tx3 (c integer);
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 values (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+(1 row)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+CREATE TABLE tx1 (a integer);
+CREATE TABLE tx2 (b integer);
+CREATE TABLE tx3 (c integer);
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 VALUES (1);
+INSERT INTO vx1 VALUES (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+CREATE TABLE tx1 (a integer, b integer);
+CREATE TABLE tx2 (b integer, c integer);
+CREATE TABLE tx3 (c integer, d integer);
+ALTER TABLE tx1 DROP COLUMN b;
+ALTER TABLE tx2 DROP COLUMN c;
+ALTER TABLE tx3 DROP COLUMN d;
+CREATE VIEW vx1 AS SELECT a FROM tx1 WHERE EXISTS(SELECT 1 FROM tx2 JOIN tx3 ON b=c);
+INSERT INTO vx1 VALUES (1);
+INSERT INTO vx1 VALUES (1);
+SELECT * FROM tx1;
+ a
+---
+ 1
+ 1
+(2 rows)
+
+SELECT * FROM vx1;
+ a
+---
+(0 rows)
+
+DROP VIEW vx1;
+DROP TABLE tx1;
+DROP TABLE tx2;
+DROP TABLE tx3;
+--
+-- Test handling of vars from correlated subqueries in quals from outer
+-- security barrier views, per bug #13988
+--
+CREATE TABLE t1 (a int, b text, c int);
+INSERT INTO t1 VALUES (1, 'one', 10);
+CREATE TABLE t2 (cc int);
+INSERT INTO t2 VALUES (10), (20);
+CREATE VIEW v1 WITH (security_barrier = true) AS
+ SELECT * FROM t1 WHERE (a > 0)
+ WITH CHECK OPTION;
+CREATE VIEW v2 WITH (security_barrier = true) AS
+ SELECT * FROM v1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.cc = v1.c)
+ WITH CHECK OPTION;
+INSERT INTO v2 VALUES (2, 'two', 20); -- ok
+INSERT INTO v2 VALUES (-2, 'minus two', 20); -- not allowed
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-2, minus two, 20).
+INSERT INTO v2 VALUES (3, 'three', 30); -- not allowed
+ERROR: new row violates check option for view "v2"
+DETAIL: Failing row contains (3, three, 30).
+UPDATE v2 SET b = 'ONE' WHERE a = 1; -- ok
+UPDATE v2 SET a = -1 WHERE a = 1; -- not allowed
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-1, ONE, 10).
+UPDATE v2 SET c = 30 WHERE a = 1; -- not allowed
+ERROR: new row violates check option for view "v2"
+DETAIL: Failing row contains (1, ONE, 30).
+DELETE FROM v2 WHERE a = 2; -- ok
+SELECT * FROM v2;
+ a | b | c
+---+-----+----
+ 1 | ONE | 10
+(1 row)
+
+DROP VIEW v2;
+DROP VIEW v1;
+DROP TABLE t2;
+DROP TABLE t1;
+--
+-- Test CREATE OR REPLACE VIEW turning a non-updatable view into an
+-- auto-updatable view and adding check options in a single step
+--
+CREATE TABLE t1 (a int, b text);
+CREATE VIEW v1 AS SELECT null::int AS a;
+CREATE OR REPLACE VIEW v1 AS SELECT * FROM t1 WHERE a > 0 WITH CHECK OPTION;
+INSERT INTO v1 VALUES (1, 'ok'); -- ok
+INSERT INTO v1 VALUES (-1, 'invalid'); -- should fail
+ERROR: new row violates check option for view "v1"
+DETAIL: Failing row contains (-1, invalid).
+DROP VIEW v1;
+DROP TABLE t1;
+-- check that an auto-updatable view on a partitioned table works correctly
+create table uv_pt (a int, b int, v varchar) partition by range (a, b);
+create table uv_pt1 (b int not null, v varchar, a int not null) partition by range (b);
+create table uv_pt11 (like uv_pt1);
+alter table uv_pt11 drop a;
+alter table uv_pt11 add a int;
+alter table uv_pt11 drop a;
+alter table uv_pt11 add a int not null;
+alter table uv_pt1 attach partition uv_pt11 for values from (2) to (5);
+alter table uv_pt attach partition uv_pt1 for values from (1, 2) to (1, 10);
+create view uv_ptv as select * from uv_pt;
+select events & 4 != 0 AS upd,
+ events & 8 != 0 AS ins,
+ events & 16 != 0 AS del
+ from pg_catalog.pg_relation_is_updatable('uv_pt'::regclass, false) t(events);
+ upd | ins | del
+-----+-----+-----
+ t | t | t
+(1 row)
+
+select pg_catalog.pg_column_is_updatable('uv_pt'::regclass, 1::smallint, false);
+ pg_column_is_updatable
+------------------------
+ t
+(1 row)
+
+select pg_catalog.pg_column_is_updatable('uv_pt'::regclass, 2::smallint, false);
+ pg_column_is_updatable
+------------------------
+ t
+(1 row)
+
+select table_name, is_updatable, is_insertable_into
+ from information_schema.views where table_name = 'uv_ptv';
+ table_name | is_updatable | is_insertable_into
+------------+--------------+--------------------
+ uv_ptv | YES | YES
+(1 row)
+
+select table_name, column_name, is_updatable
+ from information_schema.columns where table_name = 'uv_ptv' order by column_name;
+ table_name | column_name | is_updatable
+------------+-------------+--------------
+ uv_ptv | a | YES
+ uv_ptv | b | YES
+ uv_ptv | v | YES
+(3 rows)
+
+insert into uv_ptv values (1, 2);
+select tableoid::regclass, * from uv_pt;
+ tableoid | a | b | v
+----------+---+---+---
+ uv_pt11 | 1 | 2 |
+(1 row)
+
+create view uv_ptv_wco as select * from uv_pt where a = 0 with check option;
+insert into uv_ptv_wco values (1, 2);
+ERROR: new row violates check option for view "uv_ptv_wco"
+DETAIL: Failing row contains (1, 2, null).
+drop view uv_ptv, uv_ptv_wco;
+drop table uv_pt, uv_pt1, uv_pt11;
+-- check that wholerow vars appearing in WITH CHECK OPTION constraint expressions
+-- work fine with partitioned tables
+create table wcowrtest (a int) partition by list (a);
+create table wcowrtest1 partition of wcowrtest for values in (1);
+create view wcowrtest_v as select * from wcowrtest where wcowrtest = '(2)'::wcowrtest with check option;
+insert into wcowrtest_v values (1);
+ERROR: new row violates check option for view "wcowrtest_v"
+DETAIL: Failing row contains (1).
+alter table wcowrtest add b text;
+create table wcowrtest2 (b text, c int, a int);
+alter table wcowrtest2 drop c;
+alter table wcowrtest attach partition wcowrtest2 for values in (2);
+create table sometable (a int, b text);
+insert into sometable values (1, 'a'), (2, 'b');
+create view wcowrtest_v2 as
+ select *
+ from wcowrtest r
+ where r in (select s from sometable s where r.a = s.a)
+with check option;
+-- WITH CHECK qual will be processed with wcowrtest2's
+-- rowtype after tuple-routing
+insert into wcowrtest_v2 values (2, 'no such row in sometable');
+ERROR: new row violates check option for view "wcowrtest_v2"
+DETAIL: Failing row contains (2, no such row in sometable).
+drop view wcowrtest_v, wcowrtest_v2;
+drop table wcowrtest, sometable;
+-- Check INSERT .. ON CONFLICT DO UPDATE works correctly when the view's
+-- columns are named and ordered differently than the underlying table's.
+create table uv_iocu_tab (a text unique, b float);
+insert into uv_iocu_tab values ('xyxyxy', 0);
+create view uv_iocu_view as
+ select b, b+1 as c, a, '2.0'::text as two from uv_iocu_tab;
+insert into uv_iocu_view (a, b) values ('xyxyxy', 1)
+ on conflict (a) do update set b = uv_iocu_view.b;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 0
+(1 row)
+
+insert into uv_iocu_view (a, b) values ('xyxyxy', 1)
+ on conflict (a) do update set b = excluded.b;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 1
+(1 row)
+
+-- OK to access view columns that are not present in underlying base
+-- relation in the ON CONFLICT portion of the query
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = cast(excluded.two as float);
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 2
+(1 row)
+
+explain (costs off)
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = excluded.b where excluded.c > 0;
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Insert on uv_iocu_tab
+ Conflict Resolution: UPDATE
+ Conflict Arbiter Indexes: uv_iocu_tab_a_key
+ Conflict Filter: ((excluded.b + '1'::double precision) > '0'::double precision)
+ -> Result
+(5 rows)
+
+insert into uv_iocu_view (a, b) values ('xyxyxy', 3)
+ on conflict (a) do update set b = excluded.b where excluded.c > 0;
+select * from uv_iocu_tab;
+ a | b
+--------+---
+ xyxyxy | 3
+(1 row)
+
+drop view uv_iocu_view;
+drop table uv_iocu_tab;
+-- Test whole-row references to the view
+create table uv_iocu_tab (a int unique, b text);
+create view uv_iocu_view as
+ select b as bb, a as aa, uv_iocu_tab::text as cc from uv_iocu_tab;
+insert into uv_iocu_view (aa,bb) values (1,'x');
+explain (costs off)
+insert into uv_iocu_view (aa,bb) values (1,'y')
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*
+ where excluded.aa > 0
+ and excluded.bb != ''
+ and excluded.cc is not null;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
+ Insert on uv_iocu_tab
+ Conflict Resolution: UPDATE
+ Conflict Arbiter Indexes: uv_iocu_tab_a_key
+ Conflict Filter: ((excluded.a > 0) AND (excluded.b <> ''::text) AND ((excluded.*)::text IS NOT NULL))
+ -> Result
+(5 rows)
+
+insert into uv_iocu_view (aa,bb) values (1,'y')
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*
+ where excluded.aa > 0
+ and excluded.bb != ''
+ and excluded.cc is not null;
+select * from uv_iocu_view;
+ bb | aa | cc
+-------------------------+----+---------------------------------
+ Rejected: (y,1,"(1,y)") | 1 | (1,"Rejected: (y,1,""(1,y)"")")
+(1 row)
+
+-- Test omitting a column of the base relation
+delete from uv_iocu_view;
+insert into uv_iocu_view (aa,bb) values (1,'x');
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-----------------------+----+-------------------------------
+ Rejected: (,1,"(1,)") | 1 | (1,"Rejected: (,1,""(1,)"")")
+(1 row)
+
+alter table uv_iocu_tab alter column b set default 'table default';
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-------------------------------------------------------+----+---------------------------------------------------------------------
+ Rejected: ("table default",1,"(1,""table default"")") | 1 | (1,"Rejected: (""table default"",1,""(1,""""table default"""")"")")
+(1 row)
+
+alter view uv_iocu_view alter column bb set default 'view default';
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set bb = 'Rejected: '||excluded.*;
+select * from uv_iocu_view;
+ bb | aa | cc
+-----------------------------------------------------+----+-------------------------------------------------------------------
+ Rejected: ("view default",1,"(1,""view default"")") | 1 | (1,"Rejected: (""view default"",1,""(1,""""view default"""")"")")
+(1 row)
+
+-- Should fail to update non-updatable columns
+insert into uv_iocu_view (aa) values (1)
+ on conflict (aa) do update set cc = 'XXX';
+ERROR: cannot insert into column "cc" of view "uv_iocu_view"
+DETAIL: View columns that are not columns of their base relation are not updatable.
+drop view uv_iocu_view;
+drop table uv_iocu_tab;
+-- ON CONFLICT DO UPDATE permissions checks
+create user regress_view_user1;
+create user regress_view_user2;
+set session authorization regress_view_user1;
+create table base_tbl(a int unique, b text, c float);
+insert into base_tbl values (1,'xxx',1.0);
+create view rw_view1 as select b as bb, c as cc, a as aa from base_tbl;
+grant select (aa,bb) on rw_view1 to regress_view_user2;
+grant insert on rw_view1 to regress_view_user2;
+grant update (bb) on rw_view1 to regress_view_user2;
+set session authorization regress_view_user2;
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = excluded.cc; -- Not allowed
+ERROR: permission denied for view rw_view1
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = rw_view1.cc; -- Not allowed
+ERROR: permission denied for view rw_view1
+insert into rw_view1 values ('yyy',2.0,1)
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+insert into rw_view1 values ('zzz',2.0,1)
+ on conflict (aa) do update set bb = rw_view1.bb||'xxx'; -- OK
+insert into rw_view1 values ('zzz',2.0,1)
+ on conflict (aa) do update set cc = 3.0; -- Not allowed
+ERROR: permission denied for view rw_view1
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+--------+---
+ 1 | yyyxxx | 1
+(1 row)
+
+set session authorization regress_view_user1;
+grant select (a,b) on base_tbl to regress_view_user2;
+grant insert (a,b) on base_tbl to regress_view_user2;
+grant update (a,b) on base_tbl to regress_view_user2;
+set session authorization regress_view_user2;
+create view rw_view2 as select b as bb, c as cc, a as aa from base_tbl;
+insert into rw_view2 (aa,bb) values (1,'xxx')
+ on conflict (aa) do update set bb = excluded.bb; -- Not allowed
+ERROR: permission denied for table base_tbl
+create view rw_view3 as select b as bb, a as aa from base_tbl;
+insert into rw_view3 (aa,bb) values (1,'xxx')
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+-----+---
+ 1 | xxx | 1
+(1 row)
+
+set session authorization regress_view_user2;
+create view rw_view4 as select aa, bb, cc FROM rw_view1;
+insert into rw_view4 (aa,bb) values (1,'yyy')
+ on conflict (aa) do update set bb = excluded.bb; -- Not allowed
+ERROR: permission denied for view rw_view1
+create view rw_view5 as select aa, bb FROM rw_view1;
+insert into rw_view5 (aa,bb) values (1,'yyy')
+ on conflict (aa) do update set bb = excluded.bb; -- OK
+reset session authorization;
+select * from base_tbl;
+ a | b | c
+---+-----+---
+ 1 | yyy | 1
+(1 row)
+
+drop view rw_view5;
+drop view rw_view4;
+drop view rw_view3;
+drop view rw_view2;
+drop view rw_view1;
+drop table base_tbl;
+drop user regress_view_user1;
+drop user regress_view_user2;
+-- Test single- and multi-row inserts with table and view defaults.
+-- Table defaults should be used, unless overridden by view defaults.
+create table base_tab_def (a int, b text default 'Table default',
+ c text default 'Table default', d text, e text);
+create view base_tab_def_view as select * from base_tab_def;
+alter view base_tab_def_view alter b set default 'View default';
+alter view base_tab_def_view alter d set default 'View default';
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | Table default | View default |
+ 12 | View default | Table default | View default |
+ 13 | View default | Table default | View default |
+ 14 | View default | Table default | View default |
+ 15 | View default | Table default | View default |
+ 16 | View default | Table default | View default |
+ 17 | View default | Table default | View default |
+ | View default | Table default | View default |
+(14 rows)
+
+-- Adding an INSTEAD OF trigger should cause NULLs to be inserted instead of
+-- table defaults, where there are no view defaults.
+create function base_tab_def_view_instrig_func() returns trigger
+as
+$$
+begin
+ insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+ return new;
+end;
+$$
+language plpgsql;
+create trigger base_tab_def_view_instrig instead of insert on base_tab_def_view
+ for each row execute function base_tab_def_view_instrig_func();
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | | View default |
+ 12 | View default | | View default |
+ 13 | View default | | View default |
+ 14 | View default | | View default |
+ 15 | View default | | View default |
+ 16 | View default | | View default |
+ 17 | View default | | View default |
+ | View default | | View default |
+(14 rows)
+
+-- Using an unconditional DO INSTEAD rule should also cause NULLs to be
+-- inserted where there are no view defaults.
+drop trigger base_tab_def_view_instrig on base_tab_def_view;
+drop function base_tab_def_view_instrig_func;
+create rule base_tab_def_view_ins_rule as on insert to base_tab_def_view
+ do instead insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | | View default |
+ 12 | View default | | View default |
+ 13 | View default | | View default |
+ 14 | View default | | View default |
+ 15 | View default | | View default |
+ 16 | View default | | View default |
+ 17 | View default | | View default |
+ | View default | | View default |
+(14 rows)
+
+-- A DO ALSO rule should cause each row to be inserted twice. The first
+-- insert should behave the same as an auto-updatable view (using table
+-- defaults, unless overridden by view defaults). The second insert should
+-- behave the same as a rule-updatable view (inserting NULLs where there are
+-- no view defaults).
+drop rule base_tab_def_view_ins_rule on base_tab_def_view;
+create rule base_tab_def_view_ins_rule as on insert to base_tab_def_view
+ do also insert into base_tab_def values (new.a, new.b, new.c, new.d, new.e);
+truncate base_tab_def;
+insert into base_tab_def values (1);
+insert into base_tab_def values (2), (3);
+insert into base_tab_def values (4, default, default, default, default);
+insert into base_tab_def values (5, default, default, default, default),
+ (6, default, default, default, default);
+insert into base_tab_def_view values (11);
+insert into base_tab_def_view values (12), (13);
+insert into base_tab_def_view values (14, default, default, default, default);
+insert into base_tab_def_view values (15, default, default, default, default),
+ (16, default, default, default, default);
+insert into base_tab_def_view values (17), (default);
+select * from base_tab_def order by a, c NULLS LAST;
+ a | b | c | d | e
+----+---------------+---------------+--------------+---
+ 1 | Table default | Table default | |
+ 2 | Table default | Table default | |
+ 3 | Table default | Table default | |
+ 4 | Table default | Table default | |
+ 5 | Table default | Table default | |
+ 6 | Table default | Table default | |
+ 11 | View default | Table default | View default |
+ 11 | View default | | View default |
+ 12 | View default | Table default | View default |
+ 12 | View default | | View default |
+ 13 | View default | Table default | View default |
+ 13 | View default | | View default |
+ 14 | View default | Table default | View default |
+ 14 | View default | | View default |
+ 15 | View default | Table default | View default |
+ 15 | View default | | View default |
+ 16 | View default | Table default | View default |
+ 16 | View default | | View default |
+ 17 | View default | Table default | View default |
+ 17 | View default | | View default |
+ | View default | Table default | View default |
+ | View default | | View default |
+(22 rows)
+
+drop view base_tab_def_view;
+drop table base_tab_def;
+-- Test defaults with array assignments
+create table base_tab (a serial, b int[], c text, d text default 'Table default');
+create view base_tab_view as select c, a, b from base_tab;
+alter view base_tab_view alter column c set default 'View default';
+insert into base_tab_view (b[1], b[2], c, b[5], b[4], a, b[3])
+values (1, 2, default, 5, 4, default, 3), (10, 11, 'C value', 14, 13, 100, 12);
+select * from base_tab order by a;
+ a | b | c | d
+-----+------------------+--------------+---------------
+ 1 | {1,2,3,4,5} | View default | Table default
+ 100 | {10,11,12,13,14} | C value | Table default
+(2 rows)
+
+drop view base_tab_view;
+drop table base_tab;
diff --git a/src/test/regress/expected/update_1.out b/src/test/regress/expected/update_1.out
new file mode 100644
index 0000000000..5847d8c3a7
--- /dev/null
+++ b/src/test/regress/expected/update_1.out
@@ -0,0 +1,935 @@
+--
+-- UPDATE syntax tests
+--
+CREATE TABLE update_test (
+ a INT DEFAULT 10,
+ b INT,
+ c TEXT
+);
+CREATE TABLE upsert_test (
+ a INT PRIMARY KEY,
+ b TEXT
+);
+INSERT INTO update_test VALUES (5, 10, 'foo');
+INSERT INTO update_test(b, a) VALUES (15, 10);
+SELECT * FROM update_test;
+ a | b | c
+----+----+-----
+ 5 | 10 | foo
+ 10 | 15 |
+(2 rows)
+
+UPDATE update_test SET a = DEFAULT, b = DEFAULT;
+SELECT * FROM update_test;
+ a | b | c
+----+---+-----
+ 10 | | foo
+ 10 | |
+(2 rows)
+
+-- aliases for the UPDATE target table
+UPDATE update_test AS t SET b = 10 WHERE t.a = 10;
+SELECT * FROM update_test;
+ a | b | c
+----+----+-----
+ 10 | 10 | foo
+ 10 | 10 |
+(2 rows)
+
+UPDATE update_test t SET b = t.b + 10 WHERE t.a = 10;
+SELECT * FROM update_test;
+ a | b | c
+----+----+-----
+ 10 | 20 | foo
+ 10 | 20 |
+(2 rows)
+
+--
+-- Test VALUES in FROM
+--
+UPDATE update_test SET a=v.i FROM (VALUES(100, 20)) AS v(i, j)
+ WHERE update_test.b = v.j;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 20 | foo
+ 100 | 20 |
+(2 rows)
+
+-- fail, wrong data type:
+UPDATE update_test SET a = v.* FROM (VALUES(100, 20)) AS v(i, j)
+ WHERE update_test.b = v.j;
+ERROR: column "a" is of type integer but expression is of type record
+LINE 1: UPDATE update_test SET a = v.* FROM (VALUES(100, 20)) AS v(i...
+ ^
+HINT: You will need to rewrite or cast the expression.
+--
+-- Test multiple-set-clause syntax
+--
+INSERT INTO update_test SELECT a,b+1,c FROM update_test;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 20 | foo
+ 100 | 20 |
+ 100 | 21 | foo
+ 100 | 21 |
+(4 rows)
+
+UPDATE update_test SET (c,b,a) = ('bugle', b+11, DEFAULT) WHERE c = 'foo';
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-------
+ 100 | 20 |
+ 100 | 21 |
+ 10 | 31 | bugle
+ 10 | 32 | bugle
+(4 rows)
+
+UPDATE update_test SET (c,b) = ('car', a+b), a = a + 1 WHERE a = 10;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 20 |
+ 100 | 21 |
+ 11 | 41 | car
+ 11 | 42 | car
+(4 rows)
+
+-- fail, multi assignment to same column:
+UPDATE update_test SET (c,b) = ('car', a+b), b = a + 1 WHERE a = 10;
+ERROR: multiple assignments to same column "b"
+-- uncorrelated sub-select:
+UPDATE update_test
+ SET (b,a) = (select a,b from update_test where b = 41 and c = 'car')
+ WHERE a = 100 AND b = 20;
+SELECT * FROM update_test;
+ a | b | c
+-----+----+-----
+ 100 | 21 |
+ 11 | 41 | car
+ 11 | 42 | car
+ 41 | 11 |
+(4 rows)
+
+-- correlated sub-select:
+UPDATE update_test o
+ SET (b,a) = (select a+1,b from update_test i
+ where i.a=o.a and i.b=o.b and i.c is not distinct from o.c);
+SELECT * FROM update_test;
+ a | b | c
+----+-----+-----
+ 21 | 101 |
+ 41 | 12 | car
+ 42 | 12 | car
+ 11 | 42 |
+(4 rows)
+
+-- fail, multiple rows supplied:
+UPDATE update_test SET (b,a) = (select a+1,b from update_test);
+ERROR: more than one row returned by a subquery used as an expression
+-- set to null if no rows supplied:
+UPDATE update_test SET (b,a) = (select a+1,b from update_test where a = 1000)
+ WHERE a = 11;
+SELECT * FROM update_test;
+ a | b | c
+----+-----+-----
+ 21 | 101 |
+ 41 | 12 | car
+ 42 | 12 | car
+ | |
+(4 rows)
+
+-- *-expansion should work in this context:
+UPDATE update_test SET (a,b) = ROW(v.*) FROM (VALUES(21, 100)) AS v(i, j)
+ WHERE update_test.a = v.i;
+-- you might expect this to work, but syntactically it's not a RowExpr:
+UPDATE update_test SET (a,b) = (v.*) FROM (VALUES(21, 101)) AS v(i, j)
+ WHERE update_test.a = v.i;
+ERROR: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression
+LINE 1: UPDATE update_test SET (a,b) = (v.*) FROM (VALUES(21, 101)) ...
+ ^
+-- if an alias for the target table is specified, don't allow references
+-- to the original table name
+UPDATE update_test AS t SET b = update_test.b + 10 WHERE t.a = 10;
+ERROR: invalid reference to FROM-clause entry for table "update_test"
+LINE 1: UPDATE update_test AS t SET b = update_test.b + 10 WHERE t.a...
+ ^
+HINT: Perhaps you meant to reference the table alias "t".
+-- Make sure that we can update to a TOASTed value.
+UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
+SELECT a, b, char_length(c) FROM update_test;
+ a | b | char_length
+----+-----+-------------
+ | |
+ 21 | 100 |
+ 41 | 12 | 10000
+ 42 | 12 | 10000
+(4 rows)
+
+-- Check multi-assignment with a Result node to handle a one-time filter.
+EXPLAIN (VERBOSE, COSTS OFF)
+UPDATE update_test t
+ SET (a, b) = (SELECT b, a FROM update_test s WHERE s.a = t.a)
+ WHERE CURRENT_USER = SESSION_USER;
+ QUERY PLAN
+------------------------------------------------------------------
+ Update on public.update_test t
+ -> Result
+ Output: $1, $2, t.c, (SubPlan 1 (returns $1,$2)), t.ctid
+ One-Time Filter: (CURRENT_USER = SESSION_USER)
+ -> Seq Scan on public.update_test t
+ Output: t.c, t.a, t.ctid
+ SubPlan 1 (returns $1,$2)
+ -> Seq Scan on public.update_test s
+ Output: s.b, s.a
+ Filter: (s.a = t.a)
+(10 rows)
+
+UPDATE update_test t
+ SET (a, b) = (SELECT b, a FROM update_test s WHERE s.a = t.a)
+ WHERE CURRENT_USER = SESSION_USER;
+SELECT a, b, char_length(c) FROM update_test;
+ a | b | char_length
+-----+----+-------------
+ | |
+ 100 | 21 |
+ 12 | 41 | 10000
+ 12 | 42 | 10000
+(4 rows)
+
+-- Test ON CONFLICT DO UPDATE
+INSERT INTO upsert_test VALUES(1, 'Boo');
+-- uncorrelated sub-select:
+WITH aaa AS (SELECT 1 AS a, 'Foo' AS b) INSERT INTO upsert_test
+ VALUES (1, 'Bar') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b, a FROM aaa) RETURNING *;
+ a | b
+---+-----
+ 1 | Foo
+(1 row)
+
+-- correlated sub-select:
+INSERT INTO upsert_test VALUES (1, 'Baz') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Correlated', a from upsert_test i WHERE i.a = upsert_test.a)
+ RETURNING *;
+ a | b
+---+-----------------
+ 1 | Foo, Correlated
+(1 row)
+
+-- correlated sub-select (EXCLUDED.* alias):
+INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Excluded', a from upsert_test i WHERE i.a = excluded.a)
+ RETURNING *;
+ a | b
+---+---------------------------
+ 1 | Foo, Correlated, Excluded
+(1 row)
+
+-- ON CONFLICT using system attributes in RETURNING, testing both the
+-- inserting and updating paths. See bug report at:
+-- https://www.postgresql.org/message-id/73436355-6432-49B1-92ED-1FE4F7E7E100%40finefun.com.au
+INSERT INTO upsert_test VALUES (2, 'Beeble') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Excluded', a from upsert_test i WHERE i.a = excluded.a)
+ RETURNING tableoid::regclass, xmin = pg_current_xact_id()::xid AS xmin_correct, xmax = 0 AS xmax_correct;
+ERROR: zedstore tuple table slot does not have system attributes (except xmin and cmin)
+-- currently xmax is set after a conflict - that's probably not good,
+-- but it seems worthwhile to have to be explicit if that changes.
+INSERT INTO upsert_test VALUES (2, 'Brox') ON CONFLICT(a)
+ DO UPDATE SET (b, a) = (SELECT b || ', Excluded', a from upsert_test i WHERE i.a = excluded.a)
+ RETURNING tableoid::regclass, xmin = pg_current_xact_id()::xid AS xmin_correct, xmax = pg_current_xact_id()::xid AS xmax_correct;
+ERROR: zedstore tuple table slot does not have system attributes (except xmin and cmin)
+DROP TABLE update_test;
+DROP TABLE upsert_test;
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+CREATE TABLE range_parted (
+ a text,
+ b bigint,
+ c numeric,
+ d int,
+ e varchar
+) PARTITION BY RANGE (a, b);
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_d_15_20 | b | 15 | 105 | 16 |
+ part_d_15_20 | b | 17 | 105 | 19 |
+(6 rows)
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+ QUERY PLAN
+-------------------------------------------------
+ Update on range_parted
+ Update on part_a_1_a_10 range_parted_1
+ Update on part_a_10_a_20 range_parted_2
+ Update on part_b_1_b_10 range_parted_3
+ Update on part_c_1_100 range_parted_4
+ Update on part_d_1_15 range_parted_5
+ Update on part_d_15_20 range_parted_6
+ Update on part_b_20_b_30 range_parted_7
+ -> Seq Scan on part_a_1_a_10 range_parted_1
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_a_10_a_20 range_parted_2
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_b_1_b_10 range_parted_3
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_c_1_100 range_parted_4
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_d_1_15 range_parted_5
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_d_15_20 range_parted_6
+ Filter: (c > '97'::numeric)
+ -> Seq Scan on part_b_20_b_30 range_parted_7
+ Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+ERROR: new row for relation "part_c_100_200" violates partition constraint
+DETAIL: Failing row contains (105, 85, null, b, 15).
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+ERROR: new row for relation "part_c_1_100" violates partition constraint
+DETAIL: Failing row contains (null, 1, 96, 12, a).
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+ c | b | a
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10 | a | 1 | 1 | 1 | 1
+ part_d_1_15 | b | 12 | 116 | 1 | 1
+ part_d_1_15 | b | 13 | 117 | 2 | 2
+ part_d_1_15 | b | 15 | 125 | 6 | 6
+ part_d_1_15 | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+ERROR: new row for relation "part_d_1_15" violates partition constraint
+DETAIL: Failing row contains (2, 117, 2, b, 7).
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+ a | ?column?
+---+----------
+ a | 204
+ b | 124
+ b | 134
+ b | 136
+(4 rows)
+
+:show_data;
+ partname | a | b | c | d | e
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a | 1 | 1 | 1 | 1
+ part_a_1_a_10 | a | 4 | 200 | 1 | 1
+ part_b_1_b_10 | b | 7 | 117 | 2 | 2
+ part_b_1_b_10 | b | 9 | 125 | 6 | 6
+ part_d_1_15 | b | 11 | 125 | 9 | 9
+ part_d_1_15 | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+ERROR: new row violates check option for view "upview"
+DETAIL: Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+ERROR: new row violates check option for view "upview"
+DETAIL: Failing row contains (b, 15, 120, 1, 1).
+-- ok, row movement, check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+:show_data;
+ partname | a | b | c | d | e
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a | 1 | 1 | 1 | 1
+ part_b_1_b_10 | b | 7 | 117 | 2 | 2
+ part_b_1_b_10 | b | 9 | 125 | 6 | 6
+ part_d_1_15 | b | 11 | 125 | 9 | 9
+ part_d_1_15 | b | 12 | 116 | 1 | 1
+ part_d_1_15 | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+DROP VIEW upview;
+-- RETURNING having whole-row vars.
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted), *;
+ range_parted | a | b | c | d | e
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 |
+ (b,17,95,19,) | b | 17 | 95 | 19 |
+(2 rows)
+
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_c_1_100 | b | 15 | 95 | 16 |
+ part_c_1_100 | b | 17 | 95 | 19 |
+(6 rows)
+
+-- Transition tables with update row movement
+:init_range_parted;
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+ begin
+ raise notice 'trigger = %, old table = %, new table = %',
+ TG_NAME,
+ (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+ (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+ return null;
+ end;
+$$;
+CREATE TRIGGER trans_updatetrig
+ AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+ FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 13 | 98 | 2 |
+ part_d_15_20 | b | 15 | 106 | 16 |
+ part_d_15_20 | b | 17 | 106 | 19 |
+ part_d_1_15 | b | 12 | 110 | 1 |
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+ AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+ FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+ AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+ FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_d_15_20 | b | 15 | 155 | 16 |
+ part_d_15_20 | b | 17 | 155 | 19 |
+ part_d_1_15 | b | 12 | 146 | 1 |
+ part_d_1_15 | b | 13 | 147 | 2 |
+(6 rows)
+
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, the conversion
+-- is not usually needed, because the original tuple is already compatible with
+-- the desired transition tuple format. But conversion happens when there is a
+-- BR trigger because the trigger can change the inserted row. So install a
+-- BR triggers on those child partitions where the rows will be moved.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+ NEW.b = NEW.b + 1;
+ return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+ FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+ FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+ FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,110,1,), (b,15,98,2,), (b,17,106,16,), (b,19,106,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 15 | 98 | 2 |
+ part_d_15_20 | b | 17 | 106 | 16 |
+ part_d_15_20 | b | 19 | 106 | 19 |
+ part_d_1_15 | b | 15 | 110 | 1 |
+(6 rows)
+
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE: trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,146,1,), (b,16,147,2,), (b,17,155,16,), (b,19,155,19,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_d_15_20 | b | 17 | 155 | 16 |
+ part_d_15_20 | b | 19 | 155 | 19 |
+ part_d_1_15 | b | 15 | 146 | 1 |
+ part_d_1_15 | b | 16 | 147 | 2 |
+(6 rows)
+
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.
+:init_range_parted;
+UPDATE range_parted set b = 15 WHERE b = 1;
+NOTICE: trigger = trans_updatetrig, old table = (a,1,1,1,), new table = (a,15,1,1,)
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_10_a_20 | a | 15 | 1 | 1 |
+ part_c_1_100 | b | 13 | 96 | 1 |
+ part_c_1_100 | b | 14 | 97 | 2 |
+ part_d_15_20 | b | 16 | 105 | 16 |
+ part_d_15_20 | b | 18 | 105 | 19 |
+(6 rows)
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy for table "range_parted"
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+ NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+ return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+ FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted
+ AS RESTRICTIVE for UPDATE USING (true)
+ WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- ok
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+ WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+ERROR: new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+ begin
+ raise notice 'trigger = % fired on table % during %',
+ TG_NAME, TG_TABLE_NAME, TG_OP;
+ return null;
+ end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+ AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+ AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+ AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+ AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+ AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+ AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+ AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+ AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+ AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+ AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+ AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+ AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+NOTICE: trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_c_1_100 | b | 15 | 55 | 16 |
+ part_c_1_100 | b | 17 | 55 | 19 |
+(6 rows)
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
+-- Creating default partition for range
+:init_range_parted;
+create table part_def partition of range_parted default;
+\d+ part_def
+ Table "public.part_def"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a | text | | | | extended | |
+ b | bigint | | | | plain | |
+ c | numeric | | | | main | |
+ d | integer | | | | plain | |
+ e | character varying | | | | extended | |
+Partition of: range_parted DEFAULT
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
+
+insert into range_parted values ('c', 9);
+-- ok
+update part_def set a = 'd' where a = 'c';
+-- fail
+update part_def set a = 'a' where a = 'd';
+ERROR: new row for relation "part_def" violates partition constraint
+DETAIL: Failing row contains (a, 9, null, null, null).
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_d_15_20 | b | 15 | 105 | 16 |
+ part_d_15_20 | b | 17 | 105 | 19 |
+ part_def | d | 9 | | |
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+ERROR: new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL: Failing row contains (ad, 10, 200, 1, null).
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+ partname | a | b | c | d | e
+----------+----+----+-----+----+---
+ part_def | ad | 1 | 1 | 1 |
+ part_def | ad | 10 | 200 | 1 |
+ part_def | bd | 12 | 96 | 1 |
+ part_def | bd | 13 | 97 | 2 |
+ part_def | bd | 15 | 105 | 16 |
+ part_def | bd | 17 | 105 | 19 |
+ part_def | d | 9 | | |
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+ partname | a | b | c | d | e
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 | 1 |
+ part_a_1_a_10 | a | 1 | 1 | 1 |
+ part_c_1_100 | b | 12 | 96 | 1 |
+ part_c_1_100 | b | 13 | 97 | 2 |
+ part_d_15_20 | b | 15 | 105 | 16 |
+ part_d_15_20 | b | 17 | 105 | 19 |
+ part_def | d | 9 | | |
+(7 rows)
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+CREATE TABLE list_parted (
+ a text,
+ b int
+) PARTITION BY list (a);
+CREATE TABLE list_part1 PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
+-- fail
+UPDATE list_default set a = 'a' WHERE a = 'd';
+ERROR: new row for relation "list_default" violates partition constraint
+DETAIL: Failing row contains (a, 10).
+-- ok
+UPDATE list_default set a = 'x' WHERE a = 'd';
+DROP TABLE list_parted;
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+ERROR: new row for relation "sub_part2" violates partition constraint
+DETAIL: Failing row contains (2, 10, 2).
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text, * FROM list_parted WHERE a = 2 ORDER BY 1;
+ tableoid | a | b | c
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text, * FROM list_parted WHERE a = 2 ORDER BY 1;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-- Test the case where BR UPDATE triggers change the partition key.
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+ NEW.b = 2; -- This is changing partition key column.
+ return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+ for each row execute procedure func_parted_mod_b();
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part1 | 1 | 1 | 60
+ sub_part2 | 1 | 2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b = 1;
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part2 | 1 | 2 | 10
+ sub_part2 | 1 | 2 | 70
+(4 rows)
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+ raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+ return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+ for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+NOTICE: Trigger: Got OLD row (2,70,1), but returning NULL
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part2 | 1 | 2 | 10
+ sub_part2 | 1 | 2 | 70
+(4 rows)
+
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+ sub_part1 | 1 | 1 | 70
+ sub_part2 | 1 | 2 | 10
+(4 rows)
+
+DROP FUNCTION func_parted_mod_b();
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only
+-- once. There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text, * FROM list_parted ORDER BY 1, 2, 3, 4;
+ tableoid | a | b | c
+------------+---+----+----
+ list_part1 | 2 | 1 | 70
+ list_part1 | 2 | 2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 | 6 | 60
+(4 rows)
+
+DROP TABLE non_parted;
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
+-- create custom operator class and hash function, for the same reason
+-- explained in alter_table.sql
+create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
+$$ begin return (a + seed); end; $$ language 'plpgsql' immutable;
+create operator class custom_opclass for type int4 using hash as
+operator 1 = , function 2 dummy_hashint4(int4, int8);
+create table hash_parted (
+ a int,
+ b int
+) partition by hash (a custom_opclass, b custom_opclass);
+create table hpart1 partition of hash_parted for values with (modulus 2, remainder 1);
+create table hpart2 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hpart3 partition of hash_parted for values with (modulus 8, remainder 0);
+create table hpart4 partition of hash_parted for values with (modulus 8, remainder 4);
+insert into hpart1 values (1, 1);
+insert into hpart2 values (2, 5);
+insert into hpart4 values (3, 4);
+-- fail
+update hpart1 set a = 3, b=4 where a = 1;
+ERROR: new row for relation "hpart1" violates partition constraint
+DETAIL: Failing row contains (3, 4).
+-- ok, row movement
+update hash_parted set b = b - 1 where b = 1;
+-- ok
+update hash_parted set b = b + 8 where b = 1;
+-- cleanup
+drop table hash_parted;
+drop operator class custom_opclass using hash;
+drop function dummy_hashint4(a int4, seed int8);
diff --git a/src/test/regress/expected/vacuum_1.out b/src/test/regress/expected/vacuum_1.out
new file mode 100644
index 0000000000..baea1b0532
--- /dev/null
+++ b/src/test/regress/expected/vacuum_1.out
@@ -0,0 +1,384 @@
+--
+-- VACUUM
+--
+CREATE TABLE vactst (i INT);
+INSERT INTO vactst VALUES (1);
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst VALUES (0);
+SELECT count(*) FROM vactst;
+ count
+-------
+ 2049
+(1 row)
+
+DELETE FROM vactst WHERE i != 0;
+SELECT * FROM vactst;
+ i
+---
+ 0
+(1 row)
+
+VACUUM FULL vactst;
+UPDATE vactst SET i = i + 1;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst SELECT * FROM vactst;
+INSERT INTO vactst VALUES (0);
+SELECT count(*) FROM vactst;
+ count
+-------
+ 2049
+(1 row)
+
+DELETE FROM vactst WHERE i != 0;
+VACUUM (FULL) vactst;
+DELETE FROM vactst;
+SELECT * FROM vactst;
+ i
+---
+(0 rows)
+
+VACUUM (FULL, FREEZE) vactst;
+VACUUM (ANALYZE, FULL) vactst;
+CREATE TABLE vaccluster (i INT PRIMARY KEY);
+ALTER TABLE vaccluster CLUSTER ON vaccluster_pkey;
+CLUSTER vaccluster;
+CREATE FUNCTION do_analyze() RETURNS VOID VOLATILE LANGUAGE SQL
+ AS 'ANALYZE pg_am';
+CREATE FUNCTION wrap_do_analyze(c INT) RETURNS INT IMMUTABLE LANGUAGE SQL
+ AS 'SELECT $1 FROM do_analyze()';
+CREATE INDEX ON vaccluster(wrap_do_analyze(i));
+INSERT INTO vaccluster VALUES (1), (2);
+ANALYZE vaccluster;
+ERROR: ANALYZE cannot be executed from VACUUM or ANALYZE
+CONTEXT: SQL function "do_analyze" statement 1
+SQL function "wrap_do_analyze" statement 1
+-- Test ANALYZE in transaction, where the transaction surrounding
+-- analyze performed modifications. This tests for the bug at
+-- https://postgr.es/m/c7988239-d42c-ddc4-41db-171b23b35e4f%40ssinger.info
+-- (which hopefully is unlikely to be reintroduced), but also seems
+-- independently worthwhile to cover.
+INSERT INTO vactst SELECT generate_series(1, 300);
+DELETE FROM vactst WHERE i % 7 = 0; -- delete a few rows outside
+BEGIN;
+INSERT INTO vactst SELECT generate_series(301, 400);
+DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
+ANALYZE vactst;
+COMMIT;
+VACUUM FULL pg_am;
+VACUUM FULL pg_class;
+VACUUM FULL pg_database;
+VACUUM FULL vaccluster;
+ERROR: ANALYZE cannot be executed from VACUUM or ANALYZE
+CONTEXT: SQL function "do_analyze" statement 1
+SQL function "wrap_do_analyze" statement 1
+VACUUM FULL vactst;
+VACUUM (DISABLE_PAGE_SKIPPING) vaccluster;
+-- PARALLEL option
+CREATE TABLE pvactst (i INT, a INT[], p POINT) with (autovacuum_enabled = off);
+INSERT INTO pvactst SELECT i, array[1,2,3], point(i, i+1) FROM generate_series(1,1000) i;
+CREATE INDEX btree_pvactst ON pvactst USING btree (i);
+CREATE INDEX hash_pvactst ON pvactst USING hash (i);
+CREATE INDEX brin_pvactst ON pvactst USING brin (i);
+CREATE INDEX gin_pvactst ON pvactst USING gin (a);
+CREATE INDEX gist_pvactst ON pvactst USING gist (p);
+CREATE INDEX spgist_pvactst ON pvactst USING spgist (p);
+-- VACUUM invokes parallel index cleanup
+SET min_parallel_index_scan_size to 0;
+VACUUM (PARALLEL 2) pvactst;
+-- VACUUM invokes parallel bulk-deletion
+UPDATE pvactst SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 2) pvactst;
+UPDATE pvactst SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+VACUUM (PARALLEL -1) pvactst; -- error
+ERROR: parallel vacuum degree must be between 0 and 1024
+LINE 1: VACUUM (PARALLEL -1) pvactst;
+ ^
+VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
+VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
+ERROR: VACUUM FULL cannot be performed in parallel
+VACUUM (PARALLEL) pvactst; -- error, cannot use PARALLEL option without parallel degree
+ERROR: parallel option requires a value between 0 and 1024
+LINE 1: VACUUM (PARALLEL) pvactst;
+ ^
+-- Test different combinations of parallel and full options for temporary tables
+CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
+CREATE INDEX tmp_idx1 ON tmp (a);
+VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
+VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
+RESET min_parallel_index_scan_size;
+DROP TABLE pvactst;
+-- INDEX_CLEANUP option
+CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
+-- Use uncompressed data stored in toast.
+CREATE INDEX no_index_cleanup_idx ON no_index_cleanup(t);
+ALTER TABLE no_index_cleanup ALTER COLUMN t SET STORAGE EXTERNAL;
+INSERT INTO no_index_cleanup(i, t) VALUES (generate_series(1,30),
+ repeat('1234567890',269));
+-- index cleanup option is ignored if VACUUM FULL
+VACUUM (INDEX_CLEANUP TRUE, FULL TRUE) no_index_cleanup;
+VACUUM (FULL TRUE) no_index_cleanup;
+-- Toast inherits the value from its parent table.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false);
+DELETE FROM no_index_cleanup WHERE i < 15;
+-- Nothing is cleaned up.
+VACUUM no_index_cleanup;
+-- Both parent relation and toast are cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = true);
+VACUUM no_index_cleanup;
+-- Parameter is set for both the parent table and its toast relation.
+INSERT INTO no_index_cleanup(i, t) VALUES (generate_series(31,60),
+ repeat('1234567890',269));
+DELETE FROM no_index_cleanup WHERE i < 45;
+-- Only toast index is cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false,
+ toast.vacuum_index_cleanup = true);
+VACUUM no_index_cleanup;
+-- Only parent is cleaned up.
+ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = true,
+ toast.vacuum_index_cleanup = false);
+VACUUM no_index_cleanup;
+-- Test some extra relations.
+VACUUM (INDEX_CLEANUP FALSE) vaccluster;
+VACUUM (INDEX_CLEANUP FALSE) vactst; -- index cleanup option is ignored if no indexes
+VACUUM (INDEX_CLEANUP FALSE, FREEZE TRUE) vaccluster;
+-- TRUNCATE option
+CREATE TABLE vac_truncate_test(i INT NOT NULL, j text)
+ WITH (vacuum_truncate=true, autovacuum_enabled=false);
+INSERT INTO vac_truncate_test VALUES (1, NULL), (NULL, NULL);
+ERROR: null value in column "i" of relation "vac_truncate_test" violates not-null constraint
+DETAIL: Failing row contains (null, null).
+VACUUM (TRUNCATE FALSE) vac_truncate_test;
+SELECT pg_relation_size('vac_truncate_test') > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+VACUUM vac_truncate_test;
+SELECT pg_relation_size('vac_truncate_test') = 0;
+ ?column?
+----------
+ f
+(1 row)
+
+VACUUM (TRUNCATE FALSE, FULL TRUE) vac_truncate_test;
+DROP TABLE vac_truncate_test;
+-- partitioned table
+CREATE TABLE vacparted (a int, b char) PARTITION BY LIST (a);
+CREATE TABLE vacparted1 PARTITION OF vacparted FOR VALUES IN (1);
+INSERT INTO vacparted VALUES (1, 'a');
+UPDATE vacparted SET b = 'b';
+VACUUM (ANALYZE) vacparted;
+VACUUM (FULL) vacparted;
+VACUUM (FREEZE) vacparted;
+-- check behavior with duplicate column mentions
+VACUUM ANALYZE vacparted(a,b,a);
+ERROR: column "a" of relation "vacparted" appears more than once
+ANALYZE vacparted(a,b,b);
+ERROR: column "b" of relation "vacparted" appears more than once
+-- multiple tables specified
+VACUUM vaccluster, vactst;
+VACUUM vacparted, does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+VACUUM (FREEZE) vacparted, vaccluster, vactst;
+VACUUM (FREEZE) does_not_exist, vaccluster;
+ERROR: relation "does_not_exist" does not exist
+VACUUM ANALYZE vactst, vacparted (a);
+VACUUM ANALYZE vactst (does_not_exist), vacparted (b);
+ERROR: column "does_not_exist" of relation "vactst" does not exist
+VACUUM FULL vacparted, vactst;
+VACUUM FULL vactst, vacparted (a, b), vaccluster (i);
+ERROR: ANALYZE option must be specified when a column list is provided
+ANALYZE vactst, vacparted;
+ANALYZE vacparted (b), vactst;
+ANALYZE vactst, does_not_exist, vacparted;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE vactst (i), vacparted (does_not_exist);
+ERROR: column "does_not_exist" of relation "vacparted" does not exist
+ANALYZE vactst, vactst;
+BEGIN; -- ANALYZE behaves differently inside a transaction block
+ANALYZE vactst, vactst;
+COMMIT;
+-- parenthesized syntax for ANALYZE
+ANALYZE (VERBOSE) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE (nonexistent-arg) does_not_exist;
+ERROR: syntax error at or near "arg"
+LINE 1: ANALYZE (nonexistent-arg) does_not_exist;
+ ^
+ANALYZE (nonexistentarg) does_not_exit;
+ERROR: unrecognized ANALYZE option "nonexistentarg"
+LINE 1: ANALYZE (nonexistentarg) does_not_exit;
+ ^
+-- ensure argument order independence, and that SKIP_LOCKED on non-existing
+-- relation still errors out. Suppress WARNING messages caused by concurrent
+-- autovacuums.
+SET client_min_messages TO 'ERROR';
+ANALYZE (SKIP_LOCKED, VERBOSE) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+ANALYZE (VERBOSE, SKIP_LOCKED) does_not_exist;
+ERROR: relation "does_not_exist" does not exist
+-- SKIP_LOCKED option
+VACUUM (SKIP_LOCKED) vactst;
+VACUUM (SKIP_LOCKED, FULL) vactst;
+ANALYZE (SKIP_LOCKED) vactst;
+RESET client_min_messages;
+-- ensure VACUUM and ANALYZE don't have a problem with serializable
+SET default_transaction_isolation = serializable;
+VACUUM vactst;
+ANALYZE vactst;
+RESET default_transaction_isolation;
+BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ANALYZE vactst;
+COMMIT;
+DROP TABLE vaccluster;
+DROP TABLE vactst;
+DROP TABLE vacparted;
+DROP TABLE no_index_cleanup;
+-- relation ownership, WARNING logs generated as all are skipped.
+CREATE TABLE vacowned (a int);
+CREATE TABLE vacowned_parted (a int) PARTITION BY LIST (a);
+CREATE TABLE vacowned_part1 PARTITION OF vacowned_parted FOR VALUES IN (1);
+CREATE TABLE vacowned_part2 PARTITION OF vacowned_parted FOR VALUES IN (2);
+CREATE ROLE regress_vacuum;
+SET ROLE regress_vacuum;
+-- Simple table
+VACUUM vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can vacuum it
+ANALYZE vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned;
+WARNING: skipping "vacowned" --- only table or database owner can vacuum it
+-- Catalog
+VACUUM pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can vacuum it
+ANALYZE pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can analyze it
+VACUUM (ANALYZE) pg_catalog.pg_class;
+WARNING: skipping "pg_class" --- only superuser or database owner can vacuum it
+-- Shared catalog
+VACUUM pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can vacuum it
+ANALYZE pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can analyze it
+VACUUM (ANALYZE) pg_catalog.pg_authid;
+WARNING: skipping "pg_authid" --- only superuser can vacuum it
+-- Partitioned table and its partitions, nothing owned by other user.
+-- Relations are not listed in a single command to test ownership
+-- independently.
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Partitioned table and one partition owned by other user.
+ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
+ALTER TABLE vacowned_part1 OWNER TO regress_vacuum;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Only one partition owned by other user.
+ALTER TABLE vacowned_parted OWNER TO CURRENT_USER;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+-- Only partitioned table owned by other user.
+ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
+ALTER TABLE vacowned_part1 OWNER TO CURRENT_USER;
+SET ROLE regress_vacuum;
+VACUUM vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+ANALYZE vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+ANALYZE vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it
+ANALYZE vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it
+VACUUM (ANALYZE) vacowned_parted;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part1;
+WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it
+VACUUM (ANALYZE) vacowned_part2;
+WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it
+RESET ROLE;
+DROP TABLE vacowned;
+DROP TABLE vacowned_parted;
+DROP ROLE regress_vacuum;
diff --git a/src/test/regress/expected/zedstore.out b/src/test/regress/expected/zedstore.out
new file mode 100644
index 0000000000..9ab09d8514
--- /dev/null
+++ b/src/test/regress/expected/zedstore.out
@@ -0,0 +1,666 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+(10 rows)
+
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+ c1 | c3
+----+----
+ 1 | 3
+ 2 | 4
+ 3 | 5
+ 4 | 6
+ 5 | 7
+ 6 | 8
+ 7 | 9
+ 8 | 10
+ 9 | 11
+ 10 | 12
+(10 rows)
+
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+ c3
+----
+ 7
+ 8
+ 9
+ 10
+ 11
+ 12
+(6 rows)
+
+-- where clause with invalid ctid works
+select * from t_zedstore where ctid = '(0,0)';
+ c1 | c2 | c3
+----+----+----
+(0 rows)
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 4 | 5 | 6
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(15 rows)
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+-- index scan
+select * from t_zedstore where c1 = 5;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+(1 row)
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+ c1
+----
+ 5
+(1 row)
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+ c1 | c2
+----+----
+ 5 | 6
+ 6 | 7
+ 7 | 8
+ 8 | 9
+ 9 | 10
+ 10 | 11
+(6 rows)
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 1 | 2 | 3
+ 2 | 3 | 4
+ 3 | 4 | 5
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(14 rows)
+
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 8 | 9 | 10
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+(11 rows)
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+ c1 | c2 | c3
+----+-----+----
+ 5 | 6 | 7
+ 6 | 7 | 8
+ 7 | 8 | 9
+ 9 | 10 | 11
+ 10 | 11 | 12
+ 31 | 32 | 33
+ 32 | 33 | 34
+ 33 | 34 | 35
+ 34 | 35 | 36
+ 35 | 36 | 37
+ 8 | 100 | 10
+(11 rows)
+
+--
+-- Test in-line toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+select c1, length(t) from t_zedtoast;
+ c1 | length
+----+--------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+ 6 | 10000
+ 7 | 10000
+ 8 | 10000
+ 9 | 10000
+ 10 | 10000
+(10 rows)
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ f
+(1 row)
+
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ f
+(1 row)
+
+--
+-- Test out-of-line toasting
+--
+insert into t_zedtoast select i, repeat('x', 1000000) from generate_series(1, 10) i;
+select c1, length(t) from t_zedtoast;
+ c1 | length
+----+---------
+ 1 | 1000000
+ 2 | 1000000
+ 3 | 1000000
+ 4 | 1000000
+ 5 | 1000000
+ 6 | 1000000
+ 7 | 1000000
+ 8 | 1000000
+ 9 | 1000000
+ 10 | 1000000
+(10 rows)
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ t
+(1 row)
+
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+ has_toast_pages
+-----------------
+ f
+(1 row)
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+select c2 from t_zednullvalues;
+ c2
+----
+
+ 2
+(2 rows)
+
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+ c1 | c2
+----+----
+ 1 |
+ 1 |
+(2 rows)
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (b, d) from stdin;
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 4 | 4 | stuff | test_4 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(12 rows)
+
+COPY t_zedcopy (a, d, e) to stdout;
+9999 NN \N
+10000 41 51
+1 test_1 \N
+2 test_2 \N
+3 test_3 \N
+4 test_4 \N
+5 test_5 \N
+10001 42 52
+10002 43 53
+10003 44 54
+10004 45 55
+10005 46 56
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 1 | 1 | stuff | test_1 |
+ 2 | 2 | stuff | test_2 |
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(11 rows)
+
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 5 | 5 | stuff | test_5 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+(9 rows)
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+ a | b | c | d | e
+-------+-----+-------+--------+----
+ 9999 | | \N | NN |
+ 10000 | 21 | 31 | 41 | 51
+ 3 | 3 | stuff | test_3 |
+ 10001 | 22 | 32 | 42 | 52
+ 10002 | 23 | 33 | 43 | 53
+ 10003 | 24 | 34 | 44 | 54
+ 10004 | 25 | 35 | 45 | 55
+ 10005 | 26 | 36 | 46 | 56
+ 5 | 100 | stuff | test_5 |
+(9 rows)
+
+-- Test rolling back COPY
+begin;
+COPY t_zedcopy (b, d) from stdin;
+rollback;
+select count(*) from t_zedcopy where b >= 20000;
+ count
+-------
+ 0
+(1 row)
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+ count
+-------
+ 1
+(1 row)
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+ a | b
+---+---
+ 1 | 2
+ 2 | 3
+ 3 | 4
+(3 rows)
+
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+(1 row)
+
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+ a | b
+---+---
+ 1 |
+ 2 | 3
+(2 rows)
+
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+ a | b | c
+---+---+---
+ 1 | 2 | 3
+ 2 | 3 | 3
+ 3 | 4 | 3
+(3 rows)
+
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+ d
+----------
+ abcdefgh
+ abcdefgh
+ abcdefgh
+(3 rows)
+
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+ a | b | c | d
+---+---+---+----------
+ 1 | 2 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+ 3 | 4 | 3 | abcdefgh
+ 2 | 3 | 3 | abcdefgh
+(4 rows)
+
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+ b | c | d
+---+---+-------------
+ 2 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 3 | abcdefgh
+ 3 | 3 | abcdefgh
+ 4 | 5 | test_insert
+(5 rows)
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,14) | 141
+ (1,16) | 143
+ (1,18) | 145
+ (1,20) | 147
+ (1,22) | 149
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,36) | 163
+ (1,38) | 165
+ (1,40) | 167
+ (1,42) | 169
+ (1,44) | 171
+ (1,46) | 173
+ (1,48) | 175
+ (1,50) | 177
+ (1,52) | 179
+ (1,54) | 181
+ (1,56) | 183
+ (1,58) | 185
+ (1,60) | 187
+ (1,62) | 189
+ (1,64) | 191
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,72) | 199
+ (1,74) | 201
+ (1,76) | 203
+ (1,78) | 205
+ (1,80) | 207
+ (1,82) | 209
+ (1,84) | 211
+ (1,86) | 213
+ (1,88) | 215
+ (1,90) | 217
+ (1,92) | 219
+ (1,94) | 221
+ (1,96) | 223
+ (1,98) | 225
+ (1,100) | 227
+ (1,102) | 229
+ (1,104) | 231
+ (1,106) | 233
+ (1,108) | 235
+ (1,110) | 237
+ (1,112) | 239
+ (1,114) | 241
+ (1,116) | 243
+ (1,118) | 245
+ (1,120) | 247
+ (1,122) | 249
+ (1,124) | 251
+ (1,126) | 253
+ (1,128) | 255
+ (2,2) | 257
+ (2,4) | 259
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,14) | 269
+ (2,16) | 271
+ (2,18) | 273
+ (2,20) | 275
+ (2,22) | 277
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,32) | 287
+ (2,34) | 289
+ (2,36) | 291
+ (2,38) | 293
+ (2,40) | 295
+ (2,42) | 297
+ (2,44) | 299
+(86 rows)
+
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
+ ctid | id
+---------+-----
+ (0,4) | 3
+ (0,6) | 5
+ (0,8) | 7
+ (0,20) | 19
+ (0,30) | 29
+ (0,42) | 41
+ (0,44) | 43
+ (0,48) | 47
+ (0,52) | 51
+ (0,54) | 53
+ (0,56) | 55
+ (0,62) | 61
+ (0,64) | 63
+ (0,66) | 65
+ (0,76) | 75
+ (0,80) | 79
+ (0,82) | 81
+ (0,84) | 83
+ (0,88) | 87
+ (0,90) | 89
+ (0,92) | 91
+ (0,98) | 97
+ (0,106) | 105
+ (0,108) | 107
+ (0,122) | 121
+ (0,126) | 125
+ (1,2) | 129
+ (1,4) | 131
+ (1,6) | 133
+ (1,8) | 135
+ (1,10) | 137
+ (1,12) | 139
+ (1,20) | 147
+ (1,24) | 151
+ (1,26) | 153
+ (1,28) | 155
+ (1,30) | 157
+ (1,32) | 159
+ (1,34) | 161
+ (1,40) | 167
+ (1,44) | 171
+ (1,46) | 173
+ (1,58) | 185
+ (1,66) | 193
+ (1,68) | 195
+ (1,70) | 197
+ (1,78) | 205
+ (1,80) | 207
+ (1,88) | 215
+ (1,92) | 219
+ (1,96) | 223
+ (1,100) | 227
+ (1,102) | 229
+ (1,106) | 233
+ (1,112) | 239
+ (1,116) | 243
+ (1,120) | 247
+ (1,122) | 249
+ (1,126) | 253
+ (2,2) | 257
+ (2,6) | 261
+ (2,8) | 263
+ (2,10) | 265
+ (2,12) | 267
+ (2,16) | 271
+ (2,18) | 273
+ (2,24) | 279
+ (2,26) | 281
+ (2,28) | 283
+ (2,30) | 285
+ (2,34) | 289
+ (2,36) | 291
+ (2,42) | 297
+ (2,44) | 299
+(74 rows)
+
diff --git a/src/test/regress/expected/zstidscan.out b/src/test/regress/expected/zstidscan.out
new file mode 100644
index 0000000000..d7f3c76a24
--- /dev/null
+++ b/src/test/regress/expected/zstidscan.out
@@ -0,0 +1,213 @@
+select '1'::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select '-1'::zstid;
+ERROR: value "-1" is out of range for type zstid
+LINE 1: select '-1'::zstid;
+ ^
+select -'1'::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -'1'::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+-- int2 conversion
+select 1::int2::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select (-1)::int2::zstid;
+ERROR: value -1 is out of range for type zstid
+select -1::int2::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -1::int2::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+-- int4 conversion
+select 1::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select (-1)::zstid;
+ERROR: value -1 is out of range for type zstid
+select -1::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -1::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+-- int8 conversion
+select 1::int8::zstid;
+ zstid
+-------
+ 1
+(1 row)
+
+select 1000000000000000::zstid; -- bigger than MaxZSTid
+ERROR: value 1000000000000000 is out of range for type zstid
+select (-1)::int8::zstid;
+ERROR: value -1 is out of range for type zstid
+select -1::int8::zstid;
+ERROR: operator does not exist: - zstid
+LINE 1: select -1::int8::zstid;
+ ^
+HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
+create table if not exists zstidscan(a int) using zedstore;
+insert into zstidscan values (1), (2), (3);
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan;
+ QUERY PLAN
+-----------------------
+ Seq Scan on zstidscan
+(1 row)
+
+select ctid, ctid::zstid as zstid, a from zstidscan;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+ (0,2) | 2 | 2
+ (0,3) | 3 | 3
+(3 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+ QUERY PLAN
+----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid = '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,2) | 2 | 2
+(1 row)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid >= '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,2) | 2 | 2
+ (0,3) | 3 | 3
+(2 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid <= '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+ (0,2) | 2 | 2
+(2 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+ QUERY PLAN
+----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid < '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+(1 row)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+ QUERY PLAN
+----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid > '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,3) | 3 | 3
+(1 row)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+ QUERY PLAN
+-----------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid <> '2'::zstid)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+ ctid | zstid | a
+-------+-------+---
+ (0,1) | 1 | 1
+ (0,3) | 3 | 3
+(2 rows)
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+ QUERY PLAN
+----------------------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((ctid)::zstid = ANY ('{2,3}'::zstid[]))
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+ ctid | zstid | a
+-------+-------+---
+ (0,2) | 2 | 2
+ (0,3) | 3 | 3
+(2 rows)
+
+-- TODO: casting to int2 or int4 might be useful
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int2 % 3 = 0;
+ERROR: cannot cast type zstid to smallint
+LINE 1: ...zstid as zstid, a from zstidscan where ctid::zstid::int2 % 3...
+ ^
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int4 % 3 = 0;
+ERROR: cannot cast type zstid to integer
+LINE 1: ...zstid as zstid, a from zstidscan where ctid::zstid::int4 % 3...
+ ^
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+ QUERY PLAN
+---------------------------------------------------------
+ Seq Scan on zstidscan
+ Filter: ((((ctid)::zstid)::bigint % '3'::bigint) = 0)
+(2 rows)
+
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+ ctid | zstid | a
+-------+-------+---
+ (0,3) | 3 | 3
+(1 row)
+
+-- TODO: Add necessary functions to do these useful aggregates on zstid types
+select max(ctid::zstid) from zstidscan;
+ERROR: function max(zstid) does not exist
+LINE 1: select max(ctid::zstid) from zstidscan;
+ ^
+HINT: No function matches the given name and argument types. You might need to add explicit type casts.
+select min(ctid::zstid) from zstidscan;
+ERROR: function min(zstid) does not exist
+LINE 1: select min(ctid::zstid) from zstidscan;
+ ^
+HINT: No function matches the given name and argument types. You might need to add explicit type casts.
+drop table zstidscan;
diff --git a/src/test/regress/gpdiff.pl b/src/test/regress/gpdiff.pl
new file mode 100755
index 0000000000..02ec0272c6
--- /dev/null
+++ b/src/test/regress/gpdiff.pl
@@ -0,0 +1,260 @@
+#!/usr/bin/env perl
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# Pod::Usage is loaded lazily when needed, if the --help or other such option
+# is actually used. Loading the module takes some time, which adds up when
+# running hundreds of regression tests, and gpdiff.pl calls this script twice
+# for every test. See lazy_pod2usage().
+#use Pod::Usage;
+
+use strict;
+use warnings;
+use File::Spec;
+use Getopt::Long qw(GetOptions);
+Getopt::Long::Configure qw(pass_through);
+
+# Load atmsort module from the same dir as this script
+use FindBin;
+use lib "$FindBin::Bin";
+use atmsort;
+
+=head1 NAME
+
+B<gpdiff.pl>
+
+=head1 SYNOPSIS
+
+B<gpdiff.pl> [options] logfile [logfile...]
+
+Options:
+
+Normally, gpdiff takes the standard "diff" options and passes them
+directly to the diff program. Try `diff --help' for more information
+on the standard options. The following options are specific to gpdiff:
+
+ -help brief help message
+ -man full documentation
+ -version print gpdiff version and underlying diff version
+ -verbose print verbose info
+ -gpd_ignore_plans ignore explain plan content in input files
+ -gpd_init <file> load initialization file
+
+=head1 OPTIONS
+
+=over 8
+
+=item B<-help>
+
+ Print a brief help message and exits.
+
+=item B<-man>
+
+ Prints the manual page and exits.
+
+=item B<-version>
+
+ Prints the gpdiff version and underlying diff version
+
+=item B<-verbose>
+
+ Prints verbose information.
+
+=item B<-gpd_ignore_plans>
+
+Specify this option to ignore any explain plan diffs between the
+input files. This will completely ignore any plan content in
+the input files thus masking differences in plans between the input files.
+
+=item B<-init_file> <file>
+
+Specify an initialization file containing a series of directives
+(mainly for match_subs) that get applied to the input files. To
+specify multiple initialization files, use multiple init_file arguments, eg:
+
+ -init_file file1 -init_file file2
+
+=back
+
+=head1 DESCRIPTION
+
+gpdiff compares files using diff after processing them with
+atmsort.pm. This comparison is designed to handle the cases where
+query output order may differ or plans maybe differ. Type "atmsort.pl
+--man" for more details. gpdiff is invoked by pg_regress as part of
+"make installcheck-world". In this case the diff options are
+something like:
+
+ "-w -I NOTICE: -I HINT: -I CONTEXT: -I REGRESS_IGNORE:".
+
+Like diff, gpdiff can compare two files, a file and directory, a
+directory and file, and two directories. However, when gpdiff compares
+two directories, it only returns the exit status of the diff
+comparison of the final two files.
+
+=head1 BUGS
+
+While the exit status is set correctly for most cases,
+STDERR messages from diff are not displayed.
+
+=cut
+
+# Calls pod2usage, but loads the module first.
+sub lazy_pod2usage
+{
+ require Pod::Usage;
+ Pod::Usage::pod2usage(@_);
+}
+
+my %glob_atmsort_args;
+
+my $glob_ignore_plans;
+my $glob_init_file = [];
+
+sub gpdiff_files
+{
+ my ($f1, $f2, $d2d) = @_;
+ my @tmpfils;
+ my $newf1;
+ my $newf2;
+
+ atmsort::atmsort_init(%glob_atmsort_args);
+ $newf1 = atmsort::run($f1);
+ $newf2 = atmsort::run($f2);
+
+ my $args = join(' ', @ARGV, $newf1, $newf2);
+
+# print "args: $args\n";
+
+ my $outi =`diff $args`;
+
+ my $stat = $? >> 8; # diff status
+
+ # prefix the diff output with the files names for a "directory to
+ # directory" diff
+ if (defined($d2d) && length($outi))
+ {
+ $outi = "diff $f1 $f2\n" . $outi;
+ }
+
+ # replace temp file name references with actual file names
+ $outi =~ s/\Q$newf1\E/\Q$f1\E/gm;
+ $outi =~ s/\Q$newf2\E/\Q$f2\E/gm;
+
+ print $outi;
+
+#my $stat = WEXITVALUE($?); # diff status
+
+ unlink $newf1;
+ unlink $newf2;
+
+ return ($stat);
+}
+
+sub filefunc
+{
+ my ($f1, $f2, $d2d) = @_;
+
+ if ((-f $f1) && (-f $f2))
+ {
+ return (gpdiff_files($f1, $f2, $d2d));
+ }
+
+ # if f1 is a directory, do the filefunc of every file in that directory
+ if ((-d $f1) && (-d $f2))
+ {
+ my $dir = $f1;
+ my ($dir_h, $stat);
+
+ if (opendir($dir_h, $dir))
+ {
+ my $fnam;
+ while ($fnam = readdir($dir_h))
+ {
+ # ignore ., ..
+ next if ($fnam eq '.' || $fnam eq '..');
+
+ my $absname = File::Spec->rel2abs(
+ File::Spec->catfile($dir, $fnam));
+
+ # specify that is a directory comparison
+ $d2d = {} unless (defined($d2d));
+ $d2d->{dir} = 1;
+ $stat = filefunc($absname, $f2, $d2d);
+ }
+ closedir $dir_h;
+ }
+ return $stat;
+ }
+
+ # if f2 is a directory, find the corresponding file in that directory
+ if ((-f $f1) && (-d $f2))
+ {
+ my $stat;
+ my @foo = File::Spec->splitpath($f1);
+
+ return 0 unless (scalar(@foo));
+ my $basenam = $foo[-1];
+
+ my $fnam = File::Spec->rel2abs(File::Spec->catfile( $f2, $basenam));
+
+ $stat = filefunc($f1, $fnam, $d2d);
+
+ return $stat;
+ }
+
+ # find f2 in dir f1
+ if ((-f $f2) && (-d $f1))
+ {
+ my $stat;
+ my @foo = File::Spec->splitpath($f2);
+
+ return 0 unless (scalar(@foo));
+ my $basenam = $foo[-1];
+
+ my $fnam = File::Spec->rel2abs( File::Spec->catfile( $f1, $basenam));
+
+ $stat = filefunc($fnam, $f2, $d2d);
+
+ return $stat;
+ }
+
+ return 0;
+}
+
+sub print_version
+{
+ print "(PostgreSQL)";
+ exit(0);
+}
+
+if (1)
+{
+ my $pmsg = "";
+
+ GetOptions(
+ "man" => sub { lazy_pod2usage(-msg => $pmsg, -exitstatus => 0, -verbose => 2) },
+ "help" => sub { lazy_pod2usage(-msg => $pmsg, -exitstatus => 1) },
+ "version|v" => \&print_version ,
+ "verbose|Verbose" => \$glob_atmsort_args{VERBOSE},
+ "gpd_ignore_plans|gp_ignore_plans" => \$glob_atmsort_args{IGNORE_PLANS},
+ "gpd_init|gp_init_file=s" => \@{$glob_atmsort_args{INIT_FILES}}
+ );
+
+ lazy_pod2usage(-msg => $pmsg, -exitstatus => 1) unless (scalar(@ARGV) >= 2);
+
+ my $f2 = pop @ARGV;
+ my $f1 = pop @ARGV;
+
+ for my $fname ($f1, $f2)
+ {
+ unless (-e $fname)
+ {
+ print STDERR "gpdiff: $fname: No such file or directory\n";
+ }
+ }
+ exit(2) unless ((-e $f1) && (-e $f2));
+
+ exit(filefunc($f1, $f2));
+}
diff --git a/src/test/regress/output/misc_1.source b/src/test/regress/output/misc_1.source
new file mode 100644
index 0000000000..c29c54c414
--- /dev/null
+++ b/src/test/regress/output/misc_1.source
@@ -0,0 +1,692 @@
+--
+-- MISC
+--
+--
+-- BTREE
+--
+UPDATE onek
+ SET unique1 = onek.unique1 + 1;
+UPDATE onek
+ SET unique1 = onek.unique1 - 1;
+--
+-- BTREE partial
+--
+-- UPDATE onek2
+-- SET unique1 = onek2.unique1 + 1;
+--UPDATE onek2
+-- SET unique1 = onek2.unique1 - 1;
+--
+-- BTREE shutting out non-functional updates
+--
+-- the following two tests seem to take a long time on some
+-- systems. This non-func update stuff needs to be examined
+-- more closely. - jolly (2/22/96)
+--
+UPDATE tmp
+ SET stringu1 = reverse_name(onek.stringu1)
+ FROM onek
+ WHERE onek.stringu1 = 'JBAAAA' and
+ onek.stringu1 = tmp.stringu1;
+UPDATE tmp
+ SET stringu1 = reverse_name(onek2.stringu1)
+ FROM onek2
+ WHERE onek2.stringu1 = 'JCAAAA' and
+ onek2.stringu1 = tmp.stringu1;
+DROP TABLE tmp;
+--UPDATE person*
+-- SET age = age + 1;
+--UPDATE person*
+-- SET age = age + 3
+-- WHERE name = 'linda';
+--
+-- copy
+--
+COPY onek TO '@abs_builddir@/results/onek.data';
+DELETE FROM onek;
+COPY onek FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+DELETE FROM onek2;
+COPY onek2 FROM '@abs_builddir@/results/onek.data';
+SELECT unique1 FROM onek2 WHERE unique1 < 2 ORDER BY unique1;
+ unique1
+---------
+ 0
+ 1
+(2 rows)
+
+COPY BINARY stud_emp TO '@abs_builddir@/results/stud_emp.data';
+DELETE FROM stud_emp;
+COPY BINARY stud_emp FROM '@abs_builddir@/results/stud_emp.data';
+SELECT * FROM stud_emp;
+ name | age | location | salary | manager | gpa | percent
+-------+-----+------------+--------+---------+-----+---------
+ jeff | 23 | (8,7.7) | 600 | sharon | 3.5 |
+ cim | 30 | (10.5,4.7) | 400 | | 3.4 |
+ linda | 19 | (0.9,6.1) | 100 | | 2.9 |
+(3 rows)
+
+-- COPY aggtest FROM stdin;
+-- 56 7.8
+-- 100 99.097
+-- 0 0.09561
+-- 42 324.78
+-- .
+-- COPY aggtest TO stdout;
+--
+-- inheritance stress test
+--
+SELECT * FROM a_star*;
+ class | a
+-------+----
+ a | 1
+ a | 2
+ a |
+ b | 3
+ b | 4
+ b |
+ b |
+ c | 5
+ c | 6
+ c |
+ c |
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d |
+ d | 11
+ d | 12
+ d | 13
+ d |
+ d |
+ d |
+ d | 14
+ d |
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e | 17
+ e |
+ e | 18
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f |
+ f | 24
+ f | 25
+ f | 26
+ f |
+ f |
+ f |
+ f | 27
+ f |
+ f |
+ f |
+ f |
+(50 rows)
+
+SELECT *
+ FROM b_star* x
+ WHERE x.b = text 'bumble' or x.a < 3;
+ class | a | b
+-------+---+--------
+ b | | bumble
+(1 row)
+
+SELECT class, a
+ FROM c_star* x
+ WHERE x.c ~ text 'hi';
+ class | a
+-------+----
+ c | 5
+ c |
+ d | 7
+ d | 8
+ d | 10
+ d |
+ d | 12
+ d |
+ d |
+ d |
+ e | 15
+ e | 16
+ e |
+ e |
+ f | 19
+ f | 20
+ f | 21
+ f |
+ f | 24
+ f |
+ f |
+ f |
+(22 rows)
+
+SELECT class, b, c
+ FROM d_star* x
+ WHERE x.a < 100;
+ class | b | c
+-------+---------+------------
+ d | grumble | hi sunita
+ d | stumble | hi koko
+ d | rumble |
+ d | | hi kristin
+ d | fumble |
+ d | | hi avi
+ d | |
+ d | |
+(8 rows)
+
+SELECT class, c FROM e_star* x WHERE x.c NOTNULL;
+ class | c
+-------+-------------
+ e | hi carol
+ e | hi bob
+ e | hi michelle
+ e | hi elisa
+ f | hi claire
+ f | hi mike
+ f | hi marcel
+ f | hi keith
+ f | hi marc
+ f | hi allison
+ f | hi jeff
+ f | hi carl
+(12 rows)
+
+SELECT * FROM f_star* x WHERE x.c ISNULL;
+ class | a | c | e | f
+-------+----+---+-----+-------------------------------------------
+ f | 22 | | -7 | ((111,555),(222,666),(333,777),(444,888))
+ f | 25 | | -9 |
+ f | 26 | | | ((11111,33333),(22222,44444))
+ f | | | -11 | ((1111111,3333333),(2222222,4444444))
+ f | 27 | | |
+ f | | | -12 |
+ f | | | | ((11111111,33333333),(22222222,44444444))
+ f | | | |
+(8 rows)
+
+-- grouping and aggregation on inherited sets have been busted in the past...
+SELECT sum(a) FROM a_star*;
+ sum
+-----
+ 355
+(1 row)
+
+SELECT class, sum(a) FROM a_star* GROUP BY class ORDER BY class;
+ class | sum
+-------+-----
+ a | 3
+ b | 7
+ c | 11
+ d | 84
+ e | 66
+ f | 184
+(6 rows)
+
+ALTER TABLE f_star RENAME COLUMN f TO ff;
+ALTER TABLE e_star* RENAME COLUMN e TO ee;
+ALTER TABLE d_star* RENAME COLUMN d TO dd;
+ALTER TABLE c_star* RENAME COLUMN c TO cc;
+ALTER TABLE b_star* RENAME COLUMN b TO bb;
+ALTER TABLE a_star* RENAME COLUMN a TO aa;
+SELECT class, aa
+ FROM a_star* x
+ WHERE aa ISNULL;
+ class | aa
+-------+----
+ a |
+ b |
+ b |
+ c |
+ c |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ d |
+ e |
+ e |
+ e |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+ f |
+(24 rows)
+
+-- As of Postgres 7.1, ALTER implicitly recurses,
+-- so this should be same as ALTER a_star*
+ALTER TABLE a_star RENAME COLUMN aa TO foo;
+SELECT class, foo
+ FROM a_star* x
+ WHERE x.foo >= 2;
+ class | foo
+-------+-----
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(25 rows)
+
+ALTER TABLE a_star RENAME COLUMN foo TO aa;
+SELECT *
+ from a_star*
+ WHERE aa < 1000;
+ class | aa
+-------+----
+ a | 1
+ a | 2
+ b | 3
+ b | 4
+ c | 5
+ c | 6
+ d | 7
+ d | 8
+ d | 9
+ d | 10
+ d | 11
+ d | 12
+ d | 13
+ d | 14
+ e | 15
+ e | 16
+ e | 17
+ e | 18
+ f | 19
+ f | 20
+ f | 21
+ f | 22
+ f | 24
+ f | 25
+ f | 26
+ f | 27
+(26 rows)
+
+ALTER TABLE f_star ADD COLUMN f int4;
+UPDATE f_star SET f = 10;
+ALTER TABLE e_star* ADD COLUMN e int4;
+--UPDATE e_star* SET e = 42;
+SELECT * FROM e_star*;
+ class | aa | cc | ee | e
+-------+----+-------------+-----+---
+ e | 15 | hi carol | -1 |
+ e | 16 | hi bob | |
+ e | 17 | | -2 |
+ e | | hi michelle | -3 |
+ e | 18 | | |
+ e | | hi elisa | |
+ e | | | -4 |
+ f | 19 | hi claire | -5 |
+ f | 20 | hi mike | -6 |
+ f | 21 | hi marcel | |
+ f | 22 | | -7 |
+ f | | hi keith | -8 |
+ f | 24 | hi marc | |
+ f | 25 | | -9 |
+ f | 26 | | |
+ f | | hi allison | -10 |
+ f | | hi jeff | |
+ f | | | -11 |
+ f | 27 | | |
+ f | | hi carl | |
+ f | | | -12 |
+ f | | | |
+ f | | | |
+(23 rows)
+
+ALTER TABLE a_star* ADD COLUMN a text;
+NOTICE: merging definition of column "a" for child "d_star"
+-- That ALTER TABLE should have added TOAST tables.
+SELECT relname, reltoastrelid <> 0 AS has_toast_table
+ FROM pg_class
+ WHERE oid::regclass IN ('a_star', 'c_star')
+ ORDER BY 1;
+ relname | has_toast_table
+---------+-----------------
+ a_star | f
+ c_star | f
+(2 rows)
+
+--UPDATE b_star*
+-- SET a = text 'gazpacho'
+-- WHERE aa > 4;
+SELECT class, aa, a FROM a_star*;
+ class | aa | a
+-------+----+---
+ a | 1 |
+ a | 2 |
+ a | |
+ b | 3 |
+ b | 4 |
+ b | |
+ b | |
+ c | 5 |
+ c | 6 |
+ c | |
+ c | |
+ d | 7 |
+ d | 8 |
+ d | 9 |
+ d | 10 |
+ d | |
+ d | 11 |
+ d | 12 |
+ d | 13 |
+ d | |
+ d | |
+ d | |
+ d | 14 |
+ d | |
+ d | |
+ d | |
+ d | |
+ e | 15 |
+ e | 16 |
+ e | 17 |
+ e | |
+ e | 18 |
+ e | |
+ e | |
+ f | 19 |
+ f | 20 |
+ f | 21 |
+ f | 22 |
+ f | |
+ f | 24 |
+ f | 25 |
+ f | 26 |
+ f | |
+ f | |
+ f | |
+ f | 27 |
+ f | |
+ f | |
+ f | |
+ f | |
+(50 rows)
+
+--
+-- versions
+--
+--
+-- postquel functions
+--
+--
+-- mike does post_hacking,
+-- joe and sally play basketball, and
+-- everyone else does nothing.
+--
+SELECT p.name, name(p.hobbies) FROM ONLY person p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+(3 rows)
+
+--
+-- as above, but jeff also does post_hacking.
+--
+SELECT p.name, name(p.hobbies) FROM person* p;
+ name | name
+-------+-------------
+ mike | posthacking
+ joe | basketball
+ sally | basketball
+ jeff | posthacking
+(4 rows)
+
+--
+-- the next two queries demonstrate how functions generate bogus duplicates.
+-- this is a "feature" ..
+--
+SELECT DISTINCT hobbies_r.name, name(hobbies_r.equipment) FROM hobbies_r
+ ORDER BY 1,2;
+ name | name
+-------------+---------------
+ basketball | hightops
+ posthacking | advil
+ posthacking | peet's coffee
+ skywalking | guts
+(4 rows)
+
+SELECT hobbies_r.name, (hobbies_r.equipment).name FROM hobbies_r;
+ name | name
+-------------+---------------
+ posthacking | advil
+ posthacking | peet's coffee
+ posthacking | advil
+ posthacking | peet's coffee
+ basketball | hightops
+ basketball | hightops
+ skywalking | guts
+(7 rows)
+
+--
+-- mike needs advil and peet's coffee,
+-- joe and sally need hightops, and
+-- everyone else is fine.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM ONLY person p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+(4 rows)
+
+--
+-- as above, but jeff needs advil and peet's coffee as well.
+--
+SELECT p.name, name(p.hobbies), name(equipment(p.hobbies)) FROM person* p;
+ name | name | name
+-------+-------------+---------------
+ mike | posthacking | advil
+ mike | posthacking | peet's coffee
+ joe | basketball | hightops
+ sally | basketball | hightops
+ jeff | posthacking | advil
+ jeff | posthacking | peet's coffee
+(6 rows)
+
+--
+-- just like the last two, but make sure that the target list fixup and
+-- unflattening is being done correctly.
+--
+SELECT name(equipment(p.hobbies)), p.name, name(p.hobbies) FROM ONLY person p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+(4 rows)
+
+SELECT (p.hobbies).equipment.name, p.name, name(p.hobbies) FROM person* p;
+ name | name | name
+---------------+-------+-------------
+ advil | mike | posthacking
+ peet's coffee | mike | posthacking
+ hightops | joe | basketball
+ hightops | sally | basketball
+ advil | jeff | posthacking
+ peet's coffee | jeff | posthacking
+(6 rows)
+
+SELECT (p.hobbies).equipment.name, name(p.hobbies), p.name FROM ONLY person p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+(4 rows)
+
+SELECT name(equipment(p.hobbies)), name(p.hobbies), p.name FROM person* p;
+ name | name | name
+---------------+-------------+-------
+ advil | posthacking | mike
+ peet's coffee | posthacking | mike
+ hightops | basketball | joe
+ hightops | basketball | sally
+ advil | posthacking | jeff
+ peet's coffee | posthacking | jeff
+(6 rows)
+
+SELECT name(equipment(hobby_construct(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1a(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1b(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_1c(hobby_construct_named(text 'skywalking', text 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2a(text 'skywalking'));
+ name
+------
+ guts
+(1 row)
+
+SELECT name(equipment_named_ambiguous_2b(text 'skywalking'));
+ name
+---------------
+ advil
+ peet's coffee
+ hightops
+ guts
+(4 rows)
+
+SELECT hobbies_by_name('basketball');
+ hobbies_by_name
+-----------------
+ joe
+(1 row)
+
+SELECT name, overpaid(emp.*) FROM emp;
+ name | overpaid
+--------+----------
+ sharon | t
+ sam | t
+ bill | t
+ jeff | f
+ cim | f
+ linda | f
+(6 rows)
+
+--
+-- Try a few cases with SQL-spec row constructor expressions
+--
+SELECT * FROM equipment(ROW('skywalking', 'mer'));
+ name | hobby
+------+------------
+ guts | skywalking
+(1 row)
+
+SELECT name(equipment(ROW('skywalking', 'mer')));
+ name
+------
+ guts
+(1 row)
+
+SELECT *, name(equipment(h.*)) FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+SELECT *, (equipment(CAST((h.*) AS hobbies_r))).name FROM hobbies_r h;
+ name | person | name
+-------------+--------+---------------
+ posthacking | mike | advil
+ posthacking | mike | peet's coffee
+ posthacking | jeff | advil
+ posthacking | jeff | peet's coffee
+ basketball | joe | hightops
+ basketball | sally | hightops
+ skywalking | | guts
+(7 rows)
+
+--
+-- functional joins
+--
+--
+-- instance rules
+--
+--
+-- rewrite rules
+--
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index ae89ed7f0b..e2f46deec3 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,14 @@ test: brin gin gist spgist privileges init_privs security_label collate matview
# ----------
# Another group of parallel tests
# ----------
-test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tid tidscan collate.icu.utf8 incremental_sort
+test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tid tidscan zstidscan collate.icu.utf8 incremental_sort
+
+# ----------
+# zedstore does a vacuum followed by checking the recycled pages, other active
+# transactions may affect the results, so it should not run in parallel with
+# other tests
+# ----------
+test: zedstore
# rules cannot run concurrently with any test that creates
# a view or rule in the public schema
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 23d7d0beb2..25e0ec67f6 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -58,14 +58,16 @@ char *host_platform = HOST_TUPLE;
static char *shellprog = SHELLPROG;
#endif
+static char gpdiffprog[MAXPGPATH] = "diff";
+
/*
* On Windows we use -w in diff switches to avoid problems with inconsistent
* newline representation. The actual result files will generally have
* Windows-style newlines, but the comparison files might or might not.
*/
#ifndef WIN32
-const char *basic_diff_opts = "";
-const char *pretty_diff_opts = "-U3";
+const char *basic_diff_opts = "-I REGRESS_IGNORE:";
+const char *pretty_diff_opts = "-I REGRESS_IGNORE: -U3";
#else
const char *basic_diff_opts = "-w";
const char *pretty_diff_opts = "-w -U3";
@@ -94,6 +96,8 @@ static bool port_specified_by_user = false;
static char *dlpath = PKGLIBDIR;
static char *user = NULL;
static _stringlist *extraroles = NULL;
+static char *initfile = NULL;
+static bool ignore_plans_tuple_order_diff = false;
static char *config_auth_datadir = NULL;
/* internal variables */
@@ -124,6 +128,9 @@ static void header(const char *fmt,...) pg_attribute_printf(1, 2);
static void status(const char *fmt,...) pg_attribute_printf(1, 2);
static void psql_command(const char *database, const char *query,...) pg_attribute_printf(2, 3);
+static int
+run_diff(const char *cmd, const char *filename);
+
/*
* allow core files if possible.
*/
@@ -1387,11 +1394,14 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
char diff[MAXPGPATH];
char cmd[MAXPGPATH * 3];
char best_expect_file[MAXPGPATH];
+ char diff_opts[MAXPGPATH];
+ char m_pretty_diff_opts[MAXPGPATH];
FILE *difffile;
int best_line_count;
int i;
int l;
const char *platform_expectfile;
+ const char *ignore_plans_opts;
/*
* We can pass either the resultsfile or the expectfile, they should have
@@ -1412,13 +1422,36 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
strcpy(++p, platform_expectfile);
}
+ if (ignore_plans_tuple_order_diff)
+ ignore_plans_opts = " -gpd_ignore_plans";
+ else
+ ignore_plans_opts = "";
+
/* Name to use for temporary diff file */
snprintf(diff, sizeof(diff), "%s.diff", resultsfile);
+ /* Add init file arguments if provided via commandline */
+ if (initfile)
+ {
+ snprintf(diff_opts, sizeof(diff_opts),
+ "%s%s --gpd_init %s", basic_diff_opts, ignore_plans_opts, initfile);
+
+ snprintf(m_pretty_diff_opts, sizeof(m_pretty_diff_opts),
+ "%s%s --gpd_init %s", pretty_diff_opts, ignore_plans_opts, initfile);
+ }
+ else
+ {
+ snprintf(diff_opts, sizeof(diff_opts),
+ "%s%s", basic_diff_opts, ignore_plans_opts);
+
+ snprintf(m_pretty_diff_opts, sizeof(m_pretty_diff_opts),
+ "%s%s", pretty_diff_opts, ignore_plans_opts);
+ }
+
/* OK, run the diff */
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, expectfile, resultsfile, diff);
/* Is the diff file empty? */
if (run_diff(cmd, diff) == 0)
@@ -1450,8 +1483,8 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
}
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, alt_expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, alt_expectfile, resultsfile, diff);
if (run_diff(cmd, diff) == 0)
{
@@ -1478,8 +1511,8 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
if (platform_expectfile)
{
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" > \"%s\"",
- basic_diff_opts, default_expectfile, resultsfile, diff);
+ "%s %s \"%s\" \"%s\" > \"%s\"",
+ gpdiffprog, diff_opts, default_expectfile, resultsfile, diff);
if (run_diff(cmd, diff) == 0)
{
@@ -1512,10 +1545,9 @@ results_differ(const char *testname, const char *resultsfile, const char *defaul
fclose(difffile);
}
- /* Run diff */
snprintf(cmd, sizeof(cmd),
- "diff %s \"%s\" \"%s\" >> \"%s\"",
- pretty_diff_opts, best_expect_file, resultsfile, difffilename);
+ "%s %s \"%s\" \"%s\" >> \"%s\"",
+ gpdiffprog, m_pretty_diff_opts, best_expect_file, resultsfile, difffilename);
run_diff(cmd, difffilename);
unlink(diff);
@@ -1939,6 +1971,27 @@ run_single_test(const char *test, test_function tfunc)
status_end();
}
+/*
+ * Find the gpdiff.pl binary.
+ */
+static void
+find_helper_programs(const char *argv0)
+{
+ if (find_other_exec(argv0, "gpdiff.pl", "(PostgreSQL)", gpdiffprog) != 0)
+ {
+ char full_path[MAXPGPATH];
+
+ if (find_my_exec(argv0, full_path) < 0)
+ strlcpy(full_path, progname, sizeof(full_path));
+
+ fprintf(stderr,
+ _("The program \"gpdiff.pl\" is needed by %s "
+ "but was not found in the same directory as \"%s\".\n"),
+ progname, full_path);
+ exit(1);
+ }
+}
+
/*
* Create the summary-output files (making them empty if already existing)
*/
@@ -2120,6 +2173,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
{"load-extension", required_argument, NULL, 22},
{"config-auth", required_argument, NULL, 24},
{"max-concurrent-tests", required_argument, NULL, 25},
+ {"ignore-plans-and-tuple-order-diff", no_argument, NULL, 26},
+ {"ignore-tuple-order-diff", no_argument, NULL, 27},
{NULL, 0, NULL, 0}
};
@@ -2130,6 +2185,7 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
int option_index;
char buf[MAXPGPATH * 4];
char buf2[MAXPGPATH * 4];
+ bool ignore_tuple_order_diff = false;
pg_logging_init(argv[0]);
progname = get_progname(argv[0]);
@@ -2249,6 +2305,14 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
case 25:
max_concurrent_tests = atoi(optarg);
break;
+ case 26:
+ /* ignore plans means also ignore tuple order differences */
+ ignore_plans_tuple_order_diff = true;
+ ignore_tuple_order_diff = true;
+ break;
+ case 27:
+ ignore_tuple_order_diff = true;
+ break;
default:
/* getopt_long already emitted a complaint */
fprintf(stderr, _("\nTry \"%s -h\" for more information.\n"),
@@ -2293,6 +2357,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
/*
* Initialization
*/
+ if (ignore_tuple_order_diff || ignore_plans_tuple_order_diff)
+ find_helper_programs(argv[0]);
open_result_files();
initialize_environment();
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 525bdc804f..5b0490cd48 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -138,7 +138,10 @@ test: sysviews
test: tsrf
test: tid
test: tidscan
+test: zstidscan
test: collate.icu.utf8
+test: tidscan
+test: zedstore
test: rules
test: psql
test: psql_crosstab
diff --git a/src/test/regress/sql/box.sql b/src/test/regress/sql/box.sql
index cd3e00261f..1d219bd8ca 100644
--- a/src/test/regress/sql/box.sql
+++ b/src/test/regress/sql/box.sql
@@ -148,6 +148,7 @@ INSERT INTO box_temp
('(-infinity,-infinity)(infinity,infinity)');
SET enable_seqscan = false;
+SET enable_bitmapscan = false;
SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
@@ -186,6 +187,7 @@ SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
EXPLAIN (COSTS OFF) SELECT * FROM box_temp WHERE f1 ~= '(20,20),(40,40)';
RESET enable_seqscan;
+RESET enable_bitmapscan;
DROP INDEX box_spgist;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index b27643cad6..6b235d840d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -722,7 +722,9 @@ SELECT count(*) FROM dupindexcols
-- Check ordering of =ANY indexqual results (bug in 9.2.0)
--
-vacuum tenk1; -- ensure we get consistent plans here
+SET enable_seqscan = OFF;
+SET enable_indexscan = OFF;
+SET enable_bitmapscan = OFF;
explain (costs off)
SELECT unique1 FROM tenk1
@@ -742,6 +744,7 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+RESET enable_indexscan;
SET enable_indexonlyscan = OFF;
explain (costs off)
@@ -753,6 +756,8 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
+RESET enable_seqscan;
+RESET enable_bitmapscan;
RESET enable_indexonlyscan;
--
diff --git a/src/test/regress/sql/create_table.sql b/src/test/regress/sql/create_table.sql
index d257679ba6..702c97462f 100644
--- a/src/test/regress/sql/create_table.sql
+++ b/src/test/regress/sql/create_table.sql
@@ -285,15 +285,16 @@ CREATE TABLE IF NOT EXISTS as_select1 AS EXECUTE select1;
DROP TABLE as_select1;
DEALLOCATE select1;
+-- FIXME: enable this test when we introduce meta-page overflow for zedstore
-- create an extra wide table to test for issues related to that
-- (temporarily hide query, to avoid the long CREATE TABLE stmt)
-\set ECHO none
-SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
-FROM generate_series(1, 1100) g(i)
-\gexec
-\set ECHO all
-INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
-SELECT firstc, lastc FROM extra_wide_table;
+-- \set ECHO none
+-- SELECT 'CREATE TABLE extra_wide_table(firstc text, '|| array_to_string(array_agg('c'||i||' bool'),',')||', lastc text);'
+-- FROM generate_series(1, 1100) g(i)
+-- \gexec
+-- \set ECHO all
+-- INSERT INTO extra_wide_table(firstc, lastc) VALUES('first col', 'last col');
+-- SELECT firstc, lastc FROM extra_wide_table;
-- check that tables with oids cannot be created anymore
CREATE TABLE withoid() WITH OIDS;
diff --git a/src/test/regress/sql/zedstore.sql b/src/test/regress/sql/zedstore.sql
new file mode 100644
index 0000000000..48dfe019ec
--- /dev/null
+++ b/src/test/regress/sql/zedstore.sql
@@ -0,0 +1,211 @@
+-- simple tests to iteratively build the zedstore
+-- create and drop works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+drop table t_zedstore;
+-- insert and select works
+create table t_zedstore(c1 int, c2 int, c3 int) USING zedstore;
+insert into t_zedstore select i,i+1,i+2 from generate_series(1, 10)i;
+select * from t_zedstore;
+-- selecting only few columns work
+select c1, c3 from t_zedstore;
+-- only few columns in output and where clause work
+select c3 from t_zedstore where c2 > 5;
+-- where clause with invalid ctid works
+select * from t_zedstore where ctid = '(0,0)';
+
+-- Test abort works
+begin;
+insert into t_zedstore select i,i+1,i+2 from generate_series(21, 25)i;
+abort;
+insert into t_zedstore select i,i+1,i+2 from generate_series(31, 35)i;
+select * from t_zedstore;
+
+--
+-- Test indexing
+--
+create index on t_zedstore (c1);
+set enable_seqscan=off;
+set enable_indexscan=on;
+set enable_bitmapscan=off;
+
+-- index scan
+select * from t_zedstore where c1 = 5;
+
+-- index-only scan
+select c1 from t_zedstore where c1 = 5;
+
+-- bitmap scan
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+select c1, c2 from t_zedstore where c1 between 5 and 10;
+
+--
+-- Test DELETE and UPDATE
+--
+delete from t_zedstore where c2 = 5;
+select * from t_zedstore;
+delete from t_zedstore where c2 < 5;
+select * from t_zedstore;
+
+update t_zedstore set c2 = 100 where c1 = 8;
+select * from t_zedstore;
+
+--
+-- Test page deletion, by deleting a bigger range of values
+--
+insert into t_zedstore select i,i+1,i+2 from generate_series(10000, 15000)i;
+delete from t_zedstore where c1 >= 10000;
+
+--
+-- Test VACUUM
+--
+vacuum t_zedstore;
+select * from t_zedstore;
+
+--
+-- Test in-line toasting
+--
+create table t_zedtoast(c1 int, t text) USING zedstore;
+insert into t_zedtoast select i, repeat('x', 10000) from generate_series(1, 10) i;
+
+select c1, length(t) from t_zedtoast;
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+
+--
+-- Test out-of-line toasting
+--
+insert into t_zedtoast select i, repeat('x', 1000000) from generate_series(1, 10) i;
+
+select c1, length(t) from t_zedtoast;
+
+-- TODO: this test won't actually work when we start using the UNDO framework
+-- because we will not have control over when the undo record will be removed
+delete from t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+vacuum t_zedtoast;
+select count(*) > 0 as has_toast_pages from pg_zs_toast_pages('t_zedtoast');
+
+--
+-- Test NULL values
+--
+create table t_zednullvalues(c1 int, c2 int) USING zedstore;
+insert into t_zednullvalues values(1, NULL), (NULL, 2);
+select * from t_zednullvalues;
+select c2 from t_zednullvalues;
+update t_zednullvalues set c1 = 1, c2 = NULL;
+select * from t_zednullvalues;
+
+--
+-- Test COPY
+--
+create table t_zedcopy(a serial, b int, c text not null default 'stuff', d text,e text) USING zedstore;
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+9999 \N \\N \NN \N
+10000 21 31 41 51
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+1 test_1
+\.
+
+COPY t_zedcopy (b, d) from stdin;
+2 test_2
+3 test_3
+4 test_4
+5 test_5
+\.
+
+COPY t_zedcopy (a, b, c, d, e) from stdin;
+10001 22 32 42 52
+10002 23 33 43 53
+10003 24 34 44 54
+10004 25 35 45 55
+10005 26 36 46 56
+\.
+
+select * from t_zedcopy;
+COPY t_zedcopy (a, d, e) to stdout;
+
+--
+-- Also test delete and update on the table that was populated with COPY.
+-- This exercises splitting the array item. (A table not populated with
+-- COPY only contains single items, at the moment.)
+--
+
+delete from t_zedcopy where b = 4;
+select * from t_zedcopy;
+delete from t_zedcopy where b < 3;
+select * from t_zedcopy;
+
+update t_zedcopy set b = 100 where b = 5;
+select * from t_zedcopy;
+
+
+-- Test rolling back COPY
+begin;
+COPY t_zedcopy (b, d) from stdin;
+20001 test_1
+20002 test_2
+20003 test_3
+20004 test_4
+\.
+rollback;
+select count(*) from t_zedcopy where b >= 20000;
+
+--
+-- Test zero column table
+--
+create table t_zwithzerocols() using zedstore;
+insert into t_zwithzerocols select t.* from t_zwithzerocols t right join generate_series(1,1) on true;
+select count(*) from t_zwithzerocols;
+
+-- Test for alter table add column
+create table t_zaddcol(a int) using zedstore;
+insert into t_zaddcol select * from generate_series(1, 3);
+-- rewrite case
+alter table t_zaddcol add column b int generated always as (a + 1) stored;
+select * from t_zaddcol;
+-- test alter table add column with no default
+create table t_zaddcol_simple(a int) using zedstore;
+insert into t_zaddcol_simple values (1);
+alter table t_zaddcol_simple add b int;
+select * from t_zaddcol_simple;
+insert into t_zaddcol_simple values(2,3);
+select * from t_zaddcol_simple;
+-- fixed length default value stored in catalog
+alter table t_zaddcol add column c int default 3;
+select * from t_zaddcol;
+-- variable length default value stored in catalog
+alter table t_zaddcol add column d text default 'abcdefgh';
+select d from t_zaddcol;
+-- insert after add column
+insert into t_zaddcol values (2);
+select * from t_zaddcol;
+insert into t_zaddcol (a, c, d) values (3,5, 'test_insert');
+select b,c,d from t_zaddcol;
+
+--
+-- Test TABLESAMPLE
+--
+-- regular test tablesample.sql doesn't directly work for zedstore as
+-- its using fillfactor to create specific block layout for
+-- heap. Hence, output differs between heap and zedstore table while
+-- sampling. We need to use many tuples here to have multiple logical
+-- blocks as don't have way to force TIDs spread / jump for zedstore.
+--
+CREATE TABLE t_ztablesample (id int, name text) using zedstore;
+INSERT INTO t_ztablesample
+ SELECT i, repeat(i::text, 2) FROM generate_series(0, 299) s(i);
+-- lets delete half (even numbered ids) rows to limit the output
+DELETE FROM t_ztablesample WHERE id%2 = 0;
+-- should return ALL visible tuples from SOME blocks
+SELECT ctid,t.id FROM t_ztablesample AS t TABLESAMPLE SYSTEM (50) REPEATABLE (0);
+-- should return SOME visible tuples but from ALL the blocks
+SELECT ctid,id FROM t_ztablesample TABLESAMPLE BERNOULLI (50) REPEATABLE (0);
diff --git a/src/test/regress/sql/zstidscan.sql b/src/test/regress/sql/zstidscan.sql
new file mode 100644
index 0000000000..df4a3e77b9
--- /dev/null
+++ b/src/test/regress/sql/zstidscan.sql
@@ -0,0 +1,69 @@
+select '1'::zstid;
+select '-1'::zstid;
+select -'1'::zstid;
+
+-- int2 conversion
+select 1::int2::zstid;
+select (-1)::int2::zstid;
+select -1::int2::zstid;
+
+-- int4 conversion
+select 1::zstid;
+select (-1)::zstid;
+select -1::zstid;
+
+-- int8 conversion
+select 1::int8::zstid;
+select 1000000000000000::zstid; -- bigger than MaxZSTid
+select (-1)::int8::zstid;
+select -1::int8::zstid;
+
+create table if not exists zstidscan(a int) using zedstore;
+
+insert into zstidscan values (1), (2), (3);
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan;
+select ctid, ctid::zstid as zstid, a from zstidscan;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid = 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid >= 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <= 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid < 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid > 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid <> 2;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid in (2,3);
+
+-- TODO: casting to int2 or int4 might be useful
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int2 % 3 = 0;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int4 % 3 = 0;
+
+explain (costs off)
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+select ctid, ctid::zstid as zstid, a from zstidscan where ctid::zstid::int8 % 3 = 0;
+
+-- TODO: Add necessary functions to do these useful aggregates on zstid types
+select max(ctid::zstid) from zstidscan;
+select min(ctid::zstid) from zstidscan;
+
+drop table zstidscan;
diff --git a/src/test/storageperf/driver.sql b/src/test/storageperf/driver.sql
new file mode 100644
index 0000000000..3a197ee6fe
--- /dev/null
+++ b/src/test/storageperf/driver.sql
@@ -0,0 +1,45 @@
+--
+-- Main script, to run all the tests, and print the results.
+--
+--
+
+-- First run the tests using heap.
+DROP SCHEMA IF EXISTS storagetest_heap CASCADE;
+CREATE SCHEMA storagetest_heap;
+SET search_path='storagetest_heap';
+
+CREATE TABLE results (testname text, size numeric, walsize numeric, time numeric) USING heap;
+
+SET default_table_access_method=heap;
+checkpoint;
+\i tests.sql
+
+
+-- Repeat with zedstore
+
+DROP SCHEMA IF EXISTS storagetest_zedstore CASCADE;
+CREATE SCHEMA storagetest_zedstore;
+SET search_path='storagetest_zedstore';
+
+CREATE TABLE results (testname text, size numeric, walsize numeric, time numeric) USING heap;
+
+SET default_table_access_method=zedstore;
+checkpoint;
+\i tests.sql
+
+
+SET search_path='public';
+
+SELECT COALESCE(h.testname, zs.testname) as testname,
+ h.time as "heap time",
+ h.size as "heap size",
+ h.walsize as "heap wal",
+
+ zs.time as "ZS time",
+ zs.size as "ZS size",
+ zs.walsize as "ZS wal",
+ round(zs.time / h.time, 2) as "time ratio",
+ round(zs.size / h.size, 2) as "size ratio",
+ case when zs.walsize > 0 and h.walsize > 0 then round(zs.walsize / h.walsize, 2) else null end as "wal ratio"
+FROM storagetest_heap.results h
+FULL OUTER JOIN storagetest_zedstore.results zs ON (h.testname = zs.testname);
diff --git a/src/test/storageperf/sql/dml.sql b/src/test/storageperf/sql/dml.sql
new file mode 100644
index 0000000000..e65699b2fa
--- /dev/null
+++ b/src/test/storageperf/sql/dml.sql
@@ -0,0 +1,110 @@
+DROP TABLE IF EXISTS public.wide_table;
+CREATE TABLE public.wide_table(i int, j int, k int, l int, m int, n int, o int);
+INSERT INTO public.wide_table SELECT g, g+1, g+2, g+3, g+4, g+5, g+6 FROM generate_series(1, 100000) g;
+
+-- DELETE
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "BEGIN; DELETE FROM public.wide_table WHERE i >= 1; ROLLBACK;" > /tmp/dml-pgbench-script.sql
+
+\! pgbench -n -c 1 -t40 -f /tmp/dml-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('dml, pgbench, DELETE',
+ pg_total_relation_size('public.wide_table'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+-- DELETE RETURNING
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "BEGIN; DELETE FROM public.wide_table WHERE i >= 1 RETURNING j; ROLLBACK;" > /tmp/dml-pgbench-script.sql
+
+\! pgbench -n -c 1 -t40 -f /tmp/dml-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('dml, pgbench, DELETE RETURNING',
+ pg_total_relation_size('public.wide_table'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+-- UPSERT
+CREATE UNIQUE INDEX wide_table_i ON public.wide_table(i);
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "BEGIN; INSERT INTO public.wide_table SELECT g FROM generate_series(1, 100000) g ON CONFLICT (i) DO UPDATE SET j=-1, k=-1, l=-1, m=-1, n=-1, o=-1 WHERE public.wide_table.k>=1; ROLLBACK;" > /tmp/dml-pgbench-script.sql
+
+\! pgbench -n -c 1 -t5 -f /tmp/dml-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('dml, pgbench, UPSERT',
+ pg_total_relation_size('public.wide_table'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+DROP INDEX public.wide_table_i;
+
+-- ON CONFLICT DO NOTHING
+CREATE UNIQUE INDEX wide_table_i ON public.wide_table(i);
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "INSERT INTO public.wide_table SELECT g FROM generate_series(1, 100000) g ON CONFLICT (i) DO NOTHING;" > /tmp/dml-pgbench-script.sql
+
+\! pgbench -n -c 1 -t5 -f /tmp/dml-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('dml, pgbench, ON CONFLICT DO NOTHING',
+ pg_total_relation_size('public.wide_table'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+DROP INDEX public.wide_table_i;
+
+-- Tid scans
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "SET enable_seqscan TO off;" > /tmp/dml-pgbench-script.sql
+
+\! echo "EXPLAIN ANALYZE SELECT i FROM wide_table WHERE ctid IN (SELECT ctid from wide_table);" >> /tmp/dml-pgbench-script.sql
+
+\! echo "RESET enable_seqscan;" >> /tmp/dml-pgbench-script.sql
+
+\! pgbench -n -c 1 -t5 -f /tmp/dml-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('dml, pgbench, TidScan',
+ pg_total_relation_size('public.wide_table'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+-- Row level locking
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "SELECT i FROM wide_table WHERE j >= 100 FOR UPDATE;" > /tmp/dml-pgbench-script.sql
+
+\! pgbench -n -c 1 -t10 -f /tmp/dml-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('dml, pgbench, Row level locking',
+ pg_total_relation_size('public.wide_table'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git a/src/test/storageperf/sql/inlinecompress.sql b/src/test/storageperf/sql/inlinecompress.sql
new file mode 100644
index 0000000000..9e89117d10
--- /dev/null
+++ b/src/test/storageperf/sql/inlinecompress.sql
@@ -0,0 +1,133 @@
+-- Tests with a narrow, single-column table.
+
+CREATE /* UNLOGGED */ TABLE inlinecompress (i text);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+INSERT INTO inlinecompress SELECT repeat('x', 5000) FROM generate_series(1, 10000);
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, insert-select',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY inlinecompress TO '/tmp/inlinecompress.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE inlinecompress;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY inlinecompress FROM '/tmp/inlinecompress.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, COPY',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- SELECT
+--
+
+VACUUM FREEZE inlinecompress;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM inlinecompress GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, SELECT, seqscan',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- Bitmap scans
+--
+
+CREATE INDEX ON inlinecompress (i);
+
+set enable_seqscan=off;
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM inlinecompress where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM inlinecompress where LENGTH(i) < 100000 GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, SELECT, bitmap scan',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+reset enable_seqscan;
+reset enable_indexscan;
+reset enable_bitmapscan;
+
+
+--
+-- Delete half of the rows
+--
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+DELETE FROM inlinecompress WHERE ctid::zstid::int8 % 2 = 0;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, deleted half',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- And vacuum the deleted rows away
+--
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+VACUUM inlinecompress;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('inlinecompress, vacuumed',
+ pg_total_relation_size('inlinecompress'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git a/src/test/storageperf/sql/lockperf.sql b/src/test/storageperf/sql/lockperf.sql
new file mode 100644
index 0000000000..880ff9dfd8
--- /dev/null
+++ b/src/test/storageperf/sql/lockperf.sql
@@ -0,0 +1,49 @@
+drop view if exists public.redirector;
+drop table if exists twocol;
+
+CREATE TABLE twocol (i int4, val int4);
+
+CREATE VIEW public.redirector AS SELECT * FROM twocol;
+
+INSERT INTO twocol SELECT g, 0 FROM generate_series(1, 10) g;
+
+COPY twocol TO '/tmp/twocol.data'; -- dump the data, for COPY test below.
+TRUNCATE twocol;
+COPY twocol FROM '/tmp/twocol.data';
+
+
+-- FOR SHARE
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "\set i random(1, 10)" > /tmp/lockperf-pgbench-script.sql
+\! echo "SELECT i FROM redirector WHERE i = :i FOR SHARE" >> /tmp/lockperf-pgbench-script.sql
+
+\! pgbench -n -t1000 -c 20 -j 4 -f /tmp/lockperf-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('lockperf, pgbench, FOR SHARE',
+ pg_total_relation_size('twocol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+-- UPDATE
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+\! echo "\set i random(1, 10)" > /tmp/lockperf-pgbench-script.sql
+\! echo "UPDATE redirector SET val = val + 1 WHERE i = :i;" >> /tmp/lockperf-pgbench-script.sql
+
+\! pgbench -n -t1000 -c 20 -j 4 -f /tmp/lockperf-pgbench-script.sql postgres
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('lockperf, pgbench, UPDATE',
+ pg_total_relation_size('twocol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git a/src/test/storageperf/sql/nullcol.sql b/src/test/storageperf/sql/nullcol.sql
new file mode 100644
index 0000000000..bb974c0491
--- /dev/null
+++ b/src/test/storageperf/sql/nullcol.sql
@@ -0,0 +1,45 @@
+-- Tests with a narrow, single-column table, with some nulls.
+
+CREATE TABLE nullcol (i int4);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+INSERT INTO nullcol SELECT NULL FROM generate_series(1, 100000) g;
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+INSERT INTO nullcol SELECT g FROM generate_series(1, 100000) g;
+INSERT INTO nullcol SELECT CASE WHEN g % 2 = 0 THEN NULL ELSE g END FROM generate_series(1, 100000) g ;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('nullcol, insert-select',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY nullcol TO '/tmp/nullcol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE nullcol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY nullcol FROM '/tmp/nullcol.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('nullcol, COPY',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git a/src/test/storageperf/sql/onecol.sql b/src/test/storageperf/sql/onecol.sql
new file mode 100644
index 0000000000..f5c4c819ef
--- /dev/null
+++ b/src/test/storageperf/sql/onecol.sql
@@ -0,0 +1,158 @@
+-- Tests with a narrow, single-column table.
+
+CREATE /* UNLOGGED */ TABLE onecol (i int4);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+INSERT INTO onecol SELECT generate_series(1, 100000);
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, insert-select',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY onecol TO '/tmp/onecol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE onecol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY onecol FROM '/tmp/onecol.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, COPY',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- SELECT
+--
+
+VACUUM FREEZE onecol;
+
+-- Test with seq scan parallelism enabled.
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+explain SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, SELECT, seqscan',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+-- Test with seq scan parallelism disabled.
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SET max_parallel_workers_per_gather to 0;
+
+explain SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+SELECT SUM(i) FROM onecol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+VALUES ('onecol, SELECT, seqscan, parallel seqscan disabled',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+RESET max_parallel_workers_per_gather;
+--
+-- Bitmap scans
+--
+
+CREATE INDEX ON onecol (i);
+
+set enable_seqscan=off;
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT SUM(i) FROM onecol where i < 400000;
+SELECT SUM(i) FROM onecol where i < 400000;
+SELECT SUM(i) FROM onecol where i < 400000;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, SELECT, bitmap scan',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+reset enable_seqscan;
+reset enable_indexscan;
+reset enable_bitmapscan;
+
+
+--
+-- Delete half of the rows
+--
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+DELETE FROM onecol WHERE i%2 = 0;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, deleted half',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- And vacuum the deleted rows away
+--
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+VACUUM onecol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('onecol, vacuumed',
+ pg_total_relation_size('onecol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git a/src/test/storageperf/sql/toast.sql b/src/test/storageperf/sql/toast.sql
new file mode 100644
index 0000000000..16b03c91da
--- /dev/null
+++ b/src/test/storageperf/sql/toast.sql
@@ -0,0 +1,133 @@
+-- Tests with a narrow, single-column table.
+
+CREATE /* UNLOGGED */ TABLE toastcol (i text);
+
+-- Populate the table with a bunch of INSERT ... SELECT statements.
+-- Measure how long it takes, and the resulting table size.
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+INSERT INTO toastcol SELECT repeat('x', 1000000) FROM generate_series(1, 100);
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, insert-select',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+COPY toastcol TO '/tmp/toastcol.data'; -- dump the data, for COPY test below.
+
+--
+-- Truncate and populate it again with the same data, but this time using COPY.
+--
+TRUNCATE toastcol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+COPY toastcol FROM '/tmp/toastcol.data';
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, COPY',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- SELECT
+--
+
+VACUUM FREEZE toastcol;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM toastcol GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, SELECT, seqscan',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- Bitmap scans
+--
+
+CREATE INDEX ON toastcol (i);
+
+set enable_seqscan=off;
+set enable_indexscan=off;
+set enable_bitmapscan=on;
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+SELECT LENGTH(i) AS len FROM toastcol where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol where LENGTH(i) < 100000 GROUP BY len;
+SELECT LENGTH(i) AS len FROM toastcol where LENGTH(i) < 100000 GROUP BY len;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, SELECT, bitmap scan',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+reset enable_seqscan;
+reset enable_indexscan;
+reset enable_bitmapscan;
+
+
+--
+-- Delete half of the rows
+--
+
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+DELETE FROM toastcol WHERE ctid::zstid::int8 % 2 = 0;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, deleted half',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
+
+--
+-- And vacuum the deleted rows away
+--
+select pg_current_wal_insert_lsn() as wal_before, extract(epoch from now()) as time_before
+\gset
+
+VACUUM toastcol;
+
+select pg_current_wal_insert_lsn() as wal_after, extract(epoch from now()) as time_after
+\gset
+
+
+INSERT INTO results (testname, size, walsize, time)
+ VALUES ('toastcol, vacuumed',
+ pg_total_relation_size('toastcol'),
+ :'wal_after'::pg_lsn - :'wal_before',
+ :time_after - :time_before);
diff --git a/src/test/storageperf/tests.sql b/src/test/storageperf/tests.sql
new file mode 100644
index 0000000000..053d4c1bd8
--- /dev/null
+++ b/src/test/storageperf/tests.sql
@@ -0,0 +1,8 @@
+-- Test "schedule". List all the tests you want to run here.
+
+\i sql/onecol.sql
+\i sql/nullcol.sql
+\i sql/lockperf.sql
+\i sql/inlinecompress.sql
+\i sql/toast.sql
+\i sql/dml.sql
Hi,
Thanks for the updated patch. It's a quite massive amount of code - I I
don't think we had many 2MB patches in the past, so this is by no means
a full review.
1) the psql_1.out is missing a bit of expected output (due to 098fb0079)
2) I'm getting crashes in intarray contrib, due to hitting this error in
lwlock.c (backtrace attached):
/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");
I haven't investigates this too much, but it's regular build with
asserts and TAP tests, so it should be simple to reproduce using "make
check-world" I guess.
3) I did a very simple benchmark, loading a TPC-H data (for 75GB),
followed by pg_dump, and the duration (in seconds) looks like this:
master zedstore/pglz zedstore/lz4
-------------------------------------------------
copy 1855 68092 2131
dump 751 905 811
And the size of the lineitem table (as shown by \d+) is:
master: 64GB
zedstore/pglz: 51GB
zedstore/lz4: 20GB
It's mostly expected lz4 beats pglz in performance and compression
ratio, but this seems a bit too extreme I guess. Per past benchmarks
(e.g. [1]/messages/by-id/20130621000900.GA12425@alap2.anarazel.de and [2]/messages/by-id/20130605150144.GD28067@alap2.anarazel.de) the difference in compression/decompression time
should be maybe 1-2x or something like that, not 35x like here.
[1]: /messages/by-id/20130621000900.GA12425@alap2.anarazel.de
/messages/by-id/20130621000900.GA12425@alap2.anarazel.de
[2]: /messages/by-id/20130605150144.GD28067@alap2.anarazel.de
/messages/by-id/20130605150144.GD28067@alap2.anarazel.de
Furthermore, the pglz compression is not consuming the most CPU, at
least that's what perf says:
24.82% postgres [.] encode_chunk_varlen
20.49% postgres [.] decode_chunk
13.01% postgres [.] merge_attstream_guts.isra.0
12.68% libc-2.32.so [.] __memmove_avx_unaligned_erms
8.72% postgres [.] encode_chunk_fixed
6.16% postgres [.] pglz_compress
4.36% postgres [.] decode_attstream_cont
2.27% postgres [.] 0x00000000000baff0
1.84% postgres [.] AllocSetAlloc
0.79% postgres [.] append_attstream
0.70% postgres [.] palloc
So I wonder if this is a sign of a deeper issue - maybe the lower
compression ratio (for pglz) triggers some sort of feedback loop in
zedstore, or something like that? Not sure, but this seems strange.
4) I looked at some of the code, like merge_attstream etc. and I wonder
if this might be related to some of the FIXME comments. For example this
bit in merge_attstream seems interesting:
* FIXME: we don't actually pay attention to the compression anymore.
* We never repack.
* FIXME: this is backwords, the normal fast path is if (firsttid1 >
lasttid2)
But I suppose that should affect both pglz and lz4, and I'm not sure how
up to date those comments actually are.
BTW the comments in general need updating and tidying up, to make
reviews easier. For example the merge_attstream comment references
attstream1 and attstream2, but those are not the current parameters of
the function.
5) IHMO there should be a #define specifying the maximum number of items
per chunk (60). Currently there are literal constants used in various
places, sometimes 60, sometimes 59 etc. which makes it harder to
understand the code. FWIW 60 seems a bit low, but maybe it's OK.
6) I do think ZSAttStream should track which compression is used by the
stream, for two main reasons. Firstly, there's another patch to support
"custom compression" methods, which (also) allows multiple compression
methods per column. It'd be a bit strange to support that for varlena
columns in heap table, and not here, I guess. Secondly, I think one of
the interesting columnstore features down the road will be execution on
compressed data, which however requires compression method designed for
that purpose, and it's often datatype-specific (delta encoding, ...).
I don't think we need to go as far as supporting "custom" compression
methods here, but I think we should allow different built-in compression
methods for different attstreams.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Nov 12, 2020, at 2:40 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
Thanks for the updated patch. It's a quite massive amount of code - I I
don't think we had many 2MB patches in the past, so this is by no means
a full review.
Thanks for taking a look! You're not kidding about the patch size.
FYI, the tableam changes made recently have been extracted into their
own patch, which is up at [1]/messages/by-id/CAE-ML+9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw@mail.gmail.com.
1) the psql_1.out is missing a bit of expected output (due to 098fb0079)
Yeah, this patch was rebased as of efc5dcfd8a.
2) I'm getting crashes in intarray contrib, due to hitting this error in
lwlock.c (backtrace attached):/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");I haven't investigates this too much, but it's regular build with
asserts and TAP tests, so it should be simple to reproduce using "make
check-world" I guess.
I've only seen this intermittently in installcheck, and I'm not able to
reproduce with the intarray tests on my machine (macOS). Definitely
something we need to look into. What OS are you testing on?
It's mostly expected lz4 beats pglz in performance and compression
ratio, but this seems a bit too extreme I guess. Per past benchmarks
(e.g. [1] and [2]) the difference in compression/decompression time
should be maybe 1-2x or something like that, not 35x like here.
Yeah, something seems off about that. We'll take a look.
BTW the comments in general need updating and tidying up, to make
reviews easier. For example the merge_attstream comment references
attstream1 and attstream2, but those are not the current parameters of
the function.
Agreed.
5) IHMO there should be a #define specifying the maximum number of items
per chunk (60). Currently there are literal constants used in various
places, sometimes 60, sometimes 59 etc. which makes it harder to
understand the code. FWIW 60 seems a bit low, but maybe it's OK.
Yeah, that seems like a good idea.
I think the value 60 comes from the use of simple-8b encoding -- see the
comment at the top of zedstore_attstream.c.
6) I do think ZSAttStream should track which compression is used by the
stream, for two main reasons. Firstly, there's another patch to support
"custom compression" methods, which (also) allows multiple compression
methods per column. It'd be a bit strange to support that for varlena
columns in heap table, and not here, I guess. Secondly, I think one of
the interesting columnstore features down the road will be execution on
compressed data, which however requires compression method designed for
that purpose, and it's often datatype-specific (delta encoding, ...).I don't think we need to go as far as supporting "custom" compression
methods here, but I think we should allow different built-in compression
methods for different attstreams.
Interesting. We'll need to read/grok that ML thread.
Thanks again for the review!
--Jacob
[1]: /messages/by-id/CAE-ML+9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw@mail.gmail.com
On 11/13/20 8:07 PM, Jacob Champion wrote:
On Nov 12, 2020, at 2:40 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
Thanks for the updated patch. It's a quite massive amount of code - I I
don't think we had many 2MB patches in the past, so this is by no means
a full review.Thanks for taking a look! You're not kidding about the patch size.
FYI, the tableam changes made recently have been extracted into their
own patch, which is up at [1].1) the psql_1.out is missing a bit of expected output (due to 098fb0079)
Yeah, this patch was rebased as of efc5dcfd8a.
2) I'm getting crashes in intarray contrib, due to hitting this error in
lwlock.c (backtrace attached):/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");I haven't investigates this too much, but it's regular build with
asserts and TAP tests, so it should be simple to reproduce using "make
check-world" I guess.I've only seen this intermittently in installcheck, and I'm not able to
reproduce with the intarray tests on my machine (macOS). Definitely
something we need to look into. What OS are you testing on?
Fedora 32, nothing special. I'm not sure if I ran the tests with pglz or
lz4, maybe there's some dependence on that, but it does fail for me
quite reliably with this:
./configure --enable-debug --enable-cassert --enable-tap-tests
--with-lz4 && make -s clean && make -s -j4 && make check-world
It's mostly expected lz4 beats pglz in performance and compression
ratio, but this seems a bit too extreme I guess. Per past benchmarks
(e.g. [1] and [2]) the difference in compression/decompression time
should be maybe 1-2x or something like that, not 35x like here.Yeah, something seems off about that. We'll take a look.
BTW the comments in general need updating and tidying up, to make
reviews easier. For example the merge_attstream comment references
attstream1 and attstream2, but those are not the current parameters of
the function.Agreed.
5) IHMO there should be a #define specifying the maximum number of items
per chunk (60). Currently there are literal constants used in various
places, sometimes 60, sometimes 59 etc. which makes it harder to
understand the code. FWIW 60 seems a bit low, but maybe it's OK.Yeah, that seems like a good idea.
I think the value 60 comes from the use of simple-8b encoding -- see the
comment at the top of zedstore_attstream.c.
Yeah, I understand where it comes from. I'm just saying that when you
see 59 hardcoded, it may not be obvious where it came from, and
something like ITEMS_PER_CHUNK would be better.
I wonder how complicated would it be to allow larger chunks, e.g. by
using one bit to say "there's another 64-bit codeword". Not sure if it's
worth the extra complexity, though - it's just that 60 feels a bit low.
6) I do think ZSAttStream should track which compression is used by the
stream, for two main reasons. Firstly, there's another patch to support
"custom compression" methods, which (also) allows multiple compression
methods per column. It'd be a bit strange to support that for varlena
columns in heap table, and not here, I guess. Secondly, I think one of
the interesting columnstore features down the road will be execution on
compressed data, which however requires compression method designed for
that purpose, and it's often datatype-specific (delta encoding, ...).I don't think we need to go as far as supporting "custom" compression
methods here, but I think we should allow different built-in compression
methods for different attstreams.Interesting. We'll need to read/grok that ML thread.
That thread is a bit long not sure it's worth reading as a whole unless
you want to work on that feature. The gist is that to seamlessly support
multiple compression algorithms we need to store an ID of the algorithm
somewhere. For TOAST that's not too difficult, we can do that in the
TOAST pointer - the the main challenge is in doing it in a
backwards-compatible way. For zedstore we can actually design it from
the start.
I wonder if we should track version of the format somewhere, to allow
future improvements. So that if/when we decide to change something in
the future, we don't have to scavenge bits etc. Or perhaps just a
"uint32 flags" field, unused/reserved for future use.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
master zedstore/pglz zedstore/lz4
-------------------------------------------------
copy 1855 68092 2131
dump 751 905 811And the size of the lineitem table (as shown by \d+) is:
master: 64GB
zedstore/pglz: 51GB
zedstore/lz4: 20GBIt's mostly expected lz4 beats pglz in performance and compression
ratio, but this seems a bit too extreme I guess. Per past benchmarks
(e.g. [1] and [2]) the difference in compression/decompression time
should be maybe 1-2x or something like that, not 35x like here.
I can't speak to the ratio, but in basic backup/restore scenarios pglz
is absolutely killing me; Performance is just awful; we are cpubound
in backups throughout the department. Installations defaulting to
plgz will make this feature show very poorly.
merlin
On 11/16/20 1:59 PM, Merlin Moncure wrote:
On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:master zedstore/pglz zedstore/lz4
-------------------------------------------------
copy 1855 68092 2131
dump 751 905 811And the size of the lineitem table (as shown by \d+) is:
master: 64GB
zedstore/pglz: 51GB
zedstore/lz4: 20GBIt's mostly expected lz4 beats pglz in performance and compression
ratio, but this seems a bit too extreme I guess. Per past benchmarks
(e.g. [1] and [2]) the difference in compression/decompression time
should be maybe 1-2x or something like that, not 35x like here.I can't speak to the ratio, but in basic backup/restore scenarios pglz
is absolutely killing me; Performance is just awful; we are cpubound
in backups throughout the department. Installations defaulting to
plgz will make this feature show very poorly.
Maybe. I'm not disputing that pglz is considerably slower than lz4, but
judging by previous benchmarks I'd expect the compression to be slower
maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly
for the compression ratio - lz4 is great, but it seems strange it's 1/2
the size of pglz. Which is why I'm speculating that something else is
going on.
As for the "plgz will make this feature show very poorly" I think that
depends. I think we may end up with pglz doing pretty well (compared to
heap), but lz4 will probably outperform that. OTOH for various use cases
it may be more efficient to use something else with worse compression
ratio, but allowing execution on compressed data, etc.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Nov 16, 2020 at 10:07 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 11/16/20 1:59 PM, Merlin Moncure wrote:
On Thu, Nov 12, 2020 at 4:40 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:master zedstore/pglz zedstore/lz4
-------------------------------------------------
copy 1855 68092 2131
dump 751 905 811And the size of the lineitem table (as shown by \d+) is:
master: 64GB
zedstore/pglz: 51GB
zedstore/lz4: 20GBIt's mostly expected lz4 beats pglz in performance and compression
ratio, but this seems a bit too extreme I guess. Per past benchmarks
(e.g. [1] and [2]) the difference in compression/decompression time
should be maybe 1-2x or something like that, not 35x like here.I can't speak to the ratio, but in basic backup/restore scenarios pglz
is absolutely killing me; Performance is just awful; we are cpubound
in backups throughout the department. Installations defaulting to
plgz will make this feature show very poorly.Maybe. I'm not disputing that pglz is considerably slower than lz4, but
judging by previous benchmarks I'd expect the compression to be slower
maybe by a factor of ~2x. So the 30x difference is suspicious. Similarly
for the compression ratio - lz4 is great, but it seems strange it's 1/2
the size of pglz. Which is why I'm speculating that something else is
going on.As for the "plgz will make this feature show very poorly" I think that
depends. I think we may end up with pglz doing pretty well (compared to
heap), but lz4 will probably outperform that. OTOH for various use cases
it may be more efficient to use something else with worse compression
ratio, but allowing execution on compressed data, etc.
hm, you might be right. Doing some number crunching, I'm getting
about 23mb/sec compression on a 600gb backup image on a pretty typical
aws server. That's obviously not great, but your numbers are much
worse than that, so maybe something else might be going on.
I think we may end up with pglz doing pretty well (compared to heap)
I *don't* think so, or at least I'm skeptical as long as insertion
times are part of the overall performance measurement. Naturally,
with column stores, insertion times are often very peripheral to the
overall performance picture but for cases that aren't I suspect the
results are not going to be pleasant, and advise planning accordingly.
Aside, I am very interested in this work. I may be able to support
testing in an enterprise environment; lmk if interested -- thank you
merlin
On Nov 13, 2020, at 2:00 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Fedora 32, nothing special. I'm not sure if I ran the tests with pglz or
lz4, maybe there's some dependence on that, but it does fail for me
quite reliably with this:./configure --enable-debug --enable-cassert --enable-tap-tests
--with-lz4 && make -s clean && make -s -j4 && make check-world
I'm not sure what I messed up the first time, but I am able to reproduce
reliably now, with and without lz4. It looks like we have a workaround
in place that significantly increases the number of simultaneous locks
acquired during indexing:
#define XLR_MAX_BLOCK_ID 199
So that's in need of resolution. I'd expect gin and gist to be pretty
flaky until we fix that.
--Jacob
On Wed, 18 Nov 2020 at 00:31, Jacob Champion <pchampion@vmware.com> wrote:
So that's in need of resolution. I'd expect gin and gist to be pretty
flaky until we fix that.
Jacob and Soumyadeep,
Thanks for submitting this. I think a fix is still outstanding? and
the patch fails to apply on HEAD in two places.
Please can you submit the next version?
Do you mind if we add this for review to the Jan CF?
It is a lot of code and I think there is significant difficulty for
the community to accept that as-is, even though it looks to be a very
high quality submission. So I would like to suggest a strategy for
commit: we accept Zedstore as "Beta" or "Experimental" in PG14,
perhaps with a WARNING/Caution similar to the one that used to be
given by Postgres in earlier versions when you created a Hash index.
We keep Zedstore in "Beta" mode until a later release, PG15 or later
when we can declare Zedstore fully safe. That approach allows us to
get this into the repo asap, and then be fixed and improved
incrementally from here.
e.g.
"NOTICE: Caution: Zedstore is an experimental feature in PostgreSQL14
intended for robustness and performance testing only. Your data and/or
query accuracy may be at risk if you rely on this."
--
Simon Riggs http://www.EnterpriseDB.com/
On Dec 31, 2020, at 9:22 AM, Simon Riggs <simon@2ndquadrant.com<mailto:simon@2ndquadrant.com>> wrote:
On Wed, 18 Nov 2020 at 00:31, Jacob Champion <pchampion@vmware.com<mailto:pchampion@vmware.com>> wrote:
So that's in need of resolution. I'd expect gin and gist to be pretty
flaky until we fix that.
Jacob and Soumyadeep,
Thanks for submitting this. I think a fix is still outstanding? and
the patch fails to apply on HEAD in two places.
Please can you submit the next version?
Do you mind if we add this for review to the Jan CF?
It is a lot of code and I think there is significant difficulty for
the community to accept that as-is, even though it looks to be a very
high quality submission. So I would like to suggest a strategy for
commit: we accept Zedstore as "Beta" or "Experimental" in PG14,
perhaps with a WARNING/Caution similar to the one that used to be
given by Postgres in earlier versions when you created a Hash index.
We keep Zedstore in "Beta" mode until a later release, PG15 or later
when we can declare Zedstore fully safe. That approach allows us to
get this into the repo asap, and then be fixed and improved
incrementally from here.
The goal for Zedstore is to get a Column Store into Postgres, but not necessarily Zedstore. (Zedstore itself would be nice) When designing Zedstore success for us would be:
- significantly more performant on OLAP type queries,
- performant enough to not be terrible with OLTP type queries
- must support compression
- cannot be append only, this was the case initially with Greenplum Column Store and it was a mistake. Customers want to update and delete
- it needs to be feature complete as compared to HEAP unless it doesn’t make sense
Our initial goal is to get the TableAM and executor molded into a state where the above is possible for anyone wanting a column store implementation.
Given the goal of addressing API/Executor issues generically first, we have been trying to peel off and work on the parts that are not tightly linked to Zedstore. Specifically I don’t think it would be ok to merge Zedstore into core when it might affect the performance of HEAP relations.
Instead of focusing on the larger, more difficult to review Zedstore patch, we are trying to peel off the touch points where Zedstore and the current server interact. Note this isn’t intended to be an exhaustive list, rather a list of the most immediate issues. Some of these issues are critical for Zedstore to work, i.e. column projection, while some of these issues point more towards ensuring the various layers in the code are clean so that folks leveraging the TableAM don’t need to write their own bits from whole cloth but rather can leverage appropriately generic primitives, i.e. DBsize or page inspect.
As such, an incomplete list of things currently on our radar:
1) Column Projection — We have a patch [1]/messages/by-id/CAE-ML+9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw@mail.gmail.com that is a demonstration of what we would like to do. There are several TODOs in the email that can/will be addressed if the general method is acceptable
2) DBSize —Georgios has a patch [2]/messages/by-id/svffVJPtfDYEIISNS-3FQs64CauSul3RjF7idXOfy4H40YBVwB3TMumHb6WoAElJpHOsN-j8fjxYohEt4VxcsJ0Qd9gizwzsY3rjgtjj440=@pm.me that begins to make DBSize less HEAP specific
3) Reloptions —Jeff Davis has a patch [3]/messages/by-id/429fb58fa3218221bb17c7bf9e70e1aa6cfc6b5d.camel@j-davis.com that begins to make these more flexible, having spoken with him we think additional work needs to be done here
4) PageInspect —needs to be less HEAP specific but no work has been done here that I’m aware of
5) bitmapHeapScan —currently scans both the index and the relation, there are code comments to address this and we need to look into what a fix would mean
6) Bulk insertion —Justin Pryzby has a patch [4]/messages/by-id/20200508072545.GA9701@telsasoft.com we are following along with.
7) analyze — Denis has a patch which starts to address this [5]/messages/by-id/C7CFE16B-F192-4124-BEBB-7864285E0FF7@arenadata.io
Ideally we can peel out anything that is useful to any column store. Once those have been discussed and committed the general code should be in better shape as well.
— Rob
[1]: /messages/by-id/CAE-ML+9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw@mail.gmail.com
[2]: /messages/by-id/svffVJPtfDYEIISNS-3FQs64CauSul3RjF7idXOfy4H40YBVwB3TMumHb6WoAElJpHOsN-j8fjxYohEt4VxcsJ0Qd9gizwzsY3rjgtjj440=@pm.me
[3]: /messages/by-id/429fb58fa3218221bb17c7bf9e70e1aa6cfc6b5d.camel@j-davis.com
[4]: /messages/by-id/20200508072545.GA9701@telsasoft.com
[5]: /messages/by-id/C7CFE16B-F192-4124-BEBB-7864285E0FF7@arenadata.io
e.g.
"NOTICE: Caution: Zedstore is an experimental feature in PostgreSQL14
intended for robustness and performance testing only. Your data and/or
query accuracy may be at risk if you rely on this."
--
Simon Riggs http://www.EnterpriseDB.com/
Greetings.
Thanks for the project. I see the code in github has not been updated for
a long time, is it still in active development?
Thanks
--
Best Regards
Andy Fan (https://www.aliyun.com/)