Dynamic shared memory areas
Hi hackers,
I would like to propose a new subsystem called Dynamic Shared [Memory]
Areas, or "DSA". It provides an object called a "dsa_area" which can
be used by multiple backends to share data. Under the covers, a
dsa_area is made up of some number of DSM segments, but it appears to
client code as a single shared memory heap with a simple allocate/free
interface. Because the memory is mapped at different addresses in
different backends, it introduces a kind of sharable relative pointer
and an operation to convert it to a backend-local pointer.
After you have created or attached to a dsa_area, you can use it much
like MemoryContextAlloc/pfree, except for the extra hoop to jump
through to get the local address:
dsa_pointer p;
char *mem;
p = dsa_allocate(area, 42);
mem = (char *) dsa_get_address(area, p);
if (mem != NULL)
{
snprintf(mem, 42, "Hello world");
dsa_free(area, p);
}
Exposing the dsa_pointer in this way allows client code to build data
structures with internal dsa_pointers that will be usable in all
backends that attach to the dsa_area.
DSA areas have many potential uses, including shared workspaces for
various kinds of parallel query execution, longer term storage for
in-memory database objects, caches and so forth. In some cases it may
be useful to use a dsa_area directly, but there could be a library of
useful data structures that know how to use DSA memory. More on all
of those topics, with patches, soon.
SOME CONTEXT
Currently, Postgres provides three classes of memory:
1. Backend-local memory, managed with palloc/pfree, and MemoryContext
providing a hierarchy of memory heaps tied to various scopes.
Underneath that, there is of course the C runtime's heap and
allocator.
2. Traditional non-extensible shared memory mapped into every backend
at the same address. This works on Unix because child processes
inherit the memory map of the postmaster. In EXEC_BACKEND builds
(including Windows) it works because you can ask for memory to be
mapped at a specific address and it'll succeed if ASLR is turned off
and the backend hasn't been running very long and the address range
happens to be still free. This memory is currently managed with an
allocate-only allocator. There is a small library of data structures
that know how to use (but never free) this memory.
3. DSM memory, our abstraction for shared memory segments created on
demand in non-postmaster backends. This memory is mapped at different
addresses in different backends. Currently its main use is to provide
a chunk of memory for parallel query. To manage the space inside a
DSM segment, shm_toc ('table-of-contents') can be used as a kind of
allocate-only space manager which allows backends to find the
backend-local address of objects within the segment using integer
keys.
This proposal adds a fourth class, building on the third. Compared
with the existing memory classes:
* It provides a fully general allocate/free facility, as currently
available only in (1), though does not have (1)'s directly
dereferenceable pointers.
* It grows automatically and can in theory grow as big as virtual
memory allows, like (1), though it also provides a way to cap total
size so that allocations fail beyond some size.
* It provides something like the throw-it-all-away-at-once clean-up
facility of (1), since DSA areas can be destroyed, are reference
counted, and can optionally be tracked by the resource manager
mechanism (riding on DSM's coat tails).
* It provides the data sharing between backends of (2) and (3), though
doesn't have (2)'s directly dereferenceable pointers.
* Through proposals that will follow this one, it will provide for
basic data structures that build on top of it such as hash tables,
like (2), except that these ones will be able to grow as required and
give memory back.
* Unlike (1) and (2), client code has to deal with incompatible memory
maps. This involves calling dsa_get_address(area, relative_pointer)
which amounts to a few instructions to perform a base address lookup
and pointer arithmetic.
Using processes instead of threads gives Postgres certain advantages,
but requires us to deal with shared memory instead of just using
something like (1) for all our memory needs, as a hypothetical
multi-threaded Postgres fork would presumably do. This proposal is a
step towards making our shared memory facilities more powerful and
general.
IMPLEMENTATON AND HISTORY
Back in 2014, Robert Haas proposed sb_alloc[1]/messages/by-id/CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com. It had two layers:
* a 'free page manager' which cuts a piece of memory into 4KB pages
and embeds a btree into the empty pages to track contiguous runs of
pages, so that you can get and put free page ranges
* an allocator which manages a set of backend-private memory regions,
each of which has a free page manager; large allocations are handled
directly with pages from the free page manager in an existing region,
or new regions created as required with malloc; allocations <= 8KB are
handled with pools (called "heaps" in that patch) of various object
sizes ("size classes") that live in 64KB superblocks, which in turn
come from the free page manager
DSA uses Robert's free page manager unchanged, except for some
debugging by me. It uses the same general approach and much of the
code for the higher level allocator, but I have reworked it
substantially to replace the MemoryContext interface, put it in DSM
segments, introduce the multi-segment relative pointer scheme, and add
concurrency support.
Compared to some well known malloc implementations which this code
takes general inspiration from, the main differences are obviously the
shared memory nature, the lack of per-core pools (an avenue for future
research that would increase concurrent performance at the cost of
increased fragmentation), and it has that lower level page manager.
Some other systems go directly to the OS (mmap, sbrk) for superblocks
and large objects. The equivalent for us would be to throw away the
lower layer and simply create a DSM segment for large allocations and
64KB superblocks, but there are implementation and portability reasons
not to want to create very large numbers of DSM segments.
Compared to palloc/pfree, DSA aims to waste less space. It has more
finely gained size classes (8, 16, 24, 32, 40, 48, ... see
dsa_size_classes), uses a page map that uses 8 bytes per 4KB page to
keep track of how to free memory instead of putting bookkeeping
information in front of every object.
Some other notes in no particular order: It's admittedly slightly
confusing that the patch currently contains two separate relative
pointer concepts: relptr is used by Robert's freespace.c code and
provides for sort-of-type-checked offsets relative to a single base,
and dsa_pointer is used by dsa.c to provide multi-segment relative
pointers that encode a segment index in the higher bits. The lock
tranche arguments to dsa_create_dynamic are clunky, but I don't have a
better idea currently since you can't allocate and free tranche IDs so
I don't see how dsa.c can own that problem. The "dynamic" part of
dsa_create_dynamic's name reflects a desire to have an alternative
"fixed" version where you can provide it with an already existing
piece of memory to manage, such as a pre-existing DSM segment, but
that has not been implemented. It's desirable to allow atomic ops on
dsa_pointer; I believe Andres Freund plans to make that happen for 64
bit values on 32 bit systems, but if that turns out to be problematic
I would want to make dsa_pointer 32 bits on 32 bit systems.
PATCH
First, please apply dsm-unpin-segment-v2.patch[2]/messages/by-id/CAEepm=29DZeWf44-4fzciAQ14iY5vCVZ6RUJ-KR2yzs3hPzrkw@mail.gmail.com, and then
dsm-handle-invalid.patch (attached, and also proposed), and finally
dsa-v1.patch. I have also attached test-dsa.patch, a small module
which exercises the allocator and shows some client code.
Thanks to my colleagues Robert Haas for the sb_alloc code that morphed
into this patch, and John Gorman and Amit Khandekar for feedback and
testing.
I'd be most grateful for any feedback. Thanks for reading!
[1]: /messages/by-id/CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com
[2]: /messages/by-id/CAEepm=29DZeWf44-4fzciAQ14iY5vCVZ6RUJ-KR2yzs3hPzrkw@mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsm-handle-invalid.patchapplication/octet-stream; name=dsm-handle-invalid.patchDownload
commit feab191ea1be8a75e3bd430a3476f26582d3673c
Author: Robert Haas <robert.haas@enterprisedb.com>
Date: Wed May 25 09:38:48 2016 -0400
Reserve zero as an invalid DSM handle.
Previously, the handle for the control segment could not be zero, but
some other DSM segment could potentially have a handle value of zero.
However, that means that if you want to store a dsm_handle that might
or might not be valid, you need a separate boolean to keep track of
whether you've got a legal value there. That's annoying, so change
things so that no DSM segment can ever have a handle of 0 - or as we
call it here, DSM_HANDLE_INVALID.
Thomas Munro, reviewed by me.
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index fae0b00..0dd1ed4 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -181,7 +181,7 @@ dsm_postmaster_startup(PGShmemHeader *shim)
Assert(dsm_control_address == NULL);
Assert(dsm_control_mapped_size == 0);
dsm_control_handle = random();
- if (dsm_control_handle == 0)
+ if (dsm_control_handle == DSM_HANDLE_INVALID)
continue;
if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
&dsm_control_impl_private, &dsm_control_address,
@@ -475,6 +475,8 @@ dsm_create(Size size, int flags)
{
Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
seg->handle = random();
+ if (seg->handle == DSM_HANDLE_INVALID) /* Reserve sentinel */
+ continue;
if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, &seg->impl_private,
&seg->mapped_address, &seg->mapped_size, ERROR))
break;
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 8be7c9a..bc91be6 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -19,6 +19,9 @@ typedef struct dsm_segment dsm_segment;
#define DSM_CREATE_NULL_IF_MAXSEGMENTS 0x0001
+/* A sentinel value for an invalid DSM handle. */
+#define DSM_HANDLE_INVALID 0
+
/* Startup and shutdown functions. */
struct PGShmemHeader; /* avoid including pg_shmem.h */
extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
dsa-v1.patchapplication/octet-stream; name=dsa-v1.patchDownload
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 8a55392..e99ebd2 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/storage/ipc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+OBJS = dsa.o dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
procsignal.o shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
sinvaladt.o standby.o
diff --git a/src/backend/storage/ipc/dsa.c b/src/backend/storage/ipc/dsa.c
new file mode 100644
index 0000000..a12f67e
--- /dev/null
+++ b/src/backend/storage/ipc/dsa.c
@@ -0,0 +1,1934 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/barrier.h"
+#include "storage/dsa.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/* The maximum number of DSM segments that an area can own. */
+#define DSA_MAX_SEGMENTS 1024
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area. Subsequently
+ * created segments will be larger; we double the total storage space each
+ * time. Larger segments may be created if necessary to satisfy large
+ * requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE (1 * 1024 * 1024)
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment. At 40 bits the size of a
+ * segment and therefore the maximum you can allocate at once is 1TB.
+ */
+#define DSA_OFFSET_WIDTH 40
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((uint64) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ Size size; /* Size of the segment */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static void dsa_on_dsm_segment_detach(dsm_segment *, Datum arg);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+
+/*
+ * Create a new shared area with dynamic size. DSM segments will be allocated
+ * as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create_dynamic(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ Size total_size;
+ int i;
+
+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space, and set it up. All segments backing
+ * this area are pinned, so that DSA can explicitly control their lifetime
+ * (otherwise a newly created segment belonging to this area might be
+ * freed when the only backend that happens to have it mapped in ends,
+ * corrupting the area).
+ */
+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * segment.
+ */
+ control = dsm_segment_address(segment);
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ dsm_segment_handle(segment) ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = dsm_segment_handle(segment);
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = DSA_INITIAL_SEGMENT_SIZE;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = dsm_segment_handle(segment);
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(NULL));
+
+ return area;
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another
+ * process) by dsa_get_area_handle.
+ */
+dsa_area *
+dsa_attach_dynamic(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+ control = dsm_segment_address(segment);
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(area));
+
+ return area;
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ /* TODO: Does it even need to be linked in in the first place? */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (area->segment_maps[index].mapped_address == NULL)
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (area->freed_segment_counter != freed_segment_counter)
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * A callback function for when the control segment for a dsa_area is
+ * detached.
+ */
+static void
+dsa_on_dsm_segment_detach(dsm_segment *segment, Datum arg)
+{
+ bool destroy = false;
+ dsa_area_control *control =
+ (dsa_area_control *) dsm_segment_address(segment);
+
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+
+ /* Decrement the reference count for the DSA area. */
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ if (--control->refcnt == 0)
+ destroy = true;
+ LWLockRelease(&control->lock);
+
+ /*
+ * If we are the last to detach from the area, then we must unpin all
+ * segments so they can be returned to the OS.
+ */
+ if (destroy)
+ {
+ int i;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(DSA_SCLASS_LOCK(area, size_class));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (area->segment_maps[index].mapped_address == NULL) /* unlikely */
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE * ((Size) 1 << new_index);
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..20973af 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..fd1f2ec
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1812 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ FreePageBtreeCleanup(fpm);
+
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ fpm->contiguous_pages = largest;
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /*
+ * TODO: Figure out how to avoid setting this every time. It may not be as
+ * simple as it looks.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /*
+ * Past this point we might allocate btree pages, which could
+ * potentially shorten any existing run which might happen to be the
+ * current longest. So fpm->contiguous_pages needs to be recomputed.
+ */
+ fpm->contiguous_pages_dirty = true;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/storage/dsa.h b/src/include/storage/dsa.h
new file mode 100644
index 0000000..c8420d4
--- /dev/null
+++ b/src/include/storage/dsa.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/storage/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address.
+ */
+typedef uint64 dsa_pointer;
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create_dynamic(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach_dynamic(dsa_handle handle);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..e509ca2
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..a97dc96
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
test-dsa.patchapplication/octet-stream; name=test-dsa.patchDownload
diff --git a/src/test/modules/test_dsa/Makefile b/src/test/modules/test_dsa/Makefile
new file mode 100644
index 0000000..e5299a9
--- /dev/null
+++ b/src/test/modules/test_dsa/Makefile
@@ -0,0 +1,18 @@
+# src/test/modules/test_dsa/Makefile
+
+MODULES = test_dsa
+
+EXTENSION = test_dsa
+DATA = test_dsa--1.0.sql
+PGFILEDESC = "test_dsa -- tests for DSA areas"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_dsa
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_dsa/test_dsa--1.0.sql b/src/test/modules/test_dsa/test_dsa--1.0.sql
new file mode 100644
index 0000000..cc435b3
--- /dev/null
+++ b/src/test/modules/test_dsa/test_dsa--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_dsa/test_dsa--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_dsa" to load this file. \quit
+
+CREATE FUNCTION test_dsa_random(loops int, num_allocs int, min_alloc int, max_alloc int, mode text)
+RETURNS VOID
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE TYPE test_dsa_row AS (pid int, allocations bigint, elapsed interval);
+
+CREATE FUNCTION test_dsa_random_parallel(loops int, num_allocs int, min_alloc int, max_alloc int, mode text, workers int)
+RETURNS SETOF test_dsa_row
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/src/test/modules/test_dsa/test_dsa.c b/src/test/modules/test_dsa/test_dsa.c
new file mode 100644
index 0000000..de89189
--- /dev/null
+++ b/src/test/modules/test_dsa/test_dsa.c
@@ -0,0 +1,358 @@
+/* -------------------------------------------------------------------------
+ *
+ * test_dsa.c
+ * Simple exercises for dsa.c.
+ *
+ * Copyright (C) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_dsa/test_dsa.c
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/dsa.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/resowner.h"
+#include "utils/timestamp.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_dsa_random);
+PG_FUNCTION_INFO_V1(test_dsa_random_parallel);
+
+/* Which order to free objects in, within each loop. */
+typedef enum
+{
+ /* Free in random order. */
+ MODE_RANDOM,
+ /* Free in the same order we allocated (FIFO). */
+ MODE_FORWARDS,
+ /* Free in reverse order of allocation (LIFO). */
+ MODE_BACKWARDS
+} test_mode;
+
+/* Per-worker results. */
+typedef struct
+{
+ pid_t pid;
+ int64 count;
+ int64 elapsed_time_us;
+} test_result;
+
+/* Parameters for a test run, passed to workers. */
+typedef struct
+{
+ int loops;
+ int num_allocs;
+ int min_alloc;
+ int max_alloc;
+ test_mode mode;
+ test_result results[1]; /* indexed by worker number */
+} test_parameters;
+
+/* The startup message given to each worker. */
+typedef struct
+{
+ /* How to connect to the shmem area. */
+ dsa_handle area_handle;
+ /* Where to find the parameters. */
+ dsa_pointer parameters;
+ /* What index this worker should write results to. */
+ Size output_index;
+} test_hello;
+
+static test_mode
+parse_test_mode(text *mode)
+{
+ test_mode result = MODE_RANDOM;
+ char *cstr = text_to_cstring(mode);
+
+ if (strcmp(cstr, "random") == 0)
+ result = MODE_RANDOM;
+ else if (strcmp(cstr, "forwards") == 0)
+ result = MODE_FORWARDS;
+ else if (strcmp(cstr, "backwards") == 0)
+ result = MODE_BACKWARDS;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unknown mode")));
+ return result;
+}
+
+static void
+check_parameters(const test_parameters *parameters)
+{
+ if (parameters->min_alloc < 1 || parameters->min_alloc > parameters->max_alloc)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("min_alloc must be >= 1, and min_alloc must be <= max_alloc")));
+}
+
+static int
+my_tranche_id(void)
+{
+ static int tranche_id = 0;
+
+ if (tranche_id == 0)
+ tranche_id = LWLockNewTrancheId();
+
+ return tranche_id;
+}
+
+static void
+do_random_test(dsa_area *area, Size output_index, test_parameters *parameters)
+{
+ dsa_pointer *objects;
+ int min_alloc;
+ int extra_alloc;
+ int32 i;
+ int32 loop;
+ int64 start_time = GetCurrentIntegerTimestamp();
+ int64 total_allocations = 0;
+
+ /*
+ * Make tests reproducible (on the same computer at least) by using the
+ * same random sequence every time.
+ */
+ srand(42);
+
+ min_alloc = parameters->min_alloc;
+ extra_alloc = parameters->max_alloc - parameters->min_alloc;
+
+ objects = palloc(sizeof(dsa_pointer) * parameters->num_allocs);
+ Assert(objects != NULL);
+ for (loop = 0; loop < parameters->loops; ++loop)
+ {
+ int num_actually_allocated = 0;
+
+ for (i = 0; i < parameters->num_allocs; ++i)
+ {
+ Size size;
+ void *memory;
+
+ /* Adjust size randomly if needed. */
+ size = min_alloc;
+ if (extra_alloc > 0)
+ size += rand() % extra_alloc;
+
+ /* Allocate! */
+ objects[i] = dsa_allocate(area, size);
+ if (!DsaPointerIsValid(objects[i]))
+ {
+ elog(WARNING, "dsa: loop %d: out of memory after allocating %d objects", loop, i + 1);
+ break;
+ }
+ ++num_actually_allocated;
+ /* Pay the cost of accessing that memory */
+ memory = dsa_get_address(area, objects[i]);
+ memset(memory, 42, size);
+ }
+ if (parameters->mode == MODE_RANDOM)
+ {
+ for (i = 0; i < num_actually_allocated; ++i)
+ {
+ Size x = rand() % num_actually_allocated;
+ Size y = rand() % num_actually_allocated;
+ dsa_pointer temp = objects[x];
+
+ objects[x] = objects[y];
+ objects[y] = temp;
+ }
+ }
+ if (parameters->mode == MODE_BACKWARDS)
+ {
+ for (i = num_actually_allocated - 1; i >= 0; --i)
+ dsa_free(area, objects[i]);
+ }
+ else
+ {
+ for (i = 0; i < num_actually_allocated; ++i)
+ dsa_free(area, objects[i]);
+ }
+ total_allocations += num_actually_allocated;
+ }
+ pfree(objects);
+
+ parameters->results[output_index].elapsed_time_us =
+ GetCurrentIntegerTimestamp() - start_time;
+ parameters->results[output_index].pid = getpid();
+ parameters->results[output_index].count = total_allocations;
+}
+
+/* Non-parallel version: just do it. */
+Datum
+test_dsa_random(PG_FUNCTION_ARGS)
+{
+ test_parameters parameters;
+ dsa_area *area;
+
+ parameters.loops = PG_GETARG_INT32(0);
+ parameters.num_allocs = PG_GETARG_INT32(1);
+ parameters.min_alloc = PG_GETARG_INT32(2);
+ parameters.max_alloc = PG_GETARG_INT32(3);
+ parameters.mode = parse_test_mode(PG_GETARG_TEXT_PP(4));
+ check_parameters(¶meters);
+
+ area = dsa_create_dynamic(my_tranche_id(), "test_dsa");
+ do_random_test(area, 0, ¶meters);
+ dsa_dump(area);
+ dsa_detach(area);
+
+ PG_RETURN_NULL();
+}
+
+Datum test_dsa_random_worker_main(Datum arg);
+
+Datum
+test_dsa_random_worker_main(Datum arg)
+{
+ test_hello hello;
+ dsa_area *area;
+ test_parameters *parameters;
+
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "test_dsa toplevel");
+
+ /* Receive hello message and attach to shmem area. */
+ memcpy(&hello, MyBgworkerEntry->bgw_extra, sizeof(hello));
+ area = dsa_attach_dynamic(hello.area_handle);
+ Assert(area != NULL);
+ parameters = dsa_get_address(area, hello.parameters);
+ Assert(parameters != NULL);
+
+ do_random_test(area, hello.output_index, parameters);
+
+ dsa_detach(area);
+
+ return (Datum) 0;
+}
+
+/* Parallel version: fork a bunch of background workers to do it. */
+Datum
+test_dsa_random_parallel(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ test_hello hello;
+ test_parameters *parameters;
+ dsa_area *area;
+ int workers;
+ int i;
+ BackgroundWorkerHandle **handles;
+
+ /* tuplestore boilerplate stuff... */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+ MemoryContextSwitchTo(oldcontext);
+
+ /* Prepare to work! */
+ workers = PG_GETARG_INT32(5);
+ handles = palloc(sizeof(BackgroundWorkerHandle *) * workers);
+
+ /* Set up the shared memory area. */
+ area = dsa_create_dynamic(my_tranche_id(), "test_dsa");
+
+ /* The the workers how to attach to it. */
+ hello.area_handle = dsa_get_handle(area);
+
+ /* Allocate space for the parameters object. */
+ hello.parameters = dsa_allocate(area, sizeof(test_parameters) +
+ sizeof(test_result) * workers);
+ Assert(DsaPointerIsValid(hello.parameters));
+
+ /* Set up an check the parameters object. */
+ parameters = dsa_get_address(area, hello.parameters);
+ parameters->loops = PG_GETARG_INT32(0);
+ parameters->num_allocs = PG_GETARG_INT32(1);
+ parameters->min_alloc = PG_GETARG_INT32(2);
+ parameters->max_alloc = PG_GETARG_INT32(3);
+ parameters->mode = parse_test_mode(PG_GETARG_TEXT_PP(4));
+ check_parameters(parameters);
+
+ /* Start the workers. */
+ for (i = 0; i < workers; ++i)
+ {
+ BackgroundWorker bgw;
+
+ snprintf(bgw.bgw_name, sizeof(bgw.bgw_name), "worker%d", i);
+ bgw.bgw_flags = BGWORKER_SHMEM_ACCESS;
+ bgw.bgw_start_time = BgWorkerStart_PostmasterStart;
+ bgw.bgw_restart_time = BGW_NEVER_RESTART;
+ bgw.bgw_main = NULL;
+ snprintf(bgw.bgw_library_name, sizeof(bgw.bgw_library_name),
+ "test_dsa");
+ snprintf(bgw.bgw_function_name, sizeof(bgw.bgw_function_name),
+ "test_dsa_random_worker_main");
+ Assert(sizeof(parameters) <= BGW_EXTRALEN);
+ /* Each worker will write its output to a different slot. */
+ hello.output_index = i;
+ memcpy(bgw.bgw_extra, &hello, sizeof(hello));
+ bgw.bgw_notify_pid = MyProcPid;
+
+ if (!RegisterDynamicBackgroundWorker(&bgw, &handles[i]))
+ elog(ERROR, "Can't start worker");
+ }
+
+ /* Wait for the workers to complete. */
+ for (i = 0; i < workers; ++i)
+ /* erm, should really check for BGWH_STOPPED */
+ WaitForBackgroundWorkerShutdown(handles[i]);
+
+ /* Generate result tuples. */
+ for (i = 0; i < workers; ++i)
+ {
+ Datum values[3];
+ bool nulls[] = { false, false, false };
+ Interval *interval = palloc(sizeof(Interval));
+
+ interval->month = 0;
+ interval->day = 0;
+ interval->time = parameters->results[i].elapsed_time_us
+#ifndef HAVE_INT64_TIMESTAMP
+ / 1000000.0
+#endif
+ ;
+
+ values[0] = Int32GetDatum(parameters->results[i].pid);
+ values[1] = Int64GetDatum(parameters->results[i].count);
+ values[2] = PointerGetDatum(interval);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ pfree(handles);
+ dsa_detach(area);
+
+ return (Datum) 0;
+}
diff --git a/src/test/modules/test_dsa/test_dsa.control b/src/test/modules/test_dsa/test_dsa.control
new file mode 100644
index 0000000..2655c3f
--- /dev/null
+++ b/src/test/modules/test_dsa/test_dsa.control
@@ -0,0 +1,5 @@
+# dsa_test extension
+comment = 'Tests for DSA'
+default_version = '1.0'
+module_pathname = '$libdir/test_dsa'
+relocatable = true
On Fri, Aug 19, 2016 at 7:07 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
I would like to propose a new subsystem called Dynamic Shared [Memory]
Areas, or "DSA". It provides an object called a "dsa_area" which can
be used by multiple backends to share data. Under the covers, a
dsa_area is made up of some number of DSM segments, but it appears to
client code as a single shared memory heap with a simple allocate/free
interface. Because the memory is mapped at different addresses in
different backends, it introduces a kind of sharable relative pointer
and an operation to convert it to a backend-local pointer.[...]
[...] It's desirable to allow atomic ops on
dsa_pointer; I believe Andres Freund plans to make that happen for 64
bit values on 32 bit systems, but if that turns out to be problematic
I would want to make dsa_pointer 32 bits on 32 bit systems.
Here's a new version that does that. It provides the type
dsa_pointer_atomic and associated operations, using
PG_HAVE_ATOMIC_U64_SUPPORT to decide which size to use. The choice of
size is overridable at compile time with USE_SMALL_DSA_POINTER.
The other change is that it now creates DSM segments of sizes that
don't get large so fast. V1 would create 1MB, 2MB, 4MB, ... segments
(geometric growth being necessary because we can't have large numbers
of segments, but we want to support large total sizes). V2 creates
segments of size 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4, ... according to
the compile time constant DSA_NUM_SEGMENTS_AT_EACH_SIZE. I'm not sure
how to select a good number for this yet and the best answer may
depend on whether you're using small pointers.
This version is rebased against master as of today and doesn't depend
on any other patches.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsa-v2.patchapplication/octet-stream; name=dsa-v2.patchDownload
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 8a55392..e99ebd2 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/storage/ipc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+OBJS = dsa.o dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
procsignal.o shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
sinvaladt.o standby.o
diff --git a/src/backend/storage/ipc/dsa.c b/src/backend/storage/ipc/dsa.c
new file mode 100644
index 0000000..1ce9b8a
--- /dev/null
+++ b/src/backend/storage/ipc/dsa.c
@@ -0,0 +1,1950 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/barrier.h"
+#include "storage/dsa.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area. After creating
+ * some number of segments of this size we'll double the size, and so on.
+ * Larger segments may be created if necessary to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE (1 * 1024 * 1024)
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* Plenty of segments up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining.
+ */
+#define DSA_MAX_SEGMENTS (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ Size size; /* Size of the segment */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static void dsa_on_dsm_segment_detach(dsm_segment *, Datum arg);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+
+/*
+ * Create a new shared area with dynamic size. DSM segments will be allocated
+ * as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create_dynamic(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ Size total_size;
+ int i;
+
+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space, and set it up. All segments backing
+ * this area are pinned, so that DSA can explicitly control their lifetime
+ * (otherwise a newly created segment belonging to this area might be
+ * freed when the only backend that happens to have it mapped in ends,
+ * corrupting the area).
+ */
+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * segment.
+ */
+ control = dsm_segment_address(segment);
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ dsm_segment_handle(segment) ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = dsm_segment_handle(segment);
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = DSA_INITIAL_SEGMENT_SIZE;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = dsm_segment_handle(segment);
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(NULL));
+
+ return area;
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another
+ * process) by dsa_get_area_handle.
+ */
+dsa_area *
+dsa_attach_dynamic(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+ control = dsm_segment_address(segment);
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(area));
+
+ return area;
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ /* TODO: Does it even need to be linked in in the first place? */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (area->segment_maps[index].mapped_address == NULL)
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (area->freed_segment_counter != freed_segment_counter)
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * A callback function for when the control segment for a dsa_area is
+ * detached.
+ */
+static void
+dsa_on_dsm_segment_detach(dsm_segment *segment, Datum arg)
+{
+ bool destroy = false;
+ dsa_area_control *control =
+ (dsa_area_control *) dsm_segment_address(segment);
+
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+
+ /* Decrement the reference count for the DSA area. */
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ if (--control->refcnt == 0)
+ destroy = true;
+ LWLockRelease(&control->lock);
+
+ /*
+ * If we are the last to detach from the area, then we must unpin all
+ * segments so they can be returned to the OS.
+ */
+ if (destroy)
+ {
+ int i;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(DSA_SCLASS_LOCK(area, size_class));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (area->segment_maps[index].mapped_address == NULL) /* unlikely */
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index d806664..8c6abe3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -182,7 +182,7 @@ dsm_postmaster_startup(PGShmemHeader *shim)
Assert(dsm_control_address == NULL);
Assert(dsm_control_mapped_size == 0);
dsm_control_handle = random();
- if (dsm_control_handle == 0)
+ if (dsm_control_handle == DSM_HANDLE_INVALID)
continue;
if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
&dsm_control_impl_private, &dsm_control_address,
@@ -476,6 +476,8 @@ dsm_create(Size size, int flags)
{
Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
seg->handle = random();
+ if (seg->handle == DSM_HANDLE_INVALID) /* Reserve sentinel */
+ continue;
if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, &seg->impl_private,
&seg->mapped_address, &seg->mapped_size, ERROR))
break;
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..20973af 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..fd1f2ec
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1812 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ FreePageBtreeCleanup(fpm);
+
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ fpm->contiguous_pages = largest;
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /*
+ * TODO: Figure out how to avoid setting this every time. It may not be as
+ * simple as it looks.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /*
+ * Past this point we might allocate btree pages, which could
+ * potentially shorten any existing run which might happen to be the
+ * current longest. So fpm->contiguous_pages needs to be recomputed.
+ */
+ fpm->contiguous_pages_dirty = true;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/storage/dsa.h b/src/include/storage/dsa.h
new file mode 100644
index 0000000..1d18f16
--- /dev/null
+++ b/src/include/storage/dsa.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/storage/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create_dynamic(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach_dynamic(dsa_handle handle);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 8be7c9a..bc91be6 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -19,6 +19,9 @@ typedef struct dsm_segment dsm_segment;
#define DSM_CREATE_NULL_IF_MAXSEGMENTS 0x0001
+/* A sentinel value for an invalid DSM handle. */
+#define DSM_HANDLE_INVALID 0
+
/* Startup and shutdown functions. */
struct PGShmemHeader; /* avoid including pg_shmem.h */
extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..e509ca2
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..a97dc96
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Wed, Oct 5, 2016 at 3:00 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Here's a new version that does that.
While testing this patch I found some issue,
+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);
Actually problem is that size of dsa_area_control is bigger than
DSA_INITIAL_SEGMENT_SIZE.
but we are allocating segment of DSA_INITIAL_SEGMENT_SIZE size.
(gdb) p sizeof(dsa_area_control)
$8 = 67111000
(gdb) p DSA_INITIAL_SEGMENT_SIZE
$9 = 1048576
In dsa-v1 problem was not exist because DSA_MAX_SEGMENTS was 1024,
but in dsa-v2 I think it's calculated wrongly.
(gdb) p DSA_MAX_SEGMENTS
$10 = 16777216
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 5, 2016 at 10:04 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Oct 5, 2016 at 3:00 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Here's a new version that does that.
While testing this patch I found some issue,
+ total_size = DSA_INITIAL_SEGMENT_SIZE; + total_pages = total_size / FPM_PAGE_SIZE; + metadata_bytes = + MAXALIGN(sizeof(dsa_area_control)) + + MAXALIGN(sizeof(FreePageManager)) + + total_pages * sizeof(dsa_pointer); + /* Add padding up to next page boundary. */ + if (metadata_bytes % FPM_PAGE_SIZE != 0) + metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE); + usable_pages = + (total_size - metadata_bytes) / FPM_PAGE_SIZE;+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);Actually problem is that size of dsa_area_control is bigger than
DSA_INITIAL_SEGMENT_SIZE.
but we are allocating segment of DSA_INITIAL_SEGMENT_SIZE size.(gdb) p sizeof(dsa_area_control)
$8 = 67111000
(gdb) p DSA_INITIAL_SEGMENT_SIZE
$9 = 1048576In dsa-v1 problem was not exist because DSA_MAX_SEGMENTS was 1024,
but in dsa-v2 I think it's calculated wrongly.(gdb) p DSA_MAX_SEGMENTS
$10 = 16777216
Oops, right, thanks. A last minute change to that macro definition
that I stupidly tested only in USE_SMALL_DSA_POINTER mode. Here is a
fix for that, capping DSA_MAX_SEGMENTS as before.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsa-v3.patchapplication/octet-stream; name=dsa-v3.patchDownload
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 8a55392..e99ebd2 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/storage/ipc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+OBJS = dsa.o dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
procsignal.o shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
sinvaladt.o standby.o
diff --git a/src/backend/storage/ipc/dsa.c b/src/backend/storage/ipc/dsa.c
new file mode 100644
index 0000000..bd1d69e
--- /dev/null
+++ b/src/backend/storage/ipc/dsa.c
@@ -0,0 +1,1951 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/barrier.h"
+#include "storage/dsa.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area. After creating
+ * some number of segments of this size we'll double the size, and so on.
+ * Larger segments may be created if necessary to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE (1 * 1024 * 1024)
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 of segments of size up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining (but capped at 1024).
+ */
+#define DSA_MAX_SEGMENTS \
+ Min(1024, (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH)))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ Size size; /* Size of the segment */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static void dsa_on_dsm_segment_detach(dsm_segment *, Datum arg);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+
+/*
+ * Create a new shared area with dynamic size. DSM segments will be allocated
+ * as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create_dynamic(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ Size total_size;
+ int i;
+
+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space, and set it up. All segments backing
+ * this area are pinned, so that DSA can explicitly control their lifetime
+ * (otherwise a newly created segment belonging to this area might be
+ * freed when the only backend that happens to have it mapped in ends,
+ * corrupting the area).
+ */
+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * segment.
+ */
+ control = dsm_segment_address(segment);
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ dsm_segment_handle(segment) ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = dsm_segment_handle(segment);
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = DSA_INITIAL_SEGMENT_SIZE;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = dsm_segment_handle(segment);
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(NULL));
+
+ return area;
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another
+ * process) by dsa_get_area_handle.
+ */
+dsa_area *
+dsa_attach_dynamic(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+ control = dsm_segment_address(segment);
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(area));
+
+ return area;
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ /* TODO: Does it even need to be linked in in the first place? */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (area->segment_maps[index].mapped_address == NULL)
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (area->freed_segment_counter != freed_segment_counter)
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * A callback function for when the control segment for a dsa_area is
+ * detached.
+ */
+static void
+dsa_on_dsm_segment_detach(dsm_segment *segment, Datum arg)
+{
+ bool destroy = false;
+ dsa_area_control *control =
+ (dsa_area_control *) dsm_segment_address(segment);
+
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+
+ /* Decrement the reference count for the DSA area. */
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ if (--control->refcnt == 0)
+ destroy = true;
+ LWLockRelease(&control->lock);
+
+ /*
+ * If we are the last to detach from the area, then we must unpin all
+ * segments so they can be returned to the OS.
+ */
+ if (destroy)
+ {
+ int i;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(DSA_SCLASS_LOCK(area, size_class));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (area->segment_maps[index].mapped_address == NULL) /* unlikely */
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index d806664..8c6abe3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -182,7 +182,7 @@ dsm_postmaster_startup(PGShmemHeader *shim)
Assert(dsm_control_address == NULL);
Assert(dsm_control_mapped_size == 0);
dsm_control_handle = random();
- if (dsm_control_handle == 0)
+ if (dsm_control_handle == DSM_HANDLE_INVALID)
continue;
if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
&dsm_control_impl_private, &dsm_control_address,
@@ -476,6 +476,8 @@ dsm_create(Size size, int flags)
{
Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
seg->handle = random();
+ if (seg->handle == DSM_HANDLE_INVALID) /* Reserve sentinel */
+ continue;
if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, &seg->impl_private,
&seg->mapped_address, &seg->mapped_size, ERROR))
break;
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..20973af 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..fd1f2ec
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1812 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ FreePageBtreeCleanup(fpm);
+
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ fpm->contiguous_pages = largest;
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /*
+ * TODO: Figure out how to avoid setting this every time. It may not be as
+ * simple as it looks.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /*
+ * Past this point we might allocate btree pages, which could
+ * potentially shorten any existing run which might happen to be the
+ * current longest. So fpm->contiguous_pages needs to be recomputed.
+ */
+ fpm->contiguous_pages_dirty = true;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/storage/dsa.h b/src/include/storage/dsa.h
new file mode 100644
index 0000000..1d18f16
--- /dev/null
+++ b/src/include/storage/dsa.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/storage/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create_dynamic(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach_dynamic(dsa_handle handle);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 8be7c9a..bc91be6 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -19,6 +19,9 @@ typedef struct dsm_segment dsm_segment;
#define DSM_CREATE_NULL_IF_MAXSEGMENTS 0x0001
+/* A sentinel value for an invalid DSM handle. */
+#define DSM_HANDLE_INVALID 0
+
/* Startup and shutdown functions. */
struct PGShmemHeader; /* avoid including pg_shmem.h */
extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..e509ca2
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..a97dc96
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Wed, Oct 5, 2016 at 11:28 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
[dsa-v3.patch]
Here is a new version which just adds CLOBBER_FREED_MEMORY support to dsa_free.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsa-v4.patchapplication/octet-stream; name=dsa-v4.patchDownload
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 8a55392..e99ebd2 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/storage/ipc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+OBJS = dsa.o dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
procsignal.o shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
sinvaladt.o standby.o
diff --git a/src/backend/storage/ipc/dsa.c b/src/backend/storage/ipc/dsa.c
new file mode 100644
index 0000000..3052ece
--- /dev/null
+++ b/src/backend/storage/ipc/dsa.c
@@ -0,0 +1,1960 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/barrier.h"
+#include "storage/dsa.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area. After creating
+ * some number of segments of this size we'll double the size, and so on.
+ * Larger segments may be created if necessary to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE (1 * 1024 * 1024)
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 of segments of size up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining (but capped at 1024).
+ */
+#define DSA_MAX_SEGMENTS \
+ Min(1024, (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH)))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ Size size; /* Size of the segment */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static void dsa_on_dsm_segment_detach(dsm_segment *, Datum arg);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+
+/*
+ * Create a new shared area with dynamic size. DSM segments will be allocated
+ * as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create_dynamic(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ Size total_size;
+ int i;
+
+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space, and set it up. All segments backing
+ * this area are pinned, so that DSA can explicitly control their lifetime
+ * (otherwise a newly created segment belonging to this area might be
+ * freed when the only backend that happens to have it mapped in ends,
+ * corrupting the area).
+ */
+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * segment.
+ */
+ control = dsm_segment_address(segment);
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ dsm_segment_handle(segment) ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = dsm_segment_handle(segment);
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = DSA_INITIAL_SEGMENT_SIZE;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = dsm_segment_handle(segment);
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(NULL));
+
+ return area;
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another
+ * process) by dsa_get_area_handle.
+ */
+dsa_area *
+dsa_attach_dynamic(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+ control = dsm_segment_address(segment);
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(area));
+
+ return area;
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, span->npages * FPM_PAGE_SIZE);
+#endif
+
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ /* TODO: Does it even need to be linked in in the first place? */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, size);
+#endif
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (area->segment_maps[index].mapped_address == NULL)
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (area->freed_segment_counter != freed_segment_counter)
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * A callback function for when the control segment for a dsa_area is
+ * detached.
+ */
+static void
+dsa_on_dsm_segment_detach(dsm_segment *segment, Datum arg)
+{
+ bool destroy = false;
+ dsa_area_control *control =
+ (dsa_area_control *) dsm_segment_address(segment);
+
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+
+ /* Decrement the reference count for the DSA area. */
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ if (--control->refcnt == 0)
+ destroy = true;
+ LWLockRelease(&control->lock);
+
+ /*
+ * If we are the last to detach from the area, then we must unpin all
+ * segments so they can be returned to the OS.
+ */
+ if (destroy)
+ {
+ int i;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(DSA_SCLASS_LOCK(area, size_class));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (area->segment_maps[index].mapped_address == NULL) /* unlikely */
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index d806664..8c6abe3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -182,7 +182,7 @@ dsm_postmaster_startup(PGShmemHeader *shim)
Assert(dsm_control_address == NULL);
Assert(dsm_control_mapped_size == 0);
dsm_control_handle = random();
- if (dsm_control_handle == 0)
+ if (dsm_control_handle == DSM_HANDLE_INVALID)
continue;
if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
&dsm_control_impl_private, &dsm_control_address,
@@ -476,6 +476,8 @@ dsm_create(Size size, int flags)
{
Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
seg->handle = random();
+ if (seg->handle == DSM_HANDLE_INVALID) /* Reserve sentinel */
+ continue;
if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, &seg->impl_private,
&seg->mapped_address, &seg->mapped_size, ERROR))
break;
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..20973af 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..fd1f2ec
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1812 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ FreePageBtreeCleanup(fpm);
+
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ fpm->contiguous_pages = largest;
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /*
+ * TODO: Figure out how to avoid setting this every time. It may not be as
+ * simple as it looks.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /*
+ * Past this point we might allocate btree pages, which could
+ * potentially shorten any existing run which might happen to be the
+ * current longest. So fpm->contiguous_pages needs to be recomputed.
+ */
+ fpm->contiguous_pages_dirty = true;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/storage/dsa.h b/src/include/storage/dsa.h
new file mode 100644
index 0000000..1d18f16
--- /dev/null
+++ b/src/include/storage/dsa.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/storage/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create_dynamic(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach_dynamic(dsa_handle handle);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 8be7c9a..bc91be6 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -19,6 +19,9 @@ typedef struct dsm_segment dsm_segment;
#define DSM_CREATE_NULL_IF_MAXSEGMENTS 0x0001
+/* A sentinel value for an invalid DSM handle. */
+#define DSM_HANDLE_INVALID 0
+
/* Startup and shutdown functions. */
struct PGShmemHeader; /* avoid including pg_shmem.h */
extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..e509ca2
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..a97dc96
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Tue, Nov 1, 2016 at 5:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Wed, Oct 5, 2016 at 11:28 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:[dsa-v3.patch]
Here is a new version which just adds CLOBBER_FREED_MEMORY support to dsa_free.
Here is a new version that fixes a bug I discovered in freepage.c today.
Details: When dsa_free decides to give back a whole superblock back
to the free page manager for a segment with FreePageManagerPut, and
there was already exactly one span of exactly one free page in that
segment, and the span being 'put' is not adjacent to that existing
free page, then the singleton format must be converted to a btree with
the existing page as root and the newly put span as the sole leaf.
But in that special case we forgot to add the newly put span to the
appropriate free list. Not only did we lose track of it, but a future
call to FreePageManagerPut might try to merge it with another adjacent
span, which will try to manipulate the freelist that it expects it to
be in and blow up. The fix is just to add a call to
FreePagePushSpanLeader in this corner case before the early return.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsa-v5.patchapplication/octet-stream; name=dsa-v5.patchDownload
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 8a55392..e99ebd2 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/storage/ipc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+OBJS = dsa.o dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
procsignal.o shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
sinvaladt.o standby.o
diff --git a/src/backend/storage/ipc/dsa.c b/src/backend/storage/ipc/dsa.c
new file mode 100644
index 0000000..472b8ad
--- /dev/null
+++ b/src/backend/storage/ipc/dsa.c
@@ -0,0 +1,1960 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/barrier.h"
+#include "storage/dsa.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area. After creating
+ * some number of segments of this size we'll double the size, and so on.
+ * Larger segments may be created if necessary to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE (1 * 1024 * 1024)
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 of segments of size up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining (but capped at 1024).
+ */
+#define DSA_MAX_SEGMENTS \
+ Min(1024, (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH)))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ Size size; /* Size of the segment */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static void dsa_on_dsm_segment_detach(dsm_segment *, Datum arg);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+
+/*
+ * Create a new shared area with dynamic size. DSM segments will be allocated
+ * as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create_dynamic(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ Size total_size;
+ int i;
+
+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space, and set it up. All segments backing
+ * this area are pinned, so that DSA can explicitly control their lifetime
+ * (otherwise a newly created segment belonging to this area might be
+ * freed when the only backend that happens to have it mapped in ends,
+ * corrupting the area).
+ */
+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * segment.
+ */
+ control = dsm_segment_address(segment);
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ dsm_segment_handle(segment) ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = dsm_segment_handle(segment);
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = DSA_INITIAL_SEGMENT_SIZE;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = dsm_segment_handle(segment);
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(NULL));
+
+ return area;
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another
+ * process) by dsa_get_area_handle.
+ */
+dsa_area *
+dsa_attach_dynamic(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+ control = dsm_segment_address(segment);
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(area));
+
+ return area;
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, span->npages * FPM_PAGE_SIZE);
+#endif
+
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ /* TODO: Does it even need to be linked in in the first place? */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, size);
+#endif
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (area->segment_maps[index].mapped_address == NULL)
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (area->freed_segment_counter != freed_segment_counter)
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * A callback function for when the control segment for a dsa_area is
+ * detached.
+ */
+static void
+dsa_on_dsm_segment_detach(dsm_segment *segment, Datum arg)
+{
+ bool destroy = false;
+ dsa_area_control *control =
+ (dsa_area_control *) dsm_segment_address(segment);
+
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+
+ /* Decrement the reference count for the DSA area. */
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ if (--control->refcnt == 0)
+ destroy = true;
+ LWLockRelease(&control->lock);
+
+ /*
+ * If we are the last to detach from the area, then we must unpin all
+ * segments so they can be returned to the OS.
+ */
+ if (destroy)
+ {
+ int i;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (area->segment_maps[index].mapped_address == NULL) /* unlikely */
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index d806664..8c6abe3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -182,7 +182,7 @@ dsm_postmaster_startup(PGShmemHeader *shim)
Assert(dsm_control_address == NULL);
Assert(dsm_control_mapped_size == 0);
dsm_control_handle = random();
- if (dsm_control_handle == 0)
+ if (dsm_control_handle == DSM_HANDLE_INVALID)
continue;
if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
&dsm_control_impl_private, &dsm_control_address,
@@ -476,6 +476,8 @@ dsm_create(Size size, int flags)
{
Assert(seg->mapped_address == NULL && seg->mapped_size == 0);
seg->handle = random();
+ if (seg->handle == DSM_HANDLE_INVALID) /* Reserve sentinel */
+ continue;
if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size, &seg->impl_private,
&seg->mapped_address, &seg->mapped_size, ERROR))
break;
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..20973af 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..8f8dca5
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1813 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ FreePageBtreeCleanup(fpm);
+
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ fpm->contiguous_pages = largest;
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /*
+ * TODO: Figure out how to avoid setting this every time. It may not be as
+ * simple as it looks.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /*
+ * Past this point we might allocate btree pages, which could
+ * potentially shorten any existing run which might happen to be the
+ * current longest. So fpm->contiguous_pages needs to be recomputed.
+ */
+ fpm->contiguous_pages_dirty = true;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/storage/dsa.h b/src/include/storage/dsa.h
new file mode 100644
index 0000000..1d18f16
--- /dev/null
+++ b/src/include/storage/dsa.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/storage/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create_dynamic(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach_dynamic(dsa_handle handle);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 8be7c9a..bc91be6 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -19,6 +19,9 @@ typedef struct dsm_segment dsm_segment;
#define DSM_CREATE_NULL_IF_MAXSEGMENTS 0x0001
+/* A sentinel value for an invalid DSM handle. */
+#define DSM_HANDLE_INVALID 0
+
/* Startup and shutdown functions. */
struct PGShmemHeader; /* avoid including pg_shmem.h */
extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..e509ca2
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..a97dc96
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Thu, Nov 10, 2016 at 6:37 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Here is a new version that fixes a bug I discovered in freepage.c today.
And here is a new version rebased on top of commit
b40b4dd9e10ea701c8d47ccba9407fc32ed384e5.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsa-v6.patchapplication/octet-stream; name=dsa-v6.patchDownload
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 8a55392..e99ebd2 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/storage/ipc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+OBJS = dsa.o dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
procsignal.o shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
sinvaladt.o standby.o
diff --git a/src/backend/storage/ipc/dsa.c b/src/backend/storage/ipc/dsa.c
new file mode 100644
index 0000000..472b8ad
--- /dev/null
+++ b/src/backend/storage/ipc/dsa.c
@@ -0,0 +1,1960 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/ipc/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/barrier.h"
+#include "storage/dsa.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area. After creating
+ * some number of segments of this size we'll double the size, and so on.
+ * Larger segments may be created if necessary to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE (1 * 1024 * 1024)
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 of segments of size up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining (but capped at 1024).
+ */
+#define DSA_MAX_SEGMENTS \
+ Min(1024, (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH)))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ Size size; /* Size of the segment */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static void dsa_on_dsm_segment_detach(dsm_segment *, Datum arg);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+
+/*
+ * Create a new shared area with dynamic size. DSM segments will be allocated
+ * as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create_dynamic(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ Size total_size;
+ int i;
+
+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space, and set it up. All segments backing
+ * this area are pinned, so that DSA can explicitly control their lifetime
+ * (otherwise a newly created segment belonging to this area might be
+ * freed when the only backend that happens to have it mapped in ends,
+ * corrupting the area).
+ */
+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * segment.
+ */
+ control = dsm_segment_address(segment);
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ dsm_segment_handle(segment) ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = dsm_segment_handle(segment);
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = DSA_INITIAL_SEGMENT_SIZE;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = dsm_segment_handle(segment);
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(NULL));
+
+ return area;
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another
+ * process) by dsa_get_area_handle.
+ */
+dsa_area *
+dsa_attach_dynamic(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+ control = dsm_segment_address(segment);
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_segment_detach, PointerGetDatum(area));
+
+ return area;
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, span->npages * FPM_PAGE_SIZE);
+#endif
+
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ /* TODO: Does it even need to be linked in in the first place? */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, size);
+#endif
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (area->segment_maps[index].mapped_address == NULL)
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (area->freed_segment_counter != freed_segment_counter)
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * A callback function for when the control segment for a dsa_area is
+ * detached.
+ */
+static void
+dsa_on_dsm_segment_detach(dsm_segment *segment, Datum arg)
+{
+ bool destroy = false;
+ dsa_area_control *control =
+ (dsa_area_control *) dsm_segment_address(segment);
+
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+
+ /* Decrement the reference count for the DSA area. */
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ if (--control->refcnt == 0)
+ destroy = true;
+ LWLockRelease(&control->lock);
+
+ /*
+ * If we are the last to detach from the area, then we must unpin all
+ * segments so they can be returned to the OS.
+ */
+ if (destroy)
+ {
+ int i;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (area->segment_maps[index].mapped_address == NULL) /* unlikely */
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = dsm_segment_map_length(segment);
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->size = total_size;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..20973af 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..8f8dca5
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1813 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ FreePageBtreeCleanup(fpm);
+
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ fpm->contiguous_pages = largest;
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /*
+ * TODO: Figure out how to avoid setting this every time. It may not be as
+ * simple as it looks.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /*
+ * Past this point we might allocate btree pages, which could
+ * potentially shorten any existing run which might happen to be the
+ * current longest. So fpm->contiguous_pages needs to be recomputed.
+ */
+ fpm->contiguous_pages_dirty = true;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/storage/dsa.h b/src/include/storage/dsa.h
new file mode 100644
index 0000000..1d18f16
--- /dev/null
+++ b/src/include/storage/dsa.h
@@ -0,0 +1,100 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/storage/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create_dynamic(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach_dynamic(dsa_handle handle);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..e509ca2
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..a97dc96
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Thu, Nov 10, 2016 at 12:37 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Tue, Nov 1, 2016 at 5:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:On Wed, Oct 5, 2016 at 11:28 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:[dsa-v3.patch]
Here is a new version which just adds CLOBBER_FREED_MEMORY support to dsa_free.
Here is a new version that fixes a bug I discovered in freepage.c today.
Details: When dsa_free decides to give back a whole superblock back
to the free page manager for a segment with FreePageManagerPut, and
there was already exactly one span of exactly one free page in that
segment, and the span being 'put' is not adjacent to that existing
free page, then the singleton format must be converted to a btree with
the existing page as root and the newly put span as the sole leaf.
But in that special case we forgot to add the newly put span to the
appropriate free list. Not only did we lose track of it, but a future
call to FreePageManagerPut might try to merge it with another adjacent
span, which will try to manipulate the freelist that it expects it to
be in and blow up. The fix is just to add a call to
FreePagePushSpanLeader in this corner case before the early return.
Since a lot of the design of this patch is mine - from my earlier work
on sb_alloc - I don't expect to have a ton of objections to it. And
I'd like to get it committed, because other parallelism work depends
on it (bitmap heap scan and parallel hash join in particular), and
because it's got other uses as well. However, I don't want to be
perceived as slamming my code (or that of my colleagues) into the tree
without due opportunity for other people to provide feedback, so if
anyone has questions, comments, concerns, or review to offer, please
do.
I think we should develop versions of this that (1) allocate from the
main shared memory segment and (2) allocate from backend-private
memory. Per my previous benchmarking results, allocating from
backend-private memory would be a substantial win for tuplesort.c
because this allocator is substantially more memory-efficient for
large memory contexts than aset.c, and Tomas Vondra tested it out and
found that it is also faster for logical decoding than the approach he
proposed. Perhaps that's not an argument for holding up his proposed
patches for that problem, but I think it IS a good argument for
pressing forward with a backend-private version of this allocator.
I'm not saying that should be part of the initial commit of this code,
but I think it's a good direction to pursue.
One question that we need to resolve is where the code should live in
the source tree. When I wrote the original patches upon which this
work was based, I think that I put all the code in
src/backend/utils/mmgr, since it's all memory-management code. In
this patch, Thomas left the free page manager code there, but put the
allocator itself in src/backend/storage/ipc. There's a certain logic
to that because dynamic shared areas (dsa.c) sit right next to dynamic
shared memory (dsm.c) but it feels a bit schizophrenic to have half of
the code in storage/ipc and the other half in utils/mmgr. I guess my
view is that utils/mmgr is a better fit, because I think that this is
basically memory management code that happens to use shared memory,
rather than basically IPC that happens to be an allocator. If we
decide that this stuff goes in storage/ipc then that's basically
saying that everything that uses dynamic shared memory is going to end
up in that directory, which seems like a mess. The fact that the
free-page manager, at least, could be used for allocations not based
on DSM strengthens that argument in my view. Other opinions?
The #ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P hack in relptr.h, for
which I believe I'm responsible, is ugly. There is probably a
compiler out there that has __typeof__ but not
__builtin_types_compatible_p, and we could cater to that by adding a
separate configure test for __typeof__. A little browsing of the
documentation on at https://gcc.gnu.org/onlinedocs/ seems to suggest
that __builtin_types_compatible_p didn't exist before GCC 3.1, but
__typeof__ seems to be there even in 2.95.3. That's not very
interesting, honestly, because 3.1 came out in 2002, but there might
be non-GCC compilers that implement __typeof__ but not
__builtin_types_compatible_p. I am inclined not to worry about this
unless somebody feels otherwise, but it's not beautiful.
I wonder if it wouldn't be a good idea to allow the dsa_area_control
to be stored wherever instead of insisting that it's got to be located
inside the first DSM segment backing the pool itself. For example,
you could make dsa_create_dynamic() take a third argument which is a
pointer to enough space for a dsa_area_control, and it would
initialize it in place. Then you could make dsa_attach_dynamic() take
a pointer to that same structure instead of taking a dsa_handle.
Actually, I think dsa_handle goes away: it becomes the caller's
responsibility to figure out the correct pointer address to pass in
the second process. The advantage of this design change is that you
could stuff the dsa_area_control into the existing parallel DSM and
only create additional DSMs if anything is actually allocated. What
would be even cooler is to allow a little bit of space inside the
parallel DSM that gets used first, and then, when that overflows, we
start creating new DSMs. Say, 64kB. Am I sounding greedy yet? It
just seems like a good idea not to needlessly multiply the number of
DSMs.
+ /* Unlink span. */
+ /* TODO: Does it even need to be linked in in the
first place? */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
In answer to the TODO, I think this isn't strictly necessary, but it
seems like a good idea to do it anyway for debuggability. If we
didn't do this, the space occupied by a large object wouldn't be
"known" in any way other than by having disappeared from the free page
map, whereas this way it's linked into the DSA's listed of allocated
chunks like anything else, so for example dsa_dump() can print it. I
recommend removing this TODO.
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
Typo: Updat.
+ /*
+ * TODO: Figure out how to avoid setting this every time. It
may not be as
+ * simple as it looks.
+ */
Something isn't right with this function, because it takes the trouble
to calculate a value for contiguous_pages that it then doesn't use for
anything. I think the original idea here was that if we calculated a
value for contiguous_pages that was less than fpm->contiguous_pages,
there was no need to dirty it. If we can't get away with that for
some reason, then there's no point in calculating the value in the
first place.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 16, 2016 at 2:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
... my
view is that utils/mmgr is a better fit, ...
OK, changed.
The #ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P hack in relptr.h, for
which I believe I'm responsible, is ugly. There is probably a
compiler out there that has __typeof__ but not
__builtin_types_compatible_p, and we could cater to that by adding a
separate configure test for __typeof__. A little browsing of the
documentation on at https://gcc.gnu.org/onlinedocs/ seems to suggest
that __builtin_types_compatible_p didn't exist before GCC 3.1, but
__typeof__ seems to be there even in 2.95.3. That's not very
interesting, honestly, because 3.1 came out in 2002, but there might
be non-GCC compilers that implement __typeof__ but not
__builtin_types_compatible_p. I am inclined not to worry about this
unless somebody feels otherwise, but it's not beautiful.
+1
I wonder if it wouldn't be a good idea to allow the dsa_area_control
to be stored wherever instead of insisting that it's got to be located
inside the first DSM segment backing the pool itself. For example,
you could make dsa_create_dynamic() take a third argument which is a
pointer to enough space for a dsa_area_control, and it would
initialize it in place. Then you could make dsa_attach_dynamic() take
a pointer to that same structure instead of taking a dsa_handle.
Actually, I think dsa_handle goes away: it becomes the caller's
responsibility to figure out the correct pointer address to pass in
the second process. The advantage of this design change is that you
could stuff the dsa_area_control into the existing parallel DSM and
only create additional DSMs if anything is actually allocated. What
would be even cooler is to allow a little bit of space inside the
parallel DSM that gets used first, and then, when that overflows, we
start creating new DSMs. Say, 64kB. Am I sounding greedy yet? It
just seems like a good idea not to needlessly multiply the number of
DSMs.
Alternatively we could stop using DSM directly for parallel query and
just use a DSA area for all the shmem needs of a parallel query
execution as I mentioned elsewhere[1]/messages/by-id/CAEepm=0HmRefi1+xDJ99Gj5APHr8Qr05KZtAxrMj8b+ay3o6sA@mail.gmail.com. That would involve changing a
bunch of stuff including the FDW interface, so that's probably a bad
idea at this point. So I tried this in-place idea out today. See the
attached version which provides:
dsa_area *dsa_create(...);
dsa_area *dsa_attach(dsa_handle handle);
Those replace the functions that previously had _dynamic in the name.
Then I have new variants:
dsa_area *dsa_create_in_place(void *place, size_t size, ...);
dsa_area *dsa_attach_in_place(void *place);
Those let you create an area in existing memory (in a DSM segment,
traditional inherited shmem). The in-place versions will stlll create
DSM segments on demand as required, though I suppose if you wanted to
prevent that you could with dsa_set_size_limit(area, size). One
complication is that of course the automatic detach feature doesn't
work if you're in some random piece of memory. I have exposed
dsa_on_dsm_detach, so that there is a way to hook it up to the detach
hook for a pre-existing DSM segment, but that's the caller's
responibility. This is important because although the first 'segment'
is created in place, if other segments have been created we still have
to manage those; it gets tricky if you are the last attached process
for the area, but do not have a particular segment mapped in currently
because you've never accessed it; that works with a regular dsa_create
area, because everyone has the control segment mapped in so we use
that one's dsm_on_detach hook and from there we can do the cleanup we
need to do, but in this new case there is no such thing. You can see
an example of manual detach hook installation in
dsa-area-for-executor-v2.patch which I'll now go and post over in that
other thread.
+ /* Unlink span. */ + /* TODO: Does it even need to be linked in in the first place? */ + LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE), + LW_EXCLUSIVE); + unlink_span(area, span); + LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));In answer to the TODO, I think this isn't strictly necessary, but it
seems like a good idea to do it anyway for debuggability. If we
didn't do this, the space occupied by a large object wouldn't be
"known" in any way other than by having disappeared from the free page
map, whereas this way it's linked into the DSA's listed of allocated
chunks like anything else, so for example dsa_dump() can print it. I
recommend removing this TODO.
Removed.
+ /* + * TODO: We could take Max(fpm->contiguous_pages, result of + * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a + * starting point for its search, potentially avoiding a bunch of work, + * since there is no way the largest contiguous run is bigger than that. + */Typo: Updat.
Fixed.
+ /* + * TODO: Figure out how to avoid setting this every time. It may not be as + * simple as it looks. + */Something isn't right with this function, because it takes the trouble
to calculate a value for contiguous_pages that it then doesn't use for
anything. I think the original idea here was that if we calculated a
value for contiguous_pages that was less than fpm->contiguous_pages,
there was no need to dirty it. If we can't get away with that for
some reason, then there's no point in calculating the value in the
first place.
Yeah. Will come back on this point.
The attached patch is just for discussion only... I need to resolve
that contiguous_pages question and do some more testing.
[1]: /messages/by-id/CAEepm=0HmRefi1+xDJ99Gj5APHr8Qr05KZtAxrMj8b+ay3o6sA@mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsa-v6.patchapplication/octet-stream; name=dsa-v6.patchDownload
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..1842bae 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o dsa.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
new file mode 100644
index 0000000..639eb80
--- /dev/null
+++ b/src/backend/utils/mmgr/dsa.c
@@ -0,0 +1,2066 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/dsa.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The minimum size of the space that can be provided to dsa_create_in_place,
+ * when using user-supplied memory.
+ */
+#define DSA_MINIMUM_IN_PLACE_SIZE ((Size) (64 * 1024))
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area created by
+ * dsa_create. After creating some number of segments of this size we'll
+ * double this size, and so on. Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((Size) (1 * 1024 * 1024))
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 segments of size up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining (but capped at 1024).
+ */
+#define DSA_MAX_SEGMENTS \
+ Min(1024, (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH)))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+static dsa_area *create_internal(void *place, size_t size,
+ int tranche_id, const char *tranche_name,
+ dsm_handle control_handle,
+ dsm_segment *control_segment);
+static dsa_area *attach_internal(void *place, dsm_segment *segment,
+ dsa_handle handle);
+
+
+/*
+ * Create a new shared area in a new DSM segment. Further DSM segments will
+ * be allocated as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area *area;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space.
+ */
+ segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+
+ /*
+ * All segments backing this area are pinned, so that DSA can explicitly
+ * control their lifetime (otherwise a newly created segment belonging to
+ * this area might be freed when the only backend that happens to have it
+ * mapped in ends, corrupting the area).
+ */
+ dsm_pin_segment(segment);
+
+ /* Create a new DSA area with the control objet in this segment. */
+ area = create_internal(dsm_segment_address(segment),
+ DSA_INITIAL_SEGMENT_SIZE,
+ tranche_id, tranche_name,
+ dsm_segment_handle(segment), segment);
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_detach, PointerGetDatum(NULL));
+
+ return area;
+}
+
+/*
+ * Create a new shared area in an existing shared memory space, which may be
+ * either DSM or Postmaster-initialized memory. DSM segments will be
+ * allocated as required to extend the available space, though that can be
+ * prevented with dsa_set_size_limit(area, size) using the same siz provided
+ * to dsa_create_in_place.
+ *
+ * If the place is inside DSM segment, then the DSA area can be automatically
+ * detached when the DSM segment is detached by registering a callback like
+ * so:
+ *
+ * on_dsm_detach(<segment>, dsa_on_dsm_detach, PointerGetDatum(place));
+ *
+ * Failure to detach, either explicity with dsa_detach or via the above
+ * callback, could result in extra DSM segments associated with this area
+ * being leaked because they remain pinned.
+ *
+ * See dsa_create() for a note about the other arguments.
+ */
+dsa_area *
+dsa_create_in_place(void *place, size_t size,
+ int tranche_id, const char *tranche_name)
+{
+ return create_internal(place, size, tranche_id, tranche_name,
+ DSM_HANDLE_INVALID, NULL);
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area. Cannot be called for areas created with
+ * dsa_create_in_place.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ Assert(area->control->handle != DSM_HANDLE_INVALID);
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area that was created with dsa_create_in_place. The caller
+ * must somehow know the address that was used when the area was created,
+ * though it may be mapped at a different virtual address in this process.
+ *
+ * See dsa_create_in_place for note about registering for automatic detach if
+ * this area is in a DSM segment.
+ *
+ */
+dsa_area *
+dsa_attach_in_place(void *place)
+{
+ return attach_internal(place, NULL, DSM_HANDLE_INVALID);
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another process) by
+ * dsa_get_area_handle. The area must have been created with dsa_create (not
+ * dsa_create_in_place).
+ */
+dsa_area *
+dsa_attach(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area *area;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+
+ /* Now we can attach "in place". */
+ area = dsa_attach_in_place(dsm_segment_address(segment));
+
+ /* We need to know when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_detach,
+ PointerGetDatum(dsm_segment_address(segment)));
+
+ return area;
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, span->npages * FPM_PAGE_SIZE);
+#endif
+
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, size);
+#endif
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (area->segment_maps[index].mapped_address == NULL)
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (area->freed_segment_counter != freed_segment_counter)
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * A callback function for when the control segment for a dsa_area is
+ * detached. If you use dsa_create_in_place(...) to create an area in an
+ * existing DSM segment then it may be useful to have the DSA area detached
+ * automatically when the containing DSM segment is detached. This happens
+ * automatically for areas created with dsa_create(...).
+ */
+void
+dsa_on_dsm_detach(dsm_segment *segment, Datum arg)
+{
+ bool destroy = false;
+ dsa_area_control *control =
+ (dsa_area_control *) DatumGetPointer(arg);
+
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+
+ /* Decrement the reference count for the DSA area. */
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ if (--control->refcnt == 0)
+ destroy = true;
+ LWLockRelease(&control->lock);
+
+ /*
+ * If we are the last to detach from the area, then we must unpin all
+ * segments so they can be returned to the OS.
+ */
+ if (destroy)
+ {
+ int i;
+
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+}
+
+/*
+ * Workhorse function for dsa_create and dsa_create_in_place.
+ */
+static dsa_area *
+create_internal(void *place, size_t size,
+ int tranche_id, const char *tranche_name,
+ dsm_handle control_handle,
+ dsm_segment *control_segment)
+{
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ int i;
+
+ /* Sanity check on the space we have to work in. */
+ if (size < DSA_MINIMUM_IN_PLACE_SIZE)
+ elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
+ DSA_MINIMUM_IN_PLACE_SIZE, size);
+ /*
+ * That minimum size limit had better be big enough for the smallest
+ * amount of metadata space we could need to hold.
+ */
+ StaticAssertStmt(DSA_MINIMUM_IN_PLACE_SIZE >=
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ (DSA_MINIMUM_IN_PLACE_SIZE / FPM_PAGE_SIZE) *
+ sizeof(dsa_pointer),
+ "DSA_MINIMUM_IN_PLACE_SIZE is too small");
+ /* Now figure out how much space is usuable */
+ total_pages = size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ Assert(metadata_bytes <= size);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages = (size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * space.
+ */
+ control = (dsa_area_control *) place;
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ control_handle ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = control_handle;
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = size;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = control_handle;
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = control_segment;
+ segment_map->mapped_address = place;
+ segment_map->header = (dsa_segment_header *) place;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ return area;
+}
+
+/*
+ * Workhorse function for dsa_attach and dsa_attach_in_place.
+ */
+static dsa_area *
+attach_internal(void *place, dsm_segment *segment, dsa_handle handle)
+{
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ control = (dsa_area_control *) place;
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment; /* NULL for in-place */
+ segment_map->mapped_address = place;
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return area;
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (area->segment_maps[index].mapped_address == NULL) /* unlikely */
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i < DSA_MAX_SEGMENTS; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..ed42326
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1813 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ FreePageBtreeCleanup(fpm);
+
+ /*
+ * TODO: We could take Max(fpm->contiguous_pages, result of
+ * FreePageBtreeCleanup) and give it to FreePageManagerUpdateLargest as a
+ * starting point for its search, potentially avoiding a bunch of work,
+ * since there is no way the largest contiguous run is bigger than that.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ fpm->contiguous_pages = largest;
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /*
+ * TODO: Figure out how to avoid setting this every time. It may not be as
+ * simple as it looks.
+ */
+ fpm->contiguous_pages_dirty = true;
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /*
+ * Past this point we might allocate btree pages, which could
+ * potentially shorten any existing run which might happen to be the
+ * current longest. So fpm->contiguous_pages needs to be recomputed.
+ */
+ fpm->contiguous_pages_dirty = true;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
new file mode 100644
index 0000000..194fe81
--- /dev/null
+++ b/src/include/utils/dsa.h
@@ -0,0 +1,104 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/utils/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_create_in_place(void *place, Size size,
+ int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach(dsa_handle handle);
+extern dsa_area *dsa_attach_in_place(void *place);
+extern void dsa_on_dsm_detach(dsm_segment *, Datum arg);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..e509ca2
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..a97dc96
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Thu, Nov 24, 2016 at 1:07 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
The attached patch is just for discussion only... I need to resolve
that contiguous_pages question and do some more testing.
As Dilip discovered, there was a problem with resource cleanup for DSA
areas created inside pre-existing DSM segments, which I've now sorted
out in the attached version. I also updated the copyright messages,
introduced a couple of the new 'unlikely' macros in the address
decoding path, and introduced high_segment_index to avoid scanning
bigger segment arrays than is necessary sometimes.
As for contiguous_pages_dirty, I see what was missing from earlier
attempts at more subtle invalidation: we had failed to set the flag in
cases where FreePageManagerGetInternal was called during a
FreePageManagerPut operation. What do you think about the logic in
this patch... do you see any ways for contiguous_pages to get out of
date? There is a new assertion that contiguous_pages matches the
state of the freelists at the end of FreePageManagerGet and
FreePageManagerPut, enabled if you defined FPM_EXTRA_ASSERTS, and this
passes my random allocation pattern testing.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
dsa-v7.patchapplication/octet-stream; name=dsa-v7.patchDownload
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..1842bae 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o dsa.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
new file mode 100644
index 0000000..1565277
--- /dev/null
+++ b/src/backend/utils/mmgr/dsa.c
@@ -0,0 +1,2111 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap backed by one or more DSM
+ * segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Unlike the regular system heap, it deals in pseudo-pointers which must be
+ * converted to backend-local pointers before they are dereferenced. These
+ * pseudo-pointers can however be shared with other backends, and can be used
+ * to construct shared data structures.
+ *
+ * Each DSA area manages one or more DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/dsa.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The minimum size of the space that can be provided to dsa_create_in_place,
+ * when using user-supplied memory.
+ */
+#define DSA_MINIMUM_IN_PLACE_SIZE ((Size) (64 * 1024))
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area created by
+ * dsa_create. After creating some number of segments of this size we'll
+ * double this size, and so on. Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((Size) (1 * 1024 * 1024))
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 segments of size up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining (but capped at 1024).
+ */
+#define DSA_MAX_SEGMENTS \
+ Min(1024, (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH)))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that preceeds this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ *
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* Highest used segment index in the history of this area. */
+ dsa_segment_index high_segment_index;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The highest segment index this backend has ever mapped. */
+ dsa_segment_index high_segment_index;
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+static dsa_area *create_internal(void *place, size_t size,
+ int tranche_id, const char *tranche_name,
+ dsm_handle control_handle,
+ dsm_segment *control_segment);
+static dsa_area *attach_internal(void *place, dsm_segment *segment,
+ dsa_handle handle);
+static void decrement_reference_count(dsa_area_control *control);
+
+/*
+ * Create a new shared area in a new DSM segment. Further DSM segments will
+ * be allocated as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area *area;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space.
+ */
+ segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+
+ /*
+ * All segments backing this area are pinned, so that DSA can explicitly
+ * control their lifetime (otherwise a newly created segment belonging to
+ * this area might be freed when the only backend that happens to have it
+ * mapped in ends, corrupting the area).
+ */
+ dsm_pin_segment(segment);
+
+ /* Create a new DSA area with the control objet in this segment. */
+ area = create_internal(dsm_segment_address(segment),
+ DSA_INITIAL_SEGMENT_SIZE,
+ tranche_id, tranche_name,
+ dsm_segment_handle(segment), segment);
+
+ /* Clean up when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
+ PointerGetDatum(dsm_segment_address(segment)));
+
+ return area;
+}
+
+/*
+ * Create a new shared area in an existing shared memory space, which may be
+ * either DSM or Postmaster-initialized memory. DSM segments will be
+ * allocated as required to extend the available space, though that can be
+ * prevented with dsa_set_size_limit(area, size) using the same size provided
+ * to dsa_create_in_place.
+ *
+ * Areas created in-place must eventually be 'released'. This can be done
+ * explicitly with dsa_release_in_place, or via a DSM detach callback if the
+ * area happens to be in an existing DSM segment. See
+ * dsa_on_dsm_detach_release_in_place.
+ *
+ * See dsa_create() for a note about the other arguments.
+ */
+dsa_area *
+dsa_create_in_place(void *place, size_t size,
+ int tranche_id, const char *tranche_name)
+{
+ return create_internal(place, size, tranche_id, tranche_name,
+ DSM_HANDLE_INVALID, NULL);
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area. Cannot be called for areas created with
+ * dsa_create_in_place.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ Assert(area->control->handle != DSM_HANDLE_INVALID);
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another process) by
+ * dsa_get_area_handle. The area must have been created with dsa_create (not
+ * dsa_create_in_place).
+ */
+dsa_area *
+dsa_attach(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area *area;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
+
+ area = attach_internal(dsm_segment_address(segment), segment, handle);
+
+ /* Clean up when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
+ PointerGetDatum(dsm_segment_address(segment)));
+
+ return area;
+}
+
+/*
+ * Attach to an area that was created with dsa_create_in_place. The caller
+ * must somehow know the address that was used when the area was created,
+ * though it may be mapped at a different virtual address in this process.
+ *
+ * See dsa_create_in_place for note about releasing in-place areas.
+ */
+dsa_area *
+dsa_attach_in_place(void *place)
+{
+ return attach_internal(place, NULL, DSM_HANDLE_INVALID);
+}
+
+/*
+ * Release a DSA area that was produced by dsa_create_in_place or
+ * dsa_attach_in_place. The 'segment' argument is ignored but provides an
+ * interface suitable for on_dsm_detach, for the convenience of users who want
+ * to create a DSA segment inside an existing DSM segment and have it
+ * automatically released when the containing DSM segment is detached.
+ * 'place' should be the address of the place where the area was created.
+ */
+void
+dsa_on_dsm_detach_release_in_place(dsm_segment *segment, Datum place)
+{
+ dsa_release_in_place(DatumGetPointer(place));
+}
+
+/*
+ * Release a DSA area that was produced by dsa_create_in_place or
+ * dsa_attach_in_place. The 'code' argument is ignored but provides an
+ * interface suitable for on_shmem_exit or before_shmem_exit, for the
+ * convenience of users who want to create a DSA segment inside shared memory
+ * other than a DSM segment and have it automatically release at backend exit.
+ * 'place' should be the address of the place where the area was created.
+ */
+void
+dsa_on_shmem_exit_release_in_place(int code, Datum place)
+{
+ dsa_release_in_place(DatumGetPointer(place));
+}
+
+/*
+ * Release a DSA area that was produced by dsa_create_in_place or
+ * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX'
+ * callbacks so that this is managed automatically, because failure to release
+ * an area created in-place leaks its segments permanently.
+ */
+void
+dsa_release_in_place(void *place)
+{
+ decrement_reference_count((dsa_area_control *) place);
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i <= area->high_segment_index; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa couldn't find run of pages: fpm_largest out of sync");
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, span->npages * FPM_PAGE_SIZE);
+#endif
+
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, size);
+#endif
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then maybe
+ * we should probably move this span to fullness class 1. (Otherwise
+ * if you allocate exactly all the objects in the only span, it moves
+ * to class 3, then you free them all, it moves to 2, and then is
+ * given back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must have been
+ * allocated by the given area (possibly in another process). This may cause
+ * a segment to be mapped into the current process.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+ Size freed_segment_counter;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (unlikely(area->segment_maps[index].mapped_address == NULL))
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ /*
+ * Take this opportunity to check if we need to detach from any segments
+ * that have been freed. This is an unsynchronized read of the value in
+ * shared memory, but all that matters is that we eventually observe a
+ * change when that number moves.
+ */
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (unlikely(area->freed_segment_counter != freed_segment_counter))
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ for (i = 0; i <= area->high_segment_index; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_pin: area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_unpin: area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments aggressively already. */
+ continue;
+
+ /*
+ * Search the fullness class 1 only. That is where we expect to find
+ * an entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+
+/*
+ * Decrement the area's reference count, and unpin all segments (even those
+ * not mapped into this process's address space) if the count reaches zero.
+ */
+static void
+decrement_reference_count(dsa_area_control *control)
+{
+ int i;
+
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+ Assert(control->refcnt > 0);
+ if (--control->refcnt == 0)
+ {
+ for (i = 0; i <= control->high_segment_index; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+ LWLockRelease(&control->lock);
+}
+
+/*
+ * Workhorse function for dsa_create and dsa_create_in_place.
+ */
+static dsa_area *
+create_internal(void *place, size_t size,
+ int tranche_id, const char *tranche_name,
+ dsm_handle control_handle,
+ dsm_segment *control_segment)
+{
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ int i;
+
+ /* Sanity check on the space we have to work in. */
+ if (size < DSA_MINIMUM_IN_PLACE_SIZE)
+ elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
+ DSA_MINIMUM_IN_PLACE_SIZE, size);
+ /*
+ * That minimum size limit had better be big enough for the smallest
+ * amount of metadata space we could need to hold.
+ */
+ StaticAssertStmt(DSA_MINIMUM_IN_PLACE_SIZE >=
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ (DSA_MINIMUM_IN_PLACE_SIZE / FPM_PAGE_SIZE) *
+ sizeof(dsa_pointer),
+ "DSA_MINIMUM_IN_PLACE_SIZE is too small");
+ /* Now figure out how much space is usuable */
+ total_pages = size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ Assert(metadata_bytes <= size);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages = (size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * space.
+ */
+ control = (dsa_area_control *) place;
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ control_handle ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = control_handle;
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = size;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = control_handle;
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->high_segment_index = 0;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->high_segment_index = 0;
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = control_segment;
+ segment_map->mapped_address = place;
+ segment_map->header = (dsa_segment_header *) place;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ return area;
+}
+
+/*
+ * Workhorse function for dsa_attach and dsa_attach_in_place.
+ */
+static dsa_area *
+attach_internal(void *place, dsm_segment *segment, dsa_handle handle)
+{
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ control = (dsa_area_control *) place;
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->high_segment_index = 0;
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment; /* NULL for in-place */
+ segment_map->mapped_address = place;
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return area;
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (unlikely(area->segment_maps[index].mapped_address == NULL))
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
+
+ /* This slot has been freed. */
+ if (handle == DSM_HANDLE_INVALID)
+ return NULL;
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Remember the highest index this backend has ever mapped. */
+ if (area->high_segment_index < index)
+ area->high_segment_index = index;
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: This is the only time we acquire the area lock while we already
+ * hold a per-pool lock. We never hold the area lock and then take a pool
+ * lock, or we could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i <= area->high_segment_index; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /*
+ * Note that 'detaching' (= detaching from DSM segments) doesn't include
+ * 'releasing' (= adjusting the reference count). It would be nice to
+ * combine these operations, but client code might never get around to
+ * calling dsa_detach because of an error path, and a detach hook on any
+ * particular segment is too late to detach other segments in the area
+ * without risking a 'leak' warning in the non-error path.
+ */
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+ /* Track the highest segment index in the history of the area. */
+ if (area->control->high_segment_index < new_index)
+ area->control->high_segment_index = new_index;
+ /* Track the highest segment index this backend has ever mapped. */
+ if (area->high_segment_index < new_index)
+ area->high_segment_index = new_index;
+ /* Track total size of all segments. */
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..cf27e1b
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1846 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static Size FreePageManagerLargestContiguous(FreePageManager *fpm);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+ Size contiguous_pages;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (fpm->contiguous_pages < contiguous_pages)
+ fpm->contiguous_pages = contiguous_pages;
+
+ /*
+ * FreePageManagerGetInternal may have set contiguous_pages_dirty.
+ * Recompute contigous_pages if so.
+ */
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+ Assert(fpm->contiguous_pages == FreePageManagerLargestContiguous(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Compute the size of the largest run of pages that the user could
+ * succesfully get.
+ */
+static Size
+FreePageManagerLargestContiguous(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ return largest;
+}
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ fpm->contiguous_pages = FreePageManagerLargestContiguous(fpm);
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /* See if we now have a new largest chunk. */
+ if (fpm->contiguous_pages < contiguous_pages)
+ fpm->contiguous_pages = contiguous_pages;
+
+ /*
+ * The earlier call to FreePageManagerPutInternal may have set
+ * contiguous_pages_dirty if it needed to allocate internal pages, so
+ * recompute contiguous_pages if necessary.
+ */
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+ Assert(fpm->contiguous_pages == FreePageManagerLargestContiguous(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /* Decide whether we might be invalidating contiguous_pages. */
+ if (f == FPM_NUM_FREELISTS - 1 &&
+ victim->npages == fpm->contiguous_pages)
+ {
+ /*
+ * The victim span came from the oversized freelist, and had the same
+ * size as the longest span. There may or may not be another one of
+ * the same size, so contiguous_pages must be recomputed just to be
+ * safe.
+ */
+ fpm->contiguous_pages_dirty = true;
+ }
+ else if (f + 1 == fpm->contiguous_pages &&
+ relptr_is_null(fpm->freelist[f]))
+ {
+ /*
+ * The victim span came from a fixed sized freelist, and it was the
+ * list for spans of the same size as the current longest span, and
+ * the list is now empty after removing the victim. So
+ * contiguous_pages must be recomputed without a doubt.
+ */
+ fpm->contiguous_pages_dirty = true;
+ }
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
new file mode 100644
index 0000000..b19d417
--- /dev/null
+++ b/src/include/utils/dsa.h
@@ -0,0 +1,105 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/utils/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_create_in_place(void *place, Size size,
+ int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_attach(dsa_handle handle);
+extern dsa_area *dsa_attach_in_place(void *place);
+extern void dsa_release_in_place(void *place);
+extern void dsa_on_dsm_detach_release_in_place(dsm_segment *, Datum);
+extern void dsa_on_shmem_exit_release_in_place(int, Datum);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..c0adf4d
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,105 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..40139ee
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Wed, Nov 23, 2016 at 7:07 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Those let you create an area in existing memory (in a DSM segment,
traditional inherited shmem). The in-place versions will stlll create
DSM segments on demand as required, though I suppose if you wanted to
prevent that you could with dsa_set_size_limit(area, size). One
complication is that of course the automatic detach feature doesn't
work if you're in some random piece of memory. I have exposed
dsa_on_dsm_detach, so that there is a way to hook it up to the detach
hook for a pre-existing DSM segment, but that's the caller's
responibility.
shm_mq_attach() made the opposite decision about how to solve this
problem, and frankly I think that API is a lot more convenient: if the
first argument to shm_mq_attach() happens to be located inside of a
DSM, you can pass the DSM as the second argument and it registers the
on_dsm_detach() hook for you. If not, you can pass NULL and deal with
it in some other way. But this makes the common case very simple.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
More review:
+ * For large objects, we just stick all of the allocations in fullness class
+ * 0. Since we can just return the space directly to the free page manager,
+ * we don't really need them on a list at all, except that if someone wants
+ * to bulk release everything allocated using this BlockAreaContext, we
+ * have no other way of finding them.
This comment is out-of-date.
+ /*
+ * If this is the only span, and there is no active
span, then maybe
+ * we should probably move this span to fullness class
1. (Otherwise
+ * if you allocate exactly all the objects in the only
span, it moves
+ * to class 3, then you free them all, it moves to 2,
and then is
+ * given back, leaving no active span).
+ */
"maybe we should probably" seems to have one more doubt-expressing
word than it needs.
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ /* Large object frees give back segments
aggressively already. */
+ continue;
We generally use braces in this kind of case.
+ * Search the fullness class 1 only. That is where we
expect to find
extra "the"
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
...
+ return area->segment_maps[index].mapped_address + offset;
It isn't guaranteed that area->segment_maps[index].mapped_address will
be non-NULL on return from get_segment_by_index, and then this
function will return a completely bogus pointer to the caller. I
think you should probably elog() instead.
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
Avoid ":" in elog messages. You don't really need to - and it isn't
project style to - tag these with "dsa:"; that's what \errverbose or
\set VERBOSITY verbose is for. In this particular case, I might just
adopt the formulation from parallel.c:
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("could not map dynamic shared
memory segment")));
+ elog(FATAL,
+ "dsa couldn't find run of pages:
fpm_largest out of sync");
Here I'd go with "dsa could not find %u free pages".
+ elog(ERROR, "dsa_pin: area already pinned");
"dsa_area already pinned"
+ elog(ERROR, "dsa_unpin: area not pinned");
"dsa_area not pinned"
+ if (segment == NULL)
+ elog(ERROR, "dsa: can't attach to segment");
As above.
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (unlikely(area->segment_maps[index].mapped_address == NULL))
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ handle = area->control->segment_handles[index];
Don't you need to acquire the lock for this?
+ /* Check all currently mapped segments to find what's
been freed. */
+ for (i = 0; i <= area->high_segment_index; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ area->freed_segment_counter = freed_segment_counter;
And this?
+/*
+ * Release a DSA area that was produced by dsa_create_in_place or
+ * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX'
+ * callbacks so that this is managed automatically, because failure to release
+ * an area created in-place leaks its segments permanently.
+ */
+void
+dsa_release_in_place(void *place)
+{
+ decrement_reference_count((dsa_area_control *) place);
+}
Since this seems to be the only caller of decrement_reference_count,
you could just put the logic here. The contract for this function is
also a bit unclear from the header comment. I initially thought that
it was your intention that this should be called from every process
that has either created or attached the segment. But that doesn't
seem like it will work, because decrement_reference_count calls
dsm_unpin_segment on every segment, and a segment can only be pinned
once, so everybody else would fail. So maybe the idea is that ANY ONE
process has to call dsa_release_in_place. But then that could lead to
failures in other backends inside get_segment_by_index(), because
segments they don't have mapped might already be gone. OK, third try:
maybe the idea is that the LAST process out has to call
dsa_release_in_place(). But how do the various cooperating processes
know which one that is?
I've also realized another thing that's not so good about this:
superblocks are 64kB, so allocating 64kB of initial space probably
just wastes most of it. I think we want to either allocate just
enough space to hold the control information, or else that much space
plus space for at least a few superblocks. I'm inclined to go the
first way, because it seems a bit overenthusiastic to allocate 256kB
or 512kB just on the off chance we might need it. On the other hand,
including a few bytes in the control segment so that we don't need to
allocate 1MB segment that we might not need sounds pretty sharp.
Maybe DSA can expose an API that returns the number of bytes that will
be needed for the control structure, and then the caller can arrange
for that space to be available during the Estimate phase...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 30, 2016 at 4:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
More review:
Thanks!
+ * For large objects, we just stick all of the allocations in fullness class + * 0. Since we can just return the space directly to the free page manager, + * we don't really need them on a list at all, except that if someone wants + * to bulk release everything allocated using this BlockAreaContext, we + * have no other way of finding them.This comment is out-of-date.
Removed.
+ /* + * If this is the only span, and there is no active span, then maybe + * we should probably move this span to fullness class 1. (Otherwise + * if you allocate exactly all the objects in the only span, it moves + * to class 3, then you free them all, it moves to 2, and then is + * given back, leaving no active span). + */"maybe we should probably" seems to have one more doubt-expressing
word than it needs.
Fixed.
+ if (size_class == DSA_SCLASS_SPAN_LARGE) + /* Large object frees give back segments aggressively already. */ + continue;We generally use braces in this kind of case.
Fixed.
+ * Search the fullness class 1 only. That is where we
expect to findextra "the"
Fixed.
+ /* Call for effect (we don't need the result). */ + get_segment_by_index(area, index); ... + return area->segment_maps[index].mapped_address + offset;It isn't guaranteed that area->segment_maps[index].mapped_address will
be non-NULL on return from get_segment_by_index, and then this
function will return a completely bogus pointer to the caller. I
think you should probably elog() instead.
Hmm. Right. In fact it's never OK to ask for a segment by index when
that segment is gone since that implies an access-after-free so there
is no reason for NULL to be handled by callers. I have changed
get_segment_by_index to raise an error..
+ elog(ERROR, "dsa: can't attach to area handle %u", handle);
Avoid ":" in elog messages. You don't really need to - and it isn't
project style to - tag these with "dsa:"; that's what \errverbose or
\set VERBOSITY verbose is for. In this particular case, I might just
adopt the formulation from parallel.c:ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("could not map dynamic shared
memory segment")));
Fixed.
+ elog(FATAL, + "dsa couldn't find run of pages: fpm_largest out of sync");Here I'd go with "dsa could not find %u free pages".
Fixed.
+ elog(ERROR, "dsa_pin: area already pinned");
"dsa_area already pinned"
Fixed.
+ elog(ERROR, "dsa_unpin: area not pinned");
"dsa_area not pinned"
Fixed.
+ if (segment == NULL) + elog(ERROR, "dsa: can't attach to segment");As above.
Fixed.
+static dsa_segment_map * +get_segment_by_index(dsa_area *area, dsa_segment_index index) +{ + if (unlikely(area->segment_maps[index].mapped_address == NULL)) + { + dsm_handle handle; + dsm_segment *segment; + dsa_segment_map *segment_map; + + handle = area->control->segment_handles[index];Don't you need to acquire the lock for this?
No. I've updated the comments to explain, and refactored a bit.
I'll explain here in different words here: This is memory, you are a
C programmer, and as with malloc/free, referencing memory that has
been freed invokes undefined behaviour possibly including but not
limited to demons flying out of your nose. When you call
dsa_get_address(some_dsa_pointer) or dsa_free(some_dsa_pointer) you
are asserting that the address points to memory allocated with
dsa_allocate from this area that has not yet been freed. Given that
assertion, area->control->segment_handles[index] (where index is
extracted from the address) must be valid and cannot change under your
feet. control->segment_handles[index] can only change after
everything allocated from that whole segment has been freed; you can
think of it as 'locked' as long as any live object exists in the
segment it corresponds to.
In general I'm trying not to do anything too clever in the first
version of DSA: it uses plain old LWLock for each size-class's pool
and then an area-wide LWLock for segment operations. But in the
particular case of dsa_get_address, I think it's really important for
the viability of DSA for these address translations to be fast in
likely path, hence my desire to figure out a protocol for lock-free
address translation even though segments come and go.
+ /* Check all currently mapped segments to find what's been freed. */ + for (i = 0; i <= area->high_segment_index; ++i) + { + if (area->segment_maps[i].header != NULL && + area->segment_maps[i].header->freed) + { + dsm_detach(area->segment_maps[i].segment); + area->segment_maps[i].segment = NULL; + area->segment_maps[i].header = NULL; + area->segment_maps[i].mapped_address = NULL; + } + } + area->freed_segment_counter = freed_segment_counter;And this?
Hmm. I had a theory for why that didn't need to be locked, though it
admittedly lacked a necessary barrier -- d'oh. But I'll spare you the
details and just lock it because this is not a hot path and it's much
simpler that way.
I've also refactored that code into a new static function
check_for_freed_segments, because I realised that dsa_free needs the
same treatment as dsa_get_address. The checking for freed segments
was also happening at the wrong time, which I've now straightened out
-- that must happen before you arrive into a get_segment_index, as
described in the copious new comments. Thoughts?
+/* + * Release a DSA area that was produced by dsa_create_in_place or + * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX' + * callbacks so that this is managed automatically, because failure to release + * an area created in-place leaks its segments permanently. + */ +void +dsa_release_in_place(void *place) +{ + decrement_reference_count((dsa_area_control *) place); +}Since this seems to be the only caller of decrement_reference_count,
you could just put the logic here.
Ok, done.
The contract for this function is
also a bit unclear from the header comment. I initially thought that
it was your intention that this should be called from every process
that has either created or attached the segment.
That is indeed my intention.
But that doesn't
seem like it will work, because decrement_reference_count calls
dsm_unpin_segment on every segment, and a segment can only be pinned
once, so everybody else would fail. So maybe the idea is that ANY ONE
process has to call dsa_release_in_place. But then that could lead to
failures in other backends inside get_segment_by_index(), because
segments they don't have mapped might already be gone. OK, third try:
maybe the idea is that the LAST process out has to call
dsa_release_in_place(). But how do the various cooperating processes
know which one that is?
It decrements the reference count for the area, but only unpins the
segments if the reference count reaches zero:
Assert(control->refcnt > 0);
if (--control->refcnt == 0)
{
/* ... unpin all the segments ... */
}
I've also realized another thing that's not so good about this:
superblocks are 64kB, so allocating 64kB of initial space probably
just wastes most of it. I think we want to either allocate just
enough space to hold the control information, or else that much space
plus space for at least a few superblocks. I'm inclined to go the
first way, because it seems a bit overenthusiastic to allocate 256kB
or 512kB just on the off chance we might need it. On the other hand,
including a few bytes in the control segment so that we don't need to
allocate 1MB segment that we might not need sounds pretty sharp.
Maybe DSA can expose an API that returns the number of bytes that will
be needed for the control structure, and then the caller can arrange
for that space to be available during the Estimate phase...
Yeah, I also thought about that, but didn't try to do better before
because I couldn't see how to make a nice macro for this without
dragging a ton of internal stuff out into the header. I have written
a new function dsa_minimum_size(). The caller can use that number
directly to get a minimal in-place area that will immediately create
an extra DSM segment as soon as you call dsa_allocate. Unfortunately
you can't really add more to that number with predictable results
unless you know some internal details and your future allocation
pattern: to avoid extra segment creation, you'd need to add 4KB for a
block of spans and then 64KB for each size class you plan to allocate,
and of course that might change. But at least it allows us to create
an in-place DSA area for every parallel query cheaply, and then defer
creation of the first DSM segment until the first time someone tries
to allocate, which seems about right to me.
And in response to your earlier email:
On Tue, Nov 29, 2016 at 7:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:
shm_mq_attach() made the opposite decision about how to solve this
problem, and frankly I think that API is a lot more convenient: if the
first argument to shm_mq_attach() happens to be located inside of a
DSM, you can pass the DSM as the second argument and it registers the
on_dsm_detach() hook for you. If not, you can pass NULL and deal with
it in some other way. But this makes the common case very simple.
Ok, I've now done the same.
I feel like some more general destructor callback for objects in
containing object is wanted here, rather than sticking dsm_segment *
into various constructor-like functions, but I haven't thought
seriously about that and I'm not arguing that case now.
Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
test-dsa.patchapplication/octet-stream; name=test-dsa.patchDownload
diff --git a/src/test/modules/test_dsa/Makefile b/src/test/modules/test_dsa/Makefile
new file mode 100644
index 0000000..e5299a9
--- /dev/null
+++ b/src/test/modules/test_dsa/Makefile
@@ -0,0 +1,18 @@
+# src/test/modules/test_dsa/Makefile
+
+MODULES = test_dsa
+
+EXTENSION = test_dsa
+DATA = test_dsa--1.0.sql
+PGFILEDESC = "test_dsa -- tests for DSA areas"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_dsa
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_dsa/test_dsa--1.0.sql b/src/test/modules/test_dsa/test_dsa--1.0.sql
new file mode 100644
index 0000000..cc435b3
--- /dev/null
+++ b/src/test/modules/test_dsa/test_dsa--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_dsa/test_dsa--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_dsa" to load this file. \quit
+
+CREATE FUNCTION test_dsa_random(loops int, num_allocs int, min_alloc int, max_alloc int, mode text)
+RETURNS VOID
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE TYPE test_dsa_row AS (pid int, allocations bigint, elapsed interval);
+
+CREATE FUNCTION test_dsa_random_parallel(loops int, num_allocs int, min_alloc int, max_alloc int, mode text, workers int)
+RETURNS SETOF test_dsa_row
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/src/test/modules/test_dsa/test_dsa.c b/src/test/modules/test_dsa/test_dsa.c
new file mode 100644
index 0000000..149b393
--- /dev/null
+++ b/src/test/modules/test_dsa/test_dsa.c
@@ -0,0 +1,358 @@
+/* -------------------------------------------------------------------------
+ *
+ * test_dsa.c
+ * Simple exercises for dsa.c.
+ *
+ * Copyright (C) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/test_dsa/test_dsa.c
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/dsa.h"
+#include "utils/resowner.h"
+#include "utils/timestamp.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_dsa_random);
+PG_FUNCTION_INFO_V1(test_dsa_random_parallel);
+
+/* Which order to free objects in, within each loop. */
+typedef enum
+{
+ /* Free in random order. */
+ MODE_RANDOM,
+ /* Free in the same order we allocated (FIFO). */
+ MODE_FORWARDS,
+ /* Free in reverse order of allocation (LIFO). */
+ MODE_BACKWARDS
+} test_mode;
+
+/* Per-worker results. */
+typedef struct
+{
+ pid_t pid;
+ int64 count;
+ int64 elapsed_time_us;
+} test_result;
+
+/* Parameters for a test run, passed to workers. */
+typedef struct
+{
+ int loops;
+ int num_allocs;
+ int min_alloc;
+ int max_alloc;
+ test_mode mode;
+ test_result results[1]; /* indexed by worker number */
+} test_parameters;
+
+/* The startup message given to each worker. */
+typedef struct
+{
+ /* How to connect to the shmem area. */
+ dsa_handle area_handle;
+ /* Where to find the parameters. */
+ dsa_pointer parameters;
+ /* What index this worker should write results to. */
+ Size output_index;
+} test_hello;
+
+static test_mode
+parse_test_mode(text *mode)
+{
+ test_mode result = MODE_RANDOM;
+ char *cstr = text_to_cstring(mode);
+
+ if (strcmp(cstr, "random") == 0)
+ result = MODE_RANDOM;
+ else if (strcmp(cstr, "forwards") == 0)
+ result = MODE_FORWARDS;
+ else if (strcmp(cstr, "backwards") == 0)
+ result = MODE_BACKWARDS;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unknown mode")));
+ return result;
+}
+
+static void
+check_parameters(const test_parameters *parameters)
+{
+ if (parameters->min_alloc < 1 || parameters->min_alloc > parameters->max_alloc)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("min_alloc must be >= 1, and min_alloc must be <= max_alloc")));
+}
+
+static int
+my_tranche_id(void)
+{
+ static int tranche_id = 0;
+
+ if (tranche_id == 0)
+ tranche_id = LWLockNewTrancheId();
+
+ return tranche_id;
+}
+
+static void
+do_random_test(dsa_area *area, Size output_index, test_parameters *parameters)
+{
+ dsa_pointer *objects;
+ int min_alloc;
+ int extra_alloc;
+ int32 i;
+ int32 loop;
+ int64 start_time = GetCurrentIntegerTimestamp();
+ int64 total_allocations = 0;
+
+ /*
+ * Make tests reproducible (on the same computer at least) by using the
+ * same random sequence every time.
+ */
+ srand(42);
+
+ min_alloc = parameters->min_alloc;
+ extra_alloc = parameters->max_alloc - parameters->min_alloc;
+
+ objects = palloc(sizeof(dsa_pointer) * parameters->num_allocs);
+ Assert(objects != NULL);
+ for (loop = 0; loop < parameters->loops; ++loop)
+ {
+ int num_actually_allocated = 0;
+
+ for (i = 0; i < parameters->num_allocs; ++i)
+ {
+ Size size;
+ void *memory;
+
+ /* Adjust size randomly if needed. */
+ size = min_alloc;
+ if (extra_alloc > 0)
+ size += rand() % extra_alloc;
+
+ /* Allocate! */
+ objects[i] = dsa_allocate(area, size);
+ if (!DsaPointerIsValid(objects[i]))
+ {
+ elog(WARNING, "dsa: loop %d: out of memory after allocating %d objects", loop, i + 1);
+ break;
+ }
+ ++num_actually_allocated;
+ /* Pay the cost of accessing that memory */
+ memory = dsa_get_address(area, objects[i]);
+ memset(memory, 42, size);
+ }
+ if (parameters->mode == MODE_RANDOM)
+ {
+ for (i = 0; i < num_actually_allocated; ++i)
+ {
+ Size x = rand() % num_actually_allocated;
+ Size y = rand() % num_actually_allocated;
+ dsa_pointer temp = objects[x];
+
+ objects[x] = objects[y];
+ objects[y] = temp;
+ }
+ }
+ if (parameters->mode == MODE_BACKWARDS)
+ {
+ for (i = num_actually_allocated - 1; i >= 0; --i)
+ dsa_free(area, objects[i]);
+ }
+ else
+ {
+ for (i = 0; i < num_actually_allocated; ++i)
+ dsa_free(area, objects[i]);
+ }
+ total_allocations += num_actually_allocated;
+ }
+ pfree(objects);
+
+ parameters->results[output_index].elapsed_time_us =
+ GetCurrentIntegerTimestamp() - start_time;
+ parameters->results[output_index].pid = getpid();
+ parameters->results[output_index].count = total_allocations;
+}
+
+/* Non-parallel version: just do it. */
+Datum
+test_dsa_random(PG_FUNCTION_ARGS)
+{
+ test_parameters parameters;
+ dsa_area *area;
+
+ parameters.loops = PG_GETARG_INT32(0);
+ parameters.num_allocs = PG_GETARG_INT32(1);
+ parameters.min_alloc = PG_GETARG_INT32(2);
+ parameters.max_alloc = PG_GETARG_INT32(3);
+ parameters.mode = parse_test_mode(PG_GETARG_TEXT_PP(4));
+ check_parameters(¶meters);
+
+ area = dsa_create(my_tranche_id(), "test_dsa");
+ do_random_test(area, 0, ¶meters);
+ dsa_dump(area);
+ dsa_detach(area);
+
+ PG_RETURN_NULL();
+}
+
+Datum test_dsa_random_worker_main(Datum arg);
+
+Datum
+test_dsa_random_worker_main(Datum arg)
+{
+ test_hello hello;
+ dsa_area *area;
+ test_parameters *parameters;
+
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "test_dsa toplevel");
+
+ /* Receive hello message and attach to shmem area. */
+ memcpy(&hello, MyBgworkerEntry->bgw_extra, sizeof(hello));
+ area = dsa_attach(hello.area_handle);
+ Assert(area != NULL);
+ parameters = dsa_get_address(area, hello.parameters);
+ Assert(parameters != NULL);
+
+ do_random_test(area, hello.output_index, parameters);
+
+ dsa_detach(area);
+
+ return (Datum) 0;
+}
+
+/* Parallel version: fork a bunch of background workers to do it. */
+Datum
+test_dsa_random_parallel(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+
+ test_hello hello;
+ test_parameters *parameters;
+ dsa_area *area;
+ int workers;
+ int i;
+ BackgroundWorkerHandle **handles;
+
+ /* tuplestore boilerplate stuff... */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+ MemoryContextSwitchTo(oldcontext);
+
+ /* Prepare to work! */
+ workers = PG_GETARG_INT32(5);
+ handles = palloc(sizeof(BackgroundWorkerHandle *) * workers);
+
+ /* Set up the shared memory area. */
+ area = dsa_create(my_tranche_id(), "test_dsa");
+
+ /* The the workers how to attach to it. */
+ hello.area_handle = dsa_get_handle(area);
+
+ /* Allocate space for the parameters object. */
+ hello.parameters = dsa_allocate(area, sizeof(test_parameters) +
+ sizeof(test_result) * workers);
+ Assert(DsaPointerIsValid(hello.parameters));
+
+ /* Set up an check the parameters object. */
+ parameters = dsa_get_address(area, hello.parameters);
+ parameters->loops = PG_GETARG_INT32(0);
+ parameters->num_allocs = PG_GETARG_INT32(1);
+ parameters->min_alloc = PG_GETARG_INT32(2);
+ parameters->max_alloc = PG_GETARG_INT32(3);
+ parameters->mode = parse_test_mode(PG_GETARG_TEXT_PP(4));
+ check_parameters(parameters);
+
+ /* Start the workers. */
+ for (i = 0; i < workers; ++i)
+ {
+ BackgroundWorker bgw;
+
+ snprintf(bgw.bgw_name, sizeof(bgw.bgw_name), "worker%d", i);
+ bgw.bgw_flags = BGWORKER_SHMEM_ACCESS;
+ bgw.bgw_start_time = BgWorkerStart_PostmasterStart;
+ bgw.bgw_restart_time = BGW_NEVER_RESTART;
+ bgw.bgw_main = NULL;
+ snprintf(bgw.bgw_library_name, sizeof(bgw.bgw_library_name),
+ "test_dsa");
+ snprintf(bgw.bgw_function_name, sizeof(bgw.bgw_function_name),
+ "test_dsa_random_worker_main");
+ Assert(sizeof(parameters) <= BGW_EXTRALEN);
+ /* Each worker will write its output to a different slot. */
+ hello.output_index = i;
+ memcpy(bgw.bgw_extra, &hello, sizeof(hello));
+ bgw.bgw_notify_pid = MyProcPid;
+
+ if (!RegisterDynamicBackgroundWorker(&bgw, &handles[i]))
+ elog(ERROR, "Can't start worker");
+ }
+
+ /* Wait for the workers to complete. */
+ for (i = 0; i < workers; ++i)
+ /* erm, should really check for BGWH_STOPPED */
+ WaitForBackgroundWorkerShutdown(handles[i]);
+
+ /* Generate result tuples. */
+ for (i = 0; i < workers; ++i)
+ {
+ Datum values[3];
+ bool nulls[] = { false, false, false };
+ Interval *interval = palloc(sizeof(Interval));
+
+ interval->month = 0;
+ interval->day = 0;
+ interval->time = parameters->results[i].elapsed_time_us
+#ifndef HAVE_INT64_TIMESTAMP
+ / 1000000.0
+#endif
+ ;
+
+ values[0] = Int32GetDatum(parameters->results[i].pid);
+ values[1] = Int64GetDatum(parameters->results[i].count);
+ values[2] = PointerGetDatum(interval);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+ tuplestore_donestoring(tupstore);
+
+ pfree(handles);
+ dsa_detach(area);
+
+ return (Datum) 0;
+}
diff --git a/src/test/modules/test_dsa/test_dsa.control b/src/test/modules/test_dsa/test_dsa.control
new file mode 100644
index 0000000..2655c3f
--- /dev/null
+++ b/src/test/modules/test_dsa/test_dsa.control
@@ -0,0 +1,5 @@
+# dsa_test extension
+comment = 'Tests for DSA'
+default_version = '1.0'
+module_pathname = '$libdir/test_dsa'
+relocatable = true
dsa-v8.patchapplication/octet-stream; name=dsa-v8.patchDownload
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index b2403e1..1842bae 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = aset.o mcxt.o portalmem.o
+OBJS = aset.o dsa.o freepage.o mcxt.o portalmem.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
new file mode 100644
index 0000000..e5c780f
--- /dev/null
+++ b/src/backend/utils/mmgr/dsa.c
@@ -0,0 +1,2200 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.c
+ * Dynamic shared memory areas.
+ *
+ * This module provides dynamic shared memory areas which are built on top of
+ * DSM segments. While dsm.c allows segments of memory of shared memory to be
+ * created and shared between backends, it isn't designed to deal with small
+ * objects. A DSA area is a shared memory heap usually backed by one or more
+ * DSM segment which can allocate memory using dsa_allocate() and dsa_free().
+ * Alternatively, it can be created in pre-existing shared memory, including a
+ * DSM segment, and then create extra DSM segments as required. Unlike the
+ * regular system heap, it deals in pseudo-pointers which must be converted to
+ * backend-local pointers before they are dereferenced. These pseudo-pointers
+ * can however be shared with other backends, and can be used to construct
+ * shared data structures.
+ *
+ * Each DSA area manages a set of DSM segments, adding new segments as
+ * required and detaching them when they are no longer needed. Each segment
+ * contains a number of 4KB pages, a free page manager for tracking
+ * consecutive runs of free pages, and a page map for tracking the source of
+ * objects allocated on each page. Allocation requests above 8KB are handled
+ * by choosing a segment and finding consecutive free pages in its free page
+ * manager. Allocation requests for smaller sizes are handled using pools of
+ * objects of a selection of sizes. Each pool consists of a number of 16 page
+ * (64KB) superblocks allocated in the same way as large objects. Allocation
+ * of large objects and new superblocks is serialized by a single LWLock, but
+ * allocation of small objects from pre-existing superblocks uses one LWLock
+ * per pool. Currently there is one pool, and therefore one lock, per size
+ * class. Per-core pools to increase concurrency and strategies for reducing
+ * the resulting fragmentation are areas for future research. Each superblock
+ * is managed with a 'span', which tracks the superblock's freelist. Free
+ * requests are handled by looking in the page map to find which span an
+ * address was allocated from, so that small objects can be returned to the
+ * appropriate free list, and large object pages can be returned directly to
+ * the free page map. When allocating, simple heuristics for selecting
+ * segments and superblocks try to encourage occupied memory to be
+ * concentrated, increasing the likelihood that whole superblocks can become
+ * empty and be returned to the free page manager, and whole segments can
+ * become empty and be returned to the operating system.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/dsa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/dsa.h"
+#include "utils/freepage.h"
+#include "utils/memutils.h"
+
+/*
+ * The size of the initial DSM segment that backs a dsa_area created by
+ * dsa_create. After creating some number of segments of this size we'll
+ * double this size, and so on. Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((Size) (1 * 1024 * 1024))
+
+/*
+ * How many segments to create before we double the segment size. If this is
+ * low, then there is likely to be a lot of wasted space in the largest
+ * segment. If it is high, then we risk running out of segment slots (see
+ * dsm.c's limits on total number of segments), or limiting the total size
+ * an area can manage when using small pointers.
+ */
+#define DSA_NUM_SEGMENTS_AT_EACH_SIZE 4
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27 /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40 /* 1024 segments of size up to 1TB */
+#endif
+
+/*
+ * The maximum number of DSM segments that an area can own, determined by
+ * the number of bits remaining (but capped at 1024).
+ */
+#define DSA_MAX_SEGMENTS \
+ Min(1024, (1 << ((SIZEOF_DSA_POINTER * 8) - DSA_OFFSET_WIDTH)))
+
+/* The bitmask for extracting the offset from a dsa_pointer. */
+#define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
+/* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
+#define DSA_PAGES_PER_SUPERBLOCK 16
+
+/*
+ * A magic number used as a sanity check for following DSM segments belonging
+ * to a DSA area (this number will be XORed with the area handle and
+ * the segment index).
+ */
+#define DSA_SEGMENT_HEADER_MAGIC 0x0ce26608
+
+/* Build a dsa_pointer given a segment number and offset. */
+#define DSA_MAKE_POINTER(segment_number, offset) \
+ (((dsa_pointer) (segment_number) << DSA_OFFSET_WIDTH) | (offset))
+
+/* Extract the segment number from a dsa_pointer. */
+#define DSA_EXTRACT_SEGMENT_NUMBER(dp) ((dp) >> DSA_OFFSET_WIDTH)
+
+/* Extract the offset from a dsa_pointer. */
+#define DSA_EXTRACT_OFFSET(dp) ((dp) & DSA_OFFSET_BITMASK)
+
+/* The type used for index segment indexes (zero based). */
+typedef Size dsa_segment_index;
+
+/* Sentinel value for dsa_segment_index indicating 'none' or 'end'. */
+#define DSA_SEGMENT_INDEX_NONE (~(dsa_segment_index)0)
+
+/*
+ * How many bins of segments do we have? The bins are used to categorize
+ * segments by their largest contiguous run of free pages.
+ */
+#define DSA_NUM_SEGMENT_BINS 16
+
+/*
+ * What is the lowest bin that holds segments that *might* have n contiguous
+ * free pages? There is no point in looking in segments in lower bins; they
+ * definitely can't service a request for n free pages.
+ */
+#define contiguous_pages_to_segment_bin(n) Min(fls(n), DSA_NUM_SEGMENT_BINS - 1)
+
+/* Macros for access to locks. */
+#define DSA_AREA_LOCK(area) (&area->control->lock)
+#define DSA_SCLASS_LOCK(area, sclass) (&area->control->pools[sclass].lock)
+
+/*
+ * The header for an individual segment. This lives at the start of each DSM
+ * segment owned by a DSA area including the first segment (where it appears
+ * as part of the dsa_area_control struct).
+ */
+typedef struct
+{
+ /* Sanity check magic value. */
+ uint32 magic;
+ /* Total number of pages in this segment (excluding metadata area). */
+ Size usable_pages;
+ /* Total size of this segment in bytes. */
+ Size size;
+
+ /*
+ * Index of the segment that precedes this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the first one.
+ */
+ dsa_segment_index prev;
+
+ /*
+ * Index of the segment that follows this one in the same segment bin, or
+ * DSA_SEGMENT_INDEX_NONE if this is the last one.
+ */
+ dsa_segment_index next;
+ /* The index of the bin that contains this segment. */
+ Size bin;
+
+ /*
+ * A flag raised to indicate that this segment is being returned to the
+ * operating system and has been unpinned.
+ */
+ bool freed;
+} dsa_segment_header;
+
+/*
+ * Metadata for one superblock.
+ *
+ * For most blocks, span objects are stored out-of-line; that is, the span
+ * object is not stored within the block itself. But, as an exception, for a
+ * "span of spans", the span object is stored "inline". The allocation is
+ * always exactly one page, and the dsa_area_span object is located at
+ * the beginning of that page. The size class is DSA_SCLASS_BLOCK_OF_SPANS,
+ * and the remaining fields are used just as they would be in an ordinary
+ * block. We can't allocate spans out of ordinary superblocks because
+ * creating an ordinary superblock requires us to be able to allocate a span
+ * *first*. Doing it this way avoids that circularity.
+ */
+typedef struct
+{
+ dsa_pointer pool; /* Containing pool. */
+ dsa_pointer prevspan; /* Previous span. */
+ dsa_pointer nextspan; /* Next span. */
+ dsa_pointer start; /* Starting address. */
+ Size npages; /* Length of span in pages. */
+ uint16 size_class; /* Size class. */
+ uint16 ninitialized; /* Maximum number of objects ever allocated. */
+ uint16 nallocatable; /* Number of objects currently allocatable. */
+ uint16 firstfree; /* First object on free list. */
+ uint16 nmax; /* Maximum number of objects ever possible. */
+ uint16 fclass; /* Current fullness class. */
+} dsa_area_span;
+
+/*
+ * Given a pointer to an object in a span, access the index of the next free
+ * object in the same span (ie in the span's freelist) as an L-value.
+ */
+#define NextFreeObjectIndex(object) (* (uint16 *) (object))
+
+/*
+ * Small allocations are handled by dividing a single block of memory into
+ * many small objects of equal size. The possible allocation sizes are
+ * defined by the following array. Larger size classes are spaced more widely
+ * than smaller size classes. We fudge the spacing for size classes >1kB to
+ * avoid space wastage: based on the knowledge that we plan to allocate 64kB
+ * blocks, we bump the maximum object size up to the largest multiple of
+ * 8 bytes that still lets us fit the same number of objects into one block.
+ *
+ * NB: Because of this fudging, if we were ever to use differently-sized blocks
+ * for small allocations, these size classes would need to be reworked to be
+ * optimal for the new size.
+ *
+ * NB: The optimal spacing for size classes, as well as the size of the blocks
+ * out of which small objects are allocated, is not a question that has one
+ * right answer. Some allocators (such as tcmalloc) use more closely-spaced
+ * size classes than we do here, while others (like aset.c) use more
+ * widely-spaced classes. Spacing the classes more closely avoids wasting
+ * memory within individual chunks, but also means a larger number of
+ * potentially-unfilled blocks.
+ */
+static const uint16 dsa_size_classes[] = {
+ sizeof(dsa_area_span), 0, /* special size classes */
+ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
+ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */
+ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
+ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
+ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
+ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
+ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
+ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
+};
+#define DSA_NUM_SIZE_CLASSES lengthof(dsa_size_classes)
+
+/* Special size classes. */
+#define DSA_SCLASS_BLOCK_OF_SPANS 0
+#define DSA_SCLASS_SPAN_LARGE 1
+
+/*
+ * The following lookup table is used to map the size of small objects
+ * (less than 1kB) onto the corresponding size class. To use this table,
+ * round the size of the object up to the next multiple of 8 bytes, and then
+ * index into this array.
+ */
+static char dsa_size_class_map[] = {
+ 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13,
+ 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17,
+ 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19,
+ 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21,
+ 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
+ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
+ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+ 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25
+};
+#define DSA_SIZE_CLASS_MAP_QUANTUM 8
+
+/*
+ * Superblocks are binned by how full they are. Generally, each fullness
+ * class corresponds to one quartile, but the block being used for
+ * allocations is always at the head of the list for fullness class 1,
+ * regardless of how full it really is.
+ */
+#define DSA_FULLNESS_CLASSES 4
+
+/*
+ * Maximum length of a DSA name.
+ */
+#define DSA_MAXLEN 64
+
+/*
+ * A dsa_area_pool represents a set of objects of a given size class.
+ *
+ * Perhaps there should be multiple pools for the same size class for
+ * contention avoidance, but for now there is just one!
+ */
+typedef struct
+{
+ /* A lock protecting access to this pool. */
+ LWLock lock;
+ /* A set of linked lists of spans, arranged by fullness. */
+ dsa_pointer spans[DSA_FULLNESS_CLASSES];
+ /* Should we pad this out to a cacheline boundary? */
+} dsa_area_pool;
+
+/*
+ * The control block for an area. This lives in shared memory, at the start of
+ * the first DSM segment controlled by this area.
+ */
+typedef struct
+{
+ /* The segment header for the first segment. */
+ dsa_segment_header segment_header;
+ /* The handle for this area. */
+ dsa_handle handle;
+ /* The handles of the segments owned by this area. */
+ dsm_handle segment_handles[DSA_MAX_SEGMENTS];
+ /* Lists of segments, binned by maximum contiguous run of free pages. */
+ dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
+ /* The object pools for each size class. */
+ dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+ /* The total size of all active segments. */
+ Size total_segment_size;
+ /* The maximum total size of backing storage we are allowed. */
+ Size max_total_segment_size;
+ /* Highest used segment index in the history of this area. */
+ dsa_segment_index high_segment_index;
+ /* The reference count for this area. */
+ int refcnt;
+ /* A flag indicating that this area has been pinned. */
+ bool pinned;
+ /* The number of times that segments have been freed. */
+ Size freed_segment_counter;
+ /* The LWLock tranche ID. */
+ int lwlock_tranche_id;
+ char lwlock_tranche_name[DSA_MAXLEN];
+ /* The general lock (protects everything except object pools). */
+ LWLock lock;
+} dsa_area_control;
+
+/* Given a pointer to a pool, find a dsa_pointer. */
+#define DsaAreaPoolToDsaPointer(area, p) \
+ DSA_MAKE_POINTER(0, (char *) p - (char *) area->control)
+
+/*
+ * A dsa_segment_map is stored within the backend-private memory of each
+ * individual backend. It holds the base address of the segment within that
+ * backend, plus the addresses of key objects within the segment. Those
+ * could instead be derived from the base address but it's handy to have them
+ * around.
+ */
+typedef struct
+{
+ dsm_segment *segment; /* DSM segment */
+ char *mapped_address; /* Address at which segment is mapped */
+ dsa_segment_header *header; /* Header (same as mapped_address) */
+ FreePageManager *fpm; /* Free page manager within segment. */
+ dsa_pointer *pagemap; /* Page map within segment. */
+} dsa_segment_map;
+
+/*
+ * Per-backend state for a storage area. Backends obtain one of these by
+ * creating an area or attaching to an existing one using a handle. Each
+ * process that needs to use an area uses its own object to track where the
+ * segments are mapped.
+ */
+struct dsa_area
+{
+ /* Pointer to the control object in shared memory. */
+ dsa_area_control *control;
+
+ /* The lock tranche for this process. */
+ LWLockTranche lwlock_tranche;
+
+ /* Has the mapping been pinned? */
+ bool mapping_pinned;
+
+ /*
+ * This backend's array of segment maps, ordered by segment index
+ * corresponding to control->segment_handles. Some of the area's segments
+ * may not be mapped in in this backend yet, and some slots may have been
+ * freed and need to be detached; these operations happen on demand.
+ */
+ dsa_segment_map segment_maps[DSA_MAX_SEGMENTS];
+
+ /* The highest segment index this backend has ever mapped. */
+ dsa_segment_index high_segment_index;
+
+ /* The last observed freed_segment_counter. */
+ Size freed_segment_counter;
+};
+
+#define DSA_SPAN_NOTHING_FREE ((uint16) -1)
+#define DSA_SUPERBLOCK_SIZE (DSA_PAGES_PER_SUPERBLOCK * FPM_PAGE_SIZE)
+
+/* Given a pointer to a segment_map, obtain a segment index number. */
+#define get_segment_index(area, segment_map_ptr) \
+ (segment_map_ptr - &area->segment_maps[0])
+
+static void init_span(dsa_area *area, dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class);
+static bool transfer_first_span(dsa_area *area, dsa_area_pool *pool,
+ int fromclass, int toclass);
+static inline dsa_pointer alloc_object(dsa_area *area, int size_class);
+static bool ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class);
+static dsa_segment_map *get_segment_by_index(dsa_area *area,
+ dsa_segment_index index);
+static void destroy_superblock(dsa_area *area, dsa_pointer span_pointer);
+static void unlink_span(dsa_area *area, dsa_area_span *span);
+static void add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer, int fclass);
+static void unlink_segment(dsa_area *area, dsa_segment_map *segment_map);
+static dsa_segment_map *get_best_segment(dsa_area *area, Size npages);
+static dsa_segment_map *make_new_segment(dsa_area *area, Size requested_pages);
+static dsa_area *create_internal(void *place, size_t size,
+ int tranche_id, const char *tranche_name,
+ dsm_handle control_handle,
+ dsm_segment *control_segment);
+static dsa_area *attach_internal(void *place, dsm_segment *segment,
+ dsa_handle handle);
+static void check_for_freed_segments(dsa_area *area);
+
+/*
+ * Create a new shared area in a new DSM segment. Further DSM segments will
+ * be allocated as required to extend the available space.
+ *
+ * We can't allocate a LWLock tranche_id within this function, because tranche
+ * IDs are a scarce resource; there are only 64k available, using low numbers
+ * when possible matters, and we have no provision for recycling them. So,
+ * we require the caller to provide one. The caller must also provide the
+ * tranche name, so that we can distinguish LWLocks belonging to different
+ * DSAs.
+ */
+dsa_area *
+dsa_create(int tranche_id, const char *tranche_name)
+{
+ dsm_segment *segment;
+ dsa_area *area;
+
+ /*
+ * Create the DSM segment that will hold the shared control object and the
+ * first segment of usable space.
+ */
+ segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+
+ /*
+ * All segments backing this area are pinned, so that DSA can explicitly
+ * control their lifetime (otherwise a newly created segment belonging to
+ * this area might be freed when the only backend that happens to have it
+ * mapped in ends, corrupting the area).
+ */
+ dsm_pin_segment(segment);
+
+ /* Create a new DSA area with the control objet in this segment. */
+ area = create_internal(dsm_segment_address(segment),
+ DSA_INITIAL_SEGMENT_SIZE,
+ tranche_id, tranche_name,
+ dsm_segment_handle(segment), segment);
+
+ /* Clean up when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
+ PointerGetDatum(dsm_segment_address(segment)));
+
+ return area;
+}
+
+/*
+ * Create a new shared area in an existing shared memory space, which may be
+ * either DSM or Postmaster-initialized memory. DSM segments will be
+ * allocated as required to extend the available space, though that can be
+ * prevented with dsa_set_size_limit(area, size) using the same size provided
+ * to dsa_create_in_place.
+ *
+ * Areas created in-place must eventually be released by the backend that
+ * created them and all backends that attach to them. This can be done
+ * explicitly with dsa_release_in_place, or, in the special case that 'place'
+ * happens to be in a pre-existing DSM segment, by passing in a pointer to the
+ * segment so that a detach hook can be registered with the containing DSM
+ * segment.
+ *
+ * See dsa_create() for a note about the tranche arguments.
+ */
+dsa_area *
+dsa_create_in_place(void *place, size_t size,
+ int tranche_id, const char *tranche_name,
+ dsm_segment *segment)
+{
+ dsa_area *area;
+
+ area = create_internal(place, size, tranche_id, tranche_name,
+ DSM_HANDLE_INVALID, NULL);
+
+ /*
+ * Clean up when the control segment detaches, if a containing DSM segment
+ * was provided.
+ */
+ if (segment != NULL)
+ on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
+ PointerGetDatum(place));
+
+ return area;
+}
+
+/*
+ * Obtain a handle that can be passed to other processes so that they can
+ * attach to the given area. Cannot be called for areas created with
+ * dsa_create_in_place.
+ */
+dsa_handle
+dsa_get_handle(dsa_area *area)
+{
+ Assert(area->control->handle != DSM_HANDLE_INVALID);
+ return area->control->handle;
+}
+
+/*
+ * Attach to an area given a handle generated (possibly in another process) by
+ * dsa_get_area_handle. The area must have been created with dsa_create (not
+ * dsa_create_in_place).
+ */
+dsa_area *
+dsa_attach(dsa_handle handle)
+{
+ dsm_segment *segment;
+ dsa_area *area;
+
+ /*
+ * An area handle is really a DSM segment handle for the first segment, so
+ * we go ahead and attach to that.
+ */
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not attach to dsa_handle")));
+
+ area = attach_internal(dsm_segment_address(segment), segment, handle);
+
+ /* Clean up when the control segment detaches. */
+ on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
+ PointerGetDatum(dsm_segment_address(segment)));
+
+ return area;
+}
+
+/*
+ * Attach to an area that was created with dsa_create_in_place. The caller
+ * must somehow know the location in memory that was used when the area was
+ * created, though it may be mapped at a different virtual address in this
+ * process.
+ *
+ * See dsa_create_in_place for note about releasing in-place areas, and the
+ * optional 'segment' argument which can be provided to allow automatic
+ * release if the containing memory happens to be a DSM segment.
+ */
+dsa_area *
+dsa_attach_in_place(void *place, dsm_segment *segment)
+{
+ dsa_area *area;
+
+ area = attach_internal(place, NULL, DSM_HANDLE_INVALID);
+
+ /*
+ * Clean up when the control segment detaches, if a containing DSM segment
+ * was provided.
+ */
+ if (segment != NULL)
+ on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
+ PointerGetDatum(place));
+
+ return area;
+}
+
+/*
+ * Release a DSA area that was produced by dsa_create_in_place or
+ * dsa_attach_in_place. The 'segment' argument is ignored but provides an
+ * interface suitable for on_dsm_detach, for the convenience of users who want
+ * to create a DSA segment inside an existing DSM segment and have it
+ * automatically released when the containing DSM segment is detached.
+ * 'place' should be the address of the place where the area was created.
+ *
+ * This callback is automatically registered for the DSM segment containing
+ * the control object of in-place areas when a segment is provided to
+ * dsa_create_in_place or dsa_attach_in_place, and also for all areas created
+ * with dsa_create.
+ */
+void
+dsa_on_dsm_detach_release_in_place(dsm_segment *segment, Datum place)
+{
+ dsa_release_in_place(DatumGetPointer(place));
+}
+
+/*
+ * Release a DSA area that was produced by dsa_create_in_place or
+ * dsa_attach_in_place. The 'code' argument is ignored but provides an
+ * interface suitable for on_shmem_exit or before_shmem_exit, for the
+ * convenience of users who want to create a DSA segment inside shared memory
+ * other than a DSM segment and have it automatically release at backend exit.
+ * 'place' should be the address of the place where the area was created.
+ */
+void
+dsa_on_shmem_exit_release_in_place(int code, Datum place)
+{
+ dsa_release_in_place(DatumGetPointer(place));
+}
+
+/*
+ * Release a DSA area that was produced by dsa_create_in_place or
+ * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX'
+ * callbacks so that this is managed automatically, because failure to release
+ * an area created in-place leaks its segments permanently.
+ *
+ * This is also called automatically for areas produced by dsa_create or
+ * dsa_attach as an implementation detail.
+ */
+void
+dsa_release_in_place(void *place)
+{
+ dsa_area_control *control = (dsa_area_control *) place;
+ int i;
+
+ LWLockAcquire(&control->lock, LW_EXCLUSIVE);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ control->handle ^ 0));
+ Assert(control->refcnt > 0);
+ if (--control->refcnt == 0)
+ {
+ for (i = 0; i <= control->high_segment_index; ++i)
+ {
+ dsm_handle handle;
+
+ handle = control->segment_handles[i];
+ if (handle != DSM_HANDLE_INVALID)
+ dsm_unpin_segment(handle);
+ }
+ }
+ LWLockRelease(&control->lock);
+}
+
+/*
+ * Keep a DSA area attached until end of session or explicit detach.
+ *
+ * By default, areas are owned by the current resource owner, which means they
+ * are detached automatically when that scope ends.
+ */
+void
+dsa_pin_mapping(dsa_area *area)
+{
+ int i;
+
+ Assert(!area->mapping_pinned);
+ area->mapping_pinned = true;
+
+ for (i = 0; i <= area->high_segment_index; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_pin_mapping(area->segment_maps[i].segment);
+}
+
+/*
+ * Allocate memory in this storage area. The return value is a dsa_pointer
+ * that can be passed to other processes, and converted to a local pointer
+ * with dsa_get_address. If no memory is available, returns
+ * InvalidDsaPointer.
+ */
+dsa_pointer
+dsa_allocate(dsa_area *area, Size size)
+{
+ uint16 size_class;
+ dsa_pointer start_pointer;
+ dsa_segment_map *segment_map;
+
+ Assert(size > 0);
+
+ /*
+ * If bigger than the largest size class, just grab a run of pages from
+ * the free page manager, instead of allocating an object from a pool.
+ * There will still be a span, but it's a special class of span that
+ * manages this whole allocation and simply gives all pages back to the
+ * free page manager when dsa_free is called.
+ */
+ if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+ {
+ Size npages = fpm_size_to_pages(size);
+ Size first_page;
+ dsa_pointer span_pointer;
+ dsa_area_pool *pool = &area->control->pools[DSA_SCLASS_SPAN_LARGE];
+
+ /* Obtain a span object. */
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return InvalidDsaPointer;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+
+ /* Find a segment from which to allocate. */
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ /* Can't make any more segments: game over. */
+ LWLockRelease(DSA_AREA_LOCK(area));
+ dsa_free(area, span_pointer);
+ return InvalidDsaPointer;
+ }
+
+ /*
+ * Ask the free page manager for a run of pages. This should always
+ * succeed, since both get_best_segment and make_new_segment should
+ * only return a non-NULL pointer if it actually contains enough
+ * contiguous freespace. If it does fail, something in our backend
+ * private state is out of whack, so use FATAL to kill the process.
+ */
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ elog(FATAL,
+ "dsa_allocate could not find %zu free pages", npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ start_pointer = DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /* Initialize span and pagemap. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ init_span(area, span_pointer, pool, start_pointer, npages,
+ DSA_SCLASS_SPAN_LARGE);
+ segment_map->pagemap[first_page] = span_pointer;
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+
+ return start_pointer;
+ }
+
+ /* Map allocation to a size class. */
+ if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+ {
+ int mapidx;
+
+ /* For smaller sizes we have a lookup table... */
+ mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+ DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+ size_class = dsa_size_class_map[mapidx];
+ }
+ else
+ {
+ uint16 min;
+ uint16 max;
+
+ /* ... and for the rest we search by binary chop. */
+ min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+ max = lengthof(dsa_size_classes) - 1;
+
+ while (min < max)
+ {
+ uint16 mid = (min + max) / 2;
+ uint16 class_size = dsa_size_classes[mid];
+
+ if (class_size < size)
+ min = mid + 1;
+ else
+ max = mid;
+ }
+
+ size_class = min;
+ }
+ Assert(size <= dsa_size_classes[size_class]);
+ Assert(size_class == 0 || size > dsa_size_classes[size_class - 1]);
+
+ /*
+ * Attempt to allocate an object from the appropriate pool. This might
+ * return InvalidDsaPointer if there's no space available.
+ */
+ return alloc_object(area, size_class);
+}
+
+/*
+ * Free memory obtained with dsa_allocate.
+ */
+void
+dsa_free(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_map *segment_map;
+ int pageno;
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ char *superblock;
+ char *object;
+ Size size;
+ int size_class;
+
+ /* Make sure we don't have a stale segment in the slot 'dp' refers to. */
+ check_for_freed_segments(area);
+
+ /* Locate the object, span and pool. */
+ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp));
+ pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE;
+ span_pointer = segment_map->pagemap[pageno];
+ span = dsa_get_address(area, span_pointer);
+ superblock = dsa_get_address(area, span->start);
+ object = dsa_get_address(area, dp);
+ size_class = span->size_class;
+ size = dsa_size_classes[size_class];
+
+ /*
+ * Special case for large objects that live in a special span: we return
+ * those pages directly to the free page manager and free the span.
+ */
+ if (span->size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, span->npages * FPM_PAGE_SIZE);
+#endif
+
+ /* Give pages back to free page manager. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ LWLockRelease(DSA_AREA_LOCK(area));
+ /* Unlink span. */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE),
+ LW_EXCLUSIVE);
+ unlink_span(area, span);
+ LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE));
+ /* Free the span object so it can be reused. */
+ dsa_free(area, span_pointer);
+ return;
+ }
+
+#ifdef CLOBBER_FREED_MEMORY
+ memset(object, 0x7f, size);
+#endif
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /* Put the object on the span's freelist. */
+ Assert(object >= superblock);
+ Assert(object < superblock + DSA_SUPERBLOCK_SIZE);
+ Assert((object - superblock) % size == 0);
+ NextFreeObjectIndex(object) = span->firstfree;
+ span->firstfree = (object - superblock) / size;
+ ++span->nallocatable;
+
+ /*
+ * See if the span needs to moved to a different fullness class, or be
+ * freed so its pages can be given back to the segment.
+ */
+ if (span->nallocatable == 1 && span->fclass == DSA_FULLNESS_CLASSES - 1)
+ {
+ /*
+ * The block was completely full and is located in the
+ * highest-numbered fullness class, which is never scanned for free
+ * chunks. We must move it to the next-lower fullness class.
+ */
+ unlink_span(area, span);
+ add_span_to_fullness_class(area, span, span_pointer,
+ DSA_FULLNESS_CLASSES - 2);
+
+ /*
+ * If this is the only span, and there is no active span, then we
+ * should probably move this span to fullness class 1. (Otherwise if
+ * you allocate exactly all the objects in the only span, it moves to
+ * class 3, then you free them all, it moves to 2, and then is given
+ * back, leaving no active span).
+ */
+ }
+ else if (span->nallocatable == span->nmax &&
+ (span->fclass != 1 || span->prevspan != InvalidDsaPointer))
+ {
+ /*
+ * This entire block is free, and it's not the active block for this
+ * size class. Return the memory to the free page manager. We don't
+ * do this for the active block to prevent hysteresis: if we
+ * repeatedly allocate and free the only chunk in the active block, it
+ * will be very inefficient if we deallocate and reallocate the block
+ * every time.
+ */
+ destroy_superblock(area, span_pointer);
+ }
+
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+}
+
+/*
+ * Obtain a backend-local address for a dsa_pointer. 'dp' must point to
+ * memory allocated by the given area (possibly in another process) that
+ * hasn't yet been freed. This may cause a segment to be mapped into the
+ * current process if required, and may cause freed segments to be unmapped.
+ */
+void *
+dsa_get_address(dsa_area *area, dsa_pointer dp)
+{
+ dsa_segment_index index;
+ Size offset;
+
+ /* Convert InvalidDsaPointer to NULL. */
+ if (!DsaPointerIsValid(dp))
+ return NULL;
+
+ /* Process any requests to detach from freed segments. */
+ check_for_freed_segments(area);
+
+ /* Break the dsa_pointer into its components. */
+ index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
+ offset = DSA_EXTRACT_OFFSET(dp);
+ Assert(index < DSA_MAX_SEGMENTS);
+
+ /* Check if we need to cause this segment to be mapped in. */
+ if (unlikely(area->segment_maps[index].mapped_address == NULL))
+ {
+ /* Call for effect (we don't need the result). */
+ get_segment_by_index(area, index);
+ }
+
+ return area->segment_maps[index].mapped_address + offset;
+}
+
+/*
+ * Pin this area, so that it will continue to exist even if all backends
+ * detach from it. In that case, the area can still be reattached to if a
+ * handle has been recorded somewhere.
+ */
+void
+dsa_pin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ if (area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_area already pinned");
+ }
+ area->control->pinned = true;
+ ++area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Undo the effects of dsa_pin, so that the given area can be freed when no
+ * backends are attached to it. May be called only if dsa_pin has been
+ * called.
+ */
+void
+dsa_unpin(dsa_area *area)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ Assert(area->control->refcnt > 1);
+ if (!area->control->pinned)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ elog(ERROR, "dsa_area not pinned");
+ }
+ area->control->pinned = false;
+ --area->control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Set the total size limit for this area. This limit is checked whenever new
+ * segments need to be allocated from the operating system. If the new size
+ * limit is already exceeded, this has no immediate effect.
+ *
+ * Note that the total virtual memory usage may be temporarily larger than
+ * this limit when segments have been freed, but not yet detached by all
+ * backends that have attached to them.
+ */
+void
+dsa_set_size_limit(dsa_area *area, Size limit)
+{
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ area->control->max_total_segment_size = limit;
+ LWLockRelease(DSA_AREA_LOCK(area));
+}
+
+/*
+ * Aggressively free all spare memory in the hope of returning DSM segments to
+ * the operating system.
+ */
+void
+dsa_trim(dsa_area *area)
+{
+ int size_class;
+
+ /*
+ * Trim in reverse pool order so we get to the spans-of-spans last, just
+ * in case any become entirely free while processing all the other pools.
+ */
+ for (size_class = DSA_NUM_SIZE_CLASSES - 1; size_class >= 0; --size_class)
+ {
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_pointer span_pointer;
+
+ if (size_class == DSA_SCLASS_SPAN_LARGE)
+ {
+ /* Large object frees give back segments aggressively already. */
+ continue;
+ }
+
+ /*
+ * Search fullness class 1 only. That is where we expect to find an
+ * entirely empty superblock (entirely empty superblocks in other
+ * fullness classes are returned to the free page map by dsa_free).
+ */
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+ span_pointer = pool->spans[1];
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ dsa_pointer next = span->nextspan;
+
+ if (span->nallocatable == span->nmax)
+ destroy_superblock(area, span_pointer);
+
+ span_pointer = next;
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+ }
+}
+
+/*
+ * Print out debugging information about the internal state of the shared
+ * memory area.
+ */
+void
+dsa_dump(dsa_area *area)
+{
+ Size i,
+ j;
+
+ /*
+ * Note: This gives an inconsistent snapshot as it acquires and releases
+ * individual locks as it goes...
+ */
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ fprintf(stderr, "dsa_area handle %x:\n", area->control->handle);
+ fprintf(stderr, " max_total_segment_size: %zu\n",
+ area->control->max_total_segment_size);
+ fprintf(stderr, " total_segment_size: %zu\n",
+ area->control->total_segment_size);
+ fprintf(stderr, " refcnt: %d\n", area->control->refcnt);
+ fprintf(stderr, " pinned: %c\n", area->control->pinned ? 't' : 'f');
+ fprintf(stderr, " segment bins:\n");
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ {
+ if (area->control->segment_bins[i] != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_index segment_index;
+
+ fprintf(stderr,
+ " segment bin %zu (at least %d contiguous pages free):\n",
+ i, 1 << (i - 1));
+ segment_index = area->control->segment_bins[i];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, segment_index);
+
+ fprintf(stderr,
+ " segment index %zu, usable_pages = %zu, "
+ "contiguous_pages = %zu, mapped at %p\n",
+ segment_index,
+ segment_map->header->usable_pages,
+ fpm_largest(segment_map->fpm),
+ segment_map->mapped_address);
+ segment_index = segment_map->header->next;
+ }
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ fprintf(stderr, " pools:\n");
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ {
+ bool found = false;
+
+ LWLockAcquire(DSA_SCLASS_LOCK(area, i), LW_EXCLUSIVE);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ if (DsaPointerIsValid(area->control->pools[i].spans[j]))
+ found = true;
+ if (found)
+ {
+ if (i == DSA_SCLASS_BLOCK_OF_SPANS)
+ fprintf(stderr, " pool for blocks of span objects:\n");
+ else if (i == DSA_SCLASS_SPAN_LARGE)
+ fprintf(stderr, " pool for large object spans:\n");
+ else
+ fprintf(stderr,
+ " pool for size class %zu (object size %hu bytes):\n",
+ i, dsa_size_classes[i]);
+ for (j = 0; j < DSA_FULLNESS_CLASSES; ++j)
+ {
+ if (!DsaPointerIsValid(area->control->pools[i].spans[j]))
+ fprintf(stderr, " fullness class %zu is empty\n", j);
+ else
+ {
+ dsa_pointer span_pointer = area->control->pools[i].spans[j];
+
+ fprintf(stderr, " fullness class %zu:\n", j);
+ while (DsaPointerIsValid(span_pointer))
+ {
+ dsa_area_span *span;
+
+ span = dsa_get_address(area, span_pointer);
+ fprintf(stderr,
+ " span descriptor at %016lx, "
+ "superblock at %016lx, pages = %zu, "
+ "objects free = %hu/%hu\n",
+ span_pointer, span->start, span->npages,
+ span->nallocatable, span->nmax);
+ span_pointer = span->nextspan;
+ }
+ }
+ }
+ }
+ LWLockRelease(DSA_SCLASS_LOCK(area, i));
+ }
+}
+
+/*
+ * Return the smallest size that you can successfully provide to
+ * dsa_create_in_place.
+ */
+Size
+dsa_minimum_size(void)
+{
+ Size size;
+ int pages = 0;
+
+ size = MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager));
+
+ /* Figure out how many pages we need, including the page map... */
+ while (((size + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE) > pages)
+ {
+ ++pages;
+ size += sizeof(dsa_pointer);
+ }
+
+ return pages * FPM_PAGE_SIZE;
+}
+
+/*
+ * Workhorse function for dsa_create and dsa_create_in_place.
+ */
+static dsa_area *
+create_internal(void *place, size_t size,
+ int tranche_id, const char *tranche_name,
+ dsm_handle control_handle,
+ dsm_segment *control_segment)
+{
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+ Size usable_pages;
+ Size total_pages;
+ Size metadata_bytes;
+ int i;
+
+ /* Sanity check on the space we have to work in. */
+ if (size < dsa_minimum_size())
+ elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
+ dsa_minimum_size(), size);
+
+ /* Now figure out how much space is usuable */
+ total_pages = size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ Assert(metadata_bytes <= size);
+ usable_pages = (size - metadata_bytes) / FPM_PAGE_SIZE;
+
+ /*
+ * Initialize the dsa_area_control object located at the start of the
+ * space.
+ */
+ control = (dsa_area_control *) place;
+ control->segment_header.magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ control_handle ^ 0;
+ control->segment_header.next = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
+ control->segment_header.usable_pages = usable_pages;
+ control->segment_header.freed = false;
+ control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+ control->handle = control_handle;
+ control->max_total_segment_size = SIZE_MAX;
+ control->total_segment_size = size;
+ memset(&control->segment_handles[0], 0,
+ sizeof(dsm_handle) * DSA_MAX_SEGMENTS);
+ control->segment_handles[0] = control_handle;
+ for (i = 0; i < DSA_NUM_SEGMENT_BINS; ++i)
+ control->segment_bins[i] = DSA_SEGMENT_INDEX_NONE;
+ control->high_segment_index = 0;
+ control->refcnt = 1;
+ control->freed_segment_counter = 0;
+ control->lwlock_tranche_id = tranche_id;
+ strlcpy(control->lwlock_tranche_name, tranche_name, DSA_MAXLEN);
+
+ /*
+ * Create the dsa_area object that this backend will use to access the
+ * area. Other backends will need to obtain their own dsa_area object by
+ * attaching.
+ */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(area->segment_maps, 0, sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->high_segment_index = 0;
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+ LWLockInitialize(&control->lock, control->lwlock_tranche_id);
+ for (i = 0; i < DSA_NUM_SIZE_CLASSES; ++i)
+ LWLockInitialize(DSA_SCLASS_LOCK(area, i),
+ control->lwlock_tranche_id);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = control_segment;
+ segment_map->mapped_address = place;
+ segment_map->header = (dsa_segment_header *) place;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ /* There can be 0 usable pages if size is dsa_minimum_size(). */
+ if (usable_pages > 0)
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Put this segment into the appropriate bin. */
+ control->segment_bins[contiguous_pages_to_segment_bin(usable_pages)] = 0;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+
+ return area;
+}
+
+/*
+ * Workhorse function for dsa_attach and dsa_attach_in_place.
+ */
+static dsa_area *
+attach_internal(void *place, dsm_segment *segment, dsa_handle handle)
+{
+ dsa_area_control *control;
+ dsa_area *area;
+ dsa_segment_map *segment_map;
+
+ control = (dsa_area_control *) place;
+ Assert(control->handle == handle);
+ Assert(control->segment_handles[0] == handle);
+ Assert(control->segment_header.magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ handle ^ 0));
+
+ /* Build the backend-local area object. */
+ area = palloc(sizeof(dsa_area));
+ area->control = control;
+ area->mapping_pinned = false;
+ memset(&area->segment_maps[0], 0,
+ sizeof(dsa_segment_map) * DSA_MAX_SEGMENTS);
+ area->high_segment_index = 0;
+ area->lwlock_tranche.array_base = &area->control->pools[0];
+ area->lwlock_tranche.array_stride = sizeof(dsa_area_pool);
+ area->lwlock_tranche.name = control->lwlock_tranche_name;
+ LWLockRegisterTranche(control->lwlock_tranche_id, &area->lwlock_tranche);
+
+ /* Set up the segment map for this process's mapping. */
+ segment_map = &area->segment_maps[0];
+ segment_map->segment = segment; /* NULL for in-place */
+ segment_map->mapped_address = place;
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Bump the reference count. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ ++control->refcnt;
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ return area;
+}
+
+/*
+ * Add a new span to fullness class 1 of the indicated pool.
+ */
+static void
+init_span(dsa_area *area,
+ dsa_pointer span_pointer,
+ dsa_area_pool *pool, dsa_pointer start, Size npages,
+ uint16 size_class)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ Size obsize = dsa_size_classes[size_class];
+
+ /*
+ * The per-pool lock must be held because we manipulate the span list for
+ * this pool.
+ */
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /* Push this span onto the front of the span list for fullness class 1. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ {
+ dsa_area_span *head = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+
+ head->prevspan = span_pointer;
+ }
+ span->pool = DsaAreaPoolToDsaPointer(area, pool);
+ span->nextspan = pool->spans[1];
+ span->prevspan = InvalidDsaPointer;
+ pool->spans[1] = span_pointer;
+
+ span->start = start;
+ span->npages = npages;
+ span->size_class = size_class;
+ span->ninitialized = 0;
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * A block-of-spans contains its own descriptor, so mark one object as
+ * initialized and reduce the count of allocatable objects by one.
+ * Doing this here has the side effect of also reducing nmax by one,
+ * which is important to make sure we free this object at the correct
+ * time.
+ */
+ span->ninitialized = 1;
+ span->nallocatable = FPM_PAGE_SIZE / obsize - 1;
+ }
+ else if (size_class != DSA_SCLASS_SPAN_LARGE)
+ span->nallocatable = DSA_SUPERBLOCK_SIZE / obsize;
+ span->firstfree = DSA_SPAN_NOTHING_FREE;
+ span->nmax = span->nallocatable;
+ span->fclass = 1;
+}
+
+/*
+ * Transfer the first span in one fullness class to the head of another
+ * fullness class.
+ */
+static bool
+transfer_first_span(dsa_area *area,
+ dsa_area_pool *pool, int fromclass, int toclass)
+{
+ dsa_pointer span_pointer;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+
+ /* Can't do it if source list is empty. */
+ span_pointer = pool->spans[fromclass];
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+
+ /* Remove span from head of source list. */
+ span = dsa_get_address(area, span_pointer);
+ pool->spans[fromclass] = span->nextspan;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+
+ /* Add span to head of target list. */
+ span->nextspan = pool->spans[toclass];
+ pool->spans[toclass] = span_pointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = toclass;
+
+ return true;
+}
+
+/*
+ * Allocate one object of the requested size class from the given area.
+ */
+static inline dsa_pointer
+alloc_object(dsa_area *area, int size_class)
+{
+ dsa_area_pool *pool = &area->control->pools[size_class];
+ dsa_area_span *span;
+ dsa_pointer block;
+ dsa_pointer result;
+ char *object;
+ Size size;
+
+ /*
+ * Even though ensure_active_superblock can in turn call alloc_object if
+ * it needs to allocate a new span, that's always from a different pool,
+ * and the order of lock acquisition is always the same, so it's OK that
+ * we hold this lock for the duration of this function.
+ */
+ Assert(!LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockAcquire(DSA_SCLASS_LOCK(area, size_class), LW_EXCLUSIVE);
+
+ /*
+ * If there's no active superblock, we must successfully obtain one or
+ * fail the request.
+ */
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ !ensure_active_superblock(area, pool, size_class))
+ {
+ result = InvalidDsaPointer;
+ }
+ else
+ {
+ /*
+ * There should be a block in fullness class 1 at this point, and it
+ * should never be completely full. Thus we can either pop an object
+ * from the free list or, failing that, initialize a new object.
+ */
+ Assert(DsaPointerIsValid(pool->spans[1]));
+ span = (dsa_area_span *)
+ dsa_get_address(area, pool->spans[1]);
+ Assert(span->nallocatable > 0);
+ block = span->start;
+ Assert(size_class < DSA_NUM_SIZE_CLASSES);
+ size = dsa_size_classes[size_class];
+ if (span->firstfree != DSA_SPAN_NOTHING_FREE)
+ {
+ result = block + span->firstfree * size;
+ object = dsa_get_address(area, result);
+ span->firstfree = NextFreeObjectIndex(object);
+ }
+ else
+ {
+ result = block + span->ninitialized * size;
+ ++span->ninitialized;
+ }
+ --span->nallocatable;
+
+ /* If it's now full, move it to the highest-numbered fullness class. */
+ if (span->nallocatable == 0)
+ transfer_first_span(area, pool, 1, DSA_FULLNESS_CLASSES - 1);
+ }
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+ LWLockRelease(DSA_SCLASS_LOCK(area, size_class));
+
+ return result;
+}
+
+/*
+ * Ensure an active (i.e. fullness class 1) superblock, unless all existing
+ * superblocks are completely full and no more can be allocated.
+ *
+ * Fullness classes K of 0..N are loosely intended to represent blocks whose
+ * utilization percentage is at least K/N, but we only enforce this rigorously
+ * for the highest-numbered fullness class, which always contains exactly
+ * those blocks that are completely full. It's otherwise acceptable for a
+ * block to be in a higher-numbered fullness class than the one to which it
+ * logically belongs. In addition, the active block, which is always the
+ * first block in fullness class 1, is permitted to have a higher allocation
+ * percentage than would normally be allowable for that fullness class; we
+ * don't move it until it's completely full, and then it goes to the
+ * highest-numbered fullness class.
+ *
+ * It might seem odd that the active block is the head of fullness class 1
+ * rather than fullness class 0, but experience with other allocators has
+ * shown that it's usually better to allocate from a block that's moderately
+ * full rather than one that's nearly empty. Insofar as is reasonably
+ * possible, we want to avoid performing new allocations in a block that would
+ * otherwise become empty soon.
+ */
+static bool
+ensure_active_superblock(dsa_area *area, dsa_area_pool *pool,
+ int size_class)
+{
+ dsa_pointer span_pointer;
+ dsa_pointer start_pointer;
+ Size obsize = dsa_size_classes[size_class];
+ Size nmax;
+ int fclass;
+ Size npages = 1;
+ Size first_page;
+ Size i;
+ dsa_segment_map *segment_map;
+
+ Assert(LWLockHeldByMe(DSA_SCLASS_LOCK(area, size_class)));
+
+ /*
+ * Compute the number of objects that will fit in a block of this size
+ * class. Span-of-spans blocks are just a single page, and the first
+ * object isn't available for use because it describes the block-of-spans
+ * itself.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ nmax = FPM_PAGE_SIZE / obsize - 1;
+ else
+ nmax = DSA_SUPERBLOCK_SIZE / obsize;
+
+ /*
+ * If fullness class 1 is empty, try to find a span to put in it by
+ * scanning higher-numbered fullness classes (excluding the last one,
+ * whose blocks are certain to all be completely full).
+ */
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ {
+ span_pointer = pool->spans[fclass];
+
+ while (DsaPointerIsValid(span_pointer))
+ {
+ int tfclass;
+ dsa_area_span *span;
+ dsa_area_span *nextspan;
+ dsa_area_span *prevspan;
+ dsa_pointer next_span_pointer;
+
+ span = (dsa_area_span *)
+ dsa_get_address(area, span_pointer);
+ next_span_pointer = span->nextspan;
+
+ /* Figure out what fullness class should contain this span. */
+ tfclass = (nmax - span->nallocatable)
+ * (DSA_FULLNESS_CLASSES - 1) / nmax;
+
+ /* Look up next span. */
+ if (DsaPointerIsValid(span->nextspan))
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ else
+ nextspan = NULL;
+
+ /*
+ * If utilization has dropped enough that this now belongs in some
+ * other fullness class, move it there.
+ */
+ if (tfclass < fclass)
+ {
+ /* Remove from the current fullness class list. */
+ if (pool->spans[fclass] == span_pointer)
+ {
+ /* It was the head; remove it. */
+ Assert(!DsaPointerIsValid(span->prevspan));
+ pool->spans[fclass] = span->nextspan;
+ if (nextspan != NULL)
+ nextspan->prevspan = InvalidDsaPointer;
+ }
+ else
+ {
+ /* It was not the head. */
+ Assert(DsaPointerIsValid(span->prevspan));
+ prevspan = (dsa_area_span *)
+ dsa_get_address(area, span->prevspan);
+ prevspan->nextspan = span->nextspan;
+ }
+ if (nextspan != NULL)
+ nextspan->prevspan = span->prevspan;
+
+ /* Push onto the head of the new fullness class list. */
+ span->nextspan = pool->spans[tfclass];
+ pool->spans[tfclass] = span_pointer;
+ span->prevspan = InvalidDsaPointer;
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ nextspan = (dsa_area_span *)
+ dsa_get_address(area, span->nextspan);
+ nextspan->prevspan = span_pointer;
+ }
+ span->fclass = tfclass;
+ }
+
+ /* Advance to next span on list. */
+ span_pointer = next_span_pointer;
+ }
+
+ /* Stop now if we found a suitable block. */
+ if (DsaPointerIsValid(pool->spans[1]))
+ return true;
+ }
+
+ /*
+ * If there are no blocks that properly belong in fullness class 1, pick
+ * one from some other fullness class and move it there anyway, so that we
+ * have an allocation target. Our last choice is to transfer a block
+ * that's almost empty (and might become completely empty soon if left
+ * alone), but even that is better than failing, which is what we must do
+ * if there are no blocks at all with freespace.
+ */
+ Assert(!DsaPointerIsValid(pool->spans[1]));
+ for (fclass = 2; fclass < DSA_FULLNESS_CLASSES - 1; ++fclass)
+ if (transfer_first_span(area, pool, fclass, 1))
+ return true;
+ if (!DsaPointerIsValid(pool->spans[1]) &&
+ transfer_first_span(area, pool, 0, 1))
+ return true;
+
+ /*
+ * We failed to find an existing span with free objects, so we need to
+ * allocate a new superblock and construct a new span to manage it.
+ *
+ * First, get a dsa_area_span object to describe the new superblock block
+ * ... unless this allocation is for a dsa_area_span object, in which case
+ * that's surely not going to work. We handle that case by storing the
+ * span describing a block-of-spans inline.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ span_pointer = alloc_object(area, DSA_SCLASS_BLOCK_OF_SPANS);
+ if (!DsaPointerIsValid(span_pointer))
+ return false;
+ npages = DSA_PAGES_PER_SUPERBLOCK;
+ }
+
+ /* Find or create a segment and allocate the superblock. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ segment_map = get_best_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ segment_map = make_new_segment(area, npages);
+ if (segment_map == NULL)
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ return false;
+ }
+ }
+ if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
+ {
+ LWLockRelease(DSA_AREA_LOCK(area));
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+ return false;
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /* Compute the start of the superblock. */
+ start_pointer =
+ DSA_MAKE_POINTER(get_segment_index(area, segment_map),
+ first_page * FPM_PAGE_SIZE);
+
+ /*
+ * If this is a block-of-spans, carve the descriptor right out of the
+ * allocated space.
+ */
+ if (size_class == DSA_SCLASS_BLOCK_OF_SPANS)
+ {
+ /*
+ * We have a pointer into the segment. We need to build a dsa_pointer
+ * from the segment index and offset into the segment.
+ */
+ span_pointer = start_pointer;
+ }
+
+ /* Initialize span and pagemap. */
+ init_span(area, span_pointer, pool, start_pointer, npages, size_class);
+ for (i = 0; i < npages; ++i)
+ segment_map->pagemap[first_page + i] = span_pointer;
+
+ return true;
+}
+
+/*
+ * Return the segment map corresponding to a given segment index, mapping the
+ * segment in if necessary. For internal segment book-keeping, this is called
+ * with the area lock held. It is also called by dsa_free and dsa_get_address
+ * without any locking, relying on the fact they have a known live segment
+ * index and they always call check_for_freed_segments to ensures that any
+ * freed segment occupying the same slot is detached first.
+ */
+static dsa_segment_map *
+get_segment_by_index(dsa_area *area, dsa_segment_index index)
+{
+ if (unlikely(area->segment_maps[index].mapped_address == NULL))
+ {
+ dsm_handle handle;
+ dsm_segment *segment;
+ dsa_segment_map *segment_map;
+
+ /*
+ * If we are reached by dsa_free or dsa_get_address, there must be at
+ * least one object allocated in the referenced segment. Otherwise,
+ * their caller has a double-free or access-after-free bug, which we
+ * have no hope of detecting. So we know it's safe to access this
+ * array slot without holding a lock; it won't change underneath us.
+ * Furthermore, we know that we can see the latest contents of the
+ * slot, as explained in check_for_freed_segments, which those
+ * functions call before arriving here.
+ */
+ handle = area->control->segment_handles[index];
+
+ /* It's an erro to try to access an unused slot. */
+ if (handle == DSM_HANDLE_INVALID)
+ elog(ERROR,
+ "dsa_area could not attach to a segment that has been freed");
+
+ segment = dsm_attach(handle);
+ if (segment == NULL)
+ elog(ERROR, "dsa_area could not attach to segment");
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+ segment_map = &area->segment_maps[index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header =
+ (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Remember the highest index this backend has ever mapped. */
+ if (area->high_segment_index < index)
+ area->high_segment_index = index;
+
+ Assert(segment_map->header->magic ==
+ (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index));
+ }
+
+ return &area->segment_maps[index];
+}
+
+/*
+ * Return a superblock to the free page manager. If the underlying segment
+ * has become entirely free, then return it to the operating system.
+ *
+ * The appropriate pool lock must be held.
+ */
+static void
+destroy_superblock(dsa_area *area, dsa_pointer span_pointer)
+{
+ dsa_area_span *span = dsa_get_address(area, span_pointer);
+ int size_class = span->size_class;
+ dsa_segment_map *segment_map;
+
+ segment_map =
+ get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(span->start));
+
+ /* Remove it from its fullness class list. */
+ unlink_span(area, span);
+
+ /*
+ * Note: Here we acquire the area lock while we already hold a per-pool
+ * lock. We never hold the area lock and then take a pool lock, or we
+ * could deadlock.
+ */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ FreePageManagerPut(segment_map->fpm,
+ DSA_EXTRACT_OFFSET(span->start) / FPM_PAGE_SIZE,
+ span->npages);
+ /* Check if the segment is now entirely free. */
+ if (fpm_largest(segment_map->fpm) == segment_map->header->usable_pages)
+ {
+ dsa_segment_index index = get_segment_index(area, segment_map);
+
+ /* If it's not the segment with extra control data, free it. */
+ if (index != 0)
+ {
+ /*
+ * Give it back to the OS, and allow other backends to detect that
+ * they need to detach.
+ */
+ unlink_segment(area, segment_map);
+ segment_map->header->freed = true;
+ Assert(area->control->total_segment_size >=
+ segment_map->header->size);
+ area->control->total_segment_size -=
+ segment_map->header->size;
+ dsm_unpin_segment(dsm_segment_handle(segment_map->segment));
+ dsm_detach(segment_map->segment);
+ area->control->segment_handles[index] = DSM_HANDLE_INVALID;
+ ++area->control->freed_segment_counter;
+ segment_map->segment = NULL;
+ segment_map->header = NULL;
+ segment_map->mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+
+ /*
+ * Span-of-spans blocks store the span which describes them within the
+ * block itself, so freeing the storage implicitly frees the descriptor
+ * also. If this is a block of any other type, we need to separately free
+ * the span object also. This recursive call to dsa_free will acquire the
+ * span pool's lock. We can't deadlock because the acquisition order is
+ * always some other pool and then the span pool.
+ */
+ if (size_class != DSA_SCLASS_BLOCK_OF_SPANS)
+ dsa_free(area, span_pointer);
+}
+
+static void
+unlink_span(dsa_area *area, dsa_area_span *span)
+{
+ if (DsaPointerIsValid(span->nextspan))
+ {
+ dsa_area_span *next = dsa_get_address(area, span->nextspan);
+
+ next->prevspan = span->prevspan;
+ }
+ if (DsaPointerIsValid(span->prevspan))
+ {
+ dsa_area_span *prev = dsa_get_address(area, span->prevspan);
+
+ prev->nextspan = span->nextspan;
+ }
+ else
+ {
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ pool->spans[span->fclass] = span->nextspan;
+ }
+}
+
+static void
+add_span_to_fullness_class(dsa_area *area, dsa_area_span *span,
+ dsa_pointer span_pointer,
+ int fclass)
+{
+ dsa_area_pool *pool = dsa_get_address(area, span->pool);
+
+ if (DsaPointerIsValid(pool->spans[fclass]))
+ {
+ dsa_area_span *head = dsa_get_address(area,
+ pool->spans[fclass]);
+
+ head->prevspan = span_pointer;
+ }
+ span->prevspan = InvalidDsaPointer;
+ span->nextspan = pool->spans[fclass];
+ pool->spans[fclass] = span_pointer;
+ span->fclass = fclass;
+}
+
+/*
+ * Detach from an area that was either created or attached to by this process.
+ */
+void
+dsa_detach(dsa_area *area)
+{
+ int i;
+
+ /* Detach from all segments. */
+ for (i = 0; i <= area->high_segment_index; ++i)
+ if (area->segment_maps[i].segment != NULL)
+ dsm_detach(area->segment_maps[i].segment);
+
+ /*
+ * Note that 'detaching' (= detaching from DSM segments) doesn't include
+ * 'releasing' (= adjusting the reference count). It would be nice to
+ * combine these operations, but client code might never get around to
+ * calling dsa_detach because of an error path, and a detach hook on any
+ * particular segment is too late to detach other segments in the area
+ * without risking a 'leak' warning in the non-error path.
+ */
+
+ /* Free the backend-local area object. */
+ pfree(area);
+}
+
+/*
+ * Unlink a segment from the bin that contains it.
+ */
+static void
+unlink_segment(dsa_area *area, dsa_segment_map *segment_map)
+{
+ if (segment_map->header->prev != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *prev;
+
+ prev = get_segment_by_index(area, segment_map->header->prev);
+ prev->header->next = segment_map->header->next;
+ }
+ else
+ {
+ Assert(area->control->segment_bins[segment_map->header->bin] ==
+ get_segment_index(area, segment_map));
+ area->control->segment_bins[segment_map->header->bin] =
+ segment_map->header->next;
+ }
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area, segment_map->header->next);
+ next->header->prev = segment_map->header->prev;
+ }
+}
+
+/*
+ * Find a segment that could satisfy a request for 'npages' of contiguous
+ * memory, or return NULL if none can be found. This may involve attaching to
+ * segments that weren't previously attached so that we can query their free
+ * pages map.
+ */
+static dsa_segment_map *
+get_best_segment(dsa_area *area, Size npages)
+{
+ Size bin;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /*
+ * Start searching from the first bin that *might* have enough contiguous
+ * pages.
+ */
+ for (bin = contiguous_pages_to_segment_bin(npages);
+ bin < DSA_NUM_SEGMENT_BINS;
+ ++bin)
+ {
+ /*
+ * The minimum contiguous size that any segment in this bin should
+ * have. We'll re-bin if we see segments with fewer.
+ */
+ Size threshold = 1 << (bin - 1);
+ dsa_segment_index segment_index;
+
+ /* Search this bin for a segment with enough contiguous space. */
+ segment_index = area->control->segment_bins[bin];
+ while (segment_index != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *segment_map;
+ dsa_segment_index next_segment_index;
+ Size contiguous_pages;
+
+ segment_map = get_segment_by_index(area, segment_index);
+ next_segment_index = segment_map->header->next;
+ contiguous_pages = fpm_largest(segment_map->fpm);
+
+ /* Not enough for the request, still enough for this bin. */
+ if (contiguous_pages >= threshold && contiguous_pages < npages)
+ {
+ segment_index = next_segment_index;
+ continue;
+ }
+
+ /* Re-bin it if it's no longer in the appropriate bin. */
+ if (contiguous_pages < threshold)
+ {
+ Size new_bin;
+
+ new_bin = contiguous_pages_to_segment_bin(contiguous_pages);
+
+ /* Remove it from its current bin. */
+ unlink_segment(area, segment_map);
+
+ /* Push it onto the front of its new bin. */
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[new_bin];
+ segment_map->header->bin = new_bin;
+ area->control->segment_bins[new_bin] = segment_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next;
+
+ next = get_segment_by_index(area,
+ segment_map->header->next);
+ Assert(next->header->bin == new_bin);
+ next->header->prev = segment_index;
+ }
+
+ /*
+ * But fall through to see if it's enough to satisfy this
+ * request anyway....
+ */
+ }
+
+ /* Check if we are done. */
+ if (contiguous_pages >= npages)
+ return segment_map;
+
+ /* Continue searching the same bin. */
+ segment_index = next_segment_index;
+ }
+ }
+
+ /* Not found. */
+ return NULL;
+}
+
+/*
+ * Create a new segment that can handle at least requested_pages. Returns
+ * NULL if the requested total size limit or maximum allowed number of
+ * segments would be exceeded.
+ */
+static dsa_segment_map *
+make_new_segment(dsa_area *area, Size requested_pages)
+{
+ dsa_segment_index new_index;
+ Size metadata_bytes;
+ Size total_size;
+ Size total_pages;
+ Size usable_pages;
+ dsa_segment_map *segment_map;
+ dsm_segment *segment;
+
+ Assert(LWLockHeldByMe(DSA_AREA_LOCK(area)));
+
+ /* Find a segment slot that is not in use (linearly for now). */
+ for (new_index = 1; new_index < DSA_MAX_SEGMENTS; ++new_index)
+ {
+ if (area->control->segment_handles[new_index] == DSM_HANDLE_INVALID)
+ break;
+ }
+ if (new_index == DSA_MAX_SEGMENTS)
+ return NULL;
+
+ /*
+ * If the total size limit is already exceeded, then we exit early and
+ * avoid arithmetic wraparound in the unsigned expressions below.
+ */
+ if (area->control->total_segment_size >=
+ area->control->max_total_segment_size)
+ return NULL;
+
+ /*
+ * The size should be at least as big as requested, and at least big
+ * enough to follow a geometric series that approximately doubles the
+ * total storage each time we create a new segment. We use geometric
+ * growth because the underlying DSM system isn't designed for large
+ * numbers of segments (otherwise we might even consider just using one
+ * DSM segment for each large allocation and for each superblock, and then
+ * we wouldn't need to use FreePageManager).
+ *
+ * We decide on a total segment size first, so that we produce tidy
+ * power-of-two sized segments. This is a good property to have if we
+ * move to huge pages in the future. Then we work back to the number of
+ * pages we can fit.
+ */
+ total_size = DSA_INITIAL_SEGMENT_SIZE *
+ ((Size) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
+ total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+ total_size = Min(total_size,
+ area->control->max_total_segment_size -
+ area->control->total_segment_size);
+
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ sizeof(dsa_pointer) * total_pages;
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ if (total_size <= metadata_bytes)
+ return NULL;
+ usable_pages = (total_size - metadata_bytes) / FPM_PAGE_SIZE;
+ Assert(metadata_bytes + usable_pages * FPM_PAGE_SIZE <= total_size);
+
+ /* See if that is enough... */
+ if (requested_pages > usable_pages)
+ {
+ /*
+ * We'll make an odd-sized segment, working forward from the requested
+ * number of pages.
+ */
+ usable_pages = requested_pages;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ usable_pages * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
+
+ /* Is that too large for dsa_pointer's addressing scheme? */
+ if (total_size > DSA_MAX_SEGMENT_SIZE)
+ return NULL;
+
+ /* Would that exceed the limit? */
+ if (total_size > area->control->max_total_segment_size -
+ area->control->total_segment_size)
+ return NULL;
+ }
+
+ /* Create the segment. */
+ segment = dsm_create(total_size, 0);
+ if (segment == NULL)
+ return NULL;
+ dsm_pin_segment(segment);
+ if (area->mapping_pinned)
+ dsm_pin_mapping(segment);
+
+ /* Store the handle in shared memory to be found by index. */
+ area->control->segment_handles[new_index] =
+ dsm_segment_handle(segment);
+ /* Track the highest segment index in the history of the area. */
+ if (area->control->high_segment_index < new_index)
+ area->control->high_segment_index = new_index;
+ /* Track the highest segment index this backend has ever mapped. */
+ if (area->high_segment_index < new_index)
+ area->high_segment_index = new_index;
+ /* Track total size of all segments. */
+ area->control->total_segment_size += total_size;
+ Assert(area->control->total_segment_size <=
+ area->control->max_total_segment_size);
+
+ /* Build a segment map for this segment in this backend. */
+ segment_map = &area->segment_maps[new_index];
+ segment_map->segment = segment;
+ segment_map->mapped_address = dsm_segment_address(segment);
+ segment_map->header = (dsa_segment_header *) segment_map->mapped_address;
+ segment_map->fpm = (FreePageManager *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)));
+ segment_map->pagemap = (dsa_pointer *)
+ (segment_map->mapped_address +
+ MAXALIGN(sizeof(dsa_segment_header)) +
+ MAXALIGN(sizeof(FreePageManager)));
+
+ /* Set up the free page map. */
+ FreePageManagerInitialize(segment_map->fpm, segment_map->mapped_address);
+ FreePageManagerPut(segment_map->fpm, metadata_bytes / FPM_PAGE_SIZE,
+ usable_pages);
+
+ /* Set up the segment header and put it in the appropriate bin. */
+ segment_map->header->magic =
+ DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ new_index;
+ segment_map->header->usable_pages = usable_pages;
+ segment_map->header->size = total_size;
+ segment_map->header->bin = contiguous_pages_to_segment_bin(usable_pages);
+ segment_map->header->prev = DSA_SEGMENT_INDEX_NONE;
+ segment_map->header->next =
+ area->control->segment_bins[segment_map->header->bin];
+ segment_map->header->freed = false;
+ area->control->segment_bins[segment_map->header->bin] = new_index;
+ if (segment_map->header->next != DSA_SEGMENT_INDEX_NONE)
+ {
+ dsa_segment_map *next =
+ get_segment_by_index(area, segment_map->header->next);
+
+ Assert(next->header->bin == segment_map->header->bin);
+ next->header->prev = new_index;
+ }
+
+ return segment_map;
+}
+
+/*
+ * Check if any segments have been freed by destroy_superblock, so we can
+ * detach from them in this backend. This function is called by
+ * dsa_get_address and dsa_free to make sure that a dsa_pointer they have
+ * received can be resolved to the correct segment.
+ *
+ * The danger we want to defend against is that there could be an old segment
+ * mapped into a given slot in this backend, and the dsa_pointer they have
+ * might refer to some new segment in the same slot. So those functions must
+ * be sure to process all instructions to detach from a freed segment that had
+ * been generated by the time this process received the dsa_pointer, before
+ * they call get_segment_by_index.
+ */
+static void
+check_for_freed_segments(dsa_area *area)
+{
+ Size freed_segment_counter;
+
+ /*
+ * Any other process that has freed a segment has incremented
+ * free_segment_counter while holding an LWLock, and that must precede any
+ * backend creating a new segment in the same slot while holding an
+ * LWLock, and that must precede the creation of any dsa_pointer pointing
+ * into the new segment which might reach us here, and the caller must
+ * have sent the dsa_pointer to this process using appropriate memory
+ * synchronization (some kind of locking or atomic primitive or system
+ * call). So all we need to do on the reading side is ask for the load of
+ * freed_segment_counter to follow the caller's load of the dsa_pointer it
+ * has, and we can be sure to detect any segments that had been freed as
+ * of the time that the dsa_pointer reached this process.
+ */
+ pg_read_barrier();
+ freed_segment_counter = area->control->freed_segment_counter;
+ if (unlikely(area->freed_segment_counter != freed_segment_counter))
+ {
+ int i;
+
+ /* Check all currently mapped segments to find what's been freed. */
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_EXCLUSIVE);
+ for (i = 0; i <= area->high_segment_index; ++i)
+ {
+ if (area->segment_maps[i].header != NULL &&
+ area->segment_maps[i].header->freed)
+ {
+ dsm_detach(area->segment_maps[i].segment);
+ area->segment_maps[i].segment = NULL;
+ area->segment_maps[i].header = NULL;
+ area->segment_maps[i].mapped_address = NULL;
+ }
+ }
+ LWLockRelease(DSA_AREA_LOCK(area));
+ area->freed_segment_counter = freed_segment_counter;
+ }
+}
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
new file mode 100644
index 0000000..cf27e1b
--- /dev/null
+++ b/src/backend/utils/mmgr/freepage.c
@@ -0,0 +1,1846 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.c
+ * Management of free memory pages.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/mmgr/freepage.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+
+#include "utils/freepage.h"
+#include "utils/relptr.h"
+
+
+/* Magic numbers to identify various page types */
+#define FREE_PAGE_SPAN_LEADER_MAGIC 0xea4020f0
+#define FREE_PAGE_LEAF_MAGIC 0x98eae728
+#define FREE_PAGE_INTERNAL_MAGIC 0x19aa32c9
+
+/* Doubly linked list of spans of free pages; stored in first page of span. */
+struct FreePageSpanLeader
+{
+ int magic; /* always FREE_PAGE_SPAN_LEADER_MAGIC */
+ Size npages; /* number of pages in span */
+ RelptrFreePageSpanLeader prev;
+ RelptrFreePageSpanLeader next;
+};
+
+/* Common header for btree leaf and internal pages. */
+typedef struct FreePageBtreeHeader
+{
+ int magic; /* FREE_PAGE_LEAF_MAGIC or
+ * FREE_PAGE_INTERNAL_MAGIC */
+ Size nused; /* number of items used */
+ RelptrFreePageBtree parent; /* uplink */
+} FreePageBtreeHeader;
+
+/* Internal key; points to next level of btree. */
+typedef struct FreePageBtreeInternalKey
+{
+ Size first_page; /* low bound for keys on child page */
+ RelptrFreePageBtree child; /* downlink */
+} FreePageBtreeInternalKey;
+
+/* Leaf key; no payload data. */
+typedef struct FreePageBtreeLeafKey
+{
+ Size first_page; /* first page in span */
+ Size npages; /* number of pages in span */
+} FreePageBtreeLeafKey;
+
+/* Work out how many keys will fit on a page. */
+#define FPM_ITEMS_PER_INTERNAL_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeInternalKey))
+#define FPM_ITEMS_PER_LEAF_PAGE \
+ ((FPM_PAGE_SIZE - sizeof(FreePageBtreeHeader)) / \
+ sizeof(FreePageBtreeLeafKey))
+
+/* A btree page of either sort */
+struct FreePageBtree
+{
+ FreePageBtreeHeader hdr;
+ union
+ {
+ FreePageBtreeInternalKey internal_key[FPM_ITEMS_PER_INTERNAL_PAGE];
+ FreePageBtreeLeafKey leaf_key[FPM_ITEMS_PER_LEAF_PAGE];
+ } u;
+};
+
+/* Results of a btree search */
+typedef struct FreePageBtreeSearchResult
+{
+ FreePageBtree *page;
+ Size index;
+ bool found;
+ unsigned split_pages;
+} FreePageBtreeSearchResult;
+
+/* Helper functions */
+static void FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm,
+ FreePageBtree *btp);
+static Size FreePageBtreeCleanup(FreePageManager *fpm);
+static FreePageBtree *FreePageBtreeFindLeftSibling(char *base,
+ FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeFindRightSibling(char *base,
+ FreePageBtree *btp);
+static Size FreePageBtreeFirstKey(FreePageBtree *btp);
+static FreePageBtree *FreePageBtreeGetRecycled(FreePageManager *fpm);
+static void FreePageBtreeInsertInternal(char *base, FreePageBtree *btp,
+ Size index, Size first_page, FreePageBtree *child);
+static void FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index,
+ Size first_page, Size npages);
+static void FreePageBtreeRecycle(FreePageManager *fpm, Size pageno);
+static void FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp,
+ Size index);
+static void FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp);
+static void FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result);
+static Size FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page);
+static Size FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page);
+static FreePageBtree *FreePageBtreeSplitPage(FreePageManager *fpm,
+ FreePageBtree *btp);
+static void FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp);
+static void FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf);
+static void FreePageManagerDumpSpans(FreePageManager *fpm,
+ FreePageSpanLeader *span, Size expected_pages,
+ StringInfo buf);
+static bool FreePageManagerGetInternal(FreePageManager *fpm, Size npages,
+ Size *first_page);
+static Size FreePageManagerPutInternal(FreePageManager *fpm, Size first_page,
+ Size npages, bool soft);
+static void FreePagePopSpanLeader(FreePageManager *fpm, Size pageno);
+static void FreePagePushSpanLeader(FreePageManager *fpm, Size first_page,
+ Size npages);
+static Size FreePageManagerLargestContiguous(FreePageManager *fpm);
+static void FreePageManagerUpdateLargest(FreePageManager *fpm);
+
+#if FPM_EXTRA_ASSERTS
+static Size sum_free_pages(FreePageManager *fpm);
+#endif
+
+/*
+ * Initialize a new, empty free page manager.
+ *
+ * 'fpm' should reference caller-provided memory large enough to contain a
+ * FreePageManager. We'll initialize it here.
+ *
+ * 'base' is the address to which all pointers are relative. When managing
+ * a dynamic shared memory segment, it should normally be the base of the
+ * segment. When managing backend-private memory, it can be either NULL or,
+ * if managing a single contiguous extent of memory, the start of that extent.
+ */
+void
+FreePageManagerInitialize(FreePageManager *fpm, char *base)
+{
+ Size f;
+
+ relptr_store(base, fpm->self, fpm);
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ relptr_store(base, fpm->btree_recycle, (FreePageSpanLeader *) NULL);
+ fpm->btree_depth = 0;
+ fpm->btree_recycle_count = 0;
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->contiguous_pages = 0;
+ fpm->contiguous_pages_dirty = true;
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages = 0;
+#endif
+
+ for (f = 0; f < FPM_NUM_FREELISTS; f++)
+ relptr_store(base, fpm->freelist[f], (FreePageSpanLeader *) NULL);
+}
+
+/*
+ * Allocate a run of pages of the given length from the free page manager.
+ * The return value indicates whether we were able to satisfy the request;
+ * if true, the first page of the allocation is stored in *first_page.
+ */
+bool
+FreePageManagerGet(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ bool result;
+ Size contiguous_pages;
+
+ result = FreePageManagerGetInternal(fpm, npages, first_page);
+
+ /*
+ * It's a bit counterintuitive, but allocating pages can actually create
+ * opportunities for cleanup that create larger ranges. We might pull a
+ * key out of the btree that enables the item at the head of the btree
+ * recycle list to be inserted; and then if there are more items behind it
+ * one of those might cause two currently-separated ranges to merge,
+ * creating a single range of contiguous pages larger than any that
+ * existed previously. It might be worth trying to improve the cleanup
+ * algorithm to avoid such corner cases, but for now we just notice the
+ * condition and do the appropriate reporting.
+ */
+ contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (fpm->contiguous_pages < contiguous_pages)
+ fpm->contiguous_pages = contiguous_pages;
+
+ /*
+ * FreePageManagerGetInternal may have set contiguous_pages_dirty.
+ * Recompute contigous_pages if so.
+ */
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ if (result)
+ {
+ Assert(fpm->free_pages >= npages);
+ fpm->free_pages -= npages;
+ }
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+ Assert(fpm->contiguous_pages == FreePageManagerLargestContiguous(fpm));
+#endif
+ return result;
+}
+
+#ifdef FPM_EXTRA_ASSERTS
+static void
+sum_free_pages_recurse(FreePageManager *fpm, FreePageBtree *btp, Size *sum)
+{
+ char *base = fpm_segment_base(fpm);
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ||
+ btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ ++*sum;
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ Size index;
+
+
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ sum_free_pages_recurse(fpm, child, sum);
+ }
+ }
+}
+static Size
+sum_free_pages(FreePageManager *fpm)
+{
+ FreePageSpanLeader *recycle;
+ char *base = fpm_segment_base(fpm);
+ Size sum = 0;
+ int list;
+
+ /* Count the spans by scanning the freelists. */
+ for (list = 0; list < FPM_NUM_FREELISTS; ++list)
+ {
+
+ if (!relptr_is_null(fpm->freelist[list]))
+ {
+ FreePageSpanLeader *candidate =
+ relptr_access(base, fpm->freelist[list]);
+
+ do
+ {
+ sum += candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ }
+
+ /* Count btree internal pages. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ sum_free_pages_recurse(fpm, root, &sum);
+ }
+
+ /* Count the recycle list. */
+ for (recycle = relptr_access(base, fpm->btree_recycle);
+ recycle != NULL;
+ recycle = relptr_access(base, recycle->next))
+ {
+ Assert(recycle->npages == 1);
+ ++sum;
+ }
+
+ return sum;
+}
+#endif
+
+/*
+ * Compute the size of the largest run of pages that the user could
+ * succesfully get.
+ */
+static Size
+FreePageManagerLargestContiguous(FreePageManager *fpm)
+{
+ char *base;
+ Size largest;
+
+ base = fpm_segment_base(fpm);
+ largest = 0;
+ if (!relptr_is_null(fpm->freelist[FPM_NUM_FREELISTS - 1]))
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[FPM_NUM_FREELISTS - 1]);
+ do
+ {
+ if (candidate->npages > largest)
+ largest = candidate->npages;
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ else
+ {
+ Size f = FPM_NUM_FREELISTS - 1;
+
+ do
+ {
+ --f;
+ if (!relptr_is_null(fpm->freelist[f]))
+ {
+ largest = f + 1;
+ break;
+ }
+ } while (f > 0);
+ }
+
+ return largest;
+}
+
+/*
+ * Recompute the size of the largest run of pages that the user could
+ * succesfully get, if it has been marked dirty.
+ */
+static void
+FreePageManagerUpdateLargest(FreePageManager *fpm)
+{
+ if (!fpm->contiguous_pages_dirty)
+ return;
+
+ fpm->contiguous_pages = FreePageManagerLargestContiguous(fpm);
+ fpm->contiguous_pages_dirty = false;
+}
+
+/*
+ * Transfer a run of pages to the free page manager.
+ */
+void
+FreePageManagerPut(FreePageManager *fpm, Size first_page, Size npages)
+{
+ Size contiguous_pages;
+
+ Assert(npages > 0);
+
+ /* Record the new pages. */
+ contiguous_pages =
+ FreePageManagerPutInternal(fpm, first_page, npages, false);
+
+ /*
+ * If the new range we inserted into the page manager was contiguous with
+ * an existing range, it may have opened up cleanup opportunities.
+ */
+ if (contiguous_pages > npages)
+ {
+ Size cleanup_contiguous_pages;
+
+ cleanup_contiguous_pages = FreePageBtreeCleanup(fpm);
+ if (cleanup_contiguous_pages > contiguous_pages)
+ contiguous_pages = cleanup_contiguous_pages;
+ }
+
+ /* See if we now have a new largest chunk. */
+ if (fpm->contiguous_pages < contiguous_pages)
+ fpm->contiguous_pages = contiguous_pages;
+
+ /*
+ * The earlier call to FreePageManagerPutInternal may have set
+ * contiguous_pages_dirty if it needed to allocate internal pages, so
+ * recompute contiguous_pages if necessary.
+ */
+ FreePageManagerUpdateLargest(fpm);
+
+#ifdef FPM_EXTRA_ASSERTS
+ fpm->free_pages += npages;
+ Assert(fpm->free_pages == sum_free_pages(fpm));
+ Assert(fpm->contiguous_pages == FreePageManagerLargestContiguous(fpm));
+#endif
+}
+
+/*
+ * Produce a debugging dump of the state of a free page manager.
+ */
+char *
+FreePageManagerDump(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ StringInfoData buf;
+ FreePageSpanLeader *recycle;
+ bool dumped_any_freelist = false;
+ Size f;
+
+ /* Initialize output buffer. */
+ initStringInfo(&buf);
+
+ /* Dump general stuff. */
+ appendStringInfo(&buf, "metadata: self %zu max contiguous pages = %zu\n",
+ fpm->self.relptr_off, fpm->contiguous_pages);
+
+ /* Dump btree. */
+ if (fpm->btree_depth > 0)
+ {
+ FreePageBtree *root;
+
+ appendStringInfo(&buf, "btree depth %u:\n", fpm->btree_depth);
+ root = relptr_access(base, fpm->btree_root);
+ FreePageManagerDumpBtree(fpm, root, NULL, 0, &buf);
+ }
+ else if (fpm->singleton_npages > 0)
+ {
+ appendStringInfo(&buf, "singleton: %zu(%zu)\n",
+ fpm->singleton_first_page, fpm->singleton_npages);
+ }
+
+ /* Dump btree recycle list. */
+ recycle = relptr_access(base, fpm->btree_recycle);
+ if (recycle != NULL)
+ {
+ appendStringInfo(&buf, "btree recycle:");
+ FreePageManagerDumpSpans(fpm, recycle, 1, &buf);
+ }
+
+ /* Dump free lists. */
+ for (f = 0; f < FPM_NUM_FREELISTS; ++f)
+ {
+ FreePageSpanLeader *span;
+
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+ if (!dumped_any_freelist)
+ {
+ appendStringInfo(&buf, "freelists:\n");
+ dumped_any_freelist = true;
+ }
+ appendStringInfo(&buf, " %zu:", f + 1);
+ span = relptr_access(base, fpm->freelist[f]);
+ FreePageManagerDumpSpans(fpm, span, f + 1, &buf);
+ }
+
+ /* And return result to caller. */
+ return buf.data;
+}
+
+
+/*
+ * The first_page value stored at index zero in any non-root page must match
+ * the first_page value stored in its parent at the index which points to that
+ * page. So when the value stored at index zero in a btree page changes, we've
+ * got to walk up the tree adjusting ancestor keys until we reach an ancestor
+ * where that key isn't index zero. This function should be called after
+ * updating the first key on the target page; it will propagate the change
+ * upward as far as needed.
+ *
+ * We assume here that the first key on the page has not changed enough to
+ * require changes in the ordering of keys on its ancestor pages. Thus,
+ * if we search the parent page for the first key greater than or equal to
+ * the first key on the current page, the downlink to this page will be either
+ * the exact index returned by the search (if the first key decreased)
+ * or one less (if the first key increased).
+ */
+static void
+FreePageBtreeAdjustAncestorKeys(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ Size first_page;
+ FreePageBtree *parent;
+ FreePageBtree *child;
+
+ /* This might be either a leaf or an internal page. */
+ Assert(btp->hdr.nused > 0);
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ first_page = btp->u.leaf_key[0].first_page;
+ }
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ first_page = btp->u.internal_key[0].first_page;
+ }
+ child = btp;
+
+ /* Loop until we find an ancestor that does not require adjustment. */
+ for (;;)
+ {
+ Size s;
+
+ parent = relptr_access(base, child->hdr.parent);
+ if (parent == NULL)
+ break;
+ s = FreePageBtreeSearchInternal(parent, first_page);
+
+ /* Key is either at index s or index s-1; figure out which. */
+ if (s >= parent->hdr.nused)
+ {
+ Assert(s == parent->hdr.nused);
+ --s;
+ }
+ else
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ if (check != child)
+ {
+ Assert(s > 0);
+ --s;
+ }
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* Debugging double-check. */
+ {
+ FreePageBtree *check;
+
+ check = relptr_access(base, parent->u.internal_key[s].child);
+ Assert(s < parent->hdr.nused);
+ Assert(child == check);
+ }
+#endif
+
+ /* Update the parent key. */
+ parent->u.internal_key[s].first_page = first_page;
+
+ /*
+ * If this is the first key in the parent, go up another level; else
+ * done.
+ */
+ if (s > 0)
+ break;
+ child = parent;
+ }
+}
+
+/*
+ * Attempt to reclaim space from the free-page btree. The return value is
+ * the largest range of contiguous pages created by the cleanup operation.
+ */
+static Size
+FreePageBtreeCleanup(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ Size max_contiguous_pages = 0;
+
+ /* Attempt to shrink the depth of the btree. */
+ while (!relptr_is_null(fpm->btree_root))
+ {
+ FreePageBtree *root = relptr_access(base, fpm->btree_root);
+
+ /* If the root contains only one key, reduce depth by one. */
+ if (root->hdr.nused == 1)
+ {
+ /* Shrink depth of tree by one. */
+ Assert(fpm->btree_depth > 0);
+ --fpm->btree_depth;
+ if (root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ /* If root is a leaf, convert only entry to singleton range. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages;
+ }
+ else
+ {
+ FreePageBtree *newroot;
+
+ /* If root is an internal page, make only child the root. */
+ Assert(root->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ relptr_copy(fpm->btree_root, root->u.internal_key[0].child);
+ newroot = relptr_access(base, fpm->btree_root);
+ relptr_store(base, newroot->hdr.parent, (FreePageBtree *) NULL);
+ }
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, root));
+ }
+ else if (root->hdr.nused == 2 &&
+ root->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ Size end_of_first;
+ Size start_of_second;
+
+ end_of_first = root->u.leaf_key[0].first_page +
+ root->u.leaf_key[0].npages;
+ start_of_second = root->u.leaf_key[1].first_page;
+
+ if (end_of_first + 1 == start_of_second)
+ {
+ Size root_page = fpm_pointer_to_page(base, root);
+
+ if (end_of_first == root_page)
+ {
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[0].first_page);
+ FreePagePopSpanLeader(fpm, root->u.leaf_key[1].first_page);
+ fpm->singleton_first_page = root->u.leaf_key[0].first_page;
+ fpm->singleton_npages = root->u.leaf_key[0].npages +
+ root->u.leaf_key[1].npages + 1;
+ fpm->btree_depth = 0;
+ relptr_store(base, fpm->btree_root,
+ (FreePageBtree *) NULL);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ Assert(max_contiguous_pages == 0);
+ max_contiguous_pages = fpm->singleton_npages;
+ }
+ }
+
+ /* Whether it worked or not, it's time to stop. */
+ break;
+ }
+ else
+ {
+ /* Nothing more to do. Stop. */
+ break;
+ }
+ }
+
+ /*
+ * Attempt to free recycled btree pages. We skip this if releasing the
+ * recycled page would require a btree page split, because the page we're
+ * trying to recycle would be consumed by the split, which would be
+ * counterproductive.
+ *
+ * We also currently only ever attempt to recycle the first page on the
+ * list; that could be made more aggressive, but it's not clear that the
+ * complexity would be worthwhile.
+ */
+ while (fpm->btree_recycle_count > 0)
+ {
+ FreePageBtree *btp;
+ Size first_page;
+ Size contiguous_pages;
+
+ btp = FreePageBtreeGetRecycled(fpm);
+ first_page = fpm_pointer_to_page(base, btp);
+ contiguous_pages = FreePageManagerPutInternal(fpm, first_page, 1, true);
+ if (contiguous_pages == 0)
+ {
+ FreePageBtreeRecycle(fpm, first_page);
+ break;
+ }
+ else
+ {
+ if (contiguous_pages > max_contiguous_pages)
+ max_contiguous_pages = contiguous_pages;
+ }
+ }
+
+ return max_contiguous_pages;
+}
+
+/*
+ * Consider consolidating the given page with its left or right sibling,
+ * if it's fairly empty.
+ */
+static void
+FreePageBtreeConsolidate(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *np;
+ Size max;
+
+ /*
+ * We only try to consolidate pages that are less than a third full. We
+ * could be more aggressive about this, but that might risk performing
+ * consolidation only to end up splitting again shortly thereafter. Since
+ * the btree should be very small compared to the space under management,
+ * our goal isn't so much to ensure that it always occupies the absolutely
+ * smallest possible number of pages as to reclaim pages before things get
+ * too egregiously out of hand.
+ */
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ max = FPM_ITEMS_PER_LEAF_PAGE;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ max = FPM_ITEMS_PER_INTERNAL_PAGE;
+ }
+ if (btp->hdr.nused >= max / 3)
+ return;
+
+ /*
+ * If we can fit our right sibling's keys onto this page, consolidate.
+ */
+ np = FreePageBtreeFindRightSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&btp->u.leaf_key[btp->hdr.nused], &np->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ }
+ else
+ {
+ memcpy(&btp->u.internal_key[btp->hdr.nused], &np->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * np->hdr.nused);
+ btp->hdr.nused += np->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, btp);
+ }
+ FreePageBtreeRemovePage(fpm, np);
+ return;
+ }
+
+ /*
+ * If we can fit our keys onto our left sibling's page, consolidate. In
+ * this case, we move our keys onto the other page rather than visca
+ * versa, to avoid having to adjust ancestor keys.
+ */
+ np = FreePageBtreeFindLeftSibling(base, btp);
+ if (np != NULL && btp->hdr.nused + np->hdr.nused <= max)
+ {
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ memcpy(&np->u.leaf_key[np->hdr.nused], &btp->u.leaf_key[0],
+ sizeof(FreePageBtreeLeafKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ }
+ else
+ {
+ memcpy(&np->u.internal_key[np->hdr.nused], &btp->u.internal_key[0],
+ sizeof(FreePageBtreeInternalKey) * btp->hdr.nused);
+ np->hdr.nused += btp->hdr.nused;
+ FreePageBtreeUpdateParentPointers(base, np);
+ }
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+}
+
+/*
+ * Find the passed page's left sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately precedes ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindLeftSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move left. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index > 0)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index - 1].child);
+ break;
+ }
+ Assert(index == 0);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[p->hdr.nused - 1].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Find the passed page's right sibling; that is, the page at the same level
+ * of the tree whose keyspace immediately follows ours.
+ */
+static FreePageBtree *
+FreePageBtreeFindRightSibling(char *base, FreePageBtree *btp)
+{
+ FreePageBtree *p = btp;
+ int levels = 0;
+
+ /* Move up until we can move right. */
+ for (;;)
+ {
+ Size first_page;
+ Size index;
+
+ first_page = FreePageBtreeFirstKey(p);
+ p = relptr_access(base, p->hdr.parent);
+
+ if (p == NULL)
+ return NULL; /* we were passed the rightmost page */
+
+ index = FreePageBtreeSearchInternal(p, first_page);
+ if (index < p->hdr.nused - 1)
+ {
+ Assert(p->u.internal_key[index].first_page == first_page);
+ p = relptr_access(base, p->u.internal_key[index + 1].child);
+ break;
+ }
+ Assert(index == p->hdr.nused - 1);
+ ++levels;
+ }
+
+ /* Descend left. */
+ while (levels > 0)
+ {
+ Assert(p->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ p = relptr_access(base, p->u.internal_key[0].child);
+ --levels;
+ }
+ Assert(p->hdr.magic == btp->hdr.magic);
+
+ return p;
+}
+
+/*
+ * Get the first key on a btree page.
+ */
+static Size
+FreePageBtreeFirstKey(FreePageBtree *btp)
+{
+ Assert(btp->hdr.nused > 0);
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ return btp->u.leaf_key[0].first_page;
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ return btp->u.internal_key[0].first_page;
+ }
+}
+
+/*
+ * Get a page from the btree recycle list for use as a btree page.
+ */
+static FreePageBtree *
+FreePageBtreeGetRecycled(FreePageManager *fpm)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *newhead;
+
+ Assert(victim != NULL);
+ newhead = relptr_access(base, victim->next);
+ if (newhead != NULL)
+ relptr_copy(newhead->prev, victim->prev);
+ relptr_store(base, fpm->btree_recycle, newhead);
+ Assert(fpm_pointer_is_page_aligned(base, victim));
+ fpm->btree_recycle_count--;
+ return (FreePageBtree *) victim;
+}
+
+/*
+ * Insert an item into an internal page.
+ */
+static void
+FreePageBtreeInsertInternal(char *base, FreePageBtree *btp, Size index,
+ Size first_page, FreePageBtree *child)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
+ sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
+ btp->u.internal_key[index].first_page = first_page;
+ relptr_store(base, btp->u.internal_key[index].child, child);
+ ++btp->hdr.nused;
+}
+
+/*
+ * Insert an item into a leaf page.
+ */
+static void
+FreePageBtreeInsertLeaf(FreePageBtree *btp, Size index, Size first_page,
+ Size npages)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(btp->hdr.nused <= FPM_ITEMS_PER_LEAF_PAGE);
+ Assert(index <= btp->hdr.nused);
+ memmove(&btp->u.leaf_key[index + 1], &btp->u.leaf_key[index],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+ btp->u.leaf_key[index].first_page = first_page;
+ btp->u.leaf_key[index].npages = npages;
+ ++btp->hdr.nused;
+}
+
+/*
+ * Put a page on the btree recycle list.
+ */
+static void
+FreePageBtreeRecycle(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *head = relptr_access(base, fpm->btree_recycle);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = 1;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->btree_recycle, span);
+ fpm->btree_recycle_count++;
+}
+
+/*
+ * Remove an item from the btree at the given position on the given page.
+ */
+static void
+FreePageBtreeRemove(FreePageManager *fpm, FreePageBtree *btp, Size index)
+{
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(index < btp->hdr.nused);
+
+ /* When last item is removed, extirpate entire page from btree. */
+ if (btp->hdr.nused == 1)
+ {
+ FreePageBtreeRemovePage(fpm, btp);
+ return;
+ }
+
+ /* Physically remove the key from the page. */
+ --btp->hdr.nused;
+ if (index < btp->hdr.nused)
+ memmove(&btp->u.leaf_key[index], &btp->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey) * (btp->hdr.nused - index));
+
+ /* If we just removed the first key, adjust ancestor keys. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, btp);
+
+ /* Consider whether to consolidate this page with a sibling. */
+ FreePageBtreeConsolidate(fpm, btp);
+}
+
+/*
+ * Remove a page from the btree. Caller is responsible for having relocated
+ * any keys from this page that are still wanted. The page is placed on the
+ * recycled list.
+ */
+static void
+FreePageBtreeRemovePage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *parent;
+ Size index;
+ Size first_page;
+
+ for (;;)
+ {
+ /* Find parent page. */
+ parent = relptr_access(base, btp->hdr.parent);
+ if (parent == NULL)
+ {
+ /* We are removing the root page. */
+ relptr_store(base, fpm->btree_root, (FreePageBtree *) NULL);
+ fpm->btree_depth = 0;
+ Assert(fpm->singleton_first_page == 0);
+ Assert(fpm->singleton_npages == 0);
+ return;
+ }
+
+ /*
+ * If the parent contains only one item, we need to remove it as well.
+ */
+ if (parent->hdr.nused > 1)
+ break;
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+ btp = parent;
+ }
+
+ /* Find and remove the downlink. */
+ first_page = FreePageBtreeFirstKey(btp);
+ if (parent->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ {
+ index = FreePageBtreeSearchLeaf(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.leaf_key[index],
+ &parent->u.leaf_key[index + 1],
+ sizeof(FreePageBtreeLeafKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ else
+ {
+ index = FreePageBtreeSearchInternal(parent, first_page);
+ Assert(index < parent->hdr.nused);
+ if (index < parent->hdr.nused - 1)
+ memmove(&parent->u.internal_key[index],
+ &parent->u.internal_key[index + 1],
+ sizeof(FreePageBtreeInternalKey)
+ * (parent->hdr.nused - index - 1));
+ }
+ parent->hdr.nused--;
+ Assert(parent->hdr.nused > 0);
+
+ /* Recycle the page. */
+ FreePageBtreeRecycle(fpm, fpm_pointer_to_page(base, btp));
+
+ /* Adjust ancestor keys if needed. */
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+
+ /* Consider whether to consolidate the parent with a sibling. */
+ FreePageBtreeConsolidate(fpm, parent);
+}
+
+/*
+ * Search the btree for an entry for the given first page and initialize
+ * *result with the results of the search. result->page and result->index
+ * indicate either the position of an exact match or the position at which
+ * the new key should be inserted. result->found is true for an exact match,
+ * otherwise false. result->split_pages will contain the number of additional
+ * btree pages that will be needed when performing a split to insert a key.
+ * Except as described above, the contents of fields in the result object are
+ * undefined on return.
+ */
+static void
+FreePageBtreeSearch(FreePageManager *fpm, Size first_page,
+ FreePageBtreeSearchResult *result)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtree *btp = relptr_access(base, fpm->btree_root);
+ Size index;
+
+ result->split_pages = 1;
+
+ /* If the btree is empty, there's nothing to find. */
+ if (btp == NULL)
+ {
+ result->page = NULL;
+ result->found = false;
+ return;
+ }
+
+ /* Descend until we hit a leaf. */
+ while (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ FreePageBtree *child;
+ bool found_exact;
+
+ index = FreePageBtreeSearchInternal(btp, first_page);
+ found_exact = index < btp->hdr.nused &&
+ btp->u.internal_key[index].first_page == first_page;
+
+ /*
+ * If we found an exact match we descend directly. Otherwise, we
+ * descend into the child to the left if possible so that we can find
+ * the insertion point at that child's high end.
+ */
+ if (!found_exact && index > 0)
+ --index;
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Descend to appropriate child page. */
+ Assert(index < btp->hdr.nused);
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ Assert(relptr_access(base, child->hdr.parent) == btp);
+ btp = child;
+ }
+
+ /* Track required split depth for leaf insert. */
+ if (btp->hdr.nused >= FPM_ITEMS_PER_LEAF_PAGE)
+ {
+ Assert(btp->hdr.nused == FPM_ITEMS_PER_INTERNAL_PAGE);
+ result->split_pages++;
+ }
+ else
+ result->split_pages = 0;
+
+ /* Search leaf page. */
+ index = FreePageBtreeSearchLeaf(btp, first_page);
+
+ /* Assemble results. */
+ result->page = btp;
+ result->index = index;
+ result->found = index < btp->hdr.nused &&
+ first_page == btp->u.leaf_key[index].first_page;
+}
+
+/*
+ * Search an internal page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchInternal(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_INTERNAL_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.internal_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Search a leaf page for the first key greater than or equal to a given
+ * page number. Returns the index of that key, or one greater than the number
+ * of keys on the page if none.
+ */
+static Size
+FreePageBtreeSearchLeaf(FreePageBtree *btp, Size first_page)
+{
+ Size low = 0;
+ Size high = btp->hdr.nused;
+
+ Assert(btp->hdr.magic == FREE_PAGE_LEAF_MAGIC);
+ Assert(high > 0 && high <= FPM_ITEMS_PER_LEAF_PAGE);
+
+ while (low < high)
+ {
+ Size mid = (low + high) / 2;
+ Size val = btp->u.leaf_key[mid].first_page;
+
+ if (first_page == val)
+ return mid;
+ else if (first_page < val)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+
+ return low;
+}
+
+/*
+ * Allocate a new btree page and move half the keys from the provided page
+ * to the new page. Caller is responsible for making sure that there's a
+ * page available from fpm->btree_recycle. Returns a pointer to the new page,
+ * to which caller must add a downlink.
+ */
+static FreePageBtree *
+FreePageBtreeSplitPage(FreePageManager *fpm, FreePageBtree *btp)
+{
+ FreePageBtree *newsibling;
+
+ newsibling = FreePageBtreeGetRecycled(fpm);
+ newsibling->hdr.magic = btp->hdr.magic;
+ newsibling->hdr.nused = btp->hdr.nused / 2;
+ relptr_copy(newsibling->hdr.parent, btp->hdr.parent);
+ btp->hdr.nused -= newsibling->hdr.nused;
+
+ if (btp->hdr.magic == FREE_PAGE_LEAF_MAGIC)
+ memcpy(&newsibling->u.leaf_key,
+ &btp->u.leaf_key[btp->hdr.nused],
+ sizeof(FreePageBtreeLeafKey) * newsibling->hdr.nused);
+ else
+ {
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ memcpy(&newsibling->u.internal_key,
+ &btp->u.internal_key[btp->hdr.nused],
+ sizeof(FreePageBtreeInternalKey) * newsibling->hdr.nused);
+ FreePageBtreeUpdateParentPointers(fpm_segment_base(fpm), newsibling);
+ }
+
+ return newsibling;
+}
+
+/*
+ * When internal pages are split or merged, the parent pointers of their
+ * children must be updated.
+ */
+static void
+FreePageBtreeUpdateParentPointers(char *base, FreePageBtree *btp)
+{
+ Size i;
+
+ Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
+ for (i = 0; i < btp->hdr.nused; ++i)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[i].child);
+ relptr_store(base, child->hdr.parent, btp);
+ }
+}
+
+/*
+ * Debugging dump of btree data.
+ */
+static void
+FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
+ FreePageBtree *parent, int level, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+ Size pageno = fpm_pointer_to_page(base, btp);
+ Size index;
+ FreePageBtree *check_parent;
+
+ check_stack_depth();
+ check_parent = relptr_access(base, btp->hdr.parent);
+ appendStringInfo(buf, " %zu@%d %c", pageno, level,
+ btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC ? 'i' : 'l');
+ if (parent != check_parent)
+ appendStringInfo(buf, " [actual parent %zu, expected %zu]",
+ fpm_pointer_to_page(base, check_parent),
+ fpm_pointer_to_page(base, parent));
+ appendStringInfoChar(buf, ':');
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ appendStringInfo(buf, " %zu->%zu",
+ btp->u.internal_key[index].first_page,
+ btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+ else
+ appendStringInfo(buf, " %zu(%zu)",
+ btp->u.leaf_key[index].first_page,
+ btp->u.leaf_key[index].npages);
+ }
+ appendStringInfo(buf, "\n");
+
+ if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
+ {
+ for (index = 0; index < btp->hdr.nused; ++index)
+ {
+ FreePageBtree *child;
+
+ child = relptr_access(base, btp->u.internal_key[index].child);
+ FreePageManagerDumpBtree(fpm, child, btp, level + 1, buf);
+ }
+ }
+}
+
+/*
+ * Debugging dump of free-span data.
+ */
+static void
+FreePageManagerDumpSpans(FreePageManager *fpm, FreePageSpanLeader *span,
+ Size expected_pages, StringInfo buf)
+{
+ char *base = fpm_segment_base(fpm);
+
+ while (span != NULL)
+ {
+ if (span->npages != expected_pages)
+ appendStringInfo(buf, " %zu(%zu)", fpm_pointer_to_page(base, span),
+ span->npages);
+ else
+ appendStringInfo(buf, " %zu", fpm_pointer_to_page(base, span));
+ span = relptr_access(base, span->next);
+ }
+
+ appendStringInfo(buf, "\n");
+}
+
+/*
+ * This function allocates a run of pages of the given length from the free
+ * page manager.
+ */
+static bool
+FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *victim = NULL;
+ FreePageSpanLeader *prev;
+ FreePageSpanLeader *next;
+ FreePageBtreeSearchResult result;
+ Size victim_page = 0; /* placate compiler */
+ Size f;
+
+ /*
+ * Search for a free span.
+ *
+ * Right now, we use a simple best-fit policy here, but it's possible for
+ * this to result in memory fragmentation if we're repeatedly asked to
+ * allocate chunks just a little smaller than what we have available.
+ * Hopefully, this is unlikely, because we expect most requests to be
+ * single pages or superblock-sized chunks -- but no policy can be optimal
+ * under all circumstances unless it has knowledge of future allocation
+ * patterns.
+ */
+ for (f = Min(npages, FPM_NUM_FREELISTS) - 1; f < FPM_NUM_FREELISTS; ++f)
+ {
+ /* Skip empty freelists. */
+ if (relptr_is_null(fpm->freelist[f]))
+ continue;
+
+ /*
+ * All of the freelists except the last one contain only items of a
+ * single size, so we just take the first one. But the final free
+ * list contains everything too big for any of the other lists, so we
+ * need to search the list.
+ */
+ if (f < FPM_NUM_FREELISTS - 1)
+ victim = relptr_access(base, fpm->freelist[f]);
+ else
+ {
+ FreePageSpanLeader *candidate;
+
+ candidate = relptr_access(base, fpm->freelist[f]);
+ do
+ {
+ if (candidate->npages >= npages && (victim == NULL ||
+ victim->npages > candidate->npages))
+ {
+ victim = candidate;
+ if (victim->npages == npages)
+ break;
+ }
+ candidate = relptr_access(base, candidate->next);
+ } while (candidate != NULL);
+ }
+ break;
+ }
+
+ /* If we didn't find an allocatable span, return failure. */
+ if (victim == NULL)
+ return false;
+
+ /* Remove span from free list. */
+ Assert(victim->magic == FREE_PAGE_SPAN_LEADER_MAGIC);
+ prev = relptr_access(base, victim->prev);
+ next = relptr_access(base, victim->next);
+ if (prev != NULL)
+ relptr_copy(prev->next, victim->next);
+ else
+ relptr_copy(fpm->freelist[f], victim->next);
+ if (next != NULL)
+ relptr_copy(next->prev, victim->prev);
+ victim_page = fpm_pointer_to_page(base, victim);
+
+ /* Decide whether we might be invalidating contiguous_pages. */
+ if (f == FPM_NUM_FREELISTS - 1 &&
+ victim->npages == fpm->contiguous_pages)
+ {
+ /*
+ * The victim span came from the oversized freelist, and had the same
+ * size as the longest span. There may or may not be another one of
+ * the same size, so contiguous_pages must be recomputed just to be
+ * safe.
+ */
+ fpm->contiguous_pages_dirty = true;
+ }
+ else if (f + 1 == fpm->contiguous_pages &&
+ relptr_is_null(fpm->freelist[f]))
+ {
+ /*
+ * The victim span came from a fixed sized freelist, and it was the
+ * list for spans of the same size as the current longest span, and
+ * the list is now empty after removing the victim. So
+ * contiguous_pages must be recomputed without a doubt.
+ */
+ fpm->contiguous_pages_dirty = true;
+ }
+
+ /*
+ * If we haven't initialized the btree yet, the victim must be the single
+ * span stored within the FreePageManager itself. Otherwise, we need to
+ * update the btree.
+ */
+ if (relptr_is_null(fpm->btree_root))
+ {
+ Assert(victim_page == fpm->singleton_first_page);
+ Assert(victim->npages == fpm->singleton_npages);
+ Assert(victim->npages >= npages);
+ fpm->singleton_first_page += npages;
+ fpm->singleton_npages -= npages;
+ if (fpm->singleton_npages > 0)
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ }
+ else
+ {
+ /*
+ * If the span we found is exactly the right size, remove it from the
+ * btree completely. Otherwise, adjust the btree entry to reflect the
+ * still-unallocated portion of the span, and put that portion on the
+ * appropriate free list.
+ */
+ FreePageBtreeSearch(fpm, victim_page, &result);
+ Assert(result.found);
+ if (victim->npages == npages)
+ FreePageBtreeRemove(fpm, result.page, result.index);
+ else
+ {
+ FreePageBtreeLeafKey *key;
+
+ /* Adjust btree to reflect remaining pages. */
+ Assert(victim->npages > npages);
+ key = &result.page->u.leaf_key[result.index];
+ Assert(key->npages == victim->npages);
+ key->first_page += npages;
+ key->npages -= npages;
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put the unallocated pages back on the appropriate free list. */
+ FreePagePushSpanLeader(fpm, victim_page + npages,
+ victim->npages - npages);
+ }
+ }
+
+ /* Return results to caller. */
+ *first_page = fpm_pointer_to_page(base, victim);
+ return true;
+}
+
+/*
+ * Put a range of pages into the btree and freelists, consolidating it with
+ * existing free spans just before and/or after it. If 'soft' is true,
+ * only perform the insertion if it can be done without allocating new btree
+ * pages; if false, do it always. Returns 0 if the soft flag caused the
+ * insertion to be skipped, or otherwise the size of the contiguous span
+ * created by the insertion. This may be larger than npages if we're able
+ * to consolidate with an adjacent range. *internal_pages_used is set to
+ * true if the btree allocated pages for internal purposes, which might
+ * invalidate the current largest run requiring it to be recomputed.
+ */
+static Size
+FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
+ bool soft)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageBtreeSearchResult result;
+ FreePageBtreeLeafKey *prevkey = NULL;
+ FreePageBtreeLeafKey *nextkey = NULL;
+ FreePageBtree *np;
+ Size nindex;
+
+ Assert(npages > 0);
+
+ /* We can store a single free span without initializing the btree. */
+ if (fpm->btree_depth == 0)
+ {
+ if (fpm->singleton_npages == 0)
+ {
+ /* Don't have a span yet; store this one. */
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return fpm->singleton_npages;
+ }
+ else if (fpm->singleton_first_page + fpm->singleton_npages ==
+ first_page)
+ {
+ /* New span immediately follows sole existing span. */
+ fpm->singleton_npages += npages;
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else if (first_page + npages == fpm->singleton_first_page)
+ {
+ /* New span immediately precedes sole existing span. */
+ FreePagePopSpanLeader(fpm, fpm->singleton_first_page);
+ fpm->singleton_first_page = first_page;
+ fpm->singleton_npages += npages;
+ FreePagePushSpanLeader(fpm, fpm->singleton_first_page,
+ fpm->singleton_npages);
+ return fpm->singleton_npages;
+ }
+ else
+ {
+ /* Not contiguous; we need to initialize the btree. */
+ Size root_page;
+ FreePageBtree *root;
+
+ if (!relptr_is_null(fpm->btree_recycle))
+ root = FreePageBtreeGetRecycled(fpm);
+ else if (FreePageManagerGetInternal(fpm, 1, &root_page))
+ root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
+ else
+ {
+ /* We'd better be able to get a page from the existing range. */
+ elog(FATAL, "free page manager btree is corrupt");
+ }
+
+ /* Create the btree and move the preexisting range into it. */
+ root->hdr.magic = FREE_PAGE_LEAF_MAGIC;
+ root->hdr.nused = 1;
+ relptr_store(base, root->hdr.parent, (FreePageBtree *) NULL);
+ root->u.leaf_key[0].first_page = fpm->singleton_first_page;
+ root->u.leaf_key[0].npages = fpm->singleton_npages;
+ relptr_store(base, fpm->btree_root, root);
+ fpm->singleton_first_page = 0;
+ fpm->singleton_npages = 0;
+ fpm->btree_depth = 1;
+
+ /*
+ * Corner case: it may be that the btree root took the very last
+ * free page. In that case, the sole btree entry covers a zero
+ * page run, which is invalid. Overwrite it with the entry we're
+ * trying to insert and get out.
+ */
+ if (root->u.leaf_key[0].npages == 0)
+ {
+ root->u.leaf_key[0].first_page = first_page;
+ root->u.leaf_key[0].npages = npages;
+ FreePagePushSpanLeader(fpm, first_page, npages);
+ return npages;
+ }
+
+ /* Fall through to insert the new key. */
+ }
+ }
+
+ /* Search the btree. */
+ FreePageBtreeSearch(fpm, first_page, &result);
+ Assert(!result.found);
+ if (result.index > 0)
+ prevkey = &result.page->u.leaf_key[result.index - 1];
+ if (result.index < result.page->hdr.nused)
+ {
+ np = result.page;
+ nindex = result.index;
+ nextkey = &result.page->u.leaf_key[result.index];
+ }
+ else
+ {
+ np = FreePageBtreeFindRightSibling(base, result.page);
+ nindex = 0;
+ if (np != NULL)
+ nextkey = &np->u.leaf_key[0];
+ }
+
+ /* Consolidate with the previous entry if possible. */
+ if (prevkey != NULL && prevkey->first_page + prevkey->npages >= first_page)
+ {
+ bool remove_next = false;
+ Size result;
+
+ Assert(prevkey->first_page + prevkey->npages == first_page);
+ prevkey->npages = (first_page - prevkey->first_page) + npages;
+
+ /* Check whether we can *also* consolidate with the following entry. */
+ if (nextkey != NULL &&
+ prevkey->first_page + prevkey->npages >= nextkey->first_page)
+ {
+ Assert(prevkey->first_page + prevkey->npages ==
+ nextkey->first_page);
+ prevkey->npages = (nextkey->first_page - prevkey->first_page)
+ + nextkey->npages;
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ remove_next = true;
+ }
+
+ /* Put the span on the correct freelist and save size. */
+ FreePagePopSpanLeader(fpm, prevkey->first_page);
+ FreePagePushSpanLeader(fpm, prevkey->first_page, prevkey->npages);
+ result = prevkey->npages;
+
+ /*
+ * If we consolidated with both the preceding and following entries,
+ * we must remove the following entry. We do this last, because
+ * removing an element from the btree may invalidate pointers we hold
+ * into the current data structure.
+ *
+ * NB: The btree is technically in an invalid state a this point
+ * because we've already updated prevkey to cover the same key space
+ * as nextkey. FreePageBtreeRemove() shouldn't notice that, though.
+ */
+ if (remove_next)
+ FreePageBtreeRemove(fpm, np, nindex);
+
+ return result;
+ }
+
+ /* Consolidate with the next entry if possible. */
+ if (nextkey != NULL && first_page + npages >= nextkey->first_page)
+ {
+ Size newpages;
+
+ /* Compute new size for span. */
+ Assert(first_page + npages == nextkey->first_page);
+ newpages = (nextkey->first_page - first_page) + nextkey->npages;
+
+ /* Put span on correct free list. */
+ FreePagePopSpanLeader(fpm, nextkey->first_page);
+ FreePagePushSpanLeader(fpm, first_page, newpages);
+
+ /* Update key in place. */
+ nextkey->first_page = first_page;
+ nextkey->npages = newpages;
+
+ /* If reducing first key on page, ancestors might need adjustment. */
+ if (nindex == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, np);
+
+ return nextkey->npages;
+ }
+
+ /* Split leaf page and as many of its ancestors as necessary. */
+ if (result.split_pages > 0)
+ {
+ /*
+ * NB: We could consider various coping strategies here to avoid a
+ * split; most obviously, if np != result.page, we could target that
+ * page instead. More complicated shuffling strategies could be
+ * possible as well; basically, unless every single leaf page is 100%
+ * full, we can jam this key in there if we try hard enough. It's
+ * unlikely that trying that hard is worthwhile, but it's possible we
+ * might need to make more than no effort. For now, we just do the
+ * easy thing, which is nothing.
+ */
+
+ /* If this is a soft insert, it's time to give up. */
+ if (soft)
+ return 0;
+
+ /* Check whether we need to allocate more btree pages to split. */
+ if (result.split_pages > fpm->btree_recycle_count)
+ {
+ Size pages_needed;
+ Size recycle_page;
+ Size i;
+
+ /*
+ * Allocate the required number of pages and split each one in
+ * turn. This should never fail, because if we've got enough
+ * spans of free pages kicking around that we need additional
+ * storage space just to remember them all, then we should
+ * certainly have enough to expand the btree, which should only
+ * ever use a tiny number of pages compared to the number under
+ * management. If it does, something's badly screwed up.
+ */
+ pages_needed = result.split_pages - fpm->btree_recycle_count;
+ for (i = 0; i < pages_needed; ++i)
+ {
+ if (!FreePageManagerGetInternal(fpm, 1, &recycle_page))
+ elog(FATAL, "free page manager btree is corrupt");
+ FreePageBtreeRecycle(fpm, recycle_page);
+ }
+
+ /*
+ * The act of allocating pages to recycle may have invalidated the
+ * results of our previous btree reserch, so repeat it. (We could
+ * recheck whether any of our split-avoidance strategies that were
+ * not viable before now are, but it hardly seems worthwhile, so
+ * we don't bother. Consolidation can't be possible now if it
+ * wasn't previously.)
+ */
+ FreePageBtreeSearch(fpm, first_page, &result);
+
+ /*
+ * The act of allocating pages for use in constructing our btree
+ * should never cause any page to become more full, so the new
+ * split depth should be no greater than the old one, and perhaps
+ * less if we fortutiously allocated a chunk that freed up a slot
+ * on the page we need to update.
+ */
+ Assert(result.split_pages <= fpm->btree_recycle_count);
+ }
+
+ /* If we still need to perform a split, do it. */
+ if (result.split_pages > 0)
+ {
+ FreePageBtree *split_target = result.page;
+ FreePageBtree *child = NULL;
+ Size key = first_page;
+
+ for (;;)
+ {
+ FreePageBtree *newsibling;
+ FreePageBtree *parent;
+
+ /* Identify parent page, which must receive downlink. */
+ parent = relptr_access(base, split_target->hdr.parent);
+
+ /* Split the page - downlink not added yet. */
+ newsibling = FreePageBtreeSplitPage(fpm, split_target);
+
+ /*
+ * At this point in the loop, we're always carrying a pending
+ * insertion. On the first pass, it's the actual key we're
+ * trying to insert; on subsequent passes, it's the downlink
+ * that needs to be added as a result of the split performed
+ * during the previous loop iteration. Since we've just split
+ * the page, there's definitely room on one of the two
+ * resulting pages.
+ */
+ if (child == NULL)
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into = key < newsibling->u.leaf_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchLeaf(insert_into, key);
+ FreePageBtreeInsertLeaf(insert_into, index, key, npages);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+ else
+ {
+ Size index;
+ FreePageBtree *insert_into;
+
+ insert_into =
+ key < newsibling->u.internal_key[0].first_page ?
+ split_target : newsibling;
+ index = FreePageBtreeSearchInternal(insert_into, key);
+ FreePageBtreeInsertInternal(base, insert_into, index,
+ key, child);
+ relptr_store(base, child->hdr.parent, insert_into);
+ if (index == 0 && insert_into == split_target)
+ FreePageBtreeAdjustAncestorKeys(fpm, split_target);
+ }
+
+ /* If the page we just split has no parent, split the root. */
+ if (parent == NULL)
+ {
+ FreePageBtree *newroot;
+
+ newroot = FreePageBtreeGetRecycled(fpm);
+ newroot->hdr.magic = FREE_PAGE_INTERNAL_MAGIC;
+ newroot->hdr.nused = 2;
+ relptr_store(base, newroot->hdr.parent,
+ (FreePageBtree *) NULL);
+ newroot->u.internal_key[0].first_page =
+ FreePageBtreeFirstKey(split_target);
+ relptr_store(base, newroot->u.internal_key[0].child,
+ split_target);
+ relptr_store(base, split_target->hdr.parent, newroot);
+ newroot->u.internal_key[1].first_page =
+ FreePageBtreeFirstKey(newsibling);
+ relptr_store(base, newroot->u.internal_key[1].child,
+ newsibling);
+ relptr_store(base, newsibling->hdr.parent, newroot);
+ relptr_store(base, fpm->btree_root, newroot);
+ fpm->btree_depth++;
+
+ break;
+ }
+
+ /* If the parent page isn't full, insert the downlink. */
+ key = newsibling->u.internal_key[0].first_page;
+ if (parent->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE)
+ {
+ Size index;
+
+ index = FreePageBtreeSearchInternal(parent, key);
+ FreePageBtreeInsertInternal(base, parent, index,
+ key, newsibling);
+ relptr_store(base, newsibling->hdr.parent, parent);
+ if (index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, parent);
+ break;
+ }
+
+ /* The parent also needs to be split, so loop around. */
+ child = newsibling;
+ split_target = parent;
+ }
+
+ /*
+ * The loop above did the insert, so just need to update the free
+ * list, and we're done.
+ */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+ }
+ }
+
+ /* Physically add the key to the page. */
+ Assert(result.page->hdr.nused < FPM_ITEMS_PER_LEAF_PAGE);
+ FreePageBtreeInsertLeaf(result.page, result.index, first_page, npages);
+
+ /* If new first key on page, ancestors might need adjustment. */
+ if (result.index == 0)
+ FreePageBtreeAdjustAncestorKeys(fpm, result.page);
+
+ /* Put it on the free list. */
+ FreePagePushSpanLeader(fpm, first_page, npages);
+
+ return npages;
+}
+
+/*
+ * Remove a FreePageSpanLeader from the linked-list that contains it, either
+ * because we're changing the size of the span, or because we're allocating it.
+ */
+static void
+FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
+{
+ char *base = fpm_segment_base(fpm);
+ FreePageSpanLeader *span;
+ FreePageSpanLeader *next;
+ FreePageSpanLeader *prev;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, pageno);
+
+ next = relptr_access(base, span->next);
+ prev = relptr_access(base, span->prev);
+ if (next != NULL)
+ relptr_copy(next->prev, span->prev);
+ if (prev != NULL)
+ relptr_copy(prev->next, span->next);
+ else
+ {
+ Size f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
+
+ Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+ relptr_copy(fpm->freelist[f], span->next);
+ }
+}
+
+/*
+ * Initialize a new FreePageSpanLeader and put it on the appropriate free list.
+ */
+static void
+FreePagePushSpanLeader(FreePageManager *fpm, Size first_page, Size npages)
+{
+ char *base = fpm_segment_base(fpm);
+ Size f = Min(npages, FPM_NUM_FREELISTS) - 1;
+ FreePageSpanLeader *head = relptr_access(base, fpm->freelist[f]);
+ FreePageSpanLeader *span;
+
+ span = (FreePageSpanLeader *) fpm_page_to_pointer(base, first_page);
+ span->magic = FREE_PAGE_SPAN_LEADER_MAGIC;
+ span->npages = npages;
+ relptr_store(base, span->next, head);
+ relptr_store(base, span->prev, (FreePageSpanLeader *) NULL);
+ if (head != NULL)
+ relptr_store(base, head->prev, span);
+ relptr_store(base, fpm->freelist[f], span);
+}
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
new file mode 100644
index 0000000..40a38e3
--- /dev/null
+++ b/src/include/utils/dsa.h
@@ -0,0 +1,107 @@
+/*-------------------------------------------------------------------------
+ *
+ * dsa.h
+ * Dynamic shared memory areas.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/utils/dsa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DSA_H
+#define DSA_H
+
+#include "postgres.h"
+
+#include "port/atomics.h"
+#include "storage/dsm.h"
+
+/* The opaque type used for an area. */
+struct dsa_area;
+typedef struct dsa_area dsa_area;
+
+/*
+ * If this system doesn't support atomic operations on 64 bit values then
+ * we fall back to 32 bit dsa_pointer. For testing purposes,
+ * USE_SMALL_DSA_POINTER can be defined to force the use of 32 bit
+ * dsa_pointer even on systems that support 64 bit atomics.
+ */
+#ifndef PG_HAVE_ATOMIC_U64_SUPPORT
+#define SIZEOF_DSA_POINTER 4
+#else
+#ifdef USE_SMALL_DSA_POINTER
+#define SIZEOF_DSA_POINTER 4
+#else
+#define SIZEOF_DSA_POINTER 8
+#endif
+#endif
+
+/*
+ * The type of 'relative pointers' to memory allocated by a dynamic shared
+ * area. dsa_pointer values can be shared with other processes, but must be
+ * converted to backend-local pointers before they can be dereferenced. See
+ * dsa_get_address. Also, an atomic version and appropriately sized atomic
+ * operations.
+ */
+#if DSA_POINTER_SIZEOF == 4
+typedef uint32 dsa_pointer;
+typedef pg_atomic_uint32 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u32
+#define dsa_pointer_atomic_read pg_atomic_read_u32
+#define dsa_pointer_atomic_write pg_atomic_write_u32
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#else
+typedef uint64 dsa_pointer;
+typedef pg_atomic_uint64 dsa_pointer_atomic;
+#define dsa_pointer_atomic_init pg_atomic_init_u64
+#define dsa_pointer_atomic_read pg_atomic_read_u64
+#define dsa_pointer_atomic_write pg_atomic_write_u64
+#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
+#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#endif
+
+/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
+#define InvalidDsaPointer ((dsa_pointer) 0)
+
+/* Check if a dsa_pointer value is valid. */
+#define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
+
+/*
+ * The type used for dsa_area handles. dsa_handle values can be shared with
+ * other processes, so that they can attach to them. This provides a way to
+ * share allocated storage with other processes.
+ *
+ * The handle for a dsa_area is currently implemented as the dsm_handle
+ * for the first DSM segment backing this dynamic storage area, but client
+ * code shouldn't assume that is true.
+ */
+typedef dsm_handle dsa_handle;
+
+extern void dsa_startup(void);
+
+extern dsa_area *dsa_create(int tranche_id, const char *tranche_name);
+extern dsa_area *dsa_create_in_place(void *place, Size size,
+ int tranche_id, const char *tranche_name,
+ dsm_segment *segment);
+extern dsa_area *dsa_attach(dsa_handle handle);
+extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
+extern void dsa_release_in_place(void *place);
+extern void dsa_on_dsm_detach_release_in_place(dsm_segment *, Datum);
+extern void dsa_on_shmem_exit_release_in_place(int, Datum);
+extern void dsa_pin_mapping(dsa_area *area);
+extern void dsa_detach(dsa_area *area);
+extern void dsa_pin(dsa_area *area);
+extern void dsa_unpin(dsa_area *area);
+extern void dsa_set_size_limit(dsa_area *area, Size limit);
+extern Size dsa_minimum_size(void);
+extern dsa_handle dsa_get_handle(dsa_area *area);
+extern dsa_pointer dsa_allocate(dsa_area *area, Size size);
+extern void dsa_free(dsa_area *area, dsa_pointer dp);
+extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern void dsa_trim(dsa_area *area);
+extern void dsa_dump(dsa_area *area);
+
+#endif /* DSA_H */
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
new file mode 100644
index 0000000..c0adf4d
--- /dev/null
+++ b/src/include/utils/freepage.h
@@ -0,0 +1,105 @@
+/*-------------------------------------------------------------------------
+ *
+ * freepage.h
+ * Management of page-organized free memory.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/utils/freepage.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FREEPAGE_H
+#define FREEPAGE_H
+
+#include "storage/lwlock.h"
+#include "utils/relptr.h"
+
+/* Forward declarations. */
+typedef struct FreePageSpanLeader FreePageSpanLeader;
+typedef struct FreePageBtree FreePageBtree;
+typedef struct FreePageManager FreePageManager;
+
+/*
+ * PostgreSQL normally uses 8kB pages for most things, but many common
+ * architecture/operating system pairings use a 4kB page size for memory
+ * allocation, so we do that here also. We assume that a large allocation
+ * is likely to begin on a page boundary; if not, we'll discard bytes from
+ * the beginning and end of the object and use only the middle portion that
+ * is properly aligned. This works, but is not ideal, so it's best to keep
+ * this conservatively small. There don't seem to be any common architectures
+ * where the page size is less than 4kB, so this should be good enough; also,
+ * making it smaller would increase the space consumed by the address space
+ * map, which also uses this page size.
+ */
+#define FPM_PAGE_SIZE 4096
+
+/*
+ * Each freelist except for the last contains only spans of one particular
+ * size. Everything larger goes on the last one. In some sense this seems
+ * like a waste since most allocations are in a few common sizes, but it
+ * means that small allocations can simply pop the head of the relevant list
+ * without needing to worry about whether the object we find there is of
+ * precisely the correct size (because we know it must be).
+ */
+#define FPM_NUM_FREELISTS 129
+
+/* Define relative pointer types. */
+relptr_declare(FreePageBtree, RelptrFreePageBtree);
+relptr_declare(FreePageManager, RelptrFreePageManager);
+relptr_declare(FreePageSpanLeader, RelptrFreePageSpanLeader);
+
+/* Everything we need in order to manage free pages (see freepage.c) */
+struct FreePageManager
+{
+ RelptrFreePageManager self;
+ RelptrFreePageBtree btree_root;
+ RelptrFreePageSpanLeader btree_recycle;
+ unsigned btree_depth;
+ unsigned btree_recycle_count;
+ Size singleton_first_page;
+ Size singleton_npages;
+ Size contiguous_pages;
+ bool contiguous_pages_dirty;
+ RelptrFreePageSpanLeader freelist[FPM_NUM_FREELISTS];
+#ifdef FPM_EXTRA_ASSERTS
+ /* For debugging only, pages put minus pages gotten. */
+ Size free_pages;
+#endif
+};
+
+/* Macros to convert between page numbers (expressed as Size) and pointers. */
+#define fpm_page_to_pointer(base, page) \
+ (AssertVariableIsOfTypeMacro(page, Size), \
+ (base) + FPM_PAGE_SIZE * (page))
+#define fpm_pointer_to_page(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) / FPM_PAGE_SIZE)
+
+/* Macro to convert an allocation size to a number of pages. */
+#define fpm_size_to_pages(sz) \
+ (((sz) + FPM_PAGE_SIZE - 1) / FPM_PAGE_SIZE)
+
+/* Macros to check alignment of absolute and relative pointers. */
+#define fpm_pointer_is_page_aligned(base, ptr) \
+ (((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
+#define fpm_relptr_is_page_aligned(base, relptr) \
+ ((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+
+/* Macro to find base address of the segment containing a FreePageManager. */
+#define fpm_segment_base(fpm) \
+ (((char *) fpm) - fpm->self.relptr_off)
+
+/* Macro to access a FreePageManager's largest consecutive run of pages. */
+#define fpm_largest(fpm) \
+ (fpm->contiguous_pages)
+
+/* Functions to manipulate the free page map. */
+extern void FreePageManagerInitialize(FreePageManager *fpm, char *base);
+extern bool FreePageManagerGet(FreePageManager *fpm, Size npages,
+ Size *first_page);
+extern void FreePageManagerPut(FreePageManager *fpm, Size first_page,
+ Size npages);
+extern char *FreePageManagerDump(FreePageManager *fpm);
+
+#endif /* FREEPAGE_H */
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
new file mode 100644
index 0000000..40139ee
--- /dev/null
+++ b/src/include/utils/relptr.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * relptr.h
+ * This file contains basic declarations for relative pointers.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/include/utils/relptr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RELPTR_H
+#define RELPTR_H
+
+/*
+ * Relative pointers are intended to be used when storing an address that may
+ * be relative either to the base of the processes address space or some
+ * dynamic shared memory segment mapped therein.
+ *
+ * The idea here is that you declare a relative pointer as relptr(type)
+ * and then use relptr_access to dereference it and relptr_store to change
+ * it. The use of a union here is a hack, because what's stored in the
+ * relptr is always a Size, never an actual pointer. But including a pointer
+ * in the union allows us to use stupid macro tricks to provide some measure
+ * of type-safety.
+ */
+#define relptr(type) union { type *relptr_type; Size relptr_off; }
+
+#define relptr_declare(type, name) \
+ typedef union { type *relptr_type; Size relptr_off; } name;
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
+ (base + (rp).relptr_off)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_access(base, rp) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+#endif
+
+#define relptr_is_null(rp) \
+ ((rp).relptr_off == 0)
+
+#ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#else
+/*
+ * If we don't have __builtin_types_compatible_p, assume we might not have
+ * __typeof__ either.
+ */
+#define relptr_store(base, rp, val) \
+ (AssertVariableIsOfTypeMacro(base, char *), \
+ (rp).relptr_off = ((val) == NULL ? 0 : ((char *) (val)) - (base)))
+#endif
+
+#define relptr_copy(rp1, rp2) \
+ ((rp1).relptr_off = (rp2).relptr_off)
+
+#endif /* RELPTR_H */
On Thu, Dec 1, 2016 at 10:33 PM, Thomas Munro <thomas.munro@enterprisedb.com
wrote:
Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.
Moved to next CF with "needs review" status.
Regards,
Hari Babu
Fujitsu Australia
On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.
OK, I've committed the main patch. As far as test-dsa.patch, can we
tie that into make check-world so that committing it delivers some
buildfarm coverage for this code? Of course, the test settings would
have to be fairly conservative given that some buildfarm machines have
very limited resources, but it still seems worth doing. test_shm_mq
might provide some useful precedent.
Note that you don't need the prototype if you've already used
PG_FUNCTION_INFO_V1.
I'm not sure that using the same random seed every time is a good
idea. Maybe you should provide a way to set the seed as part of
starting the test, or to not do that (pass NULL?) and then elog(LOG,
...) the seed that's chosen. Then if the BF crashes, we can see what
seed was in use for that particular test.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 2, 2016 at 1:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.OK, I've committed the main patch.
...but the buildfarm isn't very happy about it.
tern complains:
In file included from dsa.c:58:0:
../../../../src/include/utils/dsa.h:59:1: error: unknown type name
'pg_atomic_uint64'
typedef pg_atomic_uint64 dsa_pointer_atomic;
...but that code is only compiled if #if DSA_POINTER_SIZEOF == 4 fails
to be true. And that should always be true unless
PG_HAVE_ATOMIC_U64_SUPPORT is defined. So apparently tern claims to
PG_HAVE_ATOMIC_U64_SUPPORT but doesn't actually define
pg_atomic_uint64? That doesn't seem right.
The failures on several other BF members appear to be similar.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 2, 2016 at 2:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 2, 2016 at 1:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.OK, I've committed the main patch.
...but the buildfarm isn't very happy about it.
tern complains:
In file included from dsa.c:58:0:
../../../../src/include/utils/dsa.h:59:1: error: unknown type name
'pg_atomic_uint64'
typedef pg_atomic_uint64 dsa_pointer_atomic;...but that code is only compiled if #if DSA_POINTER_SIZEOF == 4 fails
to be true. And that should always be true unless
PG_HAVE_ATOMIC_U64_SUPPORT is defined. So apparently tern claims to
PG_HAVE_ATOMIC_U64_SUPPORT but doesn't actually define
pg_atomic_uint64? That doesn't seem right.
No, that's not the problem. Just a garden variety thinko in dsa.h.
Will push a fix presently.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Dec 3, 2016 at 9:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 2, 2016 at 2:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 2, 2016 at 1:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.OK, I've committed the main patch.
...but the buildfarm isn't very happy about it.
tern complains:
In file included from dsa.c:58:0:
../../../../src/include/utils/dsa.h:59:1: error: unknown type name
'pg_atomic_uint64'
typedef pg_atomic_uint64 dsa_pointer_atomic;...but that code is only compiled if #if DSA_POINTER_SIZEOF == 4 fails
to be true. And that should always be true unless
PG_HAVE_ATOMIC_U64_SUPPORT is defined. So apparently tern claims to
PG_HAVE_ATOMIC_U64_SUPPORT but doesn't actually define
pg_atomic_uint64? That doesn't seem right.No, that's not the problem. Just a garden variety thinko in dsa.h.
Will push a fix presently.
Here's a patch to provide the right format string for dsa_pointer to
printf-like functions, which clears a warning coming from dsa_dump (a
debugging function) on 32 bit systems.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
fix-dsa-pointer-format.patchapplication/octet-stream; name=fix-dsa-pointer-format.patchDownload
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 9095da0..0e49e70 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1099,9 +1099,10 @@ dsa_dump(dsa_area *area)
span = dsa_get_address(area, span_pointer);
fprintf(stderr,
- " span descriptor at %016lx, "
- "superblock at %016lx, pages = %zu, "
- "objects free = %hu/%hu\n",
+ " span descriptor at "
+ DSA_POINTER_FORMAT ", superblock at "
+ DSA_POINTER_FORMAT
+ ", pages = %zu, objects free = %hu/%hu\n",
span_pointer, span->start, span->npages,
span->nallocatable, span->nmax);
span_pointer = span->nextspan;
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 4ef5c24..70c32d1 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -54,6 +54,7 @@ typedef pg_atomic_uint32 dsa_pointer_atomic;
#define dsa_pointer_atomic_write pg_atomic_write_u32
#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u32
#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u32
+#define DSA_POINTER_FORMAT "%08x"
#else
typedef uint64 dsa_pointer;
typedef pg_atomic_uint64 dsa_pointer_atomic;
@@ -62,6 +63,7 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
#define dsa_pointer_atomic_write pg_atomic_write_u64
#define dsa_pointer_atomic_fetch_add pg_atomic_fetch_add_u64
#define dsa_pointer_atomic_compare_exchange pg_atomic_compare_exchange_u64
+#define DSA_POINTER_FORMAT "%016" INT64_MODIFIER "x"
#endif
/* A sentinel value for dsa_pointer used to indicate failure to allocate. */
On Fri, Dec 2, 2016 at 3:46 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Here's a patch to provide the right format string for dsa_pointer to
printf-like functions, which clears a warning coming from dsa_dump (a
debugging function) on 32 bit systems.
Committed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 15, 2016 at 5:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think we should develop versions of this that (1) allocate from the
main shared memory segment and (2) allocate from backend-private
memory. Per my previous benchmarking results, allocating from
backend-private memory would be a substantial win for tuplesort.c
because this allocator is substantially more memory-efficient for
large memory contexts than aset.c, and Tomas Vondra tested it out and
found that it is also faster for logical decoding than the approach he
proposed.
The approach that I'd prefer to take with tuplesort.c is to have a
buffer for caller tuples that is written to sequentially, and
repalloc()'d as needed, much like the memtuples array. It would be
slightly tricky to make this work when memtuples needs to be a heap
(I'm mostly thinking of top-N heapsorts here). That has perhaps
unbeatable efficiency, while also helping cases with significant
physical/logical correlation in their input, which is pretty common.
Creating an index on a serial PK within pg_restore would probably get
notably faster if we went this way.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
[ blast-from-the-past department ]
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.
OK, I've committed the main patch.
Our shiny new version of Coverity kvetches about
FreePageBtreeInsertInternal:
*** CID 1667414: (OVERRUN)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/mmgr/freepage.c: 908 in FreePageBtreeInsertInternal()
902 {
903 Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
904 Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
905 Assert(index <= btp->hdr.nused);
906 memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
907 sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));
CID 1667414: (OVERRUN)
Overrunning array "btp->u.internal_key" of 254 16-byte elements at element index 254 (byte offset 4079) using index "index" (which evaluates to 254).
908 btp->u.internal_key[index].first_page = first_page;
909 relptr_store(base, btp->u.internal_key[index].child, child);
910 ++btp->hdr.nused;
911 }
I believe the reason is that the second Assert is wrong, and it
should instead be
904 Assert(btp->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE);
to assert that there is room for the item we are about to insert.
The same thinko exists in FreePageBtreeInsertLeaf, although
for some reason Coverity isn't whining about that.
Thoughts?
regards, tom lane
On Wed, Oct 22, 2025 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Our shiny new version of Coverity kvetches about
FreePageBtreeInsertInternal:*** CID 1667414: (OVERRUN)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/mmgr/freepage.c: 908 in FreePageBtreeInsertInternal()
902 {
903 Assert(btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC);
904 Assert(btp->hdr.nused <= FPM_ITEMS_PER_INTERNAL_PAGE);
905 Assert(index <= btp->hdr.nused);
906 memmove(&btp->u.internal_key[index + 1], &btp->u.internal_key[index],
907 sizeof(FreePageBtreeInternalKey) * (btp->hdr.nused - index));CID 1667414: (OVERRUN)
Overrunning array "btp->u.internal_key" of 254 16-byte elements at element index 254 (byte offset 4079) using index "index" (which evaluates to 254).908 btp->u.internal_key[index].first_page = first_page;
909 relptr_store(base, btp->u.internal_key[index].child, child);
910 ++btp->hdr.nused;
911 }I believe the reason is that the second Assert is wrong, and it
should instead be904 Assert(btp->hdr.nused < FPM_ITEMS_PER_INTERNAL_PAGE);
to assert that there is room for the item we are about to insert.
The same thinko exists in FreePageBtreeInsertLeaf, although
for some reason Coverity isn't whining about that.Thoughts?
I only just noticed this email. I see you've already fixed the issue.
I agree with your analysis, and thanks for taking care of it.
--
Robert Haas
EDB: http://www.enterprisedb.com