Listen / Notify - what to do when the queue is full

Started by Joachim Wielandabout 16 years ago121 messages
#1Joachim Wieland
joe@mcknight.de

We still need to decide what to do with queue full situations in the proposed
listen/notify implementation. I have a new version of the patch to allow for a
variable payload size. However, the whole notification must fit into one page so
the payload needs to be less than 8K.

I have also added the XID, so that we can write to the queue before committing
to clog which allows for rollback if we encounter write errors (disk full for
example). Especially the implications of this change make the patch a lot more
complicated.

The queue is slru-based, slru uses int page numbers, so we can use up to
2147483647 (INT_MAX) pages with some small changes in slru.c.

When do we have a full queue? Well, the idea is that notifications are written
to the queue and that they are read as soon as the notifying transaction
commits. Only if a listening backend is busy, it won't read the
notifications and
so it won't update its pointer for some time. With the current space we can
acommodate at least 2147483647 notifications or more, depending on the
payload length. That gives us something in between of 214 GB (100 Bytes per
notification) and 17 TB (8000 Bytes per notification). So in order to have a
full queue, we need to generate that amount of notifications while one backend
is still busy and is not reading the accumulating notifications. In general
chances are not too high that anyone will ever have a full notification queue,
but we need to define the behavior anyway...

These are the solutions that I currently see:

1) drop new notifications if the queue is full (silently or with rollback)
2) block until readers catch up (what if the backend that tries to write the
notifications actually is the "lazy" reader that everybody is waiting for to
proceed?)
3) invent a new signal reason and send SIGUSR1 to the "lazy" readers, they
need to interrupt whatever they are doing and copy the
notifications into their
own address space (without delivering the notifications since they are in a
transaction at that moment).

For 1) there can be warnings way ahead of when the queue is actually full, like
one when it is 50% full, another one when it is 75% full and so on and
they could
point to the backend that is most behind in reading notifications...

I think that 2) is the least practical approach. If there is a pile of at least
2,147,483,647 notifications, then a backend hasn't read the notifications
for a long long time... Chances are low that it will read them within the next
few seconds.
In a sense 2) implies 3) for the special case that the writing backend is
the one that everybody is waiting for to proceed reading notifications,
in the end this backend is waiting for itself.

For 3) the question is if we can just invent a new signal reason
PROCSIG_NOTIFYCOPY_INTERRUPT or similar and upon reception the backend
copies the notification data to its private address space?
Would this function be called by every backend after at most a few seconds
even if it is processing a long running query?

Admittedly, once 3) is in place we can also put a smaller queue into
shared memory
and remove the slru thing alltogether but we need to be sure that we can
interrupt the backends at any time since the queue size would be a lot smaller
than 200 GB...

Joachim

#2Merlin Moncure
mmoncure@gmail.com
In reply to: Joachim Wieland (#1)
Re: Listen / Notify - what to do when the queue is full

On Mon, Nov 16, 2009 at 9:05 AM, Greg Sabino Mullane >> We still need
to decide what to do with queue full situations in

the proposed listen/notify implementation. I have a new version
of the patch to allow for a variable payload size. However, the
whole notification must fit into one page so the payload needs
to be less than 8K.

That sounds fine to me, FWIW.

+1! I think this should satisfy everyone.

I have also added the XID, so that we can write to the queue before
committing to clog which allows for rollback if we encounter write
errors (disk full for example). Especially the implications of this
change make the patch a lot more complicated.

Can you elaborate on the use case for this?

Tom specifically asked for it: "The old implementation was acid so the
new one should be to"

so it won't update its pointer for some time. With the current space we can
acommodate at least 2147483647 notifications or more, depending on the
payload length.

That's a whole lot of notifications. I doubt any program out there is using
anywhere near that number at the moment. In my applications, having a
few hundred notifications active at one time is "a lot" in my book. :)

These are the solutions that I currently see:

1) drop new notifications if the queue is full (silently or with rollback)

I like this one best, but not with silence of course. While it's not the most
polite thing to do, this is for a super extreme edge case. I'd rather just
throw an exception if the queue is full rather than start messing with the
readers. It's a possible denial of service attack too, but so is the current
implementation in a way - at least I don't think apps would perform very
optimally with 2147483647 entries in the pg_listener table :)

If you need some real-world use cases involving payloads, let me know, I've
been waiting for this feature for some time and have it all mapped out.

me too. Joachim: when I benchmarked the original patch, I was seeing
a few log messages that suggested there might be something going
inside. In any event, the performance was fantastic.

merlin

#3Andrew Chernow
ac@esilo.com
In reply to: Merlin Moncure (#2)
Re: Listen / Notify - what to do when the queue is full

Greg Sabino Mullane wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

We still need to decide what to do with queue full situations in
the proposed listen/notify implementation. I have a new version
of the patch to allow for a variable payload size. However, the
whole notification must fit into one page so the payload needs
to be less than 8K.

That sounds fine to me, FWIW.

Agreed. Thank you for all your work.

1) drop new notifications if the queue is full (silently or with rollback)

I like this one best, but not with silence of course. While it's not the most
polite thing to do, this is for a super extreme edge case. I'd rather just
throw an exception if the queue is full rather than start messing with the

+1

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#4Joachim Wieland
joe@mcknight.de
In reply to: Joachim Wieland (#1)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Sun, Nov 15, 2009 at 7:19 PM, Joachim Wieland <joe@mcknight.de> wrote:

These are the solutions that I currently see:
 1) [...]
 2) [...]
 3) [...]

4) Allow readers to read uncommitted notifications as well. Instead of
delivering them, the backends just copy them over into their own
address space and deliver them later on...

Going with option 4) allows readers to always read all notifications
in the queue... This also allows a backend to send more notifications
than the queue can hold. So we are only limited by the backends'
memory. Every notification that is sent will eventually be delivered.

The queue can still fill up if one of the backends is busy for a long
long long time... Then the next writer just blocks and waits.

Attached patch implements this behavior as well as a variable payload
size, limited to 8000 characters. The variable payload also offers an
automatic speed control... The smaller your notifications are, the
more efficiently a page can be used and the faster you are. :-)

Once we are fine that this is the way to go, I'll submit a documentation patch.

Joachim

Attachments:

listennotify.2.difftext/x-diff; charset=US-ASCII; name=listennotify.2.diffDownload
diff -ur cvs/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
--- cvs/src/backend/access/transam/slru.c	2009-05-10 19:49:47.000000000 +0200
+++ cvs.build/src/backend/access/transam/slru.c	2009-11-18 10:20:54.000000000 +0100
@@ -58,26 +58,6 @@
 #include "storage/shmem.h"
 #include "miscadmin.h"
 
-
-/*
- * Define segment size.  A page is the same BLCKSZ as is used everywhere
- * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
- * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
- * or 64K transactions for SUBTRANS.
- *
- * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
- * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
- * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
- * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
- * take no explicit notice of that fact in this module, except when comparing
- * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
- *
- * Note: this file currently assumes that segment file names will be four
- * hex digits.	This sets a lower bound on the segment size (64K transactions
- * for 32-bit TransactionIds).
- */
-#define SLRU_PAGES_PER_SEGMENT	32
-
 #define SlruFileName(ctl, path, seg) \
 	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
 
diff -ur cvs/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
--- cvs/src/backend/access/transam/xact.c	2009-09-06 08:58:59.000000000 +0200
+++ cvs.build/src/backend/access/transam/xact.c	2009-11-18 10:20:54.000000000 +0100
@@ -1604,8 +1604,8 @@
 	/* close large objects before lower-level cleanup */
 	AtEOXact_LargeObject(true);
 
-	/* NOTIFY commit must come before lower-level cleanup */
-	AtCommit_Notify();
+	/* Insert notifications sent by the NOTIFY command into the queue */
+	AtCommit_NotifyBeforeCommit();
 
 	/* Prevent cancel/die interrupt while cleaning up */
 	HOLD_INTERRUPTS();
@@ -1680,6 +1680,11 @@
 
 	AtEOXact_MultiXact();
 
+	/*
+	 * Clean up Notify buffers and signal listening backends.
+	 */
+	AtCommit_NotifyAfterCommit();
+
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
 						 true, true);
diff -ur cvs/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
--- cvs/src/backend/catalog/Makefile	2009-10-31 14:47:46.000000000 +0100
+++ cvs.build/src/backend/catalog/Makefile	2009-11-18 10:20:54.000000000 +0100
@@ -30,7 +30,7 @@
 	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
 	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
 	pg_language.h pg_largeobject.h pg_aggregate.h pg_statistic.h \
-	pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h pg_cast.h \
+	pg_rewrite.h pg_trigger.h pg_description.h pg_cast.h \
 	pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
 	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
 	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -ur cvs/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
--- cvs/src/backend/commands/async.c	2009-09-06 08:59:06.000000000 +0200
+++ cvs.build/src/backend/commands/async.c	2009-11-19 00:44:41.000000000 +0100
@@ -14,31 +14,54 @@
 
 /*-------------------------------------------------------------------------
  * New Async Notification Model:
- * 1. Multiple backends on same machine.  Multiple backends listening on
- *	  one relation.  (Note: "listening on a relation" is not really the
- *	  right way to think about it, since the notify names need not have
- *	  anything to do with the names of relations actually in the database.
- *	  But this terminology is all over the code and docs, and I don't feel
- *	  like trying to replace it.)
- *
- * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
- *	  ie, each relname/listenerPID pair.  The "notification" field of the
- *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
- *	  of the originating backend when a cross-backend NOTIFY is pending.
- *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
- *	  notification field should never be equal to the listenerPID field.)
- *
- * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
- *	  relname to a list of outstanding NOTIFY requests.  Actual processing
- *	  happens if and only if we reach transaction commit.  At that time (in
- *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
- *	  If the listenerPID in a matching tuple is ours, we just send a notify
- *	  message to our own front end.  If it is not ours, and "notification"
- *	  is not already nonzero, we set notification to our own PID and send a
- *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
- *	  listenerPID).
- *	  BTW: if the signal operation fails, we presume that the listener backend
- *	  crashed without removing this tuple, and remove the tuple for it.
+ *
+ * 1. Multiple backends on same machine. Multiple backends listening on
+ *	  several channels. (This was previously called a "relation" even though it
+ *	  is just an identifier and has nothing to do with a database relation.)
+ *
+ * 2. There is one central queue in the form of Slru backed file based storage
+ *    (directory pg_notify/), with several pages mapped into shared memory.
+ *
+ *    There is no central storage of which backend listens on which channel,
+ *    every backend has its own list.
+ *
+ *    Every backend that is listening on at least one channel registers by
+ *    entering its Pid into the array of all backends. It then scans all
+ *    incoming notifications and compares the notified channels with its list.
+ *
+ *    In case there is a match it delivers the corresponding notification to
+ *    its frontend.
+ *
+ * 3. The NOTIFY statement (routine Async_Notify) registers the notification
+ *    in a list which will not be processed until at transaction end. Every
+ *    notification can additionally send a "payload" which is an extra text
+ *    parameter to convey arbitrary information to the recipient.
+ *
+ *    Duplicate notifications from the same transaction are sent out as one
+ *    notification only. This is done to save work when for example a trigger
+ *    on a 2 million row table fires a notification for each row that has been
+ *    changed. If the applications needs to receive every single notification
+ *    that has been sent, it can easily add some unique string into the extra
+ *    payload parameter.
+ *
+ *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
+ *    required changes regarding listeners (Listen/Unlisten) and then adds the
+ *    pending notifications to the head of the queue. The head pointer of the
+ *    queue always points to the next free position and a position is just a
+ *    page number and the offset in that page. This is done before marking the
+ *    transaction as committed in clog. If we run into problems writing the
+ *    notifications, we can still call elog(ERROR, ...) and the transaction
+ *    will roll back.
+ *
+ *    Once we have put all of the notifications into the queue, we return to
+ *    CommitTransaction() which will then commit to clog.
+ *
+ *    We are then called another time (AtCommit_NotifyAfterCommit())and check
+ *    if we need to signal the backends.
+ *    In SignalBackends() we scan the list of listening backends and send a
+ *    PROCSIG_NOTIFY_INTERRUPT to every backend that has set its Pid (We don't
+ *    know which backend is listening on which channel so we need to send a
+ *    signal to every listening backend).
  *
  * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
  *	  can call inbound-notify processing immediately if this backend is idle
@@ -46,48 +69,48 @@
  *	  block).  Otherwise the handler may only set a flag, which will cause the
  *	  processing to occur just before we next go idle.
  *
- * 5. Inbound-notify processing consists of scanning pg_listener for tuples
- *	  matching our own listenerPID and having nonzero notification fields.
- *	  For each such tuple, we send a message to our frontend and clear the
- *	  notification field.  BTW: this routine has to start/commit its own
- *	  transaction, since by assumption it is only called from outside any
- *	  transaction.
- *
- * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
- * of pending actions.	If we reach transaction commit, the changes are
- * applied to pg_listener just before executing any pending NOTIFYs.  This
- * method is necessary because to avoid race conditions, we must hold lock
- * on pg_listener from when we insert a new listener tuple until we commit.
- * To do that and not create undue hazard of deadlock, we don't want to
- * touch pg_listener until we are otherwise done with the transaction;
- * in particular it'd be uncool to still be taking user-commanded locks
- * while holding the pg_listener lock.
- *
- * Although we grab ExclusiveLock on pg_listener for any operation,
- * the lock is never held very long, so it shouldn't cause too much of
- * a performance problem.  (Previously we used AccessExclusiveLock, but
- * there's no real reason to forbid concurrent reads.)
+ * 5. Inbound-notify processing consists of reading all of the notifications
+ *	  that have arrived since scanning last time. We read every notification
+ *	  until we reach the head pointer's position. Then we check if we were the
+ *	  laziest backend: if our pointer is set to the same position as the global
+ *	  tail pointer is set, then we set it further to the second-laziest
+ *	  backend (We can identify it by inspecting the positions of all other
+ *	  backends' pointers). Whenever we move the tail pointer we also truncate
+ *	  now unused pages (i.e. delete files in pg_notify/ that are no longer
+ *	  used).
+ *	  Note that we really read _any_ available notification in the queue. We
+ *	  also read uncommitted notifications from transaction that could still
+ *	  roll back. We must not deliver the notifications of those transactions
+ *	  but just copy them out of the queue. We save them in the
+ *	  uncommittedNotifications list which we try to deliver every time we
+ *	  check for available notifications.
  *
- * An application that listens on the same relname it notifies will get
+ * An application that listens on the same channel it notifies will get
  * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
  * by comparing be_pid in the NOTIFY message to the application's own backend's
- * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
+ * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
  * frontend during startup.)  The above design guarantees that notifies from
- * other backends will never be missed by ignoring self-notifies.  Note,
- * however, that we do *not* guarantee that a separate frontend message will
- * be sent for every outside NOTIFY.  Since there is only room for one
- * originating PID in pg_listener, outside notifies occurring at about the
- * same time may be collapsed into a single message bearing the PID of the
- * first outside backend to perform the NOTIFY.
+ * other backends will never be missed by ignoring self-notifies.
  *-------------------------------------------------------------------------
  */
 
+/* XXX 
+ *
+ * TODO:
+ *  - guc parameter max_notifies_per_txn ??
+ *  - adapt comments
+ *  - test 2PC
+ *  - write error test
+ */
+
 #include "postgres.h"
 
 #include <unistd.h>
 #include <signal.h>
 
 #include "access/heapam.h"
+#include "access/slru.h"
+#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_listener.h"
@@ -108,8 +131,8 @@
 
 /*
  * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
- * all actions requested in the current transaction.  As explained above,
- * we don't actually modify pg_listener until we reach transaction commit.
+ * all actions requested in the current transaction. As explained above,
+ * we don't actually send notifications until we reach transaction commit.
  *
  * The list is kept in CurTransactionContext.  In subtransactions, each
  * subtransaction has its own list in its own CurTransactionContext, but
@@ -123,6 +146,12 @@
 	LISTEN_UNLISTEN_ALL
 } ListenActionKind;
 
+typedef enum
+{
+	READ_ALL_TO_UNCOMMITTED,
+	READ_ALL_SEND_COMMITTED
+} QueueProcessType;
+
 typedef struct
 {
 	ListenActionKind action;
@@ -133,8 +162,12 @@
 
 static List *upperPendingActions = NIL; /* list of upper-xact lists */
 
+static List *uncommittedNotifications = NIL;
+
+static bool needSignalBackends = false;
+
 /*
- * State for outbound notifies consists of a list of all relnames NOTIFYed
+ * State for outbound notifies consists of a list of all channels NOTIFYed
  * in the current transaction.	We do not actually perform a NOTIFY until
  * and unless the transaction commits.	pendingNotifies is NIL if no
  * NOTIFYs have been done in the current transaction.
@@ -149,12 +182,123 @@
  * condition name, it will get a self-notify at commit.  This is a bit odd
  * but is consistent with our historical behavior.
  */
-static List *pendingNotifies = NIL;		/* list of C strings */
 
+typedef struct Notification
+{
+	char		   *channel;
+	char		   *payload;
+	TransactionId	xid;
+	union {
+		/* we only need one of both, depending on whether we send a
+ 		 * notification or receive one. */
+		int32		dstPid;
+		int32		srcPid;
+	};
+} Notification;
+
+typedef struct AsyncQueueEntry
+{
+	/*
+	 * this record has the maximal length, but usually we limit it to
+	 * AsyncQueueEntryEmptySize + strlen(payload).
+	 */
+	Size			length;
+	Oid				dboid;
+	TransactionId	xid;
+	int32			srcPid;
+	char			channel[NAMEDATALEN];
+	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+} AsyncQueueEntry;
+#define AsyncQueueEntryEmptySize \
+	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)
+
+#define	InvalidPid (-1)
+#define QUEUE_POS_PAGE(x) ((x).page)
+#define QUEUE_POS_OFFSET(x) ((x).offset)
+#define QUEUE_POS_EQUAL(x,y) \
+	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+#define SET_QUEUE_POS(x,y,z) \
+	do { \
+		(x).page = (y); \
+		(x).offset = (z); \
+	} while (0);
+/* does page x logically precede page y with z = HEAD ? */
+#define QUEUE_POS_MIN(x,y,z) \
+	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+			 (x).offset < (y).offset ? (x) : \
+			 	(y)
+#define QUEUE_BACKEND_POS(i) asyncQueueControl->backend[(i)].pos
+#define QUEUE_BACKEND_PID(i) asyncQueueControl->backend[(i)].pid
+#define QUEUE_HEAD asyncQueueControl->head
+#define QUEUE_TAIL asyncQueueControl->tail
+
+typedef struct QueuePosition
+{
+	int				page;
+	int				offset;
+} QueuePosition;
+
+typedef struct QueueBackendStatus
+{
+	int32			pid;
+	QueuePosition	pos;
+} QueueBackendStatus;
+
+/*
+ * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+ *
+ * In SHARED mode, backends will only inspect their own entries as well as
+ * head and tail pointers. Consequently we can allow a backend to update its
+ * own record while holding only a shared lock (since no other backend will
+ * inspect it).
+ *
+ * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+ * also change head and tail pointers.
+ *
+ * In order to avoid deadlocks, whenever we need both locks, we always first
+ * get AsyncQueueLock and then AsyncCtlLock.
+ */
+typedef struct AsyncQueueControl
+{
+	QueuePosition		head;		/* head points to the next free location */
+	QueuePosition 		tail;		/* the global tail is equivalent to the
+									   tail of the "slowest" backend */
+	TimestampTz			lastQueueFullWarn;	/* when the queue is full we only
+											   want to log that once in a
+											   while */
+	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+									 * connections are allowed (MaxBackends) */
+	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+} AsyncQueueControl;
+
+static AsyncQueueControl   *asyncQueueControl;
+static SlruCtlData			AsyncCtlData;
+
+#define AsyncCtl					(&AsyncCtlData)
+#define QUEUE_PAGESIZE				BLCKSZ
+#define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+
+/*
+ * slru.c currently assumes that all filenames are four characters of hex
+ * digits. That means that we can use segments 0000 through FFFF.
+ * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+ * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+ *
+ * It's of course easy to enhance slru.c but those pages give us so much
+ * space already that it doesn't seem worth the trouble...
+ *
+ * It's a legal test case to define QUEUE_MAX_PAGE to a very small multiply of
+ * SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+ */
+#define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+
+static List *pendingNotifies = NIL;				/* list of Notifications */
 static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+static List *listenChannels = NIL;	/* list of channels we are listening to */
 
 /*
- * State for inbound notifies consists of two flags: one saying whether
+ * State for inbound notifications consists of two flags: one saying whether
  * the signal handler is currently allowed to call ProcessIncomingNotify
  * directly, and one saying whether the signal has occurred but the handler
  * was not allowed to call ProcessIncomingNotify at the time.
@@ -171,37 +315,148 @@
 
 bool		Trace_notify = false;
 
-
 static void queue_listen(ListenActionKind action, const char *condname);
 static void Async_UnlistenOnExit(int code, Datum arg);
-static void Exec_Listen(Relation lRel, const char *relname);
-static void Exec_Unlisten(Relation lRel, const char *relname);
-static void Exec_UnlistenAll(Relation lRel);
-static void Send_Notify(Relation lRel);
+static bool IsListeningOn(const char *channel);
+static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
+static void Exec_Listen(const char *channel);
+static void Exec_Unlisten(const char *channel);
+static void Exec_UnlistenAll(void);
+static void SignalBackends(void);
+static void Send_Notify(void);
+static bool asyncQueuePagePrecedesPhysically(int p, int q);
+static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
+static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
+static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
+static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n);
+static List *asyncQueueAddEntries(List *notifications);
+static bool asyncQueueGetEntriesByPage(QueuePosition *current,
+									   QueuePosition stop,
+									   List **committed,
+									   MemoryContext committedContext,
+									   List **uncommitted,
+									   MemoryContext uncommittedContext);
+static void asyncQueueReadAllNotifications(QueueProcessType type);
+static void asyncQueueAdvanceTail(void);
 static void ProcessIncomingNotify(void);
-static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
-static bool AsyncExistsPendingNotify(const char *relname);
+static void NotifyMyFrontEnd(const char *channel,
+							 const char *payload,
+							 int32 dstPid);
+static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
 static void ClearPendingActionsAndNotifies(void);
 
+/*
+ * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+ * asyncQueuePagePrecedesPhysically just checks numerically without any magic if
+ * one page precedes another one.
+ *
+ * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+ * takes the current head page number into account. Now if we have wrapped
+ * around, it can happen that p precedes q, even though p > q (if the head page
+ * is in between the two).
+ */ 
+static bool
+asyncQueuePagePrecedesPhysically(int p, int q)
+{
+	return p < q;
+}
+
+static bool
+asyncQueuePagePrecedesLogically(int p, int q, int head)
+{
+	if (p <= head && q <= head)
+		return p < q;
+	if (p > head && q > head)
+		return p < q;
+	if (p <= head)
+	{
+		Assert(q > head);
+		/* q is older */
+		return false;
+	}
+	else
+	{
+		Assert(p > head && q <= head);
+		/* p is older */
+		return true;
+	}
+}
+
+void
+AsyncShmemInit(void)
+{
+	bool	found;
+	int		slotno;
+	Size	size;
+
+	/*
+	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+	 * QueueBackendStatus, so we only need to add the status space requirement
+	 * for MaxBackends-1 backends.
+	 */
+	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+	size = add_size(size, sizeof(AsyncQueueControl));
+
+	asyncQueueControl = (AsyncQueueControl *)
+		ShmemInitStruct("Async Queue Control", size, &found);
+
+	if (!asyncQueueControl)
+		elog(ERROR, "out of memory");
+
+	if (!found)
+	{
+		int		i;
+		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+		SET_QUEUE_POS(QUEUE_TAIL, QUEUE_MAX_PAGE, 0);
+		for (i = 0; i < MaxBackends; i++)
+		{
+			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+			QUEUE_BACKEND_PID(i) = InvalidPid;
+		}
+	}
+
+	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+				  AsyncCtlLock, "pg_notify");
+	AsyncCtl->do_fsync = false;
+	asyncQueueControl->lastQueueFullWarn = GetCurrentTimestamp();
+
+	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+	slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+	AsyncCtl->shared->page_dirty[slotno] = true;
+	SimpleLruWritePage(AsyncCtl, slotno, NULL);
+	LWLockRelease(AsyncCtlLock);
+	LWLockRelease(AsyncQueueLock);
+
+	SlruScanDirectory(AsyncCtl, QUEUE_MAX_PAGE, true);
+}
+
 
 /*
  * Async_Notify
  *
  *		This is executed by the SQL notify command.
  *
- *		Adds the relation to the list of pending notifies.
+ *		Adds the channel to the list of pending notifies.
  *		Actual notification happens during transaction commit.
  *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  */
 void
-Async_Notify(const char *relname)
+Async_Notify(const char *channel, const char *payload)
 {
+
 	if (Trace_notify)
-		elog(DEBUG1, "Async_Notify(%s)", relname);
+		elog(DEBUG1, "Async_Notify(%s)", channel);
+
+	/*
+	 * XXX - do we now need a guc parameter max_notifies_per_txn?
+	 */ 
 
 	/* no point in making duplicate entries in the list ... */
-	if (!AsyncExistsPendingNotify(relname))
+	if (!AsyncExistsPendingNotify(channel, payload))
 	{
+		Notification *n;
 		/*
 		 * The name list needs to live until end of transaction, so store it
 		 * in the transaction context.
@@ -210,12 +465,21 @@
 
 		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
 
+		n = (Notification *) palloc(sizeof(Notification));
+		/* will set the xid later... */
+		n->xid = InvalidTransactionId;
+		n->channel = pstrdup(channel);
+		if (payload)
+			n->payload = pstrdup(payload);
+		else
+			n->payload = "";
+		n->dstPid = InvalidPid;
+
 		/*
-		 * Ordering of the list isn't important.  We choose to put new entries
-		 * on the front, as this might make duplicate-elimination a tad faster
-		 * when the same condition is signaled many times in a row.
+		 * We want to preserve the order so we need to append every
+		 * notification. See comments at AsyncExistsPendingNotify().
 		 */
-		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
+		pendingNotifies = lappend(pendingNotifies, n);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -259,12 +523,12 @@
  *		This is executed by the SQL listen command.
  */
 void
-Async_Listen(const char *relname)
+Async_Listen(const char *channel)
 {
 	if (Trace_notify)
-		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
+		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
 
-	queue_listen(LISTEN_LISTEN, relname);
+	queue_listen(LISTEN_LISTEN, channel);
 }
 
 /*
@@ -273,16 +537,16 @@
  *		This is executed by the SQL unlisten command.
  */
 void
-Async_Unlisten(const char *relname)
+Async_Unlisten(const char *channel)
 {
 	if (Trace_notify)
-		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
+		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
 
 	/* If we couldn't possibly be listening, no need to queue anything */
 	if (pendingActions == NIL && !unlistenExitRegistered)
 		return;
 
-	queue_listen(LISTEN_UNLISTEN, relname);
+	queue_listen(LISTEN_UNLISTEN, channel);
 }
 
 /*
@@ -306,8 +570,6 @@
 /*
  * Async_UnlistenOnExit
  *
- *		Clean up the pg_listener table at backend exit.
- *
  *		This is executed if we have done any LISTENs in this backend.
  *		It might not be necessary anymore, if the user UNLISTENed everything,
  *		but we don't try to detect that case.
@@ -315,17 +577,8 @@
 static void
 Async_UnlistenOnExit(int code, Datum arg)
 {
-	/*
-	 * We need to start/commit a transaction for the unlisten, but if there is
-	 * already an active transaction we had better abort that one first.
-	 * Otherwise we'd end up committing changes that probably ought to be
-	 * discarded.
-	 */
 	AbortOutOfAnyTransaction();
-	/* Now we can do the unlisten */
-	StartTransactionCommand();
-	Async_UnlistenAll();
-	CommitTransactionCommand();
+	Exec_UnlistenAll();
 }
 
 /*
@@ -348,10 +601,15 @@
 	/* We can deal with pending NOTIFY though */
 	foreach(p, pendingNotifies)
 	{
-		const char *relname = (const char *) lfirst(p);
+		AsyncQueueEntry qe;
+		Notification   *n;
+
+		n = (Notification *) lfirst(p);
+
+		asyncQueueNotificationToEntry(n, &qe);
 
 		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
-							   relname, strlen(relname) + 1);
+							   &qe, qe.length);
 	}
 
 	/*
@@ -363,26 +621,24 @@
 }
 
 /*
- * AtCommit_Notify
- *
- *		This is called at transaction commit.
+ * AtCommit_NotifyBeforeCommit
  *
- *		If there are pending LISTEN/UNLISTEN actions, insert or delete
- *		tuples in pg_listener accordingly.
+ *		This is called at transaction commit, before actually committing to
+ *		clog.
  *
- *		If there are outbound notify requests in the pendingNotifies list,
- *		scan pg_listener for matching tuples, and either signal the other
- *		backend or send a message to our own frontend.
+ *		If there are pending LISTEN/UNLISTEN actions, update our
+ *		"listenChannels" list.
  *
- *		NOTE: we are still inside the current transaction, therefore can
- *		piggyback on its committing of changes.
+ *		If there are outbound notify requests in the pendingNotifies list, add
+ *		them to the global queue and signal any backend that is listening.
  */
 void
-AtCommit_Notify(void)
+AtCommit_NotifyBeforeCommit(void)
 {
-	Relation	lRel;
 	ListCell   *p;
 
+	needSignalBackends = false;
+
 	if (pendingActions == NIL && pendingNotifies == NIL)
 		return;					/* no relevant statements in this xact */
 
@@ -397,10 +653,7 @@
 	}
 
 	if (Trace_notify)
-		elog(DEBUG1, "AtCommit_Notify");
-
-	/* Acquire ExclusiveLock on pg_listener */
-	lRel = heap_open(ListenerRelationId, ExclusiveLock);
+		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
 
 	/* Perform any pending listen/unlisten actions */
 	foreach(p, pendingActions)
@@ -410,99 +663,111 @@
 		switch (actrec->action)
 		{
 			case LISTEN_LISTEN:
-				Exec_Listen(lRel, actrec->condname);
+				Exec_Listen(actrec->condname);
 				break;
 			case LISTEN_UNLISTEN:
-				Exec_Unlisten(lRel, actrec->condname);
+				Exec_Unlisten(actrec->condname);
 				break;
 			case LISTEN_UNLISTEN_ALL:
-				Exec_UnlistenAll(lRel);
+				Exec_UnlistenAll();
 				break;
 		}
-
-		/* We must CCI after each action in case of conflicting actions */
-		CommandCounterIncrement();
 	}
 
-	/* Perform any pending notifies */
-	if (pendingNotifies)
-		Send_Notify(lRel);
-
 	/*
-	 * We do NOT release the lock on pg_listener here; we need to hold it
-	 * until end of transaction (which is about to happen, anyway) to ensure
-	 * that notified backends see our tuple updates when they look. Else they
-	 * might disregard the signal, which would make the application programmer
-	 * very unhappy.  Also, this prevents race conditions when we have just
-	 * inserted a listening tuple.
+	 * Perform any pending notifies.
 	 */
-	heap_close(lRel, NoLock);
+	if (pendingNotifies)
+	{
+		needSignalBackends = true;
+		Send_Notify();
+	}
+}
+
+/*
+ * AtCommit_NotifyAfterCommit
+ *
+ *		This is called at transaction commit, after committing to clog.
+ *
+ *		Notify the listening backends.
+ */
+void
+AtCommit_NotifyAfterCommit(void)
+{
+	if (needSignalBackends)
+		SignalBackends();
 
 	ClearPendingActionsAndNotifies();
 
 	if (Trace_notify)
-		elog(DEBUG1, "AtCommit_Notify: done");
+		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
+}
+
+/*
+ * This function is executed for every notification found in the queue in order
+ * to check if the current backend is listening on that channel. Not sure if we
+ * should further optimize this, for example convert to a sorted array and
+ * allow binary search on it...
+ */
+static bool
+IsListeningOn(const char *channel)
+{
+	ListCell   *p;
+
+	foreach(p, listenChannels)
+	{
+		char *lchan = (char *) lfirst(p);
+		if (strcmp(lchan, channel) == 0)
+			/* already listening on this channel */
+			return true;
+	}
+	return false;
 }
 
+
 /*
  * Exec_Listen --- subroutine for AtCommit_Notify
  *
- *		Register the current backend as listening on the specified relation.
+ *		Register the current backend as listening on the specified channel.
  */
 static void
-Exec_Listen(Relation lRel, const char *relname)
+Exec_Listen(const char *channel)
 {
-	HeapScanDesc scan;
-	HeapTuple	tuple;
-	Datum		values[Natts_pg_listener];
-	bool		nulls[Natts_pg_listener];
-	NameData	condname;
-	bool		alreadyListener = false;
+	MemoryContext oldcontext;
 
 	if (Trace_notify)
-		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
+		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
 
-	/* Detect whether we are already listening on this relname */
-	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
-	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-	{
-		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
-
-		if (listener->listenerpid == MyProcPid &&
-			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
-		{
-			alreadyListener = true;
-			/* No need to scan the rest of the table */
-			break;
-		}
-	}
-	heap_endscan(scan);
-
-	if (alreadyListener)
+	/* Detect whether we are already listening on this channel */
+	if (IsListeningOn(channel))
 		return;
 
 	/*
-	 * OK to insert a new tuple
+	 * OK to insert to the list.
 	 */
-	memset(nulls, false, sizeof(nulls));
-
-	namestrcpy(&condname, relname);
-	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
-	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
-	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
-
-	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
-
-	simple_heap_insert(lRel, tuple);
-
-#ifdef NOT_USED					/* currently there are no indexes */
-	CatalogUpdateIndexes(lRel, tuple);
-#endif
+	if (listenChannels == NIL)
+	{
+		/*
+		 * This is our first LISTEN, establish our pointer.
+		 */
+		LWLockAcquire(AsyncQueueLock, LW_SHARED);
+		QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD;
+		QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
+		LWLockRelease(AsyncQueueLock);
+		/*
+		 * Actually this is only necessary if we are the first listener
+		 * (The tail pointer needs to be identical with the pointer of at
+		 * least one backend).
+		 */
+		asyncQueueAdvanceTail();
+	}
 
-	heap_freetuple(tuple);
+	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+	listenChannels = lappend(listenChannels, pstrdup(channel));
+	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * now that we are listening, make sure we will unlisten before dying.
+	 * Now that we are listening, make sure we will unlisten before dying.
 	 */
 	if (!unlistenExitRegistered)
 	{
@@ -514,38 +779,53 @@
 /*
  * Exec_Unlisten --- subroutine for AtCommit_Notify
  *
- *		Remove the current backend from the list of listening backends
- *		for the specified relation.
+ *		Remove a specified channel from "listenChannel".
  */
 static void
-Exec_Unlisten(Relation lRel, const char *relname)
+Exec_Unlisten(const char *channel)
 {
-	HeapScanDesc scan;
-	HeapTuple	tuple;
+	ListCell   *p;
+	ListCell   *prev = NULL;
 
 	if (Trace_notify)
-		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
+		elog(DEBUG1, "Exec_Unlisten(%s,%d)", channel, MyProcPid);
 
-	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
-	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	/* Detect whether we are already listening on this channel */
+	foreach(p, listenChannels)
 	{
-		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
-
-		if (listener->listenerpid == MyProcPid &&
-			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
+		char *lchan = (char *) lfirst(p);
+		if (strcmp(lchan, channel) == 0)
 		{
-			/* Found the matching tuple, delete it */
-			simple_heap_delete(lRel, &tuple->t_self);
-
 			/*
-			 * We assume there can be only one match, so no need to scan the
-			 * rest of the table
+			 * Since the list is living in the TopMemoryContext, we free
+			 * the memory. The ListCell is freed by list_delete_cell().
 			 */
-			break;
+			pfree(lchan);
+			listenChannels = list_delete_cell(listenChannels, p, prev);
+			if (listenChannels == NIL)
+			{
+				bool advanceTail = false;
+				/*
+				 * This backend is not listening anymore.
+				 */
+				LWLockAcquire(AsyncQueueLock, LW_SHARED);
+				QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
+
+				/*
+				 * If we have been the last backend, advance the tail pointer.
+				 */
+				if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
+					advanceTail = true;
+				LWLockRelease(AsyncQueueLock);
+
+				if (advanceTail)
+					asyncQueueAdvanceTail();
+			}
+			return;
 		}
+		prev = p;
 	}
-	heap_endscan(scan);
-
+	
 	/*
 	 * We do not complain about unlistening something not being listened;
 	 * should we?
@@ -555,123 +835,300 @@
 /*
  * Exec_UnlistenAll --- subroutine for AtCommit_Notify
  *
- *		Update pg_listener to unlisten all relations for this backend.
+ *		Unlisten on all channels for this backend.
  */
 static void
-Exec_UnlistenAll(Relation lRel)
+Exec_UnlistenAll(void)
 {
-	HeapScanDesc scan;
-	HeapTuple	lTuple;
-	ScanKeyData key[1];
+	bool advanceTail = false;
 
 	if (Trace_notify)
-		elog(DEBUG1, "Exec_UnlistenAll");
+		elog(DEBUG1, "Exec_UnlistenAll(%d)", MyProcPid);
 
-	/* Find and delete all entries with my listenerPID */
-	ScanKeyInit(&key[0],
-				Anum_pg_listener_listenerpid,
-				BTEqualStrategyNumber, F_INT4EQ,
-				Int32GetDatum(MyProcPid));
-	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
+	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
 
-	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-		simple_heap_delete(lRel, &lTuple->t_self);
+	/*
+	 * Since the list is living in the TopMemoryContext, we free the memory.
+	 */
+	list_free_deep(listenChannels);
+	listenChannels = NIL;
 
-	heap_endscan(scan);
+	/*
+	 * If we have been the last backend, advance the tail pointer.
+	 */
+	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
+		advanceTail = true;
+	LWLockRelease(AsyncQueueLock);
+
+	if (advanceTail)
+		asyncQueueAdvanceTail();
+}
+
+static bool
+asyncQueueIsFull()
+{
+	QueuePosition	lookahead = QUEUE_HEAD;
+	Size remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
+	Size advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
+
+	/*
+	 * Check what happens if we wrote a maximally sized entry. Would we go to a
+	 * new page? If not, then our queue can not be full (because we can still
+	 * fill at least the current page with at least one more entry).
+	 */
+	if (!asyncQueueAdvance(&lookahead, advance))
+		return false;
+
+	/*
+	 * The queue is full if with a switch to a new page we reach the page
+	 * of the tail pointer.
+	 */
+	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
 }
 
 /*
- * Send_Notify --- subroutine for AtCommit_Notify
- *
- *		Scan pg_listener for tuples matching our pending notifies, and
- *		either signal the other backend or send a message to our own frontend.
+ * The function advances the position to the next entry. In case we jump to
+ * a new page the function returns true, else false.
  */
+static bool
+asyncQueueAdvance(QueuePosition *position, int entryLength)
+{
+	int		pageno = QUEUE_POS_PAGE(*position);
+	int		offset = QUEUE_POS_OFFSET(*position);
+	bool	pageJump = false;
+
+	/*
+	 * Move to the next writing position: First jump over what we have just
+	 * written or read.
+	 */
+	offset += entryLength;
+	Assert(offset < QUEUE_PAGESIZE);
+
+	/*
+	 * In a second step check if another entry can be written to the page. If
+	 * it does, stay here, we have reached the next position. If not, then we
+	 * need to move on to the next page.
+	 */
+	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
+	{
+		pageno++;
+		if (pageno > QUEUE_MAX_PAGE)
+			/* wrap around */
+			pageno = 0;
+		offset = 0;
+		pageJump = true;
+	}
+
+	SET_QUEUE_POS(*position, pageno, offset);
+	return pageJump;
+}
+
+static void
+asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
+{
+		Assert(n->channel);
+		Assert(n->payload);
+		Assert(strlen(n->payload) <= NOTIFY_PAYLOAD_MAX_LENGTH);
+
+		/* The terminator is already included in AsyncQueueEntryEmptySize */
+		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload);
+		qe->srcPid = MyProcPid;
+		qe->dboid = MyDatabaseId;
+		qe->xid = GetCurrentTransactionId();
+		strcpy(qe->channel, n->channel);
+		strcpy(qe->payload, n->payload);
+}
+
 static void
-Send_Notify(Relation lRel)
+asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n)
+{
+	n->channel = pstrdup(qe->channel);
+	n->payload = pstrdup(qe->payload);
+	n->srcPid = qe->srcPid;
+	n->xid = qe->xid;
+}
+
+static List *
+asyncQueueAddEntries(List *notifications)
 {
-	TupleDesc	tdesc = RelationGetDescr(lRel);
-	HeapScanDesc scan;
-	HeapTuple	lTuple,
-				rTuple;
-	Datum		value[Natts_pg_listener];
-	bool		repl[Natts_pg_listener],
-				nulls[Natts_pg_listener];
-
-	/* preset data to update notify column to MyProcPid */
-	memset(nulls, false, sizeof(nulls));
-	memset(repl, false, sizeof(repl));
-	repl[Anum_pg_listener_notification - 1] = true;
-	memset(value, 0, sizeof(value));
-	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
-
-	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
-
-	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-	{
-		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
-		char	   *relname = NameStr(listener->relname);
-		int32		listenerPID = listener->listenerpid;
+	int				pageno;
+	int				offset;
+	int				slotno;
+	AsyncQueueEntry	qe;
 
-		if (!AsyncExistsPendingNotify(relname))
-			continue;
+	/*
+	 * Note that we are holding exclusive AsyncQueueLock already.
+	 */
+	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
+	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
+	AsyncCtl->shared->page_dirty[slotno] = true;
 
-		if (listenerPID == MyProcPid)
+	do
+	{
+		Notification   *n;
+
+		if (asyncQueueIsFull())
 		{
-			/*
-			 * Self-notify: no need to bother with table update. Indeed, we
-			 * *must not* clear the notification field in this path, or we
-			 * could lose an outside notify, which'd be bad for applications
-			 * that ignore self-notify messages.
-			 */
-			if (Trace_notify)
-				elog(DEBUG1, "AtCommit_Notify: notifying self");
+			/* document that we will not go into the if command further down */
+			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
+			break;
+		}
+
+		n = (Notification *) linitial(notifications);
+
+		asyncQueueNotificationToEntry(n, &qe);
 
-			NotifyMyFrontEnd(relname, listenerPID);
+		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
+		/*
+		 * Check whether or not the entry still fits on the current page.
+		 */
+		if (offset + qe.length < QUEUE_PAGESIZE)
+		{
+			notifications = list_delete_first(notifications);
 		}
 		else
 		{
-			if (Trace_notify)
-				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
-					 listenerPID);
-
 			/*
-			 * If someone has already notified this listener, we don't bother
-			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
-			 * signal, just in case that backend missed the earlier signal for
-			 * some reason.  It's OK to send the signal first, because the
-			 * other guy can't read pg_listener until we unlock it.
-			 *
-			 * Note: we don't have the other guy's BackendId available, so
-			 * this will incur a search of the ProcSignal table.  That's
-			 * probably not worth worrying about.
+			 * Write a dummy entry to fill up the page. Actually readers will
+			 * only check dboid and since it won't match any reader's database
+			 * oid, they will ignore this entry and move on.
 			 */
-			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
-							   InvalidBackendId) < 0)
-			{
-				/*
-				 * Get rid of pg_listener entry if it refers to a PID that no
-				 * longer exists.  Presumably, that backend crashed without
-				 * deleting its pg_listener entries. This code used to only
-				 * delete the entry if errno==ESRCH, but as far as I can see
-				 * we should just do it for any failure (certainly at least
-				 * for EPERM too...)
-				 */
-				simple_heap_delete(lRel, &lTuple->t_self);
-			}
-			else if (listener->notification == 0)
-			{
-				/* Rewrite the tuple with my PID in notification column */
-				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
-				simple_heap_update(lRel, &lTuple->t_self, rTuple);
-
-#ifdef NOT_USED					/* currently there are no indexes */
-				CatalogUpdateIndexes(lRel, rTuple);
-#endif
-			}
+			qe.length = QUEUE_PAGESIZE - offset - 1;
+			qe.dboid = InvalidOid;
+			qe.channel[0] = '\0';
+			qe.payload[0] = '\0';
+			qe.xid = InvalidTransactionId;
 		}
+		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
+			   &qe, qe.length);
+
+	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
+			 && notifications != NIL);
+
+	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
+	{
+		/*
+		 * If the next entry needs to go to a new page, prepare that page
+		 * already.
+		 */
+		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+		AsyncCtl->shared->page_dirty[slotno] = true;
 	}
+	LWLockRelease(AsyncCtlLock);
+
+	return notifications;
+}
+
+static void
+asyncQueueFullWarning()
+{
+	/*
+	 * Caller must hold exclusive AsyncQueueLock.
+	 */
+	TimestampTz		t = GetCurrentTimestamp();
+	QueuePosition	min = QUEUE_HEAD;
+	int32			minPid = InvalidPid;
+	int				i;
+
+	for (i = 0; i < MaxBackends; i++)
+		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+		{
+			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+			minPid = QUEUE_BACKEND_PID(i);
+		}
 
-	heap_endscan(scan);
+	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFullWarn,
+								   t, QUEUE_FULL_WARN_INTERVAL))
+	{
+		ereport(WARNING, (errmsg("pg_notify queue is full. Among the slowest backends: %d", minPid)));
+		asyncQueueControl->lastQueueFullWarn = t;
+	}
+}
+
+/*
+ * Send_Notify --- subroutine for AtCommit_Notify
+ *
+ * Add the pending notifications to the queue and signal the listening
+ * backends.
+ *
+ * A full queue is very uncommon and should really not happen, given that we
+ * have so much space available in our slru pages. Nevertheless we need to
+ * deal with this possibility. Note that when we get here we are in the process
+ * of committing our transaction, we have not yet committed to clog but this
+ * would be the next step.
+ */
+static void
+Send_Notify()
+{
+	while (pendingNotifies != NIL)
+	{
+		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+		while (asyncQueueIsFull())
+		{
+			asyncQueueFullWarning();
+			LWLockRelease(AsyncQueueLock);
+
+			SignalBackends();
+
+			asyncQueueReadAllNotifications(READ_ALL_TO_UNCOMMITTED);
+
+			asyncQueueAdvanceTail();
+			pg_usleep(100 * 1000L); /* 100ms */
+			LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+		}
+		Assert(pendingNotifies != NIL);
+		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
+		LWLockRelease(AsyncQueueLock);
+	}
+}
+
+/*
+ * Send signals to all listening backends. It would be easy here to check
+ * for backends that are already up-to-date, i.e.
+ *
+ *   QUEUE_BACKEND_POS(pid) == QUEUE_HEAD
+ *
+ * but we need to signal them anyway. If we didn't, we would not have the
+ * guarantee that they can deliver their notifications from
+ * uncommittedNotifications.
+ *
+ * Since we know the BackendId and the Pid the signalling is quite cheap.
+ */
+static void
+SignalBackends(void)
+{
+	ListCell	   *p1, *p2;
+	int				i;
+	int32			pid;
+	List		   *pids = NIL;
+	List		   *ids = NIL;
+
+	/* Signal everybody who is LISTENing to any channel. */
+	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+	for (i = 0; i < MaxBackends; i++)
+	{
+		pid = QUEUE_BACKEND_PID(i);
+		if (pid != InvalidPid)
+		{
+			pids = lappend_int(pids, pid);
+			ids = lappend_int(ids, i);
+		}
+	}
+	LWLockRelease(AsyncQueueLock);
+	
+	forboth(p1, pids, p2, ids)
+	{
+		pid = (int32) lfirst_int(p1);
+		i = lfirst_int(p2);
+		/*
+		 * Should we check for failure? Can it happen that a backend
+		 * has crashed without the postmaster starting over?
+		 */
+		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
+			elog(WARNING, "Error signalling backend %d", pid);
+	}
 }
 
 /*
@@ -940,29 +1397,211 @@
 }
 
 /*
+ * This function will ask for a page with ReadOnly access and once we have the
+ * lock, we read the whole content and pass back two lists of notifications
+ * that the calling function will deliver then. The first list will contain all
+ * notifications from transactions that have already committed and the second
+ * one will contain uncommitted notifications.
+ *
+ * We stop if we have either reached the stop position or go to a new page.
+ *
+ * If we have reached the stop position, return true, else false.
+ */
+static bool
+asyncQueueGetEntriesByPage(QueuePosition *current, QueuePosition stop,
+						   List **committed, MemoryContext committedContext,
+						   List **uncommitted, MemoryContext uncommittedContext)
+{
+	int				slotno;
+	AsyncQueueEntry	qe;
+	Notification   *n;
+
+	if (QUEUE_POS_EQUAL(*current, stop))
+		return true;
+
+	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+										InvalidTransactionId);
+	do {
+		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+		readPtr += current->offset;
+
+		if (QUEUE_POS_EQUAL(*current, stop))
+			break;
+
+		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+
+		if (qe.dboid == MyDatabaseId && IsListeningOn(qe.channel))
+		{
+			MemoryContext	oldcontext;
+
+			/* read the whole entry only if we are really interested in it */
+			if (qe.length > AsyncQueueEntryEmptySize)
+				memcpy(&qe, readPtr, qe.length);
+
+			if (TransactionIdDidCommit(qe.xid))
+			{
+				oldcontext = MemoryContextSwitchTo(committedContext);
+				n = (Notification *) palloc(sizeof(Notification));
+				asyncQueueEntryToNotification(&qe, n);
+				*committed = lappend(*committed, n);
+				MemoryContextSwitchTo(oldcontext);
+			}
+			else
+			{
+				if (!TransactionIdDidAbort(qe.xid))
+				{
+					oldcontext = MemoryContextSwitchTo(uncommittedContext);
+					n = (Notification *) palloc(sizeof(Notification));
+					asyncQueueEntryToNotification(&qe, n);
+					*uncommitted= lappend(*uncommitted, n);
+					MemoryContextSwitchTo(oldcontext);
+				}
+			}
+		}
+		/*
+		 * The call to asyncQueueAdvance just jumps over what we have
+		 * just read. If there is no more space for the next record on the
+		 * current page, it will also switch to the beginning of the next page.
+		 */
+	} while(!asyncQueueAdvance(current, qe.length));
+
+	LWLockRelease(AsyncCtlLock);
+
+	if (QUEUE_POS_EQUAL(*current, stop))
+		return true;
+
+	return false;
+}
+
+static void
+asyncQueueReadAllNotifications(QueueProcessType type)
+{
+	QueuePosition	pos;
+	QueuePosition	oldpos;
+	QueuePosition	head;
+	List		   *notifications;
+	ListCell	   *lc;
+	Notification   *n;
+	bool			advanceTail = false;
+	bool			reachedStop;
+
+	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+	head = QUEUE_HEAD;
+	LWLockRelease(AsyncQueueLock);
+
+	/* Nothing to do, we have read all notifications already. */
+	if (QUEUE_POS_EQUAL(pos, head))
+		return;
+
+	do 
+	{
+		/*
+		 * Our stop position is what we found to be the head's position when
+		 * we entered this function. It might have changed already. But if it
+		 * has, we will receive (or have already received and queued) another
+		 * signal and come here again.
+		 *
+		 * We are not holding AsyncQueueLock here! The queue can only extend
+		 * beyond the head pointer (see above) and we leave our backend's
+		 * pointer where it is so nobody will truncate or rewrite pages under
+		 * us.
+		 */
+		reachedStop = false;
+
+		if (type == READ_ALL_TO_UNCOMMITTED)
+			/*
+			 * If the queue is full, we call this in the writing backend.
+			 * if a backend sends more notifications than the queue can hold
+			 * it also needs to read its own notifications from time to time
+			 * such that it can reuse the space of the queue.
+			 */
+			reachedStop = asyncQueueGetEntriesByPage(&pos, head,
+								   &uncommittedNotifications, TopMemoryContext,
+								   &uncommittedNotifications, TopMemoryContext);
+		else
+		{
+			/*
+			 * This is called from ProcessIncomingNotify()
+			 */
+			Assert(type == READ_ALL_SEND_COMMITTED);
+
+			notifications = NIL;
+			reachedStop = asyncQueueGetEntriesByPage(&pos, head,
+								   &notifications, CurrentMemoryContext,
+								   &uncommittedNotifications, TopMemoryContext);
+
+			foreach(lc, notifications)
+			{
+				n = (Notification *) lfirst(lc);
+				NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+			}
+		}
+	} while (!reachedStop);
+
+	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+	QUEUE_BACKEND_POS(MyBackendId) = pos;
+	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+		advanceTail = true;
+	LWLockRelease(AsyncQueueLock);
+
+	if (advanceTail)
+		/* Move forward the tail pointer and try to truncate. */
+		asyncQueueAdvanceTail();
+}
+
+static void
+asyncQueueAdvanceTail()
+{
+	QueuePosition	min;
+	int				i;
+	int				tailPage;
+	int				headPage;
+
+	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+	min = QUEUE_HEAD;
+	for (i = 0; i < MaxBackends; i++)
+		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+
+	tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
+	headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
+	QUEUE_TAIL = min;
+	LWLockRelease(AsyncQueueLock);
+
+	/* This is our wraparound check */
+	if (asyncQueuePagePrecedesLogically(tailPage, QUEUE_POS_PAGE(min), headPage)
+		&& asyncQueuePagePrecedesPhysically(tailPage, headPage))
+	{
+		/*
+		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+		 * release the lock again.
+		 *
+		 * Don't even bother grabbing the lock if we can only truncate at most
+		 * one page...
+		 */
+		if (QUEUE_POS_PAGE(min) - tailPage > SLRU_PAGES_PER_SEGMENT)
+			SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+	}
+}
+
+/*
  * ProcessIncomingNotify
  *
  *		Deal with arriving NOTIFYs from other backends.
  *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
  *		signal handler, or the next time control reaches the outer idle loop.
- *		Scan pg_listener for arriving notifies, report them to my front end,
- *		and clear the notification field in pg_listener until next time.
+ *		Scan the queue for arriving notifications and report them to my front
+ *		end.
  *
- *		NOTE: since we are outside any transaction, we must create our own.
+ *		NOTE: we are outside of any transaction here.
  */
 static void
 ProcessIncomingNotify(void)
 {
-	Relation	lRel;
-	TupleDesc	tdesc;
-	ScanKeyData key[1];
-	HeapScanDesc scan;
-	HeapTuple	lTuple,
-				rTuple;
-	Datum		value[Natts_pg_listener];
-	bool		repl[Natts_pg_listener],
-				nulls[Natts_pg_listener];
-	bool		catchup_enabled;
+	bool			catchup_enabled;
+
+	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
 
 	/* Must prevent catchup interrupt while I am running */
 	catchup_enabled = DisableCatchupInterrupt();
@@ -974,64 +1613,36 @@
 
 	notifyInterruptOccurred = 0;
 
-	StartTransactionCommand();
-
-	lRel = heap_open(ListenerRelationId, ExclusiveLock);
-	tdesc = RelationGetDescr(lRel);
-
-	/* Scan only entries with my listenerPID */
-	ScanKeyInit(&key[0],
-				Anum_pg_listener_listenerpid,
-				BTEqualStrategyNumber, F_INT4EQ,
-				Int32GetDatum(MyProcPid));
-	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
-
-	/* Prepare data for rewriting 0 into notification field */
-	memset(nulls, false, sizeof(nulls));
-	memset(repl, false, sizeof(repl));
-	repl[Anum_pg_listener_notification - 1] = true;
-	memset(value, 0, sizeof(value));
-	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
-
-	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
-	{
-		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
-		char	   *relname = NameStr(listener->relname);
-		int32		sourcePID = listener->notification;
+	/*
+ 	 * Work on the uncommitted notifications list until we hit the first
+	 * still-running transaction.
+	 */
+	while(uncommittedNotifications != NIL)
+	{
+		ListCell	   *lc;
+		Notification   *n;
 
-		if (sourcePID != 0)
+		n = (Notification *) linitial(uncommittedNotifications);
+		if (TransactionIdDidCommit(n->xid))
 		{
-			/* Notify the frontend */
-
-			if (Trace_notify)
-				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
-					 relname, (int) sourcePID);
-
-			NotifyMyFrontEnd(relname, sourcePID);
-
-			/*
-			 * Rewrite the tuple with 0 in notification column.
-			 */
-			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
-			simple_heap_update(lRel, &lTuple->t_self, rTuple);
-
-#ifdef NOT_USED					/* currently there are no indexes */
-			CatalogUpdateIndexes(lRel, rTuple);
-#endif
+			/* could the client have sent an unlisten already? */
+			if (IsListeningOn(n->channel))
+				NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+		}
+		else
+		{
+			if (!TransactionIdDidAbort(n->xid))
+				/* n->xid still running */
+				break;
 		}
+		pfree(n->channel);
+		pfree(n->payload);
+		lc = list_head(uncommittedNotifications);
+		uncommittedNotifications
+			= list_delete_cell(uncommittedNotifications, lc, NULL);
 	}
-	heap_endscan(scan);
-
-	/*
-	 * We do NOT release the lock on pg_listener here; we need to hold it
-	 * until end of transaction (which is about to happen, anyway) to ensure
-	 * that other backends see our tuple updates when they look. Otherwise, a
-	 * transaction started after this one might mistakenly think it doesn't
-	 * need to send this backend a new NOTIFY.
-	 */
-	heap_close(lRel, NoLock);
 
-	CommitTransactionCommand();
+	asyncQueueReadAllNotifications(READ_ALL_SEND_COMMITTED);
 
 	/*
 	 * Must flush the notify messages to ensure frontend gets them promptly.
@@ -1051,20 +1662,17 @@
  * Send NOTIFY message to my front end.
  */
 static void
-NotifyMyFrontEnd(char *relname, int32 listenerPID)
+NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
 {
 	if (whereToSendOutput == DestRemote)
 	{
 		StringInfoData buf;
 
 		pq_beginmessage(&buf, 'A');
-		pq_sendint(&buf, listenerPID, sizeof(int32));
-		pq_sendstring(&buf, relname);
+		pq_sendint(&buf, srcPid, sizeof(int32));
+		pq_sendstring(&buf, channel);
 		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
-		{
-			/* XXX Add parameter string here later */
-			pq_sendstring(&buf, "");
-		}
+			pq_sendstring(&buf, payload);
 		pq_endmessage(&buf);
 
 		/*
@@ -1074,23 +1682,57 @@
 		 */
 	}
 	else
-		elog(INFO, "NOTIFY for %s", relname);
+		elog(INFO, "NOTIFY for %s", channel);
 }
 
-/* Does pendingNotifies include the given relname? */
+/* Does pendingNotifies include the given channel/payload? */
 static bool
-AsyncExistsPendingNotify(const char *relname)
+AsyncExistsPendingNotify(const char *channel, const char *payload)
 {
 	ListCell   *p;
+	Notification *n;
 
-	foreach(p, pendingNotifies)
-	{
-		const char *prelname = (const char *) lfirst(p);
+	if (pendingNotifies == NIL)
+		return false;
 
-		if (strcmp(prelname, relname) == 0)
+	if (payload == NULL)
+		payload = "";
+
+	/*
+	 * We need to append new elements to the end of the list in order to keep
+	 * the order. However, on the other hand we'd like to check the list
+	 * backwards in order to make duplicate-elimination a tad faster when the
+	 * same condition is signaled many times in a row. So as a compromise we
+	 * check the tail element first which we can access directly. If this
+	 * doesn't match, we check the rest of whole list.
+	 */
+
+	n = (Notification *) llast(pendingNotifies);
+	if (strcmp(n->channel, channel) == 0)
+	{
+		Assert(n->payload != NULL);
+		if (strcmp(n->payload, payload) == 0)
 			return true;
 	}
 
+	/*
+	 * Note the difference to foreach(). We stop if p is the last element
+	 * already. So we don't check the last element, we have checked it already.
+ 	 */
+	for(p = list_head(pendingNotifies);
+		p != list_tail(pendingNotifies);
+		p = lnext(p))
+	{
+		n = (Notification *) lfirst(p);
+
+		if (strcmp(n->channel, channel) == 0)
+		{
+			Assert(n->payload != NULL);
+			if (strcmp(n->payload, payload) == 0)
+				return true;
+		}
+	}
+
 	return false;
 }
 
@@ -1124,5 +1766,11 @@
 	 * there is any significant delay before I commit.	OK for now because we
 	 * disallow COMMIT PREPARED inside a transaction block.)
 	 */
-	Async_Notify((char *) recdata);
+	AsyncQueueEntry		*qe = (AsyncQueueEntry *) recdata;
+
+	Assert(qe->dboid == MyDatabaseId);
+	Assert(qe->length == len);
+
+	Async_Notify(qe->channel, qe->payload);
 }
+
diff -ur cvs/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
--- cvs/src/backend/nodes/copyfuncs.c	2009-11-18 10:19:30.000000000 +0100
+++ cvs.build/src/backend/nodes/copyfuncs.c	2009-11-18 10:20:54.000000000 +0100
@@ -2761,6 +2761,7 @@
 	NotifyStmt *newnode = makeNode(NotifyStmt);
 
 	COPY_STRING_FIELD(conditionname);
+	COPY_STRING_FIELD(payload);
 
 	return newnode;
 }
diff -ur cvs/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
--- cvs/src/backend/nodes/equalfuncs.c	2009-11-18 10:19:30.000000000 +0100
+++ cvs.build/src/backend/nodes/equalfuncs.c	2009-11-18 10:20:54.000000000 +0100
@@ -1321,6 +1321,7 @@
 _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
 {
 	COMPARE_STRING_FIELD(conditionname);
+	COMPARE_STRING_FIELD(payload);
 
 	return true;
 }
diff -ur cvs/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
--- cvs/src/backend/nodes/outfuncs.c	2009-11-18 10:19:30.000000000 +0100
+++ cvs.build/src/backend/nodes/outfuncs.c	2009-11-18 10:20:54.000000000 +0100
@@ -1811,6 +1811,7 @@
 	WRITE_NODE_TYPE("NOTIFY");
 
 	WRITE_STRING_FIELD(conditionname);
+	WRITE_STRING_FIELD(payload);
 }
 
 static void
diff -ur cvs/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
--- cvs/src/backend/nodes/readfuncs.c	2009-10-31 14:47:48.000000000 +0100
+++ cvs.build/src/backend/nodes/readfuncs.c	2009-11-18 10:20:54.000000000 +0100
@@ -231,6 +231,7 @@
 	READ_LOCALS(NotifyStmt);
 
 	READ_STRING_FIELD(conditionname);
+	READ_STRING_FIELD(payload);
 
 	READ_DONE();
 }
diff -ur cvs/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
--- cvs/src/backend/parser/gram.y	2009-11-18 10:19:30.000000000 +0100
+++ cvs.build/src/backend/parser/gram.y	2009-11-18 10:20:54.000000000 +0100
@@ -394,7 +394,7 @@
 %type <boolean> opt_varying opt_timezone
 
 %type <ival>	Iconst SignedIconst
-%type <str>		Sconst comment_text
+%type <str>		Sconst comment_text notify_payload
 %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
 %type <list>	var_list
 %type <str>		ColId ColLabel var_name type_function_name param_name
@@ -5984,10 +5984,16 @@
  *
  *****************************************************************************/
 
-NotifyStmt: NOTIFY ColId
+notify_payload:
+			Sconst								{ $$ = $1; }
+			| /*EMPTY*/							{ $$ = NULL; }
+		;
+
+NotifyStmt: NOTIFY ColId notify_payload
 				{
 					NotifyStmt *n = makeNode(NotifyStmt);
 					n->conditionname = $2;
+					n->payload = $3;
 					$$ = (Node *)n;
 				}
 		;
diff -ur cvs/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
--- cvs/src/backend/storage/ipc/ipci.c	2009-09-06 09:06:21.000000000 +0200
+++ cvs.build/src/backend/storage/ipc/ipci.c	2009-11-18 10:20:54.000000000 +0100
@@ -219,6 +219,7 @@
 	 */
 	BTreeShmemInit();
 	SyncScanShmemInit();
+	AsyncShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff -ur cvs/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
--- cvs/src/backend/storage/lmgr/lwlock.c	2009-05-10 19:50:21.000000000 +0200
+++ cvs.build/src/backend/storage/lmgr/lwlock.c	2009-11-18 10:22:00.000000000 +0100
@@ -24,6 +24,7 @@
 #include "access/clog.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
+#include "commands/async.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "storage/ipc.h"
@@ -174,6 +175,9 @@
 	/* multixact.c needs two SLRU areas */
 	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
 
+	/* async.c needs one per page for the AsyncQueue */
+	numLocks += NUM_ASYNC_BUFFERS;
+
 	/*
 	 * Add any requested by loadable modules; for backwards-compatibility
 	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -ur cvs/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
--- cvs/src/backend/tcop/utility.c	2009-11-18 10:19:31.000000000 +0100
+++ cvs.build/src/backend/tcop/utility.c	2009-11-18 10:20:54.000000000 +0100
@@ -875,8 +875,12 @@
 		case T_NotifyStmt:
 			{
 				NotifyStmt *stmt = (NotifyStmt *) parsetree;
-
-				Async_Notify(stmt->conditionname);
+				if (stmt->payload
+					&& strlen(stmt->payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("payload string too long")));
+				Async_Notify(stmt->conditionname, stmt->payload);
 			}
 			break;
 
diff -ur cvs/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
--- cvs/src/bin/initdb/initdb.c	2009-11-18 10:19:31.000000000 +0100
+++ cvs.build/src/bin/initdb/initdb.c	2009-11-18 10:20:54.000000000 +0100
@@ -2469,6 +2469,7 @@
 		"pg_xlog",
 		"pg_xlog/archive_status",
 		"pg_clog",
+		"pg_notify",
 		"pg_subtrans",
 		"pg_twophase",
 		"pg_multixact/members",
diff -ur cvs/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
--- cvs/src/bin/psql/common.c	2009-05-10 19:50:30.000000000 +0200
+++ cvs.build/src/bin/psql/common.c	2009-11-18 10:20:54.000000000 +0100
@@ -555,8 +555,8 @@
 
 	while ((notify = PQnotifies(pset.db)))
 	{
-		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
-				notify->relname, notify->be_pid);
+		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
+				notify->relname, notify->extra, notify->be_pid);
 		fflush(pset.queryFout);
 		PQfreemem(notify);
 	}
diff -ur cvs/src/include/access/slru.h cvs.build/src/include/access/slru.h
--- cvs/src/include/access/slru.h	2009-05-10 19:50:35.000000000 +0200
+++ cvs.build/src/include/access/slru.h	2009-11-18 10:20:54.000000000 +0100
@@ -16,6 +16,25 @@
 #include "access/xlogdefs.h"
 #include "storage/lwlock.h"
 
+/*
+ * Define segment size.  A page is the same BLCKSZ as is used everywhere
+ * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+ * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+ * or 64K transactions for SUBTRANS.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+ * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+ * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+ * take no explicit notice of that fact in this module, except when comparing
+ * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+ *
+ * Note: this file currently assumes that segment file names will be four
+ * hex digits.	This sets a lower bound on the segment size (64K transactions
+ * for 32-bit TransactionIds).
+ */
+#define SLRU_PAGES_PER_SEGMENT	32
+
 
 /*
  * Page status codes.  Note that these do not include the "dirty" bit.
diff -ur cvs/src/include/commands/async.h cvs.build/src/include/commands/async.h
--- cvs/src/include/commands/async.h	2009-09-06 09:08:02.000000000 +0200
+++ cvs.build/src/include/commands/async.h	2009-11-18 10:23:41.000000000 +0100
@@ -13,16 +13,30 @@
 #ifndef ASYNC_H
 #define ASYNC_H
 
+/*
+ * How long can a payload string possibly be? Actually it needs to be one
+ * byte less to provide space for the trailing terminating '\0'.
+ */
+#define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+
+/*
+ * How many page slots do we reserve ?
+ */
+#define NUM_ASYNC_BUFFERS			4
+
 extern bool Trace_notify;
 
+extern void AsyncShmemInit(void);
+
 /* notify-related SQL statements */
-extern void Async_Notify(const char *relname);
+extern void Async_Notify(const char *relname, const char *payload);
 extern void Async_Listen(const char *relname);
 extern void Async_Unlisten(const char *relname);
 extern void Async_UnlistenAll(void);
 
 /* perform (or cancel) outbound notify processing at transaction commit */
-extern void AtCommit_Notify(void);
+extern void AtCommit_NotifyBeforeCommit(void);
+extern void AtCommit_NotifyAfterCommit(void);
 extern void AtAbort_Notify(void);
 extern void AtSubStart_Notify(void);
 extern void AtSubCommit_Notify(void);
diff -ur cvs/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
--- cvs/src/include/nodes/parsenodes.h	2009-11-18 10:19:31.000000000 +0100
+++ cvs.build/src/include/nodes/parsenodes.h	2009-11-18 10:20:54.000000000 +0100
@@ -2059,6 +2059,7 @@
 {
 	NodeTag		type;
 	char	   *conditionname;	/* condition name to notify */
+	char	   *payload;		/* the payload string to be conveyed */
 } NotifyStmt;
 
 /* ----------------------
diff -ur cvs/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
--- cvs/src/include/storage/lwlock.h	2009-05-10 19:53:12.000000000 +0200
+++ cvs.build/src/include/storage/lwlock.h	2009-11-18 10:20:54.000000000 +0100
@@ -67,6 +67,8 @@
 	AutovacuumLock,
 	AutovacuumScheduleLock,
 	SyncScanLock,
+	AsyncCtlLock,
+	AsyncQueueLock,
 	/* Individual lock IDs end here */
 	FirstBufMappingLock,
 	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#4)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

4) Allow readers to read uncommitted notifications as well.

The question that strikes me here is one of timing --- apparently,
readers will now have to check the queue *without* having received
a signal? That could amount to an unpleasant amount of extra overhead
when the notify system isn't even in use. (Users who don't care about
notify will define "unpleasant amount" as "not zero".)

I haven't read the patch, so maybe you have some cute solution to that,
but if so please explain what.

regards, tom lane

#6Greg Stark
gsstark@mit.edu
In reply to: Andrew Chernow (#3)
Re: Listen / Notify - what to do when the queue is full

On Mon, Nov 16, 2009 at 2:35 PM, Andrew Chernow <ac@esilo.com> wrote:

1) drop new notifications if the queue is full (silently or with
rollback)

I like this one best, but not with silence of course. While it's not the
most
polite thing to do, this is for a super extreme edge case. I'd rather just
throw an exception if the queue is full rather than start messing with the

+1

So if you guys are going to insist on turning the notification
mechanism isn't a queueing mechanism I think it at least behooves you
to have it degrade gracefully into a notification mechanism and not
become entirely useless by dropping notification messages.

That is, if the queue overflows what you should do is drop the
payloads and condense all the messages for a given class into a single
notification for that class with "unknown payload". That way if a
cache which wants to invalidate specific objects gets a queue overflow
condition then at least it knows it should rescan the original data
and rebuild the cache and not just serve invalid data.

I still think you're on the wrong path entirely and will end up with a
mechanism which serves neither use case very well instead of two
separate mechanisms that are properly designed for the two use cases.

--
greg

#7Joachim Wieland
joe@mcknight.de
In reply to: Tom Lane (#5)
Re: Listen / Notify - what to do when the queue is full

On Thu, Nov 19, 2009 at 1:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Joachim Wieland <joe@mcknight.de> writes:

4) Allow readers to read uncommitted notifications as well.

The question that strikes me here is one of timing --- apparently,
readers will now have to check the queue *without* having received
a signal?  That could amount to an unpleasant amount of extra overhead
when the notify system isn't even in use.  (Users who don't care about
notify will define "unpleasant amount" as "not zero".)

The sequence in CommitTransaction() is like that:

1) add notifications to queue
2) commit to clog
3) signal backends

Only those backends are signalled that listen to at least one channel,
if the notify system isn't in use, then nobody will ever be signalled
anyway.

If a backend is reading a transaction id that has not yet committed,
it will not deliver the notification. It knows that eventually it will
receive a signal from that transaction and then it first checks its
list of uncommitted notifications it has already read and then checks
the queue for more pending notifications.

Joachim

#8Andrew Chernow
ac@esilo.com
In reply to: Greg Stark (#6)
Re: Listen / Notify - what to do when the queue is full

That is, if the queue overflows what you should do is drop the
payloads and condense all the messages for a given class into a single
notification for that class with "unknown payload". That way if a
cache which wants to invalidate specific objects gets a queue overflow
condition then at least it knows it should rescan the original data
and rebuild the cache and not just serve invalid data.

That's far more complicated than throwing an error and it discards user payload
information. Let the error indicate a rescan is needed.

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#9Josh Berkus
josh@agliodbs.com
In reply to: Joachim Wieland (#1)
Re: Listen / Notify - what to do when the queue is full

On 11/16/09 3:19 AM, Joachim Wieland wrote:

1) drop new notifications if the queue is full (silently or with rollback)
2) block until readers catch up (what if the backend that tries to write the
notifications actually is the "lazy" reader that everybody is waiting for to
proceed?)
3) invent a new signal reason and send SIGUSR1 to the "lazy" readers, they
need to interrupt whatever they are doing and copy the
notifications into their
own address space (without delivering the notifications since they are in a
transaction at that moment).

(4) drop *old* notifications if the queue is full.

Since everyone has made the point that LISTEN is not meant to be a full
queueing system, I have no problem dropping notifications LRU-style. If
we've run out of room, the oldest notifications should go first; we
probably don't care about them anyway.

We should probably also log the fact that we ran out of room, so that
the DBA knows that they ahve a design issue. For volume reasons, I
don't think we want to log every dropped message.

Alternately, it would be great to have a configuration option which
would allow the DBA to choose any of 3 behaviors via GUC:

drop-oldest (as above)
drop-largest (if we run out of room, drop the largest payloads first to
save space)
error (if we run out of room, error and rollback)

--Josh Berkus

#10Andrew Chernow
ac@esilo.com
In reply to: Josh Berkus (#9)
Re: Listen / Notify - what to do when the queue is full

We should probably also log the fact that we ran out of room, so that
the DBA knows that they ahve a design issue.

Can't they just bump allowed memory and avoid a redesign?

Alternately, it would be great to have a configuration option which
would allow the DBA to choose any of 3 behaviors via GUC:

drop-oldest (as above)
drop-largest (if we run out of room, drop the largest payloads first to
save space)
error (if we run out of room, error and rollback)

I mentioned this up thread. I completely agree that overflow behavior should be
tunable.

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#7)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

The sequence in CommitTransaction() is like that:

1) add notifications to queue
2) commit to clog
3) signal backends

Only those backends are signalled that listen to at least one channel,
if the notify system isn't in use, then nobody will ever be signalled
anyway.

If a backend is reading a transaction id that has not yet committed,
it will not deliver the notification.

But you were saying that this patch would enable sending more data than
would fit in the queue. How will that happen if the other backends
don't look at the queue until you signal them?

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#9)
Re: Listen / Notify - what to do when the queue is full

Josh Berkus <josh@agliodbs.com> writes:

(4) drop *old* notifications if the queue is full.

Since everyone has made the point that LISTEN is not meant to be a full
queueing system, I have no problem dropping notifications LRU-style.

NO, NO, NO, a thousand times no!

That turns NOTIFY into an unreliable signaling system, and if I haven't
made this perfectly clear yet, any such change will be committed over my
dead body.

If we are unable to insert a new message into the queue, the correct
recourse is to fail the transaction that is trying to insert the *new*
message. Not to drop messages from already-committed transactions.
Failing the current transaction still leaves things in a consistent
state, ie, you don't get messages from aborted transactions but that's
okay because they didn't change the database state.

I think Greg has a legitimate concern about whether this redesign
reduces the usefulness of NOTIFY for existing use-cases, though.
Formerly, since pg_listener would effectively coalesce notifies
across multiple sending transactions instead of only one, it was
impossible to "overflow the queue", unless maybe you managed to
bloat pg_listener to the point of being out of disk space, and
even that was pretty hard. There will now be a nonzero chance
of transactions failing at commit because of queue full. If the
chance is large this will be an issue. (Is it sane to wait for
the queue to be drained?)

BTW, did we discuss the issue of 2PC transactions versus notify?
The current behavior of 2PC with notify is pretty cheesy and will
become more so if we make this change --- you aren't really
guaranteed that the notify will happen, even though the prepared
transaction did commit. I think it might be better to disallow
NOTIFY inside a prepared xact.

regards, tom lane

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Chernow (#10)
Re: Listen / Notify - what to do when the queue is full

Andrew Chernow <ac@esilo.com> writes:

I mentioned this up thread. I completely agree that overflow behavior should be
tunable.

There is only one correct overflow behavior.

regards, tom lane

#14Andrew Chernow
ac@esilo.com
In reply to: Tom Lane (#13)
Re: Listen / Notify - what to do when the queue is full

Tom Lane wrote:

Andrew Chernow <ac@esilo.com> writes:

I mentioned this up thread. I completely agree that overflow behavior should be
tunable.

There is only one correct overflow behavior.

I count three.

1. wait
2. error
3. skip

#1 and #2 are very similar to a file system. If FS buffers are full on write,
it makes you wait. In non-blocking mode, it throws an EAGAIN error. IMHO those
two behaviors are totally acceptable for handling notify overflow. #3 is pretty
weak but I *think* there are uses for it.

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Chernow (#14)
Re: Listen / Notify - what to do when the queue is full

Andrew Chernow <ac@esilo.com> writes:

Tom Lane wrote:

There is only one correct overflow behavior.

I count three.

Waiting till you can insert is reasonable (especially if we have some
behavior that nudges other backends to empty the queue). If by "skip"
you mean losing the notify but still committing, that's incorrect.
There is no room for debate about that.

regards, tom lane

#16Andrew Chernow
ac@esilo.com
In reply to: Tom Lane (#15)
Re: Listen / Notify - what to do when the queue is full

Tom Lane wrote:

Andrew Chernow <ac@esilo.com> writes:

Tom Lane wrote:

There is only one correct overflow behavior.

I count three.

Waiting till you can insert is reasonable (especially if we have some
behavior that nudges other backends to empty the queue). If by "skip"
you mean losing the notify but still committing, that's incorrect.
There is no room for debate about that.

Yeah like I said, skip felt weak.

In regards to waiting, what would happen if other backends couldn't help empty
the queue because they to are clogged? ISTM that any attempt to flush to other
non-disk queues is doomed to possible overflows as well. Then what?
Personally, I would just wait until room became available or the transaction was
canceled. We could get fancy and tack a timeout value onto the wait.

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Chernow (#16)
Re: Listen / Notify - what to do when the queue is full

Andrew Chernow <ac@esilo.com> writes:

Personally, I would just wait until room became available or the transaction was
canceled.

Works for me, as long as there's a CHECK_FOR_INTERRUPTS in there to
allow a cancel to happen. The current patch seems to have a lot of
pointless logging and no CHECK_FOR_INTERRUPTS ;-)

regards, tom lane

#18Andreas 'ads' Scherbaum
adsmail@wars-nicht.de
In reply to: Tom Lane (#12)
Re: Listen / Notify - what to do when the queue is full

On Wed, 18 Nov 2009 22:12:18 -0500 Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

(4) drop *old* notifications if the queue is full.

Since everyone has made the point that LISTEN is not meant to be a full
queueing system, I have no problem dropping notifications LRU-style.

NO, NO, NO, a thousand times no!

That turns NOTIFY into an unreliable signaling system, and if I haven't
made this perfectly clear yet, any such change will be committed over my
dead body.

If we are unable to insert a new message into the queue, the correct
recourse is to fail the transaction that is trying to insert the *new*
message. Not to drop messages from already-committed transactions.
Failing the current transaction still leaves things in a consistent
state, ie, you don't get messages from aborted transactions but that's
okay because they didn't change the database state.

+1

And in addition i don't like the idea of having the sender sitting
around until there's room for more messages in the queue, because some
very old backends didn't remove the stuff from the same.

So, yes, just failing the current transaction seems reasonable. We are
talking about millions of messages in the queue ...

Bye

--
Andreas 'ads' Scherbaum
German PostgreSQL User Group
European PostgreSQL User Group - Board of Directors
Volunteer Regional Contact, Germany - PostgreSQL Project

#19Joachim Wieland
joe@mcknight.de
In reply to: Tom Lane (#12)
Re: Listen / Notify - what to do when the queue is full

On Thu, Nov 19, 2009 at 4:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There will now be a nonzero chance
of transactions failing at commit because of queue full.  If the
chance is large this will be an issue.  (Is it sane to wait for
the queue to be drained?)

Exactly. The whole idea of putting the notification system to an slru queue
was to make this nonzero chance a very-close-to-zero nonzero chance.

Currently with pages from 0..0xFFFF we can have something between
160,000,000 (no payload) and 2,000,000 (biggest payload) notifications
in the queue at the same time.

We are free to remove the slru limitation by making slru.c work with 8
character file names. Then you can multiply both limits by 32,000 and
then it should be very-close-to-zero, at least in my point of view...

The actual queue-full behavior is then (or maybe is already now) just
a theoretical aspect that we need to agree on to make the whole
concept sound.

The current patch would just wait until some space becomes available
in the queue and it guarantees that no notification is lost. Furthermore
it guarantees that a transaction can listen on an unlimited number of
channels and that it can send an unlimited number of notifications,
not related to the size of the queue. It can also send that unlimited
number of notifications if it is one of the listeners of those notifications.

The only real limit is now the backend's memory but as long as nobody
proves that he needs unlimited notifications with a limited amount of
memory we just keep it like that.

I will add a CHECK_FOR_INTERRUPTS() and resubmit so that you
can cancel a NOTIFY while the queue is full. Also I've put in an
optimization to only signal those backends in a queue full situation
that are not yet up-to-date (which will probably turn out to be only one
backend - the slowest that is in a long running transaction - after some
time...).

BTW, did we discuss the issue of 2PC transactions versus notify?
The current behavior of 2PC with notify is pretty cheesy and will
become more so if we make this change --- you aren't really
guaranteed that the notify will happen, even though the prepared
transaction did commit.  I think it might be better to disallow
NOTIFY inside a prepared xact.

Yes, I have been thinking about that also. So what should happen
when you prepare a transaction that has sent a NOTIFY before?

Joachim

#20Joachim Wieland
joe@mcknight.de
In reply to: Andreas 'ads' Scherbaum (#18)
Re: Listen / Notify - what to do when the queue is full

On Thu, Nov 19, 2009 at 1:51 PM, Andreas 'ads' Scherbaum
<adsmail@wars-nicht.de> wrote:

And in addition i don't like the idea of having the sender sitting
around until there's room for more messages in the queue, because some
very old backends didn't remove the stuff from the same.

The only valid reason why a backend has not processed the
notifications in the queue
must be a backend that is still in a transaction since then (and has
executed LISTEN
some time before).

Joachim

#21Greg Sabino Mullane
greg@turnstep.com
In reply to: Tom Lane (#12)
Re: Listen / Notify - what to do when the queue is full

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

(4) drop *old* notifications if the queue is full.

Since everyone has made the point that LISTEN is not meant to be a full
queueing system, I have no problem dropping notifications LRU-style.

NO, NO, NO, a thousand times no!

+1. Don't even think about going there. </me gives horrified shudder>

...

even that was pretty hard. There will now be a nonzero chance
of transactions failing at commit because of queue full. If the
chance is large this will be an issue. (Is it sane to wait for
the queue to be drained?)

I think this chance will be pretty small - you need a *lot* of
unread notifications before this edge case is reached, so I think
we can be pretty severe in our response, and put the responsibility
on cleanup on the user, rather than having the backend try to
move things around, cleanup the queue selectively, etc.

BTW, did we discuss the issue of 2PC transactions versus notify?
The current behavior of 2PC with notify is pretty cheesy and will
become more so if we make this change --- you aren't really
guaranteed that the notify will happen, even though the prepared
transaction did commit. I think it might be better to disallow
NOTIFY inside a prepared xact.

That's a tough one. On the one hand, simply stating that NOTIFY and 2PC
don't play together in the docs would be a straightforward solution
(and not a bad one, as 2PC is already rare and delicate and should not
be used lightly). But what I really don't like the is the idea of a
notify that *may* work or may not - so let's keep it boolean: it either
works 100% of the time with 2PC, or doesn't at all. Should we throw
a warning or error if a client attempts to combine the two?

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation
PGP Key: 0x14964AC8 200911190857
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAksFTxEACgkQvJuQZxSWSsjkiACfYeevKZ0QngZcZXUoTPP6wXh6
iOMAoLvkPlEV6ywGqyaaloqQrnoryILU
=rioB
-----END PGP SIGNATURE-----

#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#19)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland wrote:

On Thu, Nov 19, 2009 at 4:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, did we discuss the issue of 2PC transactions versus notify?
The current behavior of 2PC with notify is pretty cheesy and will
become more so if we make this change --- you aren't really
guaranteed that the notify will happen, even though the prepared
transaction did commit. I think it might be better to disallow
NOTIFY inside a prepared xact.

That will make anyone currently using 2PC with notify/listen unhappy.

Yes, I have been thinking about that also. So what should happen
when you prepare a transaction that has sent a NOTIFY before?

From the user's point of view, nothing should happen at prepare.

At a quick glance, it doesn't seem hard to support 2PC. Messages should
be put to the queue at prepare, as just before normal commit, but the
backends won't see them until they see that the XID has committed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In reply to: Heikki Linnakangas (#22)
Re: Listen / Notify - what to do when the queue is full

Heikki Linnakangas wrote:

Joachim Wieland wrote:

On Thu, Nov 19, 2009 at 4:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yes, I have been thinking about that also. So what should happen
when you prepare a transaction that has sent a NOTIFY before?

From the user's point of view, nothing should happen at prepare.

At a quick glance, it doesn't seem hard to support 2PC. Messages should
be put to the queue at prepare, as just before normal commit, but the
backends won't see them until they see that the XID has committed.

Yeah, but if the server is restarted after the PREPARE but before the
COMMIT, the notification will be lost, since all notification queue
entries are lost upon restart with the slru design, no?

best regards,
Florian Pflug

#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Florian G. Pflug (#23)
Re: Listen / Notify - what to do when the queue is full

Florian G. Pflug wrote:

Heikki Linnakangas wrote:

At a quick glance, it doesn't seem hard to support 2PC. Messages should
be put to the queue at prepare, as just before normal commit, but the
backends won't see them until they see that the XID has committed.

Yeah, but if the server is restarted after the PREPARE but before the
COMMIT, the notification will be lost, since all notification queue
entries are lost upon restart with the slru design, no?

That's why they're stored in the 2PC state file in pg_twophase. See
AtPrepare_Notify().

Hmm, thinking about this a bit more, I don't think the messages should
be sent until commit (ie. 2nd phase). Although the information is safe
in the state file, if anyone starts to LISTEN between the PREPARE
TRANSACTION and COMMIT PREPARED calls, he would miss the notifications.
I'm not sure if it's well-defined what happens if someone starts to
LISTEN while another transaction has already sent a notification, but it
would be rather surprising if such a window existed where it doesn't
exist with non-prepared transactions.

A better approach is to do something similar to what we do now: at
prepare, just store the notifications in the state file like we do
already. In notify_twophase_postcommit(), copy the messages to the
shared queue. Although it's the same approach we have now, it becomes a
lot cleaner with the patch, because we're not piggybacking the messages
on the backend-private queue of the current transaction, but sending the
messages directly on behalf of the prepared transaction being committed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#24)
Re: Listen / Notify - what to do when the queue is full

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

A better approach is to do something similar to what we do now: at
prepare, just store the notifications in the state file like we do
already. In notify_twophase_postcommit(), copy the messages to the
shared queue. Although it's the same approach we have now, it becomes a
lot cleaner with the patch, because we're not piggybacking the messages
on the backend-private queue of the current transaction, but sending the
messages directly on behalf of the prepared transaction being committed.

This is still ignoring the complaint: you are creating a clear risk
that COMMIT PREPARED will fail.

I'm not sure that it's really worth it, but one way this could be made
safe would be for PREPARE to "reserve" the required amount of queue
space, such that nobody else could use it during the window from
PREPARE to COMMIT PREPARED.

On the whole I'd be just as happy to disallow NOTIFY in a 2PC
transaction. We have no evidence that anyone out there is using the
combination, and if they are, they can do the work to make it safe.

regards, tom lane

#26Andreas 'ads' Scherbaum
adsmail@wars-nicht.de
In reply to: Joachim Wieland (#20)
Re: Listen / Notify - what to do when the queue is full

On Thu, 19 Nov 2009 14:23:57 +0100 Joachim Wieland wrote:

On Thu, Nov 19, 2009 at 1:51 PM, Andreas 'ads' Scherbaum
<adsmail@wars-nicht.de> wrote:

And in addition i don't like the idea of having the sender sitting
around until there's room for more messages in the queue, because some
very old backends didn't remove the stuff from the same.

The only valid reason why a backend has not processed the
notifications in the queue
must be a backend that is still in a transaction since then (and has
executed LISTEN
some time before).

Yes, i know. The same backend is probably causing more trouble
anyway (blocking vacuum, xid wraparound, ...).

--
Andreas 'ads' Scherbaum
German PostgreSQL User Group
European PostgreSQL User Group - Board of Directors
Volunteer Regional Contact, Germany - PostgreSQL Project

In reply to: Tom Lane (#25)
Re: Listen / Notify - what to do when the queue is full

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

A better approach is to do something similar to what we do now: at
prepare, just store the notifications in the state file like we do
already. In notify_twophase_postcommit(), copy the messages to the
shared queue. Although it's the same approach we have now, it
becomes a lot cleaner with the patch, because we're not
piggybacking the messages on the backend-private queue of the
current transaction, but sending the messages directly on behalf of
the prepared transaction being committed.

This is still ignoring the complaint: you are creating a clear risk
that COMMIT PREPARED will fail.

I'm not sure that it's really worth it, but one way this could be
made safe would be for PREPARE to "reserve" the required amount of
queue space, such that nobody else could use it during the window
from PREPARE to COMMIT PREPARED.

I'd see no problem with "COMMIT PREPARED" failing, as long as it was
possible to retry the COMMIT PREPARED at a later time. There surely are
other failure cases for COMMIT PREPARED too, like an IO error that
prevents the clog bit from being set, or a server crash half-way through
COMMIT PREPARED.

best regards,
Florian Pflug

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian G. Pflug (#27)
Re: Listen / Notify - what to do when the queue is full

"Florian G. Pflug" <fgp@phlo.org> writes:

Tom Lane wrote:

This is still ignoring the complaint: you are creating a clear risk
that COMMIT PREPARED will fail.

I'd see no problem with "COMMIT PREPARED" failing, as long as it was
possible to retry the COMMIT PREPARED at a later time. There surely are
other failure cases for COMMIT PREPARED too, like an IO error that
prevents the clog bit from being set, or a server crash half-way through
COMMIT PREPARED.

Yes, there are failure cases that are outside our control. That's no
excuse for creating one that's within our control.

regards, tom lane

In reply to: Tom Lane (#28)
Re: Listen / Notify - what to do when the queue is full

Tom Lane wrote:

"Florian G. Pflug" <fgp@phlo.org> writes:

Tom Lane wrote:

This is still ignoring the complaint: you are creating a clear
risk that COMMIT PREPARED will fail.

I'd see no problem with "COMMIT PREPARED" failing, as long as it
was possible to retry the COMMIT PREPARED at a later time. There
surely are other failure cases for COMMIT PREPARED too, like an IO
error that prevents the clog bit from being set, or a server crash
half-way through COMMIT PREPARED.

Yes, there are failure cases that are outside our control. That's no
excuse for creating one that's within our control.

True. On the other hand, people might prefer having to deal with (very
unlikely) COMMIT PREPARED *transient* failures over not being able to
use NOTIFY together with 2PC at all. Especially since any credible
distributed transaction manager has to deal with COMMIT PREPARED
failures anyway.

Just my $0.02, though.

best regards,
Florian Pflug

#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#25)
Re: Listen / Notify - what to do when the queue is full

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

A better approach is to do something similar to what we do now: at
prepare, just store the notifications in the state file like we do
already. In notify_twophase_postcommit(), copy the messages to the
shared queue. Although it's the same approach we have now, it becomes a
lot cleaner with the patch, because we're not piggybacking the messages
on the backend-private queue of the current transaction, but sending the
messages directly on behalf of the prepared transaction being committed.

This is still ignoring the complaint: you are creating a clear risk
that COMMIT PREPARED will fail.

I'm not sure that it's really worth it, but one way this could be made
safe would be for PREPARE to "reserve" the required amount of queue
space, such that nobody else could use it during the window from
PREPARE to COMMIT PREPARED.

Hmm, ignoring 2PC for a moment, I think the patch suffers from a little
race condition:

Session 1: BEGIN;
Session 1: INSERT INTO foo ..;
Session 1: NOTIFY 'foo';
Session 1: COMMIT -- commit begins
Session 1: [commit processing runs AtCommit_NotifyBeforeCommit()]
Session 2: LISTEN 'foo';
Session 2: SELECT * FROM foo;
Session 1: [AtCommit_NotifyAfterCommit() signals listening backends]
Session 2: [waits for notifications]

Because session 2 began listening after session 1 had already sent its
notifications, it missed them. But the SELECT didn't see the INSERT,
because the inserting transaction hadn't fully finished yet.

The window isn't as small as it might seem at first glance, because the
WAL is fsynced between the BeforeCommit and AfterCommit actions.

I think we could fix that by arranging things so that a backend refrains
from advancing its own 'pos' beyond the first notification it has
written itself, until commit is completely finished. I'm not sure but
might already be true if we don't receive interrupts between
BeforeCommit and AfterCommit. LISTEN can then simply start reading from
QUEUE_TAIL instead of QUEUE_HEAD, and in the above example session 2
will see the notifications sent by session 1.

That will handle 2PC as well. We can send the notifications in
prepare-phase, and any LISTEN that starts after the prepare-phase will
see the notifications because they're still in the queue. There is no
risk of running out of disk space in COMMIT PREPARED, because the
notifications have already been written to disk. However, the
notification queue can't be truncated until the prepared transaction
finishes; does anyone think that's a show-stopper?

On the whole I'd be just as happy to disallow NOTIFY in a 2PC
transaction. We have no evidence that anyone out there is using the
combination, and if they are, they can do the work to make it safe.

Yeah, I doubt we'd hear many complaints in practice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#31Chris Browne
cbbrowne@acm.org
In reply to: Tom Lane (#12)
Re: Listen / Notify - what to do when the queue is full

greg@turnstep.com ("Greg Sabino Mullane") writes:

BTW, did we discuss the issue of 2PC transactions versus notify?
The current behavior of 2PC with notify is pretty cheesy and will
become more so if we make this change --- you aren't really
guaranteed that the notify will happen, even though the prepared
transaction did commit. I think it might be better to disallow
NOTIFY inside a prepared xact.

That's a tough one. On the one hand, simply stating that NOTIFY and 2PC
don't play together in the docs would be a straightforward solution
(and not a bad one, as 2PC is already rare and delicate and should not
be used lightly). But what I really don't like the is the idea of a
notify that *may* work or may not - so let's keep it boolean: it either
works 100% of the time with 2PC, or doesn't at all. Should we throw
a warning or error if a client attempts to combine the two?

+1 from me...

It should either work, or not work, as opposed to something
nondeterministic.

While it's certainly a nice thing for features to be orthogonal, and for
interactions to "just work," I can see making a good case for NOTIFY and
2PC not playing together.
--
select 'cbbrowne' || '@' || 'gmail.com';
http://linuxfinances.info/info/slony.html
Why isn't phonetic spelled the way it sounds?

#32Joachim Wieland
joe@mcknight.de
In reply to: Heikki Linnakangas (#30)
Re: Listen / Notify - what to do when the queue is full

On Thu, Nov 19, 2009 at 6:55 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, ignoring 2PC for a moment, I think the patch suffers from a little
race condition:

Session 1: BEGIN;
Session 1: INSERT INTO foo ..;
Session 1: NOTIFY 'foo';
Session 1: COMMIT -- commit begins
Session 1: [commit processing runs AtCommit_NotifyBeforeCommit()]

----> Session 2 must not read uncommited notifications selectively

Session 2: LISTEN 'foo';
Session 2: SELECT * FROM foo;
Session 1: [AtCommit_NotifyAfterCommit() signals listening backends]
Session 2: [waits for notifications]

Because session 2 began listening after session 1 had already sent its
notifications, it missed them.

I think you are right. However note that session 1 does not actively
send notifications to anybody, it just puts them into the queue. It's
every backend's own job to process the queue and see which messages
are interesting and which are not. The example you brought up fails if
Session 2 disregards the notifications based on the current set of
channels that it is listening to at this point. If I understand you
correctly what you are suggesting is to not read uncommitted
notifications from the queue in a reading backend or read all
notifications (regardless of which channel it has been sent to), such
that the backend can apply the check ("Am i listening on this
channel?") later on.

I think we could fix that by arranging things so that a backend refrains
from advancing its own 'pos' beyond the first notification it has
written itself, until commit is completely finished.

In the end this is similar to the idea to not read uncommitted
notifications which was what I did at the beginning. However then you
run into a full queue a lot faster. Imagine a queue length of 1000
with 3 transactions writing 400 notifications each... All three might
fail if they run in parallel, even though space would be sufficient
for at least two of them, and if they are executed in a sequence, all
of them could deliver their notifications.

Given your example, what I am proposing now is to stop reading from
the queue once we see a not-yet-committed notification but once the
queue is full, read the uncommitted notifications, effectively copying
them over into the backend's own memory... Once the transaction
commits and sends a signal, we can process, send and discard the
previously copied notifications. In the above example, at some point
one, two or all three backends would see that the queue is full and
everybody would read the uncommitted notifications of the other one,
copy them into the own memory and space will be freed in the queue.

That will handle 2PC as well. We can send the notifications in
prepare-phase, and any LISTEN that starts after the prepare-phase will
see the notifications because they're still in the queue. There is no
risk of running out of disk space in COMMIT PREPARED, because the
notifications have already been written to disk. However, the
notification queue can't be truncated until the prepared transaction
finishes; does anyone think that's a show-stopper?

Note that we don't preserve notifications when the database restarts.
But 2PC can cope with restarts. How would that fit together? Also I am
not sure how you are going to deliver notifications that happen
between the PREPARE TRANSACTION and the COMMIT PREPARED (because you
have only one queue pointer which you are not going to advance...) ?

Joachim

#33Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#32)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland wrote:

The example you brought up fails if
Session 2 disregards the notifications based on the current set of
channels that it is listening to at this point.

Right. Session 2 might not be listening at all yet.

If I understand you
correctly what you are suggesting is to not read uncommitted
notifications from the queue in a reading backend or read all
notifications (regardless of which channel it has been sent to), such
that the backend can apply the check ("Am i listening on this
channel?") later on.

Right.

Note that we don't preserve notifications when the database restarts.
But 2PC can cope with restarts. How would that fit together?

The notifications are written to the state file at prepare. They can be
recovered from there and written to the queue again at server start (see
twophase_rmgr.c).

Also I am
not sure how you are going to deliver notifications that happen
between the PREPARE TRANSACTION and the COMMIT PREPARED (because you
have only one queue pointer which you are not going to advance...) ?

Yeah, that's a problem. One uncommitted notification will block all
others too. In theory you have the same problem without 2PC, but it's OK
because you don't expect one COMMIT to take much longer to finish than
others.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#34Joachim Wieland
joe@mcknight.de
In reply to: Heikki Linnakangas (#33)
Re: Listen / Notify - what to do when the queue is full

On Fri, Nov 20, 2009 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Note that we don't preserve notifications when the database restarts.
But 2PC can cope with restarts. How would that fit together?

The notifications are written to the state file at prepare. They can be
recovered from there and written to the queue again at server start (see
twophase_rmgr.c).

Okay, but which of the backends would then leave its pointer at that
place in the queue upon restart?

This is also an issue for the non-restart case, what if you prepare
the transaction in one backend and commit in the other?

Joachim

#35Joachim Wieland
joe@mcknight.de
In reply to: Joachim Wieland (#32)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Thu, Nov 19, 2009 at 11:04 PM, Joachim Wieland <joe@mcknight.de> wrote:

Given your example, what I am proposing now is to stop reading from
the queue once we see a not-yet-committed notification but once the
queue is full, read the uncommitted notifications, effectively copying
them over into the backend's own memory... Once the transaction
commits and sends a signal, we can process, send and discard the
previously copied notifications. In the above example, at some point
one, two or all three backends would see that the queue is full and
everybody would read the uncommitted notifications of the other one,
copy them into the own memory and space will be freed in the queue.

Attached is the patch that implements the described modifications.

Joachim

Attachments:

listennotify.3.difftext/x-diff; charset=US-ASCII; name=listennotify.3.diffDownload
diff -cr cvs/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs/src/backend/access/transam/slru.c	2009-05-10 19:49:47.000000000 +0200
--- cvs.build/src/backend/access/transam/slru.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs/src/backend/access/transam/xact.c	2009-09-06 08:58:59.000000000 +0200
--- cvs.build/src/backend/access/transam/xact.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 1604,1611 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
--- 1604,1611 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
***************
*** 1680,1685 ****
--- 1680,1690 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs/src/backend/catalog/Makefile	2009-10-31 14:47:46.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2009-11-18 10:20:54.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject.h pg_aggregate.h pg_statistic.h \
! 	pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h pg_cast.h \
  	pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject.h pg_aggregate.h pg_statistic.h \
! 	pg_rewrite.h pg_trigger.h pg_description.h pg_cast.h \
  	pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs/src/backend/commands/async.c	2009-09-06 08:59:06.000000000 +0200
--- cvs.build/src/backend/commands/async.c	2009-11-20 10:44:19.000000000 +0100
***************
*** 14,44 ****
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 14,67 ----
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications and compares the notified channels with its list.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) registers the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the applications needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    We are then called another time (AtCommit_NotifyAfterCommit())and check
!  *    if we need to signal the backends.
!  *    In SignalBackends() we scan the list of listening backends and send a
!  *    PROCSIG_NOTIFY_INTERRUPT to every backend that has set its Pid (We don't
!  *    know which backend is listening on which channel so we need to send a
!  *    signal to every listening backend).
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,93 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
  #include "catalog/pg_listener.h"
--- 69,114 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach the head pointer's position. Then we check if we were the
!  *	  laziest backend: if our pointer is set to the same position as the global
!  *	  tail pointer is set, then we set it further to the second-laziest
!  *	  backend (We can identify it by inspecting the positions of all other
!  *	  backends' pointers). Whenever we move the tail pointer we also truncate
!  *	  now unused pages (i.e. delete files in pg_notify/ that are no longer
!  *	  used).
!  *	  Note that we really read _any_ available notification in the queue. We
!  *	  also read uncommitted notifications from transaction that could still
!  *	  roll back. We must not deliver the notifications of those transactions
!  *	  but just copy them out of the queue. We save them in the
!  *	  uncommittedNotifications list which we try to deliver every time we
!  *	  check for available notifications.
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
+ /* XXX 
+  *
+  * TODO:
+  *  - guc parameter max_notifies_per_txn ??
+  *  - adapt comments
+  */
+ 
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
  #include "catalog/pg_listener.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 129,136 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 123,128 ****
--- 144,161 ----
  	LISTEN_UNLISTEN_ALL
  } ListenActionKind;
  
+ typedef enum
+ {
+ 	READ_ALL_TO_UNCOMMITTED,
+ 	READ_ONLY_COMMITTED
+ } QueueProcessType;
+ 
+ typedef enum
+ {
+ 	SIGNAL_ALL,
+ 	SIGNAL_SLOW
+ } SignalType;
+ 
  typedef struct
  {
  	ListenActionKind action;
***************
*** 133,140 ****
  
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 166,177 ----
  
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
+ static List *uncommittedNotifications = NIL;
+ 
+ static bool needSignalBackends = false;
+ 
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 186,308 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	union {
+ 		/* we only need one of both, depending on whether we send a
+  		 * notification or receive one. */
+ 		int32		dstPid;
+ 		int32		srcPid;
+ 	};
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			channel[NAMEDATALEN];
+ 	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)
+ 
+ #define	InvalidPid (-1)
+ #define QUEUE_POS_PAGE(x) ((x).page)
+ #define QUEUE_POS_OFFSET(x) ((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_BACKEND_POS(i) asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i) asyncQueueControl->backend[(i)].pid
+ #define QUEUE_HEAD asyncQueueControl->head
+ #define QUEUE_TAIL asyncQueueControl->tail
+ 
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFullWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's a legal test case to define QUEUE_MAX_PAGE to a very small multiply of
+  * SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,207 ****
  
  bool		Trace_notify = false;
  
- 
  static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
  	{
  		/*
  		 * The name list needs to live until end of transaction, so store it
  		 * in the transaction context.
--- 319,469 ----
  
  bool		Trace_notify = false;
  
  static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_Listen(const char *channel);
! static void Exec_Unlisten(const char *channel);
! static void Exec_UnlistenAll(void);
! static void SignalBackends(SignalType type);
! static void Send_Notify(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **committed,
! 									   MemoryContext committedContext,
! 									   List **uncommitted,
! 									   MemoryContext uncommittedContext);
! static void asyncQueueReadAllNotifications(QueueProcessType type);
! static void asyncQueueAdvanceTail(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 dstPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic if
+  * one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. Now if we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, QUEUE_MAX_PAGE, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFullWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		AsyncCtl->shared->page_dirty[slotno] = true;
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 		LWLockRelease(AsyncQueueLock);
+ 
+ 		SlruScanDirectory(AsyncCtl, QUEUE_MAX_PAGE, true);
+ 	}
+ }
+ 
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
! 
! 	/*
! 	 * XXX - do we now need a guc parameter max_notifies_per_txn?
! 	 */ 
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(channel, payload))
  	{
+ 		Notification *n;
  		/*
  		 * The name list needs to live until end of transaction, so store it
  		 * in the transaction context.
***************
*** 210,221 ****
  
  		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
  		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
  		MemoryContextSwitchTo(oldcontext);
  	}
--- 472,492 ----
  
  		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
+ 		n = (Notification *) palloc(sizeof(Notification));
+ 		/* will set the xid later... */
+ 		n->xid = InvalidTransactionId;
+ 		n->channel = pstrdup(channel);
+ 		if (payload)
+ 			n->payload = pstrdup(payload);
+ 		else
+ 			n->payload = "";
+ 		n->dstPid = InvalidPid;
+ 
  		/*
! 		 * We want to preserve the order so we need to append every
! 		 * notification. See comments at AsyncExistsPendingNotify().
  		 */
! 		pendingNotifies = lappend(pendingNotifies, n);
  
  		MemoryContextSwitchTo(oldcontext);
  	}
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 530,541 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 544,559 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 577,582 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
- 	/*
- 	 * We need to start/commit a transaction for the unlisten, but if there is
- 	 * already an active transaction we had better abort that one first.
- 	 * Otherwise we'd end up committing changes that probably ought to be
- 	 * discarded.
- 	 */
  	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 584,591 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	AbortOutOfAnyTransaction();
! 	Exec_UnlistenAll();
  }
  
  /*
***************
*** 348,357 ****
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		const char *relname = (const char *) lfirst(p);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
  	}
  
  	/*
--- 608,622 ----
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		AsyncQueueEntry qe;
! 		Notification   *n;
! 
! 		n = (Notification *) lfirst(p);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   &qe, qe.length);
  	}
  
  	/*
***************
*** 363,388 ****
  }
  
  /*
!  * AtCommit_Notify
!  *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
   *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
  		return;					/* no relevant statements in this xact */
  
--- 628,651 ----
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
+ 	needSignalBackends = false;
+ 
  	if (pendingActions == NIL && pendingNotifies == NIL)
  		return;					/* no relevant statements in this xact */
  
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
! 
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 660,666 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
- 	/* Perform any pending notifies */
- 	if (pendingNotifies)
- 		Send_Notify(lRel);
- 
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
   * Exec_Listen --- subroutine for AtCommit_Notify
   *
!  *		Register the current backend as listening on the specified relation.
   */
  static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
! 
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
! 		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
! 		}
! 	}
! 	heap_endscan(scan);
  
! 	if (alreadyListener)
  		return;
  
  	/*
! 	 * OK to insert a new tuple
  	 */
! 	memset(nulls, false, sizeof(nulls));
! 
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
! 
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
! 
! 	simple_heap_insert(lRel, tuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 670,780 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll();
  				break;
  		}
  	}
  
  	/*
! 	 * Perform any pending notifies.
  	 */
! 	if (pendingNotifies)
! 	{
! 		needSignalBackends = true;
! 		Send_Notify();
! 	}
! }
! 
! /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
!  *
!  *		Notify the listening backends.
!  */
! void
! AtCommit_NotifyAfterCommit(void)
! {
! 	if (needSignalBackends)
! 		SignalBackends(SIGNAL_ALL);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
! }
! 
! /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
!  */
! static bool
! IsListeningOn(const char *channel)
! {
! 	ListCell   *p;
! 
! 	foreach(p, listenChannels)
! 	{
! 		char *lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			/* already listening on this channel */
! 			return true;
! 	}
! 	return false;
  }
  
+ 
  /*
   * Exec_Listen --- subroutine for AtCommit_Notify
   *
!  *		Register the current backend as listening on the specified channel.
   */
  static void
! Exec_Listen(const char *channel)
  {
! 	MemoryContext oldcontext;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
  
! 	/* Detect whether we are already listening on this channel */
! 	if (IsListeningOn(channel))
  		return;
  
  	/*
! 	 * OK to insert to the list.
  	 */
! 	if (listenChannels == NIL)
! 	{
! 		/*
! 		 * This is our first LISTEN, establish our pointer.
! 		 */
! 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 		QUEUE_BACKEND_POS(MyBackendId) = QUEUE_HEAD;
! 		QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 		LWLockRelease(AsyncQueueLock);
! 		/*
! 		 * Actually this is only necessary if we are the first listener
! 		 * (The tail pointer needs to be identical with the pointer of at
! 		 * least one backend).
! 		 */
! 		asyncQueueAdvanceTail();
! 	}
  
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
  
  	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 514,551 ****
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
- 			/* Found the matching tuple, delete it */
- 			simple_heap_delete(lRel, &tuple->t_self);
- 
  			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
  			 */
! 			break;
  		}
  	}
! 	heap_endscan(scan);
! 
  	/*
  	 * We do not complain about unlistening something not being listened;
  	 * should we?
--- 786,838 ----
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove a specified channel from "listenChannel".
   */
  static void
! Exec_Unlisten(const char *channel)
  {
! 	ListCell   *p;
! 	ListCell   *prev = NULL;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", channel, MyProcPid);
  
! 	/* Detect whether we are already listening on this channel */
! 	foreach(p, listenChannels)
  	{
! 		char *lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
  		{
  			/*
! 			 * Since the list is living in the TopMemoryContext, we free
! 			 * the memory. The ListCell is freed by list_delete_cell().
  			 */
! 			pfree(lchan);
! 			listenChannels = list_delete_cell(listenChannels, p, prev);
! 			if (listenChannels == NIL)
! 			{
! 				bool advanceTail = false;
! 				/*
! 				 * This backend is not listening anymore.
! 				 */
! 				LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 				QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
! 
! 				/*
! 				 * If we have been the last backend, advance the tail pointer.
! 				 */
! 				if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 					advanceTail = true;
! 				LWLockRelease(AsyncQueueLock);
! 
! 				if (advanceTail)
! 					asyncQueueAdvanceTail();
! 			}
! 			return;
  		}
+ 		prev = p;
  	}
! 	
  	/*
  	 * We do not complain about unlistening something not being listened;
  	 * should we?
***************
*** 555,677 ****
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple;
! 	ScanKeyData key[1];
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
  static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
! 			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
! 			}
! 			else if (listener->notification == 0)
! 			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
--- 842,1150 ----
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAll(void)
  {
! 	bool advanceTail = false;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll(%d)", MyProcPid);
! 
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
  
! 	/*
! 	 * Since the list is living in the TopMemoryContext, we free the memory.
! 	 */
! 	list_free_deep(listenChannels);
! 	listenChannels = NIL;
  
! 	/*
! 	 * If we have been the last backend, advance the tail pointer.
! 	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
  
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
! }
! 
! static bool
! asyncQueueIsFull()
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
! 
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
! 
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
  }
  
  /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
   */
+ static bool
+ asyncQueueAdvance(QueuePosition *position, int entryLength)
+ {
+ 	int		pageno = QUEUE_POS_PAGE(*position);
+ 	int		offset = QUEUE_POS_OFFSET(*position);
+ 	bool	pageJump = false;
+ 
+ 	/*
+ 	 * Move to the next writing position: First jump over what we have just
+ 	 * written or read.
+ 	 */
+ 	offset += entryLength;
+ 	Assert(offset < QUEUE_PAGESIZE);
+ 
+ 	/*
+ 	 * In a second step check if another entry can be written to the page. If
+ 	 * it does, stay here, we have reached the next position. If not, then we
+ 	 * need to move on to the next page.
+ 	 */
+ 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
+ 	{
+ 		pageno++;
+ 		if (pageno > QUEUE_MAX_PAGE)
+ 			/* wrap around */
+ 			pageno = 0;
+ 		offset = 0;
+ 		pageJump = true;
+ 	}
+ 
+ 	SET_QUEUE_POS(*position, pageno, offset);
+ 	return pageJump;
+ }
+ 
  static void
! asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
  {
! 		Assert(n->channel);
! 		Assert(n->payload);
! 		Assert(strlen(n->payload) <= NOTIFY_PAYLOAD_MAX_LENGTH);
! 
! 		/* The terminator is already included in AsyncQueueEntryEmptySize */
! 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload);
! 		qe->srcPid = MyProcPid;
! 		qe->dboid = MyDatabaseId;
! 		qe->xid = GetCurrentTransactionId();
! 		strcpy(qe->channel, n->channel);
! 		strcpy(qe->payload, n->payload);
! }
! 
! static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n)
! {
! 	n->channel = pstrdup(qe->channel);
! 	n->payload = pstrdup(qe->payload);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
! }
! 
! static List *
! asyncQueueAddEntries(List *notifications)
! {
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
! 	AsyncQueueEntry	qe;
! 
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
  
! 	do
! 	{
! 		Notification   *n;
  
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if command further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
  
! 		n = (Notification *) linitial(notifications);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
! 
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.channel[0] = '\0';
! 			qe.payload[0] = '\0';
! 			qe.xid = InvalidTransactionId;
  		}
+ 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
+ 			   &qe, qe.length);
+ 
+ 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
+ 			 && notifications != NIL);
+ 
+ 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
+ 	{
+ 		/*
+ 		 * If the next entry needs to go to a new page, prepare that page
+ 		 * already.
+ 		 */
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		AsyncCtl->shared->page_dirty[slotno] = true;
  	}
+ 	LWLockRelease(AsyncCtlLock);
  
! 	return notifications;
! }
! 
! static void
! asyncQueueFullWarning()
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t = GetCurrentTimestamp();
! 	QueuePosition	min = QUEUE_HEAD;
! 	int32			minPid = InvalidPid;
! 	int				i;
! 
! 	for (i = 0; i < MaxBackends; i++)
! 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
! 		{
! 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 			if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 				minPid = QUEUE_BACKEND_PID(i);
! 		}
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFullWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		ereport(WARNING, (errmsg("pg_notify queue is full. "
! 								 "Among the slowest backends: %d", minPid)));
! 		asyncQueueControl->lastQueueFullWarn = t;
! 	}
! }
! 
! /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  * Add the pending notifications to the queue and signal the listening
!  * backends.
!  *
!  * A full queue is very uncommon and should really not happen, given that we
!  * have so much space available in our slru pages. Nevertheless we need to
!  * deal with this possibility. Note that when we get here we are in the process
!  * of committing our transaction, we have not yet committed to clog but this
!  * would be the next step.
!  */
! static void
! Send_Notify()
! {
! 	while (pendingNotifies != NIL)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		while (asyncQueueIsFull())
! 		{
! 			asyncQueueFullWarning();
! 			LWLockRelease(AsyncQueueLock);
! 
! 			/* check if our query is cancelled */
! 			CHECK_FOR_INTERRUPTS();
! 
! 			SignalBackends(SIGNAL_SLOW);
! 
! 			asyncQueueReadAllNotifications(READ_ALL_TO_UNCOMMITTED);
! 
! 			asyncQueueAdvanceTail();
! 			pg_usleep(100 * 1000L); /* 1ms */
! 			LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		}
! 		Assert(pendingNotifies != NIL);
! 		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 		LWLockRelease(AsyncQueueLock);
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. It would be easy here to check
!  * for backends that are already up-to-date, i.e.
!  *
!  *   QUEUE_BACKEND_POS(pid) == QUEUE_HEAD
!  *
!  * but in general we need to signal them anyway. If we didn't, we would not
!  * have the guarantee that they can deliver their notifications from
!  * uncommittedNotifications. Only when the queue is full and we signal the
!  * backends to read also uncommitted data, we can use this optimization.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(SignalType type)
! {
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			if (type == SIGNAL_SLOW &&
! 					QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(i), QUEUE_HEAD))
! 				continue;
! 			pids = lappend_int(pids, pid);
! 			ids = lappend_int(ids, i);
! 		}
! 	}
! 	LWLockRelease(AsyncQueueLock);
! 	
! 	forboth(p1, pids, p2, ids)
! 	{
! 		pid = (int32) lfirst_int(p1);
! 		i = lfirst_int(p2);
! 		/*
! 		 * Should we check for failure? Can it happen that a backend
! 		 * has crashed without the postmaster starting over?
! 		 */
! 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
! 			elog(WARNING, "Error signalling backend %d", pid);
! 	}
  }
  
  /*
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1413,1641 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back two lists of notifications
+  * that the calling function will deliver then. The first list will contain all
+  * notifications from transactions that have already committed and the second
+  * one will contain uncommitted notifications.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * If we have reached the end (i.e. it does not make sense to call this
+  * function again), else false.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current, QueuePosition stop,
+ 						   List **committed, MemoryContext committedContext,
+ 						   List **uncommitted, MemoryContext uncommittedContext)
+ {
+ 	int				slotno;
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 		readPtr += current->offset;
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			MemoryContext	oldcontext;
+ 
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				Assert(committed != NULL);
+ 				if (IsListeningOn(qe.channel))
+ 				{
+ 					if (qe.length > AsyncQueueEntryEmptySize)
+ 						memcpy(&qe, readPtr, qe.length);
+ 					oldcontext = MemoryContextSwitchTo(committedContext);
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n);
+ 					*committed = lappend(*committed, n);
+ 					MemoryContextSwitchTo(oldcontext);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * We have found a transaction that has not committed.
+ 					 * Should we read uncommitted data or not ?
+ 					 */
+ 					if (!uncommitted)
+ 					{
+ 						reachedStop = true;
+ 						break;
+ 					}
+ 					if (qe.length > AsyncQueueEntryEmptySize)
+ 						memcpy(&qe, readPtr, qe.length);
+ 					oldcontext = MemoryContextSwitchTo(uncommittedContext);
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n);
+ 					*uncommitted= lappend(*uncommitted, n);
+ 					MemoryContextSwitchTo(oldcontext);
+ 				}
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ static void
+ asyncQueueReadAllNotifications(QueueProcessType type)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* Nothing to do, we have read all notifications already. */
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 		return;
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		if (type == READ_ALL_TO_UNCOMMITTED)
+ 			/*
+ 			 * If the queue is full, we call this in the writing backend.
+ 			 * if a backend sends more notifications than the queue can hold
+ 			 * it also needs to read its own notifications from time to time
+ 			 * such that it can reuse the space of the queue.
+ 			 */
+ 			reachedStop = asyncQueueGetEntriesByPage(&pos, head,
+ 								   &uncommittedNotifications, TopMemoryContext,
+ 								   &uncommittedNotifications, TopMemoryContext);
+ 		else
+ 		{
+ 			/*
+ 			 * This is called from ProcessIncomingNotify()
+ 			 */
+ 			Assert(type == READ_ONLY_COMMITTED);
+ 
+ 			notifications = NIL;
+ 			reachedStop = asyncQueueGetEntriesByPage(&pos, head,
+ 								   &notifications, CurrentMemoryContext,
+ 								   NULL, CurrentMemoryContext);
+ 
+ 			foreach(lc, notifications)
+ 			{
+ 				n = (Notification *) lfirst(lc);
+ 				NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 			}
+ 		}
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail()
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailPage;
+ 	int				headPage;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if (asyncQueuePagePrecedesLogically(tailPage, QUEUE_POS_PAGE(min), headPage)
+ 		&& asyncQueuePagePrecedesPhysically(tailPage, headPage))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * Don't even bother grabbing the lock if we can only truncate at most
+ 		 * one page...
+ 		 */
+ 		if (QUEUE_POS_PAGE(min) - tailPage > SLRU_PAGES_PER_SEGMENT)
+ 			SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
  
! 		if (sourcePID != 0)
  		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
  		}
  	}
- 	heap_endscan(scan);
  
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 1647,1681 ----
  
  	notifyInterruptOccurred = 0;
  
! 	/*
!  	 * Work on the uncommitted notifications list until we hit the first
! 	 * still-running transaction.
! 	 */
! 	while(uncommittedNotifications != NIL)
! 	{
! 		ListCell	   *lc;
! 		Notification   *n;
  
! 		n = (Notification *) linitial(uncommittedNotifications);
! 		if (TransactionIdDidCommit(n->xid))
  		{
! 			if (IsListeningOn(n->channel))
! 				NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
! 		}
! 		else
! 		{
! 			if (!TransactionIdDidAbort(n->xid))
! 				/* n->xid still running */
! 				break;
  		}
+ 		pfree(n->channel);
+ 		pfree(n->payload);
+ 		lc = list_head(uncommittedNotifications);
+ 		uncommittedNotifications
+ 			= list_delete_cell(uncommittedNotifications, lc, NULL);
  	}
  
! 	asyncQueueReadAllNotifications(READ_ONLY_COMMITTED);
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 1695,1711 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 1715,1771 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
  
! 	if (payload == NULL)
! 		payload = "";
! 
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 */
! 
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1124,1128 ****
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	Async_Notify((char *) recdata);
  }
--- 1799,1809 ----
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	AsyncQueueEntry		*qe = (AsyncQueueEntry *) recdata;
! 
! 	Assert(qe->dboid == MyDatabaseId);
! 	Assert(qe->length == len);
! 
! 	Async_Notify(qe->channel, qe->payload);
  }
+ 
diff -cr cvs/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs/src/backend/nodes/copyfuncs.c	2009-11-18 10:19:30.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 2761,2766 ****
--- 2761,2767 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs/src/backend/nodes/equalfuncs.c	2009-11-18 10:19:30.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 1321,1326 ****
--- 1321,1327 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs/src/backend/nodes/outfuncs.c	2009-11-18 10:19:30.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 1811,1816 ****
--- 1811,1817 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs/src/backend/nodes/readfuncs.c	2009-10-31 14:47:48.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs/src/backend/parser/gram.y	2009-11-18 10:19:30.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2009-11-18 10:20:54.000000000 +0100
***************
*** 394,400 ****
  %type <boolean> opt_varying opt_timezone
  
  %type <ival>	Iconst SignedIconst
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 394,400 ----
  %type <boolean> opt_varying opt_timezone
  
  %type <ival>	Iconst SignedIconst
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 5984,5993 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 5984,5999 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs/src/backend/rewrite/rewriteManip.c cvs.build/src/backend/rewrite/rewriteManip.c
*** cvs/src/backend/rewrite/rewriteManip.c	2009-11-11 01:09:14.000000000 +0100
--- cvs.build/src/backend/rewrite/rewriteManip.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 996,1001 ****
--- 996,1002 ----
  		 * While clearly wrong, this is much more useful than refusing to
  		 * execute the rule at all, and extra NOTIFY events are harmless for
  		 * typical uses of NOTIFY.
+ 		 * XXX
  		 *
  		 * If it isn't a NOTIFY, error out, since unconditional execution of
  		 * other utility stmts is unlikely to be wanted.  (This case is not
diff -cr cvs/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs/src/backend/storage/ipc/ipci.c	2009-09-06 09:06:21.000000000 +0200
--- cvs.build/src/backend/storage/ipc/ipci.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 219,224 ****
--- 219,225 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs/src/backend/storage/lmgr/lwlock.c	2009-05-10 19:50:21.000000000 +0200
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2009-11-18 10:22:00.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs/src/backend/tcop/utility.c	2009-11-18 10:19:31.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 875,882 ****
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
! 
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 875,886 ----
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
! 				if (stmt->payload
! 					&& strlen(stmt->payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 					ereport(ERROR,
! 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 							 errmsg("payload string too long")));
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs/src/bin/initdb/initdb.c	2009-11-18 10:19:31.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 2469,2474 ****
--- 2469,2475 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs/src/bin/psql/common.c	2009-05-10 19:50:30.000000000 +0200
--- cvs.build/src/bin/psql/common.c	2009-11-18 10:20:54.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs/src/include/access/slru.h	2009-05-10 19:50:35.000000000 +0200
--- cvs.build/src/include/access/slru.h	2009-11-18 10:20:54.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs/src/include/commands/async.h	2009-09-06 09:08:02.000000000 +0200
--- cvs.build/src/include/commands/async.h	2009-11-18 10:23:41.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,42 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * How long can a payload string possibly be? Actually it needs to be one
+  * byte less to provide space for the trailing terminating '\0'.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * How many page slots do we reserve ?
+  */
+ #define NUM_ASYNC_BUFFERS			4
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
diff -cr cvs/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs/src/include/nodes/parsenodes.h	2009-11-18 10:19:31.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2009-11-18 10:20:54.000000000 +0100
***************
*** 2059,2064 ****
--- 2059,2065 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs/src/include/storage/lwlock.h	2009-05-10 19:53:12.000000000 +0200
--- cvs.build/src/include/storage/lwlock.h	2009-11-18 10:20:54.000000000 +0100
***************
*** 67,72 ****
--- 67,74 ----
  	AutovacuumLock,
  	AutovacuumScheduleLock,
  	SyncScanLock,
+ 	AsyncCtlLock,
+ 	AsyncQueueLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#34)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland wrote:

On Fri, Nov 20, 2009 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Note that we don't preserve notifications when the database restarts.
But 2PC can cope with restarts. How would that fit together?

The notifications are written to the state file at prepare. They can be
recovered from there and written to the queue again at server start (see
twophase_rmgr.c).

Okay, but which of the backends would then leave its pointer at that
place in the queue upon restart?

This is also an issue for the non-restart case, what if you prepare
the transaction in one backend and commit in the other?

The dummy procs that represent prepared transactions need to be treated
as backends. Each prepared transaction needs a slot of its own in the
backends array.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#37Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#35)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland wrote:

On Thu, Nov 19, 2009 at 11:04 PM, Joachim Wieland <joe@mcknight.de> wrote:

Given your example, what I am proposing now is to stop reading from
the queue once we see a not-yet-committed notification but once the
queue is full, read the uncommitted notifications, effectively copying
them over into the backend's own memory... Once the transaction
commits and sends a signal, we can process, send and discard the
previously copied notifications. In the above example, at some point
one, two or all three backends would see that the queue is full and
everybody would read the uncommitted notifications of the other one,
copy them into the own memory and space will be freed in the queue.

Attached is the patch that implements the described modifications.

That's still not enough if session 2 that issues the LISTEN wasn't
previously subscribed to any channels. It will still miss the notification.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#38Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#35)
Re: Listen / Notify - what to do when the queue is full

On Fri, 2009-11-20 at 10:35 +0100, Joachim Wieland wrote:

On Thu, Nov 19, 2009 at 11:04 PM, Joachim Wieland <joe@mcknight.de> wrote:

Given your example, what I am proposing now is to stop reading from
the queue once we see a not-yet-committed notification but once the
queue is full, read the uncommitted notifications, effectively copying
them over into the backend's own memory... Once the transaction
commits and sends a signal, we can process, send and discard the
previously copied notifications. In the above example, at some point
one, two or all three backends would see that the queue is full and
everybody would read the uncommitted notifications of the other one,
copy them into the own memory and space will be freed in the queue.

Attached is the patch that implements the described modifications.

This is a first-pass review.

Comments:

* Why don't we read all notifications into backend-local memory at
every opportunity? It looks like sometimes it's only reading the
committed ones, and I don't see the advantage of leaving it in the SLRU.

Code comments:

* I see a compiler warning:
ipci.c: In function ‘CreateSharedMemoryAndSemaphores’:
ipci.c:222: warning: implicit declaration of function ‘AsyncShmemInit’
* sanity_check test fails, needs updating
* guc test fails, needs updating
* no docs

The overall design looks pretty good to me. From my very brief pass over
the code, it looks like:
* When the queue is full, the transaction waits until there is room
before it's committed. This may have an effect on 2PC, but the consensus
seemed to be that we could restrict the combination of 2PC + NOTIFY.
This also introduces the possibility of starvation, I suppose, but that
seems remote.
* When the queue is full, the inserter tries to signal the listening
backends, and tries to make room in the queue.
* Backends read the notifications when signaled, or when inserting (in
case the inserting backend is also the one preventing the queue from
shrinking).

I haven't looked at everything yet, but this seems like it's in
reasonable shape from a high level. Joachim, can you clean the patch up,
include docs, and fix the tests? If so, I'll do a full review.

Regards,
Jeff Davis

#39Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#38)
Re: Listen / Notify - what to do when the queue is full

Hi Jeff,

the current patch suffers from what Heikki recently spotted: If one
backend is putting notifications in the queue and meanwhile another
backend executes LISTEN and commits, then this listening backend
committed earlier and is supposed to receive the notifications of the
notifying backend - even though its transaction started later.

I have a new version that deals with this problem but I need to clean
it up a bit. I am planning to post it this week.

On Mon, Nov 30, 2009 at 6:15 AM, Jeff Davis <pgsql@j-davis.com> wrote:

 * Why don't we read all notifications into backend-local memory at
every opportunity? It looks like sometimes it's only reading the
committed ones, and I don't see the advantage of leaving it in the SLRU.

Exactly because of the problem above we cannot do it. Once the
notification is removed from the queue, then no other backend can
execute a LISTEN anymore because there is no way for it to get that
information. Also we'd need to read _all_ notifications, not only the
committed ones because we don't know what our backend will LISTEN to
in the future.

On the other hand, reading uncommitted notifications guarantees that
we can send an unlimited number of notifications (limited by main
memory) and that we don't run into a full queue in this example:

Queue length: 1000
3 notifying backends, 400 notifications to be sent by each backend.

If all of them send their notifications at the same time, we risk that
all three run into a full queue...

We could still preserve that behavior on the cost that we allow LISTEN
to block until the queue is within its limits again.

 * When the queue is full, the inserter tries to signal the listening
backends, and tries to make room in the queue.
 * Backends read the notifications when signaled, or when inserting (in
case the inserting backend is also the one preventing the queue from
shrinking).

Exactly, but it doesn't solve the problem described above. :-(

ISTM that we have two options:

a) allow LISTEN to block if the queue is full - NOTIFY will never fail
(but block as well) and will eventually succeed
b) NOTIFY could fail and make the transaction roll back - LISTEN
always succeeds immediately

Again: This is corner-case behavior and only happens after some
hundreds of gigabytes of notifications have been put to the queue and
have not yet been processed by all listening backends. I like a)
better, but b) is easier to implement...

I haven't looked at everything yet, but this seems like it's in
reasonable shape from a high level. Joachim, can you clean the patch up,
include docs, and fix the tests? If so, I'll do a full review.

As soon as everybody is fine with the approach, I will work on the docs patch.

Joachim

#40Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#39)
Re: Listen / Notify - what to do when the queue is full

On Mon, 2009-11-30 at 14:14 +0100, Joachim Wieland wrote:

I have a new version that deals with this problem but I need to clean
it up a bit. I am planning to post it this week.

Are planning to send a new version soon?

As it is, we're 12 days from the end of this commitfest, so we don't
have much time to hand the patch back and forth before we're out of
time.

Regards,
Jeff Davis

#41Greg Smith
greg@2ndquadrant.com
In reply to: Jeff Davis (#40)
Re: Listen / Notify - what to do when the queue is full

Jeff Davis wrote:

On Mon, 2009-11-30 at 14:14 +0100, Joachim Wieland wrote:

I have a new version that deals with this problem but I need to clean
it up a bit. I am planning to post it this week.

Are planning to send a new version soon?

As it is, we're 12 days from the end of this commitfest, so we don't
have much time to hand the patch back and forth before we're out of
time.

Joachim has been making good progress here, but given the current code
maturity vs. implementation complexity of this work my guess is that
even if we got an updated version today it's not going to hit commit
quality given the time left. I'm going to mark this one "returned with
feedback", and I do hope that work continues on this patch so it's early
in the queue for the final CommitFest for this version. It seems like
it just needs a bit more time to let the design mature and get the kinks
worked out and it could turn into a useful feature--I know I've wanted
NOTIFY with a payload for a years.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#42Joachim Wieland
joe@mcknight.de
In reply to: Greg Smith (#41)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

Hi,

On Mon, Dec 7, 2009 at 5:38 AM, Greg Smith <greg@2ndquadrant.com> wrote:

JI'm going to mark this one "returned with feedback", and I
do hope that work continues on this patch so it's early in the queue for the
final CommitFest for this version.  It seems like it just needs a bit more
time to let the design mature and get the kinks worked out and it could turn
into a useful feature--I know I've wanted NOTIFY with a payload for a years.

I am perfectly fine with postponing the patch to the next commitfest. To get
some more feedback and to allow everyone to play with it, I am attaching the
latest version of the patch.

What has changed:

Transactional processing is now hopefully correct:

Examples:

Backend 1: Backend 2:

transaction starts
NOTIFY foo;
commit starts
transaction starts
LISTEN foo;
commit starts
commit to clog
commit to clog

=> Backend 2 will receive Backend 1's notification.

Backend 1: Backend 2:

transaction starts
NOTIFY foo;
commit starts
transaction starts
UNLISTEN foo;
commit starts
commit to clog
commit to clog

=> Backend 2 will not receive Backend 1's notification.

This is done by introducing an additional "AsyncCommitOrderLock". It is grabbed
exclusively from transactions that execute LISTEN / UNLISTEN and in shared mode
for transactions that executed NOTIFY only. LISTEN/UNLISTEN transactions then
register the XIDs of the NOTIFYing transactions that are about to commit
at the same time in order to later find out which notifications are visible and
which ones are not.

If the queue is full, any other transaction that is trying to place a
notification to the queue is rolled back! This is basically a consequence of
the former. There are two warnings that will show up in the log once the queue
is more than 50% full and another one if it is more than 75% full. The biggest
threat to run into a full queue are probably backends that are LISTENing and
are idle in transaction.

I have added a function pg_listening() which just contains the names of the
channels that a backend is listening to.

I especially invite people who know more about the transactional stuff than I
do to take a close look at what I have done regarding notification visibility.

One open question regarding the payload is if we need to limit it to ASCII to
not risk conversion issues between different backend character sets?

The second open issue is what we should do regarding 2PC. These options have
been brought up so far:

1) allow NOTIFY in 2PC but it can happen that the transaction needs to be
rolled back if the queue is full
2) disallow NOTIFY in 2PC alltogether
3) put notifications to the queue on PREPARE TRANSACTION and make backends not
advance their pointers further than those notifications but wait for the
2PC transaction to commit. 2PC transactions would never fail but you
effectively stop the notification system until the 2PC transaction commits.

Comments?

Best regards,
Joachim

Attachments:

listennotify.6.difftext/x-diff; charset=US-ASCII; name=listennotify.6.diffDownload
diff -cr cvs/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs/src/backend/access/transam/slru.c	2009-12-09 11:24:51.000000000 +0100
--- cvs.build/src/backend/access/transam/slru.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs/src/backend/access/transam/xact.c	2009-12-09 11:24:51.000000000 +0100
--- cvs.build/src/backend/access/transam/xact.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 1604,1612 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
--- 1604,1613 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
+ 	Assert(s->state == TRANS_INPROGRESS);
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
***************
*** 1680,1685 ****
--- 1681,1691 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs/src/backend/catalog/Makefile	2009-12-09 11:24:40.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2009-12-09 11:26:03.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject.h pg_aggregate.h pg_statistic.h \
! 	pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h pg_cast.h \
  	pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject.h pg_aggregate.h pg_statistic.h \
! 	pg_rewrite.h pg_trigger.h pg_description.h pg_cast.h \
  	pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs/src/backend/commands/async.c	2009-12-09 11:24:41.000000000 +0100
--- cvs.build/src/backend/commands/async.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 14,44 ****
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 14,79 ----
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications and compares the notified channels with its list.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) stores the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the applications needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    In order to ensure transactional processing there is AsyncCommitOrderLock
!  *    that has to be grabbed exclusively by all notifications that send NOTIFYs
!  *    do LISTENs and UNLISTENs but only for the time when those transactions
!  *    commit to clog. For example, one issue here is that a transaction sending
!  *    notifications could store them into the list while another transaction
!  *    that started later does a LISTEN on a channel and commits. Then it has to
!  *    see the notifications of the longer running transaction, because it
!  *    committed earlier (even though it started later).
!  *
!  *    After clog commit we are called another time
!  *    (AtCommit_NotifyAfterCommit()). If we have executed either LISTEN or
!  *    UNLISTEN, we register any running notifying transactions and release the
!  *    lock to be able to solve exactly the problem described above. We then
!  *    check if we need to signal the backends. In SignalBackends() we scan the
!  *    list of listening backends and send a PROCSIG_NOTIFY_INTERRUPT to every
!  *    backend that has set its Pid (we don't know which backend is listening on
!  *    which channel so we need to send a signal to every listening backend). We
!  *    can exclude backends that are already up to date.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,97 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_listener.h"
  #include "commands/async.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
--- 81,127 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach the head pointer's position. Then we check if we were the
!  *	  laziest backend: if our pointer is set to the same position as the global
!  *	  tail pointer is set, then we set it further to the second-laziest
!  *	  backend (We can identify it by inspecting the positions of all other
!  *	  backends' pointers). Whenever we move the tail pointer we also truncate
!  *	  now unused pages (i.e. delete files in pg_notify/ that are no longer
!  *	  used).
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
+ /* XXX 
+  *
+  * TODO:
+  *  - guc parameter max_notifies_per_txn ??
+  *  - adapt comments
+  *  - 2PC
+  *  - limit to ASCII?
+  */
+ 
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_type.h"
  #include "commands/async.h"
+ #include "funcapi.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 138,145 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 134,140 ****
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 164,170 ----
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 179,321 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	QueuePosition	position;
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			channel[NAMEDATALEN];
+ 	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)
+ 
+ #define	InvalidPid				(-1)
+ #define QUEUE_POS_PAGE(x)		((x).page)
+ #define QUEUE_POS_OFFSET(x)		((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_POS_LT(x,y,z) \
+ 		(QUEUE_POS_EQUAL(x,QUEUE_POS_MIN(x,y,z)) && !QUEUE_POS_EQUAL(x,y))
+ #define QUEUE_POS_LE(x,y,z) \
+ 		(QUEUE_POS_EQUAL(x,QUEUE_POS_MIN(x,y,z)))
+ #define QUEUE_BACKEND_POS(i)		asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i)		asyncQueueControl->backend[(i)].pid
+ #define QUEUE_BACKEND_XID(i)		asyncQueueControl->backend[(i)].xid
+ #define QUEUE_HEAD					asyncQueueControl->head
+ #define QUEUE_TAIL					asyncQueueControl->tail
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ 	TransactionId	xid;  /* this is protected by AsyncQueueCommitOrderLock,
+ 							 no lock required to read the own entry,
+ 							 LW_SHARED is sufficient for writing the own entry,
+ 							 LW_EXCLUSIVE to read everybody else's. */
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFillWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	int					listeningBackends;
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's a legal test case to define QUEUE_MAX_PAGE to a very small multiply of
+  * SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
+ 
+ static bool backendSendsNotifications = false;
+ static bool backendExecutesListen = false;
+ static bool backendExecutesUnlisten = false; 
+ 
+ typedef struct
+ {
+ 	char			   *channel;
+ 	List			   *xids;
+ 	QueuePosition		limitPos;
+ 	ListenActionKind	kind;
+ } ActionPhaseInOut;
+ 
+ static List *ActionPhaseInOutList = NIL;
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,207 ****
  
  bool		Trace_notify = false;
  
- 
  static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
  	{
  		/*
  		 * The name list needs to live until end of transaction, so store it
  		 * in the transaction context.
--- 332,484 ----
  
  bool		Trace_notify = false;
  
  static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool IsInListenChannels(const char *channel);
! static void asyncQueuePhaseInOut(ListenActionKind kind, const char *channel,
! 								 QueuePosition limitPos);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_Listen(const char *channel);
! static void Exec_Unlisten(const char *channel);
! static void Exec_UnlistenAll(void);
! static void SignalBackends(void);
! static void Send_Notify(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n,
! 										  QueuePosition pos);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **notifications);
! static void asyncQueueReadAllNotifications(void);
! static void asyncQueueAdvanceTail(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 srcPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic if
+  * one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. If we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 			QUEUE_BACKEND_XID(i) = InvalidTransactionId;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFillWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		AsyncCtl->shared->page_dirty[slotno] = true;
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 		LWLockRelease(AsyncQueueLock);
+ 
+ 		SimpleLruTruncate(AsyncCtl, 0);
+ 	}
+ }
+ 
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
! 
! 	/*
! 	 * XXX - do we now need a guc parameter max_notifies_per_txn?
! 	 */ 
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(channel, payload))
  	{
+ 		Notification *n;
  		/*
  		 * The name list needs to live until end of transaction, so store it
  		 * in the transaction context.
***************
*** 210,221 ****
  
  		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
  		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
  		MemoryContextSwitchTo(oldcontext);
  	}
--- 487,509 ----
  
  		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
+ 		n = (Notification *) palloc(sizeof(Notification));
+ 		n->channel = pstrdup(channel);
+ 		if (payload)
+ 			n->payload = pstrdup(payload);
+ 		else
+ 			n->payload = "";
+ 		/* will set the xid and the srcPid later... */
+ 		n->xid = InvalidTransactionId;
+ 		n->srcPid = InvalidPid;
+ 		/* we don't set n->position here. It is unknown and we won't do anything
+ 		 * with it at this point anyway. */
+ 
  		/*
! 		 * We want to preserve the order so we need to append every
! 		 * notification. See comments at AsyncExistsPendingNotify().
  		 */
! 		pendingNotifies = lappend(pendingNotifies, n);
  
  		MemoryContextSwitchTo(oldcontext);
  	}
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 547,558 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 561,576 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 594,599 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	/*
! 	 * We need to start/commit a transaction for the unlisten, but if there is
! 	 * already an active transaction we had better abort that one first.
! 	 * Otherwise we'd end up committing changes that probably ought to be
! 	 * discarded.
  	 */
! 	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 601,625 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
+ 	bool	advanceTail = false;
+ 
+ 	AbortOutOfAnyTransaction();
+ 
+ 	LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
+ 	QUEUE_BACKEND_XID(MyBackendId) = InvalidTransactionId;
+ 	LWLockRelease(AsyncCommitOrderLock);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
  	/*
! 	 * If we have been the last backend, advance the tail pointer.
  	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
  }
  
  /*
***************
*** 348,357 ****
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		const char *relname = (const char *) lfirst(p);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
  	}
  
  	/*
--- 642,656 ----
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		AsyncQueueEntry qe;
! 		Notification   *n;
! 
! 		n = (Notification *) lfirst(p);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   &qe, qe.length);
  	}
  
  	/*
***************
*** 363,386 ****
  }
  
  /*
!  * AtCommit_Notify
!  *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
   *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
--- 662,681 ----
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
  
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 692,702 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
! 	Assert(backendSendsNotifications == false);
! 	Assert(backendExecutesListen == false);
! 	Assert(backendExecutesUnlisten == false);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
! 	/* Perform any pending notifies */
  	if (pendingNotifies)
! 		Send_Notify(lRel);
  
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified relation.
   */
! static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
! 
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
  
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
  
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
  		}
  	}
! 	heap_endscan(scan);
  
! 	if (alreadyListener)
! 		return;
  
! 	/*
! 	 * OK to insert a new tuple
! 	 */
! 	memset(nulls, false, sizeof(nulls));
  
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
  
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
  
! 	simple_heap_insert(lRel, tuple);
  
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 706,973 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll();
  				break;
  		}
  	}
  
! 	/*
! 	 * Perform any pending notifies.
! 	 */
  	if (pendingNotifies)
! 		Send_Notify();
  
  	/*
! 	 * Grab the AsyncCommitOrderLock to ensure we know the commit order.
! 	 * In case we have only sent notifications and have not executed LISTEN
! 	 * or UNLISTEN, a shared lock is sufficient.
  	 */
! 	if (backendExecutesListen || backendExecutesUnlisten)
! 		LWLockAcquire(AsyncCommitOrderLock, LW_EXCLUSIVE);
! 	else if (backendSendsNotifications)
! 		LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
! }
! 
! /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
!  *
!  *		Notify the listening backends.
!  */
! void
! AtCommit_NotifyAfterCommit(void)
! {
! 	QueuePosition	head;
! 	ListCell	   *p, *q;
! 
! 	/* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
! 	 * return as soon as possible */
! 	if (!pendingActions && !backendSendsNotifications)
! 		return;
! 
! 	if (backendExecutesListen || backendExecutesUnlisten)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 		head = QUEUE_HEAD;
! 		LWLockRelease(AsyncQueueLock);
! 	}
! 
! 	foreach(p, pendingActions)
! 	{
! 		ListenAction *actrec = (ListenAction *) lfirst(p);
! 
! 		switch (actrec->action)
! 		{
! 			case LISTEN_LISTEN:
! 				Assert(backendExecutesListen);
! 				asyncQueuePhaseInOut(LISTEN_LISTEN, actrec->condname, head);
! 				break;
! 			case LISTEN_UNLISTEN:
! 				Assert(backendExecutesUnlisten);
! 				asyncQueuePhaseInOut(LISTEN_UNLISTEN, actrec->condname, head);
! 				break;
! 			case LISTEN_UNLISTEN_ALL:
! 				Assert(backendExecutesUnlisten);
! 				foreach(q, listenChannels)
! 				{
! 					char *lchan = (char *) lfirst(q);
! 					asyncQueuePhaseInOut(LISTEN_UNLISTEN, lchan, head);
! 				}
! 				break;
! 		}
! 	}
! 
! 	if (backendSendsNotifications)
! 		QUEUE_BACKEND_XID(MyBackendId) = InvalidTransactionId;
! 
! 	if (backendSendsNotifications
! 	 		|| backendExecutesListen
! 			|| backendExecutesUnlisten)
! 	{
! 		Assert(LWLockHeldByMe(AsyncCommitOrderLock));
! 		LWLockRelease(AsyncCommitOrderLock);
! 	}
! 
! 	if (backendSendsNotifications)
! 		SignalBackends();
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
  }
  
  /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
   */
! static bool
! IsListeningOn(const char *channel)
  {
! 	ListCell   *p;
  
! 	foreach(p, listenChannels)
  	{
! 		ActionPhaseInOut   *act;
! 		char			   *lchan = (char *) lfirst(p);
! 		bool				vote;
  
! 		if (strcmp(lchan, channel) == 0)
  		{
! 			vote = true;
! 			foreach(p, ActionPhaseInOutList)
! 			{
! 				act = (ActionPhaseInOut *) lfirst(p);
! 				if (strcmp(act->channel, channel) != 0)
! 					continue;
! 				if (act->kind == LISTEN_LISTEN)
! 					vote = true;
! 				if (act->kind == LISTEN_UNLISTEN)
! 					vote = false;
! 			}
! 			return vote;
  		}
  	}
! 	return false;
! }
  
! /*
!  * This is a less strict version of IsListeningOn().
!  *
!  * Think of the following:
!  *
!  * Backend 1:            Backend 2:
!  * LISTEN
!  * COMMIT
!  *                       NOTIFY
!  * UNLISTEN
!  *                       COMMIT
!  * COMMIT
!  *
!  * At the end backend 1 would say IsListeningOn() == false, but
!  * IsInListenChannels() == true.
!  *
!  * IsInListenChannels() is called from ProcessIncomingNotify() to check if
!  * a notification could be of interest.
!  */
! static bool
! IsInListenChannels(const char *channel)
! {
! 	ListCell   *p;
! 	char	   *lchan;
  
! 	foreach(p, listenChannels)
! 	{
! 		lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			return true;
! 	}
! 	return false;
! }
! 
! Datum
! pg_listening(PG_FUNCTION_ARGS)
! {
! 	FuncCallContext	   *funcctx;
! 	ListCell		  **lcp;
! 
! 	/* stuff done only on the first call of the function */
! 	if (SRF_IS_FIRSTCALL())
! 	{
! 		MemoryContext	oldcontext;
! 
! 		/* create a function context for cross-call persistence */
! 		funcctx = SRF_FIRSTCALL_INIT();
! 
! 		/*
! 		 * switch to memory context appropriate for multiple function calls
! 		 */
! 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
  
! 		/* allocate memory for user context */
! 		lcp = (ListCell **) palloc(sizeof(ListCell **));
! 		if (listenChannels != NIL)
! 			*lcp = list_head(listenChannels);
! 		else
! 			*lcp = NULL;
! 		funcctx->user_fctx = (void *) lcp;
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
  
! 	/* stuff done on every call of the function */
! 	funcctx = SRF_PERCALL_SETUP();
! 	lcp = (ListCell **) funcctx->user_fctx;
  
! 	while (*lcp != NULL)
! 	{
! 		char   *channel = (char *) lfirst(*lcp);
  
! 		*lcp = (*lcp)->next;
! 
! 		if (IsListeningOn(channel))
! 			SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
! 	}
! 
! 	SRF_RETURN_DONE(funcctx);
! }
! 
! /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified channel.
!  */
! static void
! Exec_Listen(const char *channel)
! {
! 	MemoryContext oldcontext;
! 
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
! 
! 	/*
! 	 * This line has to be here and not after the IsInListenChannels() call to
! 	 * match the AtCommit_NotifyAfterCommit() checks.
! 	 */
! 	backendExecutesListen = true;
! 
! 	/* Detect whether we are already listening on this channel */
! 	if (IsInListenChannels(channel))
! 		return;
! 
! 	/*
! 	 * OK to insert to the list.
! 	 */
! 	if (listenChannels == NIL)
! 	{
! 		/*
! 		 * This is our first LISTEN, establish our pointer.
! 		 */
! 		LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
! 		QUEUE_BACKEND_XID(MyBackendId) = InvalidTransactionId;
! 		LWLockRelease(AsyncCommitOrderLock);
! 
! 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 		QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
! 		QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 		LWLockRelease(AsyncQueueLock);
! 	}
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
  
  	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 514,550 ****
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
- 	HeapScanDesc scan;
- 	HeapTuple	tuple;
- 
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
! 		{
! 			/* Found the matching tuple, delete it */
! 			simple_heap_delete(lRel, &tuple->t_self);
! 
! 			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
! 			 */
! 			break;
! 		}
! 	}
! 	heap_endscan(scan);
  
  	/*
  	 * We do not complain about unlistening something not being listened;
--- 979,993 ----
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove a specified channel from "listenChannel".
   */
  static void
! Exec_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", channel, MyProcPid);
  
! 	backendExecutesUnlisten = true;
  
  	/*
  	 * We do not complain about unlistening something not being listened;
***************
*** 555,690 ****
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple;
! 	ScanKeyData key[1];
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
  static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
  			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
  			}
! 			else if (listener->notification == 0)
  			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
  			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
   * AtAbort_Notify
   *
!  *		This is called at transaction abort.
   *
!  *		Gets rid of pending actions and outbound notifies that we would have
!  *		executed if the transaction got committed.
   */
  void
  AtAbort_Notify(void)
  {
  	ClearPendingActionsAndNotifies();
  }
  
--- 998,1390 ----
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAll(void)
  {
! 	ListCell   *p;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll(%d)", MyProcPid);
  
! 	foreach(p, listenChannels)
! 	{
! 		char *lchan = (char *) lfirst(p);
! 		Exec_Unlisten(lchan);
! 	}
! }
  
! static bool
! asyncQueueIsFull()
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size			remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size			advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
  
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
! 
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
  }
  
  /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
   */
+ static bool
+ asyncQueueAdvance(QueuePosition *position, int entryLength)
+ {
+ 	int		pageno = QUEUE_POS_PAGE(*position);
+ 	int		offset = QUEUE_POS_OFFSET(*position);
+ 	bool	pageJump = false;
+ 
+ 	/*
+ 	 * Move to the next writing position: First jump over what we have just
+ 	 * written or read.
+ 	 */
+ 	offset += entryLength;
+ 	Assert(offset < QUEUE_PAGESIZE);
+ 
+ 	/*
+ 	 * In a second step check if another entry can be written to the page. If
+ 	 * it does, stay here, we have reached the next position. If not, then we
+ 	 * need to move on to the next page.
+ 	 */
+ 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
+ 	{
+ 		pageno++;
+ 		if (pageno > QUEUE_MAX_PAGE)
+ 			/* wrap around */
+ 			pageno = 0;
+ 		offset = 0;
+ 		pageJump = true;
+ 	}
+ 
+ 	SET_QUEUE_POS(*position, pageno, offset);
+ 	return pageJump;
+ }
+ 
+ static void
+ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
+ {
+ 		Assert(n->channel != NULL);
+ 		Assert(n->payload != NULL);
+ 		Assert(strlen(n->payload) <= NOTIFY_PAYLOAD_MAX_LENGTH);
+ 
+ 		/* The terminator is already included in AsyncQueueEntryEmptySize */
+ 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload);
+ 		qe->srcPid = MyProcPid;
+ 		qe->dboid = MyDatabaseId;
+ 		qe->xid = GetCurrentTransactionId();
+ 		strcpy(qe->channel, n->channel);
+ 		strcpy(qe->payload, n->payload);
+ }
+ 
  static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n,
! 							  QueuePosition pos)
  {
! 	n->channel = pstrdup(qe->channel);
! 	n->payload = pstrdup(qe->payload);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
! 	n->position = pos;
! }
! 
! static List *
! asyncQueueAddEntries(List *notifications)
! {
! 	AsyncQueueEntry	qe;
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
  
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
! 
! 	do
! 	{
! 		Notification   *n;
  
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if command further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
! 
! 		n = (Notification *) linitial(notifications);
  
! 		asyncQueueNotificationToEntry(n, &qe);
! 
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.channel[0] = '\0';
! 			qe.payload[0] = '\0';
! 			qe.xid = InvalidTransactionId;
! 		}
! 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
! 			   &qe, qe.length);
! 
! 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
! 			 && notifications != NIL);
! 
! 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
! 	{
! 		/*
! 		 * If the next entry needs to go to a new page, prepare that page
! 		 * already.
! 		 */
! 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
! 		AsyncCtl->shared->page_dirty[slotno] = true;
! 	}
! 	LWLockRelease(AsyncCtlLock);
! 
! 	return notifications;
! }
! 
! static void
! asyncQueueFillWarning()
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t;
! 	double			fillDegree;
! 	int				occupied;
! 	int				tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int				headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 
! 	occupied = headPage - tailPage;
! 
! 	if (occupied == 0)
! 		return;
! 	
! 	if (!asyncQueuePagePrecedesPhysically(tailPage, headPage))
! 		/* head has wrapped around, tail not yet */
! 		occupied += QUEUE_MAX_PAGE;
! 
! 	fillDegree = (float) occupied / (float) QUEUE_MAX_PAGE;
! 
! 	if (fillDegree < 0.5)
! 		return;
! 
! 	t = GetCurrentTimestamp();
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFillWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		QueuePosition	min = QUEUE_HEAD;
! 		int32			minPid = InvalidPid;
! 		int				i;
! 
! 		for (i = 0; i < MaxBackends; i++)
! 			if (QUEUE_BACKEND_PID(i) != InvalidPid)
  			{
! 				min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 				if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 					minPid = QUEUE_BACKEND_PID(i);
  			}
! 
! 		if (fillDegree < 0.75)
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 50%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 		else
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 75%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 
! 		asyncQueueControl->lastQueueFillWarn = t;
! 	}
! }
! 
! static List *
! asyncQueueSendingXids(void)
! {
! 	/* Caller must hold exclusive lock on AsyncCommitOrderLock */
! 	int		i;
! 	List   *xidList = NIL;
! 
! 	for (i = 0; i < MaxBackends; i++)
! 		if (QUEUE_BACKEND_XID(i) != InvalidTransactionId)
! 			xidList = lappend_int(xidList, QUEUE_BACKEND_XID(i));
! 
! 	return xidList;
! }
! 
! static void
! asyncQueuePhaseInOut(ListenActionKind kind, const char *channel,
! 					 QueuePosition limitPos)
! {
! 	/* Caller must hold exclusive lock on AsyncCommitOrderLock (for
! 	 * asyncQueueSendingXids()) */
! 	MemoryContext 		oldcontext;
! 	ActionPhaseInOut   *entry;
! 
! 	Assert(kind == LISTEN_LISTEN || kind == LISTEN_UNLISTEN);
! 
! 	if (kind == LISTEN_UNLISTEN && !IsListeningOn(channel))
! 		return;
! 	/* we cannot take the same shortcut for LISTEN_LISTEN and return if we are
! 	 * listening already. The reason is that we add the channel for every new
! 	 * LISTEN into the list of channels in Exec_Listen() and here we cannot
! 	 * tell anymore if there is already another previous LISTEN. */
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 
! 	entry = (ActionPhaseInOut *) palloc(sizeof(ActionPhaseInOut));
! 	entry->channel = pstrdup(channel);
! 	entry->xids = asyncQueueSendingXids();
! 	entry->kind = kind;
! 	entry->limitPos = limitPos;
! 
! 	ActionPhaseInOutList = lappend(ActionPhaseInOutList, entry);
! 
! 	MemoryContextSwitchTo(oldcontext);
! }
! 
! 
! /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  * Add the pending notifications to the queue and signal the listening
!  * backends.
!  *
!  * A full queue is very uncommon and should really not happen, given that we
!  * have so much space available in our slru pages. Nevertheless we need to
!  * deal with this possibility. Note that when we get here we are in the process
!  * of committing our transaction, we have not yet committed to clog but this
!  * would be the next step. So at this point in time we can still roll the
!  * transaction back.
!  */
! static void
! Send_Notify()
! {
! 	Assert(pendingNotifies != NIL);
! 
! 	backendSendsNotifications = true;
! 
! 	LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
! 	QUEUE_BACKEND_XID(MyBackendId) = GetCurrentTransactionId();
! 	LWLockRelease(AsyncCommitOrderLock);
! 
! 	while (pendingNotifies != NIL)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		asyncQueueFillWarning();
! 		if (asyncQueueIsFull())
! 			ereport(ERROR,
! 					(errcode(ERRCODE_TOO_MANY_ENTRIES),
! 					errmsg("Too many notifications in the queue")));
! 		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 		LWLockRelease(AsyncQueueLock);
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. Since we have EXCLUSIVE lock anyway
!  * we also check the position of the other backends and in case that it is
!  * already up-to-date we don't signal it.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(void)
! {
! 	QueuePosition	pos;
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 	int				count = 0;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			count++;
! 			pos = QUEUE_BACKEND_POS(i);
! 			if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
  			{
! 				pids = lappend_int(pids, pid);
! 				ids = lappend_int(ids, i);
  			}
  		}
  	}
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	forboth(p1, pids, p2, ids)
+ 	{
+ 		pid = (int32) lfirst_int(p1);
+ 		i = lfirst_int(p2);
+ 		/*
+ 		 * Should we check for failure? Can it happen that a backend
+ 		 * has crashed without the postmaster starting over?
+ 		 */
+ 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
+ 			elog(WARNING, "Error signalling backend %d", pid);
+ 	}
  
! 	if (count == 0)
! 	{
! 		/* No backend is listening at all, signal myself so we can clean up
! 		 * the queue. */
! 		SendProcSignal(MyProcPid, PROCSIG_NOTIFY_INTERRUPT, MyBackendId);
! 	}
  }
  
  /*
   * AtAbort_Notify
   *
!  *	This is called at transaction abort.
   *
!  *	Gets rid of pending actions and outbound notifies that we would have
!  *	executed if the transaction got committed.
!  *
!  *	Even though we have not committed, we need to signal the listening backends
!  *	because our notifications might block readers from processing the queue.
!  *	Now that the transaction has aborted, they can go on and skip our
!  *	notifications.
   */
  void
  AtAbort_Notify(void)
  {
+ 	if (backendSendsNotifications)
+ 		SignalBackends();
+ 
  	ClearPendingActionsAndNotifies();
  }
  
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1640,2005 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back the list of notifications
+  * that the calling function will deliver then. The list will contain all
+  * notifications from transactions that have already committed.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * The function returns true once we have reached the end or a notification of
+  * a transaction that is still running and false if we have just finished with
+  * the page.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current,
+ 						   QueuePosition stop,
+ 						   List **notifications)
+ {
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	int				slotno;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 		readPtr += current->offset;
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				if (IsInListenChannels(qe.channel))
+ 				{
+ 					if (qe.length > AsyncQueueEntryEmptySize)
+ 						memcpy(&qe, readPtr, qe.length);
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n, *current);
+ 					*notifications = lappend(*notifications, n);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * The transaction has neither committed nor aborted so
+ 					 * far.
+ 					 */
+ 					reachedStop = true;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ static bool
+ asyncQueueCheckDelivery(Notification *n, QueuePosition head)
+ {
+ 	ActionPhaseInOut   *act;
+ 	ListCell		   *lc;
+ 	bool				vote = true;
+ 
+ 	foreach(lc, ActionPhaseInOutList)
+ 	{
+ 		act = (ActionPhaseInOut *) lfirst(lc);
+ 
+ 		if (strcmp(act->channel, n->channel) != 0)
+ 			continue;
+ 
+ 		if (act->kind == LISTEN_LISTEN)
+ 		{
+ 			if (QUEUE_POS_LT(n->position, act->limitPos, head))
+ 			{
+ 				/*
+ 				 * When LISTEN committed, n->xid was still running. As n->xid
+ 				 * has committed by now, we need to deliver its notification.
+ 				 *
+ 				 * If n->xid was not running then it is a committed transaction
+ 				 * and we must not deliver notifications from already committed
+ 				 * transactions.
+ 				 */
+ 				if (list_member_int(act->xids, n->xid))
+ 					vote = true;
+ 				else
+ 					vote = false;
+ 			}
+ 			else
+ 				/*
+ 				 * n->xid committed, when LISTEN was already fully established
+ 				 */
+ 				vote = true;
+ 		}
+ 		else
+ 		{
+ 			Assert(act->kind == LISTEN_UNLISTEN);
+ 			if (QUEUE_POS_LT(n->position, act->limitPos, head))
+ 			{
+ 				/*
+ 				 * When UNLISTEN committed, n->xid was still running. As n->xid
+ 				 * has committed by now, we must not deliver its notification.
+ 				 *
+ 				 * If n->xid was already committed, we need to deliver its
+ 				 * notification (because we assume that there has been a LISTEN
+ 				 * previously).
+ 				 */
+ 				if (list_member_int(act->xids, n->xid))
+ 					vote = false;
+ 				else
+ 					vote = true;
+ 			}
+ 			else
+ 				/*
+ 				 * n->xid committed, when UNLISTEN was already fully
+ 				 * established.
+ 				 */
+ 				vote = false;
+ 		}
+ 	}
+ 
+ 	return vote;
+ }
+ 
+ static void
+ asyncQueueCleanUpPhaseInOut(QueuePosition pos, QueuePosition head)
+ {
+ 	ActionPhaseInOut   *act;
+ 	ListCell		   *p, *q;
+ 	List			   *deletedChannels = NIL;
+ 
+ 	while ((p = list_head(ActionPhaseInOutList)))
+ 	{
+ 		act = (ActionPhaseInOut *) lfirst(p);
+ 		if (QUEUE_POS_LE(act->limitPos, pos, head))
+ 		{
+ 			list_free(act->xids);
+ 			if (act->kind == LISTEN_LISTEN)
+ 			{
+ 				pfree(act->channel);
+ 			}
+ 			else
+ 			{
+ 				Assert(act->kind == LISTEN_UNLISTEN);
+ 				/* do not free act->channel, we reuse it... */
+ 				deletedChannels = lappend(deletedChannels, act->channel);
+ 			}
+ 			ActionPhaseInOutList =
+ 							list_delete_cell(ActionPhaseInOutList, p, NULL);
+ 			
+ 			if (ActionPhaseInOutList == NIL)
+ 				break;
+ 		}
+ 		else
+ 		{
+ 			/* the entries are ordered by limitPos, if we don't have a hit
+ 			 * now we won't have a further hit later on either */
+ 			break;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Check for every channel that we want to delete from listenChannels if
+ 	 * it is still in use by a subsequently issued LISTEN. There clearly is
+ 	 * room for improving the performance of this check but we expect the lists
+ 	 * to be really short anyway...
+ 	 */
+ 	foreach(p, deletedChannels)
+ 	{
+ 		char   *candidate = (char *) lfirst(p);
+ 		bool	found = false;
+ 		foreach(q, ActionPhaseInOutList)
+ 		{
+ 			act = (ActionPhaseInOut *) lfirst(q);
+ 			if (strcmp(candidate, act->channel) == 0)
+ 			{
+ 				found = true;
+ 				break;
+ 			}
+ 		}
+ 		if (found == false)
+ 		{
+ 			ListCell *prev = NULL;
+ 			foreach(q, listenChannels)
+ 			{
+ 				char *lchan = (char *) lfirst(q);
+ 				if (strcmp(lchan, candidate) == 0)
+ 				{
+ 					pfree(lchan);
+ 					listenChannels = list_delete_cell(listenChannels, q, prev);
+ 					Assert(!IsInListenChannels(lchan));
+ 					break;
+ 				}
+ 				prev = q;
+ 			}
+ 		}
+ 	}
+ 
+ 	if (listenChannels == NIL && ActionPhaseInOutList == NIL)
+ 	{
+ 		bool advanceTail = false;
+ 
+ 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 		QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
+ 		Assert(QUEUE_BACKEND_XID(MyBackendId) == InvalidTransactionId);
+ 		if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
+ 			advanceTail = true;
+ 		LWLockRelease(AsyncQueueLock);
+ 
+ 		if (advanceTail)
+ 			/* Move forward the tail pointer and try to truncate. */
+ 			asyncQueueAdvanceTail();
+ 	}
+ }
+ 
+ static void
+ asyncQueueReadAllNotifications(void)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	/*
+ 	 * We could have signalled ourselves because nobody was listening when we
+  	 * sent out notifications.
+  	 */
+ 	if (QUEUE_BACKEND_PID(MyBackendId) == InvalidPid)
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 	{
+ 		asyncQueueAdvanceTail();
+ 		return;
+ 	}
+ 
+ 	/* Nothing to do, we have read all notifications already. */
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 		return;
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		notifications = NIL;
+ 		reachedStop = asyncQueueGetEntriesByPage(&pos, head, &notifications);
+ 
+ 		foreach(lc, notifications)
+ 		{
+ 			n = (Notification *) lfirst(lc);
+ 			if (asyncQueueCheckDelivery(n, head))
+ 				NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 		}
+ 		asyncQueueCleanUpPhaseInOut(pos, head);
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail()
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailp;
+ 	int				headp;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailp = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headp = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if ((asyncQueuePagePrecedesLogically(tailp, QUEUE_POS_PAGE(min), headp)
+ 			&& asyncQueuePagePrecedesPhysically(tailp, headp))
+ 		|| tailp == QUEUE_POS_PAGE(min))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * XXX this could be optimized, to call SimpleLruTruncate only when we
+ 		 * know we can truncate something.
+ 		 */
+ 		SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
! 
! 		if (sourcePID != 0)
! 		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 		}
! 	}
! 	heap_endscan(scan);
! 
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 2011,2017 ----
  
  	notifyInterruptOccurred = 0;
  
! 	asyncQueueReadAllNotifications();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 2031,2047 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 2051,2107 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
! 
! 	if (payload == NULL)
! 		payload = "";
! 
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 */
  
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1107,1112 ****
--- 2118,2127 ----
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
+ 
+ 	backendSendsNotifications = false;
+ 	backendExecutesListen = false;
+ 	backendExecutesUnlisten = false;
  }
  
  /*
***************
*** 1124,1128 ****
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	Async_Notify((char *) recdata);
  }
--- 2139,2149 ----
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	AsyncQueueEntry		*qe = (AsyncQueueEntry *) recdata;
! 
! 	Assert(qe->dboid == MyDatabaseId);
! 	Assert(qe->length == len);
! 
! 	Async_Notify(qe->channel, qe->payload);
  }
+ 
diff -cr cvs/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs/src/backend/nodes/copyfuncs.c	2009-12-09 11:24:41.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 2765,2770 ****
--- 2765,2771 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs/src/backend/nodes/equalfuncs.c	2009-12-09 11:24:41.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 1322,1327 ****
--- 1322,1328 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs/src/backend/nodes/outfuncs.c	2009-12-09 11:24:41.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 1813,1818 ****
--- 1813,1819 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs/src/backend/nodes/readfuncs.c	2009-12-09 11:24:41.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs/src/backend/parser/gram.y	2009-12-09 11:24:49.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2009-12-09 11:26:03.000000000 +0100
***************
*** 397,403 ****
  %type <boolean> opt_varying opt_timezone
  
  %type <ival>	Iconst SignedIconst
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 397,403 ----
  %type <boolean> opt_varying opt_timezone
  
  %type <ival>	Iconst SignedIconst
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 6039,6048 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 6039,6054 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs/src/backend/storage/ipc/ipci.c	2009-12-09 11:24:52.000000000 +0100
--- cvs.build/src/backend/storage/ipc/ipci.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 219,224 ****
--- 219,225 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs/src/backend/storage/lmgr/lwlock.c	2009-12-09 11:24:52.000000000 +0100
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs/src/backend/tcop/utility.c	2009-12-09 11:24:48.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 876,883 ****
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
! 
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 876,887 ----
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
! 				if (stmt->payload
! 					&& strlen(stmt->payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 					ereport(ERROR,
! 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 							 errmsg("payload string too long")));
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs/src/bin/initdb/initdb.c	2009-12-09 11:24:34.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 2469,2474 ****
--- 2469,2475 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs/src/bin/psql/common.c	2009-12-09 11:24:34.000000000 +0100
--- cvs.build/src/bin/psql/common.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs/src/bin/psql/tab-complete.c cvs.build/src/bin/psql/tab-complete.c
*** cvs/src/bin/psql/tab-complete.c	2009-12-09 11:24:34.000000000 +0100
--- cvs.build/src/bin/psql/tab-complete.c	2009-12-09 11:26:03.000000000 +0100
***************
*** 2087,2093 ****
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
--- 2087,2093 ----
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
diff -cr cvs/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs/src/include/access/slru.h	2009-12-09 11:24:53.000000000 +0100
--- cvs.build/src/include/access/slru.h	2009-12-09 11:26:03.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs/src/include/catalog/pg_proc.h cvs.build/src/include/catalog/pg_proc.h
*** cvs/src/include/catalog/pg_proc.h	2009-12-09 11:24:52.000000000 +0100
--- cvs.build/src/include/catalog/pg_proc.h	2009-12-09 11:26:03.000000000 +0100
***************
*** 4081,4086 ****
--- 4081,4088 ----
  DESCR("get the prepared statements for this session");
  DATA(insert OID = 2511 (  pg_cursor PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,16,16,16,1184}" "{o,o,o,o,o,o}" "{name,statement,is_holdable,is_binary,is_scrollable,creation_time}" _null_ pg_cursor _null_ _null_ _null_ ));
  DESCR("get the open cursors for this session");
+ DATA(insert OID = 2187 (  pg_listening	PGNSP	PGUID 12 1 10 0 f f f t t s 0 0 25 "" _null_ _null_ _null_ _null_ pg_listening _null_ _null_ _null_ ));
+ DESCR("get the channels that the current backend listens to");
  DATA(insert OID = 2599 (  pg_timezone_abbrevs	PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,1186,16}" "{o,o,o}" "{abbrev,utc_offset,is_dst}" _null_ pg_timezone_abbrevs _null_ _null_ _null_ ));
  DESCR("get the available time zone abbreviations");
  DATA(insert OID = 2856 (  pg_timezone_names		PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,1186,16}" "{o,o,o,o}" "{name,abbrev,utc_offset,is_dst}" _null_ pg_timezone_names _null_ _null_ _null_ ));
diff -cr cvs/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs/src/include/commands/async.h	2009-12-09 11:24:53.000000000 +0100
--- cvs.build/src/include/commands/async.h	2009-12-09 11:26:03.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,42 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * How long can a payload string possibly be? Actually it needs to be one
+  * byte less to provide space for the trailing terminating '\0'.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * How many page slots do we reserve ?
+  */
+ #define NUM_ASYNC_BUFFERS			4
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
***************
*** 43,46 ****
--- 57,62 ----
  extern void notify_twophase_postcommit(TransactionId xid, uint16 info,
  						   void *recdata, uint32 len);
  
+ extern Datum pg_listening(PG_FUNCTION_ARGS);
+ 
  #endif   /* ASYNC_H */
diff -cr cvs/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs/src/include/nodes/parsenodes.h	2009-12-09 11:24:52.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2009-12-09 11:26:03.000000000 +0100
***************
*** 2070,2075 ****
--- 2070,2076 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs/src/include/storage/lwlock.h	2009-12-09 11:24:53.000000000 +0100
--- cvs.build/src/include/storage/lwlock.h	2009-12-09 11:26:03.000000000 +0100
***************
*** 67,72 ****
--- 67,75 ----
  	AutovacuumLock,
  	AutovacuumScheduleLock,
  	SyncScanLock,
+ 	AsyncCtlLock,
+ 	AsyncQueueLock,
+ 	AsyncCommitOrderLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff -cr cvs/src/include/utils/errcodes.h cvs.build/src/include/utils/errcodes.h
*** cvs/src/include/utils/errcodes.h	2009-12-09 11:24:53.000000000 +0100
--- cvs.build/src/include/utils/errcodes.h	2009-12-09 11:26:03.000000000 +0100
***************
*** 318,323 ****
--- 318,324 ----
  #define ERRCODE_STATEMENT_TOO_COMPLEX		MAKE_SQLSTATE('5','4', '0','0','1')
  #define ERRCODE_TOO_MANY_COLUMNS			MAKE_SQLSTATE('5','4', '0','1','1')
  #define ERRCODE_TOO_MANY_ARGUMENTS			MAKE_SQLSTATE('5','4', '0','2','3')
+ #define ERRCODE_TOO_MANY_ENTRIES			MAKE_SQLSTATE('5','4', '0','3','1')
  
  /* Class 55 - Object Not In Prerequisite State (class borrowed from DB2) */
  #define ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE	MAKE_SQLSTATE('5','5', '0','0','0')
diff -cr cvs/src/test/regress/expected/guc.out cvs.build/src/test/regress/expected/guc.out
*** cvs/src/test/regress/expected/guc.out	2009-12-09 11:24:34.000000000 +0100
--- cvs.build/src/test/regress/expected/guc.out	2009-12-09 11:26:03.000000000 +0100
***************
*** 532,540 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
!   relname  
! -----------
   foo_event
  (1 row)
  
--- 532,540 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
!  pg_listening 
! --------------
   foo_event
  (1 row)
  
***************
*** 571,579 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
!  relname 
! ---------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
--- 571,579 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
!  pg_listening 
! --------------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
diff -cr cvs/src/test/regress/expected/sanity_check.out cvs.build/src/test/regress/expected/sanity_check.out
*** cvs/src/test/regress/expected/sanity_check.out	2009-12-09 11:24:34.000000000 +0100
--- cvs.build/src/test/regress/expected/sanity_check.out	2009-12-09 11:26:03.000000000 +0100
***************
*** 106,112 ****
   pg_inherits             | t
   pg_language             | t
   pg_largeobject          | t
-  pg_listener             | f
   pg_namespace            | t
   pg_opclass              | t
   pg_operator             | t
--- 106,111 ----
***************
*** 153,159 ****
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (142 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 152,158 ----
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (141 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
diff -cr cvs/src/test/regress/sql/guc.sql cvs.build/src/test/regress/sql/guc.sql
*** cvs/src/test/regress/sql/guc.sql	2009-12-09 11:24:33.000000000 +0100
--- cvs.build/src/test/regress/sql/guc.sql	2009-12-09 11:26:03.000000000 +0100
***************
*** 165,171 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 165,171 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
***************
*** 174,180 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 174,180 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
#43Robert Haas
robertmhaas@gmail.com
In reply to: Joachim Wieland (#42)
Re: Listen / Notify - what to do when the queue is full

On Wed, Dec 9, 2009 at 5:43 AM, Joachim Wieland <joe@mcknight.de> wrote:

Hi,

On Mon, Dec 7, 2009 at 5:38 AM, Greg Smith <greg@2ndquadrant.com> wrote:

JI'm going to mark this one "returned with feedback", and I
do hope that work continues on this patch so it's early in the queue for the
final CommitFest for this version.  It seems like it just needs a bit more
time to let the design mature and get the kinks worked out and it could turn
into a useful feature--I know I've wanted NOTIFY with a payload for a years.

I am perfectly fine with postponing the patch to the next commitfest. To get
some more feedback and to allow everyone to play with it, I am attaching the
latest version of the patch.

What has changed:

Transactional processing is now hopefully correct:

Examples:

Backend 1:                    Backend 2:

transaction starts
NOTIFY foo;
commit starts
                             transaction starts
                             LISTEN foo;
                             commit starts
                             commit to clog
commit to clog

=> Backend 2 will receive Backend 1's notification.

Backend 1:                    Backend 2:

transaction starts
NOTIFY foo;
commit starts
                             transaction starts
                             UNLISTEN foo;
                             commit starts
                             commit to clog
commit to clog

=> Backend 2 will not receive Backend 1's notification.

This is done by introducing an additional "AsyncCommitOrderLock". It is grabbed
exclusively from transactions that execute LISTEN / UNLISTEN and in shared mode
for transactions that executed NOTIFY only. LISTEN/UNLISTEN transactions then
register the XIDs of the NOTIFYing transactions that are about to commit
at the same time in order to later find out which notifications are visible and
which ones are not.

If the queue is full, any other transaction that is trying to place a
notification to the queue is rolled back! This is basically a consequence of
the former. There are two warnings that will show up in the log once the queue
is more than 50% full and another one if it is more than 75% full. The biggest
threat to run into a full queue are probably backends that are LISTENing and
are idle in transaction.

I have added a function pg_listening() which just contains the names of the
channels that a backend is listening to.

I especially invite people who know more about the transactional stuff than I
do to take a close look at what I have done regarding notification visibility.

One open question regarding the payload is if we need to limit it to ASCII to
not risk conversion issues between different backend character sets?

The second open issue is what we should do regarding 2PC. These options have
been brought up so far:

 1) allow NOTIFY in 2PC but it can happen that the transaction needs to be
    rolled back if the queue is full
 2) disallow NOTIFY in 2PC alltogether
 3) put notifications to the queue on PREPARE TRANSACTION and make backends not
    advance their pointers further than those notifications but wait for the
    2PC transaction to commit. 2PC transactions would never fail but you
    effectively stop the notification system until the 2PC transaction commits.

Comments?

Joachim - This no longer applies - please rebase, repost, and add a
link to the new version to the commitfest app.

Jeff - Do you want to continue reviewing this for the next CommitFest,
or hand off to another reviewer?

Thanks,

...Robert

#44Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#43)
Re: Listen / Notify - what to do when the queue is full

On Thu, 2010-01-07 at 13:40 -0500, Robert Haas wrote:

Joachim - This no longer applies - please rebase, repost, and add a
link to the new version to the commitfest app.

Jeff - Do you want to continue reviewing this for the next CommitFest,
or hand off to another reviewer?

Sure, I'll review it.

Regards,
Jeff Davis

#45Joachim Wieland
joe@mcknight.de
In reply to: Robert Haas (#43)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Thu, Jan 7, 2010 at 7:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Joachim - This no longer applies - please rebase, repost, and add a
link to the new version to the commitfest app.

Updated patch attached.

From my point of view these are the current open questions:

- is the general approach reasonable (i.e. to put notifications into
an slru-based queue instead of into shared memory to allow for a
better handling of notification bursts)

- is the transactional behavior correct (see upthread)

- do we need to limit the payload to pure ASCII ? I think yes, we need
to. I also think we need to reject other payloads with elog(ERROR...).

- how to deal with 2PC (see upthread) ?

- how to deal with hot standby (has not been discussed yet)

Joachim

Attachments:

listennotify.7.difftext/x-diff; charset=US-ASCII; name=listennotify.7.diffDownload
diff -cr cvs.head/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs.head/src/backend/access/transam/slru.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/access/transam/slru.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs.head/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs.head/src/backend/access/transam/xact.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/access/transam/xact.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 1729,1737 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
--- 1729,1738 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
+ 	Assert(s->state == TRANS_INPROGRESS);
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
***************
*** 1805,1810 ****
--- 1806,1816 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs.head/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs.head/src/backend/catalog/Makefile	2010-01-06 22:30:05.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2010-01-08 11:01:53.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs.head/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs.head/src/backend/commands/async.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/commands/async.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 14,44 ****
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 14,79 ----
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications and compares the notified channels with its list.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) stores the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the applications needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    In order to ensure transactional processing there is AsyncCommitOrderLock
!  *    that has to be grabbed exclusively by all notifications that send NOTIFYs
!  *    do LISTENs and UNLISTENs but only for the time when those transactions
!  *    commit to clog. For example, one issue here is that a transaction sending
!  *    notifications could store them into the list while another transaction
!  *    that started later does a LISTEN on a channel and commits. Then it has to
!  *    see the notifications of the longer running transaction, because it
!  *    committed earlier (even though it started later).
!  *
!  *    After clog commit we are called another time
!  *    (AtCommit_NotifyAfterCommit()). If we have executed either LISTEN or
!  *    UNLISTEN, we register any running notifying transactions and release the
!  *    lock to be able to solve exactly the problem described above. We then
!  *    check if we need to signal the backends. In SignalBackends() we scan the
!  *    list of listening backends and send a PROCSIG_NOTIFY_INTERRUPT to every
!  *    backend that has set its Pid (we don't know which backend is listening on
!  *    which channel so we need to send a signal to every listening backend). We
!  *    can exclude backends that are already up to date.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,97 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_listener.h"
  #include "commands/async.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
--- 81,127 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach the head pointer's position. Then we check if we were the
!  *	  laziest backend: if our pointer is set to the same position as the global
!  *	  tail pointer is set, then we set it further to the second-laziest
!  *	  backend (We can identify it by inspecting the positions of all other
!  *	  backends' pointers). Whenever we move the tail pointer we also truncate
!  *	  now unused pages (i.e. delete files in pg_notify/ that are no longer
!  *	  used).
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
+ /* XXX 
+  *
+  * TODO:
+  *  - guc parameter max_notifies_per_txn ??
+  *  - adapt comments
+  *  - 2PC
+  *  - limit to ASCII?
+  */
+ 
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_type.h"
  #include "commands/async.h"
+ #include "funcapi.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 138,145 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 134,140 ****
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 164,170 ----
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 179,321 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	QueuePosition	position;
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			channel[NAMEDATALEN];
+ 	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)
+ 
+ #define	InvalidPid				(-1)
+ #define QUEUE_POS_PAGE(x)		((x).page)
+ #define QUEUE_POS_OFFSET(x)		((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_POS_LT(x,y,z) \
+ 		(QUEUE_POS_EQUAL(x,QUEUE_POS_MIN(x,y,z)) && !QUEUE_POS_EQUAL(x,y))
+ #define QUEUE_POS_LE(x,y,z) \
+ 		(QUEUE_POS_EQUAL(x,QUEUE_POS_MIN(x,y,z)))
+ #define QUEUE_BACKEND_POS(i)		asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i)		asyncQueueControl->backend[(i)].pid
+ #define QUEUE_BACKEND_XID(i)		asyncQueueControl->backend[(i)].xid
+ #define QUEUE_HEAD					asyncQueueControl->head
+ #define QUEUE_TAIL					asyncQueueControl->tail
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ 	TransactionId	xid;  /* this is protected by AsyncQueueCommitOrderLock,
+ 							 no lock required to read the own entry,
+ 							 LW_SHARED is sufficient for writing the own entry,
+ 							 LW_EXCLUSIVE to read everybody else's. */
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFillWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	int					listeningBackends;
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's a legal test case to define QUEUE_MAX_PAGE to a very small multiply of
+  * SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
+ 
+ static bool backendSendsNotifications = false;
+ static bool backendExecutesListen = false;
+ static bool backendExecutesUnlisten = false; 
+ 
+ typedef struct
+ {
+ 	char			   *channel;
+ 	List			   *xids;
+ 	QueuePosition		limitPos;
+ 	ListenActionKind	kind;
+ } ActionPhaseInOut;
+ 
+ static List *ActionPhaseInOutList = NIL;
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,207 ****
  
  bool		Trace_notify = false;
  
- 
  static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
  	{
  		/*
  		 * The name list needs to live until end of transaction, so store it
  		 * in the transaction context.
--- 332,484 ----
  
  bool		Trace_notify = false;
  
  static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool IsInListenChannels(const char *channel);
! static void asyncQueuePhaseInOut(ListenActionKind kind, const char *channel,
! 								 QueuePosition limitPos);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_Listen(const char *channel);
! static void Exec_Unlisten(const char *channel);
! static void Exec_UnlistenAll(void);
! static void SignalBackends(void);
! static void Send_Notify(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n,
! 										  QueuePosition pos);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **notifications);
! static void asyncQueueReadAllNotifications(void);
! static void asyncQueueAdvanceTail(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 srcPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic if
+  * one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. If we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 			QUEUE_BACKEND_XID(i) = InvalidTransactionId;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFillWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		AsyncCtl->shared->page_dirty[slotno] = true;
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 		LWLockRelease(AsyncQueueLock);
+ 
+ 		SimpleLruTruncate(AsyncCtl, 0);
+ 	}
+ }
+ 
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
! 
! 	/*
! 	 * XXX - do we now need a guc parameter max_notifies_per_txn?
! 	 */ 
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(channel, payload))
  	{
+ 		Notification *n;
  		/*
  		 * The name list needs to live until end of transaction, so store it
  		 * in the transaction context.
***************
*** 210,221 ****
  
  		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
  		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
  		MemoryContextSwitchTo(oldcontext);
  	}
--- 487,509 ----
  
  		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
+ 		n = (Notification *) palloc(sizeof(Notification));
+ 		n->channel = pstrdup(channel);
+ 		if (payload)
+ 			n->payload = pstrdup(payload);
+ 		else
+ 			n->payload = "";
+ 		/* will set the xid and the srcPid later... */
+ 		n->xid = InvalidTransactionId;
+ 		n->srcPid = InvalidPid;
+ 		/* we don't set n->position here. It is unknown and we won't do anything
+ 		 * with it at this point anyway. */
+ 
  		/*
! 		 * We want to preserve the order so we need to append every
! 		 * notification. See comments at AsyncExistsPendingNotify().
  		 */
! 		pendingNotifies = lappend(pendingNotifies, n);
  
  		MemoryContextSwitchTo(oldcontext);
  	}
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 547,558 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 561,576 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 594,599 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	/*
! 	 * We need to start/commit a transaction for the unlisten, but if there is
! 	 * already an active transaction we had better abort that one first.
! 	 * Otherwise we'd end up committing changes that probably ought to be
! 	 * discarded.
  	 */
! 	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 601,625 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
+ 	bool	advanceTail = false;
+ 
+ 	AbortOutOfAnyTransaction();
+ 
+ 	LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
+ 	QUEUE_BACKEND_XID(MyBackendId) = InvalidTransactionId;
+ 	LWLockRelease(AsyncCommitOrderLock);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
  	/*
! 	 * If we have been the last backend, advance the tail pointer.
  	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
  }
  
  /*
***************
*** 348,357 ****
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		const char *relname = (const char *) lfirst(p);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
  	}
  
  	/*
--- 642,656 ----
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		AsyncQueueEntry qe;
! 		Notification   *n;
! 
! 		n = (Notification *) lfirst(p);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   &qe, qe.length);
  	}
  
  	/*
***************
*** 363,386 ****
  }
  
  /*
!  * AtCommit_Notify
!  *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
   *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
--- 662,681 ----
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
  
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 692,702 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
! 	Assert(backendSendsNotifications == false);
! 	Assert(backendExecutesListen == false);
! 	Assert(backendExecutesUnlisten == false);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
! 	/* Perform any pending notifies */
  	if (pendingNotifies)
! 		Send_Notify(lRel);
  
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified relation.
   */
! static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
! 
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
  
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
  
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
  		}
  	}
! 	heap_endscan(scan);
  
! 	if (alreadyListener)
! 		return;
  
! 	/*
! 	 * OK to insert a new tuple
! 	 */
! 	memset(nulls, false, sizeof(nulls));
  
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
  
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
  
! 	simple_heap_insert(lRel, tuple);
  
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 706,973 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll();
  				break;
  		}
  	}
  
! 	/*
! 	 * Perform any pending notifies.
! 	 */
  	if (pendingNotifies)
! 		Send_Notify();
  
  	/*
! 	 * Grab the AsyncCommitOrderLock to ensure we know the commit order.
! 	 * In case we have only sent notifications and have not executed LISTEN
! 	 * or UNLISTEN, a shared lock is sufficient.
  	 */
! 	if (backendExecutesListen || backendExecutesUnlisten)
! 		LWLockAcquire(AsyncCommitOrderLock, LW_EXCLUSIVE);
! 	else if (backendSendsNotifications)
! 		LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
! }
! 
! /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
!  *
!  *		Notify the listening backends.
!  */
! void
! AtCommit_NotifyAfterCommit(void)
! {
! 	QueuePosition	head;
! 	ListCell	   *p, *q;
! 
! 	/* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
! 	 * return as soon as possible */
! 	if (!pendingActions && !backendSendsNotifications)
! 		return;
! 
! 	if (backendExecutesListen || backendExecutesUnlisten)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 		head = QUEUE_HEAD;
! 		LWLockRelease(AsyncQueueLock);
! 	}
! 
! 	foreach(p, pendingActions)
! 	{
! 		ListenAction *actrec = (ListenAction *) lfirst(p);
! 
! 		switch (actrec->action)
! 		{
! 			case LISTEN_LISTEN:
! 				Assert(backendExecutesListen);
! 				asyncQueuePhaseInOut(LISTEN_LISTEN, actrec->condname, head);
! 				break;
! 			case LISTEN_UNLISTEN:
! 				Assert(backendExecutesUnlisten);
! 				asyncQueuePhaseInOut(LISTEN_UNLISTEN, actrec->condname, head);
! 				break;
! 			case LISTEN_UNLISTEN_ALL:
! 				Assert(backendExecutesUnlisten);
! 				foreach(q, listenChannels)
! 				{
! 					char *lchan = (char *) lfirst(q);
! 					asyncQueuePhaseInOut(LISTEN_UNLISTEN, lchan, head);
! 				}
! 				break;
! 		}
! 	}
! 
! 	if (backendSendsNotifications)
! 		QUEUE_BACKEND_XID(MyBackendId) = InvalidTransactionId;
! 
! 	if (backendSendsNotifications
! 	 		|| backendExecutesListen
! 			|| backendExecutesUnlisten)
! 	{
! 		Assert(LWLockHeldByMe(AsyncCommitOrderLock));
! 		LWLockRelease(AsyncCommitOrderLock);
! 	}
! 
! 	if (backendSendsNotifications)
! 		SignalBackends();
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
  }
  
  /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
   */
! static bool
! IsListeningOn(const char *channel)
  {
! 	ListCell   *p;
  
! 	foreach(p, listenChannels)
  	{
! 		ActionPhaseInOut   *act;
! 		char			   *lchan = (char *) lfirst(p);
! 		bool				vote;
  
! 		if (strcmp(lchan, channel) == 0)
  		{
! 			vote = true;
! 			foreach(p, ActionPhaseInOutList)
! 			{
! 				act = (ActionPhaseInOut *) lfirst(p);
! 				if (strcmp(act->channel, channel) != 0)
! 					continue;
! 				if (act->kind == LISTEN_LISTEN)
! 					vote = true;
! 				if (act->kind == LISTEN_UNLISTEN)
! 					vote = false;
! 			}
! 			return vote;
  		}
  	}
! 	return false;
! }
  
! /*
!  * This is a less strict version of IsListeningOn().
!  *
!  * Think of the following:
!  *
!  * Backend 1:            Backend 2:
!  * LISTEN
!  * COMMIT
!  *                       NOTIFY
!  * UNLISTEN
!  *                       COMMIT
!  * COMMIT
!  *
!  * At the end backend 1 would say IsListeningOn() == false, but
!  * IsInListenChannels() == true.
!  *
!  * IsInListenChannels() is called from ProcessIncomingNotify() to check if
!  * a notification could be of interest.
!  */
! static bool
! IsInListenChannels(const char *channel)
! {
! 	ListCell   *p;
! 	char	   *lchan;
  
! 	foreach(p, listenChannels)
! 	{
! 		lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			return true;
! 	}
! 	return false;
! }
! 
! Datum
! pg_listening(PG_FUNCTION_ARGS)
! {
! 	FuncCallContext	   *funcctx;
! 	ListCell		  **lcp;
! 
! 	/* stuff done only on the first call of the function */
! 	if (SRF_IS_FIRSTCALL())
! 	{
! 		MemoryContext	oldcontext;
! 
! 		/* create a function context for cross-call persistence */
! 		funcctx = SRF_FIRSTCALL_INIT();
! 
! 		/*
! 		 * switch to memory context appropriate for multiple function calls
! 		 */
! 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
  
! 		/* allocate memory for user context */
! 		lcp = (ListCell **) palloc(sizeof(ListCell **));
! 		if (listenChannels != NIL)
! 			*lcp = list_head(listenChannels);
! 		else
! 			*lcp = NULL;
! 		funcctx->user_fctx = (void *) lcp;
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
  
! 	/* stuff done on every call of the function */
! 	funcctx = SRF_PERCALL_SETUP();
! 	lcp = (ListCell **) funcctx->user_fctx;
  
! 	while (*lcp != NULL)
! 	{
! 		char   *channel = (char *) lfirst(*lcp);
  
! 		*lcp = (*lcp)->next;
! 
! 		if (IsListeningOn(channel))
! 			SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
! 	}
! 
! 	SRF_RETURN_DONE(funcctx);
! }
! 
! /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified channel.
!  */
! static void
! Exec_Listen(const char *channel)
! {
! 	MemoryContext oldcontext;
! 
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
! 
! 	/*
! 	 * This line has to be here and not after the IsInListenChannels() call to
! 	 * match the AtCommit_NotifyAfterCommit() checks.
! 	 */
! 	backendExecutesListen = true;
! 
! 	/* Detect whether we are already listening on this channel */
! 	if (IsInListenChannels(channel))
! 		return;
! 
! 	/*
! 	 * OK to insert to the list.
! 	 */
! 	if (listenChannels == NIL)
! 	{
! 		/*
! 		 * This is our first LISTEN, establish our pointer.
! 		 */
! 		LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
! 		QUEUE_BACKEND_XID(MyBackendId) = InvalidTransactionId;
! 		LWLockRelease(AsyncCommitOrderLock);
! 
! 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 		QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
! 		QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 		LWLockRelease(AsyncQueueLock);
! 	}
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
  
  	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 514,550 ****
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
- 	HeapScanDesc scan;
- 	HeapTuple	tuple;
- 
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
! 		{
! 			/* Found the matching tuple, delete it */
! 			simple_heap_delete(lRel, &tuple->t_self);
! 
! 			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
! 			 */
! 			break;
! 		}
! 	}
! 	heap_endscan(scan);
  
  	/*
  	 * We do not complain about unlistening something not being listened;
--- 979,993 ----
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove a specified channel from "listenChannel".
   */
  static void
! Exec_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", channel, MyProcPid);
  
! 	backendExecutesUnlisten = true;
  
  	/*
  	 * We do not complain about unlistening something not being listened;
***************
*** 555,690 ****
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple;
! 	ScanKeyData key[1];
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
  static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
  			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
  			}
! 			else if (listener->notification == 0)
  			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
  			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
   * AtAbort_Notify
   *
!  *		This is called at transaction abort.
   *
!  *		Gets rid of pending actions and outbound notifies that we would have
!  *		executed if the transaction got committed.
   */
  void
  AtAbort_Notify(void)
  {
  	ClearPendingActionsAndNotifies();
  }
  
--- 998,1390 ----
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAll(void)
  {
! 	ListCell   *p;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll(%d)", MyProcPid);
  
! 	foreach(p, listenChannels)
! 	{
! 		char *lchan = (char *) lfirst(p);
! 		Exec_Unlisten(lchan);
! 	}
! }
  
! static bool
! asyncQueueIsFull()
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size			remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size			advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
  
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
! 
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
  }
  
  /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
   */
+ static bool
+ asyncQueueAdvance(QueuePosition *position, int entryLength)
+ {
+ 	int		pageno = QUEUE_POS_PAGE(*position);
+ 	int		offset = QUEUE_POS_OFFSET(*position);
+ 	bool	pageJump = false;
+ 
+ 	/*
+ 	 * Move to the next writing position: First jump over what we have just
+ 	 * written or read.
+ 	 */
+ 	offset += entryLength;
+ 	Assert(offset < QUEUE_PAGESIZE);
+ 
+ 	/*
+ 	 * In a second step check if another entry can be written to the page. If
+ 	 * it does, stay here, we have reached the next position. If not, then we
+ 	 * need to move on to the next page.
+ 	 */
+ 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
+ 	{
+ 		pageno++;
+ 		if (pageno > QUEUE_MAX_PAGE)
+ 			/* wrap around */
+ 			pageno = 0;
+ 		offset = 0;
+ 		pageJump = true;
+ 	}
+ 
+ 	SET_QUEUE_POS(*position, pageno, offset);
+ 	return pageJump;
+ }
+ 
+ static void
+ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
+ {
+ 		Assert(n->channel != NULL);
+ 		Assert(n->payload != NULL);
+ 		Assert(strlen(n->payload) <= NOTIFY_PAYLOAD_MAX_LENGTH);
+ 
+ 		/* The terminator is already included in AsyncQueueEntryEmptySize */
+ 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload);
+ 		qe->srcPid = MyProcPid;
+ 		qe->dboid = MyDatabaseId;
+ 		qe->xid = GetCurrentTransactionId();
+ 		strcpy(qe->channel, n->channel);
+ 		strcpy(qe->payload, n->payload);
+ }
+ 
  static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n,
! 							  QueuePosition pos)
  {
! 	n->channel = pstrdup(qe->channel);
! 	n->payload = pstrdup(qe->payload);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
! 	n->position = pos;
! }
! 
! static List *
! asyncQueueAddEntries(List *notifications)
! {
! 	AsyncQueueEntry	qe;
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
  
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
! 
! 	do
! 	{
! 		Notification   *n;
  
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if command further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
! 
! 		n = (Notification *) linitial(notifications);
  
! 		asyncQueueNotificationToEntry(n, &qe);
! 
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.channel[0] = '\0';
! 			qe.payload[0] = '\0';
! 			qe.xid = InvalidTransactionId;
! 		}
! 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
! 			   &qe, qe.length);
! 
! 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
! 			 && notifications != NIL);
! 
! 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
! 	{
! 		/*
! 		 * If the next entry needs to go to a new page, prepare that page
! 		 * already.
! 		 */
! 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
! 		AsyncCtl->shared->page_dirty[slotno] = true;
! 	}
! 	LWLockRelease(AsyncCtlLock);
! 
! 	return notifications;
! }
! 
! static void
! asyncQueueFillWarning()
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t;
! 	double			fillDegree;
! 	int				occupied;
! 	int				tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int				headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 
! 	occupied = headPage - tailPage;
! 
! 	if (occupied == 0)
! 		return;
! 	
! 	if (!asyncQueuePagePrecedesPhysically(tailPage, headPage))
! 		/* head has wrapped around, tail not yet */
! 		occupied += QUEUE_MAX_PAGE;
! 
! 	fillDegree = (float) occupied / (float) QUEUE_MAX_PAGE;
! 
! 	if (fillDegree < 0.5)
! 		return;
! 
! 	t = GetCurrentTimestamp();
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFillWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		QueuePosition	min = QUEUE_HEAD;
! 		int32			minPid = InvalidPid;
! 		int				i;
! 
! 		for (i = 0; i < MaxBackends; i++)
! 			if (QUEUE_BACKEND_PID(i) != InvalidPid)
  			{
! 				min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 				if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 					minPid = QUEUE_BACKEND_PID(i);
  			}
! 
! 		if (fillDegree < 0.75)
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 50%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 		else
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 75%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 
! 		asyncQueueControl->lastQueueFillWarn = t;
! 	}
! }
! 
! static List *
! asyncQueueSendingXids(void)
! {
! 	/* Caller must hold exclusive lock on AsyncCommitOrderLock */
! 	int		i;
! 	List   *xidList = NIL;
! 
! 	for (i = 0; i < MaxBackends; i++)
! 		if (QUEUE_BACKEND_XID(i) != InvalidTransactionId)
! 			xidList = lappend_int(xidList, QUEUE_BACKEND_XID(i));
! 
! 	return xidList;
! }
! 
! static void
! asyncQueuePhaseInOut(ListenActionKind kind, const char *channel,
! 					 QueuePosition limitPos)
! {
! 	/* Caller must hold exclusive lock on AsyncCommitOrderLock (for
! 	 * asyncQueueSendingXids()) */
! 	MemoryContext 		oldcontext;
! 	ActionPhaseInOut   *entry;
! 
! 	Assert(kind == LISTEN_LISTEN || kind == LISTEN_UNLISTEN);
! 
! 	if (kind == LISTEN_UNLISTEN && !IsListeningOn(channel))
! 		return;
! 	/* we cannot take the same shortcut for LISTEN_LISTEN and return if we are
! 	 * listening already. The reason is that we add the channel for every new
! 	 * LISTEN into the list of channels in Exec_Listen() and here we cannot
! 	 * tell anymore if there is already another previous LISTEN. */
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 
! 	entry = (ActionPhaseInOut *) palloc(sizeof(ActionPhaseInOut));
! 	entry->channel = pstrdup(channel);
! 	entry->xids = asyncQueueSendingXids();
! 	entry->kind = kind;
! 	entry->limitPos = limitPos;
! 
! 	ActionPhaseInOutList = lappend(ActionPhaseInOutList, entry);
! 
! 	MemoryContextSwitchTo(oldcontext);
! }
! 
! 
! /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  * Add the pending notifications to the queue and signal the listening
!  * backends.
!  *
!  * A full queue is very uncommon and should really not happen, given that we
!  * have so much space available in our slru pages. Nevertheless we need to
!  * deal with this possibility. Note that when we get here we are in the process
!  * of committing our transaction, we have not yet committed to clog but this
!  * would be the next step. So at this point in time we can still roll the
!  * transaction back.
!  */
! static void
! Send_Notify()
! {
! 	Assert(pendingNotifies != NIL);
! 
! 	backendSendsNotifications = true;
! 
! 	LWLockAcquire(AsyncCommitOrderLock, LW_SHARED);
! 	QUEUE_BACKEND_XID(MyBackendId) = GetCurrentTransactionId();
! 	LWLockRelease(AsyncCommitOrderLock);
! 
! 	while (pendingNotifies != NIL)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		asyncQueueFillWarning();
! 		if (asyncQueueIsFull())
! 			ereport(ERROR,
! 					(errcode(ERRCODE_TOO_MANY_ENTRIES),
! 					errmsg("Too many notifications in the queue")));
! 		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 		LWLockRelease(AsyncQueueLock);
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. Since we have EXCLUSIVE lock anyway
!  * we also check the position of the other backends and in case that it is
!  * already up-to-date we don't signal it.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(void)
! {
! 	QueuePosition	pos;
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 	int				count = 0;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			count++;
! 			pos = QUEUE_BACKEND_POS(i);
! 			if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
  			{
! 				pids = lappend_int(pids, pid);
! 				ids = lappend_int(ids, i);
  			}
  		}
  	}
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	forboth(p1, pids, p2, ids)
+ 	{
+ 		pid = (int32) lfirst_int(p1);
+ 		i = lfirst_int(p2);
+ 		/*
+ 		 * Should we check for failure? Can it happen that a backend
+ 		 * has crashed without the postmaster starting over?
+ 		 */
+ 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
+ 			elog(WARNING, "Error signalling backend %d", pid);
+ 	}
  
! 	if (count == 0)
! 	{
! 		/* No backend is listening at all, signal myself so we can clean up
! 		 * the queue. */
! 		SendProcSignal(MyProcPid, PROCSIG_NOTIFY_INTERRUPT, MyBackendId);
! 	}
  }
  
  /*
   * AtAbort_Notify
   *
!  *	This is called at transaction abort.
   *
!  *	Gets rid of pending actions and outbound notifies that we would have
!  *	executed if the transaction got committed.
!  *
!  *	Even though we have not committed, we need to signal the listening backends
!  *	because our notifications might block readers from processing the queue.
!  *	Now that the transaction has aborted, they can go on and skip our
!  *	notifications.
   */
  void
  AtAbort_Notify(void)
  {
+ 	if (backendSendsNotifications)
+ 		SignalBackends();
+ 
  	ClearPendingActionsAndNotifies();
  }
  
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1640,2005 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back the list of notifications
+  * that the calling function will deliver then. The list will contain all
+  * notifications from transactions that have already committed.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * The function returns true once we have reached the end or a notification of
+  * a transaction that is still running and false if we have just finished with
+  * the page.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current,
+ 						   QueuePosition stop,
+ 						   List **notifications)
+ {
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	int				slotno;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 		readPtr += current->offset;
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				if (IsInListenChannels(qe.channel))
+ 				{
+ 					if (qe.length > AsyncQueueEntryEmptySize)
+ 						memcpy(&qe, readPtr, qe.length);
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n, *current);
+ 					*notifications = lappend(*notifications, n);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * The transaction has neither committed nor aborted so
+ 					 * far.
+ 					 */
+ 					reachedStop = true;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ static bool
+ asyncQueueCheckDelivery(Notification *n, QueuePosition head)
+ {
+ 	ActionPhaseInOut   *act;
+ 	ListCell		   *lc;
+ 	bool				vote = true;
+ 
+ 	foreach(lc, ActionPhaseInOutList)
+ 	{
+ 		act = (ActionPhaseInOut *) lfirst(lc);
+ 
+ 		if (strcmp(act->channel, n->channel) != 0)
+ 			continue;
+ 
+ 		if (act->kind == LISTEN_LISTEN)
+ 		{
+ 			if (QUEUE_POS_LT(n->position, act->limitPos, head))
+ 			{
+ 				/*
+ 				 * When LISTEN committed, n->xid was still running. As n->xid
+ 				 * has committed by now, we need to deliver its notification.
+ 				 *
+ 				 * If n->xid was not running then it is a committed transaction
+ 				 * and we must not deliver notifications from already committed
+ 				 * transactions.
+ 				 */
+ 				if (list_member_int(act->xids, n->xid))
+ 					vote = true;
+ 				else
+ 					vote = false;
+ 			}
+ 			else
+ 				/*
+ 				 * n->xid committed, when LISTEN was already fully established
+ 				 */
+ 				vote = true;
+ 		}
+ 		else
+ 		{
+ 			Assert(act->kind == LISTEN_UNLISTEN);
+ 			if (QUEUE_POS_LT(n->position, act->limitPos, head))
+ 			{
+ 				/*
+ 				 * When UNLISTEN committed, n->xid was still running. As n->xid
+ 				 * has committed by now, we must not deliver its notification.
+ 				 *
+ 				 * If n->xid was already committed, we need to deliver its
+ 				 * notification (because we assume that there has been a LISTEN
+ 				 * previously).
+ 				 */
+ 				if (list_member_int(act->xids, n->xid))
+ 					vote = false;
+ 				else
+ 					vote = true;
+ 			}
+ 			else
+ 				/*
+ 				 * n->xid committed, when UNLISTEN was already fully
+ 				 * established.
+ 				 */
+ 				vote = false;
+ 		}
+ 	}
+ 
+ 	return vote;
+ }
+ 
+ static void
+ asyncQueueCleanUpPhaseInOut(QueuePosition pos, QueuePosition head)
+ {
+ 	ActionPhaseInOut   *act;
+ 	ListCell		   *p, *q;
+ 	List			   *deletedChannels = NIL;
+ 
+ 	while ((p = list_head(ActionPhaseInOutList)))
+ 	{
+ 		act = (ActionPhaseInOut *) lfirst(p);
+ 		if (QUEUE_POS_LE(act->limitPos, pos, head))
+ 		{
+ 			list_free(act->xids);
+ 			if (act->kind == LISTEN_LISTEN)
+ 			{
+ 				pfree(act->channel);
+ 			}
+ 			else
+ 			{
+ 				Assert(act->kind == LISTEN_UNLISTEN);
+ 				/* do not free act->channel, we reuse it... */
+ 				deletedChannels = lappend(deletedChannels, act->channel);
+ 			}
+ 			ActionPhaseInOutList =
+ 							list_delete_cell(ActionPhaseInOutList, p, NULL);
+ 			
+ 			if (ActionPhaseInOutList == NIL)
+ 				break;
+ 		}
+ 		else
+ 		{
+ 			/* the entries are ordered by limitPos, if we don't have a hit
+ 			 * now we won't have a further hit later on either */
+ 			break;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Check for every channel that we want to delete from listenChannels if
+ 	 * it is still in use by a subsequently issued LISTEN. There clearly is
+ 	 * room for improving the performance of this check but we expect the lists
+ 	 * to be really short anyway...
+ 	 */
+ 	foreach(p, deletedChannels)
+ 	{
+ 		char   *candidate = (char *) lfirst(p);
+ 		bool	found = false;
+ 		foreach(q, ActionPhaseInOutList)
+ 		{
+ 			act = (ActionPhaseInOut *) lfirst(q);
+ 			if (strcmp(candidate, act->channel) == 0)
+ 			{
+ 				found = true;
+ 				break;
+ 			}
+ 		}
+ 		if (found == false)
+ 		{
+ 			ListCell *prev = NULL;
+ 			foreach(q, listenChannels)
+ 			{
+ 				char *lchan = (char *) lfirst(q);
+ 				if (strcmp(lchan, candidate) == 0)
+ 				{
+ 					pfree(lchan);
+ 					listenChannels = list_delete_cell(listenChannels, q, prev);
+ 					Assert(!IsInListenChannels(lchan));
+ 					break;
+ 				}
+ 				prev = q;
+ 			}
+ 		}
+ 	}
+ 
+ 	if (listenChannels == NIL && ActionPhaseInOutList == NIL)
+ 	{
+ 		bool advanceTail = false;
+ 
+ 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 		QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
+ 		Assert(QUEUE_BACKEND_XID(MyBackendId) == InvalidTransactionId);
+ 		if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
+ 			advanceTail = true;
+ 		LWLockRelease(AsyncQueueLock);
+ 
+ 		if (advanceTail)
+ 			/* Move forward the tail pointer and try to truncate. */
+ 			asyncQueueAdvanceTail();
+ 	}
+ }
+ 
+ static void
+ asyncQueueReadAllNotifications(void)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	/*
+ 	 * We could have signalled ourselves because nobody was listening when we
+  	 * sent out notifications.
+  	 */
+ 	if (QUEUE_BACKEND_PID(MyBackendId) == InvalidPid)
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 	{
+ 		asyncQueueAdvanceTail();
+ 		return;
+ 	}
+ 
+ 	/* Nothing to do, we have read all notifications already. */
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 		return;
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		notifications = NIL;
+ 		reachedStop = asyncQueueGetEntriesByPage(&pos, head, &notifications);
+ 
+ 		foreach(lc, notifications)
+ 		{
+ 			n = (Notification *) lfirst(lc);
+ 			if (asyncQueueCheckDelivery(n, head))
+ 				NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 		}
+ 		asyncQueueCleanUpPhaseInOut(pos, head);
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail()
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailp;
+ 	int				headp;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailp = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headp = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if ((asyncQueuePagePrecedesLogically(tailp, QUEUE_POS_PAGE(min), headp)
+ 			&& asyncQueuePagePrecedesPhysically(tailp, headp))
+ 		|| tailp == QUEUE_POS_PAGE(min))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * XXX this could be optimized, to call SimpleLruTruncate only when we
+ 		 * know we can truncate something.
+ 		 */
+ 		SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
! 
! 		if (sourcePID != 0)
! 		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 		}
! 	}
! 	heap_endscan(scan);
! 
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 2011,2017 ----
  
  	notifyInterruptOccurred = 0;
  
! 	asyncQueueReadAllNotifications();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 2031,2047 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 2051,2107 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
! 
! 	if (payload == NULL)
! 		payload = "";
! 
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 */
  
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1107,1112 ****
--- 2118,2127 ----
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
+ 
+ 	backendSendsNotifications = false;
+ 	backendExecutesListen = false;
+ 	backendExecutesUnlisten = false;
  }
  
  /*
***************
*** 1124,1128 ****
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	Async_Notify((char *) recdata);
  }
--- 2139,2149 ----
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	AsyncQueueEntry		*qe = (AsyncQueueEntry *) recdata;
! 
! 	Assert(qe->dboid == MyDatabaseId);
! 	Assert(qe->length == len);
! 
! 	Async_Notify(qe->channel, qe->payload);
  }
+ 
diff -cr cvs.head/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs.head/src/backend/nodes/copyfuncs.c	2010-01-06 22:30:06.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 2770,2775 ****
--- 2770,2776 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs.head/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs.head/src/backend/nodes/equalfuncs.c	2010-01-06 22:30:06.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 1324,1329 ****
--- 1324,1330 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs.head/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs.head/src/backend/nodes/outfuncs.c	2010-01-06 22:30:06.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 1817,1822 ****
--- 1817,1823 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs.head/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs.head/src/backend/nodes/readfuncs.c	2010-01-05 12:39:25.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs.head/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs.head/src/backend/parser/gram.y	2010-01-06 22:30:07.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2010-01-08 11:05:17.000000000 +0100
***************
*** 399,405 ****
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 399,405 ----
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 6074,6083 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 6074,6089 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs.head/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs.head/src/backend/storage/ipc/ipci.c	2010-01-05 12:39:28.000000000 +0100
--- cvs.build/src/backend/storage/ipc/ipci.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 219,224 ****
--- 219,225 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs.head/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs.head/src/backend/storage/lmgr/lwlock.c	2010-01-05 12:39:29.000000000 +0100
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs.head/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs.head/src/backend/tcop/utility.c	2010-01-06 22:30:08.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2010-01-08 11:04:38.000000000 +0100
***************
*** 928,936 ****
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 928,943 ----
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
+ 				/* XXX the new listen/notify version can be enabled
+ 				 * for Hot Standby */
  				PreventCommandDuringRecovery();
  
! 				if (stmt->payload
! 					&& strlen(stmt->payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 					ereport(ERROR,
! 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 							 errmsg("payload string too long")));
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs.head/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs.head/src/bin/initdb/initdb.c	2010-01-08 10:48:31.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 2458,2463 ****
--- 2458,2464 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs.head/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs.head/src/bin/psql/common.c	2010-01-05 12:39:33.000000000 +0100
--- cvs.build/src/bin/psql/common.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs.head/src/bin/psql/tab-complete.c cvs.build/src/bin/psql/tab-complete.c
*** cvs.head/src/bin/psql/tab-complete.c	2010-01-05 12:39:33.000000000 +0100
--- cvs.build/src/bin/psql/tab-complete.c	2010-01-08 11:00:55.000000000 +0100
***************
*** 2099,2105 ****
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
--- 2099,2105 ----
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
diff -cr cvs.head/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs.head/src/include/access/slru.h	2010-01-05 12:39:34.000000000 +0100
--- cvs.build/src/include/access/slru.h	2010-01-08 11:00:56.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs.head/src/include/catalog/pg_proc.h cvs.build/src/include/catalog/pg_proc.h
*** cvs.head/src/include/catalog/pg_proc.h	2010-01-08 10:48:32.000000000 +0100
--- cvs.build/src/include/catalog/pg_proc.h	2010-01-08 11:00:56.000000000 +0100
***************
*** 4084,4089 ****
--- 4084,4091 ----
  DESCR("get the prepared statements for this session");
  DATA(insert OID = 2511 (  pg_cursor PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,16,16,16,1184}" "{o,o,o,o,o,o}" "{name,statement,is_holdable,is_binary,is_scrollable,creation_time}" _null_ pg_cursor _null_ _null_ _null_ ));
  DESCR("get the open cursors for this session");
+ DATA(insert OID = 2187 (  pg_listening	PGNSP	PGUID 12 1 10 0 f f f t t s 0 0 25 "" _null_ _null_ _null_ _null_ pg_listening _null_ _null_ _null_ ));
+ DESCR("get the channels that the current backend listens to");
  DATA(insert OID = 2599 (  pg_timezone_abbrevs	PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,1186,16}" "{o,o,o}" "{abbrev,utc_offset,is_dst}" _null_ pg_timezone_abbrevs _null_ _null_ _null_ ));
  DESCR("get the available time zone abbreviations");
  DATA(insert OID = 2856 (  pg_timezone_names		PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,1186,16}" "{o,o,o,o}" "{name,abbrev,utc_offset,is_dst}" _null_ pg_timezone_names _null_ _null_ _null_ ));
diff -cr cvs.head/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs.head/src/include/commands/async.h	2010-01-05 12:39:35.000000000 +0100
--- cvs.build/src/include/commands/async.h	2010-01-08 11:00:56.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,42 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * How long can a payload string possibly be? Actually it needs to be one
+  * byte less to provide space for the trailing terminating '\0'.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * How many page slots do we reserve ?
+  */
+ #define NUM_ASYNC_BUFFERS			4
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
***************
*** 43,46 ****
--- 57,62 ----
  extern void notify_twophase_postcommit(TransactionId xid, uint16 info,
  						   void *recdata, uint32 len);
  
+ extern Datum pg_listening(PG_FUNCTION_ARGS);
+ 
  #endif   /* ASYNC_H */
diff -cr cvs.head/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs.head/src/include/nodes/parsenodes.h	2010-01-06 22:30:09.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2010-01-08 11:00:56.000000000 +0100
***************
*** 2082,2087 ****
--- 2082,2088 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs.head/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs.head/src/include/storage/lwlock.h	2010-01-05 12:39:36.000000000 +0100
--- cvs.build/src/include/storage/lwlock.h	2010-01-08 11:00:56.000000000 +0100
***************
*** 67,72 ****
--- 67,75 ----
  	AutovacuumLock,
  	AutovacuumScheduleLock,
  	SyncScanLock,
+ 	AsyncCtlLock,
+ 	AsyncQueueLock,
+ 	AsyncCommitOrderLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff -cr cvs.head/src/include/utils/errcodes.h cvs.build/src/include/utils/errcodes.h
*** cvs.head/src/include/utils/errcodes.h	2010-01-05 12:39:36.000000000 +0100
--- cvs.build/src/include/utils/errcodes.h	2010-01-08 11:00:56.000000000 +0100
***************
*** 318,323 ****
--- 318,324 ----
  #define ERRCODE_STATEMENT_TOO_COMPLEX		MAKE_SQLSTATE('5','4', '0','0','1')
  #define ERRCODE_TOO_MANY_COLUMNS			MAKE_SQLSTATE('5','4', '0','1','1')
  #define ERRCODE_TOO_MANY_ARGUMENTS			MAKE_SQLSTATE('5','4', '0','2','3')
+ #define ERRCODE_TOO_MANY_ENTRIES			MAKE_SQLSTATE('5','4', '0','3','1')
  
  /* Class 55 - Object Not In Prerequisite State (class borrowed from DB2) */
  #define ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE	MAKE_SQLSTATE('5','5', '0','0','0')
diff -cr cvs.head/src/test/regress/expected/guc.out cvs.build/src/test/regress/expected/guc.out
*** cvs.head/src/test/regress/expected/guc.out	2009-11-22 06:20:41.000000000 +0100
--- cvs.build/src/test/regress/expected/guc.out	2010-01-08 11:00:56.000000000 +0100
***************
*** 532,540 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
!   relname  
! -----------
   foo_event
  (1 row)
  
--- 532,540 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
!  pg_listening 
! --------------
   foo_event
  (1 row)
  
***************
*** 571,579 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
!  relname 
! ---------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
--- 571,579 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
!  pg_listening 
! --------------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
diff -cr cvs.head/src/test/regress/expected/sanity_check.out cvs.build/src/test/regress/expected/sanity_check.out
*** cvs.head/src/test/regress/expected/sanity_check.out	2010-01-05 12:39:38.000000000 +0100
--- cvs.build/src/test/regress/expected/sanity_check.out	2010-01-08 14:28:26.000000000 +0100
***************
*** 107,113 ****
   pg_language             | t
   pg_largeobject          | t
   pg_largeobject_metadata | t
-  pg_listener             | f
   pg_namespace            | t
   pg_opclass              | t
   pg_operator             | t
--- 107,112 ----
***************
*** 154,160 ****
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (143 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 153,159 ----
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (142 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
diff -cr cvs.head/src/test/regress/sql/guc.sql cvs.build/src/test/regress/sql/guc.sql
*** cvs.head/src/test/regress/sql/guc.sql	2009-10-21 22:38:58.000000000 +0200
--- cvs.build/src/test/regress/sql/guc.sql	2010-01-08 11:00:56.000000000 +0100
***************
*** 165,171 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 165,171 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
***************
*** 174,180 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 174,180 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
#46Merlin Moncure
mmoncure@gmail.com
In reply to: Joachim Wieland (#45)
Re: Listen / Notify - what to do when the queue is full

On Fri, Jan 8, 2010 at 7:48 AM, Joachim Wieland <joe@mcknight.de> wrote:

- do we need to limit the payload to pure ASCII ? I think yes, we need
to. I also think we need to reject other payloads with elog(ERROR...).

Just noticed this...don't you mean UTF8? Are we going to force non
English speaking users to send all payloads in English?

merlin

#47Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Merlin Moncure (#46)
Re: Listen / Notify - what to do when the queue is full

Merlin Moncure wrote:

On Fri, Jan 8, 2010 at 7:48 AM, Joachim Wieland <joe@mcknight.de> wrote:

- do we need to limit the payload to pure ASCII ? I think yes, we need
to. I also think we need to reject other payloads with elog(ERROR...).

Just noticed this...don't you mean UTF8? Are we going to force non
English speaking users to send all payloads in English?

hmm ASCII only sounds weird though doing UTF8 would have obvious
conversion problems in some circumstances - what about bytea (or why
_do_ we have to limit this to something?).

Stefan

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#46)
Re: Listen / Notify - what to do when the queue is full

Merlin Moncure <mmoncure@gmail.com> writes:

On Fri, Jan 8, 2010 at 7:48 AM, Joachim Wieland <joe@mcknight.de> wrote:

- do we need to limit the payload to pure ASCII ? I think yes, we need
to. I also think we need to reject other payloads with elog(ERROR...).

Just noticed this...don't you mean UTF8? Are we going to force non
English speaking users to send all payloads in English?

No, he meant ASCII. Otherwise we're going to have to deal with encoding
conversion issues.

regards, tom lane

#49Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#48)
Re: Listen / Notify - what to do when the queue is full

On Fri, Jan 8, 2010 at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Fri, Jan 8, 2010 at 7:48 AM, Joachim Wieland <joe@mcknight.de> wrote:

- do we need to limit the payload to pure ASCII ? I think yes, we need
to. I also think we need to reject other payloads with elog(ERROR...).

Just noticed this...don't you mean UTF8?  Are we going to force non
English speaking users to send all payloads in English?

No, he meant ASCII.  Otherwise we're going to have to deal with encoding
conversion issues.

That seems pretty awkward...instead of forcing an ancient, useless to
90% of the world encoding, why not send bytea (if necessary hex/b64
encoded)? I'm just trying to imagine how databases encoded in non
ascii superset encodings would use this feature...

If we must use ascii, we should probably offer conversion functions
to/from text, right? I definitely understand the principle of the
utility of laziness, but is this a proper case of simply dumping the
problem onto the user?

merlin

#50Andrew Chernow
ac@esilo.com
In reply to: Stefan Kaltenbrunner (#47)
Re: Listen / Notify - what to do when the queue is full

conversion problems in some circumstances - what about bytea (or why
_do_ we have to limit this to something?).

I agree with bytea. Zero conversions and the most flexible. Payload
encoding/format should be decided by the user.

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#51Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Andrew Chernow (#50)
Re: Listen / Notify - what to do when the queue is full

Andrew Chernow wrote:

conversion problems in some circumstances - what about bytea (or why
_do_ we have to limit this to something?).

I agree with bytea. Zero conversions and the most flexible. Payload
encoding/format should be decided by the user.

yeah that is exactly why I think they this would be the most flexible
option...

Stefan

#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#49)
Re: Listen / Notify - what to do when the queue is full

Merlin Moncure <mmoncure@gmail.com> writes:

On Fri, Jan 8, 2010 at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

No, he meant ASCII. �Otherwise we're going to have to deal with encoding
conversion issues.

That seems pretty awkward...instead of forcing an ancient, useless to
90% of the world encoding, why not send bytea

You mean declare the notify parameter as bytea instead of text, and dump
all encoding conversion issues onto the user? We could do that I guess.
I'm not convinced that it's either more functional or more convenient to
use than a parameter that's declared as text and restricted to ASCII.

If we must use ascii, we should probably offer conversion functions
to/from text, right?

There's encode() and decode(), which is what people would wind up using
quite a lot of the time if we declare the parameter to be bytea.

regards, tom lane

#53Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#52)
Re: Listen / Notify - what to do when the queue is full

On Fri, Jan 8, 2010 at 4:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Fri, Jan 8, 2010 at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

No, he meant ASCII.  Otherwise we're going to have to deal with encoding
conversion issues.

That seems pretty awkward...instead of forcing an ancient, useless to
90% of the world encoding, why not send bytea

You mean declare the notify parameter as bytea instead of text, and dump
all encoding conversion issues onto the user?  We could do that I guess.
I'm not convinced that it's either more functional or more convenient to
use than a parameter that's declared as text and restricted to ASCII.

If we must use ascii, we should probably offer conversion functions
to/from text, right?

There's encode() and decode(), which is what people would wind up using
quite a lot of the time if we declare the parameter to be bytea.

all good points...IF notify payload can be result of expression so for
non const cases where you are expecting non ascii data you can throw a
quick encode. I had assumed it didn't (wrongly?) because current
notify relname takes a strict literal, so this would work (If the
below works, disregard the rest and I concede ascii is fine):
notify test encode('some_string', 'hex');

A quick look at gram.y changes suggest this wont work if I read it
correctly. How would someone from Japan issue a notify/payload from
the console?

merlin

#54Greg Sabino Mullane
greg@turnstep.com
In reply to: Merlin Moncure (#49)
Re: Listen / Notify - what to do when the queue is full

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

- do we need to limit the payload to pure ASCII ? I think yes, we need
to. I also think we need to reject other payloads with elog(ERROR...).

...[snip other followups]

On the one hand, I don't see the problem with ASCII here - the
payload is meant as a quick shorthand convenience, not a literal payload
of important information. On the other, it should at least match
the current rules for the listen and notify names themselves, which
means allowing more than ASCII.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 201001102303
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAktKo40ACgkQvJuQZxSWSshg9ACg2uiDYuhBnRQqFS6Ej3O9VLcC
2TgAn035OrYcdERn4I1VI4NRQFBIcXZ/
=yJmK
-----END PGP SIGNATURE-----

#55Peter Eisentraut
peter_e@gmx.net
In reply to: Greg Sabino Mullane (#54)
Re: Listen / Notify - what to do when the queue is full

On mån, 2010-01-11 at 04:05 +0000, Greg Sabino Mullane wrote:

On the one hand, I don't see the problem with ASCII here - the
payload is meant as a quick shorthand convenience, not a literal payload
of important information.

Is it not? The notify name itself is already a quick shorthand
convenience. Who knows what the payload is actually meant for. Have
use cases been presented and analyzed?

#56Arnaud Betremieux
arnaud.betremieux@keyconsulting.fr
In reply to: Peter Eisentraut (#55)
Re: Listen / Notify - what to do when the queue is full

A use case : use NOTIFY in a rule to send the primary key of a row that
has been updated (for instance to manage a cache).

This requires a patch on top of this one, and it really is a separate
concern, but I thought I'd give the use case anyway, since I believe it
is relevant to the issues here.

I can see four kinds of NOTIFY statements :

1) The existing case : NOTIFY channel
2) With Joachim's patch : NOTIFY channel 'payload'
3) My use case : NOTIFY channel 'pay'||'load' (actually NOTIFY
channel '<table_name>#'||OLD.id)
4) Taken one step further : NOTIFY channel (SELECT payload FROM payloads
WHERE ...)

I'm working on a proof of concept patch to use Joachim's new notify
function to introduce case 3. I think this means going through the
planner and executor, so I might as well do case 4 as well. A use case I
can see for case 4 is sending information in a rule or trigger about an
updated object, when that information is stored in a separate table
(versioning or audit information for example).

Cases 1 and 2 could remain utility commands, while cases 3 and 4 could
go through the planner and the executor, the notify plan node calling
Joachim's new notify function on execution.

Best regards,
Arnaud Betremieux

Show quoted text

On 11/01/2010 07:58, Peter Eisentraut wrote:

On mån, 2010-01-11 at 04:05 +0000, Greg Sabino Mullane wrote:

On the one hand, I don't see the problem with ASCII here - the
payload is meant as a quick shorthand convenience, not a literal payload
of important information.

Is it not? The notify name itself is already a quick shorthand
convenience. Who knows what the payload is actually meant for. Have
use cases been presented and analyzed?

#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Arnaud Betremieux (#56)
Re: Listen / Notify - what to do when the queue is full

Arnaud Betremieux <arnaud.betremieux@keyconsulting.fr> writes:

3) My use case : NOTIFY channel 'pay'||'load' (actually NOTIFY
channel '<table_name>#'||OLD.id)
4) Taken one step further : NOTIFY channel (SELECT payload FROM payloads
WHERE ...)

I'm working on a proof of concept patch to use Joachim's new notify
function to introduce case 3. I think this means going through the
planner and executor, so I might as well do case 4 as well.

It would be a lot less work to introduce a function like send_notify()
that could be invoked within a regular SELECT. Pushing a utility
statement through the planner/executor code path will do enough violence
to the system design that such a patch would probably be rejected out of
hand.

regards, tom lane

#58Andrew Chernow
ac@esilo.com
In reply to: Arnaud Betremieux (#56)
Re: Listen / Notify - what to do when the queue is full

Arnaud Betremieux wrote:

A use case : use NOTIFY in a rule to send the primary key of a row that
has been updated (for instance to manage a cache).

This requires a patch on top of this one, and it really is a separate
concern, but I thought I'd give the use case anyway, since I believe it
is relevant to the issues here.

I can see four kinds of NOTIFY statements :

1) The existing case : NOTIFY channel
2) With Joachim's patch : NOTIFY channel 'payload'
3) My use case : NOTIFY channel 'pay'||'load' (actually NOTIFY
channel '<table_name>#'||OLD.id)
4) Taken one step further : NOTIFY channel (SELECT payload FROM payloads
WHERE ...)

I know I'd be looking to send utf8 and byteas. Can notify as it stands today
take an expression for the payload (#4)?

The other issue is that libpq expects a string, so if non-c-string safe data is
to be sent a protocol change is needed or the server must hex encode all
payloads before transit and libpq must decode it; also requiring an
'payload_len' member be added to PGnotify. The latter is better IMHO as
protocol changes are nasty. Although, only needed to support bytea. If all we
want is utf8, then there is no issue with libpq.

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#59Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#57)
Re: Listen / Notify - what to do when the queue is full

On Mon, Jan 11, 2010 at 8:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm working on a proof of concept patch to use Joachim's new notify
function to introduce case 3. I think this means going through the
planner and executor, so I might as well do case 4 as well.

It would be a lot less work to introduce a function like send_notify()
that could be invoked within a regular SELECT.  Pushing a utility

+1

IMO, this neatly solves the problem and addresses some of the concerns
I was raising upthread. A bytea version would be nice but a text only
version is workable.

merlin

#60Arnaud Betremieux
arnaud.betremieux@keyconsulting.fr
In reply to: Tom Lane (#57)
Re: Listen / Notify - what to do when the queue is full

On 11/01/2010 14:25, Tom Lane wrote:

Arnaud Betremieux<arnaud.betremieux@keyconsulting.fr> writes:

3) My use case : NOTIFY channel 'pay'||'load' (actually NOTIFY
channel '<table_name>#'||OLD.id)
4) Taken one step further : NOTIFY channel (SELECT payload FROM payloads
WHERE ...)

I'm working on a proof of concept patch to use Joachim's new notify
function to introduce case 3. I think this means going through the
planner and executor, so I might as well do case 4 as well.

It would be a lot less work to introduce a function like send_notify()
that could be invoked within a regular SELECT. Pushing a utility
statement through the planner/executor code path will do enough violence
to the system design that such a patch would probably be rejected out of
hand.

Introducing a send_notify function does sound a lot simpler and cleaner,
and I think I'll try it this way. The only thing that bothers me is the
syntax :

... DO ALSO SELECT send_notify('payload')
... DO ALSO SELECT send_notify(a) FROM b

How about a new grammar for NOTIFY <channel> a_expr, which would go
through the rewriter to be transformed as a SELECT ?
so NOTIFY (SELECT a FROM b) would become SELECT send_notify(SELECT a
FROM b) ?

#61Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#45)
Re: Listen / Notify - what to do when the queue is full

Initial comments:

* compiler warnings

ipci.c: In function ‘CreateSharedMemoryAndSemaphores’:
ipci.c:228: warning: implicit declaration of function ‘AsyncShmemInit’

* 2PC

Adds complexity, and I didn't see any clear, easy solution after
reading the thread. I don't see this as a showstopper, so I'd leave
this until later.

* Hot Standby

It would be nice to have NOTIFY work with HS, but I don't think that's
going to happen this cycle. I don't think this is a showstopper,
either.

* ASCII?

I tend to think that encoding should be handled properly here, or we
should use BYTEA. My reasoning is that TEXT is a datatype, whereas ASCII
is not, and I don't see a reason to invent it now. We might as well use
real TEXT (with encoding support) or use BYTEA which works fine for
ASCII anyway.

* send_notify()

Someone suggested a function like this, which sounds useful to me. Not
a showstopper, but if it's not too difficult it might be worth
including. (Incidentally, the function signature should match the type
of the payload, which is unclear if we're going to use ASCII.)

* AsyncCommitOrderLock

This is a big red flag to me. The name by itself is scary. The fact
that it is held across a series of fairly interesting operations is
even scarier.

As you say in the comments, it's acquired before recording the
transaction commit in the clog, and released afterward. What you
actually have is something like:

AtCommit_NotifyBeforeCommit();

HOLD_INTERRUPTS();

s->state = TRANS_COMMIT;

latestXid = RecordTransactionCommit(false);

TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);

ProcArrayEndTransaction(MyProc, latestXid);

CallXactCallbacks(XACT_EVENT_COMMIT);

ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_BEFORE_LOCKS,
true, true);

AtEOXact_Buffers(true);

AtEOXact_RelationCache(true);

AtEarlyCommit_Snapshot();

AtEOXact_Inval(true);

smgrDoPendingDeletes(true);

AtEOXact_MultiXact();

AtCommit_NotifyAfterCommit();

That is a lot of stuff happening between the acquisition and
release. There are two things particularly scary about this (to me):

* holding an LWLock while performing I/O (in smgrDoPendingDeletes())

* holding an LWLock while acquiring another LWLock (e.g.
ProcArrayEndTransaction())

An LWLock-based deadlock is a hard deadlock -- not detected by the
deadlock detector, and there's not much you can do even if it were --
right in the transaction-completing code. That means that the whole
system would be locked up. I'm not sure that such a deadlock condition
exists, but it seems likely (if not, it's only because other areas of
the code avoided this practice), and hard to prove that it's safe. And
it's probably bad for performance to increase the length of time
transactions are serialized, even if only for cases that involve
LISTEN/NOTIFY.

I believe this needs a re-think. What is the real purpose for
AsyncCommitOrderLock, and can we acheive that another way? It seems
that you're worried about a transaction that issues a LISTEN and
committing not getting a notification from a NOTIFYing transaction
that commits concurrently (and slightly after the LISTEN). But
SignalBackends() is called after transaction commit, and should signal
all backends who committed a LISTEN before that time, right?

* The transaction IDs are used because Send_Notify() is called before
the AsyncCommitOrderLock acquire, and so the backend could potentially
be reading uncommitted notifications that are "about" to be committed
(or aborted). Then, the queue is not read further until that transaction
completes. That's not really commented effectively, and I suspect the
process could be simpler. For instance, why can't the backend always
read all of the data from the queue, notifying if the transaction is
committed and saving to a local list otherwise (which would be checked
on the next wakeup)?

* Can you clarify the big comment at the top of async.c? It's a helpful
overview, but it glosses over some of the touchy synchronization steps
going on. I find myself trying to make a map of these steps myself.

Regards,
Jeff Davis

#62Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#61)
Re: Listen / Notify - what to do when the queue is full

Hi Jeff,

thanks a lot for your review. I will reply to your review again in
detail but I'd like to answer your two main questions already now.

On Tue, Jan 19, 2010 at 8:08 AM, Jeff Davis <pgsql@j-davis.com> wrote:

* AsyncCommitOrderLock

I believe this needs a re-think. What is the real purpose for
AsyncCommitOrderLock, and can we acheive that another way? It seems
that you're worried about a transaction that issues a LISTEN and
committing not getting a notification from a NOTIFYing transaction
that commits concurrently (and slightly after the LISTEN).

Yes, that is exactly the point. However I am not worried about a
notification getting lost but rather about determining the visibility
of the notifications.

In the end we need to be able to know about the order of LISTEN,
UNLISTEN and NOTIFY commits to find out who should receive which
notifications. As you cannot determine if xid1 has committed before or
after xid2 retrospectively I enforced the order by an LWLock and by
saving the list of xids currently being committed.

There are also two examples in
http://archives.postgresql.org/pgsql-hackers/2009-12/msg00790.php
about that issue.

But SignalBackends() is called after transaction commit, and should signal
all backends who committed a LISTEN before that time, right?

Yes, any listening backend is being signaled but that doesn't help to
find out about the exact order of the almost-concurrent events that
happened before.

* The transaction IDs are used because Send_Notify() is called before
the AsyncCommitOrderLock acquire, and so the backend could potentially
be reading uncommitted notifications that are "about" to be committed
(or aborted). Then, the queue is not read further until that transaction
completes. That's not really commented effectively, and I suspect the
process could be simpler. For instance, why can't the backend always
read all of the data from the queue, notifying if the transaction is
committed and saving to a local list otherwise (which would be checked
on the next wakeup)?

It's true that the backends could always read up to the end of the
queue and copy everything into the local memory. However you still
need to apply the same checks before you deliver the notifications:
You need to make sure that the transaction has committed and that you
were listening to the channels of the notifications at the time they
got sent / committed. Also you need to copy really _everything_
because you could start to listen to a channel after copying its
uncommitted notifications.

There are other reasons (tail pointer management and signaling
strategy) but in the end it seemed more straightforward to stop as
soon as we hit an uncommitted notification. We will receive a signal
for it eventually anyway and can then start again and read further.

Also I think (but I have no numbers about it) that it makes the
backends work more on the same slru pages.

Joachim

#63Arnaud Betremieux
arnaud.betremieux@keyconsulting.fr
In reply to: Jeff Davis (#61)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

Regarding the send_notify function, I have been working on it and have a
patch (attached) that applies on top of Joachim's.

It introduces a send_notify SQL function that calls the Async_Notify C
function.

It's pretty straightforward and will need to be refined to take into
account the encoding decisions that are made in Async_Notify, but it
should be a good starting point, and it doesn't introduce much in the
way of complexity.

On a side note, since that might be a discussion for another thread, it
would be nice and more DBA friendly, to have some kind of syntactic
sugar around it to be able to use
NOTIFY channel (SELECT col FROM table);
instead of
SELECT send_notify('channel', (SELECT col FROM table));

Best regards,
Arnaud Betremieux

Show quoted text

On 19/01/2010 08:08, Jeff Davis wrote:

Initial comments:

* compiler warnings

ipci.c: In function ‘CreateSharedMemoryAndSemaphores’:
ipci.c:228: warning: implicit declaration of function ‘AsyncShmemInit’

* 2PC

Adds complexity, and I didn't see any clear, easy solution after
reading the thread. I don't see this as a showstopper, so I'd leave
this until later.

* Hot Standby

It would be nice to have NOTIFY work with HS, but I don't think that's
going to happen this cycle. I don't think this is a showstopper,
either.

* ASCII?

I tend to think that encoding should be handled properly here, or we
should use BYTEA. My reasoning is that TEXT is a datatype, whereas ASCII
is not, and I don't see a reason to invent it now. We might as well use
real TEXT (with encoding support) or use BYTEA which works fine for
ASCII anyway.

* send_notify()

Someone suggested a function like this, which sounds useful to me. Not
a showstopper, but if it's not too difficult it might be worth
including. (Incidentally, the function signature should match the type
of the payload, which is unclear if we're going to use ASCII.)

* AsyncCommitOrderLock

This is a big red flag to me. The name by itself is scary. The fact
that it is held across a series of fairly interesting operations is
even scarier.

As you say in the comments, it's acquired before recording the
transaction commit in the clog, and released afterward. What you
actually have is something like:

AtCommit_NotifyBeforeCommit();

HOLD_INTERRUPTS();

s->state = TRANS_COMMIT;

latestXid = RecordTransactionCommit(false);

TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);

ProcArrayEndTransaction(MyProc, latestXid);

CallXactCallbacks(XACT_EVENT_COMMIT);

ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_BEFORE_LOCKS,
true, true);

AtEOXact_Buffers(true);

AtEOXact_RelationCache(true);

AtEarlyCommit_Snapshot();

AtEOXact_Inval(true);

smgrDoPendingDeletes(true);

AtEOXact_MultiXact();

AtCommit_NotifyAfterCommit();

That is a lot of stuff happening between the acquisition and
release. There are two things particularly scary about this (to me):

* holding an LWLock while performing I/O (in smgrDoPendingDeletes())

* holding an LWLock while acquiring another LWLock (e.g.
ProcArrayEndTransaction())

An LWLock-based deadlock is a hard deadlock -- not detected by the
deadlock detector, and there's not much you can do even if it were --
right in the transaction-completing code. That means that the whole
system would be locked up. I'm not sure that such a deadlock condition
exists, but it seems likely (if not, it's only because other areas of
the code avoided this practice), and hard to prove that it's safe. And
it's probably bad for performance to increase the length of time
transactions are serialized, even if only for cases that involve
LISTEN/NOTIFY.

I believe this needs a re-think. What is the real purpose for
AsyncCommitOrderLock, and can we acheive that another way? It seems
that you're worried about a transaction that issues a LISTEN and
committing not getting a notification from a NOTIFYing transaction
that commits concurrently (and slightly after the LISTEN). But
SignalBackends() is called after transaction commit, and should signal
all backends who committed a LISTEN before that time, right?

* The transaction IDs are used because Send_Notify() is called before
the AsyncCommitOrderLock acquire, and so the backend could potentially
be reading uncommitted notifications that are "about" to be committed
(or aborted). Then, the queue is not read further until that transaction
completes. That's not really commented effectively, and I suspect the
process could be simpler. For instance, why can't the backend always
read all of the data from the queue, notifying if the transaction is
committed and saving to a local list otherwise (which would be checked
on the next wakeup)?

* Can you clarify the big comment at the top of async.c? It's a helpful
overview, but it glosses over some of the touchy synchronization steps
going on. I find myself trying to make a map of these steps myself.

Regards,
Jeff Davis

Attachments:

listennotify.function.difftext/x-patch; name=listennotify.function.diffDownload
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 0545fd7..aed8428 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -386,3 +389,24 @@ pg_typeof(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_OID(get_fn_expr_argtype(fcinfo->flinfo, 0));
 }
+
+
+/*
+ * send_notify -
+ *	  Send a notification to listening clients
+ */
+Datum
+send_notify(PG_FUNCTION_ARGS)
+{
+	text	   *channel = PG_GETARG_TEXT_PP(0);
+	text	   *payload = PG_GETARG_TEXT_PP(1);
+	const char *channelStr;
+	const char *payloadStr;
+
+	channelStr = text_to_cstring(channel);
+	payloadStr = text_to_cstring(payload);
+
+	Async_Notify(channelStr, payloadStr);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 784da1b..471c9f7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4728,7 +4728,8 @@ DATA(insert OID = 3113 (  last_value	PGNSP PGUID 12 1 0 0 f t f t f i 1 0 2283 "
 DESCR("fetch the last row value");
 DATA(insert OID = 3114 (  nth_value		PGNSP PGUID 12 1 0 0 f t f t f i 2 0 2283 "2283 23" _null_ _null_ _null_ _null_ window_nth_value _null_ _null_ _null_ ));
 DESCR("fetch the Nth row value");
-
+DATA(insert OID = 3115 (  send_notify  PGNSP PGUID 12 1 0 0 f f f f f v 2 0 2278 "25 25" _null_ _null_ _null_ _null_ send_notify _null_ _null_ _null_));
+DESCR("send a notification to clients");

 /*
  * Symbolic values for provolatile column: these indicate whether the result
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 91411a4..0e6f6e3 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1031,4 +1031,6 @@ extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);

+extern Datum send_notify(PG_FUNCTION_ARGS);
+
 #endif   /* BUILTINS_H */
#64Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#42)
Re: Listen / Notify - what to do when the queue is full

On Wed, 2009-12-09 at 11:43 +0100, Joachim Wieland wrote:

Examples:

Backend 1: Backend 2:

transaction starts
NOTIFY foo;
commit starts
transaction starts
LISTEN foo;
commit starts
commit to clog
commit to clog

=> Backend 2 will receive Backend 1's notification.

How does the existing notification mechanism solve this problem? Is it
really a problem? Why would Backend2 expect to receive the notification?

Backend 1: Backend 2:

transaction starts
NOTIFY foo;
commit starts
transaction starts
UNLISTEN foo;
commit starts
commit to clog
commit to clog

=> Backend 2 will not receive Backend 1's notification.

This is the same problem, except that it doesn't matter. A spurious
notification is not a bug, right?

Regards,
Jeff Davis

#65Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#64)
Re: Listen / Notify - what to do when the queue is full

Jeff Davis <pgsql@j-davis.com> writes:

How does the existing notification mechanism solve this problem? Is it
really a problem? Why would Backend2 expect to receive the notification?

The intended way to use LISTEN/NOTIFY for status tracking is

1. LISTEN foo; (and commit the listen)
2. examine current database state
3. assume that we'll get a NOTIFY for any change that commits
subsequently to what we saw in step 2

In the current implementation, a transaction that is in process of
commit during step 1 might possibly not see your pg_listener record
as committed, and so it might not send you a NOTIFY for whatever it did.
If it still hasn't committed when you perform step 2, then you'd fail to
see its changes as part of your initial state, *and* you'd not get a
NOTIFY when the changes did become visible. The way we prevent this
race condition is that a listener takes exclusive lock on pg_listener
before entering its record, and doesn't release the lock until after
committing. Notifiers likewise take exclusive lock. This serializes
things so that either the modifying transaction commits before the
listener completes step 1 (and hence the listener will see its updates
in step 2), or the listener is guaranteed to have a committed record
in pg_listener when the modifying process determines whom to notify.

I guess Joachim is trying to provide a similar guarantee for the new
implementation, but I'm not clear on why it would require locking.
The new implementation is broadcast and ISTM it shouldn't require the
modifying transaction to know which processes are listening.

I haven't read the patch but I agree that the description you give is
pretty scary from a performance standpoint. More locks around
transaction commit doesn't seem like a good idea. If they're only taken
when an actual LISTEN or NOTIFY has happened in the current transaction,
that'd be okay (certainly no worse than what happens now) but the naming
suggested that this'd happen unconditionally.

regards, tom lane

#66Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#65)
Re: Listen / Notify - what to do when the queue is full

On Tue, 2010-01-19 at 19:05 -0500, Tom Lane wrote:

I guess Joachim is trying to provide a similar guarantee for the new
implementation, but I'm not clear on why it would require locking.
The new implementation is broadcast and ISTM it shouldn't require the
modifying transaction to know which processes are listening.

I think there is a better way. I'll dig into it a little more.

I haven't read the patch but I agree that the description you give is
pretty scary from a performance standpoint. More locks around
transaction commit doesn't seem like a good idea.

I was also worried about holding multiple LWLocks at once -- is such
practice generally avoided in the rest of the code?

If they're only taken
when an actual LISTEN or NOTIFY has happened in the current transaction,
that'd be okay (certainly no worse than what happens now) but the naming
suggested that this'd happen unconditionally.

It appears that the locks are only taken when LISTEN or NOTIFY is
involved.

Regards,
Jeff Davis

#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#66)
Re: Listen / Notify - what to do when the queue is full

Jeff Davis <pgsql@j-davis.com> writes:

I was also worried about holding multiple LWLocks at once -- is such
practice generally avoided in the rest of the code?

It's allowed but remember that there is no deadlock detection in lwlock.c.
You must be very certain that there is only one possible order in which
such locks could be taken. Interactions with heavyweight locks would be
bad news as well.

It appears that the locks are only taken when LISTEN or NOTIFY is
involved.

On the whole it might be better if a heavyweight lock were used,
such that it'll automatically clean up after commit. (I'm still
wondering if we couldn't do without the lock altogether though.)

regards, tom lane

#68Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#67)
Re: Listen / Notify - what to do when the queue is full

On Tue, 2010-01-19 at 19:24 -0500, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

I was also worried about holding multiple LWLocks at once -- is such
practice generally avoided in the rest of the code?

It's allowed but remember that there is no deadlock detection in lwlock.c.
You must be very certain that there is only one possible order in which
such locks could be taken. Interactions with heavyweight locks would be
bad news as well.

That was my worry initially.

On the whole it might be better if a heavyweight lock were used,
such that it'll automatically clean up after commit. (I'm still
wondering if we couldn't do without the lock altogether though.)

Yes, I think there's a better way as well. I'll look into it.

Regards,
Jeff Davis

#69Joachim Wieland
joe@mcknight.de
In reply to: Tom Lane (#65)
Re: Listen / Notify - what to do when the queue is full

On Wed, Jan 20, 2010 at 1:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I guess Joachim is trying to provide a similar guarantee for the new
implementation, but I'm not clear on why it would require locking.
The new implementation is broadcast and ISTM it shouldn't require the
modifying transaction to know which processes are listening.

It is rather about a listening backend seeing a notification in the
global queue without knowing if it should deliver the notification to
its frontend or not. The backend needs to know if its own LISTEN
committed before or after the NOTIFY committed that it sees in the
queue. As I have understood there is no way to find out if a
transaction has committed before or after another transaction. If we
had this, it would be easy without a lock.

I haven't read the patch but I agree that the description you give is
pretty scary from a performance standpoint.  More locks around
transaction commit doesn't seem like a good idea.  If they're only taken
when an actual LISTEN or NOTIFY has happened in the current transaction,
that'd be okay (certainly no worse than what happens now) but the naming
suggested that this'd happen unconditionally.

The lock is taken exclusively by transactions doing LISTEN/UNLISTEN
and in shared mode by transactions that execute only NOTIFY. It should
really not degrade performance but I understand Jeff's concerns about
deadlocks.

Joachim

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#69)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

On Wed, Jan 20, 2010 at 1:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I guess Joachim is trying to provide a similar guarantee for the new
implementation, but I'm not clear on why it would require locking.

It is rather about a listening backend seeing a notification in the
global queue without knowing if it should deliver the notification to
its frontend or not. The backend needs to know if its own LISTEN
committed before or after the NOTIFY committed that it sees in the
queue.

In that case I think you've way overcomplicated matters. Just deliver
the notification. We don't really care if the listener gets additional
notifications; the only really bad case would be if it failed to get an
event that was generated after it committed a LISTEN.

regards, tom lane

#71Joachim Wieland
joe@mcknight.de
In reply to: Tom Lane (#70)
Re: Listen / Notify - what to do when the queue is full

On Wed, Jan 20, 2010 at 5:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In that case I think you've way overcomplicated matters.  Just deliver
the notification.  We don't really care if the listener gets additional
notifications; the only really bad case would be if it failed to get an
event that was generated after it committed a LISTEN.

Okay, what about unprocessed notifications in the queue and a backend
executing UNLISTEN: can we assume that it is not interested in
notifications anymore once it executes UNLISTEN and discard all of
them even though there might be notifications that have been sent (and
committed) before the UNLISTEN committed?

Joachim

#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#71)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

Okay, what about unprocessed notifications in the queue and a backend
executing UNLISTEN: can we assume that it is not interested in
notifications anymore once it executes UNLISTEN and discard all of
them even though there might be notifications that have been sent (and
committed) before the UNLISTEN committed?

Yes. That is the case with the existing implementation as well, no?
We don't consider sending notifies until transaction end, so anything
that commits during the xact in which you UNLISTEN will get dropped.
Again, a little bit of sloppiness here doesn't seem important. Issuing
UNLISTEN implies the client is not interested anymore.

regards, tom lane

#73Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#72)
Re: Listen / Notify - what to do when the queue is full

On Wed, 2010-01-20 at 15:54 -0500, Tom Lane wrote:

Joachim Wieland <joe@mcknight.de> writes:

Okay, what about unprocessed notifications in the queue and a backend
executing UNLISTEN: can we assume that it is not interested in
notifications anymore once it executes UNLISTEN and discard all of
them even though there might be notifications that have been sent (and
committed) before the UNLISTEN committed?

Yes. That is the case with the existing implementation as well, no?
We don't consider sending notifies until transaction end, so anything
that commits during the xact in which you UNLISTEN will get dropped.

Only if the transaction containing UNLISTEN commits. Are you saying it
would also be OK to drop NOTIFYs if a backend's UNLISTEN transaction
aborts?

Again, a little bit of sloppiness here doesn't seem important. Issuing
UNLISTEN implies the client is not interested anymore.

Thinking out loud: If we're taking this approach, I wonder if it might
be a good idea to PreventTransactionChain for LISTEN and UNLISTEN? It
might simplify things for users because they wouldn't be expecting
transaction-like behavior, except for the NOTIFYs themselves.

Regards,
Jeff Davis

#74Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#73)
Re: Listen / Notify - what to do when the queue is full

On Wed, Jan 20, 2010 at 11:08 PM, Jeff Davis <pgsql@j-davis.com> wrote:

Yes.  That is the case with the existing implementation as well, no?
We don't consider sending notifies until transaction end, so anything
that commits during the xact in which you UNLISTEN will get dropped.

Only if the transaction containing UNLISTEN commits. Are you saying it
would also be OK to drop NOTIFYs if a backend's UNLISTEN transaction
aborts?

If the backend's UNLISTEN transaction aborts, then it has never
executed UNLISTEN...

So it will continue to get notifications (if it has executed a LISTEN before).

Joachim

#75Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#73)
Re: Listen / Notify - what to do when the queue is full

Jeff Davis <pgsql@j-davis.com> writes:

On Wed, 2010-01-20 at 15:54 -0500, Tom Lane wrote:

Yes. That is the case with the existing implementation as well, no?
We don't consider sending notifies until transaction end, so anything
that commits during the xact in which you UNLISTEN will get dropped.

Only if the transaction containing UNLISTEN commits. Are you saying it
would also be OK to drop NOTIFYs if a backend's UNLISTEN transaction
aborts?

No, I would say not, but that wasn't being proposed was it? The
decisions about what to do are only made at/after commit.

Thinking out loud: If we're taking this approach, I wonder if it might
be a good idea to PreventTransactionChain for LISTEN and UNLISTEN?

That shouldn't be necessary IMO. There's never been such a restriction
before.

regards, tom lane

#76Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#67)
Re: Listen / Notify - what to do when the queue is full

On Tue, 2010-01-19 at 19:24 -0500, Tom Lane wrote:

(I'm still
wondering if we couldn't do without the lock altogether though.)

Here's the problem as I see it:

If we insert the notifications into the queue before actually recording
the commit, there's a window in between where another backend could
perform the expected sequence as you wrote:

1. LISTEN foo; (and commit the listen)
2. examine current database state
3. assume that we'll get a NOTIFY for any change that commits
subsequently to what we saw in step 2

and miss the NOTIFYs, and not see the updated database state.

But I don't think that the NOTIFYs will actually be missed. Once put
into the queue, the notification will only be removed from the queue
after all backends have read it. But no backend will advance past it as
long as the notification is from an uncommitted transaction. By the time
the notifying transaction is committed, the listening transaction will
also be committed, and therefore subscribed to the queue.

The newly-listening backend will be awakened properly as well, because
that's done after the notifying transaction commits, and therefore will
wake up any listening transactions that committed earlier.

However, there's still a problem inserting into the queue when no
backends are listening. Perhaps that can be solved right before we wake
up the listening backends after the notifying transaction commits: if
there are no listening backends, clear the queue.

We still might get spurious notifications if they were committed before
the LISTEN transaction was committed. And we also might get spurios
notifications if the UNLISTEN doesn't take effect quite quickly enough.
Those are both acceptable.

If the above scheme is too complex, we can always use a heavyweight
lock. However, there's no pg_listener so it's not obvious what LOCKTAG
to use. We can just pick something arbitrary, like the Oid of the new
pg_listening() function, I suppose. Is there any precedent for that?

Thoughts?

Regards,
Jeff Davis

#77Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#76)
Re: Listen / Notify - what to do when the queue is full

On Thu, Jan 21, 2010 at 3:06 AM, Jeff Davis <pgsql@j-davis.com> wrote:

Here's the problem as I see it:

You are writing a lot of true facts but I miss to find a real
problem... What exactly do you see as a problem?

The only time you are writing "problem" is in this paragraph:

However, there's still a problem inserting into the queue when no
backends are listening. Perhaps that can be solved right before we wake
up the listening backends after the notifying transaction commits: if
there are no listening backends, clear the queue.

This gets already done, in SignalBackends(), a notifying transactions
counts the number of listening backends. If no other backend is
listening, then it signals itself so that the queue gets cleaned (i.e.
the global pointer gets forwarded and the slru pages will be truncated
if possible).

I have been working on simplifying the patch yesterday, I still need
to adapt comments and review it again but I am planning to post the
new version tonight. Then we have a common base again to discuss
further :-)

Thanks for your review,
Joachim

#78Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#77)
Re: Listen / Notify - what to do when the queue is full

On Thu, 2010-01-21 at 10:14 +0100, Joachim Wieland wrote:

On Thu, Jan 21, 2010 at 3:06 AM, Jeff Davis <pgsql@j-davis.com> wrote:

Here's the problem as I see it:

You are writing a lot of true facts but I miss to find a real
problem... What exactly do you see as a problem?

I worded that in a confusing way, I apologize. My point was that I don't
think we need a lock, because I don't see any situation in which the
notifications would be lost.

I have been working on simplifying the patch yesterday, I still need
to adapt comments and review it again but I am planning to post the
new version tonight. Then we have a common base again to discuss
further :-)

Sounds good.

Regards,
Jeff Davis

#79Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#78)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Thu, Jan 21, 2010 at 6:21 PM, Jeff Davis <pgsql@j-davis.com> wrote:

I have been working on simplifying the patch yesterday, I still need
to adapt comments and review it again but I am planning to post the
new version tonight. Then we have a common base again to discuss
further :-)

Sounds good.

Attached is a new version of the patch. I haven't changed it with
regards to 2PC or the ASCII / bytea issues but now the transactional
behavior is less strict which allows to remove the
AsyncCommitOrderLock.

I will leave the commitfest status on "Waiting on author" but I will
be travelling for the weekend and wanted to share my current version.

Joachim

Attachments:

listennotify.8.difftext/x-diff; charset=US-ASCII; name=listennotify.8.diffDownload
diff -cr cvs.head/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs.head/src/backend/access/transam/slru.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/access/transam/slru.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs.head/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs.head/src/backend/access/transam/xact.c	2010-01-20 20:08:24.000000000 +0100
--- cvs.build/src/backend/access/transam/xact.c	2010-01-22 00:53:47.000000000 +0100
***************
*** 1728,1735 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
--- 1728,1735 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
***************
*** 1804,1809 ****
--- 1804,1814 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs.head/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs.head/src/backend/catalog/Makefile	2010-01-06 22:30:05.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2010-01-22 00:42:56.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs.head/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs.head/src/backend/commands/async.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/commands/async.c	2010-01-22 00:45:34.000000000 +0100
***************
*** 14,44 ****
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 14,68 ----
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications and compares the notified channels with its list.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) stores the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the applications needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    After clog commit we are called another time
!  *    (AtCommit_NotifyAfterCommit()). Here we check if we need to signal the
!  *    backends. In SignalBackends() we scan the list of listening backends and
!  *    send a PROCSIG_NOTIFY_INTERRUPT to every backend that has set its Pid (we
!  *    don't know which backend is listening on which channel so we need to send
!  *    a signal to every listening backend). We can exclude backends that are
!  *    already up to date.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,97 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_listener.h"
  #include "commands/async.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
--- 70,116 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  *    Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach either a notification from an uncommitted transaction or
!  *	  the head pointer's position. Then we check if we were the laziest
!  *	  backend: if our pointer is set to the same position as the global tail
!  *	  pointer is set, then we set it further to the second-laziest backend (We
!  *	  can identify it by inspecting the positions of all other backends'
!  *	  pointers). Whenever we move the tail pointer we also truncate now unused
!  *	  pages (i.e. delete files in pg_notify/ that are no longer used).
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
+ /* XXX 
+  *
+  * TODO:
+  *  - guc parameter max_notifies_per_txn ??
+  *  - 2PC
+  *  - limit to ASCII?
+  *  - check truncation
+  */
+ 
  #include "postgres.h"
  
  #include <unistd.h>
  #include <signal.h>
  
  #include "access/heapam.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_type.h"
  #include "commands/async.h"
+ #include "funcapi.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 127,134 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 126,132 ****
  typedef struct
  {
  	ListenActionKind action;
! 	char		condname[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
--- 145,151 ----
  typedef struct
  {
  	ListenActionKind action;
! 	char		channel[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
***************
*** 134,140 ****
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 153,159 ----
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 168,288 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			channel[NAMEDATALEN];
+ 	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)
+ 
+ #define	InvalidPid				(-1)
+ #define QUEUE_POS_PAGE(x)		((x).page)
+ #define QUEUE_POS_OFFSET(x)		((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_BACKEND_POS(i)		asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i)		asyncQueueControl->backend[(i)].pid
+ #define QUEUE_HEAD					asyncQueueControl->head
+ #define QUEUE_TAIL					asyncQueueControl->tail
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFillWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's an interesting test case to define QUEUE_MAX_PAGE to a very small
+  * multiple of SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
+ 
+ /* has this backend sent notifications in the current transaction ? */
+ static bool backendSendsNotifications = false;
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,224 ****
  
  bool		Trace_notify = false;
  
! 
! static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
! 	{
! 		/*
! 		 * The name list needs to live until end of transaction, so store it
! 		 * in the transaction context.
! 		 */
! 		MemoryContext oldcontext;
  
! 		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
! 		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
  }
  
  /*
--- 299,465 ----
  
  bool		Trace_notify = false;
  
! static void queue_listen(ListenActionKind action, const char *channel);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_Listen(const char *channel);
! static void Exec_Unlisten(const char *channel);
! static void Exec_UnlistenAll(void);
! static void SignalBackends(void);
! static void Send_Notify(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **notifications);
! static void asyncQueueReadAllNotifications(void);
! static void asyncQueueAdvanceTail(void);
! static void asyncQueueUnregister(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 srcPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic
+  * if one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. If we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFillWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 
+ 		SimpleLruTruncate(AsyncCtl, 0);
+ 	}
+ }
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 	Notification *n;
+ 	MemoryContext oldcontext;
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
  
  	/* no point in making duplicate entries in the list ... */
! 	if (AsyncExistsPendingNotify(channel, payload))
! 		return;
! 
! 	/*
! 	 * The name list needs to live until end of transaction, so store it
! 	 * in the transaction context.
! 	 */
! 	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 	n = (Notification *) palloc(sizeof(Notification));
! 	n->channel = pstrdup(channel);
! 	if (payload)
! 		n->payload = pstrdup(payload);
! 	else
! 		n->payload = "";
  
! 	/* will set the xid and the srcPid later... */
! 	n->xid = InvalidTransactionId;
! 	n->srcPid = InvalidPid;
  
! 	/*
! 	 * We want to preserve the order so we need to append every
! 	 * notification. See comments at AsyncExistsPendingNotify().
! 	 */
! 	pendingNotifies = lappend(pendingNotifies, n);
! 
! 	MemoryContextSwitchTo(oldcontext);
  }
  
  /*
***************
*** 226,236 ****
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of pg_listener happens during transaction commit.
!  *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  static void
! queue_listen(ListenActionKind action, const char *condname)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
--- 467,477 ----
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of the notification queue happens during transaction
!  *		commit.
   */
  static void
! queue_listen(ListenActionKind action, const char *channel)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
***************
*** 244,252 ****
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(condname));
  	actrec->action = action;
! 	strcpy(actrec->condname, condname);
  
  	pendingActions = lappend(pendingActions, actrec);
  
--- 485,493 ----
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(channel));
  	actrec->action = action;
! 	strcpy(actrec->channel, channel);
  
  	pendingActions = lappend(pendingActions, actrec);
  
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 500,511 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 514,529 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 547,552 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
- 	/*
- 	 * We need to start/commit a transaction for the unlisten, but if there is
- 	 * already an active transaction we had better abort that one first.
- 	 * Otherwise we'd end up committing changes that probably ought to be
- 	 * discarded.
- 	 */
  	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 554,561 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	AbortOutOfAnyTransaction();
! 	Exec_UnlistenAll();
  }
  
  /*
***************
*** 348,357 ****
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		const char *relname = (const char *) lfirst(p);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
  	}
  
  	/*
--- 578,592 ----
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		AsyncQueueEntry qe;
! 		Notification   *n;
! 
! 		n = (Notification *) lfirst(p);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   &qe, qe.length);
  	}
  
  	/*
***************
*** 363,386 ****
  }
  
  /*
!  * AtCommit_Notify
!  *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
   *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
--- 598,617 ----
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
  
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 628,636 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
! 	Assert(backendSendsNotifications == false);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
- 	/* Perform any pending notifies */
- 	if (pendingNotifies)
- 		Send_Notify(lRel);
- 
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified relation.
   */
! static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
  
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
  
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
  
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
! 		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
! 		}
  	}
- 	heap_endscan(scan);
  
! 	if (alreadyListener)
! 		return;
  
! 	/*
! 	 * OK to insert a new tuple
! 	 */
! 	memset(nulls, false, sizeof(nulls));
  
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
  
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
  
! 	simple_heap_insert(lRel, tuple);
  
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 640,806 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(actrec->channel);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(actrec->channel);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll();
  				break;
  		}
  	}
  
  	/*
! 	 * Perform any pending notifies.
  	 */
! 	if (pendingNotifies)
! 		Send_Notify();
! }
! 
! /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
!  *
!  *		Notify the listening backends.
!  */
! void
! AtCommit_NotifyAfterCommit(void)
! {
! 	/* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
! 	 * return as soon as possible */
! 	if (!pendingActions && !backendSendsNotifications)
! 		return;
! 
! 	if (backendSendsNotifications)
! 		SignalBackends();
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
  }
  
  /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
   */
! static bool
! IsListeningOn(const char *channel)
  {
! 	ListCell   *p;
! 	char	   *lchan;
  
! 	foreach(p, listenChannels)
! 	{
! 		lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			return true;
! 	}
! 	return false;
! }
! 
! Datum
! pg_listening(PG_FUNCTION_ARGS)
! {
! 	FuncCallContext	   *funcctx;
! 	ListCell		  **lcp;
  
! 	/* stuff done only on the first call of the function */
! 	if (SRF_IS_FIRSTCALL())
  	{
! 		MemoryContext	oldcontext;
  
! 		/* create a function context for cross-call persistence */
! 		funcctx = SRF_FIRSTCALL_INIT();
! 
! 		/*
! 		 * switch to memory context appropriate for multiple function calls
! 		 */
! 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
! 
! 		/* allocate memory for user context */
! 		lcp = (ListCell **) palloc(sizeof(ListCell **));
! 		if (listenChannels != NIL)
! 			*lcp = list_head(listenChannels);
! 		else
! 			*lcp = NULL;
! 		funcctx->user_fctx = (void *) lcp;
! 
! 		MemoryContextSwitchTo(oldcontext);
  	}
  
! 	/* stuff done on every call of the function */
! 	funcctx = SRF_PERCALL_SETUP();
! 	lcp = (ListCell **) funcctx->user_fctx;
  
! 	while (*lcp != NULL)
! 	{
! 		char   *channel = (char *) lfirst(*lcp);
! 
! 		*lcp = (*lcp)->next;
! 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
! 	}
! 
! 	SRF_RETURN_DONE(funcctx);
! }
! 
! /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified channel.
!  */
! static void
! Exec_Listen(const char *channel)
! {
! 	MemoryContext oldcontext;
  
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
  
! 	/* Detect whether we are already listening on this channel */
! 	if (IsListeningOn(channel))
! 		return;
  
! 	/*
! 	 * OK to insert to the list.
! 	 */
! 	if (listenChannels == NIL)
! 	{
! 		/*
! 		 * This is our first LISTEN, establish our pointer.
! 		 * We set our pointer to the global tail pointer, this way we make
! 		 * sure that we get all of the notifications. We might get a few more
! 		 * but that doesn't hurt.
! 		 */
! 		LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 		QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
! 		QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 		LWLockRelease(AsyncQueueLock);
  
! 		/*
! 		 * Try to move our pointer forward as far as possible. This will skip
! 		 * over already committed notifications. Still, we could get
! 		 * notifications that have already committed before we started to
! 		 * LISTEN.
! 		 *
! 		 * Note that we are not yet listening on anything, so we won't deliver
! 		 * any notification.
! 		 *
! 		 * This will also advance the global tail pointer if necessary.
! 		 */
! 		asyncQueueReadAllNotifications();
! 	}
  
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
  
  	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 514,550 ****
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			/* Found the matching tuple, delete it */
! 			simple_heap_delete(lRel, &tuple->t_self);
! 
! 			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
! 			 */
  			break;
  		}
  	}
! 	heap_endscan(scan);
  
  	/*
  	 * We do not complain about unlistening something not being listened;
--- 812,843 ----
  /*
   * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove a specified channel from "listenChannel".
   */
  static void
! Exec_Unlisten(const char *channel)
  {
! 	ListCell *q;
! 	ListCell *prev;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", channel, MyProcPid);
  
! 	prev = NULL;
! 	foreach(q, listenChannels)
  	{
! 		char *lchan = (char *) lfirst(q);
! 		if (strcmp(lchan, channel) == 0)
  		{
! 			pfree(lchan);
! 			listenChannels = list_delete_cell(listenChannels, q, prev);
  			break;
  		}
+ 		prev = q;
  	}
! 
! 	if (listenChannels == NIL)
! 		asyncQueueUnregister();
  
  	/*
  	 * We do not complain about unlistening something not being listened;
***************
*** 555,690 ****
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
- 	HeapScanDesc scan;
- 	HeapTuple	lTuple;
- 	ScanKeyData key[1];
- 
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
  static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
  			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
  			}
! 			else if (listener->notification == 0)
  			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
  			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
   * AtAbort_Notify
   *
!  *		This is called at transaction abort.
   *
!  *		Gets rid of pending actions and outbound notifies that we would have
!  *		executed if the transaction got committed.
   */
  void
  AtAbort_Notify(void)
  {
  	ClearPendingActionsAndNotifies();
  }
  
--- 848,1220 ----
  /*
   * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAll(void)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll(%d)", MyProcPid);
! 
! 	list_free_deep(listenChannels);
! 	listenChannels = NIL;
  
! 	asyncQueueUnregister();
! }
! 
! static void
! asyncQueueUnregister(void)
! {
! 	bool	  advanceTail = false;
! 
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
! 	/*
! 	 * If we have been the last backend, advance the tail pointer.
! 	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
! }
! 
! static bool
! asyncQueueIsFull(void)
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size			remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size			advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
  
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
  
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
  }
  
  /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
   */
+ static bool
+ asyncQueueAdvance(QueuePosition *position, int entryLength)
+ {
+ 	int		pageno = QUEUE_POS_PAGE(*position);
+ 	int		offset = QUEUE_POS_OFFSET(*position);
+ 	bool	pageJump = false;
+ 
+ 	/*
+ 	 * Move to the next writing position: First jump over what we have just
+ 	 * written or read.
+ 	 */
+ 	offset += entryLength;
+ 	Assert(offset < QUEUE_PAGESIZE);
+ 
+ 	/*
+ 	 * In a second step check if another entry can be written to the page. If
+ 	 * it does, stay here, we have reached the next position. If not, then we
+ 	 * need to move on to the next page.
+ 	 */
+ 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
+ 	{
+ 		pageno++;
+ 		if (pageno > QUEUE_MAX_PAGE)
+ 			/* wrap around */
+ 			pageno = 0;
+ 		offset = 0;
+ 		pageJump = true;
+ 	}
+ 
+ 	SET_QUEUE_POS(*position, pageno, offset);
+ 	return pageJump;
+ }
+ 
+ static void
+ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
+ {
+ 		Assert(n->channel != NULL);
+ 		Assert(n->payload != NULL);
+ 		Assert(strlen(n->payload) <= NOTIFY_PAYLOAD_MAX_LENGTH);
+ 
+ 		/* The terminator is already included in AsyncQueueEntryEmptySize */
+ 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload);
+ 		qe->srcPid = MyProcPid;
+ 		qe->dboid = MyDatabaseId;
+ 		qe->xid = GetCurrentTransactionId();
+ 		strcpy(qe->channel, n->channel);
+ 		strcpy(qe->payload, n->payload);
+ }
+ 
  static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n)
! {
! 	n->channel = pstrdup(qe->channel);
! 	n->payload = pstrdup(qe->payload);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
! }
! 
! /*
!  * Add the notifications to the queue: we go page by page here, i.e. we stop
!  * once we have to go to a new page but we will be called again and then fill
!  * that next page. If an entry does not fit to a page anymore, we write a dummy
!  * entry with an InvalidOid as the database oid in order to fill the page. So
!  * every page is always used up to the last byte which simplifies reading the
!  * page later.
!  *
!  * We are holding AsyncQueueLock already from the caller and grab AsyncCtlLock
!  * here in this function.
!  *
!  * We are passed the list of notifications to write and return the
!  * not-yet-written notifications back. Eventually we will return NIL.
!  */
! static List *
! asyncQueueAddEntries(List *notifications)
  {
! 	AsyncQueueEntry	qe;
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
  
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
  
! 	do
! 	{
! 		Notification   *n;
! 
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if-block further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
  
! 		n = (Notification *) linitial(notifications);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
! 
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.channel[0] = '\0';
! 			qe.payload[0] = '\0';
! 			qe.xid = InvalidTransactionId;
! 		}
! 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
! 			   &qe, qe.length);
! 
! 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
! 			 && notifications != NIL);
! 
! 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
! 	{
! 		/*
! 		 * we need to go to continue on a new page, stop here but prepare that
! 		 * page already.
! 		 */
! 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
! 		AsyncCtl->shared->page_dirty[slotno] = true;
! 	}
! 	LWLockRelease(AsyncCtlLock);
! 
! 	return notifications;
! }
! 
! static void
! asyncQueueFillWarning(void)
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t;
! 	double			fillDegree;
! 	int				occupied;
! 	int				tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int				headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 
! 	occupied = headPage - tailPage;
! 
! 	if (occupied == 0)
! 		return;
! 	
! 	if (!asyncQueuePagePrecedesPhysically(tailPage, headPage))
! 		/* head has wrapped around, tail not yet */
! 		occupied += QUEUE_MAX_PAGE;
! 
! 	fillDegree = (float) occupied / (float) QUEUE_MAX_PAGE;
! 
! 	if (fillDegree < 0.5)
! 		return;
! 
! 	t = GetCurrentTimestamp();
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFillWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		QueuePosition	min = QUEUE_HEAD;
! 		int32			minPid = InvalidPid;
! 		int				i;
! 
! 		for (i = 0; i < MaxBackends; i++)
! 			if (QUEUE_BACKEND_PID(i) != InvalidPid)
  			{
! 				min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 				if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 					minPid = QUEUE_BACKEND_PID(i);
  			}
! 
! 		if (fillDegree < 0.75)
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 50%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 		else
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 75%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 
! 		asyncQueueControl->lastQueueFillWarn = t;
! 	}
! }
! 
! /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  * Add the pending notifications to the queue.
!  *
!  * A full queue is very uncommon and should really not happen, given that we
!  * have so much space available in the slru pages. Nevertheless we need to
!  * deal with this possibility. Note that when we get here we are in the process
!  * of committing our transaction, we have not yet committed to clog but this
!  * would be the next step. So at this point in time we can still roll the
!  * transaction back.
!  */
! static void
! Send_Notify(void)
! {
! 	backendSendsNotifications = true;
! 
! 	while (pendingNotifies != NIL)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		asyncQueueFillWarning();
! 		if (asyncQueueIsFull())
! 			ereport(ERROR,
! 					(errcode(ERRCODE_TOO_MANY_ENTRIES),
! 					errmsg("Too many notifications in the queue")));
! 		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 		LWLockRelease(AsyncQueueLock);
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. Since we have EXCLUSIVE lock anyway
!  * we also check the position of the other backends and in case that anyone is
!  * already up-to-date we don't signal it. This can happen if concurrent
!  * notifying transactions have sent a signal and the signaled backend has read
!  * the other notifications and ours in the same step.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(void)
! {
! 	QueuePosition	pos;
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 	int				count = 0;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			count++;
! 			pos = QUEUE_BACKEND_POS(i);
! 			if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
  			{
! 				pids = lappend_int(pids, pid);
! 				ids = lappend_int(ids, i);
  			}
  		}
  	}
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	forboth(p1, pids, p2, ids)
+ 	{
+ 		pid = (int32) lfirst_int(p1);
+ 		i = lfirst_int(p2);
+ 		/*
+ 		 * Should we check for failure? Can it happen that a backend
+ 		 * has crashed without the postmaster starting over?
+ 		 */
+ 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
+ 			elog(WARNING, "Error signalling backend %d", pid);
+ 	}
  
! 	if (count == 0)
! 	{
! 		/* No backend is listening at all, try to clean up the queue.
! 		 * Even if by now (after we determined count to be 0 and now)
! 		 * a backend has started to listen, advancing the tail does not
! 		 * hurt. Our notifications are committed already and a newly
! 		 * listening backend would skip over them anyway. */
! 		asyncQueueAdvanceTail();
! 	}
  }
  
  /*
   * AtAbort_Notify
   *
!  *	This is called at transaction abort.
!  *
!  *	Gets rid of pending actions and outbound notifies that we would have
!  *	executed if the transaction got committed.
   *
!  *	Even though we have not committed, we need to signal the listening backends
!  *	because our notifications might block readers from processing the queue.
!  *	Now that the transaction has aborted, they can go on and skip over our
!  *	notifications. They could find notifications past ours that they need to
!  *	deliver.
   */
  void
  AtAbort_Notify(void)
  {
+ 	if (backendSendsNotifications)
+ 		SignalBackends();
+ 
  	ClearPendingActionsAndNotifies();
  }
  
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1470,1708 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back the list of notifications
+  * that the calling function will deliver then. The list will contain all
+  * notifications from transactions that have already committed.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * The function returns true once we have reached the end or a notification of
+  * a transaction that is still running and false if we have finished with
+  * the page. In other words: once it returns true there is no point in calling
+  * it again.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current,
+ 						   QueuePosition stop,
+ 						   List **notifications)
+ {
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	int				slotno;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		readPtr += current->offset;
+ 		/* at first we only read the header of the notification */
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				if (IsListeningOn(qe.channel))
+ 				{
+ 					if (qe.length > AsyncQueueEntryEmptySize)
+ 					{
+ 						/* now we know that we are interested in the
+ 						 * notification and read it completely. */
+ 						memcpy(&qe, readPtr, qe.length);
+ 					}
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n);
+ 					*notifications = lappend(*notifications, n);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * The transaction has neither committed nor aborted so
+ 					 * far.
+ 					 */
+ 					reachedStop = true;
+ 					break;
+ 				}
+ 				/*
+ 				 * Here we know that the transaction has aborted, we just
+ 				 * ignore its notifications.
+ 				 */
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	/*
+ 	 * Release the lock that we implicitly got from
+ 	 * SimpleLruReadPage_ReadOnly().
+ 	 */
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ 
+ static void
+ asyncQueueReadAllNotifications(void)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 	{
+ 		/* Nothing to do, we have read all notifications already. */
+ 		return;
+ 	}
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us. Especially we don't want to hold a lock while sending the
+ 		 * notifications to the frontend.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		notifications = NIL;
+ 		reachedStop = asyncQueueGetEntriesByPage(&pos, head, &notifications);
+ 
+ 		/*
+ 		 * Note that we deliver everything that we see in the queue and that
+ 		 * matches our _current_ listening state.
+ 		 * Especially we do not take into account different commit times.
+ 		 *
+ 		 * See the following example:
+ 		 *
+ 		 * Backend 1:                    Backend 2:
+ 		 *
+ 		 * transaction starts
+ 		 * NOTIFY foo;
+ 		 * commit starts
+ 		 *                               transaction starts
+ 		 *                               LISTEN foo;
+ 		 *                               commit starts
+ 		 * commit to clog
+ 		 *                               commit to clog
+ 		 *
+ 		 * It could happen that backend 2 sees the notification from
+ 		 * backend 1 in the queue and even though the notifying transaction
+ 		 * committed before the listening transaction, we still deliver the
+ 		 * notification.
+ 		 *
+ 		 * The idea is that an additional notification does not do any
+ 		 * harm we just need to make sure that we do not miss a
+ 		 * notification.
+ 		 */
+ 		foreach(lc, notifications)
+ 		{
+ 			n = (Notification *) lfirst(lc);
+ 			NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 		}
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail(void)
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailp;
+ 	int				headp;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailp = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headp = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if ((asyncQueuePagePrecedesLogically(tailp, QUEUE_POS_PAGE(min), headp)
+ 			&& asyncQueuePagePrecedesPhysically(tailp, headp))
+ 		|| tailp == QUEUE_POS_PAGE(min))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * XXX this could be optimized, to call SimpleLruTruncate only when we
+ 		 * know that we can truncate something.
+ 		 */
+ 		SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
! 
! 		if (sourcePID != 0)
! 		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 		}
! 	}
! 	heap_endscan(scan);
! 
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 1714,1720 ----
  
  	notifyInterruptOccurred = 0;
  
! 	asyncQueueReadAllNotifications();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 1734,1750 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 1754,1810 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
  
! 	if (payload == NULL)
! 		payload = "";
! 
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 */
! 
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1107,1112 ****
--- 1821,1828 ----
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
+ 
+ 	backendSendsNotifications = false;
  }
  
  /*
***************
*** 1124,1128 ****
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	Async_Notify((char *) recdata);
  }
--- 1840,1850 ----
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	AsyncQueueEntry		*qe = (AsyncQueueEntry *) recdata;
! 
! 	Assert(qe->dboid == MyDatabaseId);
! 	Assert(qe->length == len);
! 
! 	Async_Notify(qe->channel, qe->payload);
  }
+ 
diff -cr cvs.head/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs.head/src/backend/nodes/copyfuncs.c	2010-01-06 22:30:06.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 2770,2775 ****
--- 2770,2776 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs.head/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs.head/src/backend/nodes/equalfuncs.c	2010-01-06 22:30:06.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 1324,1329 ****
--- 1324,1330 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs.head/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs.head/src/backend/nodes/outfuncs.c	2010-01-06 22:30:06.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 1817,1822 ****
--- 1817,1823 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs.head/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs.head/src/backend/nodes/readfuncs.c	2010-01-05 12:39:25.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs.head/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs.head/src/backend/parser/gram.y	2010-01-06 22:30:07.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2010-01-22 00:42:56.000000000 +0100
***************
*** 399,405 ****
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 399,405 ----
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 6074,6083 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 6074,6089 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs.head/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs.head/src/backend/storage/ipc/ipci.c	2010-01-20 20:08:27.000000000 +0100
--- cvs.build/src/backend/storage/ipc/ipci.c	2010-01-22 00:44:47.000000000 +0100
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/nbtree.h"
  #include "access/subtrans.h"
  #include "access/twophase.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
***************
*** 225,230 ****
--- 226,232 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs.head/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs.head/src/backend/storage/lmgr/lwlock.c	2010-01-05 12:39:29.000000000 +0100
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs.head/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs.head/src/backend/tcop/utility.c	2010-01-20 20:08:28.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 928,936 ****
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 928,943 ----
  		case T_NotifyStmt:
  			{
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
+ 				/* XXX the new listen/notify version can be enabled
+ 				 * for Hot Standby */
  				PreventCommandDuringRecovery();
  
! 				if (stmt->payload
! 					&& strlen(stmt->payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 					ereport(ERROR,
! 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 							 errmsg("payload string too long")));
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs.head/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs.head/src/bin/initdb/initdb.c	2010-01-08 10:48:31.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 2458,2463 ****
--- 2458,2464 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs.head/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs.head/src/bin/psql/common.c	2010-01-05 12:39:33.000000000 +0100
--- cvs.build/src/bin/psql/common.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs.head/src/bin/psql/tab-complete.c cvs.build/src/bin/psql/tab-complete.c
*** cvs.head/src/bin/psql/tab-complete.c	2010-01-05 12:39:33.000000000 +0100
--- cvs.build/src/bin/psql/tab-complete.c	2010-01-22 00:42:56.000000000 +0100
***************
*** 2099,2105 ****
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
--- 2099,2105 ----
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
diff -cr cvs.head/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs.head/src/include/access/slru.h	2010-01-05 12:39:34.000000000 +0100
--- cvs.build/src/include/access/slru.h	2010-01-22 00:42:56.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs.head/src/include/catalog/pg_proc.h cvs.build/src/include/catalog/pg_proc.h
*** cvs.head/src/include/catalog/pg_proc.h	2010-01-20 20:08:29.000000000 +0100
--- cvs.build/src/include/catalog/pg_proc.h	2010-01-22 00:42:56.000000000 +0100
***************
*** 4097,4102 ****
--- 4097,4104 ----
  DESCR("get the prepared statements for this session");
  DATA(insert OID = 2511 (  pg_cursor PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,16,16,16,1184}" "{o,o,o,o,o,o}" "{name,statement,is_holdable,is_binary,is_scrollable,creation_time}" _null_ pg_cursor _null_ _null_ _null_ ));
  DESCR("get the open cursors for this session");
+ DATA(insert OID = 2187 (  pg_listening	PGNSP	PGUID 12 1 10 0 f f f t t s 0 0 25 "" _null_ _null_ _null_ _null_ pg_listening _null_ _null_ _null_ ));
+ DESCR("get the channels that the current backend listens to");
  DATA(insert OID = 2599 (  pg_timezone_abbrevs	PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,1186,16}" "{o,o,o}" "{abbrev,utc_offset,is_dst}" _null_ pg_timezone_abbrevs _null_ _null_ _null_ ));
  DESCR("get the available time zone abbreviations");
  DATA(insert OID = 2856 (  pg_timezone_names		PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,1186,16}" "{o,o,o,o}" "{name,abbrev,utc_offset,is_dst}" _null_ pg_timezone_names _null_ _null_ _null_ ));
diff -cr cvs.head/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs.head/src/include/commands/async.h	2010-01-05 12:39:35.000000000 +0100
--- cvs.build/src/include/commands/async.h	2010-01-22 00:42:56.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,42 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * How long can a payload string possibly be? Actually it needs to be one
+  * byte less to provide space for the trailing terminating '\0'.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * How many page slots do we reserve ?
+  */
+ #define NUM_ASYNC_BUFFERS			4
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
***************
*** 43,46 ****
--- 57,62 ----
  extern void notify_twophase_postcommit(TransactionId xid, uint16 info,
  						   void *recdata, uint32 len);
  
+ extern Datum pg_listening(PG_FUNCTION_ARGS);
+ 
  #endif   /* ASYNC_H */
diff -cr cvs.head/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs.head/src/include/nodes/parsenodes.h	2010-01-20 20:08:30.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2010-01-22 00:42:56.000000000 +0100
***************
*** 2081,2086 ****
--- 2081,2087 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs.head/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs.head/src/include/storage/lwlock.h	2010-01-05 12:39:36.000000000 +0100
--- cvs.build/src/include/storage/lwlock.h	2010-01-22 00:44:25.000000000 +0100
***************
*** 67,72 ****
--- 67,74 ----
  	AutovacuumLock,
  	AutovacuumScheduleLock,
  	SyncScanLock,
+ 	AsyncCtlLock,
+ 	AsyncQueueLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff -cr cvs.head/src/include/utils/errcodes.h cvs.build/src/include/utils/errcodes.h
*** cvs.head/src/include/utils/errcodes.h	2010-01-05 12:39:36.000000000 +0100
--- cvs.build/src/include/utils/errcodes.h	2010-01-22 00:42:56.000000000 +0100
***************
*** 318,323 ****
--- 318,324 ----
  #define ERRCODE_STATEMENT_TOO_COMPLEX		MAKE_SQLSTATE('5','4', '0','0','1')
  #define ERRCODE_TOO_MANY_COLUMNS			MAKE_SQLSTATE('5','4', '0','1','1')
  #define ERRCODE_TOO_MANY_ARGUMENTS			MAKE_SQLSTATE('5','4', '0','2','3')
+ #define ERRCODE_TOO_MANY_ENTRIES			MAKE_SQLSTATE('5','4', '0','3','1')
  
  /* Class 55 - Object Not In Prerequisite State (class borrowed from DB2) */
  #define ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE	MAKE_SQLSTATE('5','5', '0','0','0')
diff -cr cvs.head/src/test/regress/expected/guc.out cvs.build/src/test/regress/expected/guc.out
*** cvs.head/src/test/regress/expected/guc.out	2009-11-22 06:20:41.000000000 +0100
--- cvs.build/src/test/regress/expected/guc.out	2010-01-22 00:42:56.000000000 +0100
***************
*** 532,540 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
!   relname  
! -----------
   foo_event
  (1 row)
  
--- 532,540 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
!  pg_listening 
! --------------
   foo_event
  (1 row)
  
***************
*** 571,579 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
!  relname 
! ---------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
--- 571,579 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
!  pg_listening 
! --------------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
diff -cr cvs.head/src/test/regress/expected/sanity_check.out cvs.build/src/test/regress/expected/sanity_check.out
*** cvs.head/src/test/regress/expected/sanity_check.out	2010-01-20 20:08:32.000000000 +0100
--- cvs.build/src/test/regress/expected/sanity_check.out	2010-01-22 00:42:56.000000000 +0100
***************
*** 107,113 ****
   pg_language             | t
   pg_largeobject          | t
   pg_largeobject_metadata | t
-  pg_listener             | f
   pg_namespace            | t
   pg_opclass              | t
   pg_operator             | t
--- 107,112 ----
***************
*** 154,160 ****
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (143 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 153,159 ----
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (142 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
diff -cr cvs.head/src/test/regress/sql/guc.sql cvs.build/src/test/regress/sql/guc.sql
*** cvs.head/src/test/regress/sql/guc.sql	2009-10-21 22:38:58.000000000 +0200
--- cvs.build/src/test/regress/sql/guc.sql	2010-01-22 00:42:56.000000000 +0100
***************
*** 165,171 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 165,171 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
***************
*** 174,180 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 174,180 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
#80Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#79)
Re: Listen / Notify - what to do when the queue is full

Comments:

* In standard_ProcessUtility(), case NotifyStmt, you add a comment:

/* XXX the new listen/notify version can be enabled
* for Hot Standby */

but I don't think that's correct. We may be able to support LISTEN
on the standby, but not NOTIFY (right?). I don't think we should
be adding speculative comments anyway, because there is some work
still needed before HS can support LISTEN (notably, WAL-logging
NOTIFY).

* You have a TODO list as a comment. Can you remove it and explain
those items on list if they aren't already?

* You have the comment:

/*
* How long can a payload string possibly be? Actually it needs
to be one
* byte less to provide space for the trailing terminating '\0'.
*/

That should be written more simply, like "Maximum size of the
payload, including terminating NULL."

* You have the Assert:

Assert(strlen(n->payload) <= NOTIFY_PAYLOAD_MAX_LENGTH);

which is inconsistent with the earlier test:

if (stmt->payload
&& strlen(stmt->payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
ereport(ERROR, ...

* ASCII-only is still an open issue.

* 2PC is still an open issue (notifications are lost on restart,
and there may be other problems, as well). I think it's easy enough
to throw an error on PREPARE TRANSACTION if there are any
notifications, right?

* There's a bug where an UNLISTEN can abort, and yet you still miss
the notification. This is because you unlisten before you actually
commit the transaction, and an error between those times will cause
the UNLISTEN to take effect even though the rest of the transaction
fails. For example:

-- session 1
LISTEN foo;
BEGIN;
UNLISTEN foo;

-- session 2
NOTIFY foo;

-- gdb in session 1
(gdb) break AtCommit_NotifyBeforeCommit
(gdb) c

-- session 1
COMMIT;

-- gdb in session 1
(gdb) finish
(gdb) p op_strict(7654322)
(gdb) quit

The notification is missed. It's fixed easily enough by doing the
UNLISTEN step in AtCommit_NotifyAfterCommit.

I'm still looking through some of the queue stuff, and I'll send an
update soon. I wanted to give you some feedback now though.

Regards,
Jeff Davis

#81Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#80)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Sat, Jan 30, 2010 at 12:02 AM, Jeff Davis <pgsql@j-davis.com> wrote:

Comments:

* In standard_ProcessUtility(), case NotifyStmt, you add a comment:

   /* XXX the new listen/notify version can be enabled
    * for Hot Standby */

 but I don't think that's correct. We may be able to support LISTEN
 on the standby, but not NOTIFY (right?). I don't think we should
 be adding speculative comments anyway, because there is some work
 still needed before HS can support LISTEN (notably, WAL-logging
 NOTIFY).

I admit that it was not clear what I meant. The comment should only
address LISTEN / NOTIFY on the standby server. Do you see any problems
allowing it? Of course it's not all that useful because all
transactions are read-only but it still allows different clients to
communicate.
Of course listening on the standby server to notifications from the
primary server is not possible currently but in my point of view this
is a completely new use case for LISTEN/NOTIFY as it is not local but
across servers.

* You have a TODO list as a comment. Can you remove it and explain
 those items on list if they aren't already?

Sure.

I was wondering if we should have a hard limit on the maximal number
of notifications per transaction. You can now easily fill up your
backend's memory with notifications. However we had the same problem
with the old implementation already and nobody complained. The
difference is just that now you can fill it up a lot faster because
you can send a large payload.

The second doubt I had is about the truncation behavior of slru. ISTM
that it doesn't truncate at the end of the page range once the head
pointer has already wrapped around.
There is the following comment in slru.c describing this fact:

/*
* While we are holding the lock, make an important safety check: the
* planned cutoff point must be <= the current endpoint page. Otherwise we
* have already wrapped around, and proceeding with the truncation would
* risk removing the current segment.
*/

I wanted to check if we can do anything about it and if we need to do
anything at all...

* You have the comment:
 That should be written more simply [...]

done

* You have the Assert:

done

* ASCII-only is still an open issue.

still an open issue...

* 2PC is still an open issue (notifications are lost on restart,
 and there may be other problems, as well). I think it's easy enough
 to throw an error on PREPARE TRANSACTION if there are any
 notifications, right?

we could forbid them, yes. Notifications aren't lost on restart
however, they get recorded in the 2PC state files. Currently the patch
works fine with 2PC (including server restart) with the exception that
in the queue full case we might need to roll back the prepared
transaction. Now the question is, should we forbid NOTIFY for 2PC
altogether only because in the unlikely event of a full queue we
cannot guarantee that we can commit the transaction?

One solution is to treat a 2PC transaction like a backend with its own
pointer to the queue. As long as the prepared transaction is not
committed, its pointer does not move and so we don't move forward the
global tail pointer. Here the NOTIFYs sent by the 2PC transaction are
already in the queue and the transaction can always commit. The
drawback of this idea is that if you forget to commit the prepared
transaction and leave it around uncommitted, your queue will fill up
inevitably because you do not truncate anymore...

* There's a bug where an UNLISTEN can abort, and yet you still miss
 the notification.
[...]
 The notification is missed. It's fixed easily enough by doing the
 UNLISTEN step in AtCommit_NotifyAfterCommit.

Thanks, very well spotted... Actually the same is true for LISTEN... I
have reworked the patch to do the changes to listenChannels only in
the post-commit functions.

I have also included Arnaud Betremieux's send_notify function. Should
we really call the function send_notify? What about
pg_send_notification or just pg_notify? I am not a native English
speaker but to me it sounds strange to send a verb (notify) and I'd
rather prefix the name with pg_...?
Currently the function always returns void. Is this the preferred return type?

Joachim

Attachments:

listennotify.9.difftext/x-diff; charset=US-ASCII; name=listennotify.9.diffDownload
diff -cr cvs/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs/src/backend/access/transam/slru.c	2010-01-30 22:09:03.000000000 +0100
--- cvs.build/src/backend/access/transam/slru.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs/src/backend/access/transam/xact.c	2010-01-30 22:09:03.000000000 +0100
--- cvs.build/src/backend/access/transam/xact.c	2010-01-31 13:14:27.000000000 +0100
***************
*** 1728,1735 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
--- 1728,1735 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
***************
*** 1804,1809 ****
--- 1804,1814 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs/src/backend/catalog/Makefile	2010-01-30 22:08:49.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2010-01-30 22:10:21.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs/src/backend/commands/async.c	2010-01-30 22:08:52.000000000 +0100
--- cvs.build/src/backend/commands/async.c	2010-02-03 00:22:41.000000000 +0100
***************
*** 14,44 ****
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 14,68 ----
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications and compares the notified channels with its list.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) stores the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the applications needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    After clog commit we are called another time
!  *    (AtCommit_NotifyAfterCommit()). Here we check if we need to signal the
!  *    backends. In SignalBackends() we scan the list of listening backends and
!  *    send a PROCSIG_NOTIFY_INTERRUPT to every backend that has set its Pid (we
!  *    don't know which backend is listening on which channel so we need to send
!  *    a signal to every listening backend). We can exclude backends that are
!  *    already up to date.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,84 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
--- 70,91 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  *    Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach either a notification from an uncommitted transaction or
!  *	  the head pointer's position. Then we check if we were the laziest
!  *	  backend: if our pointer is set to the same position as the global tail
!  *	  pointer is set, then we set it further to the second-laziest backend (We
!  *	  can identify it by inspecting the positions of all other backends'
!  *	  pointers). Whenever we move the tail pointer we also truncate now unused
!  *	  pages (i.e. delete files in pg_notify/ that are no longer used).
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
***************
*** 88,97 ****
  #include <signal.h>
  
  #include "access/heapam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_listener.h"
  #include "commands/async.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
--- 95,107 ----
  #include <signal.h>
  
  #include "access/heapam.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_type.h"
  #include "commands/async.h"
+ #include "funcapi.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 118,125 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 126,132 ****
  typedef struct
  {
  	ListenActionKind action;
! 	char		condname[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
--- 136,142 ----
  typedef struct
  {
  	ListenActionKind action;
! 	char		channel[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
***************
*** 134,140 ****
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 144,150 ----
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 159,281 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			channel[NAMEDATALEN];
+ 	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)
+ 
+ #define	InvalidPid				(-1)
+ #define QUEUE_POS_PAGE(x)		((x).page)
+ #define QUEUE_POS_OFFSET(x)		((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_BACKEND_POS(i)		asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i)		asyncQueueControl->backend[(i)].pid
+ #define QUEUE_HEAD					asyncQueueControl->head
+ #define QUEUE_TAIL					asyncQueueControl->tail
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFillWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's an interesting test case to define QUEUE_MAX_PAGE to a very small
+  * multiple of SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
+ 
+ /* has this backend sent notifications in the current transaction ? */
+ static bool backendSendsNotifications = false;
+ /* has this backend executed a LISTEN in the current transaction ? */
+ static bool backendExecutesInitialListen = false;
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,224 ****
  
  bool		Trace_notify = false;
  
! 
! static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
! 	{
! 		/*
! 		 * The name list needs to live until end of transaction, so store it
! 		 * in the transaction context.
! 		 */
! 		MemoryContext oldcontext;
  
! 		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
! 		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
  }
  
  /*
--- 292,486 ----
  
  bool		Trace_notify = false;
  
! static void queue_listen(ListenActionKind action, const char *channel);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_ListenBeforeCommit(const char *channel);
! static void Exec_ListenAfterCommit(const char *channel);
! static void Exec_UnlistenAfterCommit(const char *channel);
! static void Exec_UnlistenAllAfterCommit(void);
! static void SignalBackends(void);
! static void Send_Notify(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **notifications);
! static void asyncQueueReadAllNotifications(void);
! static void asyncQueueAdvanceTail(void);
! static void asyncQueueUnregister(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 srcPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic
+  * if one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. If we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFillWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 
+ 		SimpleLruTruncate(AsyncCtl, 0);
+ 	}
+ }
+ 
+ 
+ /*
+  * send_notify -
+  *	  Send a notification to listening clients
+  */
+ Datum
+ send_notify(PG_FUNCTION_ARGS)
+ {
+ 	const char *channelStr;
+ 	const char *payloadStr;
+ 	text	   *channel = PG_GETARG_TEXT_PP(0);
+ 	text	   *payload = PG_GETARG_TEXT_PP(1);
+ 
+ 	channelStr = text_to_cstring(channel);
+ 	payloadStr = text_to_cstring(payload);
+ 
+ 	Async_Notify(channelStr, payloadStr);
+ 
+ 	PG_RETURN_VOID();
+ }
+ 
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 	Notification *n;
+ 	MemoryContext oldcontext;
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
! 
! 	if (payload && strlen(payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 				 errmsg("payload string too long")));
  
  	/* no point in making duplicate entries in the list ... */
! 	if (AsyncExistsPendingNotify(channel, payload))
! 		return;
  
! 	/*
! 	 * The name list needs to live until end of transaction, so store it
! 	 * in the transaction context.
! 	 */
! 	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 	n = (Notification *) palloc(sizeof(Notification));
! 	n->channel = pstrdup(channel);
! 	if (payload)
! 		n->payload = pstrdup(payload);
! 	else
! 		n->payload = "";
  
! 	/* will set the xid and the srcPid later... */
! 	n->xid = InvalidTransactionId;
! 	n->srcPid = InvalidPid;
! 
! 	/*
! 	 * We want to preserve the order so we need to append every
! 	 * notification. See comments at AsyncExistsPendingNotify().
! 	 */
! 	pendingNotifies = lappend(pendingNotifies, n);
! 
! 	MemoryContextSwitchTo(oldcontext);
  }
  
  /*
***************
*** 226,236 ****
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of pg_listener happens during transaction commit.
!  *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  static void
! queue_listen(ListenActionKind action, const char *condname)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
--- 488,498 ----
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of the notification queue happens during transaction
!  *		commit.
   */
  static void
! queue_listen(ListenActionKind action, const char *channel)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
***************
*** 244,252 ****
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(condname));
  	actrec->action = action;
! 	strcpy(actrec->condname, condname);
  
  	pendingActions = lappend(pendingActions, actrec);
  
--- 506,514 ----
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(channel));
  	actrec->action = action;
! 	strcpy(actrec->channel, channel);
  
  	pendingActions = lappend(pendingActions, actrec);
  
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 521,532 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 535,550 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 568,573 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
- 	/*
- 	 * We need to start/commit a transaction for the unlisten, but if there is
- 	 * already an active transaction we had better abort that one first.
- 	 * Otherwise we'd end up committing changes that probably ought to be
- 	 * discarded.
- 	 */
  	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 575,582 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	AbortOutOfAnyTransaction();
! 	Exec_UnlistenAllAfterCommit();
  }
  
  /*
***************
*** 348,357 ****
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		const char *relname = (const char *) lfirst(p);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
  	}
  
  	/*
--- 599,613 ----
  	/* We can deal with pending NOTIFY though */
  	foreach(p, pendingNotifies)
  	{
! 		AsyncQueueEntry qe;
! 		Notification   *n;
! 
! 		n = (Notification *) lfirst(p);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
  
  		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   &qe, qe.length);
  	}
  
  	/*
***************
*** 363,386 ****
  }
  
  /*
!  * AtCommit_Notify
!  *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
   *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
--- 619,638 ----
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
  
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 649,658 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
! 	Assert(backendSendsNotifications == false);
! 	Assert(backendExecutesInitialListen == false);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
- 	/* Perform any pending notifies */
- 	if (pendingNotifies)
- 		Send_Notify(lRel);
- 
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified relation.
   */
! static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
  
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
  
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
  
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
! 		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
! 		}
  	}
- 	heap_endscan(scan);
  
! 	if (alreadyListener)
! 		return;
  
! 	/*
! 	 * OK to insert a new tuple
! 	 */
! 	memset(nulls, false, sizeof(nulls));
  
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
  
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
  
! 	simple_heap_insert(lRel, tuple);
  
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 662,847 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_ListenBeforeCommit(actrec->channel);
  				break;
  			case LISTEN_UNLISTEN:
! 				/* there is no Exec_UnlistenBeforeCommit() */
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				/* there is no Exec_UnlistenAllBeforeCommit() */
  				break;
  		}
  	}
  
  	/*
! 	 * Perform any pending notifies.
  	 */
! 	if (pendingNotifies)
! 		Send_Notify();
! }
! 
! /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
!  *
!  *		Notify the listening backends.
!  */
! void
! AtCommit_NotifyAfterCommit(void)
! {
! 	ListCell   *p;
! 
! 	/* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
! 	 * return as soon as possible */
! 	if (!pendingActions && !backendSendsNotifications)
! 		return;
! 
! 	/* Perform any pending listen/unlisten actions */
! 	foreach(p, pendingActions)
! 	{
! 		ListenAction *actrec = (ListenAction *) lfirst(p);
! 
! 		switch (actrec->action)
! 		{
! 			case LISTEN_LISTEN:
! 				Exec_ListenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN:
! 				Exec_UnlistenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAllAfterCommit();
! 				break;
! 		}
! 	}
! 
! 	if (backendSendsNotifications)
! 		SignalBackends();
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
  }
  
  /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
   */
! static bool
! IsListeningOn(const char *channel)
  {
! 	ListCell   *p;
! 	char	   *lchan;
  
! 	foreach(p, listenChannels)
! 	{
! 		lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			return true;
! 	}
! 	return false;
! }
! 
! Datum
! pg_listening(PG_FUNCTION_ARGS)
! {
! 	FuncCallContext	   *funcctx;
! 	ListCell		  **lcp;
  
! 	/* stuff done only on the first call of the function */
! 	if (SRF_IS_FIRSTCALL())
  	{
! 		MemoryContext	oldcontext;
  
! 		/* create a function context for cross-call persistence */
! 		funcctx = SRF_FIRSTCALL_INIT();
! 
! 		/*
! 		 * switch to memory context appropriate for multiple function calls
! 		 */
! 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
! 
! 		/* allocate memory for user context */
! 		lcp = (ListCell **) palloc(sizeof(ListCell **));
! 		if (listenChannels != NIL)
! 			*lcp = list_head(listenChannels);
! 		else
! 			*lcp = NULL;
! 		funcctx->user_fctx = (void *) lcp;
! 
! 		MemoryContextSwitchTo(oldcontext);
  	}
  
! 	/* stuff done on every call of the function */
! 	funcctx = SRF_PERCALL_SETUP();
! 	lcp = (ListCell **) funcctx->user_fctx;
  
! 	while (*lcp != NULL)
! 	{
! 		char   *channel = (char *) lfirst(*lcp);
! 
! 		*lcp = (*lcp)->next;
! 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
! 	}
! 
! 	SRF_RETURN_DONE(funcctx);
! }
  
! /*
!  * Exec_ListenBeforeCommit --- subroutine for AtCommit_NotifyBeforeCommit
!  *
!  * Note that we do only set our pointer here and do not yet add the channel to
!  * listenChannels. Since our transaction could still roll back we do this only
!  * after commit. We know that our tail pointer won't move between here and
!  * directly after commit, so we won't miss a notification.
!  */
! static void
! Exec_ListenBeforeCommit(const char *channel)
! {
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
  
! 	/* Detect whether we are already listening to something. */
! 	if (listenChannels != NIL)
! 		return;
  
! 	/*
! 	 * We need this variable to detect an aborted initial LISTEN.
! 	 * In that case we would set up our pointer but not listen on any channel.
! 	 * This state gets cleaned up again in AtAbort_Notify().
! 	 */
! 	backendExecutesInitialListen = true;
  
! 	/*
! 	 * This is our first LISTEN, establish our pointer.
! 	 * We set our pointer to the global tail pointer, this way we make
! 	 * sure that we get all of the notifications. We might get a few more
! 	 * but that doesn't hurt.
! 	 */
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
! 	QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 	LWLockRelease(AsyncQueueLock);
  
! 	/*
! 	 * Try to move our pointer forward as far as possible. This will skip
! 	 * over already committed notifications. Still, we could get
! 	 * notifications that have already committed before we started to
! 	 * LISTEN.
! 	 *
! 	 * Note that we are not yet listening on anything, so we won't deliver
! 	 * any notification.
! 	 *
! 	 * This will also advance the global tail pointer if necessary.
! 	 */
! 	asyncQueueReadAllNotifications();
  
  	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 512,550 ****
  }
  
  /*
!  * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			/* Found the matching tuple, delete it */
! 			simple_heap_delete(lRel, &tuple->t_self);
! 
! 			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
! 			 */
  			break;
  		}
  	}
! 	heap_endscan(scan);
  
  	/*
  	 * We do not complain about unlistening something not being listened;
--- 851,903 ----
  }
  
  /*
!  * Exec_ListenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
   *
!  * Add the channel to the list of channels we are listening on.
   */
  static void
! Exec_ListenAfterCommit(const char *channel)
  {
! 	MemoryContext oldcontext;
! 
! 	/* Detect whether we are already listening on this channel */
! 	if (IsListeningOn(channel))
! 		return;
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
! }
! 
! /*
!  * Exec_UnlistenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
!  *
!  * Remove a specified channel from "listenChannels".
!  */
! static void
! Exec_UnlistenAfterCommit(const char *channel)
! {
! 	ListCell *q;
! 	ListCell *prev;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAfterCommit(%s,%d)", channel, MyProcPid);
  
! 	prev = NULL;
! 	foreach(q, listenChannels)
  	{
! 		char *lchan = (char *) lfirst(q);
! 		if (strcmp(lchan, channel) == 0)
  		{
! 			pfree(lchan);
! 			listenChannels = list_delete_cell(listenChannels, q, prev);
  			break;
  		}
+ 		prev = q;
  	}
! 
! 	if (listenChannels == NIL)
! 		asyncQueueUnregister();
  
  	/*
  	 * We do not complain about unlistening something not being listened;
***************
*** 553,690 ****
  }
  
  /*
!  * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
- 	HeapScanDesc scan;
- 	HeapTuple	lTuple;
- 	ScanKeyData key[1];
- 
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
  static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
  			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
  			}
! 			else if (listener->notification == 0)
  			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
  			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
   * AtAbort_Notify
   *
!  *		This is called at transaction abort.
   *
!  *		Gets rid of pending actions and outbound notifies that we would have
!  *		executed if the transaction got committed.
   */
  void
  AtAbort_Notify(void)
  {
  	ClearPendingActionsAndNotifies();
  }
  
--- 906,1293 ----
  }
  
  /*
!  * Exec_UnlistenAllAfterCommit --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAllAfterCommit(void)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAllAferCommit(%d)", MyProcPid);
! 
! 	list_free_deep(listenChannels);
! 	listenChannels = NIL;
! 
! 	asyncQueueUnregister();
! }
! 
! static void
! asyncQueueUnregister(void)
! {
! 	bool	  advanceTail = false;
! 
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
! 	/*
! 	 * If we have been the last backend, advance the tail pointer.
! 	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
! }
  
! static bool
! asyncQueueIsFull(void)
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size			remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size			advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
  
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
  
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
  }
  
  /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
   */
+ static bool
+ asyncQueueAdvance(QueuePosition *position, int entryLength)
+ {
+ 	int		pageno = QUEUE_POS_PAGE(*position);
+ 	int		offset = QUEUE_POS_OFFSET(*position);
+ 	bool	pageJump = false;
+ 
+ 	/*
+ 	 * Move to the next writing position: First jump over what we have just
+ 	 * written or read.
+ 	 */
+ 	offset += entryLength;
+ 	Assert(offset < QUEUE_PAGESIZE);
+ 
+ 	/*
+ 	 * In a second step check if another entry can be written to the page. If
+ 	 * it does, stay here, we have reached the next position. If not, then we
+ 	 * need to move on to the next page.
+ 	 */
+ 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
+ 	{
+ 		pageno++;
+ 		if (pageno > QUEUE_MAX_PAGE)
+ 			/* wrap around */
+ 			pageno = 0;
+ 		offset = 0;
+ 		pageJump = true;
+ 	}
+ 
+ 	SET_QUEUE_POS(*position, pageno, offset);
+ 	return pageJump;
+ }
+ 
  static void
! asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
  {
! 		Assert(n->channel != NULL);
! 		Assert(n->payload != NULL);
! 		Assert(strlen(n->payload) < NOTIFY_PAYLOAD_MAX_LENGTH);
! 
! 		/* The terminator is already included in AsyncQueueEntryEmptySize */
! 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload);
! 		qe->srcPid = MyProcPid;
! 		qe->dboid = MyDatabaseId;
! 		qe->xid = GetCurrentTransactionId();
! 		strcpy(qe->channel, n->channel);
! 		strcpy(qe->payload, n->payload);
! }
! 
! static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n)
! {
! 	n->channel = pstrdup(qe->channel);
! 	n->payload = pstrdup(qe->payload);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
! }
  
! /*
!  * Add the notifications to the queue: we go page by page here, i.e. we stop
!  * once we have to go to a new page but we will be called again and then fill
!  * that next page. If an entry does not fit to a page anymore, we write a dummy
!  * entry with an InvalidOid as the database oid in order to fill the page. So
!  * every page is always used up to the last byte which simplifies reading the
!  * page later.
!  *
!  * We are holding AsyncQueueLock already from the caller and grab AsyncCtlLock
!  * here in this function.
!  *
!  * We are passed the list of notifications to write and return the
!  * not-yet-written notifications back. Eventually we will return NIL.
!  */
! static List *
! asyncQueueAddEntries(List *notifications)
! {
! 	AsyncQueueEntry	qe;
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
  
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
! 
! 	do
! 	{
! 		Notification   *n;
! 
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if-block further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
! 
! 		n = (Notification *) linitial(notifications);
  
! 		asyncQueueNotificationToEntry(n, &qe);
! 
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.channel[0] = '\0';
! 			qe.payload[0] = '\0';
! 			qe.xid = InvalidTransactionId;
! 		}
! 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
! 			   &qe, qe.length);
! 
! 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
! 			 && notifications != NIL);
! 
! 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
! 	{
! 		/*
! 		 * we need to go to continue on a new page, stop here but prepare that
! 		 * page already.
! 		 */
! 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
! 		AsyncCtl->shared->page_dirty[slotno] = true;
! 	}
! 	LWLockRelease(AsyncCtlLock);
! 
! 	return notifications;
! }
! 
! static void
! asyncQueueFillWarning(void)
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t;
! 	double			fillDegree;
! 	int				occupied;
! 	int				tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int				headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 
! 	occupied = headPage - tailPage;
! 
! 	if (occupied == 0)
! 		return;
! 	
! 	if (!asyncQueuePagePrecedesPhysically(tailPage, headPage))
! 		/* head has wrapped around, tail not yet */
! 		occupied += QUEUE_MAX_PAGE;
! 
! 	fillDegree = (float) occupied / (float) QUEUE_MAX_PAGE;
! 
! 	if (fillDegree < 0.5)
! 		return;
! 
! 	t = GetCurrentTimestamp();
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFillWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		QueuePosition	min = QUEUE_HEAD;
! 		int32			minPid = InvalidPid;
! 		int				i;
! 
! 		for (i = 0; i < MaxBackends; i++)
! 			if (QUEUE_BACKEND_PID(i) != InvalidPid)
  			{
! 				min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 				if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 					minPid = QUEUE_BACKEND_PID(i);
  			}
! 
! 		if (fillDegree < 0.75)
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 50%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 		else
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 75%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 
! 		asyncQueueControl->lastQueueFillWarn = t;
! 	}
! }
! 
! /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  * Add the pending notifications to the queue.
!  *
!  * A full queue is very uncommon and should really not happen, given that we
!  * have so much space available in the slru pages. Nevertheless we need to
!  * deal with this possibility. Note that when we get here we are in the process
!  * of committing our transaction, we have not yet committed to clog but this
!  * would be the next step. So at this point in time we can still roll the
!  * transaction back.
!  */
! static void
! Send_Notify(void)
! {
! 	backendSendsNotifications = true;
! 
! 	while (pendingNotifies != NIL)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		asyncQueueFillWarning();
! 		if (asyncQueueIsFull())
! 			ereport(ERROR,
! 					(errcode(ERRCODE_TOO_MANY_ENTRIES),
! 					errmsg("Too many notifications in the queue")));
! 		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 		LWLockRelease(AsyncQueueLock);
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. Since we have EXCLUSIVE lock anyway
!  * we also check the position of the other backends and in case that anyone is
!  * already up-to-date we don't signal it. This can happen if concurrent
!  * notifying transactions have sent a signal and the signaled backend has read
!  * the other notifications and ours in the same step.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(void)
! {
! 	QueuePosition	pos;
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 	int				count = 0;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			count++;
! 			pos = QUEUE_BACKEND_POS(i);
! 			if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
  			{
! 				pids = lappend_int(pids, pid);
! 				ids = lappend_int(ids, i);
  			}
  		}
  	}
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	forboth(p1, pids, p2, ids)
+ 	{
+ 		pid = (int32) lfirst_int(p1);
+ 		i = lfirst_int(p2);
+ 		/*
+ 		 * Should we check for failure? Can it happen that a backend
+ 		 * has crashed without the postmaster starting over?
+ 		 */
+ 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
+ 			elog(WARNING, "Error signalling backend %d", pid);
+ 	}
  
! 	if (count == 0)
! 	{
! 		/* No backend is listening at all, try to clean up the queue.
! 		 * Even if by now (after we determined count to be 0 and now)
! 		 * a backend has started to listen, advancing the tail does not
! 		 * hurt. Our notifications are committed already and a newly
! 		 * listening backend would skip over them anyway. */
! 		asyncQueueAdvanceTail();
! 	}
  }
  
  /*
   * AtAbort_Notify
   *
!  *	This is called at transaction abort.
   *
!  *	Gets rid of pending actions and outbound notifies that we would have
!  *	executed if the transaction got committed.
!  *
!  *	Even though we have not committed, we need to signal the listening backends
!  *	because our notifications might block readers from processing the queue.
!  *	Now that the transaction has aborted, they can go on and skip over our
!  *	notifications. They could find notifications past ours that they need to
!  *	deliver.
   */
  void
  AtAbort_Notify(void)
  {
+ 	if (backendSendsNotifications)
+ 		SignalBackends();
+ 
+ 	/*
+ 	 * If we LISTEN but then roll back the transaction we have set our pointer
+ 	 * but have not made the entry in listenChannels. In that case, remove
+ 	 * our pointer again.
+ 	 */
+ 	if (backendExecutesInitialListen)
+ 		/*
+ 		 * Checking listenChannels should be redundant but it can't hurt doing
+ 		 * it for safety reasons.
+ 		*/
+ 		if (listenChannels == NIL)
+ 			asyncQueueUnregister();
+ 
  	ClearPendingActionsAndNotifies();
  }
  
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1543,1781 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back the list of notifications
+  * that the calling function will deliver then. The list will contain all
+  * notifications from transactions that have already committed.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * The function returns true once we have reached the end or a notification of
+  * a transaction that is still running and false if we have finished with
+  * the page. In other words: once it returns true there is no point in calling
+  * it again.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current,
+ 						   QueuePosition stop,
+ 						   List **notifications)
+ {
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	int				slotno;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		readPtr += current->offset;
+ 		/* at first we only read the header of the notification */
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				if (IsListeningOn(qe.channel))
+ 				{
+ 					if (qe.length > AsyncQueueEntryEmptySize)
+ 					{
+ 						/* now we know that we are interested in the
+ 						 * notification and read it completely. */
+ 						memcpy(&qe, readPtr, qe.length);
+ 					}
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n);
+ 					*notifications = lappend(*notifications, n);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * The transaction has neither committed nor aborted so
+ 					 * far.
+ 					 */
+ 					reachedStop = true;
+ 					break;
+ 				}
+ 				/*
+ 				 * Here we know that the transaction has aborted, we just
+ 				 * ignore its notifications.
+ 				 */
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	/*
+ 	 * Release the lock that we implicitly got from
+ 	 * SimpleLruReadPage_ReadOnly().
+ 	 */
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ 
+ static void
+ asyncQueueReadAllNotifications(void)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 	{
+ 		/* Nothing to do, we have read all notifications already. */
+ 		return;
+ 	}
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us. Especially we don't want to hold a lock while sending the
+ 		 * notifications to the frontend.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		notifications = NIL;
+ 		reachedStop = asyncQueueGetEntriesByPage(&pos, head, &notifications);
+ 
+ 		/*
+ 		 * Note that we deliver everything that we see in the queue and that
+ 		 * matches our _current_ listening state.
+ 		 * Especially we do not take into account different commit times.
+ 		 *
+ 		 * See the following example:
+ 		 *
+ 		 * Backend 1:                    Backend 2:
+ 		 *
+ 		 * transaction starts
+ 		 * NOTIFY foo;
+ 		 * commit starts
+ 		 *                               transaction starts
+ 		 *                               LISTEN foo;
+ 		 *                               commit starts
+ 		 * commit to clog
+ 		 *                               commit to clog
+ 		 *
+ 		 * It could happen that backend 2 sees the notification from
+ 		 * backend 1 in the queue and even though the notifying transaction
+ 		 * committed before the listening transaction, we still deliver the
+ 		 * notification.
+ 		 *
+ 		 * The idea is that an additional notification does not do any
+ 		 * harm we just need to make sure that we do not miss a
+ 		 * notification.
+ 		 */
+ 		foreach(lc, notifications)
+ 		{
+ 			n = (Notification *) lfirst(lc);
+ 			NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 		}
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail(void)
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailp;
+ 	int				headp;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailp = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headp = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if ((asyncQueuePagePrecedesLogically(tailp, QUEUE_POS_PAGE(min), headp)
+ 			&& asyncQueuePagePrecedesPhysically(tailp, headp))
+ 		|| tailp == QUEUE_POS_PAGE(min))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * XXX this could be optimized, to call SimpleLruTruncate only when we
+ 		 * know that we can truncate something.
+ 		 */
+ 		SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
! 
! 		if (sourcePID != 0)
! 		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 		}
! 	}
! 	heap_endscan(scan);
! 
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 1787,1793 ----
  
  	notifyInterruptOccurred = 0;
  
! 	asyncQueueReadAllNotifications();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 1807,1823 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 1827,1883 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
! 
! 	if (payload == NULL)
! 		payload = "";
! 
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 */
  
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1107,1112 ****
--- 1894,1902 ----
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
+ 
+ 	backendSendsNotifications = false;
+ 	backendExecutesInitialListen = false;
  }
  
  /*
***************
*** 1124,1128 ****
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	Async_Notify((char *) recdata);
  }
--- 1914,1924 ----
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	AsyncQueueEntry		*qe = (AsyncQueueEntry *) recdata;
! 
! 	Assert(qe->dboid == MyDatabaseId);
! 	Assert(qe->length == len);
! 
! 	Async_Notify(qe->channel, qe->payload);
  }
+ 
diff -cr cvs/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs/src/backend/nodes/copyfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 2771,2776 ****
--- 2771,2777 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs/src/backend/nodes/equalfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 1325,1330 ****
--- 1325,1331 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs/src/backend/nodes/outfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 1818,1823 ****
--- 1818,1824 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs/src/backend/nodes/readfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs/src/backend/parser/gram.y	2010-01-30 22:09:01.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2010-01-30 22:10:21.000000000 +0100
***************
*** 400,406 ****
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 400,406 ----
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 6113,6122 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 6113,6128 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs/src/backend/storage/ipc/ipci.c	2010-01-30 22:09:08.000000000 +0100
--- cvs.build/src/backend/storage/ipc/ipci.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/nbtree.h"
  #include "access/subtrans.h"
  #include "access/twophase.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
***************
*** 225,230 ****
--- 226,232 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs/src/backend/storage/lmgr/lwlock.c	2010-01-30 22:09:08.000000000 +0100
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs/src/backend/tcop/utility.c	2010-01-30 22:09:01.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2010-01-31 14:10:51.000000000 +0100
***************
*** 930,936 ****
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 930,936 ----
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs/src/backend/utils/adt/misc.c cvs.build/src/backend/utils/adt/misc.c
*** cvs/src/backend/utils/adt/misc.c	2010-01-30 22:08:55.000000000 +0100
--- cvs.build/src/backend/utils/adt/misc.c	2010-01-31 19:11:05.000000000 +0100
***************
*** 386,388 ****
--- 386,389 ----
  {
  	PG_RETURN_OID(get_fn_expr_argtype(fcinfo->flinfo, 0));
  }
+ 
diff -cr cvs/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs/src/bin/initdb/initdb.c	2010-01-30 22:08:40.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 2458,2463 ****
--- 2458,2464 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs/src/bin/psql/common.c	2010-01-30 22:08:41.000000000 +0100
--- cvs.build/src/bin/psql/common.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs/src/bin/psql/tab-complete.c cvs.build/src/bin/psql/tab-complete.c
*** cvs/src/bin/psql/tab-complete.c	2010-01-30 22:08:40.000000000 +0100
--- cvs.build/src/bin/psql/tab-complete.c	2010-01-30 22:10:21.000000000 +0100
***************
*** 2093,2099 ****
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
--- 2093,2099 ----
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
diff -cr cvs/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs/src/include/access/slru.h	2010-01-30 22:09:13.000000000 +0100
--- cvs.build/src/include/access/slru.h	2010-01-30 22:10:21.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs/src/include/catalog/pg_proc.h cvs.build/src/include/catalog/pg_proc.h
*** cvs/src/include/catalog/pg_proc.h	2010-01-30 22:09:10.000000000 +0100
--- cvs.build/src/include/catalog/pg_proc.h	2010-01-31 20:16:07.000000000 +0100
***************
*** 4113,4118 ****
--- 4113,4122 ----
  DESCR("get the prepared statements for this session");
  DATA(insert OID = 2511 (  pg_cursor PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,16,16,16,1184}" "{o,o,o,o,o,o}" "{name,statement,is_holdable,is_binary,is_scrollable,creation_time}" _null_ pg_cursor _null_ _null_ _null_ ));
  DESCR("get the open cursors for this session");
+ DATA(insert OID = 3034 (  pg_listening	PGNSP	PGUID 12 1 10 0 f f f t t s 0 0 25 "" _null_ _null_ _null_ _null_ pg_listening _null_ _null_ _null_ ));
+ DESCR("get the channels that the current backend listens to");
+ DATA(insert OID = 3035 (  send_notify  PGNSP PGUID 12 1 0 0 f f f f f v 2 0 2278 "25 25" _null_ _null_ _null_ _null_ send_notify _null_ _null_ _null_));
+ DESCR("send a notification to clients");
  DATA(insert OID = 2599 (  pg_timezone_abbrevs	PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,1186,16}" "{o,o,o}" "{abbrev,utc_offset,is_dst}" _null_ pg_timezone_abbrevs _null_ _null_ _null_ ));
  DESCR("get the available time zone abbreviations");
  DATA(insert OID = 2856 (  pg_timezone_names		PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,1186,16}" "{o,o,o,o}" "{name,abbrev,utc_offset,is_dst}" _null_ pg_timezone_names _null_ _null_ _null_ ));
***************
*** 4760,4766 ****
  DATA(insert OID = 3114 (  nth_value		PGNSP PGUID 12 1 0 0 f t f t f i 2 0 2283 "2283 23" _null_ _null_ _null_ _null_ window_nth_value _null_ _null_ _null_ ));
  DESCR("fetch the Nth row value");
  
- 
  /*
   * Symbolic values for provolatile column: these indicate whether the result
   * of a function is dependent *only* on the values of its explicit arguments,
--- 4764,4769 ----
diff -cr cvs/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs/src/include/commands/async.h	2010-01-30 22:09:11.000000000 +0100
--- cvs.build/src/include/commands/async.h	2010-01-31 19:08:46.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,41 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * Maximum size of the payload, including terminating NULL.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * How many page slots do we reserve ?
+  */
+ #define NUM_ASYNC_BUFFERS			4
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
***************
*** 43,46 ****
--- 56,62 ----
  extern void notify_twophase_postcommit(TransactionId xid, uint16 info,
  						   void *recdata, uint32 len);
  
+ extern Datum pg_listening(PG_FUNCTION_ARGS);
+ extern Datum send_notify(PG_FUNCTION_ARGS);
+ 
  #endif   /* ASYNC_H */
diff -cr cvs/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs/src/include/nodes/parsenodes.h	2010-01-30 22:09:11.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2010-01-30 22:10:21.000000000 +0100
***************
*** 2084,2089 ****
--- 2084,2090 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs/src/include/storage/lwlock.h	2010-01-30 22:09:13.000000000 +0100
--- cvs.build/src/include/storage/lwlock.h	2010-01-30 22:10:21.000000000 +0100
***************
*** 67,72 ****
--- 67,74 ----
  	AutovacuumLock,
  	AutovacuumScheduleLock,
  	SyncScanLock,
+ 	AsyncCtlLock,
+ 	AsyncQueueLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff -cr cvs/src/include/utils/errcodes.h cvs.build/src/include/utils/errcodes.h
*** cvs/src/include/utils/errcodes.h	2010-01-30 22:09:12.000000000 +0100
--- cvs.build/src/include/utils/errcodes.h	2010-01-30 22:10:21.000000000 +0100
***************
*** 318,323 ****
--- 318,324 ----
  #define ERRCODE_STATEMENT_TOO_COMPLEX		MAKE_SQLSTATE('5','4', '0','0','1')
  #define ERRCODE_TOO_MANY_COLUMNS			MAKE_SQLSTATE('5','4', '0','1','1')
  #define ERRCODE_TOO_MANY_ARGUMENTS			MAKE_SQLSTATE('5','4', '0','2','3')
+ #define ERRCODE_TOO_MANY_ENTRIES			MAKE_SQLSTATE('5','4', '0','3','1')
  
  /* Class 55 - Object Not In Prerequisite State (class borrowed from DB2) */
  #define ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE	MAKE_SQLSTATE('5','5', '0','0','0')
diff -cr cvs/src/test/regress/expected/guc.out cvs.build/src/test/regress/expected/guc.out
*** cvs/src/test/regress/expected/guc.out	2010-01-30 22:08:39.000000000 +0100
--- cvs.build/src/test/regress/expected/guc.out	2010-01-30 22:10:21.000000000 +0100
***************
*** 532,540 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
!   relname  
! -----------
   foo_event
  (1 row)
  
--- 532,540 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
!  pg_listening 
! --------------
   foo_event
  (1 row)
  
***************
*** 571,579 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
!  relname 
! ---------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
--- 571,579 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
!  pg_listening 
! --------------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
diff -cr cvs/src/test/regress/expected/sanity_check.out cvs.build/src/test/regress/expected/sanity_check.out
*** cvs/src/test/regress/expected/sanity_check.out	2010-01-30 22:08:40.000000000 +0100
--- cvs.build/src/test/regress/expected/sanity_check.out	2010-01-30 22:10:21.000000000 +0100
***************
*** 107,113 ****
   pg_language             | t
   pg_largeobject          | t
   pg_largeobject_metadata | t
-  pg_listener             | f
   pg_namespace            | t
   pg_opclass              | t
   pg_operator             | t
--- 107,112 ----
***************
*** 154,160 ****
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (143 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 153,159 ----
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (142 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
diff -cr cvs/src/test/regress/sql/guc.sql cvs.build/src/test/regress/sql/guc.sql
*** cvs/src/test/regress/sql/guc.sql	2010-01-30 22:08:38.000000000 +0100
--- cvs.build/src/test/regress/sql/guc.sql	2010-01-30 22:10:21.000000000 +0100
***************
*** 165,171 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 165,171 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
***************
*** 174,180 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 174,180 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
#82Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#81)
Re: Listen / Notify - what to do when the queue is full

On Wed, 2010-02-03 at 00:10 +0100, Joachim Wieland wrote:

I admit that it was not clear what I meant. The comment should only
address LISTEN / NOTIFY on the standby server. Do you see any problems
allowing it?

The original comment was a part of the NotifyStmt case, and I don't
think we can support NOTIFY issued on a standby system -- surely there's
no way for the standby to communicate the notification to the master.
Anyway, this is getting a little sidetracked; I don't think we need to
worry about HS right now.

I was wondering if we should have a hard limit on the maximal number
of notifications per transaction. You can now easily fill up your
backend's memory with notifications. However we had the same problem
with the old implementation already and nobody complained. The
difference is just that now you can fill it up a lot faster because
you can send a large payload.

I don't see a need for that. The only way that is likely to happen is
with triggers, and there are other ways to fill up the memory using
triggers. Even if the option is there, all we can do is abort.

The second doubt I had is about the truncation behavior of slru. ISTM
that it doesn't truncate at the end of the page range once the head
pointer has already wrapped around.
There is the following comment in slru.c describing this fact:

/*
* While we are holding the lock, make an important safety check: the
* planned cutoff point must be <= the current endpoint page. Otherwise we
* have already wrapped around, and proceeding with the truncation would
* risk removing the current segment.
*/

I wanted to check if we can do anything about it and if we need to do
anything at all...

I'll have to take a look in more detail.

Now the question is, should we forbid NOTIFY for 2PC
altogether only because in the unlikely event of a full queue we
cannot guarantee that we can commit the transaction?

I believe that was the consensus in the thread. The people who use 2PC
are the people that want the most rock-solid guarantees. I don't like
forbidding feature combinations, but I don't see a good alternative.

One solution is to treat a 2PC transaction like a backend with its own
pointer to the queue. As long as the prepared transaction is not
committed, its pointer does not move and so we don't move forward the
global tail pointer. Here the NOTIFYs sent by the 2PC transaction are
already in the queue and the transaction can always commit. The
drawback of this idea is that if you forget to commit the prepared
transaction and leave it around uncommitted, your queue will fill up
inevitably because you do not truncate anymore...

There's also a problem if the power goes out, you restart, and then the
queue fills up before you COMMIT PREPARED.

* There's a bug where an UNLISTEN can abort, and yet you still miss
the notification.
[...]
The notification is missed. It's fixed easily enough by doing the
UNLISTEN step in AtCommit_NotifyAfterCommit.

Thanks, very well spotted... Actually the same is true for LISTEN... I
have reworked the patch to do the changes to listenChannels only in
the post-commit functions.

I'm worried that this creates the opposite problem: that a LISTEN
transaction might commit before a NOTIFY transaction, and yet miss the
notification.

It seems safest to me to add a backend (LISTEN) to the list before
commit, and remove a backend (UNLISTEN) after commit. That way we are
sure to only receive spurious notifications, and can't miss any.

I have also included Arnaud Betremieux's send_notify function. Should
we really call the function send_notify? What about
pg_send_notification or just pg_notify?

Agreed: pg_notify sounds good to me.

Is [void] the preferred return type?

I can't think of anything else that it might return. NOTIFY doesn't
return much ;)

Regards,
Jeff Davis

#83Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#82)
Re: Listen / Notify - what to do when the queue is full

On Wed, Feb 3, 2010 at 2:05 AM, Jeff Davis <pgsql@j-davis.com> wrote:

Thanks, very well spotted... Actually the same is true for LISTEN... I
have reworked the patch to do the changes to listenChannels only in
the post-commit functions.

I'm worried that this creates the opposite problem: that a LISTEN
transaction might commit before a NOTIFY transaction, and yet miss the
notification.

See the following comment and let me know if you agree...

! /*
! * Exec_ListenBeforeCommit --- subroutine for AtCommit_NotifyBeforeCommit
! *
! * Note that we do only set our pointer here and do not yet add the channel to
! * listenChannels. Since our transaction could still roll back we do this only
! * after commit. We know that our tail pointer won't move between here and
! * directly after commit, so we won't miss a notification.
! */

However this introduces a new problem when an initial LISTEN aborts:
Then we are not listening to anything but for other backends it looks
like we were. This is tracked by the boolean variable
backendExecutesInitialListen and gets cleaned up in AtAbort_Notify().

It seems safest to me to add a backend (LISTEN) to the list before
commit, and remove a backend (UNLISTEN) after commit. That way we are
sure to only receive spurious notifications, and can't miss any.

If a LISTEN aborted we would not only receive a few spurious
notifications from it but would receive notifications on this channel
forever even though we have never executed LISTEN on it successfully.

Joachim

#84Robert Haas
robertmhaas@gmail.com
In reply to: Joachim Wieland (#83)
Re: Listen / Notify - what to do when the queue is full

On Wed, Feb 3, 2010 at 4:34 AM, Joachim Wieland <joe@mcknight.de> wrote:

On Wed, Feb 3, 2010 at 2:05 AM, Jeff Davis <pgsql@j-davis.com> wrote:

Thanks, very well spotted... Actually the same is true for LISTEN... I
have reworked the patch to do the changes to listenChannels only in
the post-commit functions.

I'm worried that this creates the opposite problem: that a LISTEN
transaction might commit before a NOTIFY transaction, and yet miss the
notification.

See the following comment and let me know if you agree...

! /*
!  * Exec_ListenBeforeCommit --- subroutine for AtCommit_NotifyBeforeCommit
!  *
!  * Note that we do only set our pointer here and do not yet add the channel to
!  * listenChannels. Since our transaction could still roll back we do this only
!  * after commit. We know that our tail pointer won't move between here and
!  * directly after commit, so we won't miss a notification.
!  */

However this introduces a new problem when an initial LISTEN aborts:
Then we are not listening to anything but for other backends it looks
like we were. This is tracked by the boolean variable
backendExecutesInitialListen and gets cleaned up in AtAbort_Notify().

It seems safest to me to add a backend (LISTEN) to the list before
commit, and remove a backend (UNLISTEN) after commit. That way we are
sure to only receive spurious notifications, and can't miss any.

If a LISTEN aborted we would not only receive a few spurious
notifications from it but would receive notifications on this channel
forever even though we have never executed LISTEN on it successfully.

Jeff, do you think this patch is ready for committer? If so, please
mark it as such on commitfest.postgresql.org - otherwise, please
clarify what you think the action items are.

Thanks,

...Robert

#85Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#84)
Re: Listen / Notify - what to do when the queue is full

On Sun, 2010-02-07 at 00:18 -0500, Robert Haas wrote:

Jeff, do you think this patch is ready for committer? If so, please
mark it as such on commitfest.postgresql.org - otherwise, please
clarify what you think the action items are.

I'll post an update tomorrow.

Regards,
Jeff Davis

#86Joachim Wieland
joe@mcknight.de
In reply to: Jeff Davis (#82)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Wed, Feb 3, 2010 at 2:05 AM, Jeff Davis <pgsql@j-davis.com> wrote:

The original comment was a part of the NotifyStmt case, and I don't
think we can support NOTIFY issued on a standby system -- surely there's
no way for the standby to communicate the notification to the master.
Anyway, this is getting a little sidetracked; I don't think we need to
worry about HS right now.

True but I was not talking about moving any notifications to different
servers. Clients listening on one server should receive the
notifications from NOTIFYs executed on this server, no matter if it is
a standby or the master server. It seems to me that currently we
cannot support Listen/Notify on the Hot Standby because it calls
GetCurrentTransactionId() which is not available there. On the other
hand this is just done because we fear that a transaction could
rollback while actually committing. On the read-only standby however a
transaction that commits always commits sucessfully because it has not
changed any data (right?)...

The second doubt I had is about the truncation behavior of slru. ISTM
that it doesn't truncate at the end of the page range once the head
pointer has already wrapped around.
There is the following comment in slru.c describing this fact:

    /*
     * While we are holding the lock, make an important safety check: the
     * planned cutoff point must be <= the current endpoint page. Otherwise we
     * have already wrapped around, and proceeding with the truncation would
     * risk removing the current segment.
     */

I wanted to check if we can do anything about it and if we need to do
anything at all...

I'll have to take a look in more detail.

This is still kind of an open item but it's an slru issue and should
also be true for other functionality that uses slru queues.

Now the question is, should we forbid NOTIFY for 2PC
altogether only because in the unlikely event of a full queue we
cannot guarantee that we can commit the transaction?

I believe that was the consensus in the thread. The people who use 2PC
are the people that want the most rock-solid guarantees. I don't like
forbidding feature combinations, but I don't see a good alternative.

If this was consensus, then fine... I have forbidden it in the latest patch now.

One solution is to treat a 2PC transaction like a backend with its own
pointer to the queue. As long as the prepared transaction is not
committed, its pointer does not move and so we don't move forward the
global tail pointer. Here the NOTIFYs sent by the 2PC transaction are
already in the queue and the transaction can always commit. The
drawback of this idea is that if you forget to commit the prepared
transaction and leave it around uncommitted, your queue will fill up
inevitably because you do not truncate anymore...

There's also a problem if the power goes out, you restart, and then the
queue fills up before you COMMIT PREPARED.

This I don't understand... If power goes out and we restart, we'd
first put all notifications from the prepared transactions into the
queue. We know that they fit because they have fit earlier as well (we
wouldn't allow user connections until we have worked through all 2PC
state files).

I'm worried that this creates the opposite problem: that a LISTEN
transaction might commit before a NOTIFY transaction, and yet miss the
notification.

(see my other email on this item...)

I have also included Arnaud Betremieux's send_notify function. Should
we really call the function send_notify? What about
pg_send_notification or just pg_notify?

Agreed: pg_notify sounds good to me.

Function name changed.

There was another problem that the slru files did not all get deleted
at server restart, which is fixed now.

Regarding the famous ASCII-restriction open item I have now realized
what I haven't thought of previously: notifications are not
transferred between databases, they always stay in one database. Since
libpq does the conversion between server and client encoding, it is
questionable if we really need to restrict this at all... But I am not
an encoding expert so whoever feels like he can confirm or refute
this, please speak up.

Joachim

Attachments:

listennotify.10.difftext/x-diff; charset=US-ASCII; name=listennotify.10.diffDownload
diff -cr cvs/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs/src/backend/access/transam/slru.c	2010-01-30 22:09:03.000000000 +0100
--- cvs.build/src/backend/access/transam/slru.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs/src/backend/access/transam/xact.c	2010-01-30 22:09:03.000000000 +0100
--- cvs.build/src/backend/access/transam/xact.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 1728,1735 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
--- 1728,1735 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
***************
*** 1804,1809 ****
--- 1804,1814 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs/src/backend/catalog/Makefile	2010-01-30 22:08:49.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2010-02-07 12:57:49.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs/src/backend/commands/async.c	2010-01-30 22:08:52.000000000 +0100
--- cvs.build/src/backend/commands/async.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 14,44 ****
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 14,68 ----
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications and compares the notified channels with its list.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) stores the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the applications needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    After clog commit we are called another time
!  *    (AtCommit_NotifyAfterCommit()). Here we check if we need to signal the
!  *    backends. In SignalBackends() we scan the list of listening backends and
!  *    send a PROCSIG_NOTIFY_INTERRUPT to every backend that has set its Pid (we
!  *    don't know which backend is listening on which channel so we need to send
!  *    a signal to every listening backend). We can exclude backends that are
!  *    already up to date.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,84 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
--- 70,91 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  *    Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach either a notification from an uncommitted transaction or
!  *	  the head pointer's position. Then we check if we were the laziest
!  *	  backend: if our pointer is set to the same position as the global tail
!  *	  pointer is set, then we set it further to the second-laziest backend (We
!  *	  can identify it by inspecting the positions of all other backends'
!  *	  pointers). Whenever we move the tail pointer we also truncate now unused
!  *	  pages (i.e. delete files in pg_notify/ that are no longer used).
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
***************
*** 88,97 ****
  #include <signal.h>
  
  #include "access/heapam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_listener.h"
  #include "commands/async.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
--- 95,107 ----
  #include <signal.h>
  
  #include "access/heapam.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_type.h"
  #include "commands/async.h"
+ #include "funcapi.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 118,125 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 126,132 ****
  typedef struct
  {
  	ListenActionKind action;
! 	char		condname[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
--- 136,142 ----
  typedef struct
  {
  	ListenActionKind action;
! 	char		channel[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
***************
*** 134,140 ****
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 144,150 ----
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 159,281 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			channel[NAMEDATALEN];
+ 	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)
+ 
+ #define	InvalidPid				(-1)
+ #define QUEUE_POS_PAGE(x)		((x).page)
+ #define QUEUE_POS_OFFSET(x)		((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_BACKEND_POS(i)		asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i)		asyncQueueControl->backend[(i)].pid
+ #define QUEUE_HEAD					asyncQueueControl->head
+ #define QUEUE_TAIL					asyncQueueControl->tail
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFillWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's an interesting test case to define QUEUE_MAX_PAGE to a very small
+  * multiple of SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
+ 
+ /* has this backend sent notifications in the current transaction ? */
+ static bool backendSendsNotifications = false;
+ /* has this backend executed a LISTEN in the current transaction ? */
+ static bool backendExecutesInitialListen = false;
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,224 ****
  
  bool		Trace_notify = false;
  
! 
! static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
  	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
! 	{
! 		/*
! 		 * The name list needs to live until end of transaction, so store it
! 		 * in the transaction context.
! 		 */
! 		MemoryContext oldcontext;
  
! 		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
! 		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
  }
  
  /*
--- 292,488 ----
  
  bool		Trace_notify = false;
  
! static void queue_listen(ListenActionKind action, const char *channel);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_ListenBeforeCommit(const char *channel);
! static void Exec_ListenAfterCommit(const char *channel);
! static void Exec_UnlistenAfterCommit(const char *channel);
! static void Exec_UnlistenAllAfterCommit(void);
! static void SignalBackends(void);
! static void Send_Notify(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **notifications);
! static void asyncQueueReadAllNotifications(void);
! static void asyncQueueAdvanceTail(void);
! static void asyncQueueUnregister(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 srcPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic
+  * if one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. If we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFillWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		SlruScanDirectory(AsyncCtl,
+ 						  QUEUE_MAX_PAGE + SLRU_PAGES_PER_SEGMENT,
+ 						  true);
+ 
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 	}
+ }
+ 
+ 
+ /*
+  * pg_notify -
+  *	  Send a notification to listening clients
+  */
+ Datum
+ pg_notify(PG_FUNCTION_ARGS)
+ {
+ 	const char *channelStr;
+ 	const char *payloadStr;
+ 	text	   *channel = PG_GETARG_TEXT_PP(0);
+ 	text	   *payload = PG_GETARG_TEXT_PP(1);
+ 
+ 	channelStr = text_to_cstring(channel);
+ 	payloadStr = text_to_cstring(payload);
+ 
+ 	Async_Notify(channelStr, payloadStr);
+ 
+ 	PG_RETURN_VOID();
+ }
+ 
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 	Notification *n;
+ 	MemoryContext oldcontext;
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
! 
! 	if (payload && strlen(payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 				 errmsg("payload string too long")));
  
  	/* no point in making duplicate entries in the list ... */
! 	if (AsyncExistsPendingNotify(channel, payload))
! 		return;
  
! 	/*
! 	 * The name list needs to live until end of transaction, so store it
! 	 * in the transaction context.
! 	 */
! 	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 	n = (Notification *) palloc(sizeof(Notification));
! 	n->channel = pstrdup(channel);
! 	if (payload)
! 		n->payload = pstrdup(payload);
! 	else
! 		n->payload = "";
  
! 	/* will set the xid and the srcPid later... */
! 	n->xid = InvalidTransactionId;
! 	n->srcPid = InvalidPid;
! 
! 	/*
! 	 * We want to preserve the order so we need to append every
! 	 * notification. See comments at AsyncExistsPendingNotify().
! 	 */
! 	pendingNotifies = lappend(pendingNotifies, n);
! 
! 	MemoryContextSwitchTo(oldcontext);
  }
  
  /*
***************
*** 226,236 ****
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of pg_listener happens during transaction commit.
!  *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  static void
! queue_listen(ListenActionKind action, const char *condname)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
--- 490,500 ----
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of the notification queue happens during transaction
!  *		commit.
   */
  static void
! queue_listen(ListenActionKind action, const char *channel)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
***************
*** 244,252 ****
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(condname));
  	actrec->action = action;
! 	strcpy(actrec->condname, condname);
  
  	pendingActions = lappend(pendingActions, actrec);
  
--- 508,516 ----
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(channel));
  	actrec->action = action;
! 	strcpy(actrec->channel, channel);
  
  	pendingActions = lappend(pendingActions, actrec);
  
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 523,534 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 537,552 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 570,575 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
- 	/*
- 	 * We need to start/commit a transaction for the unlisten, but if there is
- 	 * already an active transaction we had better abort that one first.
- 	 * Otherwise we'd end up committing changes that probably ought to be
- 	 * discarded.
- 	 */
  	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 577,584 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	AbortOutOfAnyTransaction();
! 	Exec_UnlistenAllAfterCommit();
  }
  
  /*
***************
*** 340,386 ****
  	ListCell   *p;
  
  	/* It's not sensible to have any pending LISTEN/UNLISTEN actions */
! 	if (pendingActions)
  		ereport(ERROR,
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 				 errmsg("cannot PREPARE a transaction that has executed LISTEN or UNLISTEN")));
! 
! 	/* We can deal with pending NOTIFY though */
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *relname = (const char *) lfirst(p);
! 
! 		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
! 	}
! 
! 	/*
! 	 * We can clear the state immediately, rather than needing a separate
! 	 * PostPrepare call, because if the transaction fails we'd just discard
! 	 * the state anyway.
! 	 */
! 	ClearPendingActionsAndNotifies();
  }
  
  /*
!  * AtCommit_Notify
!  *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
   *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
--- 593,619 ----
  	ListCell   *p;
  
  	/* It's not sensible to have any pending LISTEN/UNLISTEN actions */
! 	if (pendingActions || pendingNotifies)
  		ereport(ERROR,
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 				 errmsg("cannot PREPARE a transaction that has executed LISTEN/UNLISTEN or NOTIFY")));
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
  
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 630,639 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
! 	Assert(backendSendsNotifications == false);
! 	Assert(backendExecutesInitialListen == false);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
- 	/* Perform any pending notifies */
- 	if (pendingNotifies)
- 		Send_Notify(lRel);
- 
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified relation.
   */
! static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
  
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
  
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
  
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
! 		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
! 		}
  	}
- 	heap_endscan(scan);
  
! 	if (alreadyListener)
! 		return;
  
! 	/*
! 	 * OK to insert a new tuple
! 	 */
! 	memset(nulls, false, sizeof(nulls));
  
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
  
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
  
! 	simple_heap_insert(lRel, tuple);
  
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 643,828 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_ListenBeforeCommit(actrec->channel);
  				break;
  			case LISTEN_UNLISTEN:
! 				/* there is no Exec_UnlistenBeforeCommit() */
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				/* there is no Exec_UnlistenAllBeforeCommit() */
  				break;
  		}
  	}
  
  	/*
! 	 * Perform any pending notifies.
  	 */
! 	if (pendingNotifies)
! 		Send_Notify();
! }
! 
! /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
!  *
!  *		Notify the listening backends.
!  */
! void
! AtCommit_NotifyAfterCommit(void)
! {
! 	ListCell   *p;
! 
! 	/* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
! 	 * return as soon as possible */
! 	if (!pendingActions && !backendSendsNotifications)
! 		return;
! 
! 	/* Perform any pending listen/unlisten actions */
! 	foreach(p, pendingActions)
! 	{
! 		ListenAction *actrec = (ListenAction *) lfirst(p);
! 
! 		switch (actrec->action)
! 		{
! 			case LISTEN_LISTEN:
! 				Exec_ListenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN:
! 				Exec_UnlistenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAllAfterCommit();
! 				break;
! 		}
! 	}
! 
! 	if (backendSendsNotifications)
! 		SignalBackends();
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
  }
  
  /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
   */
! static bool
! IsListeningOn(const char *channel)
  {
! 	ListCell   *p;
! 	char	   *lchan;
  
! 	foreach(p, listenChannels)
! 	{
! 		lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			return true;
! 	}
! 	return false;
! }
! 
! Datum
! pg_listening(PG_FUNCTION_ARGS)
! {
! 	FuncCallContext	   *funcctx;
! 	ListCell		  **lcp;
  
! 	/* stuff done only on the first call of the function */
! 	if (SRF_IS_FIRSTCALL())
  	{
! 		MemoryContext	oldcontext;
  
! 		/* create a function context for cross-call persistence */
! 		funcctx = SRF_FIRSTCALL_INIT();
! 
! 		/*
! 		 * switch to memory context appropriate for multiple function calls
! 		 */
! 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
! 
! 		/* allocate memory for user context */
! 		lcp = (ListCell **) palloc(sizeof(ListCell **));
! 		if (listenChannels != NIL)
! 			*lcp = list_head(listenChannels);
! 		else
! 			*lcp = NULL;
! 		funcctx->user_fctx = (void *) lcp;
! 
! 		MemoryContextSwitchTo(oldcontext);
  	}
  
! 	/* stuff done on every call of the function */
! 	funcctx = SRF_PERCALL_SETUP();
! 	lcp = (ListCell **) funcctx->user_fctx;
  
! 	while (*lcp != NULL)
! 	{
! 		char   *channel = (char *) lfirst(*lcp);
  
! 		*lcp = (*lcp)->next;
! 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
! 	}
  
! 	SRF_RETURN_DONE(funcctx);
! }
  
! /*
!  * Exec_ListenBeforeCommit --- subroutine for AtCommit_NotifyBeforeCommit
!  *
!  * Note that we do only set our pointer here and do not yet add the channel to
!  * listenChannels. Since our transaction could still roll back we do this only
!  * after commit. We know that our tail pointer won't move between here and
!  * directly after commit, so we won't miss a notification.
!  */
! static void
! Exec_ListenBeforeCommit(const char *channel)
! {
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
  
! 	/* Detect whether we are already listening to something. */
! 	if (listenChannels != NIL)
! 		return;
  
! 	/*
! 	 * We need this variable to detect an aborted initial LISTEN.
! 	 * In that case we would set up our pointer but not listen on any channel.
! 	 * This state gets cleaned up again in AtAbort_Notify().
! 	 */
! 	backendExecutesInitialListen = true;
  
  	/*
! 	 * This is our first LISTEN, establish our pointer.
! 	 * We set our pointer to the global tail pointer, this way we make
! 	 * sure that we get all of the notifications. We might get a few more
! 	 * but that doesn't hurt.
! 	 */
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
! 	QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	/*
! 	 * Try to move our pointer forward as far as possible. This will skip
! 	 * over already committed notifications. Still, we could get
! 	 * notifications that have already committed before we started to
! 	 * LISTEN.
! 	 *
! 	 * Note that we are not yet listening on anything, so we won't deliver
! 	 * any notification.
! 	 *
! 	 * This will also advance the global tail pointer if necessary.
! 	 */
! 	asyncQueueReadAllNotifications();
! 
! 	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 512,550 ****
  }
  
  /*
!  * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			/* Found the matching tuple, delete it */
! 			simple_heap_delete(lRel, &tuple->t_self);
! 
! 			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
! 			 */
  			break;
  		}
  	}
! 	heap_endscan(scan);
  
  	/*
  	 * We do not complain about unlistening something not being listened;
--- 832,884 ----
  }
  
  /*
!  * Exec_ListenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
   *
!  * Add the channel to the list of channels we are listening on.
   */
  static void
! Exec_ListenAfterCommit(const char *channel)
  {
! 	MemoryContext oldcontext;
! 
! 	/* Detect whether we are already listening on this channel */
! 	if (IsListeningOn(channel))
! 		return;
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
! }
! 
! /*
!  * Exec_UnlistenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
!  *
!  * Remove a specified channel from "listenChannels".
!  */
! static void
! Exec_UnlistenAfterCommit(const char *channel)
! {
! 	ListCell *q;
! 	ListCell *prev;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAfterCommit(%s,%d)", channel, MyProcPid);
  
! 	prev = NULL;
! 	foreach(q, listenChannels)
  	{
! 		char *lchan = (char *) lfirst(q);
! 		if (strcmp(lchan, channel) == 0)
  		{
! 			pfree(lchan);
! 			listenChannels = list_delete_cell(listenChannels, q, prev);
  			break;
  		}
+ 		prev = q;
  	}
! 
! 	if (listenChannels == NIL)
! 		asyncQueueUnregister();
  
  	/*
  	 * We do not complain about unlistening something not being listened;
***************
*** 553,690 ****
  }
  
  /*
!  * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
- 	HeapScanDesc scan;
- 	HeapTuple	lTuple;
- 	ScanKeyData key[1];
- 
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
   *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
! static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
  			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
  			}
! 			else if (listener->notification == 0)
  			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
  			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
   * AtAbort_Notify
   *
!  *		This is called at transaction abort.
   *
!  *		Gets rid of pending actions and outbound notifies that we would have
!  *		executed if the transaction got committed.
   */
  void
  AtAbort_Notify(void)
  {
  	ClearPendingActionsAndNotifies();
  }
  
--- 887,1274 ----
  }
  
  /*
!  * Exec_UnlistenAllAfterCommit --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAllAfterCommit(void)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAllAferCommit(%d)", MyProcPid);
! 
! 	list_free_deep(listenChannels);
! 	listenChannels = NIL;
! 
! 	asyncQueueUnregister();
! }
! 
! static void
! asyncQueueUnregister(void)
! {
! 	bool	  advanceTail = false;
! 
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
! 	/*
! 	 * If we have been the last backend, advance the tail pointer.
! 	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
! }
! 
! static bool
! asyncQueueIsFull(void)
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size			remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size			advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
! 
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
  
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
! }
  
! /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
!  */
! static bool
! asyncQueueAdvance(QueuePosition *position, int entryLength)
! {
! 	int		pageno = QUEUE_POS_PAGE(*position);
! 	int		offset = QUEUE_POS_OFFSET(*position);
! 	bool	pageJump = false;
! 
! 	/*
! 	 * Move to the next writing position: First jump over what we have just
! 	 * written or read.
! 	 */
! 	offset += entryLength;
! 	Assert(offset < QUEUE_PAGESIZE);
  
! 	/*
! 	 * In a second step check if another entry can be written to the page. If
! 	 * it does, stay here, we have reached the next position. If not, then we
! 	 * need to move on to the next page.
! 	 */
! 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
! 	{
! 		pageno++;
! 		if (pageno > QUEUE_MAX_PAGE)
! 			/* wrap around */
! 			pageno = 0;
! 		offset = 0;
! 		pageJump = true;
! 	}
! 
! 	SET_QUEUE_POS(*position, pageno, offset);
! 	return pageJump;
! }
! 
! static void
! asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
! {
! 		Assert(n->channel != NULL);
! 		Assert(n->payload != NULL);
! 		Assert(strlen(n->payload) < NOTIFY_PAYLOAD_MAX_LENGTH);
! 
! 		/* The terminator is already included in AsyncQueueEntryEmptySize */
! 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload);
! 		qe->srcPid = MyProcPid;
! 		qe->dboid = MyDatabaseId;
! 		qe->xid = GetCurrentTransactionId();
! 		strcpy(qe->channel, n->channel);
! 		strcpy(qe->payload, n->payload);
! }
! 
! static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n)
! {
! 	n->channel = pstrdup(qe->channel);
! 	n->payload = pstrdup(qe->payload);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
  }
  
  /*
!  * Add the notifications to the queue: we go page by page here, i.e. we stop
!  * once we have to go to a new page but we will be called again and then fill
!  * that next page. If an entry does not fit to a page anymore, we write a dummy
!  * entry with an InvalidOid as the database oid in order to fill the page. So
!  * every page is always used up to the last byte which simplifies reading the
!  * page later.
   *
!  * We are holding AsyncQueueLock already from the caller and grab AsyncCtlLock
!  * here in this function.
!  *
!  * We are passed the list of notifications to write and return the
!  * not-yet-written notifications back. Eventually we will return NIL.
   */
! static List *
! asyncQueueAddEntries(List *notifications)
  {
! 	AsyncQueueEntry	qe;
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
  
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
  
! 	do
! 	{
! 		Notification   *n;
! 
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if-block further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
! 
! 		n = (Notification *) linitial(notifications);
! 
! 		asyncQueueNotificationToEntry(n, &qe);
  
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.channel[0] = '\0';
! 			qe.payload[0] = '\0';
! 			qe.xid = InvalidTransactionId;
! 		}
! 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
! 			   &qe, qe.length);
! 
! 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
! 			 && notifications != NIL);
! 
! 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
! 	{
! 		/*
! 		 * we need to go to continue on a new page, stop here but prepare that
! 		 * page already.
! 		 */
! 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
! 		AsyncCtl->shared->page_dirty[slotno] = true;
! 	}
! 	LWLockRelease(AsyncCtlLock);
! 
! 	return notifications;
! }
! 
! static void
! asyncQueueFillWarning(void)
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t;
! 	double			fillDegree;
! 	int				occupied;
! 	int				tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int				headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 
! 	occupied = headPage - tailPage;
! 
! 	if (occupied == 0)
! 		return;
! 	
! 	if (!asyncQueuePagePrecedesPhysically(tailPage, headPage))
! 		/* head has wrapped around, tail not yet */
! 		occupied += QUEUE_MAX_PAGE;
! 
! 	fillDegree = (float) occupied / (float) QUEUE_MAX_PAGE;
! 
! 	if (fillDegree < 0.5)
! 		return;
! 
! 	t = GetCurrentTimestamp();
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFillWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		QueuePosition	min = QUEUE_HEAD;
! 		int32			minPid = InvalidPid;
! 		int				i;
! 
! 		for (i = 0; i < MaxBackends; i++)
! 			if (QUEUE_BACKEND_PID(i) != InvalidPid)
  			{
! 				min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 				if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 					minPid = QUEUE_BACKEND_PID(i);
  			}
! 
! 		if (fillDegree < 0.75)
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 50%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 		else
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 75%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 
! 		asyncQueueControl->lastQueueFillWarn = t;
! 	}
! }
! 
! /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  * Add the pending notifications to the queue.
!  *
!  * A full queue is very uncommon and should really not happen, given that we
!  * have so much space available in the slru pages. Nevertheless we need to
!  * deal with this possibility. Note that when we get here we are in the process
!  * of committing our transaction, we have not yet committed to clog but this
!  * would be the next step. So at this point in time we can still roll the
!  * transaction back.
!  */
! static void
! Send_Notify(void)
! {
! 	backendSendsNotifications = true;
! 
! 	while (pendingNotifies != NIL)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		asyncQueueFillWarning();
! 		if (asyncQueueIsFull())
! 			ereport(ERROR,
! 					(errcode(ERRCODE_TOO_MANY_ENTRIES),
! 					errmsg("Too many notifications in the queue")));
! 		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 		LWLockRelease(AsyncQueueLock);
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. Since we have EXCLUSIVE lock anyway
!  * we also check the position of the other backends and in case that anyone is
!  * already up-to-date we don't signal it. This can happen if concurrent
!  * notifying transactions have sent a signal and the signaled backend has read
!  * the other notifications and ours in the same step.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(void)
! {
! 	QueuePosition	pos;
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 	int				count = 0;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			count++;
! 			pos = QUEUE_BACKEND_POS(i);
! 			if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
  			{
! 				pids = lappend_int(pids, pid);
! 				ids = lappend_int(ids, i);
  			}
  		}
  	}
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	forboth(p1, pids, p2, ids)
+ 	{
+ 		pid = (int32) lfirst_int(p1);
+ 		i = lfirst_int(p2);
+ 		/*
+ 		 * Should we check for failure? Can it happen that a backend
+ 		 * has crashed without the postmaster starting over?
+ 		 */
+ 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
+ 			elog(WARNING, "Error signalling backend %d", pid);
+ 	}
  
! 	if (count == 0)
! 	{
! 		/* No backend is listening at all, try to clean up the queue.
! 		 * Even if by now (after we determined count to be 0 and now)
! 		 * a backend has started to listen, advancing the tail does not
! 		 * hurt. Our notifications are committed already and a newly
! 		 * listening backend would skip over them anyway. */
! 		asyncQueueAdvanceTail();
! 	}
  }
  
  /*
   * AtAbort_Notify
   *
!  *	This is called at transaction abort.
   *
!  *	Gets rid of pending actions and outbound notifies that we would have
!  *	executed if the transaction got committed.
!  *
!  *	Even though we have not committed, we need to signal the listening backends
!  *	because our notifications might block readers from processing the queue.
!  *	Now that the transaction has aborted, they can go on and skip over our
!  *	notifications. They could find notifications past ours that they need to
!  *	deliver.
   */
  void
  AtAbort_Notify(void)
  {
+ 	if (backendSendsNotifications)
+ 		SignalBackends();
+ 
+ 	/*
+ 	 * If we LISTEN but then roll back the transaction we have set our pointer
+ 	 * but have not made the entry in listenChannels. In that case, remove
+ 	 * our pointer again.
+ 	 */
+ 	if (backendExecutesInitialListen)
+ 		/*
+ 		 * Checking listenChannels should be redundant but it can't hurt doing
+ 		 * it for safety reasons.
+ 		*/
+ 		if (listenChannels == NIL)
+ 			asyncQueueUnregister();
+ 
  	ClearPendingActionsAndNotifies();
  }
  
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1524,1762 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back the list of notifications
+  * that the calling function will deliver then. The list will contain all
+  * notifications from transactions that have already committed.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * The function returns true once we have reached the end or a notification of
+  * a transaction that is still running and false if we have finished with
+  * the page. In other words: once it returns true there is no point in calling
+  * it again.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current,
+ 						   QueuePosition stop,
+ 						   List **notifications)
+ {
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	int				slotno;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		readPtr += current->offset;
+ 		/* at first we only read the header of the notification */
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				if (IsListeningOn(qe.channel))
+ 				{
+ 					if (qe.length > AsyncQueueEntryEmptySize)
+ 					{
+ 						/* now we know that we are interested in the
+ 						 * notification and read it completely. */
+ 						memcpy(&qe, readPtr, qe.length);
+ 					}
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n);
+ 					*notifications = lappend(*notifications, n);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * The transaction has neither committed nor aborted so
+ 					 * far.
+ 					 */
+ 					reachedStop = true;
+ 					break;
+ 				}
+ 				/*
+ 				 * Here we know that the transaction has aborted, we just
+ 				 * ignore its notifications.
+ 				 */
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	/*
+ 	 * Release the lock that we implicitly got from
+ 	 * SimpleLruReadPage_ReadOnly().
+ 	 */
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ 
+ static void
+ asyncQueueReadAllNotifications(void)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 	{
+ 		/* Nothing to do, we have read all notifications already. */
+ 		return;
+ 	}
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us. Especially we don't want to hold a lock while sending the
+ 		 * notifications to the frontend.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		notifications = NIL;
+ 		reachedStop = asyncQueueGetEntriesByPage(&pos, head, &notifications);
+ 
+ 		/*
+ 		 * Note that we deliver everything that we see in the queue and that
+ 		 * matches our _current_ listening state.
+ 		 * Especially we do not take into account different commit times.
+ 		 *
+ 		 * See the following example:
+ 		 *
+ 		 * Backend 1:                    Backend 2:
+ 		 *
+ 		 * transaction starts
+ 		 * NOTIFY foo;
+ 		 * commit starts
+ 		 *                               transaction starts
+ 		 *                               LISTEN foo;
+ 		 *                               commit starts
+ 		 * commit to clog
+ 		 *                               commit to clog
+ 		 *
+ 		 * It could happen that backend 2 sees the notification from
+ 		 * backend 1 in the queue and even though the notifying transaction
+ 		 * committed before the listening transaction, we still deliver the
+ 		 * notification.
+ 		 *
+ 		 * The idea is that an additional notification does not do any
+ 		 * harm we just need to make sure that we do not miss a
+ 		 * notification.
+ 		 */
+ 		foreach(lc, notifications)
+ 		{
+ 			n = (Notification *) lfirst(lc);
+ 			NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 		}
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail(void)
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailp;
+ 	int				headp;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailp = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headp = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if ((asyncQueuePagePrecedesLogically(tailp, QUEUE_POS_PAGE(min), headp)
+ 			&& asyncQueuePagePrecedesPhysically(tailp, headp))
+ 		|| tailp == QUEUE_POS_PAGE(min))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * XXX this could be optimized, to call SimpleLruTruncate only when we
+ 		 * know that we can truncate something.
+ 		 */
+ 		SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
! 
! 		if (sourcePID != 0)
! 		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 		}
! 	}
! 	heap_endscan(scan);
! 
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 1768,1774 ----
  
  	notifyInterruptOccurred = 0;
  
! 	asyncQueueReadAllNotifications();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 1788,1804 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 1808,1864 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
! 
! 	if (payload == NULL)
! 		payload = "";
! 
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 */
  
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1107,1112 ****
--- 1875,1883 ----
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
+ 
+ 	backendSendsNotifications = false;
+ 	backendExecutesInitialListen = false;
  }
  
  /*
***************
*** 1124,1128 ****
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	Async_Notify((char *) recdata);
  }
--- 1895,1905 ----
  	 * there is any significant delay before I commit.	OK for now because we
  	 * disallow COMMIT PREPARED inside a transaction block.)
  	 */
! 	AsyncQueueEntry		*qe = (AsyncQueueEntry *) recdata;
! 
! 	Assert(qe->dboid == MyDatabaseId);
! 	Assert(qe->length == len);
! 
! 	Async_Notify(qe->channel, qe->payload);
  }
+ 
diff -cr cvs/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs/src/backend/nodes/copyfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 2771,2776 ****
--- 2771,2777 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs/src/backend/nodes/equalfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 1325,1330 ****
--- 1325,1331 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs/src/backend/nodes/outfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 1818,1823 ****
--- 1818,1824 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs/src/backend/nodes/readfuncs.c	2010-01-30 22:08:50.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs/src/backend/parser/gram.y	2010-01-30 22:09:01.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2010-02-07 12:57:49.000000000 +0100
***************
*** 400,406 ****
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 400,406 ----
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 6113,6122 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 6113,6128 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs/src/backend/storage/ipc/ipci.c	2010-01-30 22:09:08.000000000 +0100
--- cvs.build/src/backend/storage/ipc/ipci.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/nbtree.h"
  #include "access/subtrans.h"
  #include "access/twophase.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
***************
*** 225,230 ****
--- 226,232 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs/src/backend/storage/lmgr/lwlock.c	2010-01-30 22:09:08.000000000 +0100
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs/src/backend/tcop/utility.c	2010-01-30 22:09:01.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 930,936 ****
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 930,936 ----
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs/src/backend/utils/adt/misc.c cvs.build/src/backend/utils/adt/misc.c
*** cvs/src/backend/utils/adt/misc.c	2010-01-30 22:08:55.000000000 +0100
--- cvs.build/src/backend/utils/adt/misc.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 386,388 ****
--- 386,389 ----
  {
  	PG_RETURN_OID(get_fn_expr_argtype(fcinfo->flinfo, 0));
  }
+ 
diff -cr cvs/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs/src/bin/initdb/initdb.c	2010-01-30 22:08:40.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 2458,2463 ****
--- 2458,2464 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs/src/bin/psql/common.c	2010-01-30 22:08:41.000000000 +0100
--- cvs.build/src/bin/psql/common.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs/src/bin/psql/tab-complete.c cvs.build/src/bin/psql/tab-complete.c
*** cvs/src/bin/psql/tab-complete.c	2010-01-30 22:08:40.000000000 +0100
--- cvs.build/src/bin/psql/tab-complete.c	2010-02-07 12:57:49.000000000 +0100
***************
*** 2093,2099 ****
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
--- 2093,2099 ----
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
diff -cr cvs/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs/src/include/access/slru.h	2010-01-30 22:09:13.000000000 +0100
--- cvs.build/src/include/access/slru.h	2010-02-07 12:57:49.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs/src/include/catalog/pg_proc.h cvs.build/src/include/catalog/pg_proc.h
*** cvs/src/include/catalog/pg_proc.h	2010-01-30 22:09:10.000000000 +0100
--- cvs.build/src/include/catalog/pg_proc.h	2010-02-07 12:57:49.000000000 +0100
***************
*** 4113,4118 ****
--- 4113,4122 ----
  DESCR("get the prepared statements for this session");
  DATA(insert OID = 2511 (  pg_cursor PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,16,16,16,1184}" "{o,o,o,o,o,o}" "{name,statement,is_holdable,is_binary,is_scrollable,creation_time}" _null_ pg_cursor _null_ _null_ _null_ ));
  DESCR("get the open cursors for this session");
+ DATA(insert OID = 3034 (  pg_listening	PGNSP	PGUID 12 1 10 0 f f f t t s 0 0 25 "" _null_ _null_ _null_ _null_ pg_listening _null_ _null_ _null_ ));
+ DESCR("get the channels that the current backend listens to");
+ DATA(insert OID = 3035 (  pg_notify  PGNSP PGUID 12 1 0 0 f f f f f v 2 0 2278 "25 25" _null_ _null_ _null_ _null_ pg_notify _null_ _null_ _null_));
+ DESCR("send a notification to clients");
  DATA(insert OID = 2599 (  pg_timezone_abbrevs	PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,1186,16}" "{o,o,o}" "{abbrev,utc_offset,is_dst}" _null_ pg_timezone_abbrevs _null_ _null_ _null_ ));
  DESCR("get the available time zone abbreviations");
  DATA(insert OID = 2856 (  pg_timezone_names		PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,1186,16}" "{o,o,o,o}" "{name,abbrev,utc_offset,is_dst}" _null_ pg_timezone_names _null_ _null_ _null_ ));
***************
*** 4760,4766 ****
  DATA(insert OID = 3114 (  nth_value		PGNSP PGUID 12 1 0 0 f t f t f i 2 0 2283 "2283 23" _null_ _null_ _null_ _null_ window_nth_value _null_ _null_ _null_ ));
  DESCR("fetch the Nth row value");
  
- 
  /*
   * Symbolic values for provolatile column: these indicate whether the result
   * of a function is dependent *only* on the values of its explicit arguments,
--- 4764,4769 ----
diff -cr cvs/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs/src/include/commands/async.h	2010-01-30 22:09:11.000000000 +0100
--- cvs.build/src/include/commands/async.h	2010-02-07 12:57:49.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,41 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * Maximum size of the payload, including terminating NULL.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * How many page slots do we reserve ?
+  */
+ #define NUM_ASYNC_BUFFERS			4
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
***************
*** 43,46 ****
--- 56,62 ----
  extern void notify_twophase_postcommit(TransactionId xid, uint16 info,
  						   void *recdata, uint32 len);
  
+ extern Datum pg_listening(PG_FUNCTION_ARGS);
+ extern Datum pg_notify(PG_FUNCTION_ARGS);
+ 
  #endif   /* ASYNC_H */
diff -cr cvs/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs/src/include/nodes/parsenodes.h	2010-01-30 22:09:11.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2010-02-07 12:57:49.000000000 +0100
***************
*** 2084,2089 ****
--- 2084,2090 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs/src/include/storage/lwlock.h	2010-01-30 22:09:13.000000000 +0100
--- cvs.build/src/include/storage/lwlock.h	2010-02-07 12:57:49.000000000 +0100
***************
*** 67,72 ****
--- 67,74 ----
  	AutovacuumLock,
  	AutovacuumScheduleLock,
  	SyncScanLock,
+ 	AsyncCtlLock,
+ 	AsyncQueueLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff -cr cvs/src/include/utils/errcodes.h cvs.build/src/include/utils/errcodes.h
*** cvs/src/include/utils/errcodes.h	2010-01-30 22:09:12.000000000 +0100
--- cvs.build/src/include/utils/errcodes.h	2010-02-07 12:57:49.000000000 +0100
***************
*** 318,323 ****
--- 318,324 ----
  #define ERRCODE_STATEMENT_TOO_COMPLEX		MAKE_SQLSTATE('5','4', '0','0','1')
  #define ERRCODE_TOO_MANY_COLUMNS			MAKE_SQLSTATE('5','4', '0','1','1')
  #define ERRCODE_TOO_MANY_ARGUMENTS			MAKE_SQLSTATE('5','4', '0','2','3')
+ #define ERRCODE_TOO_MANY_ENTRIES			MAKE_SQLSTATE('5','4', '0','3','1')
  
  /* Class 55 - Object Not In Prerequisite State (class borrowed from DB2) */
  #define ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE	MAKE_SQLSTATE('5','5', '0','0','0')
diff -cr cvs/src/test/regress/expected/guc.out cvs.build/src/test/regress/expected/guc.out
*** cvs/src/test/regress/expected/guc.out	2010-01-30 22:08:39.000000000 +0100
--- cvs.build/src/test/regress/expected/guc.out	2010-02-07 12:57:49.000000000 +0100
***************
*** 532,540 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
!   relname  
! -----------
   foo_event
  (1 row)
  
--- 532,540 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
!  pg_listening 
! --------------
   foo_event
  (1 row)
  
***************
*** 571,579 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
!  relname 
! ---------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
--- 571,579 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
!  pg_listening 
! --------------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
diff -cr cvs/src/test/regress/expected/sanity_check.out cvs.build/src/test/regress/expected/sanity_check.out
*** cvs/src/test/regress/expected/sanity_check.out	2010-01-30 22:08:40.000000000 +0100
--- cvs.build/src/test/regress/expected/sanity_check.out	2010-02-07 12:57:49.000000000 +0100
***************
*** 107,113 ****
   pg_language             | t
   pg_largeobject          | t
   pg_largeobject_metadata | t
-  pg_listener             | f
   pg_namespace            | t
   pg_opclass              | t
   pg_operator             | t
--- 107,112 ----
***************
*** 154,160 ****
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (143 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 153,159 ----
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (142 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
diff -cr cvs/src/test/regress/sql/guc.sql cvs.build/src/test/regress/sql/guc.sql
*** cvs/src/test/regress/sql/guc.sql	2010-01-30 22:08:38.000000000 +0100
--- cvs.build/src/test/regress/sql/guc.sql	2010-02-07 12:57:49.000000000 +0100
***************
*** 165,171 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 165,171 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
***************
*** 174,180 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 174,180 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
#87Alvaro Herrera
alvherre@commandprompt.com
In reply to: Joachim Wieland (#86)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland wrote:

+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			channel[NAMEDATALEN];
+ 	char			payload[NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1)

These are the on-disk notifications, right? It seems to me a bit
wasteful to store channel name always as NAMEDATALEN bytes. Can we
truncate it at its strlen? I realize that this would cause the struct
definition to be uglier (you will no longer be able to have both channel
and payload pointers, only a char[1] pointer to a data area to which you
write both). Typical channel names should be short, so IMHO this is
worthwhile. Besides, I think the uglification of code this causes
should be fairly contained ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#88Jeff Davis
pgsql@j-davis.com
In reply to: Joachim Wieland (#86)
Re: Listen / Notify - what to do when the queue is full

In this version of the patch, there is a compiler warning:

async.c: In function ‘AtPrepare_Notify’:
async.c:593: warning: unused variable ‘p’

and also two trivial merge conflicts: an OID conflict for the functions
you added, and a trivial code conflict.

On Sun, 2010-02-07 at 17:32 +0100, Joachim Wieland wrote:

On Wed, Feb 3, 2010 at 2:05 AM, Jeff Davis <pgsql@j-davis.com> wrote:

The original comment was a part of the NotifyStmt case, and I don't
think we can support NOTIFY issued on a standby system -- surely there's
no way for the standby to communicate the notification to the master.
Anyway, this is getting a little sidetracked; I don't think we need to
worry about HS right now.

True but I was not talking about moving any notifications to different
servers. Clients listening on one server should receive the
notifications from NOTIFYs executed on this server, no matter if it is
a standby or the master server.

I'm not sure I agree with that philosophy. If the driving use-case for
LISTEN/NOTIFY is cache invalidation, then a NOTIFY on the master should
make it to all listening backends on the slaves.

This is still kind of an open item but it's an slru issue and should
also be true for other functionality that uses slru queues.

I haven't found out anything new here.

This I don't understand... If power goes out and we restart, we'd
first put all notifications from the prepared transactions into the
queue. We know that they fit because they have fit earlier as well (we
wouldn't allow user connections until we have worked through all 2PC
state files).

Ok, it appears you've thought the 2PC interaction through more than I
have. Even if we don't include it this time, I'm glad to hear that there
is a plan to do so. Feel free to include support if you have it ready,
but it's late in the CF so I don't want you to get sidetracked on that
issue.

There was another problem that the slru files did not all get deleted
at server restart, which is fixed now.

Good catch.

Regarding the famous ASCII-restriction open item I have now realized
what I haven't thought of previously: notifications are not
transferred between databases, they always stay in one database. Since
libpq does the conversion between server and client encoding, it is
questionable if we really need to restrict this at all... But I am not
an encoding expert so whoever feels like he can confirm or refute
this, please speak up.

I would like to support encoded text, but I think there are other
problems. For instance, what if one server has a client_encoding that
doesn't support some of the glyphs being sent by the notifying backend?
Then we have a mess, because we can't deliver it.

I think we just have to say "ASCII only" here, because there's no
reasonable way to handle this, regardless of implementation. If the user
issues SELECT and the client_encoding can't support some of the glyphs,
we can throw an error. But for a notification? We just have no mechanism
for that.

Regards,
Jeff Davis

#89Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#88)
Re: Listen / Notify - what to do when the queue is full

On Mon, 2010-02-08 at 22:13 -0800, Jeff Davis wrote:

I would like to support encoded text, but I think there are other
problems. For instance, what if one server has a client_encoding that
doesn't support some of the glyphs being sent by the notifying backend?
Then we have a mess, because we can't deliver it.

I was thinking more about this. It seems clear that we want the backend
that issues the notify to only put 7-bit ASCII in the payload.

But if the client sends the letters 'string' as a payload, and the
representation in the server encoding is something other than the normal
7-bit ASCII representation of 'string', it will be incorrect, right?

Looking at the documentation, it appears that all of the server
encodings represent 7-bit ascii characters using the same 7-bit ascii
representation. Is that true?

Regards,
Jeff Davis

#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#89)
Re: Listen / Notify - what to do when the queue is full

Jeff Davis <pgsql@j-davis.com> writes:

Looking at the documentation, it appears that all of the server
encodings represent 7-bit ascii characters using the same 7-bit ascii
representation. Is that true?

Correct. We only support ASCII-superset encodings, both for frontend
and backend.

Limiting NOTIFY payloads to 7-bit would definitely avoid the issue.
The question is if that's more of a pain than a benefit.

regards, tom lane

#91Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#90)
Re: Listen / Notify - what to do when the queue is full

On Tue, Feb 9, 2010 at 4:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Davis <pgsql@j-davis.com> writes:

Looking at the documentation, it appears that all of the server
encodings represent 7-bit ascii characters using the same 7-bit ascii
representation. Is that true?

Correct.  We only support ASCII-superset encodings, both for frontend
and backend.

Limiting NOTIFY payloads to 7-bit would definitely avoid the issue.
The question is if that's more of a pain than a benefit.

I think it's a reasonable restriction for now. We have limited time
remaining here; and we can always consider relaxing the restriction in
the future when we have more time to think through the issues. It'll
still be a big improvement over what we have now.

...Robert

#92Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#90)
Re: Listen / Notify - what to do when the queue is full

On Tue, 2010-02-09 at 16:51 -0500, Tom Lane wrote:

Limiting NOTIFY payloads to 7-bit would definitely avoid the issue.
The question is if that's more of a pain than a benefit.

I don't see any alternative. If one backend sends a NOTIFY payload that
contains a non-ASCII character, there's a risk that we won't be able to
deliver it to another backend with a client_encoding that can't
represent that character.

Also, just the fact that client_encoding can be changed at pretty much
any time is a potential problem, because it's difficult to know whether
a particular notification was sent using the old client_encoding or the
new one (because it's asynchronous).

Regards,
Jeff Davis

#93Andrew Chernow
ac@esilo.com
In reply to: Jeff Davis (#92)
Re: Listen / Notify - what to do when the queue is full

Jeff Davis wrote:

On Tue, 2010-02-09 at 16:51 -0500, Tom Lane wrote:

Limiting NOTIFY payloads to 7-bit would definitely avoid the issue.
The question is if that's more of a pain than a benefit.

I don't see any alternative. If one backend sends a NOTIFY payload that

Wouldn't binary payloads be an alternative? NOTE: I may have missed this
discussion. Sorry if it has already been covered.

--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

#94Jeff Davis
pgsql@j-davis.com
In reply to: Andrew Chernow (#93)
Re: Listen / Notify - what to do when the queue is full

On Tue, 2010-02-09 at 19:02 -0500, Andrew Chernow wrote:

Wouldn't binary payloads be an alternative? NOTE: I may have missed this
discussion. Sorry if it has already been covered.

The Notify struct has a "char *" field, which can't hold embedded NULL
bytes, so it can't really be binary. But it can't be arbitrary text,
because it has to be encoded in a way that works for every possible
client encoding (otherwise there's a possibility of an error, and no way
to handle it).

Also, the query starts out as text, so we need a way to interpret the
text in an encoding-independent way.

So, I think ASCII is the natural choice here.

Regards,
Jeff Davis

#95Andrew Dunstan
andrew@dunslane.net
In reply to: Jeff Davis (#94)
Re: Listen / Notify - what to do when the queue is full

Jeff Davis wrote:

Also, the query starts out as text, so we need a way to interpret the
text in an encoding-independent way.

So, I think ASCII is the natural choice here.

It's not worth hanging up this facility over this issue, ISTM. If we
want something more that ASCII then a base64 or hex encoded string could
possibly meet the need in the first instance.

cheers

andrew

#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#95)
Re: Listen / Notify - what to do when the queue is full

Andrew Dunstan <andrew@dunslane.net> writes:

Jeff Davis wrote:

So, I think ASCII is the natural choice here.

It's not worth hanging up this facility over this issue, ISTM. If we
want something more that ASCII then a base64 or hex encoded string could
possibly meet the need in the first instance.

Yeah, that would serve people who want to push either binary or
non-ASCII data through the pipe. It would leave all risks of encoding
problems on the user's head, though.

regards, tom lane

#97Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#96)
Re: Listen / Notify - what to do when the queue is full

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Jeff Davis wrote:

So, I think ASCII is the natural choice here.

It's not worth hanging up this facility over this issue, ISTM. If we
want something more that ASCII then a base64 or hex encoded string could
possibly meet the need in the first instance.

Yeah, that would serve people who want to push either binary or
non-ASCII data through the pipe. It would leave all risks of encoding
problems on the user's head, though.

True. It's a workaround, but I think it's acceptable at this stage. We
need to get some experience with this facility before we can refine it.

cheers

andrew

#98Alvaro Herrera
alvherre@commandprompt.com
In reply to: Andrew Dunstan (#97)
Re: Listen / Notify - what to do when the queue is full

Andrew Dunstan wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Jeff Davis wrote:

So, I think ASCII is the natural choice here.

It's not worth hanging up this facility over this issue, ISTM.
If we want something more that ASCII then a base64 or hex
encoded string could possibly meet the need in the first
instance.

Yeah, that would serve people who want to push either binary or
non-ASCII data through the pipe. It would leave all risks of encoding
problems on the user's head, though.

True. It's a workaround, but I think it's acceptable at this stage.
We need to get some experience with this facility before we can
refine it.

Hmm? If we decide now that it's not going to have encoding conversion,
we won't able to change it later.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#99Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#98)
Re: Listen / Notify - what to do when the queue is full

Alvaro Herrera <alvherre@commandprompt.com> writes:

Andrew Dunstan wrote:

True. It's a workaround, but I think it's acceptable at this stage.
We need to get some experience with this facility before we can
refine it.

Hmm? If we decide now that it's not going to have encoding conversion,
we won't able to change it later.

How so? If the feature currently allows only ASCII data, the behavior
would be upward compatible with a future version that is able to accept
and convert non-ASCII characters.

regards, tom lane

#100Joachim Wieland
joe@mcknight.de
In reply to: Alvaro Herrera (#87)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Mon, Feb 8, 2010 at 5:16 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

These are the on-disk notifications, right?  It seems to me a bit
wasteful to store channel name always as NAMEDATALEN bytes.  Can we
truncate it at its strlen?

Attached is a new and hopefully more or less final patch for LISTEN / NOTIFY.

The following items have been addressed in this patch:

- only store strlen(channel) instead of NAMEDATALEN bytes on disk
- limit to 7-bit ASCII
- forbid 2PC and LISTEN/NOTIFY for now
- documentation changes
- add missing tab completion for NOTIFY
- fix pg_notify() behavior with respect to NULLs, too long and too
short parameters
- rebased to current HEAD, OID conflicts resolved

Joachim

Attachments:

listennotify.11.difftext/x-patch; charset=US-ASCII; name=listennotify.11.diffDownload
diff -cr cvs.head/doc/src/sgml/catalogs.sgml cvs.build/doc/src/sgml/catalogs.sgml
*** cvs.head/doc/src/sgml/catalogs.sgml	2010-02-10 20:32:57.000000000 +0100
--- cvs.build/doc/src/sgml/catalogs.sgml	2010-02-10 20:42:10.000000000 +0100
***************
*** 169,179 ****
       </row>
  
       <row>
-       <entry><link linkend="catalog-pg-listener"><structname>pg_listener</structname></link></entry>
-       <entry>asynchronous notification support</entry>
-      </row>
- 
-      <row>
        <entry><link linkend="catalog-pg-namespace"><structname>pg_namespace</structname></link></entry>
        <entry>schemas</entry>
       </row>
--- 169,174 ----
***************
*** 3253,3320 ****
    </table>
   </sect1>
  
-  <sect1 id="catalog-pg-listener">
-   <title><structname>pg_listener</structname></title>
- 
-   <indexterm zone="catalog-pg-listener">
-    <primary>pg_listener</primary>
-   </indexterm>
- 
-   <para>
-    The catalog <structname>pg_listener</structname> supports the
-    <xref linkend="sql-listen" endterm="sql-listen-title"> and
-    <xref linkend="sql-notify" endterm="sql-notify-title">
-    commands.  A listener creates an entry in
-    <structname>pg_listener</structname> for each notification name
-    it is listening for.  A notifier scans <structname>pg_listener</structname>
-    and updates each matching entry to show that a notification has occurred.
-    The notifier also sends a signal (using the PID recorded in the table)
-    to awaken the listener from sleep.
-   </para>
- 
-   <table>
-    <title><structname>pg_listener</> Columns</title>
- 
-    <tgroup cols="3">
-     <thead>
-      <row>
-       <entry>Name</entry>
-       <entry>Type</entry>
-       <entry>Description</entry>
-      </row>
-     </thead>
- 
-     <tbody>
-      <row>
-       <entry><structfield>relname</structfield></entry>
-       <entry><type>name</type></entry>
-       <entry>
-        Notify condition name.  (The name need not match any actual
-        relation in the database; the name <structfield>relname</> is historical.)
-       </entry>
-      </row>
- 
-      <row>
-       <entry><structfield>listenerpid</structfield></entry>
-       <entry><type>int4</type></entry>
-       <entry>PID of the server process that created this entry</entry>
-      </row>
- 
-      <row>
-       <entry><structfield>notification</structfield></entry>
-       <entry><type>int4</type></entry>
-       <entry>
-        Zero if no event is pending for this listener.  If an event is
-        pending, the PID of the server process that sent the notification
-       </entry>
-      </row>
-     </tbody>
-    </tgroup>
-   </table>
- 
-  </sect1>
- 
- 
   <sect1 id="catalog-pg-namespace">
    <title><structname>pg_namespace</structname></title>
  
--- 3248,3253 ----
diff -cr cvs.head/doc/src/sgml/func.sgml cvs.build/doc/src/sgml/func.sgml
*** cvs.head/doc/src/sgml/func.sgml	2010-02-10 20:32:58.000000000 +0100
--- cvs.build/doc/src/sgml/func.sgml	2010-02-11 00:18:29.000000000 +0100
***************
*** 11572,11577 ****
--- 11572,11583 ----
        </row>
  
        <row>
+        <entry><literal><function>pg_listening</function>()</literal></entry>
+        <entry><type>set of text</type></entry>
+        <entry>channels that the session is currently listening on</entry>
+       </row>
+ 
+       <row>
         <entry><literal><function>session_user</function></literal></entry>
         <entry><type>name</type></entry>
         <entry>session user name</entry>
***************
*** 11738,11743 ****
--- 11744,11758 ----
     </para>
  
     <indexterm>
+     <primary>pg_listening</primary>
+    </indexterm>
+ 
+    <para>
+     <function>pg_listening</function> returns a set of channels that the
+     current session is listening to. See <xref linkend="sql-listen" endterm="sql-listen-title"> for more information.
+    </para>
+ 
+    <indexterm>
      <primary>version</primary>
     </indexterm>
  
diff -cr cvs.head/doc/src/sgml/libpq.sgml cvs.build/doc/src/sgml/libpq.sgml
*** cvs.head/doc/src/sgml/libpq.sgml	2010-02-10 20:32:59.000000000 +0100
--- cvs.build/doc/src/sgml/libpq.sgml	2010-02-10 20:42:10.000000000 +0100
***************
*** 4116,4125 ****
     can stop listening with the <command>UNLISTEN</command> command).  All
     sessions listening on a particular condition will be notified
     asynchronously when a <command>NOTIFY</command> command with that
!    condition name is executed by any session.  No additional information
!    is passed from the notifier to the listener.  Thus, typically, any
!    actual data that needs to be communicated is transferred through a
!    database table.  Commonly, the condition name is the same as the
     associated table, but it is not necessary for there to be any associated
     table.
    </para>
--- 4116,4124 ----
     can stop listening with the <command>UNLISTEN</command> command).  All
     sessions listening on a particular condition will be notified
     asynchronously when a <command>NOTIFY</command> command with that
!    condition name is executed by any session. A notification parameter can be
!    used to communicate additional data to a listener.
!    Commonly, the condition name is the same as the
     associated table, but it is not necessary for there to be any associated
     table.
    </para>
***************
*** 4153,4161 ****
     <function>PQfreemem</function>.  It is sufficient to free the
     <structname>PGnotify</structname> pointer; the
     <structfield>relname</structfield> and <structfield>extra</structfield>
!    fields do not represent separate allocations.  (At present, the
!    <structfield>extra</structfield> field is unused and will always point
!    to an empty string.)
    </para>
  
    <para>
--- 4152,4158 ----
     <function>PQfreemem</function>.  It is sufficient to free the
     <structname>PGnotify</structname> pointer; the
     <structfield>relname</structfield> and <structfield>extra</structfield>
!    fields do not represent separate allocations.
    </para>
  
    <para>
diff -cr cvs.head/doc/src/sgml/protocol.sgml cvs.build/doc/src/sgml/protocol.sgml
*** cvs.head/doc/src/sgml/protocol.sgml	2010-02-10 20:32:59.000000000 +0100
--- cvs.build/doc/src/sgml/protocol.sgml	2010-02-10 20:42:10.000000000 +0100
***************
*** 3192,3199 ****
  <listitem>
  <para>
                  Additional information passed from the notifying process.
-                 (Currently, this feature is unimplemented so the field
-                 is always an empty string.)
  </para>
  </listitem>
  </varlistentry>
--- 3192,3197 ----
diff -cr cvs.head/doc/src/sgml/ref/listen.sgml cvs.build/doc/src/sgml/ref/listen.sgml
*** cvs.head/doc/src/sgml/ref/listen.sgml	2008-11-14 11:22:47.000000000 +0100
--- cvs.build/doc/src/sgml/ref/listen.sgml	2010-02-10 23:53:03.000000000 +0100
***************
*** 74,79 ****
--- 74,86 ----
   </refsect1>
  
   <refsect1>
+   <title>Notes</title>
+   <para>
+    A transaction that has executed <command>LISTEN</command> cannot be prepared for a two-phase commit.
+   </para>
+  </refsect1>
+ 
+  <refsect1>
    <title>Parameters</title>
  
    <variablelist>
***************
*** 98,104 ****
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" received from server process with PID 8448.
  </programlisting>
    </para>
   </refsect1>
--- 105,111 ----
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" () received from server process with PID 8448.
  </programlisting>
    </para>
   </refsect1>
diff -cr cvs.head/doc/src/sgml/ref/notify.sgml cvs.build/doc/src/sgml/ref/notify.sgml
*** cvs.head/doc/src/sgml/ref/notify.sgml	2008-11-14 11:22:47.000000000 +0100
--- cvs.build/doc/src/sgml/ref/notify.sgml	2010-02-11 00:38:22.000000000 +0100
***************
*** 21,27 ****
  
   <refsynopsisdiv>
  <synopsis>
! NOTIFY <replaceable class="PARAMETER">name</replaceable>        
  </synopsis>
   </refsynopsisdiv>
  
--- 21,27 ----
  
   <refsynopsisdiv>
  <synopsis>
! NOTIFY <replaceable class="PARAMETER">name</replaceable> [ <replaceable class="PARAMETER">parameter</replaceable> ]
  </synopsis>
   </refsynopsisdiv>
  
***************
*** 29,36 ****
    <title>Description</title>
  
    <para>
!    The <command>NOTIFY</command> command sends a notification event to each
!    client application that has previously executed
     <command>LISTEN <replaceable class="parameter">name</replaceable></command>
     for the specified notification name in the current database.
    </para>
--- 29,37 ----
    <title>Description</title>
  
    <para>
!    The <command>NOTIFY</command> command sends a notification event together
!    with an optional notification parameter to each client application that has
!    previously executed
     <command>LISTEN <replaceable class="parameter">name</replaceable></command>
     for the specified notification name in the current database.
    </para>
***************
*** 39,54 ****
     <command>NOTIFY</command> provides a simple form of signal or
     interprocess communication mechanism for a collection of processes
     accessing the same <productname>PostgreSQL</productname> database.
!    Higher-level mechanisms can be built by using tables in the database to
!    pass additional data (beyond a mere notification name) from notifier to
!    listener(s).
    </para>
  
    <para>
     The information passed to the client for a notification event includes the notification
!    name and the notifying session's server process <acronym>PID</>.  It is up to the
!    database designer to define the notification names that will be used in a given
!    database and what each one means.
    </para>
  
    <para>
--- 40,56 ----
     <command>NOTIFY</command> provides a simple form of signal or
     interprocess communication mechanism for a collection of processes
     accessing the same <productname>PostgreSQL</productname> database.
!    A notification parameter can be sent along with the notification and
!    higher-level mechanisms for passing structured data can be built by using
!    tables in the database to pass additional data from notifier to listener(s).
    </para>
  
    <para>
     The information passed to the client for a notification event includes the notification
!    name, the notifying session's server process <acronym>PID</> and the
!    notification parameter (payload) which is an empty string if it has not been specified.
!    It is up to the database designer to define the notification names that will
!    be used in a given database and what each one means.
    </para>
  
    <para>
***************
*** 89,102 ****
    </para>
  
    <para>
!    <command>NOTIFY</command> behaves like Unix signals in one important
!    respect: if the same notification name is signaled multiple times in quick
!    succession, recipients might get only one notification event for several executions
!    of <command>NOTIFY</command>.  So it is a bad idea to depend on the number
!    of notifications received.  Instead, use <command>NOTIFY</command> to wake up
!    applications that need to pay attention to something, and use a database
!    object (such as a sequence) to keep track of what happened or how many times
!    it happened.
    </para>
  
    <para>
--- 91,111 ----
    </para>
  
    <para>
!    If the same notification name is signaled multiple times from the same
!    transaction and with an identical notification parameter (or an empty one), then
!    the listening backend will only receive one single notification. On the other hand,
!    notifications with distinct notification parameters will always be delivered as distinct
!    notifications. Similarly, notifications from different transactions will never
!    get folded into one notification. <command>NOTIFY</command> also guarantees
!    that notifications from the same transaction get delivered in the order they
!    were sent.
!   </para>
! 
!   <para>
!    An alternative to specifying a notification parameter is to use <command>NOTIFY</command> to
!    wake up applications that need to pay attention to something, and use a
!    database object (such as a sequence) to keep track of what happened or how many
!    times it happened.
    </para>
  
    <para>
***************
*** 111,122 ****
     notification event message) is the same as one's own session's
     <acronym>PID</> (available from <application>libpq</>).  When they
     are the same, the notification event is one's own work bouncing
!    back, and can be ignored.  (Despite what was said in the preceding
!    paragraph, this is a safe technique.
!    <productname>PostgreSQL</productname> keeps self-notifications
!    separate from notifications arriving from other sessions, so you
!    cannot miss an outside notification by ignoring your own
!    notifications.)
    </para>
   </refsect1>
  
--- 120,141 ----
     notification event message) is the same as one's own session's
     <acronym>PID</> (available from <application>libpq</>).  When they
     are the same, the notification event is one's own work bouncing
!    back, and can be ignored.
!   </para>
!  </refsect1>
! 
!  <refsect1>
!   <title>Notes</title>
!   <para>
!    A transaction that has executed <command>LISTEN</command> cannot be prepared for a two-phase commit.
!   </para>
!   <para>
!    To send a notification you can also use the function
!    <literal><function>pg_notify</function>(<type>text</type>,
!    <type>text</type>)</literal>. The function takes the channel name as the
!    first argument and the payload as the second. This could be more convenient
!    to use in triggers and you can also use a non-constant channel name and
!    parameter value.
    </para>
   </refsect1>
  
***************
*** 132,137 ****
--- 151,170 ----
       </para>
      </listitem>
     </varlistentry>
+    <varlistentry>
+     <term><replaceable class="PARAMETER">parameter</replaceable></term>
+     <listitem>
+      <para>
+       The notification parameter (payload) to be communicated along with the
+       notification. The character string is only allowed to consist of pure
+       ASCII 7-bit characters and must be shorter than 8000 characters.
+       Specifying a longer payload will cause an error. If you need to send
+       other characters or binary data, you need to take care of the encoding
+       and decoding (like base64) on your own. Alternatively you can store the
+       information in a database table and send the key of the record.
+      </para>
+     </listitem>
+    </varlistentry>
    </variablelist>
   </refsect1>
  
***************
*** 145,151 ****
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" received from server process with PID 8448.
  </programlisting>
    </para>
   </refsect1>
--- 178,190 ----
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" () received from server process with PID 8448.
! NOTIFY virtual 'This is the payload';
! Asynchronous notification "virtual" (This is the payload) received from server process with PID 8448.
! 
! LISTEN foo;
! SELECT pg_notify((SELECT 'fo' || 'o'), (SELECT 'pay' || 'load'));
! Asynchronous notification "foo" (payload) received from server process with PID 20801.
  </programlisting>
    </para>
   </refsect1>
diff -cr cvs.head/doc/src/sgml/ref/unlisten.sgml cvs.build/doc/src/sgml/ref/unlisten.sgml
*** cvs.head/doc/src/sgml/ref/unlisten.sgml	2008-11-14 11:22:47.000000000 +0100
--- cvs.build/doc/src/sgml/ref/unlisten.sgml	2010-02-10 23:53:12.000000000 +0100
***************
*** 83,88 ****
--- 83,92 ----
     At the end of each session, <command>UNLISTEN *</command> is
     automatically executed.
    </para>
+ 
+   <para>
+    A transaction that has executed <command>LISTEN</command> cannot be prepared for a two-phase commit.
+   </para>
   </refsect1>
  
   <refsect1>
***************
*** 94,100 ****
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" received from server process with PID 8448.
  </programlisting>
    </para>
  
--- 98,104 ----
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" () received from server process with PID 8448.
  </programlisting>
    </para>
  
diff -cr cvs.head/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs.head/src/backend/access/transam/slru.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/access/transam/slru.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs.head/src/backend/access/transam/twophase_rmgr.c cvs.build/src/backend/access/transam/twophase_rmgr.c
*** cvs.head/src/backend/access/transam/twophase_rmgr.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/access/transam/twophase_rmgr.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 25,31 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_recover,		/* Lock */
- 	NULL,						/* notify/listen */
  	NULL,						/* pgstat */
  	multixact_twophase_recover	/* MultiXact */
  };
--- 25,30 ----
***************
*** 34,40 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_postcommit,	/* Lock */
- 	notify_twophase_postcommit, /* notify/listen */
  	pgstat_twophase_postcommit,	/* pgstat */
  	multixact_twophase_postcommit /* MultiXact */
  };
--- 33,38 ----
***************
*** 43,49 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_postabort,	/* Lock */
- 	NULL,						/* notify/listen */
  	pgstat_twophase_postabort,	/* pgstat */
  	multixact_twophase_postabort /* MultiXact */
  };
--- 41,46 ----
***************
*** 52,58 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_standby_recover,		/* Lock */
- 	NULL,						/* notify/listen */
  	NULL,						/* pgstat */
  	NULL						/* MultiXact */
  };
--- 49,54 ----
diff -cr cvs.head/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs.head/src/backend/access/transam/xact.c	2010-02-10 20:33:05.000000000 +0100
--- cvs.build/src/backend/access/transam/xact.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 1733,1740 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
--- 1733,1740 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
***************
*** 1812,1817 ****
--- 1812,1822 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs.head/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs.head/src/backend/catalog/Makefile	2010-01-06 22:30:05.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2010-02-10 20:42:10.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs.head/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs.head/src/backend/commands/async.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/commands/async.c	2010-02-11 01:19:01.000000000 +0100
***************
*** 14,44 ****
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 14,68 ----
  
  /*-------------------------------------------------------------------------
   * New Async Notification Model:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications and compares the notified channels with its list.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) stores the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the applications needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    After clog commit we are called another time
!  *    (AtCommit_NotifyAfterCommit()). Here we check if we need to signal the
!  *    backends. In SignalBackends() we scan the list of listening backends and
!  *    send a PROCSIG_NOTIFY_INTERRUPT to every backend that has set its Pid (we
!  *    don't know which backend is listening on which channel so we need to send
!  *    a signal to every listening backend). We can exclude backends that are
!  *    already up to date.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,84 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
--- 70,91 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  *    Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach either a notification from an uncommitted transaction or
!  *	  the head pointer's position. Then we check if we were the laziest
!  *	  backend: if our pointer is set to the same position as the global tail
!  *	  pointer is set, then we set it further to the second-laziest backend (We
!  *	  can identify it by inspecting the positions of all other backends'
!  *	  pointers). Whenever we move the tail pointer we also truncate now unused
!  *	  pages (i.e. delete files in pg_notify/ that are no longer used).
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
***************
*** 88,97 ****
  #include <signal.h>
  
  #include "access/heapam.h"
! #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_listener.h"
  #include "commands/async.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
--- 95,106 ----
  #include <signal.h>
  
  #include "access/heapam.h"
! #include "access/slru.h"
! #include "access/transam.h"
  #include "access/xact.h"
! #include "catalog/pg_type.h"
  #include "commands/async.h"
+ #include "funcapi.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 117,124 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 126,132 ****
  typedef struct
  {
  	ListenActionKind action;
! 	char		condname[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
--- 135,141 ----
  typedef struct
  {
  	ListenActionKind action;
! 	char		channel[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
***************
*** 134,140 ****
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 143,149 ----
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 158,280 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1 \
+ 							  - NAMEDATALEN + 1)
+ 
+ #define	InvalidPid				(-1)
+ #define QUEUE_POS_PAGE(x)		((x).page)
+ #define QUEUE_POS_OFFSET(x)		((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_BACKEND_POS(i)		asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i)		asyncQueueControl->backend[(i)].pid
+ #define QUEUE_HEAD					asyncQueueControl->head
+ #define QUEUE_TAIL					asyncQueueControl->tail
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFillWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's an interesting test case to define QUEUE_MAX_PAGE to a very small
+  * multiple of SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
+ 
+ /* has this backend sent notifications in the current transaction ? */
+ static bool backendSendsNotifications = false;
+ /* has this backend executed a LISTEN in the current transaction ? */
+ static bool backendExecutesInitialListen = false;
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,224 ****
  
  bool		Trace_notify = false;
  
! 
! static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
! 	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
  	{
! 		/*
! 		 * The name list needs to live until end of transaction, so store it
! 		 * in the transaction context.
! 		 */
! 		MemoryContext oldcontext;
  
! 		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
! 		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
  }
  
  /*
--- 291,514 ----
  
  bool		Trace_notify = false;
  
! static void queue_listen(ListenActionKind action, const char *channel);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_ListenBeforeCommit(const char *channel);
! static void Exec_ListenAfterCommit(const char *channel);
! static void Exec_UnlistenAfterCommit(const char *channel);
! static void Exec_UnlistenAllAfterCommit(void);
! static void SignalBackends(void);
! static void Send_Notify(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **notifications);
! static void asyncQueueReadAllNotifications(void);
! static void asyncQueueAdvanceTail(void);
! static void asyncQueueUnregister(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 srcPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic
+  * if one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. If we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFillWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		SlruScanDirectory(AsyncCtl,
+ 						  QUEUE_MAX_PAGE + SLRU_PAGES_PER_SEGMENT,
+ 						  true);
+ 
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 	}
+ }
+ 
+ 
+ /*
+  * pg_notify -
+  *	  Send a notification to listening clients
+  */
+ Datum
+ pg_notify(PG_FUNCTION_ARGS)
+ {
+ 	const char *channel;
+ 	const char *payload;
+ 
+ 	if (PG_ARGISNULL(0))
+ 		channel = "";
+ 	else
+ 		channel = text_to_cstring(PG_GETARG_TEXT_PP(0));
+ 
+ 	if (PG_ARGISNULL(1))
+ 		payload = "";
+ 	else
+ 		payload = text_to_cstring(PG_GETARG_TEXT_PP(1));
+ 
+ 	Async_Notify(channel, payload);
+ 
+ 	PG_RETURN_VOID();
+ }
+ 
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 	Notification   *n;
+ 	MemoryContext	oldcontext;
+ 	int				i;
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
  
! 	/* a channel name must be specified */
! 	if (!channel || !strlen(channel))
! 		ereport(ERROR,
! 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 				 errmsg("channel name cannot be empty")));
! 
! 	if (strlen(channel) >= NAMEDATALEN)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 				 errmsg("channel name too long")));
! 
! 	if (payload)
  	{
! 		if (strlen(payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 			ereport(ERROR,
! 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 					 errmsg("payload string too long")));
! 
! 		for (i = 0; i < strlen(payload); i++)
! 			if (payload[i] < 32 || payload[i] > 126)
! 				ereport(ERROR,
! 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 						 errmsg("invalid character in payload"),
! 						 errdetail("only 7-bit ASCII characters allowed")));
! 	}
  
! 	/* no point in making duplicate entries in the list ... */
! 	if (AsyncExistsPendingNotify(channel, payload))
! 		return;
  
! 	/*
! 	 * The name list needs to live until end of transaction, so store it
! 	 * in the transaction context.
! 	 */
! 	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 	n = (Notification *) palloc(sizeof(Notification));
! 	n->channel = pstrdup(channel);
! 	if (payload)
! 		n->payload = pstrdup(payload);
! 	else
! 		n->payload = "";
! 
! 	/* will set the xid and the srcPid later... */
! 	n->xid = InvalidTransactionId;
! 	n->srcPid = InvalidPid;
! 
! 	/*
! 	 * We want to preserve the order so we need to append every
! 	 * notification. See comments at AsyncExistsPendingNotify().
! 	 */
! 	pendingNotifies = lappend(pendingNotifies, n);
! 
! 	MemoryContextSwitchTo(oldcontext);
  }
  
  /*
***************
*** 226,236 ****
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of pg_listener happens during transaction commit.
!  *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  static void
! queue_listen(ListenActionKind action, const char *condname)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
--- 516,526 ----
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of the notification queue happens during transaction
!  *		commit.
   */
  static void
! queue_listen(ListenActionKind action, const char *channel)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
***************
*** 244,252 ****
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(condname));
  	actrec->action = action;
! 	strcpy(actrec->condname, condname);
  
  	pendingActions = lappend(pendingActions, actrec);
  
--- 534,542 ----
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(channel));
  	actrec->action = action;
! 	strcpy(actrec->channel, channel);
  
  	pendingActions = lappend(pendingActions, actrec);
  
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 549,560 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 563,578 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 596,601 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
- 	/*
- 	 * We need to start/commit a transaction for the unlisten, but if there is
- 	 * already an active transaction we had better abort that one first.
- 	 * Otherwise we'd end up committing changes that probably ought to be
- 	 * discarded.
- 	 */
  	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 603,610 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	AbortOutOfAnyTransaction();
! 	Exec_UnlistenAllAfterCommit();
  }
  
  /*
***************
*** 337,386 ****
  void
  AtPrepare_Notify(void)
  {
- 	ListCell   *p;
- 
  	/* It's not sensible to have any pending LISTEN/UNLISTEN actions */
! 	if (pendingActions)
  		ereport(ERROR,
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 				 errmsg("cannot PREPARE a transaction that has executed LISTEN or UNLISTEN")));
! 
! 	/* We can deal with pending NOTIFY though */
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *relname = (const char *) lfirst(p);
! 
! 		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
! 	}
! 
! 	/*
! 	 * We can clear the state immediately, rather than needing a separate
! 	 * PostPrepare call, because if the transaction fails we'd just discard
! 	 * the state anyway.
! 	 */
! 	ClearPendingActionsAndNotifies();
  }
  
  /*
!  * AtCommit_Notify
!  *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
   *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
--- 616,643 ----
  void
  AtPrepare_Notify(void)
  {
  	/* It's not sensible to have any pending LISTEN/UNLISTEN actions */
! 	if (pendingActions || pendingNotifies)
  		ereport(ERROR,
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 				 errmsg("cannot PREPARE a transaction that has executed LISTEN/UNLISTEN or NOTIFY")));
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
  
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 654,663 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
! 	Assert(backendSendsNotifications == false);
! 	Assert(backendExecutesInitialListen == false);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
- 	/* Perform any pending notifies */
- 	if (pendingNotifies)
- 		Send_Notify(lRel);
- 
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
!  *
!  *		Register the current backend as listening on the specified relation.
   */
! static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
  
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
  
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
  
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
! 		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
! 		}
  	}
- 	heap_endscan(scan);
  
! 	if (alreadyListener)
! 		return;
  
! 	/*
! 	 * OK to insert a new tuple
! 	 */
! 	memset(nulls, false, sizeof(nulls));
  
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
  
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
  
! 	simple_heap_insert(lRel, tuple);
  
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 667,852 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_ListenBeforeCommit(actrec->channel);
  				break;
  			case LISTEN_UNLISTEN:
! 				/* there is no Exec_UnlistenBeforeCommit() */
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				/* there is no Exec_UnlistenAllBeforeCommit() */
  				break;
  		}
  	}
  
  	/*
! 	 * Perform any pending notifies.
  	 */
! 	if (pendingNotifies)
! 		Send_Notify();
! }
! 
! /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
!  *
!  *		Notify the listening backends.
!  */
! void
! AtCommit_NotifyAfterCommit(void)
! {
! 	ListCell   *p;
! 
! 	/* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
! 	 * return as soon as possible */
! 	if (!pendingActions && !backendSendsNotifications)
! 		return;
! 
! 	/* Perform any pending listen/unlisten actions */
! 	foreach(p, pendingActions)
! 	{
! 		ListenAction *actrec = (ListenAction *) lfirst(p);
! 
! 		switch (actrec->action)
! 		{
! 			case LISTEN_LISTEN:
! 				Exec_ListenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN:
! 				Exec_UnlistenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAllAfterCommit();
! 				break;
! 		}
! 	}
! 
! 	if (backendSendsNotifications)
! 		SignalBackends();
  
  	ClearPendingActionsAndNotifies();
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
  }
  
  /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
   */
! static bool
! IsListeningOn(const char *channel)
  {
! 	ListCell   *p;
! 	char	   *lchan;
  
! 	foreach(p, listenChannels)
! 	{
! 		lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			return true;
! 	}
! 	return false;
! }
! 
! Datum
! pg_listening(PG_FUNCTION_ARGS)
! {
! 	FuncCallContext	   *funcctx;
! 	ListCell		  **lcp;
  
! 	/* stuff done only on the first call of the function */
! 	if (SRF_IS_FIRSTCALL())
  	{
! 		MemoryContext	oldcontext;
  
! 		/* create a function context for cross-call persistence */
! 		funcctx = SRF_FIRSTCALL_INIT();
! 
! 		/*
! 		 * switch to memory context appropriate for multiple function calls
! 		 */
! 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
! 
! 		/* allocate memory for user context */
! 		lcp = (ListCell **) palloc(sizeof(ListCell **));
! 		if (listenChannels != NIL)
! 			*lcp = list_head(listenChannels);
! 		else
! 			*lcp = NULL;
! 		funcctx->user_fctx = (void *) lcp;
! 
! 		MemoryContextSwitchTo(oldcontext);
  	}
  
! 	/* stuff done on every call of the function */
! 	funcctx = SRF_PERCALL_SETUP();
! 	lcp = (ListCell **) funcctx->user_fctx;
  
! 	while (*lcp != NULL)
! 	{
! 		char   *channel = (char *) lfirst(*lcp);
  
! 		*lcp = (*lcp)->next;
! 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
! 	}
  
! 	SRF_RETURN_DONE(funcctx);
! }
  
! /*
!  * Exec_ListenBeforeCommit --- subroutine for AtCommit_NotifyBeforeCommit
!  *
!  * Note that we do only set our pointer here and do not yet add the channel to
!  * listenChannels. Since our transaction could still roll back we do this only
!  * after commit. We know that our tail pointer won't move between here and
!  * directly after commit, so we won't miss a notification.
!  */
! static void
! Exec_ListenBeforeCommit(const char *channel)
! {
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
! 
! 	/* Detect whether we are already listening to something. */
! 	if (listenChannels != NIL)
! 		return;
! 
! 	/*
! 	 * We need this variable to detect an aborted initial LISTEN.
! 	 * In that case we would set up our pointer but not listen on any channel.
! 	 * This state gets cleaned up again in AtAbort_Notify().
! 	 */
! 	backendExecutesInitialListen = true;
  
! 	/*
! 	 * This is our first LISTEN, establish our pointer.
! 	 * We set our pointer to the global tail pointer, this way we make
! 	 * sure that we get all of the notifications. We might get a few more
! 	 * but that doesn't hurt.
! 	 */
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
! 	QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 	LWLockRelease(AsyncQueueLock);
  
! 	/*
! 	 * Try to move our pointer forward as far as possible. This will skip
! 	 * over already committed notifications. Still, we could get
! 	 * notifications that have already committed before we started to
! 	 * LISTEN.
! 	 *
! 	 * Note that we are not yet listening on anything, so we won't deliver
! 	 * any notification.
! 	 *
! 	 * This will also advance the global tail pointer if necessary.
! 	 */
! 	asyncQueueReadAllNotifications();
  
  	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 512,550 ****
  }
  
  /*
!  * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			/* Found the matching tuple, delete it */
! 			simple_heap_delete(lRel, &tuple->t_self);
! 
! 			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
! 			 */
  			break;
  		}
  	}
! 	heap_endscan(scan);
  
  	/*
  	 * We do not complain about unlistening something not being listened;
--- 856,908 ----
  }
  
  /*
!  * Exec_ListenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
   *
!  * Add the channel to the list of channels we are listening on.
   */
  static void
! Exec_ListenAfterCommit(const char *channel)
  {
! 	MemoryContext oldcontext;
! 
! 	/* Detect whether we are already listening on this channel */
! 	if (IsListeningOn(channel))
! 		return;
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
! }
! 
! /*
!  * Exec_UnlistenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
!  *
!  * Remove a specified channel from "listenChannels".
!  */
! static void
! Exec_UnlistenAfterCommit(const char *channel)
! {
! 	ListCell *q;
! 	ListCell *prev;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAfterCommit(%s,%d)", channel, MyProcPid);
  
! 	prev = NULL;
! 	foreach(q, listenChannels)
  	{
! 		char *lchan = (char *) lfirst(q);
! 		if (strcmp(lchan, channel) == 0)
  		{
! 			pfree(lchan);
! 			listenChannels = list_delete_cell(listenChannels, q, prev);
  			break;
  		}
+ 		prev = q;
  	}
! 
! 	if (listenChannels == NIL)
! 		asyncQueueUnregister();
  
  	/*
  	 * We do not complain about unlistening something not being listened;
***************
*** 553,690 ****
  }
  
  /*
!  * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
- 	HeapScanDesc scan;
- 	HeapTuple	lTuple;
- 	ScanKeyData key[1];
- 
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
   *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
! static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
  			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
  			}
! 			else if (listener->notification == 0)
  			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
  			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
   * AtAbort_Notify
   *
!  *		This is called at transaction abort.
   *
!  *		Gets rid of pending actions and outbound notifies that we would have
!  *		executed if the transaction got committed.
   */
  void
  AtAbort_Notify(void)
  {
  	ClearPendingActionsAndNotifies();
  }
  
--- 911,1302 ----
  }
  
  /*
!  * Exec_UnlistenAllAfterCommit --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAllAfterCommit(void)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAllAferCommit(%d)", MyProcPid);
! 
! 	list_free_deep(listenChannels);
! 	listenChannels = NIL;
! 
! 	asyncQueueUnregister();
! }
! 
! static void
! asyncQueueUnregister(void)
! {
! 	bool	  advanceTail = false;
! 
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
! 	/*
! 	 * If we have been the last backend, advance the tail pointer.
! 	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
! }
! 
! static bool
! asyncQueueIsFull(void)
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size			remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size			advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
  
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
! 
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
! }
! 
! /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
!  */
! static bool
! asyncQueueAdvance(QueuePosition *position, int entryLength)
! {
! 	int		pageno = QUEUE_POS_PAGE(*position);
! 	int		offset = QUEUE_POS_OFFSET(*position);
! 	bool	pageJump = false;
! 
! 	/*
! 	 * Move to the next writing position: First jump over what we have just
! 	 * written or read.
! 	 */
! 	offset += entryLength;
! 	Assert(offset < QUEUE_PAGESIZE);
  
! 	/*
! 	 * In a second step check if another entry can be written to the page. If
! 	 * it does, stay here, we have reached the next position. If not, then we
! 	 * need to move on to the next page.
! 	 */
! 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
! 	{
! 		pageno++;
! 		if (pageno > QUEUE_MAX_PAGE)
! 			/* wrap around */
! 			pageno = 0;
! 		offset = 0;
! 		pageJump = true;
! 	}
! 
! 	SET_QUEUE_POS(*position, pageno, offset);
! 	return pageJump;
! }
  
! static void
! asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
! {
! 		Assert(n->channel != NULL);
! 		Assert(n->payload != NULL);
! 		Assert(strlen(n->payload) < NOTIFY_PAYLOAD_MAX_LENGTH);
! 		Assert(strlen(n->channel) < NAMEDATALEN);
! 
! 		/* The terminators are already included in AsyncQueueEntryEmptySize */
! 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload)
! 											  + strlen(n->channel);
! 		qe->srcPid = MyProcPid;
! 		qe->dboid = MyDatabaseId;
! 		qe->xid = GetCurrentTransactionId();
! 		strcpy(qe->data, n->channel);
! 		Assert(*(qe->data + strlen(n->channel)) == '\0');
! 		strcpy(qe->data + strlen(n->channel) + 1, n->payload);
! }
! 
! static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n)
! {
! 	n->channel = pstrdup(qe->data);
! 	Assert(*(qe->data + strlen(qe->data)) == '\0');
! 	n->payload = pstrdup(qe->data + strlen(qe->data) + 1);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
  }
  
  /*
!  * Add the notifications to the queue: we go page by page here, i.e. we stop
!  * once we have to go to a new page but we will be called again and then fill
!  * that next page. If an entry does not fit to a page anymore, we write a dummy
!  * entry with an InvalidOid as the database oid in order to fill the page. So
!  * every page is always used up to the last byte which simplifies reading the
!  * page later.
!  *
!  * We are holding AsyncQueueLock already from the caller and grab AsyncCtlLock
!  * here in this function.
   *
!  * We are passed the list of notifications to write and return the
!  * not-yet-written notifications back. Eventually we will return NIL.
   */
! static List *
! asyncQueueAddEntries(List *notifications)
  {
! 	AsyncQueueEntry	qe;
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
  
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
! 
! 	do
! 	{
! 		Notification   *n;
  
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if-block further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
! 
! 		n = (Notification *) linitial(notifications);
  
! 		asyncQueueNotificationToEntry(n, &qe);
! 
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.data[0] = '\0'; /* empty channel */
! 			qe.data[1] = '\0'; /* empty payload */
! 			qe.xid = InvalidTransactionId;
! 		}
! 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
! 			   &qe, qe.length);
! 
! 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
! 			 && notifications != NIL);
! 
! 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
! 	{
! 		/*
! 		 * we need to go to continue on a new page, stop here but prepare that
! 		 * page already.
! 		 */
! 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
! 		AsyncCtl->shared->page_dirty[slotno] = true;
! 	}
! 	LWLockRelease(AsyncCtlLock);
! 
! 	return notifications;
! }
! 
! static void
! asyncQueueFillWarning(void)
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t;
! 	double			fillDegree;
! 	int				occupied;
! 	int				tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int				headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 
! 	occupied = headPage - tailPage;
! 
! 	if (occupied == 0)
! 		return;
! 	
! 	if (!asyncQueuePagePrecedesPhysically(tailPage, headPage))
! 		/* head has wrapped around, tail not yet */
! 		occupied += QUEUE_MAX_PAGE;
! 
! 	fillDegree = (float) occupied / (float) QUEUE_MAX_PAGE;
! 
! 	if (fillDegree < 0.5)
! 		return;
! 
! 	t = GetCurrentTimestamp();
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFillWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		QueuePosition	min = QUEUE_HEAD;
! 		int32			minPid = InvalidPid;
! 		int				i;
! 
! 		for (i = 0; i < MaxBackends; i++)
! 			if (QUEUE_BACKEND_PID(i) != InvalidPid)
  			{
! 				min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 				if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 					minPid = QUEUE_BACKEND_PID(i);
  			}
! 
! 		if (fillDegree < 0.75)
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 50%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 		else
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 75%% full. "
! 								 "Among the slowest backends: %d", minPid)));
! 
! 		asyncQueueControl->lastQueueFillWarn = t;
! 	}
! }
! 
! /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  * Add the pending notifications to the queue.
!  *
!  * A full queue is very uncommon and should really not happen, given that we
!  * have so much space available in the slru pages. Nevertheless we need to
!  * deal with this possibility. Note that when we get here we are in the process
!  * of committing our transaction, we have not yet committed to clog but this
!  * would be the next step. So at this point in time we can still roll the
!  * transaction back.
!  */
! static void
! Send_Notify(void)
! {
! 	backendSendsNotifications = true;
! 
! 	while (pendingNotifies != NIL)
! 	{
! 		LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 		asyncQueueFillWarning();
! 		if (asyncQueueIsFull())
! 			ereport(ERROR,
! 					(errcode(ERRCODE_TOO_MANY_ENTRIES),
! 					errmsg("Too many notifications in the queue")));
! 		pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 		LWLockRelease(AsyncQueueLock);
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. Since we have EXCLUSIVE lock anyway
!  * we also check the position of the other backends and in case that anyone is
!  * already up-to-date we don't signal it. This can happen if concurrent
!  * notifying transactions have sent a signal and the signaled backend has read
!  * the other notifications and ours in the same step.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(void)
! {
! 	QueuePosition	pos;
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 	int				count = 0;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			count++;
! 			pos = QUEUE_BACKEND_POS(i);
! 			if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
  			{
! 				pids = lappend_int(pids, pid);
! 				ids = lappend_int(ids, i);
  			}
  		}
  	}
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	forboth(p1, pids, p2, ids)
+ 	{
+ 		pid = (int32) lfirst_int(p1);
+ 		i = lfirst_int(p2);
+ 		/*
+ 		 * Should we check for failure? Can it happen that a backend
+ 		 * has crashed without the postmaster starting over?
+ 		 */
+ 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
+ 			elog(WARNING, "Error signalling backend %d", pid);
+ 	}
  
! 	if (count == 0)
! 	{
! 		/* No backend is listening at all, try to clean up the queue.
! 		 * Even if by now (after we determined count to be 0 and now)
! 		 * a backend has started to listen, advancing the tail does not
! 		 * hurt. Our notifications are committed already and a newly
! 		 * listening backend would skip over them anyway. */
! 		asyncQueueAdvanceTail();
! 	}
  }
  
  /*
   * AtAbort_Notify
   *
!  *	This is called at transaction abort.
   *
!  *	Gets rid of pending actions and outbound notifies that we would have
!  *	executed if the transaction got committed.
!  *
!  *	Even though we have not committed, we need to signal the listening backends
!  *	because our notifications might block readers from processing the queue.
!  *	Now that the transaction has aborted, they can go on and skip over our
!  *	notifications. They could find notifications past ours that they need to
!  *	deliver.
   */
  void
  AtAbort_Notify(void)
  {
+ 	if (backendSendsNotifications)
+ 		SignalBackends();
+ 
+ 	/*
+ 	 * If we LISTEN but then roll back the transaction we have set our pointer
+ 	 * but have not made the entry in listenChannels. In that case, remove
+ 	 * our pointer again.
+ 	 */
+ 	if (backendExecutesInitialListen)
+ 		/*
+ 		 * Checking listenChannels should be redundant but it can't hurt doing
+ 		 * it for safety reasons.
+ 		*/
+ 		if (listenChannels == NIL)
+ 			asyncQueueUnregister();
+ 
  	ClearPendingActionsAndNotifies();
  }
  
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1552,1786 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back the list of notifications
+  * that the calling function will deliver then. The list will contain all
+  * notifications from transactions that have already committed.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * The function returns true once we have reached the end or a notification of
+  * a transaction that is still running and false if we have finished with
+  * the page. In other words: once it returns true there is no point in calling
+  * it again.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current,
+ 						   QueuePosition stop,
+ 						   List **notifications)
+ {
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	int				slotno;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		readPtr += current->offset;
+ 		/* at first we only read the header of the notification */
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				memcpy(&qe, readPtr, qe.length);
+ 				/* qe.data is the NUL terminated channel name */
+ 				if (IsListeningOn(qe.data))
+ 				{
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n);
+ 					*notifications = lappend(*notifications, n);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * The transaction has neither committed nor aborted so
+ 					 * far.
+ 					 */
+ 					reachedStop = true;
+ 					break;
+ 				}
+ 				/*
+ 				 * Here we know that the transaction has aborted, we just
+ 				 * ignore its notifications.
+ 				 */
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	/*
+ 	 * Release the lock that we implicitly got from
+ 	 * SimpleLruReadPage_ReadOnly().
+ 	 */
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ 
+ static void
+ asyncQueueReadAllNotifications(void)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 	{
+ 		/* Nothing to do, we have read all notifications already. */
+ 		return;
+ 	}
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us. Especially we don't want to hold a lock while sending the
+ 		 * notifications to the frontend.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		notifications = NIL;
+ 		reachedStop = asyncQueueGetEntriesByPage(&pos, head, &notifications);
+ 
+ 		/*
+ 		 * Note that we deliver everything that we see in the queue and that
+ 		 * matches our _current_ listening state.
+ 		 * Especially we do not take into account different commit times.
+ 		 *
+ 		 * See the following example:
+ 		 *
+ 		 * Backend 1:                    Backend 2:
+ 		 *
+ 		 * transaction starts
+ 		 * NOTIFY foo;
+ 		 * commit starts
+ 		 *                               transaction starts
+ 		 *                               LISTEN foo;
+ 		 *                               commit starts
+ 		 * commit to clog
+ 		 *                               commit to clog
+ 		 *
+ 		 * It could happen that backend 2 sees the notification from
+ 		 * backend 1 in the queue and even though the notifying transaction
+ 		 * committed before the listening transaction, we still deliver the
+ 		 * notification.
+ 		 *
+ 		 * The idea is that an additional notification does not do any
+ 		 * harm we just need to make sure that we do not miss a
+ 		 * notification.
+ 		 */
+ 		foreach(lc, notifications)
+ 		{
+ 			n = (Notification *) lfirst(lc);
+ 			NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 		}
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail(void)
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailp;
+ 	int				headp;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailp = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headp = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if ((asyncQueuePagePrecedesLogically(tailp, QUEUE_POS_PAGE(min), headp)
+ 			&& asyncQueuePagePrecedesPhysically(tailp, headp))
+ 		|| tailp == QUEUE_POS_PAGE(min))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * XXX this could be optimized, to call SimpleLruTruncate only when we
+ 		 * know that we can truncate something.
+ 		 */
+ 		SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
! 
! 		if (sourcePID != 0)
! 		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 		}
! 	}
! 	heap_endscan(scan);
! 
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 1792,1798 ----
  
  	notifyInterruptOccurred = 0;
  
! 	asyncQueueReadAllNotifications();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 1812,1828 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 1832,1888 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
  
! 	if (payload == NULL)
! 		payload = "";
! 
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 */
! 
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1107,1128 ****
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
- }
  
! /*
!  * 2PC processing routine for COMMIT PREPARED case.
!  *
!  * (We don't have to do anything for ROLLBACK PREPARED.)
!  */
! void
! notify_twophase_postcommit(TransactionId xid, uint16 info,
! 						   void *recdata, uint32 len)
! {
! 	/*
! 	 * Set up to issue the NOTIFY at the end of my own current transaction.
! 	 * (XXX this has some issues if my own transaction later rolls back, or if
! 	 * there is any significant delay before I commit.	OK for now because we
! 	 * disallow COMMIT PREPARED inside a transaction block.)
! 	 */
! 	Async_Notify((char *) recdata);
  }
--- 1899,1906 ----
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
  
! 	backendSendsNotifications = false;
! 	backendExecutesInitialListen = false;
  }
+ 
diff -cr cvs.head/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs.head/src/backend/nodes/copyfuncs.c	2010-01-30 22:06:33.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 2771,2776 ****
--- 2771,2777 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs.head/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs.head/src/backend/nodes/equalfuncs.c	2010-01-30 22:06:33.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 1325,1330 ****
--- 1325,1331 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs.head/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs.head/src/backend/nodes/outfuncs.c	2010-01-30 22:06:33.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 1818,1823 ****
--- 1818,1824 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs.head/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs.head/src/backend/nodes/readfuncs.c	2010-01-05 12:39:25.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs.head/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs.head/src/backend/parser/gram.y	2010-02-10 20:33:09.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2010-02-10 20:42:10.000000000 +0100
***************
*** 400,406 ****
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 400,406 ----
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 6113,6122 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 6113,6128 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs.head/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs.head/src/backend/storage/ipc/ipci.c	2010-01-20 20:08:27.000000000 +0100
--- cvs.build/src/backend/storage/ipc/ipci.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/nbtree.h"
  #include "access/subtrans.h"
  #include "access/twophase.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
***************
*** 225,230 ****
--- 226,232 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs.head/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs.head/src/backend/storage/lmgr/lwlock.c	2010-01-05 12:39:29.000000000 +0100
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs.head/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs.head/src/backend/tcop/utility.c	2010-01-30 22:06:36.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2010-02-10 20:51:11.000000000 +0100
***************
*** 930,936 ****
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 930,936 ----
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs.head/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs.head/src/bin/initdb/initdb.c	2010-01-30 22:06:37.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 2458,2463 ****
--- 2458,2464 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs.head/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs.head/src/bin/psql/common.c	2010-01-05 12:39:33.000000000 +0100
--- cvs.build/src/bin/psql/common.c	2010-02-10 20:42:10.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs.head/src/bin/psql/tab-complete.c cvs.build/src/bin/psql/tab-complete.c
*** cvs.head/src/bin/psql/tab-complete.c	2010-01-30 22:06:37.000000000 +0100
--- cvs.build/src/bin/psql/tab-complete.c	2010-02-11 01:28:45.000000000 +0100
***************
*** 1852,1858 ****
  
  /* NOTIFY */
  	else if (pg_strcasecmp(prev_wd, "NOTIFY") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s'");
  
  /* OPTIONS */
  	else if (pg_strcasecmp(prev_wd, "OPTIONS") == 0)
--- 1852,1858 ----
  
  /* NOTIFY */
  	else if (pg_strcasecmp(prev_wd, "NOTIFY") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s'");
  
  /* OPTIONS */
  	else if (pg_strcasecmp(prev_wd, "OPTIONS") == 0)
***************
*** 2093,2099 ****
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
--- 2093,2099 ----
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
diff -cr cvs.head/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs.head/src/include/access/slru.h	2010-01-05 12:39:34.000000000 +0100
--- cvs.build/src/include/access/slru.h	2010-02-10 20:42:10.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs.head/src/include/access/twophase_rmgr.h cvs.build/src/include/access/twophase_rmgr.h
*** cvs.head/src/include/access/twophase_rmgr.h	2010-01-05 12:39:34.000000000 +0100
--- cvs.build/src/include/access/twophase_rmgr.h	2010-02-10 20:42:10.000000000 +0100
***************
*** 23,31 ****
   */
  #define TWOPHASE_RM_END_ID			0
  #define TWOPHASE_RM_LOCK_ID			1
! #define TWOPHASE_RM_NOTIFY_ID		2
! #define TWOPHASE_RM_PGSTAT_ID		3
! #define TWOPHASE_RM_MULTIXACT_ID	4
  #define TWOPHASE_RM_MAX_ID			TWOPHASE_RM_MULTIXACT_ID
  
  extern const TwoPhaseCallback twophase_recover_callbacks[];
--- 23,30 ----
   */
  #define TWOPHASE_RM_END_ID			0
  #define TWOPHASE_RM_LOCK_ID			1
! #define TWOPHASE_RM_PGSTAT_ID		2
! #define TWOPHASE_RM_MULTIXACT_ID	3
  #define TWOPHASE_RM_MAX_ID			TWOPHASE_RM_MULTIXACT_ID
  
  extern const TwoPhaseCallback twophase_recover_callbacks[];
diff -cr cvs.head/src/include/catalog/pg_proc.h cvs.build/src/include/catalog/pg_proc.h
*** cvs.head/src/include/catalog/pg_proc.h	2010-02-10 20:33:17.000000000 +0100
--- cvs.build/src/include/catalog/pg_proc.h	2010-02-10 20:55:24.000000000 +0100
***************
*** 4127,4132 ****
--- 4127,4136 ----
  DESCR("get the prepared statements for this session");
  DATA(insert OID = 2511 (  pg_cursor PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,16,16,16,1184}" "{o,o,o,o,o,o}" "{name,statement,is_holdable,is_binary,is_scrollable,creation_time}" _null_ pg_cursor _null_ _null_ _null_ ));
  DESCR("get the open cursors for this session");
+ DATA(insert OID = 3036 (  pg_listening	PGNSP	PGUID 12 1 10 0 f f f t t s 0 0 25 "" _null_ _null_ _null_ _null_ pg_listening _null_ _null_ _null_ ));
+ DESCR("get the channels that the current backend listens to");
+ DATA(insert OID = 3035 (  pg_notify  PGNSP PGUID 12 1 0 0 f f f f f v 2 0 2278 "25 25" _null_ _null_ _null_ _null_ pg_notify _null_ _null_ _null_));
+ DESCR("send a notification to clients");
  DATA(insert OID = 2599 (  pg_timezone_abbrevs	PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,1186,16}" "{o,o,o}" "{abbrev,utc_offset,is_dst}" _null_ pg_timezone_abbrevs _null_ _null_ _null_ ));
  DESCR("get the available time zone abbreviations");
  DATA(insert OID = 2856 (  pg_timezone_names		PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,1186,16}" "{o,o,o,o}" "{name,abbrev,utc_offset,is_dst}" _null_ pg_timezone_names _null_ _null_ _null_ ));
diff -cr cvs.head/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs.head/src/include/commands/async.h	2010-01-05 12:39:35.000000000 +0100
--- cvs.build/src/include/commands/async.h	2010-02-10 20:43:58.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,41 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * Maximum size of the payload, including terminating NULL.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * The number of page slots that we reserve.
+  */
+ #define NUM_ASYNC_BUFFERS			4
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
***************
*** 43,46 ****
--- 56,62 ----
  extern void notify_twophase_postcommit(TransactionId xid, uint16 info,
  						   void *recdata, uint32 len);
  
+ extern Datum pg_listening(PG_FUNCTION_ARGS);
+ extern Datum pg_notify(PG_FUNCTION_ARGS);
+ 
  #endif   /* ASYNC_H */
diff -cr cvs.head/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs.head/src/include/nodes/parsenodes.h	2010-02-10 20:33:18.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2010-02-10 20:42:10.000000000 +0100
***************
*** 2084,2089 ****
--- 2084,2090 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs.head/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs.head/src/include/storage/lwlock.h	2010-02-10 20:33:18.000000000 +0100
--- cvs.build/src/include/storage/lwlock.h	2010-02-10 20:43:06.000000000 +0100
***************
*** 68,73 ****
--- 68,75 ----
  	AutovacuumScheduleLock,
  	SyncScanLock,
  	RelationMappingLock,
+  	AsyncCtlLock,
+  	AsyncQueueLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff -cr cvs.head/src/include/utils/errcodes.h cvs.build/src/include/utils/errcodes.h
*** cvs.head/src/include/utils/errcodes.h	2010-01-05 12:39:36.000000000 +0100
--- cvs.build/src/include/utils/errcodes.h	2010-02-10 20:42:10.000000000 +0100
***************
*** 318,323 ****
--- 318,324 ----
  #define ERRCODE_STATEMENT_TOO_COMPLEX		MAKE_SQLSTATE('5','4', '0','0','1')
  #define ERRCODE_TOO_MANY_COLUMNS			MAKE_SQLSTATE('5','4', '0','1','1')
  #define ERRCODE_TOO_MANY_ARGUMENTS			MAKE_SQLSTATE('5','4', '0','2','3')
+ #define ERRCODE_TOO_MANY_ENTRIES			MAKE_SQLSTATE('5','4', '0','3','1')
  
  /* Class 55 - Object Not In Prerequisite State (class borrowed from DB2) */
  #define ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE	MAKE_SQLSTATE('5','5', '0','0','0')
diff -cr cvs.head/src/test/regress/expected/guc.out cvs.build/src/test/regress/expected/guc.out
*** cvs.head/src/test/regress/expected/guc.out	2009-11-22 06:20:41.000000000 +0100
--- cvs.build/src/test/regress/expected/guc.out	2010-02-10 20:42:10.000000000 +0100
***************
*** 532,540 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
!   relname  
! -----------
   foo_event
  (1 row)
  
--- 532,540 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
!  pg_listening 
! --------------
   foo_event
  (1 row)
  
***************
*** 571,579 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
!  relname 
! ---------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
--- 571,579 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
!  pg_listening 
! --------------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
diff -cr cvs.head/src/test/regress/expected/sanity_check.out cvs.build/src/test/regress/expected/sanity_check.out
*** cvs.head/src/test/regress/expected/sanity_check.out	2010-01-20 20:08:32.000000000 +0100
--- cvs.build/src/test/regress/expected/sanity_check.out	2010-02-10 20:42:10.000000000 +0100
***************
*** 107,113 ****
   pg_language             | t
   pg_largeobject          | t
   pg_largeobject_metadata | t
-  pg_listener             | f
   pg_namespace            | t
   pg_opclass              | t
   pg_operator             | t
--- 107,112 ----
***************
*** 154,160 ****
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (143 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 153,159 ----
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (142 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
diff -cr cvs.head/src/test/regress/sql/guc.sql cvs.build/src/test/regress/sql/guc.sql
*** cvs.head/src/test/regress/sql/guc.sql	2009-10-21 22:38:58.000000000 +0200
--- cvs.build/src/test/regress/sql/guc.sql	2010-02-10 20:42:10.000000000 +0100
***************
*** 165,171 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 165,171 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
***************
*** 174,180 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 174,180 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
#101Simon Riggs
simon@2ndQuadrant.com
In reply to: Joachim Wieland (#100)
Re: Listen / Notify - what to do when the queue is full

On Thu, 2010-02-11 at 00:52 +0100, Joachim Wieland wrote:

On Mon, Feb 8, 2010 at 5:16 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

These are the on-disk notifications, right? It seems to me a bit
wasteful to store channel name always as NAMEDATALEN bytes. Can we
truncate it at its strlen?

Attached is a new and hopefully more or less final patch for LISTEN / NOTIFY.

The following items have been addressed in this patch:

- only store strlen(channel) instead of NAMEDATALEN bytes on disk
- limit to 7-bit ASCII
- forbid 2PC and LISTEN/NOTIFY for now
- documentation changes
- add missing tab completion for NOTIFY
- fix pg_notify() behavior with respect to NULLs, too long and too
short parameters
- rebased to current HEAD, OID conflicts resolved

Some minor review comments without having taken in much of previous
discussion:

* Patch doesn't apply cleanly anymore, close.

* In async.c you say "new async notification model". Please say "async
notification model in 9.0 is". In (2) you say there is a central queue,
but don't describe purpose of queue or what it contains. Some other
typos in header comments.

* There is no mention of what to do with pg_notify at checkpoint. Look
at how pg_subtrans handles this. Should pg_notify do the same?

* Is there a lazy backend avoidance scheme as exists for relcache
invalidation messages? see storage/ipc/sinval...
OK, I see asyncQueueFillWarning() but nowhere else says it exists and
there aren't any comments in it to say what it does or why

* What do you expect the DBA to do when they receive a queue fill
warning? Where is that written down?

* Not clear of the purpose of backendSendsNotifications. In
AtCommit_NotifyAfterCommit() the logic seems strange. Code is
/* Allow transactions that have not executed
LISTEN/UNLISTEN/NOTIFY to
* return as soon as possible */
if (!pendingActions && !backendSendsNotifications)
return;
but since backendSendsNotifications is set true earlier there is no fast
path.

* AtSubCommit_Notify doesn't seem to have a fastpath when no notify
commands have been executed.

* In Send_Notify() you throw ERROR while still holding the lock. It
seems better practice to drop the lock then ERROR.

* Why is Send_Notify() a separate function? It's only called from one
point in the code.

* We know that sometimes a FATAL error can occur without properly
processing the abort. Do we depend upon this never happening?

* Can we make NUM_ASYNC_BUFFERS = 8? 4 just seems too small

* backendSendsNotifications is future tense. The variable should be
called something like hasSentNotifications. Couple of other variables
with similar names

Hope that helps

--
Simon Riggs www.2ndQuadrant.com

#102Joachim Wieland
joe@mcknight.de
In reply to: Simon Riggs (#101)
1 attachment(s)
Re: Listen / Notify - what to do when the queue is full

On Sun, Feb 14, 2010 at 2:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Some minor review comments without having taken in much of previous
discussion:

* Patch doesn't apply cleanly anymore, close.

New version is rebased against current HEAD.

* In async.c you say "new async notification model". Please say "async
notification model in 9.0 is".

I haven't changed that line at all... ;-) But the fact that the
former "new async model" is now the old one really shows that a
version number makes sense here.

In (2) you say there is a central queue,

fixed with some additional comments.

* There is no mention of what to do with pg_notify at checkpoint. Look
at how pg_subtrans handles this. Should pg_notify do the same?

Actually we don't care... We even hope that the pg_notify pages are
not flushed at all. Notifications don't survive a server restart
anyway and upon restart we just delete whatever is in the directory.

* Is there a lazy backend avoidance scheme as exists for relcache
invalidation messages? see storage/ipc/sinval...

We cannot avoid a lazy backend. If a listening backend is (probably
idle) in a running transaction we don't clean up the notification
queue anymore but wait for the slowest backend to do it. But there are
other effects of long running transactions that you will probably
notice long before your notification queue gets filled up.

OK, I see asyncQueueFillWarning() but nowhere else says it exists and
there aren't any comments in it to say what it does or why
* What do you expect the DBA to do when they receive a queue fill
warning? Where is that written down?

Comment added as well as an errdetail for whoever sees this message:

WARNING: pg_notify queue is more than 50% full. Among the slowest
backends: 27744
DETAIL: Cleanup can only proceed if this backend ends its current transaction

I have also added a note in the NOTIFY documentation about this.

* Not clear of the purpose of backendSendsNotifications. In
AtCommit_NotifyAfterCommit() the logic seems strange. Code is
/* Allow transactions that have not executed
LISTEN/UNLISTEN/NOTIFY to
* return as soon as possible */
if (!pendingActions && !backendSendsNotifications)
return;
but since backendSendsNotifications is set true earlier there is no fast
path.

What exactly is the strange logic that you're seeing? The fast path is
for transactions that do neither LISTEN/UNLISTEN nor NOTIFY. Without
LISTEN/UNLISTEN they have pendingActions = NIL and without NOTIFY they
have backendSendsNotifications = false and will return immediately.

backendSendsNotifications is only set to be true if the backend has
executed NOTIFY...

* AtSubCommit_Notify doesn't seem to have a fastpath when no notify
commands have been executed.

I haven't changed this function at all. I don't see much benefit
either as it's just concatenating empty lists. This immediately
returns from the respective list handling function.

* In Send_Notify() you throw ERROR while still holding the lock. It
seems better practice to drop the lock then ERROR.

Explicit LWLockRelease() added.

* Why is Send_Notify() a separate function? It's only called from one
point in the code.

I have now inlined Send_Notify().

* We know that sometimes a FATAL error can occur without properly
processing the abort. Do we depend upon this never happening?

What exactly could happen? I thought that a backend either does a
clean shutdown or that the postmaster starts over. We depend on a
listening backend to execute its on_shmem_exit() handlers before
leaving the game...

* Can we make NUM_ASYNC_BUFFERS = 8? 4 just seems too small

I have done some tests with NUM_ASYNC_BUFFERS and couldn't find any
measurable performance difference at all, so I kept it at 4. 8 isn't
exaggerated either and seems to be a fine number though. Changed to 8.

* backendSendsNotifications is future tense. The variable should be
called something like hasSentNotifications. Couple of other variables
with similar names

Variable names changed as proposed.

New patch attached, thanks for the review.

Joachim

Attachments:

listennotify.12.difftext/x-patch; charset=US-ASCII; name=listennotify.12.diffDownload
diff -cr cvs.head/doc/src/sgml/catalogs.sgml cvs.build/doc/src/sgml/catalogs.sgml
*** cvs.head/doc/src/sgml/catalogs.sgml	2010-02-10 20:32:57.000000000 +0100
--- cvs.build/doc/src/sgml/catalogs.sgml	2010-02-14 16:04:25.000000000 +0100
***************
*** 169,179 ****
       </row>
  
       <row>
-       <entry><link linkend="catalog-pg-listener"><structname>pg_listener</structname></link></entry>
-       <entry>asynchronous notification support</entry>
-      </row>
- 
-      <row>
        <entry><link linkend="catalog-pg-namespace"><structname>pg_namespace</structname></link></entry>
        <entry>schemas</entry>
       </row>
--- 169,174 ----
***************
*** 3253,3320 ****
    </table>
   </sect1>
  
-  <sect1 id="catalog-pg-listener">
-   <title><structname>pg_listener</structname></title>
- 
-   <indexterm zone="catalog-pg-listener">
-    <primary>pg_listener</primary>
-   </indexterm>
- 
-   <para>
-    The catalog <structname>pg_listener</structname> supports the
-    <xref linkend="sql-listen" endterm="sql-listen-title"> and
-    <xref linkend="sql-notify" endterm="sql-notify-title">
-    commands.  A listener creates an entry in
-    <structname>pg_listener</structname> for each notification name
-    it is listening for.  A notifier scans <structname>pg_listener</structname>
-    and updates each matching entry to show that a notification has occurred.
-    The notifier also sends a signal (using the PID recorded in the table)
-    to awaken the listener from sleep.
-   </para>
- 
-   <table>
-    <title><structname>pg_listener</> Columns</title>
- 
-    <tgroup cols="3">
-     <thead>
-      <row>
-       <entry>Name</entry>
-       <entry>Type</entry>
-       <entry>Description</entry>
-      </row>
-     </thead>
- 
-     <tbody>
-      <row>
-       <entry><structfield>relname</structfield></entry>
-       <entry><type>name</type></entry>
-       <entry>
-        Notify condition name.  (The name need not match any actual
-        relation in the database; the name <structfield>relname</> is historical.)
-       </entry>
-      </row>
- 
-      <row>
-       <entry><structfield>listenerpid</structfield></entry>
-       <entry><type>int4</type></entry>
-       <entry>PID of the server process that created this entry</entry>
-      </row>
- 
-      <row>
-       <entry><structfield>notification</structfield></entry>
-       <entry><type>int4</type></entry>
-       <entry>
-        Zero if no event is pending for this listener.  If an event is
-        pending, the PID of the server process that sent the notification
-       </entry>
-      </row>
-     </tbody>
-    </tgroup>
-   </table>
- 
-  </sect1>
- 
- 
   <sect1 id="catalog-pg-namespace">
    <title><structname>pg_namespace</structname></title>
  
--- 3248,3253 ----
diff -cr cvs.head/doc/src/sgml/func.sgml cvs.build/doc/src/sgml/func.sgml
*** cvs.head/doc/src/sgml/func.sgml	2010-02-14 16:02:44.000000000 +0100
--- cvs.build/doc/src/sgml/func.sgml	2010-02-14 16:04:25.000000000 +0100
***************
*** 11574,11579 ****
--- 11574,11585 ----
        </row>
  
        <row>
+        <entry><literal><function>pg_listening</function>()</literal></entry>
+        <entry><type>set of text</type></entry>
+        <entry>channels that the session is currently listening on</entry>
+       </row>
+ 
+       <row>
         <entry><literal><function>session_user</function></literal></entry>
         <entry><type>name</type></entry>
         <entry>session user name</entry>
***************
*** 11740,11745 ****
--- 11746,11760 ----
     </para>
  
     <indexterm>
+     <primary>pg_listening</primary>
+    </indexterm>
+ 
+    <para>
+     <function>pg_listening</function> returns a set of channels that the
+     current session is listening to. See <xref linkend="sql-listen" endterm="sql-listen-title"> for more information.
+    </para>
+ 
+    <indexterm>
      <primary>version</primary>
     </indexterm>
  
diff -cr cvs.head/doc/src/sgml/libpq.sgml cvs.build/doc/src/sgml/libpq.sgml
*** cvs.head/doc/src/sgml/libpq.sgml	2010-02-10 20:32:59.000000000 +0100
--- cvs.build/doc/src/sgml/libpq.sgml	2010-02-14 16:04:25.000000000 +0100
***************
*** 4116,4125 ****
     can stop listening with the <command>UNLISTEN</command> command).  All
     sessions listening on a particular condition will be notified
     asynchronously when a <command>NOTIFY</command> command with that
!    condition name is executed by any session.  No additional information
!    is passed from the notifier to the listener.  Thus, typically, any
!    actual data that needs to be communicated is transferred through a
!    database table.  Commonly, the condition name is the same as the
     associated table, but it is not necessary for there to be any associated
     table.
    </para>
--- 4116,4124 ----
     can stop listening with the <command>UNLISTEN</command> command).  All
     sessions listening on a particular condition will be notified
     asynchronously when a <command>NOTIFY</command> command with that
!    condition name is executed by any session. A notification parameter can be
!    used to communicate additional data to a listener.
!    Commonly, the condition name is the same as the
     associated table, but it is not necessary for there to be any associated
     table.
    </para>
***************
*** 4153,4161 ****
     <function>PQfreemem</function>.  It is sufficient to free the
     <structname>PGnotify</structname> pointer; the
     <structfield>relname</structfield> and <structfield>extra</structfield>
!    fields do not represent separate allocations.  (At present, the
!    <structfield>extra</structfield> field is unused and will always point
!    to an empty string.)
    </para>
  
    <para>
--- 4152,4158 ----
     <function>PQfreemem</function>.  It is sufficient to free the
     <structname>PGnotify</structname> pointer; the
     <structfield>relname</structfield> and <structfield>extra</structfield>
!    fields do not represent separate allocations.
    </para>
  
    <para>
diff -cr cvs.head/doc/src/sgml/protocol.sgml cvs.build/doc/src/sgml/protocol.sgml
*** cvs.head/doc/src/sgml/protocol.sgml	2010-02-10 20:32:59.000000000 +0100
--- cvs.build/doc/src/sgml/protocol.sgml	2010-02-14 16:04:25.000000000 +0100
***************
*** 3192,3199 ****
  <listitem>
  <para>
                  Additional information passed from the notifying process.
-                 (Currently, this feature is unimplemented so the field
-                 is always an empty string.)
  </para>
  </listitem>
  </varlistentry>
--- 3192,3197 ----
diff -cr cvs.head/doc/src/sgml/ref/listen.sgml cvs.build/doc/src/sgml/ref/listen.sgml
*** cvs.head/doc/src/sgml/ref/listen.sgml	2008-11-14 11:22:47.000000000 +0100
--- cvs.build/doc/src/sgml/ref/listen.sgml	2010-02-14 16:04:25.000000000 +0100
***************
*** 74,79 ****
--- 74,86 ----
   </refsect1>
  
   <refsect1>
+   <title>Notes</title>
+   <para>
+    A transaction that has executed <command>LISTEN</command> cannot be prepared for a two-phase commit.
+   </para>
+  </refsect1>
+ 
+  <refsect1>
    <title>Parameters</title>
  
    <variablelist>
***************
*** 98,104 ****
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" received from server process with PID 8448.
  </programlisting>
    </para>
   </refsect1>
--- 105,111 ----
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" () received from server process with PID 8448.
  </programlisting>
    </para>
   </refsect1>
diff -cr cvs.head/doc/src/sgml/ref/notify.sgml cvs.build/doc/src/sgml/ref/notify.sgml
*** cvs.head/doc/src/sgml/ref/notify.sgml	2008-11-14 11:22:47.000000000 +0100
--- cvs.build/doc/src/sgml/ref/notify.sgml	2010-02-14 18:03:14.000000000 +0100
***************
*** 21,27 ****
  
   <refsynopsisdiv>
  <synopsis>
! NOTIFY <replaceable class="PARAMETER">name</replaceable>        
  </synopsis>
   </refsynopsisdiv>
  
--- 21,27 ----
  
   <refsynopsisdiv>
  <synopsis>
! NOTIFY <replaceable class="PARAMETER">name</replaceable> [ <replaceable class="PARAMETER">parameter</replaceable> ]
  </synopsis>
   </refsynopsisdiv>
  
***************
*** 29,36 ****
    <title>Description</title>
  
    <para>
!    The <command>NOTIFY</command> command sends a notification event to each
!    client application that has previously executed
     <command>LISTEN <replaceable class="parameter">name</replaceable></command>
     for the specified notification name in the current database.
    </para>
--- 29,37 ----
    <title>Description</title>
  
    <para>
!    The <command>NOTIFY</command> command sends a notification event together
!    with an optional notification parameter to each client application that has
!    previously executed
     <command>LISTEN <replaceable class="parameter">name</replaceable></command>
     for the specified notification name in the current database.
    </para>
***************
*** 39,54 ****
     <command>NOTIFY</command> provides a simple form of signal or
     interprocess communication mechanism for a collection of processes
     accessing the same <productname>PostgreSQL</productname> database.
!    Higher-level mechanisms can be built by using tables in the database to
!    pass additional data (beyond a mere notification name) from notifier to
!    listener(s).
    </para>
  
    <para>
     The information passed to the client for a notification event includes the notification
!    name and the notifying session's server process <acronym>PID</>.  It is up to the
!    database designer to define the notification names that will be used in a given
!    database and what each one means.
    </para>
  
    <para>
--- 40,56 ----
     <command>NOTIFY</command> provides a simple form of signal or
     interprocess communication mechanism for a collection of processes
     accessing the same <productname>PostgreSQL</productname> database.
!    A notification parameter can be sent along with the notification and
!    higher-level mechanisms for passing structured data can be built by using
!    tables in the database to pass additional data from notifier to listener(s).
    </para>
  
    <para>
     The information passed to the client for a notification event includes the notification
!    name, the notifying session's server process <acronym>PID</> and the
!    notification parameter (payload) which is an empty string if it has not been specified.
!    It is up to the database designer to define the notification names that will
!    be used in a given database and what each one means.
    </para>
  
    <para>
***************
*** 89,102 ****
    </para>
  
    <para>
!    <command>NOTIFY</command> behaves like Unix signals in one important
!    respect: if the same notification name is signaled multiple times in quick
!    succession, recipients might get only one notification event for several executions
!    of <command>NOTIFY</command>.  So it is a bad idea to depend on the number
!    of notifications received.  Instead, use <command>NOTIFY</command> to wake up
!    applications that need to pay attention to something, and use a database
!    object (such as a sequence) to keep track of what happened or how many times
!    it happened.
    </para>
  
    <para>
--- 91,111 ----
    </para>
  
    <para>
!    If the same notification name is signaled multiple times from the same
!    transaction and with an identical notification parameter (or an empty one), the
!    database server can decide to deliver a single notification only.
!    On the other hand, notifications with distinct notification parameters will
!    always be delivered as distinct notifications. Similarly, notifications from
!    different transactions will never get folded into one notification.
!    <command>NOTIFY</command> also guarantees that notifications from the same
!    transaction get delivered in the order they were sent.
!   </para>
! 
!   <para>
!    An alternative to specifying a notification parameter is to use <command>NOTIFY</command> to
!    wake up applications that need to pay attention to something, and use a
!    database object (such as a sequence) to keep track of what happened or how many
!    times it happened.
    </para>
  
    <para>
***************
*** 111,122 ****
     notification event message) is the same as one's own session's
     <acronym>PID</> (available from <application>libpq</>).  When they
     are the same, the notification event is one's own work bouncing
!    back, and can be ignored.  (Despite what was said in the preceding
!    paragraph, this is a safe technique.
!    <productname>PostgreSQL</productname> keeps self-notifications
!    separate from notifications arriving from other sessions, so you
!    cannot miss an outside notification by ignoring your own
!    notifications.)
    </para>
   </refsect1>
  
--- 120,154 ----
     notification event message) is the same as one's own session's
     <acronym>PID</> (available from <application>libpq</>).  When they
     are the same, the notification event is one's own work bouncing
!    back, and can be ignored.
!   </para>
!  </refsect1>
! 
!  <refsect1>
!   <title>Notes</title>
!   <para>
!    A transaction that has executed <command>LISTEN</command> cannot be prepared for a two-phase commit.
!   </para>
!   <para>
!    To send a notification you can also use the function
!    <literal><function>pg_notify</function>(<type>text</type>,
!    <type>text</type>)</literal>. The function takes the channel name as the
!    first argument and the payload as the second. This could be more convenient
!    to use in triggers and you can also use a non-constant channel name and
!    parameter value.
!   </para>
!   <para>
!    There is a limited queue for notifications. In case this queue is full, all
!    transactions calling <command>NOTIFY</command> will fail.
!   </para>
!   <para>
!    The queue it is quite large and should be sufficiently sized for almost
!    every use case. However, no cleanup can take place if one of your backends
!    executes <command>LISTEN</command> and then enters a transaction for a very
!    long time. Once the queue is half full you will see warnings in the log file
!    pointing you to the backend that is preventing cleanup. In this case you should
!    make sure that this backend ends its current transaction so that cleanup can
!    proceed.
    </para>
   </refsect1>
  
***************
*** 132,137 ****
--- 164,183 ----
       </para>
      </listitem>
     </varlistentry>
+    <varlistentry>
+     <term><replaceable class="PARAMETER">parameter</replaceable></term>
+     <listitem>
+      <para>
+       The notification parameter (payload) to be communicated along with the
+       notification. The character string is only allowed to consist of pure
+       ASCII 7-bit characters and must be shorter than 8000 characters.
+       Specifying a longer payload will cause an error. If you need to send
+       other characters or binary data, you need to take care of the encoding
+       and decoding (like base64) on your own. Alternatively you can store the
+       information in a database table and send the key of the record.
+      </para>
+     </listitem>
+    </varlistentry>
    </variablelist>
   </refsect1>
  
***************
*** 145,151 ****
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" received from server process with PID 8448.
  </programlisting>
    </para>
   </refsect1>
--- 191,203 ----
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" () received from server process with PID 8448.
! NOTIFY virtual 'This is the payload';
! Asynchronous notification "virtual" (This is the payload) received from server process with PID 8448.
! 
! LISTEN foo;
! SELECT pg_notify((SELECT 'fo' || 'o'), (SELECT 'pay' || 'load'));
! Asynchronous notification "foo" (payload) received from server process with PID 20801.
  </programlisting>
    </para>
   </refsect1>
diff -cr cvs.head/doc/src/sgml/ref/unlisten.sgml cvs.build/doc/src/sgml/ref/unlisten.sgml
*** cvs.head/doc/src/sgml/ref/unlisten.sgml	2008-11-14 11:22:47.000000000 +0100
--- cvs.build/doc/src/sgml/ref/unlisten.sgml	2010-02-14 16:04:25.000000000 +0100
***************
*** 83,88 ****
--- 83,92 ----
     At the end of each session, <command>UNLISTEN *</command> is
     automatically executed.
    </para>
+ 
+   <para>
+    A transaction that has executed <command>LISTEN</command> cannot be prepared for a two-phase commit.
+   </para>
   </refsect1>
  
   <refsect1>
***************
*** 94,100 ****
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" received from server process with PID 8448.
  </programlisting>
    </para>
  
--- 98,104 ----
  <programlisting>
  LISTEN virtual;
  NOTIFY virtual;
! Asynchronous notification "virtual" () received from server process with PID 8448.
  </programlisting>
    </para>
  
diff -cr cvs.head/src/backend/access/transam/slru.c cvs.build/src/backend/access/transam/slru.c
*** cvs.head/src/backend/access/transam/slru.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/access/transam/slru.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 58,83 ****
  #include "storage/shmem.h"
  #include "miscadmin.h"
  
- 
- /*
-  * Define segment size.  A page is the same BLCKSZ as is used everywhere
-  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
-  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
-  * or 64K transactions for SUBTRANS.
-  *
-  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
-  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
-  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
-  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
-  * take no explicit notice of that fact in this module, except when comparing
-  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
-  *
-  * Note: this file currently assumes that segment file names will be four
-  * hex digits.	This sets a lower bound on the segment size (64K transactions
-  * for 32-bit TransactionIds).
-  */
- #define SLRU_PAGES_PER_SEGMENT	32
- 
  #define SlruFileName(ctl, path, seg) \
  	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
  
--- 58,63 ----
diff -cr cvs.head/src/backend/access/transam/twophase_rmgr.c cvs.build/src/backend/access/transam/twophase_rmgr.c
*** cvs.head/src/backend/access/transam/twophase_rmgr.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/access/transam/twophase_rmgr.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 25,31 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_recover,		/* Lock */
- 	NULL,						/* notify/listen */
  	NULL,						/* pgstat */
  	multixact_twophase_recover	/* MultiXact */
  };
--- 25,30 ----
***************
*** 34,40 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_postcommit,	/* Lock */
- 	notify_twophase_postcommit, /* notify/listen */
  	pgstat_twophase_postcommit,	/* pgstat */
  	multixact_twophase_postcommit /* MultiXact */
  };
--- 33,38 ----
***************
*** 43,49 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_postabort,	/* Lock */
- 	NULL,						/* notify/listen */
  	pgstat_twophase_postabort,	/* pgstat */
  	multixact_twophase_postabort /* MultiXact */
  };
--- 41,46 ----
***************
*** 52,58 ****
  {
  	NULL,						/* END ID */
  	lock_twophase_standby_recover,		/* Lock */
- 	NULL,						/* notify/listen */
  	NULL,						/* pgstat */
  	NULL						/* MultiXact */
  };
--- 49,54 ----
diff -cr cvs.head/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c
*** cvs.head/src/backend/access/transam/xact.c	2010-02-14 16:02:46.000000000 +0100
--- cvs.build/src/backend/access/transam/xact.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 1736,1743 ****
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* NOTIFY commit must come before lower-level cleanup */
! 	AtCommit_Notify();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
--- 1736,1743 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
! 	/* Insert notifications sent by the NOTIFY command into the queue */
! 	AtCommit_NotifyBeforeCommit();
  
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
***************
*** 1815,1820 ****
--- 1815,1825 ----
  
  	AtEOXact_MultiXact();
  
+ 	/*
+ 	 * Clean up Notify buffers and signal listening backends.
+ 	 */
+ 	AtCommit_NotifyAfterCommit();
+ 
  	ResourceOwnerRelease(TopTransactionResourceOwner,
  						 RESOURCE_RELEASE_LOCKS,
  						 true, true);
diff -cr cvs.head/src/backend/catalog/Makefile cvs.build/src/backend/catalog/Makefile
*** cvs.head/src/backend/catalog/Makefile	2010-01-06 22:30:05.000000000 +0100
--- cvs.build/src/backend/catalog/Makefile	2010-02-14 16:04:25.000000000 +0100
***************
*** 30,36 ****
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_listener.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
--- 30,36 ----
  	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
  	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
  	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
! 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_description.h \
  	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
  	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
  	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff -cr cvs.head/src/backend/commands/async.c cvs.build/src/backend/commands/async.c
*** cvs.head/src/backend/commands/async.c	2010-01-05 12:39:22.000000000 +0100
--- cvs.build/src/backend/commands/async.c	2010-02-14 18:09:42.000000000 +0100
***************
*** 13,44 ****
   */
  
  /*-------------------------------------------------------------------------
!  * New Async Notification Model:
!  * 1. Multiple backends on same machine.  Multiple backends listening on
!  *	  one relation.  (Note: "listening on a relation" is not really the
!  *	  right way to think about it, since the notify names need not have
!  *	  anything to do with the names of relations actually in the database.
!  *	  But this terminology is all over the code and docs, and I don't feel
!  *	  like trying to replace it.)
!  *
!  * 2. There is a tuple in relation "pg_listener" for each active LISTEN,
!  *	  ie, each relname/listenerPID pair.  The "notification" field of the
!  *	  tuple is zero when no NOTIFY is pending for that listener, or the PID
!  *	  of the originating backend when a cross-backend NOTIFY is pending.
!  *	  (We skip writing to pg_listener when doing a self-NOTIFY, so the
!  *	  notification field should never be equal to the listenerPID field.)
!  *
!  * 3. The NOTIFY statement itself (routine Async_Notify) just adds the target
!  *	  relname to a list of outstanding NOTIFY requests.  Actual processing
!  *	  happens if and only if we reach transaction commit.  At that time (in
!  *	  routine AtCommit_Notify) we scan pg_listener for matching relnames.
!  *	  If the listenerPID in a matching tuple is ours, we just send a notify
!  *	  message to our own front end.  If it is not ours, and "notification"
!  *	  is not already nonzero, we set notification to our own PID and send a
!  *	  PROCSIG_NOTIFY_INTERRUPT signal to the receiving process (indicated by
!  *	  listenerPID).
!  *	  BTW: if the signal operation fails, we presume that the listener backend
!  *	  crashed without removing this tuple, and remove the tuple for it.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
--- 13,73 ----
   */
  
  /*-------------------------------------------------------------------------
!  * Async Notification Model as of 9.0:
!  *
!  * 1. Multiple backends on same machine. Multiple backends listening on
!  *	  several channels. (This was previously called a "relation" even though it
!  *	  is just an identifier and has nothing to do with a database relation.)
!  *
!  * 2. There is one central queue in the form of Slru backed file based storage
!  *    (directory pg_notify/), with several pages mapped into shared memory.
!  *    All listening backends read from that queue and all notifications are
!  *    placed in this queue, see the data structure AsyncQueueEntry.
!  *
!  *    There is no central storage of which backend listens on which channel,
!  *    every backend has its own list.
!  *
!  *    Every backend that is listening on at least one channel registers by
!  *    entering its Pid into the array of all backends. It then scans all
!  *    incoming notifications in the central queue and first compares the
!  *    database oid of the notification with its own database oid and then
!  *    compares the notified channel with the list of channels that it listens
!  *    to. All notifications without a match are just skipped.
!  *
!  *    In case there is a match it delivers the corresponding notification to
!  *    its frontend.
!  *
!  * 3. The NOTIFY statement (routine Async_Notify) stores the notification
!  *    in a list which will not be processed until at transaction end. Every
!  *    notification can additionally send a "payload" which is an extra text
!  *    parameter to convey arbitrary information to the recipient.
!  *
!  *    Duplicate notifications from the same transaction are sent out as one
!  *    notification only. This is done to save work when for example a trigger
!  *    on a 2 million row table fires a notification for each row that has been
!  *    changed. If the application needs to receive every single notification
!  *    that has been sent, it can easily add some unique string into the extra
!  *    payload parameter.
!  *
!  *    Once the transaction commits, AtCommit_NotifyBeforeCommit() performs the
!  *    required changes regarding listeners (Listen/Unlisten) and then adds the
!  *    pending notifications to the head of the queue. The head pointer of the
!  *    queue always points to the next free position and a position is just a
!  *    page number and the offset in that page. This is done before marking the
!  *    transaction as committed in clog. If we run into problems writing the
!  *    notifications, we can still call elog(ERROR, ...) and the transaction
!  *    will roll back.
!  *
!  *    Once we have put all of the notifications into the queue, we return to
!  *    CommitTransaction() which will then commit to clog.
!  *
!  *    After clog commit we are called another time
!  *    (AtCommit_NotifyAfterCommit()). Here we check if we need to signal the
!  *    backends. In SignalBackends() we scan the list of listening backends and
!  *    send a PROCSIG_NOTIFY_INTERRUPT to every backend that has set its Pid (we
!  *    don't know which backend is listening on which channel so we need to send
!  *    a signal to every listening backend). We can exclude backends that are
!  *    already up to date.
   *
   * 4. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
   *	  can call inbound-notify processing immediately if this backend is idle
***************
*** 46,84 ****
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  * 5. Inbound-notify processing consists of scanning pg_listener for tuples
!  *	  matching our own listenerPID and having nonzero notification fields.
!  *	  For each such tuple, we send a message to our frontend and clear the
!  *	  notification field.  BTW: this routine has to start/commit its own
!  *	  transaction, since by assumption it is only called from outside any
!  *	  transaction.
!  *
!  * Like NOTIFY, LISTEN and UNLISTEN just add the desired action to a list
!  * of pending actions.	If we reach transaction commit, the changes are
!  * applied to pg_listener just before executing any pending NOTIFYs.  This
!  * method is necessary because to avoid race conditions, we must hold lock
!  * on pg_listener from when we insert a new listener tuple until we commit.
!  * To do that and not create undue hazard of deadlock, we don't want to
!  * touch pg_listener until we are otherwise done with the transaction;
!  * in particular it'd be uncool to still be taking user-commanded locks
!  * while holding the pg_listener lock.
!  *
!  * Although we grab ExclusiveLock on pg_listener for any operation,
!  * the lock is never held very long, so it shouldn't cause too much of
!  * a performance problem.  (Previously we used AccessExclusiveLock, but
!  * there's no real reason to forbid concurrent reads.)
   *
!  * An application that listens on the same relname it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * PID.  (As of FE/BE protocol 2.0, the backend's PID is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.  Note,
!  * however, that we do *not* guarantee that a separate frontend message will
!  * be sent for every outside NOTIFY.  Since there is only room for one
!  * originating PID in pg_listener, outside notifies occurring at about the
!  * same time may be collapsed into a single message bearing the PID of the
!  * first outside backend to perform the NOTIFY.
   *-------------------------------------------------------------------------
   */
  
--- 75,96 ----
   *	  block).  Otherwise the handler may only set a flag, which will cause the
   *	  processing to occur just before we next go idle.
   *
!  *    Inbound-notify processing consists of reading all of the notifications
!  *	  that have arrived since scanning last time. We read every notification
!  *	  until we reach either a notification from an uncommitted transaction or
!  *	  the head pointer's position. Then we check if we were the laziest
!  *	  backend: if our pointer is set to the same position as the global tail
!  *	  pointer is set, then we set it further to the second-laziest backend (We
!  *	  can identify it by inspecting the positions of all other backends'
!  *	  pointers). Whenever we move the tail pointer we also truncate now unused
!  *	  pages (i.e. delete files in pg_notify/ that are no longer used).
   *
!  * An application that listens on the same channel it notifies will get
   * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
   * by comparing be_pid in the NOTIFY message to the application's own backend's
!  * Pid.  (As of FE/BE protocol 2.0, the backend's Pid is provided to the
   * frontend during startup.)  The above design guarantees that notifies from
!  * other backends will never be missed by ignoring self-notifies.
   *-------------------------------------------------------------------------
   */
  
***************
*** 88,97 ****
  #include <signal.h>
  
  #include "access/heapam.h"
! #include "access/twophase_rmgr.h"
  #include "access/xact.h"
! #include "catalog/pg_listener.h"
  #include "commands/async.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
--- 100,111 ----
  #include <signal.h>
  
  #include "access/heapam.h"
! #include "access/slru.h"
! #include "access/transam.h"
  #include "access/xact.h"
! #include "catalog/pg_type.h"
  #include "commands/async.h"
+ #include "funcapi.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
***************
*** 108,115 ****
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction.  As explained above,
!  * we don't actually modify pg_listener until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
--- 122,129 ----
  
  /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
!  * all actions requested in the current transaction. As explained above,
!  * we don't actually send notifications until we reach transaction commit.
   *
   * The list is kept in CurTransactionContext.  In subtransactions, each
   * subtransaction has its own list in its own CurTransactionContext, but
***************
*** 126,132 ****
  typedef struct
  {
  	ListenActionKind action;
! 	char		condname[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
--- 140,146 ----
  typedef struct
  {
  	ListenActionKind action;
! 	char		channel[1];	/* actually, as long as needed */
  } ListenAction;
  
  static List *pendingActions = NIL;		/* list of ListenAction */
***************
*** 134,140 ****
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all relnames NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
--- 148,154 ----
  static List *upperPendingActions = NIL; /* list of upper-xact lists */
  
  /*
!  * State for outbound notifies consists of a list of all channels NOTIFYed
   * in the current transaction.	We do not actually perform a NOTIFY until
   * and unless the transaction commits.	pendingNotifies is NIL if no
   * NOTIFYs have been done in the current transaction.
***************
*** 149,160 ****
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
- static List *pendingNotifies = NIL;		/* list of C strings */
  
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
  
  /*
!  * State for inbound notifies consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
--- 163,285 ----
   * condition name, it will get a self-notify at commit.  This is a bit odd
   * but is consistent with our historical behavior.
   */
  
+ typedef struct QueuePosition
+ {
+ 	int				page;
+ 	int				offset;
+ } QueuePosition;
+ 
+ typedef struct Notification
+ {
+ 	char		   *channel;
+ 	char		   *payload;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ } Notification;
+ 
+ typedef struct AsyncQueueEntry
+ {
+ 	/*
+ 	 * this record has the maximal length, but usually we limit it to
+ 	 * AsyncQueueEntryEmptySize + strlen(payload).
+ 	 */
+ 	Size			length;
+ 	Oid				dboid;
+ 	TransactionId	xid;
+ 	int32			srcPid;
+ 	char			data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
+ } AsyncQueueEntry;
+ #define AsyncQueueEntryEmptySize \
+ 	 (sizeof(AsyncQueueEntry) - NOTIFY_PAYLOAD_MAX_LENGTH + 1 \
+ 							  - NAMEDATALEN + 1)
+ 
+ #define	InvalidPid				(-1)
+ #define QUEUE_POS_PAGE(x)		((x).page)
+ #define QUEUE_POS_OFFSET(x)		((x).offset)
+ #define QUEUE_POS_EQUAL(x,y) \
+ 	 ((x).page == (y).page ? (x).offset == (y).offset : false)
+ #define SET_QUEUE_POS(x,y,z) \
+ 	do { \
+ 		(x).page = (y); \
+ 		(x).offset = (z); \
+ 	} while (0);
+ /* does page x logically precede page y with z = HEAD ? */
+ #define QUEUE_POS_MIN(x,y,z) \
+ 	asyncQueuePagePrecedesLogically((x).page, (y).page, (z).page) ? (x) : \
+ 		 asyncQueuePagePrecedesLogically((y).page, (x).page, (z).page) ? (y) : \
+ 			 (x).offset < (y).offset ? (x) : \
+ 			 	(y)
+ #define QUEUE_BACKEND_POS(i)		asyncQueueControl->backend[(i)].pos
+ #define QUEUE_BACKEND_PID(i)		asyncQueueControl->backend[(i)].pid
+ #define QUEUE_HEAD					asyncQueueControl->head
+ #define QUEUE_TAIL					asyncQueueControl->tail
+ 
+ typedef struct QueueBackendStatus
+ {
+ 	int32			pid;
+ 	QueuePosition	pos;
+ } QueueBackendStatus;
+ 
+ /*
+  * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+  *
+  * In SHARED mode, backends will only inspect their own entries as well as
+  * head and tail pointers. Consequently we can allow a backend to update its
+  * own record while holding only a shared lock (since no other backend will
+  * inspect it).
+  *
+  * In EXCLUSIVE mode, backends can inspect the entries of other backends and
+  * also change head and tail pointers.
+  *
+  * In order to avoid deadlocks, whenever we need both locks, we always first
+  * get AsyncQueueLock and then AsyncCtlLock.
+  */
+ typedef struct AsyncQueueControl
+ {
+ 	QueuePosition		head;		/* head points to the next free location */
+ 	QueuePosition 		tail;		/* the global tail is equivalent to the
+ 									   tail of the "slowest" backend */
+ 	TimestampTz			lastQueueFillWarn;	/* when the queue is full we only
+ 											   want to log that once in a
+ 											   while */
+ 	QueueBackendStatus	backend[1];	/* actually this one has as many entries as
+ 									 * connections are allowed (MaxBackends) */
+ 	/* DO NOT ADD FURTHER STRUCT MEMBERS HERE */
+ } AsyncQueueControl;
+ 
+ static AsyncQueueControl   *asyncQueueControl;
+ static SlruCtlData			AsyncCtlData;
+ 
+ #define AsyncCtl					(&AsyncCtlData)
+ #define QUEUE_PAGESIZE				BLCKSZ
+ #define QUEUE_FULL_WARN_INTERVAL	5000	/* warn at most once every 5s */
+ 
+ /*
+  * slru.c currently assumes that all filenames are four characters of hex
+  * digits. That means that we can use segments 0000 through FFFF.
+  * Each segment contains SLRU_PAGES_PER_SEGMENT pages which gives us
+  * the pages from 0 to SLRU_PAGES_PER_SEGMENT * 0xFFFF.
+  *
+  * It's of course easy to enhance slru.c but those pages give us so much
+  * space already that it doesn't seem worth the trouble...
+  *
+  * It's an interesting test case to define QUEUE_MAX_PAGE to a very small
+  * multiple of SLRU_PAGES_PER_SEGMENT to test queue full behaviour.
+  */
+ #define QUEUE_MAX_PAGE			(SLRU_PAGES_PER_SEGMENT * 0xFFFF)
+ 
+ static List *pendingNotifies = NIL;				/* list of Notifications */
  static List *upperPendingNotifies = NIL;		/* list of upper-xact lists */
+ static List *listenChannels = NIL;	/* list of channels we are listening to */
+ 
+ /* has this backend sent notifications in the current transaction ? */
+ static bool backendHasSentNotifications = false;
+ /* has this backend executed a LISTEN in the current transaction ? */
+ static bool backendHasExecutedInitialListen = false;
  
  /*
!  * State for inbound notifications consists of two flags: one saying whether
   * the signal handler is currently allowed to call ProcessIncomingNotify
   * directly, and one saying whether the signal has occurred but the handler
   * was not allowed to call ProcessIncomingNotify at the time.
***************
*** 171,224 ****
  
  bool		Trace_notify = false;
  
! 
! static void queue_listen(ListenActionKind action, const char *condname);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static void Exec_Listen(Relation lRel, const char *relname);
! static void Exec_Unlisten(Relation lRel, const char *relname);
! static void Exec_UnlistenAll(Relation lRel);
! static void Send_Notify(Relation lRel);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(char *relname, int32 listenerPID);
! static bool AsyncExistsPendingNotify(const char *relname);
  static void ClearPendingActionsAndNotifies(void);
  
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the relation to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", relname);
  
! 	/* no point in making duplicate entries in the list ... */
! 	if (!AsyncExistsPendingNotify(relname))
  	{
! 		/*
! 		 * The name list needs to live until end of transaction, so store it
! 		 * in the transaction context.
! 		 */
! 		MemoryContext oldcontext;
  
! 		oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 		/*
! 		 * Ordering of the list isn't important.  We choose to put new entries
! 		 * on the front, as this might make duplicate-elimination a tad faster
! 		 * when the same condition is signaled many times in a row.
! 		 */
! 		pendingNotifies = lcons(pstrdup(relname), pendingNotifies);
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
  }
  
  /*
--- 296,520 ----
  
  bool		Trace_notify = false;
  
! static void queue_listen(ListenActionKind action, const char *channel);
  static void Async_UnlistenOnExit(int code, Datum arg);
! static bool IsListeningOn(const char *channel);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
! static void Exec_ListenBeforeCommit(const char *channel);
! static void Exec_ListenAfterCommit(const char *channel);
! static void Exec_UnlistenAfterCommit(const char *channel);
! static void Exec_UnlistenAllAfterCommit(void);
! static void SignalBackends(void);
! static bool asyncQueuePagePrecedesPhysically(int p, int q);
! static bool asyncQueuePagePrecedesLogically(int p, int q, int head);
! static bool asyncQueueAdvance(QueuePosition *position, int entryLength);
! static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
! static void asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n);
! static List *asyncQueueAddEntries(List *notifications);
! static bool asyncQueueGetEntriesByPage(QueuePosition *current,
! 									   QueuePosition stop,
! 									   List **notifications);
! static void asyncQueueReadAllNotifications(void);
! static void asyncQueueAdvanceTail(void);
! static void asyncQueueUnregister(void);
! static void asyncQueueFillWarning(void);
! static bool asyncQueueIsFull(void);
  static void ProcessIncomingNotify(void);
! static void NotifyMyFrontEnd(const char *channel,
! 							 const char *payload,
! 							 int32 srcPid);
! static bool AsyncExistsPendingNotify(const char *channel, const char *payload);
  static void ClearPendingActionsAndNotifies(void);
  
+ /*
+  * We will work on the page range of 0..(SLRU_PAGES_PER_SEGMENT * 0xFFFF).
+  * asyncQueuePagePrecedesPhysically just checks numerically without any magic
+  * if one page precedes another one.
+  *
+  * On the other hand, when asyncQueuePagePrecedesLogically does that check, it
+  * takes the current head page number into account. If we have wrapped
+  * around, it can happen that p precedes q, even though p > q (if the head page
+  * is in between the two).
+  */ 
+ static bool
+ asyncQueuePagePrecedesPhysically(int p, int q)
+ {
+ 	return p < q;
+ }
+ 
+ static bool
+ asyncQueuePagePrecedesLogically(int p, int q, int head)
+ {
+ 	if (p <= head && q <= head)
+ 		return p < q;
+ 	if (p > head && q > head)
+ 		return p < q;
+ 	if (p <= head)
+ 	{
+ 		Assert(q > head);
+ 		/* q is older */
+ 		return false;
+ 	}
+ 	else
+ 	{
+ 		Assert(p > head && q <= head);
+ 		/* p is older */
+ 		return true;
+ 	}
+ }
+ 
+ void
+ AsyncShmemInit(void)
+ {
+ 	bool	found;
+ 	int		slotno;
+ 	Size	size;
+ 
+ 	/*
+ 	 * Remember that sizeof(AsyncQueueControl) already contains one member of
+ 	 * QueueBackendStatus, so we only need to add the status space requirement
+ 	 * for MaxBackends-1 backends.
+ 	 */
+ 	size = mul_size(MaxBackends-1, sizeof(QueueBackendStatus));
+ 	size = add_size(size, sizeof(AsyncQueueControl));
+ 
+ 	asyncQueueControl = (AsyncQueueControl *)
+ 		ShmemInitStruct("Async Queue Control", size, &found);
+ 
+ 	if (!asyncQueueControl)
+ 		elog(ERROR, "out of memory");
+ 
+ 	if (!found)
+ 	{
+ 		int		i;
+ 		SET_QUEUE_POS(QUEUE_HEAD, 0, 0);
+ 		SET_QUEUE_POS(QUEUE_TAIL, 0, 0);
+ 		for (i = 0; i < MaxBackends; i++)
+ 		{
+ 			SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ 			QUEUE_BACKEND_PID(i) = InvalidPid;
+ 		}
+ 	}
+ 
+ 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedesPhysically;
+ 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
+ 				  AsyncCtlLock, "pg_notify");
+ 	AsyncCtl->do_fsync = false;
+ 	asyncQueueControl->lastQueueFillWarn = GetCurrentTimestamp();
+ 
+ 	if (!found)
+ 	{
+ 		SlruScanDirectory(AsyncCtl,
+ 						  QUEUE_MAX_PAGE + SLRU_PAGES_PER_SEGMENT,
+ 						  true);
+ 
+ 		LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
+ 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
+ 		SimpleLruWritePage(AsyncCtl, slotno, NULL);
+ 		LWLockRelease(AsyncCtlLock);
+ 	}
+ }
+ 
+ 
+ /*
+  * pg_notify -
+  *	  Send a notification to listening clients
+  */
+ Datum
+ pg_notify(PG_FUNCTION_ARGS)
+ {
+ 	const char *channel;
+ 	const char *payload;
+ 
+ 	if (PG_ARGISNULL(0))
+ 		channel = "";
+ 	else
+ 		channel = text_to_cstring(PG_GETARG_TEXT_PP(0));
+ 
+ 	if (PG_ARGISNULL(1))
+ 		payload = "";
+ 	else
+ 		payload = text_to_cstring(PG_GETARG_TEXT_PP(1));
+ 
+ 	Async_Notify(channel, payload);
+ 
+ 	PG_RETURN_VOID();
+ }
+ 
  
  /*
   * Async_Notify
   *
   *		This is executed by the SQL notify command.
   *
!  *		Adds the channel to the list of pending notifies.
   *		Actual notification happens during transaction commit.
   *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  void
! Async_Notify(const char *channel, const char *payload)
  {
+ 	Notification   *n;
+ 	MemoryContext	oldcontext;
+ 	int				i;
+ 
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Notify(%s)", channel);
  
! 	/* a channel name must be specified */
! 	if (!channel || !strlen(channel))
! 		ereport(ERROR,
! 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 				 errmsg("channel name cannot be empty")));
! 
! 	if (strlen(channel) >= NAMEDATALEN)
! 		ereport(ERROR,
! 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 				 errmsg("channel name too long")));
! 
! 	if (payload)
  	{
! 		if (strlen(payload) > NOTIFY_PAYLOAD_MAX_LENGTH - 1)
! 			ereport(ERROR,
! 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 					 errmsg("payload string too long")));
! 
! 		for (i = 0; i < strlen(payload); i++)
! 			if (payload[i] < 32 || payload[i] > 126)
! 				ereport(ERROR,
! 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 						 errmsg("invalid character in payload"),
! 						 errdetail("only 7-bit ASCII characters allowed")));
! 	}
  
! 	/* no point in making duplicate entries in the list ... */
! 	if (AsyncExistsPendingNotify(channel, payload))
! 		return;
  
! 	/*
! 	 * The name list needs to live until end of transaction, so store it
! 	 * in the transaction context.
! 	 */
! 	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
! 	n = (Notification *) palloc(sizeof(Notification));
! 	n->channel = pstrdup(channel);
! 	if (payload)
! 		n->payload = pstrdup(payload);
! 	else
! 		n->payload = "";
! 
! 	/* will set the xid and the srcPid later... */
! 	n->xid = InvalidTransactionId;
! 	n->srcPid = InvalidPid;
! 
! 	/*
! 	 * We want to preserve the order so we need to append every
! 	 * notification. See comments at AsyncExistsPendingNotify().
! 	 */
! 	pendingNotifies = lappend(pendingNotifies, n);
! 
! 	MemoryContextSwitchTo(oldcontext);
  }
  
  /*
***************
*** 226,236 ****
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of pg_listener happens during transaction commit.
!  *		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   */
  static void
! queue_listen(ListenActionKind action, const char *condname)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
--- 522,532 ----
   *		Common code for listen, unlisten, unlisten all commands.
   *
   *		Adds the request to the list of pending actions.
!  *		Actual update of the notification queue happens during transaction
!  *		commit.
   */
  static void
! queue_listen(ListenActionKind action, const char *channel)
  {
  	MemoryContext oldcontext;
  	ListenAction *actrec;
***************
*** 244,252 ****
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(condname));
  	actrec->action = action;
! 	strcpy(actrec->condname, condname);
  
  	pendingActions = lappend(pendingActions, actrec);
  
--- 540,548 ----
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
  	/* space for terminating null is included in sizeof(ListenAction) */
! 	actrec = (ListenAction *) palloc(sizeof(ListenAction) + strlen(channel));
  	actrec->action = action;
! 	strcpy(actrec->channel, channel);
  
  	pendingActions = lappend(pendingActions, actrec);
  
***************
*** 259,270 ****
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", relname, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, relname);
  }
  
  /*
--- 555,566 ----
   *		This is executed by the SQL listen command.
   */
  void
! Async_Listen(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Listen(%s,%d)", channel, MyProcPid);
  
! 	queue_listen(LISTEN_LISTEN, channel);
  }
  
  /*
***************
*** 273,288 ****
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *relname)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", relname, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, relname);
  }
  
  /*
--- 569,584 ----
   *		This is executed by the SQL unlisten command.
   */
  void
! Async_Unlisten(const char *channel)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Async_Unlisten(%s,%d)", channel, MyProcPid);
  
  	/* If we couldn't possibly be listening, no need to queue anything */
  	if (pendingActions == NIL && !unlistenExitRegistered)
  		return;
  
! 	queue_listen(LISTEN_UNLISTEN, channel);
  }
  
  /*
***************
*** 306,313 ****
  /*
   * Async_UnlistenOnExit
   *
-  *		Clean up the pg_listener table at backend exit.
-  *
   *		This is executed if we have done any LISTENs in this backend.
   *		It might not be necessary anymore, if the user UNLISTENed everything,
   *		but we don't try to detect that case.
--- 602,607 ----
***************
*** 315,331 ****
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
- 	/*
- 	 * We need to start/commit a transaction for the unlisten, but if there is
- 	 * already an active transaction we had better abort that one first.
- 	 * Otherwise we'd end up committing changes that probably ought to be
- 	 * discarded.
- 	 */
  	AbortOutOfAnyTransaction();
! 	/* Now we can do the unlisten */
! 	StartTransactionCommand();
! 	Async_UnlistenAll();
! 	CommitTransactionCommand();
  }
  
  /*
--- 609,616 ----
  static void
  Async_UnlistenOnExit(int code, Datum arg)
  {
  	AbortOutOfAnyTransaction();
! 	Exec_UnlistenAllAfterCommit();
  }
  
  /*
***************
*** 337,386 ****
  void
  AtPrepare_Notify(void)
  {
- 	ListCell   *p;
- 
  	/* It's not sensible to have any pending LISTEN/UNLISTEN actions */
! 	if (pendingActions)
  		ereport(ERROR,
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 				 errmsg("cannot PREPARE a transaction that has executed LISTEN or UNLISTEN")));
! 
! 	/* We can deal with pending NOTIFY though */
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *relname = (const char *) lfirst(p);
! 
! 		RegisterTwoPhaseRecord(TWOPHASE_RM_NOTIFY_ID, 0,
! 							   relname, strlen(relname) + 1);
! 	}
! 
! 	/*
! 	 * We can clear the state immediately, rather than needing a separate
! 	 * PostPrepare call, because if the transaction fails we'd just discard
! 	 * the state anyway.
! 	 */
! 	ClearPendingActionsAndNotifies();
  }
  
  /*
!  * AtCommit_Notify
   *
!  *		This is called at transaction commit.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, insert or delete
!  *		tuples in pg_listener accordingly.
   *
!  *		If there are outbound notify requests in the pendingNotifies list,
!  *		scan pg_listener for matching tuples, and either signal the other
!  *		backend or send a message to our own frontend.
!  *
!  *		NOTE: we are still inside the current transaction, therefore can
!  *		piggyback on its committing of changes.
   */
  void
! AtCommit_Notify(void)
  {
- 	Relation	lRel;
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
--- 622,649 ----
  void
  AtPrepare_Notify(void)
  {
  	/* It's not sensible to have any pending LISTEN/UNLISTEN actions */
! 	if (pendingActions || pendingNotifies)
  		ereport(ERROR,
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 				 errmsg("cannot PREPARE a transaction that has executed LISTEN/UNLISTEN or NOTIFY")));
  }
  
  /*
!  * AtCommit_NotifyBeforeCommit
   *
!  *		This is called at transaction commit, before actually committing to
!  *		clog.
   *
!  *		If there are pending LISTEN/UNLISTEN actions, update our
!  *		"listenChannels" list.
   *
!  *		If there are outbound notify requests in the pendingNotifies list, add
!  *		them to the global queue and signal any backend that is listening.
   */
  void
! AtCommit_NotifyBeforeCommit(void)
  {
  	ListCell   *p;
  
  	if (pendingActions == NIL && pendingNotifies == NIL)
***************
*** 397,406 ****
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify");
  
! 	/* Acquire ExclusiveLock on pg_listener */
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
--- 660,669 ----
  	}
  
  	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyBeforeCommit");
  
! 	Assert(backendHasSentNotifications == false);
! 	Assert(backendHasExecutedInitialListen == false);
  
  	/* Perform any pending listen/unlisten actions */
  	foreach(p, pendingActions)
***************
*** 410,508 ****
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_Listen(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN:
! 				Exec_Unlisten(lRel, actrec->condname);
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAll(lRel);
  				break;
  		}
- 
- 		/* We must CCI after each action in case of conflicting actions */
- 		CommandCounterIncrement();
  	}
  
- 	/* Perform any pending notifies */
- 	if (pendingNotifies)
- 		Send_Notify(lRel);
- 
  	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that notified backends see our tuple updates when they look. Else they
! 	 * might disregard the signal, which would make the application programmer
! 	 * very unhappy.  Also, this prevents race conditions when we have just
! 	 * inserted a listening tuple.
  	 */
! 	heap_close(lRel, NoLock);
! 
! 	ClearPendingActionsAndNotifies();
  
! 	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_Notify: done");
  }
  
  /*
!  * Exec_Listen --- subroutine for AtCommit_Notify
   *
!  *		Register the current backend as listening on the specified relation.
   */
! static void
! Exec_Listen(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
! 	Datum		values[Natts_pg_listener];
! 	bool		nulls[Natts_pg_listener];
! 	NameData	condname;
! 	bool		alreadyListener = false;
  
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", relname, MyProcPid);
  
! 	/* Detect whether we are already listening on this relname */
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
  
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			alreadyListener = true;
! 			/* No need to scan the rest of the table */
! 			break;
  		}
  	}
- 	heap_endscan(scan);
  
! 	if (alreadyListener)
! 		return;
  
! 	/*
! 	 * OK to insert a new tuple
! 	 */
! 	memset(nulls, false, sizeof(nulls));
  
! 	namestrcpy(&condname, relname);
! 	values[Anum_pg_listener_relname - 1] = NameGetDatum(&condname);
! 	values[Anum_pg_listener_listenerpid - 1] = Int32GetDatum(MyProcPid);
! 	values[Anum_pg_listener_notification - 1] = Int32GetDatum(0);		/* no notifies pending */
  
! 	tuple = heap_form_tuple(RelationGetDescr(lRel), values, nulls);
  
! 	simple_heap_insert(lRel, tuple);
  
! #ifdef NOT_USED					/* currently there are no indexes */
! 	CatalogUpdateIndexes(lRel, tuple);
! #endif
  
! 	heap_freetuple(tuple);
  
  	/*
! 	 * now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
--- 673,886 ----
  		switch (actrec->action)
  		{
  			case LISTEN_LISTEN:
! 				Exec_ListenBeforeCommit(actrec->channel);
  				break;
  			case LISTEN_UNLISTEN:
! 				/* there is no Exec_UnlistenBeforeCommit() */
  				break;
  			case LISTEN_UNLISTEN_ALL:
! 				/* there is no Exec_UnlistenAllBeforeCommit() */
  				break;
  		}
  	}
  
  	/*
! 	 * Perform any pending notifies.
  	 */
! 	if (pendingNotifies)
! 	{
! 		backendHasSentNotifications = true;
  
! 		while (pendingNotifies != NIL)
! 		{
! 			/*
! 			 * Add the pending notifications to the queue.
! 			 *
! 			 * A full queue is very uncommon and should really not happen,
! 			 * given that we have so much space available in the slru pages.
! 			 * Nevertheless we need to deal with this possibility. Note that
! 			 * when we get here we are in the process of committing our
! 			 * transaction, we have not yet committed to clog but this would be
! 			 * the next step. So at this point in time we can still roll the
! 			 * transaction back.
! 			 */
! 			LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 			asyncQueueFillWarning();
! 			if (asyncQueueIsFull())
! 			{
! 				LWLockRelease(AsyncQueueLock);
! 				ereport(ERROR,
! 						(errcode(ERRCODE_TOO_MANY_ENTRIES),
! 						errmsg("Too many notifications in the queue")));
! 			}
! 			pendingNotifies = asyncQueueAddEntries(pendingNotifies);
! 			LWLockRelease(AsyncQueueLock);
! 		}
! 	}
  }
  
  /*
!  * AtCommit_NotifyAfterCommit
!  *
!  *		This is called at transaction commit, after committing to clog.
   *
!  *		Notify the listening backends.
   */
! void
! AtCommit_NotifyAfterCommit(void)
  {
! 	ListCell   *p;
  
! 	/* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
! 	 * return as soon as possible */
! 	if (!pendingActions && !backendHasSentNotifications)
! 		return;
  
! 	/* Perform any pending listen/unlisten actions */
! 	foreach(p, pendingActions)
  	{
! 		ListenAction *actrec = (ListenAction *) lfirst(p);
  
! 		switch (actrec->action)
  		{
! 			case LISTEN_LISTEN:
! 				Exec_ListenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN:
! 				Exec_UnlistenAfterCommit(actrec->channel);
! 				break;
! 			case LISTEN_UNLISTEN_ALL:
! 				Exec_UnlistenAllAfterCommit();
! 				break;
  		}
  	}
  
! 	if (backendHasSentNotifications)
! 		SignalBackends();
  
! 	ClearPendingActionsAndNotifies();
! 
! 	if (Trace_notify)
! 		elog(DEBUG1, "AtCommit_NotifyAfterCommit: done");
! }
! 
! /*
!  * This function is executed for every notification found in the queue in order
!  * to check if the current backend is listening on that channel. Not sure if we
!  * should further optimize this, for example convert to a sorted array and
!  * allow binary search on it...
!  */
! static bool
! IsListeningOn(const char *channel)
! {
! 	ListCell   *p;
! 	char	   *lchan;
! 
! 	foreach(p, listenChannels)
! 	{
! 		lchan = (char *) lfirst(p);
! 		if (strcmp(lchan, channel) == 0)
! 			return true;
! 	}
! 	return false;
! }
  
! Datum
! pg_listening(PG_FUNCTION_ARGS)
! {
! 	FuncCallContext	   *funcctx;
! 	ListCell		  **lcp;
  
! 	/* stuff done only on the first call of the function */
! 	if (SRF_IS_FIRSTCALL())
! 	{
! 		MemoryContext	oldcontext;
  
! 		/* create a function context for cross-call persistence */
! 		funcctx = SRF_FIRSTCALL_INIT();
  
! 		/*
! 		 * switch to memory context appropriate for multiple function calls
! 		 */
! 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
! 
! 		/* allocate memory for user context */
! 		lcp = (ListCell **) palloc(sizeof(ListCell **));
! 		if (listenChannels != NIL)
! 			*lcp = list_head(listenChannels);
! 		else
! 			*lcp = NULL;
! 		funcctx->user_fctx = (void *) lcp;
  
! 		MemoryContextSwitchTo(oldcontext);
! 	}
! 
! 	/* stuff done on every call of the function */
! 	funcctx = SRF_PERCALL_SETUP();
! 	lcp = (ListCell **) funcctx->user_fctx;
! 
! 	while (*lcp != NULL)
! 	{
! 		char   *channel = (char *) lfirst(*lcp);
! 
! 		*lcp = (*lcp)->next;
! 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
! 	}
! 
! 	SRF_RETURN_DONE(funcctx);
! }
! 
! /*
!  * Exec_ListenBeforeCommit --- subroutine for AtCommit_NotifyBeforeCommit
!  *
!  * Note that we do only set our pointer here and do not yet add the channel to
!  * listenChannels. Since our transaction could still roll back we do this only
!  * after commit. We know that our tail pointer won't move between here and
!  * directly after commit, so we won't miss a notification.
!  */
! static void
! Exec_ListenBeforeCommit(const char *channel)
! {
! 	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Listen(%s,%d)", channel, MyProcPid);
! 
! 	/* Detect whether we are already listening to something. */
! 	if (listenChannels != NIL)
! 		return;
  
  	/*
! 	 * We need this variable to detect an aborted initial LISTEN.
! 	 * In that case we would set up our pointer but not listen on any channel.
! 	 * This state gets cleaned up again in AtAbort_Notify().
! 	 */
! 	backendHasExecutedInitialListen = true;
! 
! 	/*
! 	 * This is our first LISTEN, establish our pointer.
! 	 * We set our pointer to the global tail pointer, this way we make
! 	 * sure that we get all of the notifications. We might get a few more
! 	 * but that doesn't hurt.
! 	 */
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_POS(MyBackendId) = QUEUE_TAIL;
! 	QUEUE_BACKEND_PID(MyBackendId) = MyProcPid;
! 	LWLockRelease(AsyncQueueLock);
! 
! 	/*
! 	 * Try to move our pointer forward as far as possible. This will skip
! 	 * over already committed notifications. Still, we could get
! 	 * notifications that have already committed before we started to
! 	 * LISTEN.
! 	 *
! 	 * Note that we are not yet listening on anything, so we won't deliver
! 	 * any notification.
! 	 *
! 	 * This will also advance the global tail pointer if necessary.
! 	 */
! 	asyncQueueReadAllNotifications();
! 
! 	/*
! 	 * Now that we are listening, make sure we will unlisten before dying.
  	 */
  	if (!unlistenExitRegistered)
  	{
***************
*** 512,550 ****
  }
  
  /*
!  * Exec_Unlisten --- subroutine for AtCommit_Notify
   *
!  *		Remove the current backend from the list of listening backends
!  *		for the specified relation.
   */
  static void
! Exec_Unlisten(Relation lRel, const char *relname)
  {
! 	HeapScanDesc scan;
! 	HeapTuple	tuple;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_Unlisten(%s,%d)", relname, MyProcPid);
  
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(tuple);
! 
! 		if (listener->listenerpid == MyProcPid &&
! 			strncmp(NameStr(listener->relname), relname, NAMEDATALEN) == 0)
  		{
! 			/* Found the matching tuple, delete it */
! 			simple_heap_delete(lRel, &tuple->t_self);
! 
! 			/*
! 			 * We assume there can be only one match, so no need to scan the
! 			 * rest of the table
! 			 */
  			break;
  		}
  	}
! 	heap_endscan(scan);
  
  	/*
  	 * We do not complain about unlistening something not being listened;
--- 890,942 ----
  }
  
  /*
!  * Exec_ListenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
!  *
!  * Add the channel to the list of channels we are listening on.
!  */
! static void
! Exec_ListenAfterCommit(const char *channel)
! {
! 	MemoryContext oldcontext;
! 
! 	/* Detect whether we are already listening on this channel */
! 	if (IsListeningOn(channel))
! 		return;
! 
! 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
! 	listenChannels = lappend(listenChannels, pstrdup(channel));
! 	MemoryContextSwitchTo(oldcontext);
! }
! 
! /*
!  * Exec_UnlistenAfterCommit --- subroutine for AtCommit_NotifyAfterCommit
   *
!  * Remove a specified channel from "listenChannels".
   */
  static void
! Exec_UnlistenAfterCommit(const char *channel)
  {
! 	ListCell *q;
! 	ListCell *prev;
  
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAfterCommit(%s,%d)", channel, MyProcPid);
  
! 	prev = NULL;
! 	foreach(q, listenChannels)
  	{
! 		char *lchan = (char *) lfirst(q);
! 		if (strcmp(lchan, channel) == 0)
  		{
! 			pfree(lchan);
! 			listenChannels = list_delete_cell(listenChannels, q, prev);
  			break;
  		}
+ 		prev = q;
  	}
! 
! 	if (listenChannels == NIL)
! 		asyncQueueUnregister();
  
  	/*
  	 * We do not complain about unlistening something not being listened;
***************
*** 553,690 ****
  }
  
  /*
!  * Exec_UnlistenAll --- subroutine for AtCommit_Notify
   *
!  *		Update pg_listener to unlisten all relations for this backend.
   */
  static void
! Exec_UnlistenAll(Relation lRel)
  {
- 	HeapScanDesc scan;
- 	HeapTuple	lTuple;
- 	ScanKeyData key[1];
- 
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAll");
  
! 	/* Find and delete all entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
  
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 		simple_heap_delete(lRel, &lTuple->t_self);
  
! 	heap_endscan(scan);
  }
  
  /*
!  * Send_Notify --- subroutine for AtCommit_Notify
!  *
!  *		Scan pg_listener for tuples matching our pending notifies, and
!  *		either signal the other backend or send a message to our own frontend.
   */
  static void
! Send_Notify(Relation lRel)
  {
! 	TupleDesc	tdesc = RelationGetDescr(lRel);
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 
! 	/* preset data to update notify column to MyProcPid */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(MyProcPid);
! 
! 	scan = heap_beginscan(lRel, SnapshotNow, 0, NULL);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		listenerPID = listener->listenerpid;
  
! 		if (!AsyncExistsPendingNotify(relname))
! 			continue;
  
! 		if (listenerPID == MyProcPid)
  		{
! 			/*
! 			 * Self-notify: no need to bother with table update. Indeed, we
! 			 * *must not* clear the notification field in this path, or we
! 			 * could lose an outside notify, which'd be bad for applications
! 			 * that ignore self-notify messages.
! 			 */
! 			if (Trace_notify)
! 				elog(DEBUG1, "AtCommit_Notify: notifying self");
  
! 			NotifyMyFrontEnd(relname, listenerPID);
  		}
  		else
  		{
- 			if (Trace_notify)
- 				elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
- 					 listenerPID);
- 
  			/*
! 			 * If someone has already notified this listener, we don't bother
! 			 * modifying the table, but we do still send a NOTIFY_INTERRUPT
! 			 * signal, just in case that backend missed the earlier signal for
! 			 * some reason.  It's OK to send the signal first, because the
! 			 * other guy can't read pg_listener until we unlock it.
! 			 *
! 			 * Note: we don't have the other guy's BackendId available, so
! 			 * this will incur a search of the ProcSignal table.  That's
! 			 * probably not worth worrying about.
  			 */
! 			if (SendProcSignal(listenerPID, PROCSIG_NOTIFY_INTERRUPT,
! 							   InvalidBackendId) < 0)
  			{
! 				/*
! 				 * Get rid of pg_listener entry if it refers to a PID that no
! 				 * longer exists.  Presumably, that backend crashed without
! 				 * deleting its pg_listener entries. This code used to only
! 				 * delete the entry if errno==ESRCH, but as far as I can see
! 				 * we should just do it for any failure (certainly at least
! 				 * for EPERM too...)
! 				 */
! 				simple_heap_delete(lRel, &lTuple->t_self);
  			}
! 			else if (listener->notification == 0)
  			{
! 				/* Rewrite the tuple with my PID in notification column */
! 				rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 				simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 				CatalogUpdateIndexes(lRel, rTuple);
! #endif
  			}
  		}
  	}
  
! 	heap_endscan(scan);
  }
  
  /*
   * AtAbort_Notify
   *
!  *		This is called at transaction abort.
   *
!  *		Gets rid of pending actions and outbound notifies that we would have
!  *		executed if the transaction got committed.
   */
  void
  AtAbort_Notify(void)
  {
  	ClearPendingActionsAndNotifies();
  }
  
--- 945,1325 ----
  }
  
  /*
!  * Exec_UnlistenAllAfterCommit --- subroutine for AtCommit_Notify
   *
!  *		Unlisten on all channels for this backend.
   */
  static void
! Exec_UnlistenAllAfterCommit(void)
  {
  	if (Trace_notify)
! 		elog(DEBUG1, "Exec_UnlistenAllAferCommit(%d)", MyProcPid);
! 
! 	list_free_deep(listenChannels);
! 	listenChannels = NIL;
! 
! 	asyncQueueUnregister();
! }
! 
! static void
! asyncQueueUnregister(void)
! {
! 	bool	  advanceTail = false;
  
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	QUEUE_BACKEND_PID(MyBackendId) = InvalidPid;
! 	/*
! 	 * If we have been the last backend, advance the tail pointer.
! 	 */
! 	if (QUEUE_POS_EQUAL(QUEUE_BACKEND_POS(MyBackendId), QUEUE_TAIL))
! 		advanceTail = true;
! 	LWLockRelease(AsyncQueueLock);
  
! 	if (advanceTail)
! 		asyncQueueAdvanceTail();
! }
  
! static bool
! asyncQueueIsFull(void)
! {
! 	QueuePosition	lookahead = QUEUE_HEAD;
! 	Size			remain = QUEUE_PAGESIZE - QUEUE_POS_OFFSET(lookahead) - 1;
! 	Size			advance = Min(remain, NOTIFY_PAYLOAD_MAX_LENGTH);
! 
! 	/*
! 	 * Check what happens if we wrote a maximally sized entry. Would we go to a
! 	 * new page? If not, then our queue can not be full (because we can still
! 	 * fill at least the current page with at least one more entry).
! 	 */
! 	if (!asyncQueueAdvance(&lookahead, advance))
! 		return false;
! 
! 	/*
! 	 * The queue is full if with a switch to a new page we reach the page
! 	 * of the tail pointer.
! 	 */
! 	return QUEUE_POS_PAGE(lookahead) == QUEUE_POS_PAGE(QUEUE_TAIL);
  }
  
  /*
!  * The function advances the position to the next entry. In case we jump to
!  * a new page the function returns true, else false.
   */
+ static bool
+ asyncQueueAdvance(QueuePosition *position, int entryLength)
+ {
+ 	int		pageno = QUEUE_POS_PAGE(*position);
+ 	int		offset = QUEUE_POS_OFFSET(*position);
+ 	bool	pageJump = false;
+ 
+ 	/*
+ 	 * Move to the next writing position: First jump over what we have just
+ 	 * written or read.
+ 	 */
+ 	offset += entryLength;
+ 	Assert(offset < QUEUE_PAGESIZE);
+ 
+ 	/*
+ 	 * In a second step check if another entry can be written to the page. If
+ 	 * it does, stay here, we have reached the next position. If not, then we
+ 	 * need to move on to the next page.
+ 	 */
+ 	if (offset + AsyncQueueEntryEmptySize >= QUEUE_PAGESIZE)
+ 	{
+ 		pageno++;
+ 		if (pageno > QUEUE_MAX_PAGE)
+ 			/* wrap around */
+ 			pageno = 0;
+ 		offset = 0;
+ 		pageJump = true;
+ 	}
+ 
+ 	SET_QUEUE_POS(*position, pageno, offset);
+ 	return pageJump;
+ }
+ 
+ static void
+ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
+ {
+ 		Assert(n->channel != NULL);
+ 		Assert(n->payload != NULL);
+ 		Assert(strlen(n->payload) < NOTIFY_PAYLOAD_MAX_LENGTH);
+ 		Assert(strlen(n->channel) < NAMEDATALEN);
+ 
+ 		/* The terminators are already included in AsyncQueueEntryEmptySize */
+ 		qe->length = AsyncQueueEntryEmptySize + strlen(n->payload)
+ 											  + strlen(n->channel);
+ 		qe->srcPid = MyProcPid;
+ 		qe->dboid = MyDatabaseId;
+ 		qe->xid = GetCurrentTransactionId();
+ 		strcpy(qe->data, n->channel);
+ 		Assert(*(qe->data + strlen(n->channel)) == '\0');
+ 		strcpy(qe->data + strlen(n->channel) + 1, n->payload);
+ }
+ 
  static void
! asyncQueueEntryToNotification(AsyncQueueEntry *qe, Notification *n)
  {
! 	n->channel = pstrdup(qe->data);
! 	Assert(*(qe->data + strlen(qe->data)) == '\0');
! 	n->payload = pstrdup(qe->data + strlen(qe->data) + 1);
! 	n->srcPid = qe->srcPid;
! 	n->xid = qe->xid;
! }
! 
! /*
!  * Add the notifications to the queue: we go page by page here, i.e. we stop
!  * once we have to go to a new page but we will be called again and then fill
!  * that next page. If an entry does not fit to a page anymore, we write a dummy
!  * entry with an InvalidOid as the database oid in order to fill the page. So
!  * every page is always used up to the last byte which simplifies reading the
!  * page later.
!  *
!  * We are holding AsyncQueueLock already from the caller and grab AsyncCtlLock
!  * here in this function.
!  *
!  * We are passed the list of notifications to write and return the
!  * not-yet-written notifications back. Eventually we will return NIL.
!  */
! static List *
! asyncQueueAddEntries(List *notifications)
! {
! 	AsyncQueueEntry	qe;
! 	int				pageno;
! 	int				offset;
! 	int				slotno;
! 
! 	/*
! 	 * Note that we are holding exclusive AsyncQueueLock already.
! 	 */
! 	LWLockAcquire(AsyncCtlLock, LW_EXCLUSIVE);
! 	pageno = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	slotno = SimpleLruReadPage(AsyncCtl, pageno, true, InvalidTransactionId);
! 	AsyncCtl->shared->page_dirty[slotno] = true;
  
! 	do
! 	{
! 		Notification   *n;
  
! 		if (asyncQueueIsFull())
  		{
! 			/* document that we will not go into the if-block further down */
! 			Assert(QUEUE_POS_OFFSET(QUEUE_HEAD) != 0);
! 			break;
! 		}
! 
! 		n = (Notification *) linitial(notifications);
  
! 		asyncQueueNotificationToEntry(n, &qe);
! 
! 		offset = QUEUE_POS_OFFSET(QUEUE_HEAD);
! 		/*
! 		 * Check whether or not the entry still fits on the current page.
! 		 */
! 		if (offset + qe.length < QUEUE_PAGESIZE)
! 		{
! 			notifications = list_delete_first(notifications);
  		}
  		else
  		{
  			/*
! 			 * Write a dummy entry to fill up the page. Actually readers will
! 			 * only check dboid and since it won't match any reader's database
! 			 * oid, they will ignore this entry and move on.
  			 */
! 			qe.length = QUEUE_PAGESIZE - offset - 1;
! 			qe.dboid = InvalidOid;
! 			qe.data[0] = '\0'; /* empty channel */
! 			qe.data[1] = '\0'; /* empty payload */
! 			qe.xid = InvalidTransactionId;
! 		}
! 		memcpy((char*) AsyncCtl->shared->page_buffer[slotno] + offset,
! 			   &qe, qe.length);
! 
! 	} while (!asyncQueueAdvance(&(QUEUE_HEAD), qe.length)
! 			 && notifications != NIL);
! 
! 	if (QUEUE_POS_OFFSET(QUEUE_HEAD) == 0)
! 	{
! 		/*
! 		 * we need to go to continue on a new page, stop here but prepare that
! 		 * page already.
! 		 */
! 		slotno = SimpleLruZeroPage(AsyncCtl, QUEUE_POS_PAGE(QUEUE_HEAD));
! 		AsyncCtl->shared->page_dirty[slotno] = true;
! 	}
! 	LWLockRelease(AsyncCtlLock);
! 
! 	return notifications;
! }
! 
! /*
!  * Here we calculate how full our queue already is. As the queue is quite
!  * large, we would probably only see a high filling degree with a long running
!  * idle transaction. We don't emit anything if the queue is less than half
!  * full.
!  *
!  * In case it is between 50% to 75% full, we log a warning and calculate the
!  * "slowest" backend to give a hint on which backend is preventing cleanup.
!  *
!  * There's a similar warning in case our queue is more than 75% full.
!  *
!  * The warnings show up only once every QUEUE_FULL_WARN_INTERVAL.
!  */
! static void
! asyncQueueFillWarning(void)
! {
! 	/*
! 	 * Caller must hold exclusive AsyncQueueLock.
! 	 */
! 	TimestampTz		t;
! 	double			fillDegree;
! 	int				occupied;
! 	int				tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int				headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 
! 	occupied = headPage - tailPage;
! 
! 	if (occupied == 0)
! 		return;
! 	
! 	if (!asyncQueuePagePrecedesPhysically(tailPage, headPage))
! 		/* head has wrapped around, tail not yet */
! 		occupied += QUEUE_MAX_PAGE;
! 
! 	fillDegree = (float) occupied / (float) QUEUE_MAX_PAGE;
! 
! 	if (fillDegree < 0.5)
! 		return;
! 
! 	t = GetCurrentTimestamp();
! 
! 	if (TimestampDifferenceExceeds(asyncQueueControl->lastQueueFillWarn,
! 								   t, QUEUE_FULL_WARN_INTERVAL))
! 	{
! 		QueuePosition	min = QUEUE_HEAD;
! 		int32			minPid = InvalidPid;
! 		int				i;
! 
! 		for (i = 0; i < MaxBackends; i++)
! 			if (QUEUE_BACKEND_PID(i) != InvalidPid)
  			{
! 				min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
! 				if (QUEUE_POS_EQUAL(min, QUEUE_BACKEND_POS(i)))
! 					minPid = QUEUE_BACKEND_PID(i);
  			}
! 
! 		if (fillDegree < 0.75)
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 50%% full. "
! 								 "Among the slowest backends: %d", minPid),
! 								errdetail("Cleanup can only proceed if "
! 										  "this backend ends its current "
! 										  "transaction")));
! 		else
! 			ereport(WARNING, (errmsg("pg_notify queue is more than 75%% full. "
! 								 "Among the slowest backends: %d", minPid),
! 								errdetail("Cleanup can only proceed if "
! 										  "this backend ends its current "
! 										  "transaction"))); 
! 
! 		asyncQueueControl->lastQueueFillWarn = t;
! 	}
! }
! 
! /*
!  * Send signals to all listening backends. Since we have EXCLUSIVE lock anyway
!  * we also check the position of the other backends and in case that anyone is
!  * already up-to-date we don't signal it. This can happen if concurrent
!  * notifying transactions have sent a signal and the signaled backend has read
!  * the other notifications and ours in the same step.
!  *
!  * Since we know the BackendId and the Pid the signalling is quite cheap.
!  */
! static void
! SignalBackends(void)
! {
! 	QueuePosition	pos;
! 	ListCell	   *p1, *p2;
! 	int				i;
! 	int32			pid;
! 	List		   *pids = NIL;
! 	List		   *ids = NIL;
! 	int				count = 0;
! 
! 	/* Signal everybody who is LISTENing to any channel. */
! 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
! 	for (i = 0; i < MaxBackends; i++)
! 	{
! 		pid = QUEUE_BACKEND_PID(i);
! 		if (pid != InvalidPid)
! 		{
! 			count++;
! 			pos = QUEUE_BACKEND_POS(i);
! 			if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
  			{
! 				pids = lappend_int(pids, pid);
! 				ids = lappend_int(ids, i);
  			}
  		}
  	}
+ 	LWLockRelease(AsyncQueueLock);
  
! 	forboth(p1, pids, p2, ids)
! 	{
! 		pid = (int32) lfirst_int(p1);
! 		i = lfirst_int(p2);
! 		/*
! 		 * Should we check for failure? Can it happen that a backend
! 		 * has crashed without the postmaster starting over?
! 		 */
! 		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, i) < 0)
! 			elog(WARNING, "Error signalling backend %d", pid);
! 	}
! 
! 	if (count == 0)
! 	{
! 		/* No backend is listening at all, try to clean up the queue.
! 		 * Even if by now (after we determined count to be 0 and now)
! 		 * a backend has started to listen, advancing the tail does not
! 		 * hurt. Our notifications are committed already and a newly
! 		 * listening backend would skip over them anyway. */
! 		asyncQueueAdvanceTail();
! 	}
  }
  
  /*
   * AtAbort_Notify
   *
!  *	This is called at transaction abort.
   *
!  *	Gets rid of pending actions and outbound notifies that we would have
!  *	executed if the transaction got committed.
!  *
!  *	Even though we have not committed, we need to signal the listening backends
!  *	because our notifications might block readers from processing the queue.
!  *	Now that the transaction has aborted, they can go on and skip over our
!  *	notifications. They could find notifications past ours that they need to
!  *	deliver.
   */
  void
  AtAbort_Notify(void)
  {
+ 	if (backendHasSentNotifications)
+ 		SignalBackends();
+ 
+ 	/*
+ 	 * If we LISTEN but then roll back the transaction we have set our pointer
+ 	 * but have not made the entry in listenChannels. In that case, remove
+ 	 * our pointer again.
+ 	 */
+ 	if (backendHasExecutedInitialListen)
+ 		/*
+ 		 * Checking listenChannels should be redundant but it can't hurt doing
+ 		 * it for safety reasons.
+ 		*/
+ 		if (listenChannels == NIL)
+ 			asyncQueueUnregister();
+ 
  	ClearPendingActionsAndNotifies();
  }
  
***************
*** 940,968 ****
  }
  
  /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan pg_listener for arriving notifies, report them to my front end,
!  *		and clear the notification field in pg_listener until next time.
   *
!  *		NOTE: since we are outside any transaction, we must create our own.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	Relation	lRel;
! 	TupleDesc	tdesc;
! 	ScanKeyData key[1];
! 	HeapScanDesc scan;
! 	HeapTuple	lTuple,
! 				rTuple;
! 	Datum		value[Natts_pg_listener];
! 	bool		repl[Natts_pg_listener],
! 				nulls[Natts_pg_listener];
! 	bool		catchup_enabled;
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
--- 1575,1809 ----
  }
  
  /*
+  * This function will ask for a page with ReadOnly access and once we have the
+  * lock, we read the whole content and pass back the list of notifications
+  * that the calling function will deliver then. The list will contain all
+  * notifications from transactions that have already committed.
+  *
+  * We stop if we have either reached the stop position or go to a new page.
+  *
+  * The function returns true once we have reached the end or a notification of
+  * a transaction that is still running and false if we have finished with
+  * the page. In other words: once it returns true there is no point in calling
+  * it again.
+  */
+ static bool
+ asyncQueueGetEntriesByPage(QueuePosition *current,
+ 						   QueuePosition stop,
+ 						   List **notifications)
+ {
+ 	AsyncQueueEntry	qe;
+ 	Notification   *n;
+ 	int				slotno;
+ 	bool			reachedStop = false;
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		return true;
+ 
+ 	slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, current->page,
+ 										InvalidTransactionId);
+ 	do {
+ 		char *readPtr = (char *) (AsyncCtl->shared->page_buffer[slotno]);
+ 
+ 		if (QUEUE_POS_EQUAL(*current, stop))
+ 		{
+ 			reachedStop = true;
+ 			break;
+ 		}
+ 
+ 		readPtr += current->offset;
+ 		/* at first we only read the header of the notification */
+ 		memcpy(&qe, readPtr, AsyncQueueEntryEmptySize);
+ 
+ 		if (qe.dboid == MyDatabaseId)
+ 		{
+ 			if (TransactionIdDidCommit(qe.xid))
+ 			{
+ 				memcpy(&qe, readPtr, qe.length);
+ 				/* qe.data is the NUL terminated channel name */
+ 				if (IsListeningOn(qe.data))
+ 				{
+ 					n = (Notification *) palloc(sizeof(Notification));
+ 					asyncQueueEntryToNotification(&qe, n);
+ 					*notifications = lappend(*notifications, n);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				if (!TransactionIdDidAbort(qe.xid))
+ 				{
+ 					/*
+ 					 * The transaction has neither committed nor aborted so
+ 					 * far.
+ 					 */
+ 					reachedStop = true;
+ 					break;
+ 				}
+ 				/*
+ 				 * Here we know that the transaction has aborted, we just
+ 				 * ignore its notifications.
+ 				 */
+ 			}
+ 		}
+ 		/*
+ 		 * The call to asyncQueueAdvance just jumps over what we have
+ 		 * just read. If there is no more space for the next record on the
+ 		 * current page, it will also switch to the beginning of the next page.
+ 		 */
+ 	} while(!asyncQueueAdvance(current, qe.length));
+ 
+ 	/*
+ 	 * Release the lock that we implicitly got from
+ 	 * SimpleLruReadPage_ReadOnly().
+ 	 */
+ 	LWLockRelease(AsyncCtlLock);
+ 
+ 	if (QUEUE_POS_EQUAL(*current, stop))
+ 		reachedStop = true;
+ 
+ 	return reachedStop;
+ }
+ 
+ 
+ static void
+ asyncQueueReadAllNotifications(void)
+ {
+ 	QueuePosition	pos;
+ 	QueuePosition	oldpos;
+ 	QueuePosition	head;
+ 	List		   *notifications;
+ 	ListCell	   *lc;
+ 	Notification   *n;
+ 	bool			advanceTail = false;
+ 	bool			reachedStop;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	pos = oldpos = QUEUE_BACKEND_POS(MyBackendId);
+ 	head = QUEUE_HEAD;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (QUEUE_POS_EQUAL(pos, head))
+ 	{
+ 		/* Nothing to do, we have read all notifications already. */
+ 		return;
+ 	}
+ 
+ 	do 
+ 	{
+ 		/*
+ 		 * Our stop position is what we found to be the head's position when
+ 		 * we entered this function. It might have changed already. But if it
+ 		 * has, we will receive (or have already received and queued) another
+ 		 * signal and come here again.
+ 		 *
+ 		 * We are not holding AsyncQueueLock here! The queue can only extend
+ 		 * beyond the head pointer (see above) and we leave our backend's
+ 		 * pointer where it is so nobody will truncate or rewrite pages under
+ 		 * us. Especially we don't want to hold a lock while sending the
+ 		 * notifications to the frontend.
+ 		 */
+ 		reachedStop = false;
+ 
+ 		notifications = NIL;
+ 		reachedStop = asyncQueueGetEntriesByPage(&pos, head, &notifications);
+ 
+ 		/*
+ 		 * Note that we deliver everything that we see in the queue and that
+ 		 * matches our _current_ listening state.
+ 		 * Especially we do not take into account different commit times.
+ 		 *
+ 		 * See the following example:
+ 		 *
+ 		 * Backend 1:                    Backend 2:
+ 		 *
+ 		 * transaction starts
+ 		 * NOTIFY foo;
+ 		 * commit starts
+ 		 *                               transaction starts
+ 		 *                               LISTEN foo;
+ 		 *                               commit starts
+ 		 * commit to clog
+ 		 *                               commit to clog
+ 		 *
+ 		 * It could happen that backend 2 sees the notification from
+ 		 * backend 1 in the queue and even though the notifying transaction
+ 		 * committed before the listening transaction, we still deliver the
+ 		 * notification.
+ 		 *
+ 		 * The idea is that an additional notification does not do any
+ 		 * harm we just need to make sure that we do not miss a
+ 		 * notification.
+ 		 */
+ 		foreach(lc, notifications)
+ 		{
+ 			n = (Notification *) lfirst(lc);
+ 			NotifyMyFrontEnd(n->channel, n->payload, n->srcPid);
+ 		}
+ 	} while (!reachedStop);
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
+ 	QUEUE_BACKEND_POS(MyBackendId) = pos;
+ 	if (QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL))
+ 		advanceTail = true;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	if (advanceTail)
+ 		/* Move forward the tail pointer and try to truncate. */
+ 		asyncQueueAdvanceTail();
+ }
+ 
+ static void
+ asyncQueueAdvanceTail(void)
+ {
+ 	QueuePosition	min;
+ 	int				i;
+ 	int				tailp;
+ 	int				headp;
+ 
+ 	LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+ 	min = QUEUE_HEAD;
+ 	for (i = 0; i < MaxBackends; i++)
+ 		if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ 			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i), QUEUE_HEAD);
+ 
+ 	tailp = QUEUE_POS_PAGE(QUEUE_TAIL);
+ 	headp = QUEUE_POS_PAGE(QUEUE_HEAD);
+ 	QUEUE_TAIL = min;
+ 	LWLockRelease(AsyncQueueLock);
+ 
+ 	/* This is our wraparound check */
+ 	if ((asyncQueuePagePrecedesLogically(tailp, QUEUE_POS_PAGE(min), headp)
+ 			&& asyncQueuePagePrecedesPhysically(tailp, headp))
+ 		|| tailp == QUEUE_POS_PAGE(min))
+ 	{
+ 		/*
+ 		 * SimpleLruTruncate() will ask for AsyncCtlLock but will also
+ 		 * release the lock again.
+ 		 *
+ 		 * XXX this could be optimized, to call SimpleLruTruncate only when we
+ 		 * know that we can truncate something.
+ 		 */
+ 		SimpleLruTruncate(AsyncCtl, QUEUE_POS_PAGE(min));
+ 	}
+ }
+ 
+ /*
   * ProcessIncomingNotify
   *
   *		Deal with arriving NOTIFYs from other backends.
   *		This is called either directly from the PROCSIG_NOTIFY_INTERRUPT
   *		signal handler, or the next time control reaches the outer idle loop.
!  *		Scan the queue for arriving notifications and report them to my front
!  *		end.
   *
!  *		NOTE: we are outside of any transaction here.
   */
  static void
  ProcessIncomingNotify(void)
  {
! 	bool			catchup_enabled;
! 
! 	Assert(GetCurrentTransactionIdIfAny() == InvalidTransactionId);
  
  	/* Must prevent catchup interrupt while I am running */
  	catchup_enabled = DisableCatchupInterrupt();
***************
*** 974,1037 ****
  
  	notifyInterruptOccurred = 0;
  
! 	StartTransactionCommand();
! 
! 	lRel = heap_open(ListenerRelationId, ExclusiveLock);
! 	tdesc = RelationGetDescr(lRel);
! 
! 	/* Scan only entries with my listenerPID */
! 	ScanKeyInit(&key[0],
! 				Anum_pg_listener_listenerpid,
! 				BTEqualStrategyNumber, F_INT4EQ,
! 				Int32GetDatum(MyProcPid));
! 	scan = heap_beginscan(lRel, SnapshotNow, 1, key);
! 
! 	/* Prepare data for rewriting 0 into notification field */
! 	memset(nulls, false, sizeof(nulls));
! 	memset(repl, false, sizeof(repl));
! 	repl[Anum_pg_listener_notification - 1] = true;
! 	memset(value, 0, sizeof(value));
! 	value[Anum_pg_listener_notification - 1] = Int32GetDatum(0);
! 
! 	while ((lTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
! 	{
! 		Form_pg_listener listener = (Form_pg_listener) GETSTRUCT(lTuple);
! 		char	   *relname = NameStr(listener->relname);
! 		int32		sourcePID = listener->notification;
! 
! 		if (sourcePID != 0)
! 		{
! 			/* Notify the frontend */
! 
! 			if (Trace_notify)
! 				elog(DEBUG1, "ProcessIncomingNotify: received %s from %d",
! 					 relname, (int) sourcePID);
! 
! 			NotifyMyFrontEnd(relname, sourcePID);
! 
! 			/*
! 			 * Rewrite the tuple with 0 in notification column.
! 			 */
! 			rTuple = heap_modify_tuple(lTuple, tdesc, value, nulls, repl);
! 			simple_heap_update(lRel, &lTuple->t_self, rTuple);
! 
! #ifdef NOT_USED					/* currently there are no indexes */
! 			CatalogUpdateIndexes(lRel, rTuple);
! #endif
! 		}
! 	}
! 	heap_endscan(scan);
! 
! 	/*
! 	 * We do NOT release the lock on pg_listener here; we need to hold it
! 	 * until end of transaction (which is about to happen, anyway) to ensure
! 	 * that other backends see our tuple updates when they look. Otherwise, a
! 	 * transaction started after this one might mistakenly think it doesn't
! 	 * need to send this backend a new NOTIFY.
! 	 */
! 	heap_close(lRel, NoLock);
! 
! 	CommitTransactionCommand();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
--- 1815,1821 ----
  
  	notifyInterruptOccurred = 0;
  
! 	asyncQueueReadAllNotifications();
  
  	/*
  	 * Must flush the notify messages to ensure frontend gets them promptly.
***************
*** 1051,1070 ****
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(char *relname, int32 listenerPID)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, listenerPID, sizeof(int32));
! 		pq_sendstring(&buf, relname);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 		{
! 			/* XXX Add parameter string here later */
! 			pq_sendstring(&buf, "");
! 		}
  		pq_endmessage(&buf);
  
  		/*
--- 1835,1851 ----
   * Send NOTIFY message to my front end.
   */
  static void
! NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid)
  {
  	if (whereToSendOutput == DestRemote)
  	{
  		StringInfoData buf;
  
  		pq_beginmessage(&buf, 'A');
! 		pq_sendint(&buf, srcPid, sizeof(int32));
! 		pq_sendstring(&buf, channel);
  		if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 3)
! 			pq_sendstring(&buf, payload);
  		pq_endmessage(&buf);
  
  		/*
***************
*** 1074,1096 ****
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", relname);
  }
  
! /* Does pendingNotifies include the given relname? */
  static bool
! AsyncExistsPendingNotify(const char *relname)
  {
  	ListCell   *p;
  
! 	foreach(p, pendingNotifies)
! 	{
! 		const char *prelname = (const char *) lfirst(p);
  
! 		if (strcmp(prelname, relname) == 0)
  			return true;
  	}
  
  	return false;
  }
  
--- 1855,1919 ----
  		 */
  	}
  	else
! 		elog(INFO, "NOTIFY for %s", channel);
  }
  
! /* Does pendingNotifies include the given channel/payload? */
  static bool
! AsyncExistsPendingNotify(const char *channel, const char *payload)
  {
  	ListCell   *p;
+ 	Notification *n;
  
! 	if (pendingNotifies == NIL)
! 		return false;
! 
! 	if (payload == NULL)
! 		payload = "";
  
! 	/*
! 	 * We need to append new elements to the end of the list in order to keep
! 	 * the order. However, on the other hand we'd like to check the list
! 	 * backwards in order to make duplicate-elimination a tad faster when the
! 	 * same condition is signaled many times in a row. So as a compromise we
! 	 * check the tail element first which we can access directly. If this
! 	 * doesn't match, we check the rest of whole list.
! 	 *
! 	 * As we are not checking our parents' lists, we can still get duplicates
! 	 * in combination with subtransactions, like in:
! 	 *
! 	 * begin;
! 	 * notify foo '1';
! 	 * savepoint foo;
! 	 * notify foo '1';
! 	 * commit;
! 	 */
! 	n = (Notification *) llast(pendingNotifies);
! 	if (strcmp(n->channel, channel) == 0)
! 	{
! 		Assert(n->payload != NULL);
! 		if (strcmp(n->payload, payload) == 0)
  			return true;
  	}
  
+ 	/*
+ 	 * Note the difference to foreach(). We stop if p is the last element
+ 	 * already. So we don't check the last element, we have checked it already.
+  	 */
+ 	for(p = list_head(pendingNotifies);
+ 		p != list_tail(pendingNotifies);
+ 		p = lnext(p))
+ 	{
+ 		n = (Notification *) lfirst(p);
+ 
+ 		if (strcmp(n->channel, channel) == 0)
+ 		{
+ 			Assert(n->payload != NULL);
+ 			if (strcmp(n->payload, payload) == 0)
+ 				return true;
+ 		}
+ 	}
+ 
  	return false;
  }
  
***************
*** 1107,1128 ****
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
- }
  
! /*
!  * 2PC processing routine for COMMIT PREPARED case.
!  *
!  * (We don't have to do anything for ROLLBACK PREPARED.)
!  */
! void
! notify_twophase_postcommit(TransactionId xid, uint16 info,
! 						   void *recdata, uint32 len)
! {
! 	/*
! 	 * Set up to issue the NOTIFY at the end of my own current transaction.
! 	 * (XXX this has some issues if my own transaction later rolls back, or if
! 	 * there is any significant delay before I commit.	OK for now because we
! 	 * disallow COMMIT PREPARED inside a transaction block.)
! 	 */
! 	Async_Notify((char *) recdata);
  }
--- 1930,1937 ----
  	 */
  	pendingActions = NIL;
  	pendingNotifies = NIL;
  
! 	backendHasSentNotifications = false;
! 	backendHasExecutedInitialListen = false;
  }
+ 
diff -cr cvs.head/src/backend/nodes/copyfuncs.c cvs.build/src/backend/nodes/copyfuncs.c
*** cvs.head/src/backend/nodes/copyfuncs.c	2010-02-14 16:02:46.000000000 +0100
--- cvs.build/src/backend/nodes/copyfuncs.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 2777,2782 ****
--- 2777,2783 ----
  	NotifyStmt *newnode = makeNode(NotifyStmt);
  
  	COPY_STRING_FIELD(conditionname);
+ 	COPY_STRING_FIELD(payload);
  
  	return newnode;
  }
diff -cr cvs.head/src/backend/nodes/equalfuncs.c cvs.build/src/backend/nodes/equalfuncs.c
*** cvs.head/src/backend/nodes/equalfuncs.c	2010-02-14 16:02:46.000000000 +0100
--- cvs.build/src/backend/nodes/equalfuncs.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 1325,1330 ****
--- 1325,1331 ----
  _equalNotifyStmt(NotifyStmt *a, NotifyStmt *b)
  {
  	COMPARE_STRING_FIELD(conditionname);
+ 	COMPARE_STRING_FIELD(payload);
  
  	return true;
  }
diff -cr cvs.head/src/backend/nodes/outfuncs.c cvs.build/src/backend/nodes/outfuncs.c
*** cvs.head/src/backend/nodes/outfuncs.c	2010-02-14 16:02:46.000000000 +0100
--- cvs.build/src/backend/nodes/outfuncs.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 1820,1825 ****
--- 1820,1826 ----
  	WRITE_NODE_TYPE("NOTIFY");
  
  	WRITE_STRING_FIELD(conditionname);
+ 	WRITE_STRING_FIELD(payload);
  }
  
  static void
diff -cr cvs.head/src/backend/nodes/readfuncs.c cvs.build/src/backend/nodes/readfuncs.c
*** cvs.head/src/backend/nodes/readfuncs.c	2010-02-14 16:02:46.000000000 +0100
--- cvs.build/src/backend/nodes/readfuncs.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 231,236 ****
--- 231,237 ----
  	READ_LOCALS(NotifyStmt);
  
  	READ_STRING_FIELD(conditionname);
+ 	READ_STRING_FIELD(payload);
  
  	READ_DONE();
  }
diff -cr cvs.head/src/backend/parser/gram.y cvs.build/src/backend/parser/gram.y
*** cvs.head/src/backend/parser/gram.y	2010-02-14 16:02:47.000000000 +0100
--- cvs.build/src/backend/parser/gram.y	2010-02-14 16:04:25.000000000 +0100
***************
*** 400,406 ****
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
--- 400,406 ----
  
  %type <ival>	Iconst SignedIconst
  %type <list>	Iconst_list
! %type <str>		Sconst comment_text notify_payload
  %type <str>		RoleId opt_granted_by opt_boolean ColId_or_Sconst
  %type <list>	var_list
  %type <str>		ColId ColLabel var_name type_function_name param_name
***************
*** 6123,6132 ****
   *
   *****************************************************************************/
  
! NotifyStmt: NOTIFY ColId
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
  					$$ = (Node *)n;
  				}
  		;
--- 6123,6138 ----
   *
   *****************************************************************************/
  
! notify_payload:
! 			Sconst								{ $$ = $1; }
! 			| /*EMPTY*/							{ $$ = NULL; }
! 		;
! 
! NotifyStmt: NOTIFY ColId notify_payload
  				{
  					NotifyStmt *n = makeNode(NotifyStmt);
  					n->conditionname = $2;
+ 					n->payload = $3;
  					$$ = (Node *)n;
  				}
  		;
diff -cr cvs.head/src/backend/storage/ipc/ipci.c cvs.build/src/backend/storage/ipc/ipci.c
*** cvs.head/src/backend/storage/ipc/ipci.c	2010-01-20 20:08:27.000000000 +0100
--- cvs.build/src/backend/storage/ipc/ipci.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/nbtree.h"
  #include "access/subtrans.h"
  #include "access/twophase.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
***************
*** 225,230 ****
--- 226,232 ----
  	 */
  	BTreeShmemInit();
  	SyncScanShmemInit();
+ 	AsyncShmemInit();
  
  #ifdef EXEC_BACKEND
  
diff -cr cvs.head/src/backend/storage/lmgr/lwlock.c cvs.build/src/backend/storage/lmgr/lwlock.c
*** cvs.head/src/backend/storage/lmgr/lwlock.c	2010-01-05 12:39:29.000000000 +0100
--- cvs.build/src/backend/storage/lmgr/lwlock.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 24,29 ****
--- 24,30 ----
  #include "access/clog.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
+ #include "commands/async.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "storage/ipc.h"
***************
*** 174,179 ****
--- 175,183 ----
  	/* multixact.c needs two SLRU areas */
  	numLocks += NUM_MXACTOFFSET_BUFFERS + NUM_MXACTMEMBER_BUFFERS;
  
+ 	/* async.c needs one per page for the AsyncQueue */
+ 	numLocks += NUM_ASYNC_BUFFERS;
+ 
  	/*
  	 * Add any requested by loadable modules; for backwards-compatibility
  	 * reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff -cr cvs.head/src/backend/tcop/utility.c cvs.build/src/backend/tcop/utility.c
*** cvs.head/src/backend/tcop/utility.c	2010-01-30 22:06:36.000000000 +0100
--- cvs.build/src/backend/tcop/utility.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 930,936 ****
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname);
  			}
  			break;
  
--- 930,936 ----
  				NotifyStmt *stmt = (NotifyStmt *) parsetree;
  				PreventCommandDuringRecovery();
  
! 				Async_Notify(stmt->conditionname, stmt->payload);
  			}
  			break;
  
diff -cr cvs.head/src/bin/initdb/initdb.c cvs.build/src/bin/initdb/initdb.c
*** cvs.head/src/bin/initdb/initdb.c	2010-01-30 22:06:37.000000000 +0100
--- cvs.build/src/bin/initdb/initdb.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 2458,2463 ****
--- 2458,2464 ----
  		"pg_xlog",
  		"pg_xlog/archive_status",
  		"pg_clog",
+ 		"pg_notify",
  		"pg_subtrans",
  		"pg_twophase",
  		"pg_multixact/members",
diff -cr cvs.head/src/bin/psql/common.c cvs.build/src/bin/psql/common.c
*** cvs.head/src/bin/psql/common.c	2010-01-05 12:39:33.000000000 +0100
--- cvs.build/src/bin/psql/common.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 555,562 ****
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" received from server process with PID %d.\n"),
! 				notify->relname, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
--- 555,562 ----
  
  	while ((notify = PQnotifies(pset.db)))
  	{
! 		fprintf(pset.queryFout, _("Asynchronous notification \"%s\" (%s) received from server process with PID %d.\n"),
! 				notify->relname, notify->extra, notify->be_pid);
  		fflush(pset.queryFout);
  		PQfreemem(notify);
  	}
diff -cr cvs.head/src/bin/psql/tab-complete.c cvs.build/src/bin/psql/tab-complete.c
*** cvs.head/src/bin/psql/tab-complete.c	2010-01-30 22:06:37.000000000 +0100
--- cvs.build/src/bin/psql/tab-complete.c	2010-02-14 16:04:25.000000000 +0100
***************
*** 1852,1858 ****
  
  /* NOTIFY */
  	else if (pg_strcasecmp(prev_wd, "NOTIFY") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s'");
  
  /* OPTIONS */
  	else if (pg_strcasecmp(prev_wd, "OPTIONS") == 0)
--- 1852,1858 ----
  
  /* NOTIFY */
  	else if (pg_strcasecmp(prev_wd, "NOTIFY") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s'");
  
  /* OPTIONS */
  	else if (pg_strcasecmp(prev_wd, "OPTIONS") == 0)
***************
*** 2093,2099 ****
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(relname) FROM pg_catalog.pg_listener WHERE substring(pg_catalog.quote_ident(relname),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
--- 2093,2099 ----
  
  /* UNLISTEN */
  	else if (pg_strcasecmp(prev_wd, "UNLISTEN") == 0)
! 		COMPLETE_WITH_QUERY("SELECT pg_catalog.quote_ident(channel) FROM pg_catalog.pg_listening() AS channel WHERE substring(pg_catalog.quote_ident(channel),1,%d)='%s' UNION SELECT '*'");
  
  /* UPDATE */
  	/* If prev. word is UPDATE suggest a list of tables */
diff -cr cvs.head/src/include/access/slru.h cvs.build/src/include/access/slru.h
*** cvs.head/src/include/access/slru.h	2010-01-05 12:39:34.000000000 +0100
--- cvs.build/src/include/access/slru.h	2010-02-14 16:04:25.000000000 +0100
***************
*** 16,21 ****
--- 16,40 ----
  #include "access/xlogdefs.h"
  #include "storage/lwlock.h"
  
+ /*
+  * Define segment size.  A page is the same BLCKSZ as is used everywhere
+  * else in Postgres.  The segment size can be chosen somewhat arbitrarily;
+  * we make it 32 pages by default, or 256Kb, i.e. 1M transactions for CLOG
+  * or 64K transactions for SUBTRANS.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * page numbering also wraps around at 0xFFFFFFFF/xxxx_XACTS_PER_PAGE (where
+  * xxxx is CLOG or SUBTRANS, respectively), and segment numbering at
+  * 0xFFFFFFFF/xxxx_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need
+  * take no explicit notice of that fact in this module, except when comparing
+  * segment and page numbers in SimpleLruTruncate (see PagePrecedes()).
+  *
+  * Note: this file currently assumes that segment file names will be four
+  * hex digits.	This sets a lower bound on the segment size (64K transactions
+  * for 32-bit TransactionIds).
+  */
+ #define SLRU_PAGES_PER_SEGMENT	32
+ 
  
  /*
   * Page status codes.  Note that these do not include the "dirty" bit.
diff -cr cvs.head/src/include/access/twophase_rmgr.h cvs.build/src/include/access/twophase_rmgr.h
*** cvs.head/src/include/access/twophase_rmgr.h	2010-01-05 12:39:34.000000000 +0100
--- cvs.build/src/include/access/twophase_rmgr.h	2010-02-14 16:04:25.000000000 +0100
***************
*** 23,31 ****
   */
  #define TWOPHASE_RM_END_ID			0
  #define TWOPHASE_RM_LOCK_ID			1
! #define TWOPHASE_RM_NOTIFY_ID		2
! #define TWOPHASE_RM_PGSTAT_ID		3
! #define TWOPHASE_RM_MULTIXACT_ID	4
  #define TWOPHASE_RM_MAX_ID			TWOPHASE_RM_MULTIXACT_ID
  
  extern const TwoPhaseCallback twophase_recover_callbacks[];
--- 23,30 ----
   */
  #define TWOPHASE_RM_END_ID			0
  #define TWOPHASE_RM_LOCK_ID			1
! #define TWOPHASE_RM_PGSTAT_ID		2
! #define TWOPHASE_RM_MULTIXACT_ID	3
  #define TWOPHASE_RM_MAX_ID			TWOPHASE_RM_MULTIXACT_ID
  
  extern const TwoPhaseCallback twophase_recover_callbacks[];
diff -cr cvs.head/src/include/catalog/pg_proc.h cvs.build/src/include/catalog/pg_proc.h
*** cvs.head/src/include/catalog/pg_proc.h	2010-02-10 20:33:17.000000000 +0100
--- cvs.build/src/include/catalog/pg_proc.h	2010-02-14 16:04:25.000000000 +0100
***************
*** 4127,4132 ****
--- 4127,4136 ----
  DESCR("get the prepared statements for this session");
  DATA(insert OID = 2511 (  pg_cursor PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,16,16,16,1184}" "{o,o,o,o,o,o}" "{name,statement,is_holdable,is_binary,is_scrollable,creation_time}" _null_ pg_cursor _null_ _null_ _null_ ));
  DESCR("get the open cursors for this session");
+ DATA(insert OID = 3036 (  pg_listening	PGNSP	PGUID 12 1 10 0 f f f t t s 0 0 25 "" _null_ _null_ _null_ _null_ pg_listening _null_ _null_ _null_ ));
+ DESCR("get the channels that the current backend listens to");
+ DATA(insert OID = 3035 (  pg_notify  PGNSP PGUID 12 1 0 0 f f f f f v 2 0 2278 "25 25" _null_ _null_ _null_ _null_ pg_notify _null_ _null_ _null_));
+ DESCR("send a notification to clients");
  DATA(insert OID = 2599 (  pg_timezone_abbrevs	PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,1186,16}" "{o,o,o}" "{abbrev,utc_offset,is_dst}" _null_ pg_timezone_abbrevs _null_ _null_ _null_ ));
  DESCR("get the available time zone abbreviations");
  DATA(insert OID = 2856 (  pg_timezone_names		PGNSP PGUID 12 1 1000 0 f f f t t s 0 0 2249 "" "{25,25,1186,16}" "{o,o,o,o}" "{name,abbrev,utc_offset,is_dst}" _null_ pg_timezone_names _null_ _null_ _null_ ));
diff -cr cvs.head/src/include/commands/async.h cvs.build/src/include/commands/async.h
*** cvs.head/src/include/commands/async.h	2010-01-05 12:39:35.000000000 +0100
--- cvs.build/src/include/commands/async.h	2010-02-14 16:48:57.000000000 +0100
***************
*** 13,28 ****
  #ifndef ASYNC_H
  #define ASYNC_H
  
  extern bool Trace_notify;
  
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_Notify(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
--- 13,41 ----
  #ifndef ASYNC_H
  #define ASYNC_H
  
+ /*
+  * Maximum size of the payload, including terminating NULL.
+  */
+ #define NOTIFY_PAYLOAD_MAX_LENGTH	8000
+ 
+ /*
+  * The number of page slots that we reserve.
+  */
+ #define NUM_ASYNC_BUFFERS			8
+ 
  extern bool Trace_notify;
  
+ extern void AsyncShmemInit(void);
+ 
  /* notify-related SQL statements */
! extern void Async_Notify(const char *relname, const char *payload);
  extern void Async_Listen(const char *relname);
  extern void Async_Unlisten(const char *relname);
  extern void Async_UnlistenAll(void);
  
  /* perform (or cancel) outbound notify processing at transaction commit */
! extern void AtCommit_NotifyBeforeCommit(void);
! extern void AtCommit_NotifyAfterCommit(void);
  extern void AtAbort_Notify(void);
  extern void AtSubStart_Notify(void);
  extern void AtSubCommit_Notify(void);
***************
*** 43,46 ****
--- 56,62 ----
  extern void notify_twophase_postcommit(TransactionId xid, uint16 info,
  						   void *recdata, uint32 len);
  
+ extern Datum pg_listening(PG_FUNCTION_ARGS);
+ extern Datum pg_notify(PG_FUNCTION_ARGS);
+ 
  #endif   /* ASYNC_H */
diff -cr cvs.head/src/include/nodes/parsenodes.h cvs.build/src/include/nodes/parsenodes.h
*** cvs.head/src/include/nodes/parsenodes.h	2010-02-14 16:02:49.000000000 +0100
--- cvs.build/src/include/nodes/parsenodes.h	2010-02-14 16:04:25.000000000 +0100
***************
*** 2097,2102 ****
--- 2097,2103 ----
  {
  	NodeTag		type;
  	char	   *conditionname;	/* condition name to notify */
+ 	char	   *payload;		/* the payload string to be conveyed */
  } NotifyStmt;
  
  /* ----------------------
diff -cr cvs.head/src/include/storage/lwlock.h cvs.build/src/include/storage/lwlock.h
*** cvs.head/src/include/storage/lwlock.h	2010-02-10 20:33:18.000000000 +0100
--- cvs.build/src/include/storage/lwlock.h	2010-02-14 16:04:25.000000000 +0100
***************
*** 68,73 ****
--- 68,75 ----
  	AutovacuumScheduleLock,
  	SyncScanLock,
  	RelationMappingLock,
+  	AsyncCtlLock,
+  	AsyncQueueLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff -cr cvs.head/src/include/utils/errcodes.h cvs.build/src/include/utils/errcodes.h
*** cvs.head/src/include/utils/errcodes.h	2010-01-05 12:39:36.000000000 +0100
--- cvs.build/src/include/utils/errcodes.h	2010-02-14 16:04:25.000000000 +0100
***************
*** 318,323 ****
--- 318,324 ----
  #define ERRCODE_STATEMENT_TOO_COMPLEX		MAKE_SQLSTATE('5','4', '0','0','1')
  #define ERRCODE_TOO_MANY_COLUMNS			MAKE_SQLSTATE('5','4', '0','1','1')
  #define ERRCODE_TOO_MANY_ARGUMENTS			MAKE_SQLSTATE('5','4', '0','2','3')
+ #define ERRCODE_TOO_MANY_ENTRIES			MAKE_SQLSTATE('5','4', '0','3','1')
  
  /* Class 55 - Object Not In Prerequisite State (class borrowed from DB2) */
  #define ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE	MAKE_SQLSTATE('5','5', '0','0','0')
diff -cr cvs.head/src/test/regress/expected/guc.out cvs.build/src/test/regress/expected/guc.out
*** cvs.head/src/test/regress/expected/guc.out	2009-11-22 06:20:41.000000000 +0100
--- cvs.build/src/test/regress/expected/guc.out	2010-02-14 16:04:25.000000000 +0100
***************
*** 532,540 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
!   relname  
! -----------
   foo_event
  (1 row)
  
--- 532,540 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
!  pg_listening 
! --------------
   foo_event
  (1 row)
  
***************
*** 571,579 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
!  relname 
! ---------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
--- 571,579 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
!  pg_listening 
! --------------
  (0 rows)
  
  SELECT name FROM pg_prepared_statements;
diff -cr cvs.head/src/test/regress/expected/sanity_check.out cvs.build/src/test/regress/expected/sanity_check.out
*** cvs.head/src/test/regress/expected/sanity_check.out	2010-01-20 20:08:32.000000000 +0100
--- cvs.build/src/test/regress/expected/sanity_check.out	2010-02-14 16:04:25.000000000 +0100
***************
*** 107,113 ****
   pg_language             | t
   pg_largeobject          | t
   pg_largeobject_metadata | t
-  pg_listener             | f
   pg_namespace            | t
   pg_opclass              | t
   pg_operator             | t
--- 107,112 ----
***************
*** 154,160 ****
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (143 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
--- 153,159 ----
   timetz_tbl              | f
   tinterval_tbl           | f
   varchar_tbl             | f
! (142 rows)
  
  --
  -- another sanity check: every system catalog that has OIDs should have
diff -cr cvs.head/src/test/regress/sql/guc.sql cvs.build/src/test/regress/sql/guc.sql
*** cvs.head/src/test/regress/sql/guc.sql	2009-10-21 22:38:58.000000000 +0200
--- cvs.build/src/test/regress/sql/guc.sql	2010-02-14 16:04:25.000000000 +0100
***************
*** 165,171 ****
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 165,171 ----
  CREATE ROLE temp_reset_user;
  SET SESSION AUTHORIZATION temp_reset_user;
  -- look changes
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
***************
*** 174,180 ****
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT relname FROM pg_listener;
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
--- 174,180 ----
  -- discard everything
  DISCARD ALL;
  -- look again
! SELECT pg_listening();
  SELECT name FROM pg_prepared_statements;
  SELECT name FROM pg_cursors;
  SHOW vacuum_cost_delay;
#103Simon Riggs
simon@2ndQuadrant.com
In reply to: Joachim Wieland (#102)
Re: Listen / Notify - what to do when the queue is full

On Sun, 2010-02-14 at 17:22 +0100, Joachim Wieland wrote:

* There is no mention of what to do with pg_notify at checkpoint.

Look

at how pg_subtrans handles this. Should pg_notify do the same?

Actually we don't care... We even hope that the pg_notify pages are
not flushed at all. Notifications don't survive a server restart
anyway and upon restart we just delete whatever is in the directory.

Suspected that was true, just checking it was commented somewhere.

--
Simon Riggs www.2ndQuadrant.com

#104Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#102)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

+ #define ERRCODE_TOO_MANY_ENTRIES MAKE_SQLSTATE('5','4', '0','3','1')

Do you have any evidence that there is actually a DB2 error code
matching this, or is this errcode just invented? The one page
Google finds doesn't list it:
http://publib.boulder.ibm.com/iseries/v5r1/ic2924/index.htm?info/rzala/rzalastc.html

regards, tom lane

#105Simon Riggs
simon@2ndQuadrant.com
In reply to: Joachim Wieland (#102)
Re: Listen / Notify - what to do when the queue is full

On Sun, 2010-02-14 at 17:22 +0100, Joachim Wieland wrote:

New patch attached, thanks for the review.

Next set of questions

* Will this work during Hot Standby now? The barrier was that it wrote
to a table and so we could not allow that. ISTM this new version can and
should work with Hot Standby. Can you test that and if so, remove the
explicit barrier code and change tests and docs to enable it?

* We also discussed the idea of having a NOTIFY command that would work
from Primary to Standby. All this would need is some code to WAL log the
NOTIFY if not in Hot Standby and for some recovery code to send the
NOTIFY to any listeners on the standby. I would suggest that would be an
option on NOTIFY to WAL log the notification:
e.g. NOTIFY me 'with_payload' FOR STANDBY ALSO;

* Don't really like pg_listening() as a name. Perhaps pg_listening_to()
or pg_listening_on() or pg_listening_for() or pg_listening_channels() or
pg_listen_channels()

* I think it's confusing that pg_notify is both a data structure and a
function. Suggest changing one of those to avoid issues in
understanding. "Use pg_notify" might be confused by a DBA.

--
Simon Riggs www.2ndQuadrant.com

#106Joachim Wieland
joe@mcknight.de
In reply to: Simon Riggs (#105)
Re: Listen / Notify - what to do when the queue is full

On Sun, Feb 14, 2010 at 11:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Next set of questions

* Will this work during Hot Standby now? The barrier was that it wrote
to a table and so we could not allow that. ISTM this new version can and
should work with Hot Standby. Can you test that and if so, remove the
explicit barrier code and change tests and docs to enable it?

I have tested it already. The point where it currently fails is the
following line:

qe->xid = GetCurrentTransactionId();

We record the TransactionId (of the notifying transaction) in the
notification in order to later check if this transaction has committed
successfully or not. If you tell me how we can find this out in HS, we
might be done...

The reason why we are doing all this is because we fear that we can
not write the notifications to disk once we have committed to clog...
So we write them to disk before committing to clog and therefore need
to record the TransactionId.

* We also discussed the idea of having a NOTIFY command that would work
from Primary to Standby. All this would need is some code to WAL log the
NOTIFY if not in Hot Standby and for some recovery code to send the
NOTIFY to any listeners on the standby. I would suggest that would be an
option on NOTIFY to WAL log the notification:
e.g. NOTIFY me 'with_payload' FOR STANDBY ALSO;

What should happen if you wanted to replay a NOTIFY WAL record in the
standby but cannot write to the pg_notify/ directory?

* Don't really like pg_listening() as a name. Perhaps pg_listening_to()
or pg_listening_on() or pg_listening_for() or pg_listening_channels() or
pg_listen_channels()

pg_listen_channels() sounds best to me but I leave this decision to a
native speaker.

* I think it's confusing that pg_notify is both a data structure and a
function. Suggest changing one of those to avoid issues in
understanding. "Use pg_notify" might be confused by a DBA.

You are talking about the libpq datastructure PGnotify I suppose... I
don't see it overly confusing but I wouldn't object changing it. There
was a previous discussion about the name, see the last paragraph of
http://archives.postgresql.org/message-id/dc7b844e1002021510i4aaa879fy8bbdd003729d28da@mail.gmail.com

Joachim

#107Simon Riggs
simon@2ndQuadrant.com
In reply to: Joachim Wieland (#106)
Re: Listen / Notify - what to do when the queue is full

On Mon, 2010-02-15 at 12:59 +0100, Joachim Wieland wrote:

* I think it's confusing that pg_notify is both a data structure and

a

function. Suggest changing one of those to avoid issues in
understanding. "Use pg_notify" might be confused by a DBA.

You are talking about the libpq datastructure PGnotify I suppose... I
don't see it overly confusing but I wouldn't object changing it. There
was a previous discussion about the name, see the last paragraph of
http://archives.postgresql.org/message-id/dc7b844e1002021510i4aaa879fy8bbdd003729d28da@mail.gmail.com

No, which illustrates the confusion nicely!
Function and datastructure.

--
Simon Riggs www.2ndQuadrant.com

#108Simon Riggs
simon@2ndQuadrant.com
In reply to: Joachim Wieland (#106)
Re: Listen / Notify - what to do when the queue is full

On Mon, 2010-02-15 at 12:59 +0100, Joachim Wieland wrote:

On Sun, Feb 14, 2010 at 11:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Next set of questions

* Will this work during Hot Standby now? The barrier was that it wrote
to a table and so we could not allow that. ISTM this new version can and
should work with Hot Standby. Can you test that and if so, remove the
explicit barrier code and change tests and docs to enable it?

I have tested it already. The point where it currently fails is the
following line:

qe->xid = GetCurrentTransactionId();

We record the TransactionId (of the notifying transaction) in the
notification in order to later check if this transaction has committed
successfully or not. If you tell me how we can find this out in HS, we
might be done...

The reason why we are doing all this is because we fear that we can
not write the notifications to disk once we have committed to clog...
So we write them to disk before committing to clog and therefore need
to record the TransactionId.

That's a shame. So it will never work in Hot Standby mode unless you can
think of a different way.

* We also discussed the idea of having a NOTIFY command that would work
from Primary to Standby. All this would need is some code to WAL log the
NOTIFY if not in Hot Standby and for some recovery code to send the
NOTIFY to any listeners on the standby. I would suggest that would be an
option on NOTIFY to WAL log the notification:
e.g. NOTIFY me 'with_payload' FOR STANDBY ALSO;

What should happen if you wanted to replay a NOTIFY WAL record in the
standby but cannot write to the pg_notify/ directory?

Same thing that happens to any action that cannot be replayed. Why
should that be a problem?

--
Simon Riggs www.2ndQuadrant.com

#109Joachim Wieland
joe@mcknight.de
In reply to: Simon Riggs (#108)
Re: Listen / Notify - what to do when the queue is full

On Mon, Feb 15, 2010 at 1:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 2010-02-15 at 12:59 +0100, Joachim Wieland wrote:

I have tested it already. The point where it currently fails is the
following line:

      qe->xid = GetCurrentTransactionId();

That's a shame. So it will never work in Hot Standby mode unless you can
think of a different way.

We could probably fake this on the Hot Standby in the following way:

We introduce a commit record for every notifying transaction and write
it into the queue itself. So right before writing anything else, we
write an entry which informs readers that the following records are
not yet committed. Then we write the actual notifications and commit.
In post-commit we return back to the commit record and flip its
status. Reading backends would stop at the commit record and we'd
signal them so that they can continue. This actually plays nicely with
Tom's intent to not have interleaved notifications in the queue (makes
things a bit easier but would probably work either way)...

However we'd need to make sure that we clean up that commit record
even if something weird happens (similar to TransactionIdDidAbort()
returning true) in order to allow the readers to proceed.

Comments?

Joachim

#110Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#109)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

We could probably fake this on the Hot Standby in the following way:

We introduce a commit record for every notifying transaction and write
it into the queue itself. So right before writing anything else, we
write an entry which informs readers that the following records are
not yet committed. Then we write the actual notifications and commit.
In post-commit we return back to the commit record and flip its
status.

This doesn't seem likely to work --- it essentially makes commit non
atomic. There has to be one and only one authoritative reference as
to whether transaction X committed.

I think that having HS slave sessions issue notifies is a fairly silly
idea anyway. They can't write the database, so exactly what condition
are they going to be notifying others about?

What *would* be useful is for HS slaves to be able to listen for notify
messages issued by writing sessions on the master. This patch gets rid
of the need for LISTEN to change on-disk state, so in principle we can
do it. The only bit we seem to lack is WAL transmission of the messages
(plus of course synchronization in case a slave session is too slow
about picking up messages). Definitely a 9.1 project at this point
though.

regards, tom lane

#111Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#105)
Re: Listen / Notify - what to do when the queue is full

On Sun, 2010-02-14 at 22:44 +0000, Simon Riggs wrote:

* We also discussed the idea of having a NOTIFY command that would work
from Primary to Standby. All this would need is some code to WAL log the
NOTIFY if not in Hot Standby and for some recovery code to send the
NOTIFY to any listeners on the standby. I would suggest that would be an
option on NOTIFY to WAL log the notification:
e.g. NOTIFY me 'with_payload' FOR STANDBY ALSO;

My first reaction is that it should not be optional. If we allow a slave
system to LISTEN on a condition, what's the point if it doesn't receive
the notifications from the master?

Cache invalidation seems to be the driving use case for LISTEN/NOTIFY.
Only the master can invalidate the cache (as Tom points out downthread);
and users on the slave system want to know about that invalidation if
they are explicitly listening for it.

Regards,
Jeff Davis

#112Greg Sabino Mullane
greg@turnstep.com
In reply to: Joachim Wieland (#106)
Re: Listen / Notify - what to do when the queue is full

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

* We also discussed the idea of having a NOTIFY command that
would work from Primary to Standby.

Just curious, what's a use case for this?

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201002161102
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkt6wZ4ACgkQvJuQZxSWSsjrYwCfSWvHlTBFT/fIYcBToX9C57GO
toAAoOLQhBj6NdVTayaVtRH8L7nk16qM
=LBAH
-----END PGP SIGNATURE-----

#113Jeff Davis
pgsql@j-davis.com
In reply to: Greg Sabino Mullane (#112)
Re: Listen / Notify - what to do when the queue is full

On Tue, 2010-02-16 at 16:02 +0000, Greg Sabino Mullane wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

* We also discussed the idea of having a NOTIFY command that
would work from Primary to Standby.

Just curious, what's a use case for this?

If you have some kind of cache above the DBMS, you need to invalidate it
when a part of the database is updated. It makes sense that every reader
would want to know about the update, not just those connected to the
master.

Regards,
Jeff Davis

#114Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#102)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

[ listen/notify patch ]

Applied after rather a lot of hacking.

Aside from the issues previously raised, I changed the NOTIFY syntax to
include a comma between channel name and payload. The submitted syntax
with no comma looked odd to me, and it would have been a real nightmare
to extend if we ever decide we want to support expressions in NOTIFY.

I found a number of implementation problems having to do with wraparound
behavior and error recovery. I think they're all fixed, but any
remaining bugs are probably my fault not yours.

regards, tom lane

#115Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#105)
Re: Listen / Notify - what to do when the queue is full

Simon Riggs <simon@2ndQuadrant.com> writes:

* Don't really like pg_listening() as a name. Perhaps pg_listening_to()
or pg_listening_on() or pg_listening_for() or pg_listening_channels() or
pg_listen_channels()

BTW, I used pg_listening_channels() for that.

* I think it's confusing that pg_notify is both a data structure and a
function. Suggest changing one of those to avoid issues in
understanding. "Use pg_notify" might be confused by a DBA.

I didn't change that. The data structure is PGnotify, which seems
enough different from pg_notify to not be a real serious problem.
There is a duplication with the $PGDATA subdirectory pg_notify/,
but that one is not a user-facing name, so I thought it wasn't
really an issue.

regards, tom lane

#116Joachim Wieland
joe@mcknight.de
In reply to: Tom Lane (#114)
Re: Listen / Notify - what to do when the queue is full

On Tue, Feb 16, 2010 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Joachim Wieland <joe@mcknight.de> writes:

[ listen/notify patch ]

I found a number of implementation problems having to do with wraparound
behavior and error recovery.  I think they're all fixed, but any
remaining bugs are probably my fault not yours.

First, thanks for the rework you have done and thanks for applying this.

While I can see a lot of improvements over my version, I think the
logic in asyncQueueProcessPageEntries() needs to be reordered:

+ static bool
+ asyncQueueProcessPageEntries(QueuePosition *current,
+ 							 QueuePosition stop,
+ 							 char *page_buffer)
[...]
+ 	do
+ 	{
[...]
+ 		/*
+ 		 * Advance *current over this message, possibly to the next page.
+ 		 * As noted in the comments for asyncQueueReadAllNotifications, we
+ 		 * must do this before possibly failing while processing the message.
+ 		 */
+ 		reachedEndOfPage = asyncQueueAdvance(current, qe->length);
[...]
+ 			if (TransactionIdDidCommit(qe->xid))
[...]
+ 			else if (TransactionIdDidAbort(qe->xid))
[...]
+ 			else
+ 			{
+ 				/*
+ 				 * The transaction has neither committed nor aborted so far,
+ 				 * so we can't process its message yet.  Break out of the loop.
+ 				 */
+ 				reachedStop = true;
+ 				break;

In the beginning you are advancing *current but later on you could
find out that the transaction is still running. As the position in the
queue has already advanced you would miss one notification here
because you'd restart directly behind this notification in the
queue...

Joachim

#117Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joachim Wieland (#116)
Re: Listen / Notify - what to do when the queue is full

Joachim Wieland <joe@mcknight.de> writes:

While I can see a lot of improvements over my version, I think the
logic in asyncQueueProcessPageEntries() needs to be reordered:

Hmmm ... I was intending to cover the case of a failure in
TransactionIdDidCommit too, but I can see it will take a bit more
thought.

regards, tom lane

#118Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#110)
Re: Listen / Notify - what to do when the queue is full

On Mon, 2010-02-15 at 15:00 -0500, Tom Lane wrote:

Joachim Wieland <joe@mcknight.de> writes:

We could probably fake this on the Hot Standby in the following way:

We introduce a commit record for every notifying transaction and write
it into the queue itself. So right before writing anything else, we
write an entry which informs readers that the following records are
not yet committed. Then we write the actual notifications and commit.
In post-commit we return back to the commit record and flip its
status.

This doesn't seem likely to work --- it essentially makes commit non
atomic. There has to be one and only one authoritative reference as
to whether transaction X committed.

I thought a bit more about this and don't really understand why we need
an xid at all. When we discussed this before the role of a NOTIFY was to
remind us to refresh a cache, not as a way of delivering a transactional
payload. If the cache refresh use case is still the objective why does
it matter whether we commit or not when we issue a NOTIFY? Surely, the
rare case where we actually abort right at the end of the transaction
will just cause an unnecessary cache refresh.

I think that having HS slave sessions issue notifies is a fairly silly
idea anyway. They can't write the database, so exactly what condition
are they going to be notifying others about?

Agreed

What *would* be useful is for HS slaves to be able to listen for notify
messages issued by writing sessions on the master. This patch gets rid
of the need for LISTEN to change on-disk state, so in principle we can
do it. The only bit we seem to lack is WAL transmission of the messages
(plus of course synchronization in case a slave session is too slow
about picking up messages). Definitely a 9.1 project at this point
though.

OK

--
Simon Riggs www.2ndQuadrant.com

#119Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#118)
Re: Listen / Notify - what to do when the queue is full

On 2/18/10 9:58 AM, Simon Riggs wrote:

I thought a bit more about this and don't really understand why we need
an xid at all. When we discussed this before the role of a NOTIFY was to
remind us to refresh a cache, not as a way of delivering a transactional
payload. If the cache refresh use case is still the objective why does
it matter whether we commit or not when we issue a NOTIFY? Surely, the
rare case where we actually abort right at the end of the transaction
will just cause an unnecessary cache refresh.

Actually, even for that use, it doesn't wash to have notifies being sent
for transactions which have not yet committed. Think of cases (and I
have two applications) where data is being pushed into an external
non-transactional cache or queue (like Memcached, Redis or ApacheMQ)
from PostgreSQL on the backend. If the transaction fails, it's
important that the data not get pushed.

I guess I'm not following why HS would be different from a single server
in this regard, though?

Mind you, this is all 9.1 discussion, no?

--Josh Berkus

#120Merlin Moncure
mmoncure@gmail.com
In reply to: Simon Riggs (#118)
Re: Listen / Notify - what to do when the queue is full

On Thu, Feb 18, 2010 at 12:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 2010-02-15 at 15:00 -0500, Tom Lane wrote:

Joachim Wieland <joe@mcknight.de> writes:

We could probably fake this on the Hot Standby in the following way:

We introduce a commit record for every notifying transaction and write
it into the queue itself. So right before writing anything else, we
write an entry which informs readers that the following records are
not yet committed. Then we write the actual notifications and commit.
In post-commit we return back to the commit record and flip its
status.

This doesn't seem likely to work --- it essentially makes commit non
atomic.  There has to be one and only one authoritative reference as
to whether transaction X committed.

I thought a bit more about this and don't really understand why we need
an xid at all. When we discussed this before the role of a NOTIFY was to
remind us to refresh a cache, not as a way of delivering a transactional
payload. If the cache refresh use case is still the objective why does
it matter whether we commit or not when we issue a NOTIFY? Surely, the
rare case where we actually abort right at the end of the transaction
will just cause an unnecessary cache refresh.

notifications serve many more purposes than cache refreshes...it's a
generic 'wake up and do something' to the client.

For example, one of those things could be for the client to shut down.
If the server errors out of the transaction that set up the client to
shut down, you probably wouldn't want the client to shut down. I
don't think that's a big deal really, but it conflicts with the old
behavior.

However, being able to send notifications immediately (not at end of
transaction) would be exceptionally useful in some cases. This
happens when the notifying backend is waiting on some sort of response
from the notified client. If you could NOTIFY IMMEDIATELY, then you
could ping the client and get the response in a single transaction
without using dblink based hacks.

merlin

#121Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#120)
Re: Listen / Notify - what to do when the queue is full

Merlin Moncure <mmoncure@gmail.com> writes:

On Thu, Feb 18, 2010 at 12:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I thought a bit more about this and don't really understand why we need
an xid at all. When we discussed this before the role of a NOTIFY was to
remind us to refresh a cache, not as a way of delivering a transactional
payload. If the cache refresh use case is still the objective why does
it matter whether we commit or not when we issue a NOTIFY? Surely, the
rare case where we actually abort right at the end of the transaction
will just cause an unnecessary cache refresh.

notifications serve many more purposes than cache refreshes...it's a
generic 'wake up and do something' to the client.

The point to my mind is that the previous implementation guaranteed that
failed transactions would not send notifies. I don't think we can just
drop that semantic consistency statement and not break applications.

Also, as Josh notes, even for cache refresh uses it is *critical* that
the notifies not be delivered to listeners till after the sender
commits; else you have race conditions where the listeners look for
changes before they can see them. So it's difficult to make it
much simpler than this anyhow.

regards, tom lane