proposal: make NOTIFY list de-duplication optional
- new GUC in "Statement Behaviour" section, notify_duplicate_removal
(default true)
Initial discussion in this thread:
/messages/by-id/CAP_rwwmpzk9=SbjRZTOd05bDctyC43wNKnu_m37dYGvL4SAeSw@mail.gmail.com
Rationale: for some legitimate use cases, duplicate removal is not
required, and it gets O(N^2) cost on large COPY/ insert transactions.
Attachments:
postgres-async-notify-duplicate-removal-guc.patchtext/x-patch; charset=US-ASCII; name=postgres-async-notify-duplicate-removal-guc.patchDownload+46-12
On Fri, Feb 5, 2016 at 10:17 AM, Filip Rembiałkowski
<filip.rembialkowski@gmail.com> wrote:
- new GUC in "Statement Behaviour" section, notify_duplicate_removal
(default true)Initial discussion in this thread:
/messages/by-id/CAP_rwwmpzk9=SbjRZTOd05bDctyC43wNKnu_m37dYGvL4SAeSw@mail.gmail.comRationale: for some legitimate use cases, duplicate removal is not required,
and it gets O(N^2) cost on large COPY/ insert transactions.
I agree with what Merlin said about this:
/messages/by-id/CAHyXU0yoHe8Qc=yC10AHU1nFiA1tbHsg+35Ds-oEueUapo7t4g@mail.gmail.com
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Feb 5, 2016 at 10:17 AM, Filip Rembiałkowski
<filip.rembialkowski@gmail.com> wrote:- new GUC in "Statement Behaviour" section, notify_duplicate_removal
I agree with what Merlin said about this:
/messages/by-id/CAHyXU0yoHe8Qc=yC10AHU1nFiA1tbHsg+35Ds-oEueUapo7t4g@mail.gmail.com
Yeah, I agree that a GUC for this is quite unappetizing.
One idea would be to build a hashtable to aid with duplicate detection
(perhaps only once the pending-notify list gets long).
Another thought is that it's already the case that duplicate detection is
something of a "best effort" activity; note for example the comment in
AsyncExistsPendingNotify pointing out that we don't collapse duplicates
across subtransactions. Would it be acceptable to relax the standards
a bit further? For example, if we only checked for duplicates among the
last N notification list entries (for N say around 100), we'd probably
cover just about all the useful cases, and the runtime would stay linear.
The data structure isn't tremendously conducive to that, but it could be
done.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, 6 Feb 2016 at 12:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I agree with what Merlin said about this:
/messages/by-id/CAHyXU0yoHe8Qc=yC10AHU1nFiA1tbHsg+35Ds-oEueUapo7t4g@mail.gmail.com
Yeah, I agree that a GUC for this is quite unappetizing.
How would you feel about a variant for calling NOTIFY?
The SQL syntax could be something like "NOTIFY [ALL] channel, payload"
where the ALL means "just send the notification already, nobody cares
whether there's an identical one in the queue".
Likewise we could introduce a three-argument form of pg_notify(text, text,
bool) where the final argument is whether you are interested in removing
duplicates.
Optimising the remove-duplicates path is still probably a worthwhile
endeavour, but if the user really doesn't care at all about duplication, it
seems silly to force them to pay any performance price for a behaviour they
didn't want, no?
Cheers,
BJ
Brendan Jurd <direvus@gmail.com> writes:
On Sat, 6 Feb 2016 at 12:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I agree that a GUC for this is quite unappetizing.
How would you feel about a variant for calling NOTIFY?
If we decide that this ought to be user-visible, then an extra NOTIFY
parameter would be the way to do it. I'd much rather it "just works"
though. In particular, if we do start advertising user control of
de-duplication, we are likely to start getting bug reports about every
case where it's inexact, eg the no-checks-across-subxact-boundaries
business.
Optimising the remove-duplicates path is still probably a worthwhile
endeavour, but if the user really doesn't care at all about duplication, it
seems silly to force them to pay any performance price for a behaviour they
didn't want, no?
I would only be impressed with that argument if it could be shown that
de-duplication was a significant fraction of the total cost of a typical
NOTIFY cycle. Obviously, you can make the O(N^2) term dominate if you
try, but I really doubt that it's significant for reasonable numbers of
notify events per transaction. One should also keep in mind that
duplicate events are going to cost extra processing on the
client-application side, too. In my experience with using NOTIFY, that
cost probably dwarfs the cost of emitting the messages.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/05/2016 08:49 PM, Tom Lane wrote:
Yeah, I agree that a GUC for this is quite unappetizing.
Agreed.
One idea would be to build a hashtable to aid with duplicate detection
(perhaps only once the pending-notify list gets long).Another thought is that it's already the case that duplicate detection is
something of a "best effort" activity; note for example the comment in
AsyncExistsPendingNotify pointing out that we don't collapse duplicates
across subtransactions. Would it be acceptable to relax the standards
a bit further? For example, if we only checked for duplicates among the
last N notification list entries (for N say around 100), we'd probably
cover just about all the useful cases, and the runtime would stay linear.
The data structure isn't tremendously conducive to that, but it could be
done.
I like the hashtable idea if it can be made workable.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Feb 6, 2016 at 5:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Brendan Jurd <direvus@gmail.com> writes:
On Sat, 6 Feb 2016 at 12:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I agree that a GUC for this is quite unappetizing.
How would you feel about a variant for calling NOTIFY?
If we decide that this ought to be user-visible, then an extra NOTIFY
parameter would be the way to do it. I'd much rather it "just works"
though. In particular, if we do start advertising user control of
de-duplication, we are likely to start getting bug reports about every
case where it's inexact, eg the no-checks-across-subxact-boundaries
business.
It is not enough to say "database server can decide to deliver a
single notification only." - which is already said in the docs?
The ALL keyword would be a clearly separated "do-nothing" version.
Optimising the remove-duplicates path is still probably a worthwhile
endeavour, but if the user really doesn't care at all about duplication, it
seems silly to force them to pay any performance price for a behaviour they
didn't want, no?I would only be impressed with that argument if it could be shown that
de-duplication was a significant fraction of the total cost of a typical
NOTIFY cycle.
Even if a typical NOTIFY cycle excludes processing 10k or 100k
messages, why penalize users who have bigger transactions?
Obviously, you can make the O(N^2) term dominate if you
try, but I really doubt that it's significant for reasonable numbers of
notify events per transaction.
Yes, it is hard to observe for less than few thousands messages in one
transaction.
But big data happens. And then the numbers get really bad.
In my test for 40k messages, it is 400 ms versus 9 seconds. 22 times
slower. For 200k messages, it is 2 seconds versus 250 seconds. 125
times slower.
And I tested with very short payload strings, so strcmp() had not much to do.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
+1
... and a patch (only adding ALL keyword, no hash table implemented yet).
Show quoted text
On Sat, Feb 6, 2016 at 2:35 PM, Brendan Jurd <direvus@gmail.com> wrote:
On Sat, 6 Feb 2016 at 12:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I agree with what Merlin said about this:
/messages/by-id/CAHyXU0yoHe8Qc=yC10AHU1nFiA1tbHsg+35Ds-oEueUapo7t4g@mail.gmail.com
Yeah, I agree that a GUC for this is quite unappetizing.
How would you feel about a variant for calling NOTIFY?
The SQL syntax could be something like "NOTIFY [ALL] channel, payload" where
the ALL means "just send the notification already, nobody cares whether
there's an identical one in the queue".Likewise we could introduce a three-argument form of pg_notify(text, text,
bool) where the final argument is whether you are interested in removing
duplicates.Optimising the remove-duplicates path is still probably a worthwhile
endeavour, but if the user really doesn't care at all about duplication, it
seems silly to force them to pay any performance price for a behaviour they
didn't want, no?Cheers,
BJ
Attachments:
postgres-notify-all.patchtext/x-patch; charset=US-ASCII; name=postgres-notify-all.patchDownload+53-17
On 02/07/2016 03:42 AM, Filip Rembiałkowski wrote:
+1
... and a patch (only adding ALL keyword, no hash table implemented yet).
Please stop top-posting, it's very disruptive. My comments are below,
where they belong.
On Sat, Feb 6, 2016 at 2:35 PM, Brendan Jurd <direvus@gmail.com> wrote:
On Sat, 6 Feb 2016 at 12:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I agree with what Merlin said about this:
/messages/by-id/CAHyXU0yoHe8Qc=yC10AHU1nFiA1tbHsg+35Ds-oEueUapo7t4g@mail.gmail.com
Yeah, I agree that a GUC for this is quite unappetizing.
How would you feel about a variant for calling NOTIFY?
The SQL syntax could be something like "NOTIFY [ALL] channel, payload" where
the ALL means "just send the notification already, nobody cares whether
there's an identical one in the queue".Likewise we could introduce a three-argument form of pg_notify(text, text,
bool) where the final argument is whether you are interested in removing
duplicates.Optimising the remove-duplicates path is still probably a worthwhile
endeavour, but if the user really doesn't care at all about duplication, it
seems silly to force them to pay any performance price for a behaviour they
didn't want, no?
On 02/07/2016 03:42 AM, Filip Rembiałkowski wrote:
+1
... and a patch (only adding ALL keyword, no hash table implemented yet).
I only read through the patch, I didn't compile it or test it, but I
have a few comments:
You left the duplicate behavior with subtransactions, but didn't mention
it in the documentation. If I do NOTIFY DISTINCT chan, 'msg'; then I
expect only distinct notifications but I'll get duplicates if I'm in a
subtransaction. Either the documentation or the code needs to be fixed.
I seem to remember some discussion about not using DEFAULT parameters in
system functions so you should leave the old function alone and create a
new function with your use_all parameter. I don't recall the exact
reason why so hopefully someone else will enlighten me.
There is also no mention in the documentation about what happens if I do:
NOTIFY ALL chan, 'msg';
NOTIFY ALL chan, 'msg';
NOTIFY DISTINCT chan, 'msg';
NOTIFY ALL chan, 'msg';
Without testing, I'd say I'd get two messages, but it should be
explicitly mentioned somewhere.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 7, 2016 at 11:54 AM, Vik Fearing <vik@2ndquadrant.fr> wrote:
On 02/07/2016 03:42 AM, Filip Rembiałkowski wrote:
You left the duplicate behavior with subtransactions, but didn't mention
it in the documentation. If I do NOTIFY DISTINCT chan, 'msg'; then I
expect only distinct notifications but I'll get duplicates if I'm in a
subtransaction. Either the documentation or the code needs to be fixed.
agreed
I seem to remember some discussion about not using DEFAULT parameters in
system functions so you should leave the old function alone and create a
new function with your use_all parameter. I don't recall the exact
reason why so hopefully someone else will enlighten me.
I'm quite new to this; how do I pinpoint proper OID for a new catalog
object (function, in this case)?
Is there a better way than browsing fmgr files and guessing next available oid?
There is also no mention in the documentation about what happens if I do:
NOTIFY ALL chan, 'msg';
NOTIFY ALL chan, 'msg';
NOTIFY DISTINCT chan, 'msg';
NOTIFY ALL chan, 'msg';Without testing, I'd say I'd get two messages, but it should be
explicitly mentioned somewhere.
If it's four separate transactions, LISTEN'er should get four.
If it's in one transaction, LISTEN'er should get three.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/07/2016 04:00 PM, Filip Rembiałkowski wrote:
On Sun, Feb 7, 2016 at 11:54 AM, Vik Fearing <vik@2ndquadrant.fr> wrote:
I seem to remember some discussion about not using DEFAULT parameters in
system functions so you should leave the old function alone and create a
new function with your use_all parameter. I don't recall the exact
reason why so hopefully someone else will enlighten me.I'm quite new to this; how do I pinpoint proper OID for a new catalog
object (function, in this case)?
Is there a better way than browsing fmgr files and guessing next available oid?
There is a shell script called `unused_oids` in src/include/catalog/.
There is also no mention in the documentation about what happens if I do:
NOTIFY ALL chan, 'msg';
NOTIFY ALL chan, 'msg';
NOTIFY DISTINCT chan, 'msg';
NOTIFY ALL chan, 'msg';Without testing, I'd say I'd get two messages, but it should be
explicitly mentioned somewhere.If it's four separate transactions, LISTEN'er should get four.
The question was for one transaction, I should have been clearer about that.
If it's in one transaction, LISTEN'er should get three.
This is surprising to me, I would think it would get only two. What is
your rationale for three?
Compare with the behavior of:
select 1
union all
select 1
union distinct
select 1
union all
select 1;
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 7, 2016 at 4:37 PM, Vik Fearing <vik@2ndquadrant.fr> wrote:
There is also no mention in the documentation about what happens if I do:
NOTIFY ALL chan, 'msg';
NOTIFY ALL chan, 'msg';
NOTIFY DISTINCT chan, 'msg';
NOTIFY ALL chan, 'msg';Without testing, I'd say I'd get two messages, but it should be
explicitly mentioned somewhere.If it's four separate transactions, LISTEN'er should get four.
The question was for one transaction, I should have been clearer about that.
If it's in one transaction, LISTEN'er should get three.
This is surprising to me, I would think it would get only two. What is
your rationale for three?
It is a single transaction, but four separate commands.
NOTIFY ALL chan, 'msg';
-- send the message, save in the list/hash
NOTIFY ALL chan, 'msg';
-- ALL was specified, send the message even if it is on the list/hash
NOTIFY DISTINCT chan, 'msg';
-- default mode, skip message because it's in the list/hash
NOTIFY ALL chan, 'msg';
-- ALL was specified, send the message even if it is hashed/saved
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8 February 2016 at 09:37, Filip Rembiałkowski <
filip.rembialkowski@gmail.com> wrote:
On Sun, Feb 7, 2016 at 4:37 PM, Vik Fearing <vik@2ndquadrant.fr> wrote:
There is also no mention in the documentation about what happens if I
do:
NOTIFY ALL chan, 'msg';
NOTIFY ALL chan, 'msg';
NOTIFY DISTINCT chan, 'msg';
NOTIFY ALL chan, 'msg';Without testing, I'd say I'd get two messages, but it should be
explicitly mentioned somewhere.If it's four separate transactions, LISTEN'er should get four.
The question was for one transaction, I should have been clearer about
that.
If it's in one transaction, LISTEN'er should get three.
This is surprising to me, I would think it would get only two. What is
your rationale for three?It is a single transaction, but four separate commands.
NOTIFY ALL chan, 'msg';
-- send the message, save in the list/hash
NOTIFY ALL chan, 'msg';
-- ALL was specified, send the message even if it is on the list/hash
NOTIFY DISTINCT chan, 'msg';
-- default mode, skip message because it's in the list/hash
NOTIFY ALL chan, 'msg';
-- ALL was specified, send the message even if it is hashed/saved
So in total three messages are sent?
Would it be correct to say that if ALL is specified then a message is
queued no matter what. If DISTINCT is specified then it is only queued if
no message with the same channel and argument is already queued for
delivery. Using DISTINCT can never decrease the total number of messages to
be sent.
Right?
If so, I think that's the right behaviour and the docs just need to be
explicit - an example like the above would be good, translated to be
friendlier to those who don't know the internal mechanics.
I've found the deduplication functionality of NOTIFY very frustrating in
the past and I see this as a significant improvement. Sometimes the *number
of times* something happened is significant too...
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 8, 2016 at 1:52 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
Would it be correct to say that if ALL is specified then a message is queued
no matter what. If DISTINCT is specified then it is only queued if no
message with the same channel and argument is already queued for delivery.
Yes, exactly.
Using DISTINCT can never decrease the total number of messages to be sent.
This sentence does not sound true. DISTINCT is the default, old
behaviour. It *can* decrease total number of messages (by
deduplication)
I've found the deduplication functionality of NOTIFY very frustrating in the past
and I see this as a significant improvement. Sometimes the *number of times*
something happened is significant too...
yep, same idea here.
Here is my next try, after suggestions from -perf and -hackers list:
* no GUC
* small addition to NOTIFY grammar: NOTIFY ALL/DISTINCT
* corresponding, 3-argument version of pg_notify(text,text,bool)
* updated the docs to include new syntax and clarify behavior
* no hashtable in AsyncExistsPendingNotify
(I don't see much sense in that part; and it can be well done
separately from this)
Attachments:
postgres-notify-all-v2.patchtext/x-patch; charset=US-ASCII; name=postgres-notify-all-v2.patchDownload+65-11
On 02/08/2016 09:33 PM, Filip Rembiałkowski wrote:
Here is my next try, after suggestions from -perf and -hackers list:
* no GUC
* small addition to NOTIFY grammar: NOTIFY ALL/DISTINCT
* corresponding, 3-argument version of pg_notify(text,text,bool)
* updated the docs to include new syntax and clarify behavior
* no hashtable in AsyncExistsPendingNotify
(I don't see much sense in that part; and it can be well done
separately from this)
Please add this to the next commitfest:
https://commitfest.postgresql.org/9/new/
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 2:33 PM, Filip Rembiałkowski
<filip.rembialkowski@gmail.com> wrote:
Here is my next try, after suggestions from -perf and -hackers list:
* no GUC
* small addition to NOTIFY grammar: NOTIFY ALL/DISTINCT
* corresponding, 3-argument version of pg_notify(text,text,bool)
This is all sounding pretty good. I wonder if the third argument
should be a boolean however. If we make it 'text, 'send mode',
instead, we could leave some room for more specialization of the
queuing behavior.
For example, we've had a couple of requests over the years to have an
'immediate' mode which dumps the notification immediately to the
client without waiting for tx commit. This may or may not be a good
idea, but if it was ultimately proved to be, it could be introduced as
an alternate mode without adding an extra function.
merlin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 9, 2016 at 12:15 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
I wonder if the third argument
should be a boolean however. If we make it 'text, 'send mode',
instead, we could leave some room for more specialization of the
queuing behavior.For example, we've had a couple of requests over the years to have an
'immediate' mode which dumps the notification immediately to the
client without waiting for tx commit. This may or may not be a good
idea, but if it was ultimately proved to be, it could be introduced as
an alternate mode without adding an extra function.
But then it becomes disputable if SQL syntax change makes sense.
---we had this,
NOTIFY channel [ , payload ]
---and in this patch we have this
NOTIFY [ ALL | DISTINCT ] channel [ , payload ]
--- but maybe we should have this?
NOTIFY channel [ , payload [ , mode ] ]
I'm not sure which direction is better with non-standard SQL additions.
Recycling keywords or adding more commas?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 9, 2016 at 2:16 PM, Filip Rembiałkowski
<filip.rembialkowski@gmail.com> wrote:
But then it becomes disputable if SQL syntax change makes sense.
---we had this, NOTIFY channel [ , payload ] ---and in this patch we have this NOTIFY [ ALL | DISTINCT ] channel [ , payload ] --- but maybe we should have this? NOTIFY channel [ , payload [ , mode ] ]
I think using ALL to mean "don't worry about de-duplication" could be
a bit confusing, especially as there was some interest recently in
supporting wildcard notifications:
/messages/by-id/52693FC5.7070507@gmail.com
and conceivably we might want to support a way to notify all
listeners, i.e. NOTIFY * as proposed in that thread. If we ever
supported wildcard notifies, ALL may be easily confused to mean "all
channel names".
What about adopting the options-inside-parentheses format, the way
EXPLAIN does nowadays, something like:
NOTIFY (DEDUPLICATE FALSE, MODE IMMEDIATE) mychannel;
Josh
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Josh Kupershmidt <schmiddy@gmail.com> writes:
On Tue, Feb 9, 2016 at 2:16 PM, Filip Rembiałkowski
<filip.rembialkowski@gmail.com> wrote:But then it becomes disputable if SQL syntax change makes sense.
---we had this, NOTIFY channel [ , payload ] ---and in this patch we have this NOTIFY [ ALL | DISTINCT ] channel [ , payload ] --- but maybe we should have this? NOTIFY channel [ , payload [ , mode ] ]
What about adopting the options-inside-parentheses format, the way
EXPLAIN does nowadays, something like:
NOTIFY (DEDUPLICATE FALSE, MODE IMMEDIATE) mychannel;
FWIW, I think it would be a good thing if the NOTIFY statement syntax were
not remarkably different from the syntax used in the pg_notify() function
call. To do otherwise would certainly be confusing. So on the whole
I'd go with the "NOTIFY channel [ , payload [ , mode ] ]" option.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Small update. I had to add one thing in /contrib/tcn/.