64-bit queryId?

Started by Robert Haasover 8 years ago50 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

Our Query currently has space for a 32-bit queryId, but that seems
reasonably likely to result in collisions:

https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

If you have as many as 50,000 queries, there's a 25% probability of
having at least one collision; that doesn't seem particularly
unrealistic. Obviously, normalization reduces the number of distinct
queries a lot, but if queries are dynamically generated you might
still have quite a few of them.

How about widening the value to uint64?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: 64-bit queryId?

Robert Haas <robertmhaas@gmail.com> writes:

How about widening the value to uint64?

Doesn't really seem like that would guarantee no collisions.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#2)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 1:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

How about widening the value to uint64?

Doesn't really seem like that would guarantee no collisions.

This moves the possibility of a 25% collision from 50k queries 3.3
billion. Not zero, but safe enough as a minimal change for many years
to come.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 12:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

How about widening the value to uint64?

Doesn't really seem like that would guarantee no collisions.

Well, no duh. If you come up with a hash function that maps an
infinite domain onto a finite range while guaranteeing no collisions,
I will look forward to reading the paper with interest.

Assuming, however, that you don't manage to prove all known
mathematics inconsistent, what one might reasonably hope to do is
render collisions remote enough that one need not worry about them too
much in practice. From that point of view, reducing the probability
of a collision by several orders of magnitude seems worth doing if (1)
the cost isn't too much and (2) the probability of a collision as
things stand is significant. I argue that both of those things are
probably true.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#4)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 7:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Assuming, however, that you don't manage to prove all known
mathematics inconsistent, what one might reasonably hope to do is
render collisions remote enough that one need not worry about them too
much in practice.

Isn't that already true in the case of queryId? I've never heard any
complaints about collisions. Most people don't change
pg_stat_statements.max, so the probability of a collision is more like
1%. And, that's the probability of *any* collision, not the
probability of a collision that the user actually cares about. The
majority of entries in pg_stat_statements among those ten thousand
will not be interesting. Often, 90%+ will be one-hit wonders. If that
isn't true, then you're probably not going to find pg_stat_statements
very useful, because you have nothing to focus on.

I have heard complaints about a number of different things in
pg_stat_statements, like the fact that we don't always manage to
replace constants with '?'/'$n' characters in all cases. I heard about
that quite a few times during my time at Heroku. But never this.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#5)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 8:39 AM, Peter Geoghegan <pg@bowt.ie> wrote:

Isn't that already true in the case of queryId? I've never heard any
complaints about collisions. Most people don't change
pg_stat_statements.max, so the probability of a collision is more like
1%. And, that's the probability of *any* collision, not the
probability of a collision that the user actually cares about. The
majority of entries in pg_stat_statements among those ten thousand
will not be interesting.

Correction: ten thousand is an example value of pg_stat_statements.max
in the docs, not the default value. The default is five thousand.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#5)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 11:39 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Sep 30, 2017 at 7:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Assuming, however, that you don't manage to prove all known
mathematics inconsistent, what one might reasonably hope to do is
render collisions remote enough that one need not worry about them too
much in practice.

Isn't that already true in the case of queryId? I've never heard any
complaints about collisions. Most people don't change
pg_stat_statements.max, so the probability of a collision is more like
1%. And, that's the probability of *any* collision, not the
probability of a collision that the user actually cares about. The
majority of entries in pg_stat_statements among those ten thousand
will not be interesting. Often, 90%+ will be one-hit wonders. If that
isn't true, then you're probably not going to find pg_stat_statements
very useful, because you have nothing to focus on.

Well, I think that the fact that pg_stat_statements.max exists at all
is something that could be fixed now that we have DSA. It's hard to
tell right now whether the fact that we don't hear about collisions is
an artifact of that limit or of the underlying workload. Also, if
someone did have collisions, how would they even know?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: 64-bit queryId?

On 2017-09-30 11:50:08 -0400, Robert Haas wrote:

Well, I think that the fact that pg_stat_statements.max exists at all
is something that could be fixed now that we have DSA.

You normally *do* want a limit imo. And given that query strings are now
stored externally, I don't think there's a huge space concern anymore?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#5)
Re: 64-bit queryId?

Peter Geoghegan <pg@bowt.ie> writes:

On Sat, Sep 30, 2017 at 7:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Assuming, however, that you don't manage to prove all known
mathematics inconsistent, what one might reasonably hope to do is
render collisions remote enough that one need not worry about them too
much in practice.

Isn't that already true in the case of queryId? I've never heard any
complaints about collisions.

More to the point: with 32-bit IDs, it's apparent that you shouldn't
really rely on them being unique, and should design your usage so that
it will survive collisions. Robert seems to be arguing that if we
merely made the IDs wider, it would be okay to design applications that
don't allow for that and would fail hard on a collision. I'm reminded
of Weinberg's famous line "If builders built houses the way programmers
build programs, the first woodpecker to come along would destroy
civilization".

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: 64-bit queryId?

On 2017-09-30 12:03:57 -0400, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Sat, Sep 30, 2017 at 7:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Assuming, however, that you don't manage to prove all known
mathematics inconsistent, what one might reasonably hope to do is
render collisions remote enough that one need not worry about them too
much in practice.

Isn't that already true in the case of queryId? I've never heard any
complaints about collisions.

More to the point: with 32-bit IDs, it's apparent that you shouldn't
really rely on them being unique, and should design your usage so that
it will survive collisions. Robert seems to be arguing that if we
merely made the IDs wider, it would be okay to design applications that
don't allow for that and would fail hard on a collision. I'm reminded
of Weinberg's famous line "If builders built houses the way programmers
build programs, the first woodpecker to come along would destroy
civilization".

I think you're putting words and intent into Robert's mouth. If you
design a hashtable you're concerned about unnecessary collisions, even
if you're aware of them. Additionally, it's not clear you always can do
all that much about the collisions, without accepting undue overhead -
see e.g. pg_stat_statements.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

More to the point: with 32-bit IDs, it's apparent that you shouldn't
really rely on them being unique, and should design your usage so that
it will survive collisions.

But the point is precisely that we do not do that. pg_stat_statements
"survives" collisions, but nothing good happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#5)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 6:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Sep 30, 2017 at 7:34 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

Assuming, however, that you don't manage to prove all known
mathematics inconsistent, what one might reasonably hope to do is
render collisions remote enough that one need not worry about them too
much in practice.

Isn't that already true in the case of queryId? I've never heard any
complaints about collisions. Most people don't change
pg_stat_statements.max, so the probability of a collision is more like
1%. And, that's the probability of *any* collision, not the
probability of a collision that the user actually cares about. The
majority of entries in pg_stat_statements among those ten thousand
will not be interesting. Often, 90%+ will be one-hit wonders. If that
isn't true, then you're probably not going to find pg_stat_statements
very useful, because you have nothing to focus on.

I heard from customers that they periodically dump contents of
pg_stat_statements and then build statistics over long period of time. If
even they leave default pg_stat_statements.max unchanged, probability of
collision would be significantly higher.

I have heard complaints about a number of different things in

pg_stat_statements, like the fact that we don't always manage to
replace constants with '?'/'$n' characters in all cases. I heard about
that quite a few times during my time at Heroku. But never this.

I'm not sure that we should be oriented only by users complaints in this
problem by couple reasons.

1) Some defects could leave unrevealed by users during long periods of
time. We have examples of GiST picksplit algorithm to be bad resulting bad
query performance. However, users never complained about that because they
didn't know what is real fair performance for this kind of queries. In the
same way users may suffer from queryId collisions during long time without
noticing it. They might have inaccurate statistics of queries execution,
but they don't notice it because they have nothing to compare.

2) Ideally, we should fix potential problems before they materialize, not
only after users complaints.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)
Re: 64-bit queryId?

On Sat, Sep 30, 2017 at 11:55 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-09-30 11:50:08 -0400, Robert Haas wrote:

Well, I think that the fact that pg_stat_statements.max exists at all
is something that could be fixed now that we have DSA.

You normally *do* want a limit imo. And given that query strings are now
stored externally, I don't think there's a huge space concern anymore?

Well, that's probably true. I guess you don't want your query
generator to run the system out of memory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#13)
Re: 64-bit queryId?

On 2017-09-30 18:17:37 -0400, Robert Haas wrote:

On Sat, Sep 30, 2017 at 11:55 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-09-30 11:50:08 -0400, Robert Haas wrote:

Well, I think that the fact that pg_stat_statements.max exists at all
is something that could be fixed now that we have DSA.

You normally *do* want a limit imo. And given that query strings are now
stored externally, I don't think there's a huge space concern anymore?

Well, that's probably true. I guess you don't want your query
generator to run the system out of memory.

Hah. I'm just pondering running sqlsmith against a pg without such a
limit.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Bruce Momjian
bruce@momjian.us
In reply to: Alexander Korotkov (#12)
Re: 64-bit queryId?

On 30 September 2017 at 21:03, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

I heard from customers that they periodically dump contents of
pg_stat_statements and then build statistics over long period of time. If
even they leave default pg_stat_statements.max unchanged, probability of
collision would be significantly higher.

Indeed. It's simple enough to export stats to prometheus with queryid
as the key. Then even if the query ages out of the database stats you
have graphs and derivative metrics going further back.

I have to admit this was my first reaction to the idea of using sha1
hashes for git commits as well. But eventually I came around. If the
chances of a hash collision are smaller than a cosmic ray flipping a
bit or a digital electronics falling into a meta-stable state then I
had to admit there's not much value in being theoretically more
correct.

In practice if the query has aged out of pg_stat_statements and you're
exporting pg_stat_statements metrics to longer-term storage there's
really nothing more "correct" than just using a long enough hash and
assuming there are no collisions anyways. If 64-bits is still not
sufficient we could just go to 160-bit sha1 or 256-bit sha256.

Actually there's a reason I'm wondering if we shouldn't use a
cryptographic hash or even an HMAC. Currently if you're non-superuser
we, quite sensibly, hide the query text. But we also hide the queryid.
The latter is really inconvenient since it really makes the stats
utterly useless. I'm not sure what the rationale was but the only
thing I can think of is a fear that it's possible to reverse engineer
the query using brute force. An HMAC, or for most purposes even a
simple cryptographic hash with a secret salt would make that
impossible.

(I have other advances in pg_stat_statements I would love to see. It
would be so much more helpful if pg_stat_statements also kept a few
examples of query parameters such as the most recent set, the set that
caused the longest execution, maybe the set with the largest of each
metric.)

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#15)
Re: 64-bit queryId?

Greg Stark <stark@mit.edu> writes:

Indeed. It's simple enough to export stats to prometheus with queryid
as the key. Then even if the query ages out of the database stats you
have graphs and derivative metrics going further back.

I'm not really ready to buy into that as a supported use-case, because
it immediately leads to the conclusion that we need to promise stability
of query IDs across PG versions, which seems entirely infeasible to me
(at least with anything like the current method of deriving them).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#16)
Re: 64-bit queryId?

On 1 October 2017 at 16:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:

Indeed. It's simple enough to export stats to prometheus with queryid
as the key. Then even if the query ages out of the database stats you
have graphs and derivative metrics going further back.

I'm not really ready to buy into that as a supported use-case, because
it immediately leads to the conclusion that we need to promise stability
of query IDs across PG versions, which seems entirely infeasible to me
(at least with anything like the current method of deriving them).

Well these kinds of monitoring systems tend to be used by operations
people who are a lot more practical and a lot less worried about
theoretical concerns like that.

What they're going to want to do is things like "alert on any query
using more than 0.1 cpu core" or "alert on any query being the top i/o
consumer that isn't amongst the top 10 over the previous day" or
"alert on any query using more than 4x the cpu or i/o on one database
than it does on average across all our databases". That last one is a
case where it would be nice if the queryid values were stable across
versions but it's hardly surprising that they aren't.

In context the point was merely that the default
pg_stat_statements.max of 5000 isn't sufficient to argue that 32-bit
values are enough. It wouldn't be hard for there to be 64k different
queries over time and across all the databases in a fleet and at that
point it becomes likely there'll be a 32-bit collision.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#17)
Re: 64-bit queryId?

On Sun, Oct 1, 2017 at 3:48 PM, Greg Stark <stark@mit.edu> wrote:

Well these kinds of monitoring systems tend to be used by operations
people who are a lot more practical and a lot less worried about
theoretical concerns like that.

+1, well said.

In context the point was merely that the default
pg_stat_statements.max of 5000 isn't sufficient to argue that 32-bit
values are enough. It wouldn't be hard for there to be 64k different
queries over time and across all the databases in a fleet and at that
point it becomes likely there'll be a 32-bit collision.

Yeah.

I think Alexander Korotkov's points are quite good, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Magnus Hagander
magnus@hagander.net
In reply to: Robert Haas (#18)
Re: 64-bit queryId?

On Mon, Oct 2, 2017 at 1:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Oct 1, 2017 at 3:48 PM, Greg Stark <stark@mit.edu> wrote:

Well these kinds of monitoring systems tend to be used by operations
people who are a lot more practical and a lot less worried about
theoretical concerns like that.

+1, well said.

+1 as well. I think these people would be perfectly find by it changing
across a version upgrade as long as they know that's the deal. *Most* of
the time there is no version upgrade going on, so it would work fine that
time.

Most operations people already deal with a lot of such parameters changing
around. I'm sure most of them would be more than happy with an improvement,
even if it's not mathematically perfect.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#20Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#18)
Re: 64-bit queryId?

On 10/01/2017 04:22 PM, Robert Haas wrote:

On Sun, Oct 1, 2017 at 3:48 PM, Greg Stark <stark@mit.edu> wrote:

Well these kinds of monitoring systems tend to be used by operations
people who are a lot more practical and a lot less worried about
theoretical concerns like that.

+1, well said.

In context the point was merely that the default
pg_stat_statements.max of 5000 isn't sufficient to argue that 32-bit
values are enough. It wouldn't be hard for there to be 64k different
queries over time and across all the databases in a fleet and at that
point it becomes likely there'll be a 32-bit collision.

Yeah.

I think Alexander Korotkov's points are quite good, too.

+1 to both of these as well.

jD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Joshua D. Drake (#20)
#22Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Joshua D. Drake (#20)
In reply to: Robert Haas (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#23)
#25Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#24)
#26Michael Paquier
michael@paquier.xyz
In reply to: Alexander Korotkov (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#25)
#28Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Robert Haas (#21)
#29Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#27)
#30Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#25)
#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Robert Haas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#31)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#34)
#36Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#36)
#38Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#38)
#40Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#40)
#42Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#42)
#44Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#43)
#45Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#44)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Julien Rouhaud (#45)
#47Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#46)
#48Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Julien Rouhaud (#48)
#50Julien Rouhaud
rjuju123@gmail.com
In reply to: Robert Haas (#49)