Sequence vs UUID
Hello, We were trying to understand whether we should use UUID or Sequence
in general for primary keys. In many of the blogs(one is below) across
multiple databases, I saw over the internet and all are mostly stating the
sequence is better as compared to UUID. But I think in the case of
concurrent data load scenarios UUID will spread the contention point
whereas sequence can be a single point of contention.
So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each of those
to understand the exact scenario in which we should go for one over
other? Basically
I wanted to see if we can perform some test on sample data to see the
percentage of overhead on read and write performances of the query in
presence of UUID VS Sequence to draw some conclusion in general? And also
considering open source postgres as the base for many databases like
redshift etc, so the results which apply to progress would apply to others
as well.
https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
On Jan 26, 2023, at 11:17, veem v <veema0000@gmail.com> wrote:
So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other?
Clear rules are a bit difficult to come by here, but there are general guidelines.
First, the distinction isn't exactly UUIDs vs sequences. There are two distinctions:
1. UUIDs vs bigints.
2. Sequential values vs random values.
The advantage of bigints vs UUIDs is that bigints will faster for PostgreSQL to process in a variety of ways, as well as being half the size of a UUID (so, smaller tables, smaller indexes, etc.). The main advantage of UUIDs vs bigints is, if you are using random values, there's existing easy-to-use infrastructure for generating UUIDs vs generating random bigints.
The advantage of sequential values is that they interact much better with caching than random values. Random values will have a harder time maintaining a reasonable in-memory working set than sequential values. (Of course, if the database fits entirely in memory, this isn't as much of an issue). The advantage of random values is that they are more opaque; if there is a 123480102 in a sequential key, an attacker can be confident there's also a 123480103, which a random value avoids. There are algorithms for generating sequential values that avoid this by having hard-to-guess less significant digits.
Another advantage of sequential values is that they are (roughly) time-ordered, so they can be used to get "most recent" efficiently.
One concern about sequential values that generally is not a real issue is the bottleneck of creating new sequential values. The sequence functionality in PostgreSQL is very concurrency-friendly.
UUIDs can be generated in such a way that they have sequential properties; see:
On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:
Hello, We were trying to understand whether we should use UUID or Sequence
in general for primary keys. In many of the blogs(one is below) across
multiple databases, I saw over the internet and all are mostly stating the
sequence is better as compared to UUID. But I think in the case of
concurrent data load scenarios UUID will spread the contention point
whereas sequence can be a single point of contention.So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each of those
to understand the exact scenario in which we should go for one over other?
Basically I wanted to see if we can perform some test on sample data to
see the percentage of overhead on read and write performances of the query
in presence of UUID VS Sequence to draw some conclusion in general? And
also considering open source postgres as the base for many databases like
redshift etc, so the results which apply to progress would apply to others
as well.https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to work
around that
* UUIDS are basically random data causing page fragmentation. this is
particularly bad in auto clustering architectures like sql server
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges
of records on large tables.
merlin
On 1/26/23 14:36, Merlin Moncure wrote:
On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:
Hello, We were trying to understand whether we should use UUID or
Sequence in general for primary keys. In many of the blogs(one is
below) across multiple databases, I saw over the internet and all
are mostly stating the sequence is better as compared to UUID. But
I think in the case of concurrent data load scenarios UUID will
spread the contention point whereas sequence can be a single point
of contention.So we want to understand from experts here, if there are anyclear
rules available or if we haveany pros vs cons list available for
each of those to understand the exact scenario in which we should
go for one over other? Basically I wantedto see if we can perform
some test on sample data to see the percentage of overhead on read
and write performances of the query in presence of UUID VS
Sequence to draw some conclusion in general? And also considering
open source postgres as the base for many databases like redshift
etc, so the results which apply to progress would apply to others
as well.https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
<https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/>* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to
work around that
* UUIDS are basically random data causing page fragmentation. this is
particularly bad in auto clustering architectures like sql server
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping
ranges of records on large tables.merlin
"Sweeping ranges of records" by ID suggests you have information in your
id. If you're comfortable with sequence generation as a surrogate for
time-point of entry, by all means have at it.
For many of us, contemporaneously generated records have nothing to do
with each other. (And they carry attributes which groups them.)
UUIDs do a very good job of avoiding id-overlap across domains (compare
to a sequence for each domain).
On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote:
Hello, We were trying to understand whether we should use UUID or Sequence in
general for primary keys. In many of the blogs (one is below) across multiple
databases, I saw over the internet and all are mostly stating the sequence is
better as compared to UUID. But I think in the case of concurrent data load
scenarios UUID will spread the contention point whereas sequence can be a
single point of contention.So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each of those to
understand the exact scenario in which we should go for one over other?
Basically I wanted to see if we can perform some test on sample data to see
the percentage of overhead on read and write performances of the query in
presence of UUID VS Sequence to draw some conclusion in general? And also
considering open source postgres as the base for many databases like redshift
etc, so the results which apply to progress would apply to others as well.https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
I think that[1]https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/ provides a good summary. Performance consideration is just one
aspect. Is there a technical requirement for using UUID over sequential values?
If there's a single generator of primary keys use bigint sequences. In case of
multiple generators (multi-master replication, sharding, clients generating IDs)
consider UUID.
There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2]https://owasp.org/www-community/attacks/Forced_browsing. The first I can understand: you may
not want to reveal the number of users or customers. But access control should
prevent forced browsing.
[1]: https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
[2]: https://owasp.org/www-community/attacks/Forced_browsing
--
Erik
You could always create a uuid matching table for anything displayed to
users and keep a private ID for anything internal. From my particle
standpoint, one is 8 bytes, the other is 16 or 32. Any database
implementation should guarantee a unique value. I have had cases where it
didn't work but those were rare and possibly a coding error. Basically, i
don't see a particle difference apart from url access and giving away a PK.
I haven't had an issue with it. Some people have. I still use internally
but my requirements are not public facing.
Thanks,
Ben
On Thu, Jan 26, 2023, 4:55 PM Erik Wienhold <ewie@ewie.name> wrote:
Show quoted text
On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote:
Hello, We were trying to understand whether we should use UUID or
Sequence in
general for primary keys. In many of the blogs (one is below) across
multiple
databases, I saw over the internet and all are mostly stating the
sequence is
better as compared to UUID. But I think in the case of concurrent data
load
scenarios UUID will spread the contention point whereas sequence can be a
single point of contention.So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each ofthose to
understand the exact scenario in which we should go for one over other?
Basically I wanted to see if we can perform some test on sample data tosee
the percentage of overhead on read and write performances of the query in
presence of UUID VS Sequence to draw some conclusion in general? And also
considering open source postgres as the base for many databases likeredshift
etc, so the results which apply to progress would apply to others as
well.
https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
I think that[1] provides a good summary. Performance consideration is
just one
aspect. Is there a technical requirement for using UUID over sequential
values?If there's a single generator of primary keys use bigint sequences. In
case of
multiple generators (multi-master replication, sharding, clients
generating IDs)
consider UUID.There are arguments against sequential PK, e.g. they give away too much
info and
allow attacks such as forced browsing[2]. The first I can understand: you
may
not want to reveal the number of users or customers. But access control
should
prevent forced browsing.[1]
https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
[2] https://owasp.org/www-community/attacks/Forced_browsing--
Erik
On Thu, Jan 26, 2023 at 3:50 PM Rob Sargent <robjsargent@gmail.com> wrote:
On 1/26/23 14:36, Merlin Moncure wrote:
On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:
Hello, We were trying to understand whether we should use UUID or
Sequence in general for primary keys. In many of the blogs(one is below)
across multiple databases, I saw over the internet and all are mostly
stating the sequence is better as compared to UUID. But I think in the case
of concurrent data load scenarios UUID will spread the contention point
whereas sequence can be a single point of contention.So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each of
those to understand the exact scenario in which we should go for one over
other? Basically I wanted to see if we can perform some test on sample
data to see the percentage of overhead on read and write performances of
the query in presence of UUID VS Sequence to draw some conclusion in
general? And also considering open source postgres as the base for many
databases like redshift etc, so the results which apply to
progress would apply to others as well.https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to work
around that
* UUIDS are basically random data causing page fragmentation. this is
particularly bad in auto clustering architectures like sql server
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges
of records on large tables."Sweeping ranges of records" by ID suggests you have information in your
id. If you're comfortable with sequence generation as a surrogate for
time-point of entry, by all means have at it.
Yeah, sequences preserve time locality. Clustering the table on the
primary key would then optimize certain cases. Clustering is useless on
guids. Sequences do better here and hybrid natural key models can do
better still. Buffer hit ratio and read amplification can be major issues
if not understood and controlled for.
For many of us, contemporaneously generated records have nothing to do
with each other. (And they carry attributes which groups them.)UUIDs do a very good job of avoiding id-overlap across domains (compare to
a sequence for each domain).
While true, there are other solutions to the issue, one or more of,
* separating id generation from storage
* reserving id ranges
* creating domain id composite alongside generated id
Relying solely on guid to generate uniqueness seems like a broken model
anyways in the general case. Whatever thing is responsible for
suppressing information conflicts ought to be emitting the id.
merlin
On 1/26/23 15:55, Erik Wienhold wrote:
On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote:
Hello, We were trying to understand whether we should use UUID or Sequence in
general for primary keys. In many of the blogs (one is below) across multiple
databases, I saw over the internet and all are mostly stating the sequence is
better as compared to UUID. But I think in the case of concurrent data load
scenarios UUID will spread the contention point whereas sequence can be a
single point of contention.So we want to understand from experts here, if there are any clear rules
available or if we have any pros vs cons list available for each of those to
understand the exact scenario in which we should go for one over other?
Basically I wanted to see if we can perform some test on sample data to see
the percentage of overhead on read and write performances of the query in
presence of UUID VS Sequence to draw some conclusion in general? And also
considering open source postgres as the base for many databases like redshift
etc, so the results which apply to progress would apply to others as well.https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
I think that[1] provides a good summary. Performance consideration is just one
aspect. Is there a technical requirement for using UUID over sequential values?If there's a single generator of primary keys use bigint sequences. In case of
multiple generators (multi-master replication, sharding, clients generating IDs)
consider UUID.There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2]. The first I can understand: you may
not want to reveal the number of users or customers. But access control should
prevent forced browsing.
Shouldn't your application layer isolate the users from the database? UUIDs
are all over the DBs I manage, but the PKs are all sequences.
--
Born in Arizona, moved to Babylonia.
Hello.
I have been using UUID for quite a long time now. The reason I began to use UUID was the need to be able to move data between
databases and the need to create record outside the database. You should use UUID as a primary key for a record and also have
some bookkeeping UUID:s in the record like "origin" of the record. In this way it is "easy" to handle different sources of
data.
We have also written some nice replicating software on this basis (postsync) that can check for alterations in one database
and update others. In this way we can keep one or many replicas of databases.
So forget about performance issues (there will ALWAYS be need for faster systems). The ease and functionality with UUID
is so mutch better. Sequence keys are a terrible idea!
// GH
Den 2023-01-26 kl. 20:17, skrev veem v:
Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention.
So we want to understand from experts here, if there are anyclear rules available or if we haveany pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wantedto see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.
https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/ <https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/>
--
gorhas@raditex.nu
http://www.raditex.nu
Raditex Control AB
Bo 229, 715 91 ODENSBACKEN
Mob: 070-5530148
So forget about performance issues (there will ALWAYS be need for faster systems). The ease and functionality with UUID
is so mutch better. Sequence keys are a terrible idea!// GH
Wow. I am not alone
Show quoted text
On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
On 1/26/23 15:55, Erik Wienhold wrote:
There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2]. The first I can understand: you may
not want to reveal the number of users or customers. But access control should
prevent forced browsing.Shouldn't your application layer isolate the users from the database? UUIDs
are all over the DBs I manage, but the PKs are all sequences.
Yes, I meant the application layer, not Postgres' access control.
--
Erik
Why is it a terrible idea? I have been using them for years without a
single problem. I don't rely on them for create order. Terrible seem a bit
extreme.
Thanks,
Ben
On Sat, Jan 28, 2023, 3:39 PM Erik Wienhold <ewie@ewie.name> wrote:
Show quoted text
On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
On 1/26/23 15:55, Erik Wienhold wrote:
There are arguments against sequential PK, e.g. they give away too
much info and
allow attacks such as forced browsing[2]. The first I can understand:
you may
not want to reveal the number of users or customers. But access
control should
prevent forced browsing.
Shouldn't your application layer isolate the users from the database?
UUIDs
are all over the DBs I manage, but the PKs are all sequences.
Yes, I meant the application layer, not Postgres' access control.
--
Erik
Type 4 UUIDs are sub-optimal for big table because cache hit rates drop
through the floor.
This lesson was burned into my psyche waaaay back in the Clinton
administration. It was my task to speed up a five hour batch job which read
input records from a flat file, did some validations and then inserted
them. Simply sorting the input file on the primary key fields -- we used
natural keys, not synthetics -- dropped the run time to two hours. (VMS
SORT saved the day, because you could tell it the sort order you wanted;
thus, I could preserve the header record at the top of the file, and the
trailer record at the end of the file without jumping through a bunch of hoops.)
On 1/28/23 19:44, Benedict Holland wrote:
Why is it a terrible idea? I have been using them for years without a
single problem. I don't rely on them for create order. Terrible seem a bit
extreme.Thanks,
BenOn Sat, Jan 28, 2023, 3:39 PM Erik Wienhold <ewie@ewie.name> wrote:
On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
On 1/26/23 15:55, Erik Wienhold wrote:
There are arguments against sequential PK, e.g. they give away too
much info and
allow attacks such as forced browsing[2]. The first I can
understand: you may
not want to reveal the number of users or customers. But access
control should
prevent forced browsing.
Shouldn't your application layer isolate the users from the
database? UUIDs
are all over the DBs I manage, but the PKs are all sequences.
Yes, I meant the application layer, not Postgres' access control.
--
Erik
--
Born in Arizona, moved to Babylonia.
On Sat, Jan 28, 2023 at 6:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through the floor.
This lesson was burned into my psyche waaaay back in the Clinton administration. It was my task to speed up a five hour batch job which read input records from a flat file, did some validations and then inserted them. Simply sorting the input file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours. (VMS SORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record at the top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.)
This can be mitigated with judicious use of a sequence at the front of
the uuidv4.
https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/
More effort than just calling the built-in gen_random_uuid() or
equivalent in app code, but a substantial performance gain for your
effort.
https://github.com/tvondra/sequential-uuids
And in a managed environment where you can't install custom
extensions, a fairly simple function with divide on unix epoch seconds
combined with a call to overlay(...) should suffice performance-wise.
At 60 seconds, this will loop every 45 days or so, and you can choose
how much "sequentialness" works for you, from 1 to 4 bytes at the
expense of pseudo-randomness.
-----------------------------
-- Generate time interval UUID
CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 =
60, block_num_bytes int2 = 2)
RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$
SELECT encode(
overlay(
-- convert the uuid to byte array
uuid_send(gen_random_uuid())
-- use only the bottom bytes
PLACING substring(
int4send((extract(epoch FROM now()) / interval_seconds)::int4)
FROM (5 - block_num_bytes)
)
-- place at the front two bytes of the uuid
FROM 1
)
-- convert the resulting byte array to hex for conversion to uuid
, 'hex')::uuid
WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4
$$;
-----------------------------
Technically affecting the v4 spec. You could always convert to a
UUIDv8, which is the intention behind that new version even though the
standard hasn't been ratified yet.
Cheers,
Miles Elam
Then it's not a Type 4 UUID, which is perfectly fine; just not random.
Also, should now() be replaced by clock_timestamp(), so that it can be
called multiple times in the same transaction?
On 1/28/23 21:28, Miles Elam wrote:
On Sat, Jan 28, 2023 at 6:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through the floor.
This lesson was burned into my psyche waaaay back in the Clinton administration. It was my task to speed up a five hour batch job which read input records from a flat file, did some validations and then inserted them. Simply sorting the input file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours. (VMS SORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record at the top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.)
This can be mitigated with judicious use of a sequence at the front of
the uuidv4.https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/
More effort than just calling the built-in gen_random_uuid() or
equivalent in app code, but a substantial performance gain for your
effort.https://github.com/tvondra/sequential-uuids
And in a managed environment where you can't install custom
extensions, a fairly simple function with divide on unix epoch seconds
combined with a call to overlay(...) should suffice performance-wise.
At 60 seconds, this will loop every 45 days or so, and you can choose
how much "sequentialness" works for you, from 1 to 4 bytes at the
expense of pseudo-randomness.-----------------------------
-- Generate time interval UUID
CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 =
60, block_num_bytes int2 = 2)
RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$
SELECT encode(
overlay(
-- convert the uuid to byte array
uuid_send(gen_random_uuid())
-- use only the bottom bytes
PLACING substring(
int4send((extract(epoch FROM now()) / interval_seconds)::int4)
FROM (5 - block_num_bytes)
)
-- place at the front two bytes of the uuid
FROM 1
)
-- convert the resulting byte array to hex for conversion to uuid
, 'hex')::uuid
WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4
$$;-----------------------------
Technically affecting the v4 spec. You could always convert to a
UUIDv8, which is the intention behind that new version even though the
standard hasn't been ratified yet.Cheers,
Miles Elam
--
Born in Arizona, moved to Babylonia.
On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
Then it's not a Type 4 UUID, which is perfectly fine; just not random.
Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.
Also, should now() be replaced by clock_timestamp(), so that it can be
called multiple times in the same transaction?
Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.
I have a question, As i understand here, usage wise there are multiple
benefits of UUID over sequences like, in case of distributed app where we
may not be able to rely on one point generator like sequences, in case of
multi master architecture, sharding.
If we just look in terms of performance wise, the key advantage of sequence
is that for read queries, because of the storage size it will be smaller
and thus it will cache more index rows and so will be beneficial during
read queries and should also be beneficial even on joins because of its
smaller size. Also fetching a value from sequence is cheaper than
calculating the UUIDS. But the downside is during write operation, it can
be a point of contention in case of concurrent data load as every incoming
request will try to modify same table/index page/block. But as its
mentioned in this blog (
https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/), state
this UUID can be made sequential so even this can be sequential using
prefix-timestamp etc. However isn't it that making the UUID sequential will
again actually be a disadvantage and can be contention point for this
unique index as each incoming write will now fight for same block/page
while doing concurrent data load and will contend for the same table block
or say one side of the index branch/leaf block etc, whereas in case of
random UUIDs the write was spreading across multiple blocks so there was no
contention on any specific blocks? Please correct if my understanding is
wrong?
On Sun, 29 Jan, 2023, 10:33 am Miles Elam, <miles.elam@productops.com>
wrote:
Show quoted text
On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
Then it's not a Type 4 UUID, which is perfectly fine; just not random.
Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.Also, should now() be replaced by clock_timestamp(), so that it can be
called multiple times in the same transaction?Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.
Was trying to test the performance for simple read/write for the bigint vs
UUID. What we see is , ~3 times performance degradation while joining on
bigint vs UUID columns. Also even just generation of sequence vs bigint
itself is degrading by ~3times too. Also even insert performance on same
table for ~10million rows is ~1min 39sec for bigint vs ~3minute 11 sec in
case of UUID. Is such extent of degradation in performance this expected
for UUID?
CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name
varchar(20) );
CREATE TABLE test2_UUID (id bigint,source_id varchar(36) PRIMARY KEY,Name
varchar(20) );
CREATE TABLE test1_bigint ( id bigint PRIMARY KEY, source_id varchar(36) ,
Name varchar(20));
CREATE TABLE test2_bigint ( id bigint PRIMARY KEY, source_id varchar(36) ,
Name varchar(20));
Loaded same 10million rows.
explain Analyze select * from test1_bigint a , test2_bigint b where a.id =
b.id
Merge Join (cost=12.31..875534.52 rows=10000021 width=100) (actual
time=0.042..6974.575 rows=10000000 loops=1)
Merge Cond: (a.id = b.id)
-> Index Scan using test1_bigint_pkey on test1_bigint a
(cost=0.43..362780.75 rows=10000021 width=50) (actual time=0.020..2070.079
rows=10000000 loops=1)
-> Index Scan using test2_bigint_2_pkey on test2_bigint b
(cost=0.43..362780.75 rows=10000021 width=50) (actual time=0.019..2131.086
rows=10000000 loops=1)
Planning Time: 0.207 ms
Execution Time: 7311.210 ms
set enable_seqscan=off;
explain Analyze select * from test1_UUID a , test2_UUID b where a.source_id
= b.source_id;
Merge Join (cost=2.75..2022857.05 rows=10000021 width=100) (actual
time=0.043..21954.213 rows=10000000 loops=1)
Merge Cond: ((a.source_id)::text = (b.source_id)::text)
-> Index Scan using test1_uuid_pkey on test1_UUID a (cost=0.56..936420.18
rows=10000021 width=50) (actual time=0.022..7854.143 rows=10000000 loops=1)
-> Index Scan using test2_uuid_2_pkey on test2_UUID b
(cost=0.56..936437.90 rows=10000021 width=50) (actual time=0.017..7971.187
rows=10000000 loops=1)
Planning Time: 0.516 ms
Execution Time: 22292.801 ms
**********
create sequence myseq cache 32767;
select count(nextval('myseq') ) from generate_series(1,10000000)
1 row retrieved starting from 1 in 4 s 521 ms (execution: 4 s 502 ms,
fetching: 19 ms)
select count(gen_random_uuid()) from generate_series(1,10000000)
1 row retrieved starting from 1 in 11 s 145 ms (execution: 11 s 128 ms,
fetching: 17 ms)
On Mon, 30 Jan, 2023, 4:59 pm veem v, <veema0000@gmail.com> wrote:
Show quoted text
I have a question, As i understand here, usage wise there are multiple
benefits of UUID over sequences like, in case of distributed app where we
may not be able to rely on one point generator like sequences, in case of
multi master architecture, sharding.If we just look in terms of performance wise, the key advantage of
sequence is that for read queries, because of the storage size it will be
smaller and thus it will cache more index rows and so will be beneficial
during read queries and should also be beneficial even on joins because of
its smaller size. Also fetching a value from sequence is cheaper than
calculating the UUIDS. But the downside is during write operation, it can
be a point of contention in case of concurrent data load as every incoming
request will try to modify same table/index page/block. But as its
mentioned in this blog (
https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/), state
this UUID can be made sequential so even this can be sequential using
prefix-timestamp etc. However isn't it that making the UUID sequential will
again actually be a disadvantage and can be contention point for this
unique index as each incoming write will now fight for same block/page
while doing concurrent data load and will contend for the same table block
or say one side of the index branch/leaf block etc, whereas in case of
random UUIDs the write was spreading across multiple blocks so there was no
contention on any specific blocks? Please correct if my understanding is
wrong?On Sun, 29 Jan, 2023, 10:33 am Miles Elam, <miles.elam@productops.com>
wrote:On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
Then it's not a Type 4 UUID, which is perfectly fine; just not random.
Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.Also, should now() be replaced by clock_timestamp(), so that it can be
called multiple times in the same transaction?Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.
On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote:
CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) );
Maybe if you used a "native" 16-byte uuid, instead of its textual
serialization with dashes (36 bytes + length overhead), the gap would
narrow.
Dominique Devienne <ddevienne@gmail.com> writes:
On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote:
CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) );
Maybe if you used a "native" 16-byte uuid, instead of its textual
serialization with dashes (36 bytes + length overhead), the gap would
narrow.
Yeah, especially if your database is not using C locale. The
strcoll or ICU-based comparisons done on string types can be
enormously more expensive than the memcmp() used for binary
types like native uuid.
regards, tom lane