"global" & shared sequences

Started by Jonathan Vanascoover 10 years ago4 messagesgeneral
Jump to latest
#1Jonathan Vanasco
postgres@2xlp.com

Hoping to glean some advice from the more experienced....

The major component of our application currently tracks a few dozen object types, and the total number of objects is in the 100s Millions range. Postgres will potentially be tracking billions of objects.

Right now the primary key for our "core" objects is based on a per-table sequence, but each object has a secondary id based on a global/shared sequence. we expose everything via a connected object graph, and basically needed a global sequence. We are currently scaled vertically (1x writer, 2x reader)

I'd like to avoid assuming any more technical debt, and am not thrilled with the current setup. Our internal relations are all by the table's primary key, but the external (API, WEB) queries use the global id. Every table has 2 indexes, and we need to convert a 'global' id to a 'table id' before doing a query. If we're able to replace the per-table primary key with the global id, we'd be freeing up some disk space from the indexes and tables -- and not have to keep our performance cache that maps table-to-global ids.

The concerns that I have before moving ahead are:

1. general performance at different stages of DB size. with 18 sequences, our keys/indexes are simply smaller than they'd be with 1 key. i wonder how this will impact lookups and joins.
2. managing this sequence when next scaling the db (which would probably have to be sharding, unless others have a suggestion)

if anyone has insights, they would be greatly appreciated.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jonathan Vanasco (#1)
Re: "global" & shared sequences

On 10/1/15 6:48 PM, Jonathan Vanasco wrote:

1. general performance at different stages of DB size. with 18 sequences, our keys/indexes are simply smaller than they'd be with 1 key. i wonder how this will impact lookups and joins.

I'm not really following here... the size of an index is determined by
the number of tuples in it and the average width of each tuple. So as
long as you're using the same size of data type, 18 vs 1 sequence won't
change the size of your indexes.

2. managing this sequence when next scaling the db (which would probably have to be sharding, unless others have a suggestion)

Sequences are designed to be extremely fast to assign. If you ever did
find a single sequence being a bottleneck, you could always start
caching values in each backend. I think it'd be hard (if not impossible)
to turn a single global sequence into a real bottleneck.

If you start sharding you'll need to either create a composite ID where
part of the ID is a shard identifier (say, the top 8 bits), or assign
IDs in ranges that are assigned to each shard. There's work being done
right now to make #2 a bit easier. Probably better would be if you could
shard based on something like object or customer; that way you only have
to look up which shard the customer lives in.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Jonathan Vanasco
postgres@2xlp.com
In reply to: Jim Nasby (#2)
Re: "global" & shared sequences

Thanks for the reply.

On Oct 2, 2015, at 3:26 PM, Jim Nasby wrote:

I'm not really following here... the size of an index is determined by the number of tuples in it and the average width of each tuple. So as long as you're using the same size of data type, 18 vs 1 sequence won't change the size of your indexes.

I'm pretty much concerned with exactly that -- the general distribution of numbers, which affects the average size/length of each key.

Using an even distribution as an example, the average width of the keys can increase by 2 places:

Since we have ~18 object types, the primary keys in each might range from 1 to 9,999,999
Using a shared sequence, the keys for the same dataset would range from 1 to 189,999,999

Each table is highly related, and may fkey onto 2-4 other tables... So i'm a bit wary of this change. But if it works for others... I'm fine with that!

Sequences are designed to be extremely fast to assign. If you ever did find a single sequence being a bottleneck, you could always start caching values in each backend. I think it'd be hard (if not impossible) to turn a single global sequence into a real bottleneck.

I don't think so either, but everything I've read has been theoretical -- so I was hoping that someone here can give the "yeah, no issue!" from experience. The closest production stuff I found was via the BDR plugin (only relevant thing that came up during search) and there seemed to be anecdotal accounts of issues with sequences becoming bottlenecks -- but that was from their code that pre-generated allowable sequence ids on each node.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jonathan Vanasco (#3)
Re: "global" & shared sequences

On 10/2/15 4:08 PM, Jonathan Vanasco wrote:

Using an even distribution as an example, the average width of the keys can increase by 2 places:

Assuming you're using int4 or int8, then that doesn't matter. The only
other possible issue I can think of would be it somehow throwing the
planner stats off, but I think the odds of that are very small.

Sequences are designed to be extremely fast to assign. If you ever did find a single sequence being a bottleneck, you could always start caching values in each backend. I think it'd be hard (if not impossible) to turn a single global sequence into a real bottleneck.

I don't think so either, but everything I've read has been theoretical -- so I was hoping that someone here can give the "yeah, no issue!" from experience. The closest production stuff I found was via the BDR plugin (only relevant thing that came up during search) and there seemed to be anecdotal accounts of issues with sequences becoming bottlenecks -- but that was from their code that pre-generated allowable sequence ids on each node.

You could always run a custom pg_bench that runs a PREPAREd SELECT
nextval() and compare that to a prepared SELECT currval(). You might
notice a difference at higher client counts with no caching, but I doubt
you'd see that much difference with caching turned on.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general