init_sequence spill to hash table

Started by David Rowleyabout 12 years ago19 messages

dgrowleyml@gmail.com

about 12 years ago

2 attachment(s)

Here /messages/by-id/24278.1352922571@sss.pgh.pa.us there
was some talk about init_sequence being a bottleneck when many sequences
are used in a single backend.

The attached I think implements what was talked about in the above link
which for me seems to double the speed of a currval() loop over 30000
sequences. It goes from about 7 seconds to 3.5 on my laptop.

I thought I would post the patch early to see if this is actually wanted
before I do too much more work on it.

My implementation maintains using the linear list for sequences up to a
defined threshold (currently 32) then it moves everything over to a
hashtable and free's off the list.

A more complete solution would contain regression tests to exercise the
hash table code.

I know there is a desire to move sequences over to a single table still,
but I see this as a separate patch and storing current values in a hash
table for each backend should still be a win even if/when the single table
stuff gets implemented.

Regards

David Rowley

Attachments:

sequence.sqltext/plain; charset=US-ASCII; name=sequence.sqlDownload

hashtab_seq_v0.1.patchapplication/octet-stream; name=hashtab_seq_v0.1.patchDownload

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index f3344c6..daea137 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -60,15 +60,11 @@ typedef struct sequence_magic
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
  * discard entries.)
- *
- * XXX We use linear search to find pre-existing SeqTable entries.	This is
- * good when only a small number of sequences are touched in a session, but
- * would suck with many different sequences.  Perhaps use a hashtable someday.
  */
 typedef struct SeqTableData
 {
+	Oid			relid;			/* pg_class OID of this sequence, hash key must be first */
 	struct SeqTableData *next;	/* link to next SeqTable object */
-	Oid			relid;			/* pg_class OID of this sequence */
 	Oid			filenode;		/* last seen relfilenode of this sequence */
 	LocalTransactionId lxid;	/* xact in which we last did a seq op */
 	bool		last_valid;		/* do we have a valid "last" value? */
@@ -81,7 +77,17 @@ typedef struct SeqTableData
 
 typedef SeqTableData *SeqTable;
 
-static SeqTable seqtab = NULL;	/* Head of list of SeqTable items */
+
+/*
+ * Marks threshold of the number of sequences we store in each backend
+ * in a linear list before we move the list over to a hash table.
+ * This improves performance for backends which must store many hundreds
+ * of sequences.
+ */
+#define SEQ_MOVE_TO_HASHTABLE_THRESHOLD 32
+
+static SeqTable seqlist		= NULL;	/* Head of list of SeqTable items */
+static HTAB		*seqhashtab = NULL; /* Pointer to sequence hash table */
 
 /*
  * last_used_seq is updated by nextval() to point to the last used
@@ -92,6 +98,7 @@ static SeqTableData *last_used_seq = NULL;
 static void fill_seq_with_data(Relation rel, HeapTuple tuple);
 static int64 nextval_internal(Oid relid);
 static Relation open_share_lock(SeqTable seq);
+static void switch_to_hashtable();
 static void init_sequence(Oid relid, SeqTable *p_elm, Relation *p_rel);
 static Form_pg_sequence read_seq_tuple(SeqTable elm, Relation rel,
 			   Buffer *buf, HeapTuple seqtuple);
@@ -999,6 +1006,43 @@ open_share_lock(SeqTable seq)
 }
 
 /*
+ * Change from storing sequences in a linear list fashion
+ * to storing them in a hash table for fast lookups
+ */
+static void switch_to_hashtable()
+{
+	SeqTable	next, hentry;
+	HASHCTL		ctl;
+
+	Assert(seqhashtab == NULL);
+
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(SeqTableData);
+	ctl.hash = oid_hash;
+	ctl.alloc = malloc;
+
+	seqhashtab = hash_create("Sequence table", 1000, &ctl,
+		HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+
+	/* Loop over the linear list and add each item into the hash table */
+	while (seqlist != NULL)
+	{
+		next = seqlist->next;
+
+		hentry = (SeqTable) hash_search(seqhashtab,
+			&seqlist->relid,
+			HASH_ENTER, NULL);
+
+		memcpy(hentry, seqlist, sizeof(SeqTableData));
+		hentry->next = NULL;
+
+		free(seqlist);
+		seqlist = next;
+	}
+}
+
+/*
  * Given a relation OID, open and lock the sequence.  p_elm and p_rel are
  * output parameters.
  */
@@ -1007,22 +1051,51 @@ init_sequence(Oid relid, SeqTable *p_elm, Relation *p_rel)
 {
 	SeqTable	elm;
 	Relation	seqrel;
+	int			nsequences = 0;
+	bool		found = false;
+
+	/*
+	 * Most workload types will only use a small number of sequences during
+	 * their tasks. However certain workload types will have a much higher
+	 * demand on sequences. For these high demand senarios we'll store these
+	 * sequences in a hash table and since we're unable to tell initally if
+	 * the workload will use many sequences we just start out assuming it won't
+	 * and use a linear list to start of with, then if we reach
+	 * SEQ_MOVE_TO_HASHTABLE_THRESHOLD sequences then we switch over to use a
+	 * hash table.
+	 */
 
-	/* Look to see if we already have a seqtable entry for relation */
-	for (elm = seqtab; elm != NULL; elm = elm->next)
+	/* if the hash table exists then we'll search with that */
+	if (seqhashtab != NULL)
+	{
+		elm = (SeqTable) hash_search(seqhashtab, &relid, HASH_ENTER, &found);
+	}
+	else
+	{
+		/* no hash table, so perform linear search for the sequence */
+		for (elm = seqlist; elm != NULL; elm = elm->next, nsequences++)
 	{
 		if (elm->relid == relid)
+			{
+				found = true;
 			break;
 	}
+		}
 
+		if (!found)
+		{
 	/*
-	 * Allocate new seqtable entry if we didn't find one.
-	 *
-	 * NOTE: seqtable entries remain in the list for the life of a backend. If
-	 * the sequence itself is deleted then the entry becomes wasted memory,
-	 * but it's small enough that this should not matter.
+			 * if the number of sequences found in the list has reached
+			 * the threshold then we'll move the list over to a hash table
 	 */
-	if (elm == NULL)
+			if (nsequences == SEQ_MOVE_TO_HASHTABLE_THRESHOLD)
+			{
+				switch_to_hashtable();
+				elm = (SeqTable) hash_search(seqhashtab, &relid, HASH_ENTER, &found);
+
+				Assert(found == false);
+			}
+			else
 	{
 		/*
 		 * Time to make a new seqtable entry.  These entries live as long as
@@ -1033,13 +1106,29 @@ init_sequence(Oid relid, SeqTable *p_elm, Relation *p_rel)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
 					 errmsg("out of memory")));
+
+				/* put this sequence at the top of the list */
+				elm->next = seqlist;
+				seqlist = elm;
+			}
+		}
+	}
+
+	/*
+	* Allocate new seqtable entry if we didn't find one.
+	*
+	* NOTE: With the exception of the use of DISCARD SEQUENCES or DISCARD ALL
+	* seqtable entries remain in the list for the life of a backend. If
+	* the sequence itself is deleted then the entry becomes wasted memory,
+	* but it's small enough that this should not matter.
+	*/
+	if (!found)
+	{
 		elm->relid = relid;
 		elm->filenode = InvalidOid;
 		elm->lxid = InvalidLocalTransactionId;
 		elm->last_valid = false;
 		elm->last = elm->cached = elm->increment = 0;
-		elm->next = seqtab;
-		seqtab = elm;
 	}
 
 	/*
@@ -1611,11 +1700,19 @@ ResetSequenceCaches(void)
 {
 	SeqTableData *next;
 
-	while (seqtab != NULL)
+	/* we should never have both a list and a table */
+	Assert(!(seqlist && seqhashtab));
+
+	while (seqlist != NULL)
 	{
-		next = seqtab->next;
-		free(seqtab);
-		seqtab = next;
+		next = seqlist->next;
+		free(seqlist);
+		seqlist = next;
+	}
+
+	if (seqhashtab) {
+		hash_destroy(seqhashtab);
+		seqhashtab = NULL;
 	}
 
 	last_used_seq = NULL;

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: David Rowley (#1)

Re: init_sequence spill to hash table

On 13.11.2013 11:55, David Rowley wrote:

I thought I would post the patch early to see if this is actually wanted
before I do too much more work on it.

Seems reasonable.

My implementation maintains using the linear list for sequences up to a
defined threshold (currently 32) then it moves everything over to a
hashtable and free's off the list.

Did you check how it would perform if you just always used the hash
table? Or if you just have a single entry before you move to hash table,
ie. set the threshold to 2? That would be slightly simpler.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Rowley

dgrowleyml@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#2)

2 attachment(s)

Re: init_sequence spill to hash table

On Thu, Nov 14, 2013 at 12:15 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 13.11.2013 11:55, David Rowley wrote:

I thought I would post the patch early to see if this is actually wanted
before I do too much more work on it.

Seems reasonable.

My implementation maintains using the linear list for sequences up to a

defined threshold (currently 32) then it moves everything over to a
hashtable and free's off the list.

Did you check how it would perform if you just always used the hash table?
Or if you just have a single entry before you move to hash table, ie. set
the threshold to 2? That would be slightly simpler.

I've just completed some more benchmarking of this. I didn't try dropping
the threshold down to 2 or 0 but I did tests at the cut over point and
really don't see much difference in performance between the list at 32 and
the hashtable at 33 sequences. The hash table version excels in the 16000
sequence test in comparison to the unpatched version.

Times are in milliseconds of the time it took to call currval() 100000
times for 1 sequence.
Patched Unpatched increased by 1 in cache 1856.452 1844.11 -1% 32 in
cache 1841.84 1802.433 -2% 33 in cache 1861.558 not tested N/A 16000 in
cache 1963.711 10329.22 426%

There was not much point in testing 33 sequences with the unpatched version
as there is no switching to hash table. Most likely the 1 and 2% drop in
speed is noise as the only instructions that have been added here are
adding test to check if the hashtable has been created in init_sequence and
also a counter to count the number of sequences in the list while still in
list mode.

I also tested filling the cache with 30000 sequences then inserting 100000
records into a table which used currval() for the default of a column. The
sequence I used would have been half way up the linked list in the
unpatched version, so init_sequence would have had to loop 15000 times
before finding the sequence.

Here is the average and median over 8 runs of the INSERT statement
Patched Unpatched increased by AVERAGE 482.2626 597.66375 24% MEDIAN
471.2205 567.949 21%

Just for clarity, during testing I added the following line to
the switch_to_hashtable() function

elog(NOTICE, "moved sequences into hash table");

I've attached seqeuence.sql which includes all of my tests, the functions I
used for benchmarking and comments with the results that I got during
running the tests.

I've also attached the results formatted a little better in a spreadsheet.

Please let me know if there is something else you think I should test.

Regards

David Rowley

Show quoted text

- Heikki

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: David Rowley (#3)

Re: init_sequence spill to hash table

On 14.11.2013 14:38, David Rowley wrote:

I've just completed some more benchmarking of this. I didn't try dropping
the threshold down to 2 or 0 but I did tests at the cut over point and
really don't see much difference in performance between the list at 32 and
the hashtable at 33 sequences. The hash table version excels in the 16000
sequence test in comparison to the unpatched version.

Times are in milliseconds of the time it took to call currval() 100000
times for 1 sequence.
Patched Unpatched increased by 1 in cache 1856.452 1844.11 -1% 32 in
cache 1841.84 1802.433 -2% 33 in cache 1861.558 not tested N/A 16000 in
cache 1963.711 10329.22 426%

If I understand those results correctly, the best case scenario with the
current code takes about 1800 ms. There's practically no difference with
N <= 32, where N is the number of sequences touched. The hash table
method also takes about 1800 ms when N=33. The performance of the hash
table is O(1), so presumably we can extrapolate from that that it's the
same for any N.

I think that means that we should just completely replace the list with
the hash table. The difference with a small N is lost in noise, so
there's no point in keeping the list as a fast path for small N. That'll
make the patch somewhat simpler.
- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: David Rowley (#1)

Re: init_sequence spill to hash table

Hi,

On 2013-11-13 22:55:43 +1300, David Rowley wrote:

Here /messages/by-id/24278.1352922571@sss.pgh.pa.us there
was some talk about init_sequence being a bottleneck when many sequences
are used in a single backend.

The attached I think implements what was talked about in the above link
which for me seems to double the speed of a currval() loop over 30000
sequences. It goes from about 7 seconds to 3.5 on my laptop.

I think it'd be a better idea to integrate the sequence caching logic
into the relcache. There's a comment about it:
* (We can't
* rely on the relcache, since it's only, well, a cache, and may decide to
* discard entries.)
but that's not really accurate anymore. We have the infrastructure for
keeping values across resets and we don't discard entries.

Since we already do a relcache lookup for every sequence manipulation
(c.f. init_sequence()) relying on it won't increase, but rather decrease
the overhead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#5)

Re: init_sequence spill to hash table

Andres Freund <andres@2ndquadrant.com> writes:

I think it'd be a better idea to integrate the sequence caching logic
into the relcache. There's a comment about it:
* (We can't
* rely on the relcache, since it's only, well, a cache, and may decide to
* discard entries.)
but that's not really accurate anymore. We have the infrastructure for
keeping values across resets and we don't discard entries.

We most certainly *do* discard entries, if they're not open when a cache
flush event comes along.

I suppose it'd be possible to mark a relcache entry for a sequence
as locked-in-core, but that doesn't attract me at all. A relcache
entry is a whole lot larger than the amount of state we really need
to keep for a sequence.

One idea is to have a hashtable for the sequence-specific data,
but to add a link field to the relcache entry that points to the
non-flushable sequence hashtable entry. That would save the second
hashtable lookup as long as the relcache entry hadn't been flushed
since last use, while not requiring any violence to the lifespan
semantics of relcache entries. (Actually, if we did that, it might
not even be worth converting the list to a hashtable? Searches would
become a lot less frequent.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#6)

Re: init_sequence spill to hash table

On 2013-11-14 09:23:20 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

I think it'd be a better idea to integrate the sequence caching logic
into the relcache. There's a comment about it:
* (We can't
* rely on the relcache, since it's only, well, a cache, and may decide to
* discard entries.)
but that's not really accurate anymore. We have the infrastructure for
keeping values across resets and we don't discard entries.

We most certainly *do* discard entries, if they're not open when a cache
flush event comes along.

What I was aiming at is that we don't discard them because of a limited
cache size. I don't think it means much that we flush the entry when
it's changed but not referenced.

I suppose it'd be possible to mark a relcache entry for a sequence
as locked-in-core, but that doesn't attract me at all. A relcache
entry is a whole lot larger than the amount of state we really need
to keep for a sequence.

But effectively that's what already happens unless either somebody else
does an ALTER SEQUENCE or sinval overflow happened, right? So in
production workloads we already will keep both around since hopefully
neither altering a sequence nor sinval overflows should be a frequent
thing.

One idea is to have a hashtable for the sequence-specific data,
but to add a link field to the relcache entry that points to the
non-flushable sequence hashtable entry. That would save the second
hashtable lookup as long as the relcache entry hadn't been flushed
since last use, while not requiring any violence to the lifespan
semantics of relcache entries. (Actually, if we did that, it might
not even be worth converting the list to a hashtable? Searches would
become a lot less frequent.)

That's not a bad idea if we decide not to integrate them. And I agree,
there's not much need to have a separate hashtable in that case.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#7)

Re: init_sequence spill to hash table

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-11-14 09:23:20 -0500, Tom Lane wrote:

We most certainly *do* discard entries, if they're not open when a cache
flush event comes along.

What I was aiming at is that we don't discard them because of a limited
cache size. I don't think it means much that we flush the entry when
it's changed but not referenced.

Well, I don't want non-user-significant events (such as an sinval queue
overrun) causing sequence state to get discarded. We would get bug
reports about lost sequence values.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#8)

Re: init_sequence spill to hash table

On 2013-11-14 09:47:18 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-11-14 09:23:20 -0500, Tom Lane wrote:

We most certainly *do* discard entries, if they're not open when a cache
flush event comes along.

What I was aiming at is that we don't discard them because of a limited
cache size. I don't think it means much that we flush the entry when
it's changed but not referenced.

Well, I don't want non-user-significant events (such as an sinval queue
overrun) causing sequence state to get discarded. We would get bug
reports about lost sequence values.

But we can easily do as you suggest and simply retain the entry in that
case. I am just not seeing the memory overhead argument as counting much
since we don't protect against it in normal operation.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

David Rowley

dgrowleyml@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#4)

Re: init_sequence spill to hash table

On Fri, Nov 15, 2013 at 3:03 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 14.11.2013 14:38, David Rowley wrote:

I've just completed some more benchmarking of this. I didn't try dropping
the threshold down to 2 or 0 but I did tests at the cut over point and
really don't see much difference in performance between the list at 32 and
the hashtable at 33 sequences. The hash table version excels in the 16000
sequence test in comparison to the unpatched version.

Times are in milliseconds of the time it took to call currval() 100000
times for 1 sequence.
Patched Unpatched increased by 1 in cache 1856.452 1844.11 -1% 32
in
cache 1841.84 1802.433 -2% 33 in cache 1861.558 not tested N/A 16000 in
cache 1963.711 10329.22 426%

If I understand those results correctly, the best case scenario with the
current code takes about 1800 ms. There's practically no difference with N
<= 32, where N is the number of sequences touched. The hash table method
also takes about 1800 ms when N=33. The performance of the hash table is
O(1), so presumably we can extrapolate from that that it's the same for any
N.

I think that means that we should just completely replace the list with
the hash table. The difference with a small N is lost in noise, so there's
no point in keeping the list as a fast path for small N. That'll make the
patch somewhat simpler.
- Heikki

I had thought that maybe the biggest type of workloads might only touch 1
or 2 sequences, though it may be small but I had thought there would be an
overhead in both cycles and memory usage in creating a hash table for these
light usages of sequence backends. It would certainly make the patch more
simple by removing this and it would also mean that I could remove the
sometimes unused ->next member from the SeqTableData struct which is just
now set to NULL when in hash table mode. If you think it's the way to go
then I can make the change, though maybe I'll hold off the refactor for now
as it looks like other ideas have come up around rel cache.

Regards

David Rowley

#11

David Rowley

dgrowleyml@gmail.com

about 12 years ago

In reply to: Andres Freund (#5)

Re: init_sequence spill to hash table

On Fri, Nov 15, 2013 at 3:12 AM, Andres Freund <andres@2ndquadrant.com>wrote:

Hi,

On 2013-11-13 22:55:43 +1300, David Rowley wrote:

Here /messages/by-id/24278.1352922571@sss.pgh.pa.usthere
was some talk about init_sequence being a bottleneck when many sequences
are used in a single backend.

The attached I think implements what was talked about in the above link
which for me seems to double the speed of a currval() loop over 30000
sequences. It goes from about 7 seconds to 3.5 on my laptop.

I think it'd be a better idea to integrate the sequence caching logic
into the relcache. There's a comment about it:
* (We can't
* rely on the relcache, since it's only, well, a cache, and may decide to
* discard entries.)
but that's not really accurate anymore. We have the infrastructure for
keeping values across resets and we don't discard entries.

I just want to check this idea against an existing todo item to move
sequences into a single table, as I think by the sounds of it this binds
sequences being relations even closer together.

The todo item reads:

"Consider placing all sequences in a single table, or create a system view"

This had been on the back of my mind while implementing the hash table
stuff for init_sequence and again when doing my benchmarks where I created
30000 sequences and went through the pain of having a path on my file
system with 30000 8k files.
It sounds like your idea overlaps with this todo a little, so maybe this is
a good idea to decide which would be best, though the more I think about
it, the more I think that moving sequences into a single table is a no-go

So for implementing moving sequences into a single system table:

1. The search_path stuff makes this a bit more complex. It sounds like this
would require some duplication of the search_path logic.

2. There is also the problem with tracking object dependency.

Currently:
create sequence t_a_seq;
create table t (a int not null default nextval('t_a_seq'));
alter sequence t_a_seq owned by t.a;
drop table t;
drop sequence t_a_seq; -- already deleted by drop table t
ERROR: sequence "t_a_seq" does not exist

Moving sequences to a single table sounds like a special case for this
logic.

3. Would moving sequences to a table still have to check that no duplicate
object existed in the pg_class?
Currently you can't have a sequence with the same name as a table

create sequence a;
create table a (a int);
ERROR: relation "a" already exists

Its not that I'm trying to shoot holes in moving sequences to a single
table, really I'd like find a way to improve the wastefulness these 1 file
per sequence laying around my file system, but if changing this is a no-go
then it would be better to come off the todo list and then we shouldn't as
many issues pouring more concrete in the sequences being relations mould.

Regards

David Rowley

#12

David Rowley

dgrowleyml@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#4)

1 attachment(s)

Re: init_sequence spill to hash table

On Fri, Nov 15, 2013 at 3:03 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote

I think that means that we should just completely replace the list with
the hash table. The difference with a small N is lost in noise, so there's
no point in keeping the list as a fast path for small N. That'll make the
patch somewhat simpler.
- Heikki

Attached is a much more simple patch which gets rid of the initial linked
list.

Attachments:

hashtab_seq_v0.2.patchapplication/octet-stream; name=hashtab_seq_v0.2.patchDownload

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index f3344c6..0e40e29 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -60,15 +60,10 @@ typedef struct sequence_magic
  * session.  This is needed to hold onto nextval/currval state.  (We can't
  * rely on the relcache, since it's only, well, a cache, and may decide to
  * discard entries.)
- *
- * XXX We use linear search to find pre-existing SeqTable entries.	This is
- * good when only a small number of sequences are touched in a session, but
- * would suck with many different sequences.  Perhaps use a hashtable someday.
  */
 typedef struct SeqTableData
 {
-	struct SeqTableData *next;	/* link to next SeqTable object */
-	Oid			relid;			/* pg_class OID of this sequence */
+	Oid			relid;			/* pg_class OID of this sequence, hash key must be first */
 	Oid			filenode;		/* last seen relfilenode of this sequence */
 	LocalTransactionId lxid;	/* xact in which we last did a seq op */
 	bool		last_valid;		/* do we have a valid "last" value? */
@@ -81,7 +76,7 @@ typedef struct SeqTableData
 
 typedef SeqTableData *SeqTable;
 
-static SeqTable seqtab = NULL;	/* Head of list of SeqTable items */
+static HTAB *seqhashtab = NULL; /* Pointer to sequence hash table */
 
 /*
  * last_used_seq is updated by nextval() to point to the last used
@@ -92,6 +87,7 @@ static SeqTableData *last_used_seq = NULL;
 static void fill_seq_with_data(Relation rel, HeapTuple tuple);
 static int64 nextval_internal(Oid relid);
 static Relation open_share_lock(SeqTable seq);
+static void create_seq_hashtable(void);
 static void init_sequence(Oid relid, SeqTable *p_elm, Relation *p_rel);
 static Form_pg_sequence read_seq_tuple(SeqTable elm, Relation rel,
 			   Buffer *buf, HeapTuple seqtuple);
@@ -998,6 +994,23 @@ open_share_lock(SeqTable seq)
 	return relation_open(seq->relid, NoLock);
 }
 
+/* creates hash table for storing sequence data */
+static void create_seq_hashtable(void)
+{
+	HASHCTL		ctl;
+
+	Assert(seqhashtab == NULL);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(SeqTableData);
+	ctl.hash = oid_hash;
+	ctl.alloc = malloc;
+
+	seqhashtab = hash_create("Sequence table", 1000, &ctl,
+		HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+}
+
 /*
  * Given a relation OID, open and lock the sequence.  p_elm and p_rel are
  * output parameters.
@@ -1007,39 +1020,28 @@ init_sequence(Oid relid, SeqTable *p_elm, Relation *p_rel)
 {
 	SeqTable	elm;
 	Relation	seqrel;
+	bool		found;
 
-	/* Look to see if we already have a seqtable entry for relation */
-	for (elm = seqtab; elm != NULL; elm = elm->next)
-	{
-		if (elm->relid == relid)
-			break;
-	}
+	if (seqhashtab == NULL)
+		create_seq_hashtable();
+
+	elm = (SeqTable) hash_search(seqhashtab, &relid, HASH_ENTER, &found);
 
 	/*
-	 * Allocate new seqtable entry if we didn't find one.
-	 *
-	 * NOTE: seqtable entries remain in the list for the life of a backend. If
-	 * the sequence itself is deleted then the entry becomes wasted memory,
-	 * but it's small enough that this should not matter.
-	 */
-	if (elm == NULL)
+	* Initalize the new hash table entry if it did not exist already.
+	*
+	* NOTE: With the exception of the use of DISCARD SEQUENCES or DISCARD ALL
+	* hash entries are stored for the life of a backend. If the sequence
+	* itself is deleted then the entry becomes wasted memory, but it's small
+	* enough that this should not matter.
+	*/
+	if (!found)
 	{
-		/*
-		 * Time to make a new seqtable entry.  These entries live as long as
-		 * the backend does, so we use plain malloc for them.
-		 */
-		elm = (SeqTable) malloc(sizeof(SeqTableData));
-		if (elm == NULL)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
 		elm->relid = relid;
 		elm->filenode = InvalidOid;
 		elm->lxid = InvalidLocalTransactionId;
 		elm->last_valid = false;
 		elm->last = elm->cached = elm->increment = 0;
-		elm->next = seqtab;
-		seqtab = elm;
 	}
 
 	/*
@@ -1609,13 +1611,10 @@ seq_redo(XLogRecPtr lsn, XLogRecord *record)
 void
 ResetSequenceCaches(void)
 {
-	SeqTableData *next;
-
-	while (seqtab != NULL)
+	if (seqhashtab)
 	{
-		next = seqtab->next;
-		free(seqtab);
-		seqtab = next;
+		hash_destroy(seqhashtab);
+		seqhashtab = NULL;
 	}
 
 	last_used_seq = NULL;

#13

David Rowley

dgrowleyml@gmail.com

about 12 years ago

In reply to: Tom Lane (#6)

Re: init_sequence spill to hash table

On Fri, Nov 15, 2013 at 3:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

I think it'd be a better idea to integrate the sequence caching logic
into the relcache. There's a comment about it:
* (We can't
* rely on the relcache, since it's only, well, a cache, and may decide

to

* discard entries.)
but that's not really accurate anymore. We have the infrastructure for
keeping values across resets and we don't discard entries.

We most certainly *do* discard entries, if they're not open when a cache
flush event comes along.

I suppose it'd be possible to mark a relcache entry for a sequence
as locked-in-core, but that doesn't attract me at all. A relcache
entry is a whole lot larger than the amount of state we really need
to keep for a sequence.

One idea is to have a hashtable for the sequence-specific data,
but to add a link field to the relcache entry that points to the
non-flushable sequence hashtable entry. That would save the second
hashtable lookup as long as the relcache entry hadn't been flushed
since last use, while not requiring any violence to the lifespan
semantics of relcache entries. (Actually, if we did that, it might
not even be worth converting the list to a hashtable? Searches would
become a lot less frequent.)

Unless I've misunderstood something it looks like this would mean giving
heamam.c and relcache.c knowledge of sequences.
Currently relation_open is called from open_share_lock in sequence.c. The
only way I can see to do this would be to add something like
relation_open_sequence() in heapam.c which means we'd need to invent
RelationIdGetSequenceRelation() and use that instead
of RelationIdGetRelation() and somewhere along the line have it pass back
the SeqTableData struct which would be tagged onto RelIdCacheEnt.

I think it can be done but I don't think it will look pretty.
Perhaps if there was a more generic way... Would tagging some void
*rd_extra only the RelationData be a better way? And just have sequence.c
make use of that for storing the SeqTableData.

Also I'm wondering what we'd do with all these pointers when someone does
DISCARD SEQUENCES; would we have to invalidate the relcache or would it
just be matter of looping over it and freeing of the sequence data setting
the pointers to NULL?

Regards

David Rowley

#14

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: David Rowley (#11)

Re: init_sequence spill to hash table

On 2013-11-15 14:22:30 +1300, David Rowley wrote:

On Fri, Nov 15, 2013 at 3:12 AM, Andres Freund <andres@2ndquadrant.com>wrote:

Hi,

On 2013-11-13 22:55:43 +1300, David Rowley wrote:

Here /messages/by-id/24278.1352922571@sss.pgh.pa.usthere
was some talk about init_sequence being a bottleneck when many sequences
are used in a single backend.

The attached I think implements what was talked about in the above link
which for me seems to double the speed of a currval() loop over 30000
sequences. It goes from about 7 seconds to 3.5 on my laptop.

I think it'd be a better idea to integrate the sequence caching logic
into the relcache. There's a comment about it:
* (We can't
* rely on the relcache, since it's only, well, a cache, and may decide to
* discard entries.)
but that's not really accurate anymore. We have the infrastructure for
keeping values across resets and we don't discard entries.

I just want to check this idea against an existing todo item to move
sequences into a single table, as I think by the sounds of it this binds
sequences being relations even closer together.

This had been on the back of my mind while implementing the hash table
stuff for init_sequence and again when doing my benchmarks where I created
30000 sequences and went through the pain of having a path on my file
system with 30000 8k files.

Well. But in which real world usecases is that actually the bottleneck?

1. The search_path stuff makes this a bit more complex. It sounds like this
would require some duplication of the search_path logic.

I'd assumed that if we were to do this, the sequences themselves would
still continue to live in pg_class. Just instead of a relfilenode
containing their state it would be stored in an extra table.

2. There is also the problem with tracking object dependency.

Currently:
create sequence t_a_seq;
create table t (a int not null default nextval('t_a_seq'));
alter sequence t_a_seq owned by t.a;
drop table t;
drop sequence t_a_seq; -- already deleted by drop table t
ERROR: sequence "t_a_seq" does not exist

Moving sequences to a single table sounds like a special case for this
logic.

There should already be code such dependencies.

4) Scalability problems: The one block sequences use already can be a
major contention issue when you have paralell inserts to the same
table. A workload which I, unlike a couple thousand unrelated sequences,
actually think is more realistic. So we'd need to force 1 sequence tuple
per block, which we currently cannot do.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: David Rowley (#13)

Re: init_sequence spill to hash table

On 2013-11-15 19:12:15 +1300, David Rowley wrote:

On Fri, Nov 15, 2013 at 3:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

I think it'd be a better idea to integrate the sequence caching logic
into the relcache. There's a comment about it:
* (We can't
* rely on the relcache, since it's only, well, a cache, and may decide

to

* discard entries.)
but that's not really accurate anymore. We have the infrastructure for
keeping values across resets and we don't discard entries.

We most certainly *do* discard entries, if they're not open when a cache
flush event comes along.

I suppose it'd be possible to mark a relcache entry for a sequence
as locked-in-core, but that doesn't attract me at all. A relcache
entry is a whole lot larger than the amount of state we really need
to keep for a sequence.

One idea is to have a hashtable for the sequence-specific data,
but to add a link field to the relcache entry that points to the
non-flushable sequence hashtable entry. That would save the second
hashtable lookup as long as the relcache entry hadn't been flushed
since last use, while not requiring any violence to the lifespan
semantics of relcache entries. (Actually, if we did that, it might
not even be worth converting the list to a hashtable? Searches would
become a lot less frequent.)

Unless I've misunderstood something it looks like this would mean giving
heamam.c and relcache.c knowledge of sequences.
Currently relation_open is called from open_share_lock in sequence.c. The
only way I can see to do this would be to add something like
relation_open_sequence() in heapam.c which means we'd need to invent
RelationIdGetSequenceRelation() and use that instead
of RelationIdGetRelation() and somewhere along the line have it pass back
the SeqTableData struct which would be tagged onto RelIdCacheEnt.

Look like indexes already go through that path, and store data in the
relcache, without requiring heapam.c to know all that much.

I think it can be done but I don't think it will look pretty.
Perhaps if there was a more generic way... Would tagging some void
*rd_extra only the RelationData be a better way? And just have sequence.c
make use of that for storing the SeqTableData.

I'd rather have it properly typed in ->rd_sequence, maybe in a union
against the index members.

Also I'm wondering what we'd do with all these pointers when someone does
DISCARD SEQUENCES; would we have to invalidate the relcache or would it
just be matter of looping over it and freeing of the sequence data setting
the pointers to NULL?

Good question. Have a 'discard_count' member and global variable? That
starts at zero and gets incremented everytime somebody discards?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: David Rowley (#12)

1 attachment(s)

Re: init_sequence spill to hash table

On 15.11.2013 07:47, David Rowley wrote:

On Fri, Nov 15, 2013 at 3:03 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote

I think that means that we should just completely replace the list with
the hash table. The difference with a small N is lost in noise, so there's
no point in keeping the list as a fast path for small N. That'll make the
patch somewhat simpler.

Attached is a much more simple patch which gets rid of the initial linked
list.

Thanks, committed with minor copy-editing. I dialed down the initial
size of the hash table from 1000 to 16, that ought to be enough.

I ran a quick performance test of my own, based on the script you sent.
I modified it a bit to eliminate the PL/pgSQL overhead, making it more
heavily bottlenecked by the nextval/currval overhead. Results:

nextval, 10000 seqs 36772 2426
currval, 1 seq 1176 1069
currval, 2 seqs 865 857
currval, 4 seqs 742 759
currval, 5 seqs 718 711
currval, 10 seqs 680 668
currval, 100 seqs 871 656
currval, 1000 seqs 3507 700
currval, 10000 seqs 34742 1224

The performance when you touch only a few sequences is unchanged. When
you touch a lot of them, you gain. Just as you would expect.

Attached is the test script I used. After running the test, I realized
that there's a little flaw in the test methodology, but I doesn't
invalidates the results. I used the same backend for all the test runs,
so even when currval() is called repeatedly for a single sequence, the
linked list (or hash table, with the patch) nevertheless contains
entries for all 10000 sequences. However, the sequences actually used by
the test are always in the front of the list, because the nextval()
calls were made in the same order. But with the unpatched version, if
you called currval() on the lastly initialized sequence repeatedly,
instead of the firstly initialized one, you would get much would get
horrible performance, even when you touch only a single sequence.

Regarding the more grandiose ideas of using the relcache or rewriting
the way sequences are stored altogether: this patch might become
obsolete if we do any of that stuff, but that's ok. The immediate
performance problem has been solved now, but those other ideas might be
worth pursuing for other reasons.

- Heikki

#17

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Heikki Linnakangas (#16)

Re: init_sequence spill to hash table

On 2013-11-15 12:31:54 +0200, Heikki Linnakangas wrote:

On 15.11.2013 07:47, David Rowley wrote:

On Fri, Nov 15, 2013 at 3:03 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote

I think that means that we should just completely replace the list with
the hash table. The difference with a small N is lost in noise, so there's
no point in keeping the list as a fast path for small N. That'll make the
patch somewhat simpler.

Attached is a much more simple patch which gets rid of the initial linked
list.

Thanks, committed with minor copy-editing. I dialed down the initial size of
the hash table from 1000 to 16, that ought to be enough.

I am slightly confused by the use of HASH_CONTEXT without setting
ctl->hcxt, that works, but seems more like an accident.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

David Rowley

dgrowleyml@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#16)

Re: init_sequence spill to hash table

On Fri, Nov 15, 2013 at 11:31 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

Thanks, committed with minor copy-editing. I dialed down the initial size
of the hash table from 1000 to 16, that ought to be enough.

Great. Thanks for commiting.

Regards

David Rowley

Show quoted text

- Heikki

#19

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Andres Freund (#17)

Re: init_sequence spill to hash table

On 15.11.2013 12:44, Andres Freund wrote:

On 2013-11-15 12:31:54 +0200, Heikki Linnakangas wrote:

On 15.11.2013 07:47, David Rowley wrote:

On Fri, Nov 15, 2013 at 3:03 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote

I think that means that we should just completely replace the list with
the hash table. The difference with a small N is lost in noise, so there's
no point in keeping the list as a fast path for small N. That'll make the
patch somewhat simpler.

Attached is a much more simple patch which gets rid of the initial linked
list.

Thanks, committed with minor copy-editing. I dialed down the initial size of
the hash table from 1000 to 16, that ought to be enough.

I am slightly confused by the use of HASH_CONTEXT without setting
ctl->hcxt, that works, but seems more like an accident.

Ugh. Accident indeed. Thanks, fixed!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers