Usage of epoch in txid_current

Started by Amit Kapilaover 8 years ago54 messageshackers
Jump to latest
#1Amit Kapila
amit.kapila16@gmail.com

Hi,

Currently, txid_current and friends export a 64-bit format of
transaction id that is extended with an “epoch” counter so that it
will not wrap around during the life of an installation. The epoch
value it uses is based on the epoch that is maintained by checkpoint
(aka only checkpoint increments it).

Now if epoch changes multiple times between two checkpoints
(practically the chances of this are bleak, but there is a theoretical
possibility), then won't the computation of xids will go wrong?
Basically, it can give the same value of txid after wraparound if the
checkpoint doesn't occur between the two calls to txid_current.

Am I missing something which ensures that epoch gets incremented at or
after wraparound?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: Amit Kapila (#1)
Re: Usage of epoch in txid_current

Hi!

On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Currently, txid_current and friends export a 64-bit format of
transaction id that is extended with an “epoch” counter so that it
will not wrap around during the life of an installation. The epoch
value it uses is based on the epoch that is maintained by checkpoint
(aka only checkpoint increments it).

Now if epoch changes multiple times between two checkpoints
(practically the chances of this are bleak, but there is a theoretical
possibility), then won't the computation of xids will go wrong?
Basically, it can give the same value of txid after wraparound if the
checkpoint doesn't occur between the two calls to txid_current.

AFAICS, yes, if epoch changes multiple times between two checkpoints, then
computation will go wrong. And it doesn't look like purely theoretical
possibility for me, because I think I know couple of instances of the edge
of this...

Am I missing something which ensures that epoch gets incremented at or

after wraparound?

I've checked the code, and it doesn't look for me that there is something
like this.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexander Korotkov (#2)
Re: Usage of epoch in txid_current

On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Currently, txid_current and friends export a 64-bit format of
transaction id that is extended with an “epoch” counter so that it
will not wrap around during the life of an installation. The epoch
value it uses is based on the epoch that is maintained by checkpoint
(aka only checkpoint increments it).

Now if epoch changes multiple times between two checkpoints
(practically the chances of this are bleak, but there is a theoretical
possibility), then won't the computation of xids will go wrong?
Basically, it can give the same value of txid after wraparound if the
checkpoint doesn't occur between the two calls to txid_current.

AFAICS, yes, if epoch changes multiple times between two checkpoints, then
computation will go wrong. And it doesn't look like purely theoretical
possibility for me, because I think I know couple of instances of the edge
of this...

Okay, it is quite strange that we haven't discovered this problem till
now. I think we should do something to fix it. One idea is that we
track epoch change in shared memory (probably in the same data
structure (VariableCacheData) where we track nextXid). We need to
increment it when the xid wraparound during xid allocation (in
GetNewTransactionId). Also, we need to make it persistent as which
means we need to log it in checkpoint xlog record and we need to write
a separate xlog record for the epoch change.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#3)
Re: Usage of epoch in txid_current

On 2017-12-05 16:21:27 +0530, Amit Kapila wrote:

On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Currently, txid_current and friends export a 64-bit format of
transaction id that is extended with an “epoch” counter so that it
will not wrap around during the life of an installation. The epoch
value it uses is based on the epoch that is maintained by checkpoint
(aka only checkpoint increments it).

Now if epoch changes multiple times between two checkpoints
(practically the chances of this are bleak, but there is a theoretical
possibility), then won't the computation of xids will go wrong?
Basically, it can give the same value of txid after wraparound if the
checkpoint doesn't occur between the two calls to txid_current.

AFAICS, yes, if epoch changes multiple times between two checkpoints, then
computation will go wrong. And it doesn't look like purely theoretical
possibility for me, because I think I know couple of instances of the edge
of this...

I think it's not terribly likely principle, due to the required WAL
size. You need at least a commit record for each of 4 billion
transactions. Each commit record is at least 24bytes long, and in a
non-artificial scenario you additionally would have a few hundred bytes
of actual content of WAL. So we're talking about a distance of at least
0.5-2TB within a single checkpoint here. Not impossible, but not likely
either.

Okay, it is quite strange that we haven't discovered this problem till
now. I think we should do something to fix it. One idea is that we
track epoch change in shared memory (probably in the same data
structure (VariableCacheData) where we track nextXid). We need to
increment it when the xid wraparound during xid allocation (in
GetNewTransactionId). Also, we need to make it persistent as which
means we need to log it in checkpoint xlog record and we need to write
a separate xlog record for the epoch change.

I think it makes a fair bit of sense to not do the current crufty
tracking of xid epochs. I don't really how we got there, but it doesn't
make terribly much sense. Don't think we need additional WAL logging
though - we should be able to piggyback this onto the already existing
clog logging.

I kinda wonder if we shouldn't just track nextXid as a 64bit integer
internally, instead of bothering with tracking the epoch
separately. Then we can "just" truncate it in the cases where it's
stored in space constrained places etc.

Greetings,

Andres Freund

#5Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#4)
Re: Usage of epoch in txid_current

Andres,

* Andres Freund (andres@anarazel.de) wrote:

On 2017-12-05 16:21:27 +0530, Amit Kapila wrote:

On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Currently, txid_current and friends export a 64-bit format of
transaction id that is extended with an “epoch” counter so that it
will not wrap around during the life of an installation. The epoch
value it uses is based on the epoch that is maintained by checkpoint
(aka only checkpoint increments it).

Now if epoch changes multiple times between two checkpoints
(practically the chances of this are bleak, but there is a theoretical
possibility), then won't the computation of xids will go wrong?
Basically, it can give the same value of txid after wraparound if the
checkpoint doesn't occur between the two calls to txid_current.

AFAICS, yes, if epoch changes multiple times between two checkpoints, then
computation will go wrong. And it doesn't look like purely theoretical
possibility for me, because I think I know couple of instances of the edge
of this...

I think it's not terribly likely principle, due to the required WAL
size. You need at least a commit record for each of 4 billion
transactions. Each commit record is at least 24bytes long, and in a
non-artificial scenario you additionally would have a few hundred bytes
of actual content of WAL. So we're talking about a distance of at least
0.5-2TB within a single checkpoint here. Not impossible, but not likely
either.

At the bottom end, with a 30-minute checkpoint, that's about 300MB/s.
Certainly quite a bit and we might have trouble getting there for other
reasons, but definitely something that can be accomplished with even a
single SSD these days.

Okay, it is quite strange that we haven't discovered this problem till
now. I think we should do something to fix it. One idea is that we
track epoch change in shared memory (probably in the same data
structure (VariableCacheData) where we track nextXid). We need to
increment it when the xid wraparound during xid allocation (in
GetNewTransactionId). Also, we need to make it persistent as which
means we need to log it in checkpoint xlog record and we need to write
a separate xlog record for the epoch change.

I think it makes a fair bit of sense to not do the current crufty
tracking of xid epochs. I don't really how we got there, but it doesn't
make terribly much sense. Don't think we need additional WAL logging
though - we should be able to piggyback this onto the already existing
clog logging.

Don't you mean xact logging? ;)

I kinda wonder if we shouldn't just track nextXid as a 64bit integer
internally, instead of bothering with tracking the epoch
separately. Then we can "just" truncate it in the cases where it's
stored in space constrained places etc.

This sounds reasonable to me, at least, but I've not been in these
depths much.

Thanks!

Stephen

#6Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#5)
Re: Usage of epoch in txid_current

On December 5, 2017 10:01:43 AM PST, Stephen Frost <sfrost@snowman.net> wrote:

Andres,

* Andres Freund (andres@anarazel.de) wrote:

I think it makes a fair bit of sense to not do the current crufty
tracking of xid epochs. I don't really how we got there, but it

doesn't

make terribly much sense. Don't think we need additional WAL logging
though - we should be able to piggyback this onto the already

existing

clog logging.

Don't you mean xact logging? ;)

No. We log a WAL record at clog boundaries. Wraparounds have to be at one. We could just include the 64 bit xid there and would have reliable tracking.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#5)
Re: Usage of epoch in txid_current

Stephen Frost <sfrost@snowman.net> writes:

* Andres Freund (andres@anarazel.de) wrote:

I kinda wonder if we shouldn't just track nextXid as a 64bit integer
internally, instead of bothering with tracking the epoch
separately. Then we can "just" truncate it in the cases where it's
stored in space constrained places etc.

This sounds reasonable to me, at least, but I've not been in these
depths much.

+1 ... I think the reason it's like that is simply that nobody's revisited
the XID generator since we decided to require 64-bit integer support.

We'd need this for support of true 64-bit XIDs, too, though I'm unsure
whether that project is going anywhere anytime soon. In any case it
seems like a separable subset of that work.

regards, tom lane

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#4)
Re: Usage of epoch in txid_current

On Tue, Dec 5, 2017 at 11:15 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-12-05 16:21:27 +0530, Amit Kapila wrote:

Okay, it is quite strange that we haven't discovered this problem till
now. I think we should do something to fix it. One idea is that we
track epoch change in shared memory (probably in the same data
structure (VariableCacheData) where we track nextXid). We need to
increment it when the xid wraparound during xid allocation (in
GetNewTransactionId). Also, we need to make it persistent as which
means we need to log it in checkpoint xlog record and we need to write
a separate xlog record for the epoch change.

I think it makes a fair bit of sense to not do the current crufty
tracking of xid epochs. I don't really how we got there, but it doesn't
make terribly much sense. Don't think we need additional WAL logging
though - we should be able to piggyback this onto the already existing
clog logging.

I kinda wonder if we shouldn't just track nextXid as a 64bit integer
internally, instead of bothering with tracking the epoch
separately. Then we can "just" truncate it in the cases where it's
stored in space constrained places etc.

We are using ShmemVariableCache->nextXid at many places so always
converting/truncating it to 32-bit number before using seems slightly
awkward, so we can think of using a separate nextBigXid 64bit number
as well. Either way, it is not clear to me how we will keep it
updated after recovery. Right now, the mechanism is quite simple, at
the beginning of a recovery we take the value of nextXid from
checkpoint record and then if any xlog record indicates xid that
follows nextXid, we advance it. Here, the point to note is that we
take the xid from the WAL record (which means that it assumes xids are
non-contiguous or some xids are consumed without being logged) and
increment it. Unless we plan to change something in that logic (like
storing 64-bit xids in WAL records), it is not clear to me how to make
it work. OTOH, recovering value of epoch which increments only at
wraparound seems fairly straightforward as described in my previous
email.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#9Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#8)
Re: Usage of epoch in txid_current

Hi,

On 2017-12-06 17:39:09 +0530, Amit Kapila wrote:

We are using ShmemVariableCache->nextXid at many places so always
converting/truncating it to 32-bit number before using seems slightly
awkward, so we can think of using a separate nextBigXid 64bit number
as well.

-many

It's not actually that many places. And a lot of them would should be
updated anyway, to be epoch aware. Let's not add this kind of crummyness
to avoid a few truncating casts here and there.

Either way, it is not clear to me how we will keep it
updated after recovery. Right now, the mechanism is quite simple, at
the beginning of a recovery we take the value of nextXid from
checkpoint record and then if any xlog record indicates xid that
follows nextXid, we advance it. Here, the point to note is that we
take the xid from the WAL record (which means that it assumes xids are
non-contiguous or some xids are consumed without being logged) and
increment it. Unless we plan to change something in that logic (like
storing 64-bit xids in WAL records), it is not clear to me how to make
it work. OTOH, recovering value of epoch which increments only at
wraparound seems fairly straightforward as described in my previous
email.

I think it should be fairly simple if simply add the 64bit xid to the
existing clog extension WAL records.

Greetings,

Andres Freund

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#9)
Re: Usage of epoch in txid_current

On Wed, Dec 6, 2017 at 11:26 PM, Andres Freund <andres@anarazel.de> wrote:

Either way, it is not clear to me how we will keep it
updated after recovery. Right now, the mechanism is quite simple, at
the beginning of a recovery we take the value of nextXid from
checkpoint record and then if any xlog record indicates xid that
follows nextXid, we advance it. Here, the point to note is that we
take the xid from the WAL record (which means that it assumes xids are
non-contiguous or some xids are consumed without being logged) and
increment it. Unless we plan to change something in that logic (like
storing 64-bit xids in WAL records), it is not clear to me how to make
it work. OTOH, recovering value of epoch which increments only at
wraparound seems fairly straightforward as described in my previous
email.

I think it should be fairly simple if simply add the 64bit xid to the
existing clog extension WAL records.

IIUC, you mean to say that we should log the 64bit xid value in
CLOG_ZEROPAGE record while extending clog and that too we can do only
at wraparound. Now, maybe doing it every time also doesn't hurt, but
I think doing it at wraparound should be sufficient.

Just to be clear, I am not planning to pursue writing a patch for this
at the moment. So, if anybody else is interested or if Andres wants
to write it, I can help in the review.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Amit Kapila (#10)
Re: Usage of epoch in txid_current

On Fri, Dec 8, 2017 at 3:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 6, 2017 at 11:26 PM, Andres Freund <andres@anarazel.de> wrote:

Either way, it is not clear to me how we will keep it
updated after recovery. Right now, the mechanism is quite simple, at
the beginning of a recovery we take the value of nextXid from
checkpoint record and then if any xlog record indicates xid that
follows nextXid, we advance it. Here, the point to note is that we
take the xid from the WAL record (which means that it assumes xids are
non-contiguous or some xids are consumed without being logged) and
increment it. Unless we plan to change something in that logic (like
storing 64-bit xids in WAL records), it is not clear to me how to make
it work. OTOH, recovering value of epoch which increments only at
wraparound seems fairly straightforward as described in my previous
email.

I think it should be fairly simple if simply add the 64bit xid to the
existing clog extension WAL records.

IIUC, you mean to say that we should log the 64bit xid value in
CLOG_ZEROPAGE record while extending clog and that too we can do only
at wraparound. Now, maybe doing it every time also doesn't hurt, but
I think doing it at wraparound should be sufficient.

Can you please elaborate on why clog redo in particular would need to
use 64 bit xids? Is the 64 bit xid not derivable from the 32 bit xid
in the WAL + the current value of a new 64 bit next xid?

Just to be clear, I am not planning to pursue writing a patch for this
at the moment. So, if anybody else is interested or if Andres wants
to write it, I can help in the review.

I played around with this idea yesterday. Experiment-grade patch
attached. Approach:

1. Introduce a new type BigTransactionId (better names welcome).
2. Change ShmemVariableCache->nextXid to ShmemVariableCache->nextBigXid.
3. Change checkpoints to use nextBigXid too.
4. Change ReadNewTransactionId() to ReadNextBigTransactionId().
5. Remove GetNextXidAndEpoch() as it's now redundant.
6. Everywhere that was reading ShmemVariableCache->nextXid or calling
ReadNewTransactionId() but actually needs an xid now uses an explicit
conversion macro XidFromBigTransactionId().
7. Everywhere that was writing ShmemVariableCache->nextXid but only
has an xid now goes through a new function
AdvanceNextBigTransactionIdPast(xid), to handle recovery (since WAL
records have xids in them as mentioned by Amit).

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent). Otherwise there is a class of bug that is hidden for
the first 2^32 transactions.

The logic in CreateCheckPoint() that assumed that there could be only
one wraparound per checkpoint is gone (the problem reported in this
thread). Note that AdvanceNextBigTransactionIdPast() contains logic
that infers an epoch, which is in some ways similar to what the
removed code was doing, but in this case I think all callers must have
an xid from the same or next epoch.

I'm probably missing a few details... this is only a first swing at
the problem. It does pass check-world and various xid wraparound
tests I came up with. Clearly big xids could spread to more places
than I show here. Do you see problems with this approach or have
better ideas?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Track-the-next-xid-using-64-bits.patchapplication/octet-stream; name=0001-Track-the-next-xid-using-64-bits.patchDownload+154-207
#12Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#11)
Re: Usage of epoch in txid_current

On 2018-07-10 11:35:59 +1200, Thomas Munro wrote:

I played around with this idea yesterday. Experiment-grade patch
attached.

Cool!

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent). Otherwise there is a class of bug that is hidden for
the first 2^32 transactions.

You could have BigTransactionId (maybe renamed to FullTransactionId?) be
a struct type. That'd prevent such issues. Most compilers these days
should be more than good enough to optimize passing around an 8byte
struct by value...

Greetings,

Andres Freund

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: Usage of epoch in txid_current

Andres Freund <andres@anarazel.de> writes:

On 2018-07-10 11:35:59 +1200, Thomas Munro wrote:

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent). Otherwise there is a class of bug that is hidden for
the first 2^32 transactions.

You could have BigTransactionId (maybe renamed to FullTransactionId?) be
a struct type. That'd prevent such issues. Most compilers these days
should be more than good enough to optimize passing around an 8byte
struct by value...

Or, perhaps, use a struct in assert builds and int64 otherwise?
You could hide the ensuing notational differences in macros.

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Usage of epoch in txid_current

Hi,

On 2018-07-09 19:56:25 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-07-10 11:35:59 +1200, Thomas Munro wrote:

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent). Otherwise there is a class of bug that is hidden for
the first 2^32 transactions.

You could have BigTransactionId (maybe renamed to FullTransactionId?) be
a struct type. That'd prevent such issues. Most compilers these days
should be more than good enough to optimize passing around an 8byte
struct by value...

Or, perhaps, use a struct in assert builds and int64 otherwise?
You could hide the ensuing notational differences in macros.

That should be doable. But I'd like to check if it's necessary
first. Optimizing passing an 8 byte struct shouldn't be hard for
compilers these days - especially when most things dealing with them are
inline functions. If we find that it's not a problem on contemporary
compilers, it might be worthwhile to use a bit more type safety in other
places too.

Here's what gcc does on O1:

#include <stdint.h>

typedef struct foo
{
uint64_t id;
} foo;

extern foo take_foo_struct(foo f, int i);
extern uint64_t take_foo_int(uint64_t id, int i);

foo take_foo_struct(foo f, int i)
{
f.id += i;
return f;
}

uint64_t take_foo_int(uint64_t id, int i)
{
id += i;
return id;
}

results in:

.file "test.c"
.text
.globl take_foo_struct
.type take_foo_struct, @function
take_foo_struct:
.LFB0:
.cfi_startproc
movslq %esi, %rax
addq %rdi, %rax
ret
.cfi_endproc
.LFE0:
.size take_foo_struct, .-take_foo_struct
.globl take_foo_int
.type take_foo_int, @function
take_foo_int:
.LFB1:
.cfi_startproc
movslq %esi, %rax
addq %rdi, %rax
ret
.cfi_endproc
.LFE1:
.size take_foo_int, .-take_foo_int
.ident "GCC: (Debian 7.3.0-24) 7.3.0"
.section .note.GNU-stack,"",@progbits

IOW, exactly the same code generated. Note that the compiler does *not*
see the callsites in this case, i.e. this is platform ABI conformant.

Greetings,

Andres Freund

#15Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#14)
Re: Usage of epoch in txid_current

On 2018-07-09 17:08:34 -0700, Andres Freund wrote:

Hi,

On 2018-07-09 19:56:25 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-07-10 11:35:59 +1200, Thomas Munro wrote:

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent). Otherwise there is a class of bug that is hidden for
the first 2^32 transactions.

You could have BigTransactionId (maybe renamed to FullTransactionId?) be
a struct type. That'd prevent such issues. Most compilers these days
should be more than good enough to optimize passing around an 8byte
struct by value...

Or, perhaps, use a struct in assert builds and int64 otherwise?
You could hide the ensuing notational differences in macros.

That should be doable. But I'd like to check if it's necessary
first. Optimizing passing an 8 byte struct shouldn't be hard for
compilers these days - especially when most things dealing with them are
inline functions. If we find that it's not a problem on contemporary
compilers, it might be worthwhile to use a bit more type safety in other
places too.

Here's what gcc does on O1:

#include <stdint.h>

typedef struct foo
{
uint64_t id;
} foo;

extern foo take_foo_struct(foo f, int i);
extern uint64_t take_foo_int(uint64_t id, int i);

foo take_foo_struct(foo f, int i)
{
f.id += i;
return f;
}

uint64_t take_foo_int(uint64_t id, int i)
{
id += i;
return id;
}

results in:

.file "test.c"
.text
.globl take_foo_struct
.type take_foo_struct, @function
take_foo_struct:
.LFB0:
.cfi_startproc
movslq %esi, %rax
addq %rdi, %rax
ret
.cfi_endproc
.LFE0:
.size take_foo_struct, .-take_foo_struct
.globl take_foo_int
.type take_foo_int, @function
take_foo_int:
.LFB1:
.cfi_startproc
movslq %esi, %rax
addq %rdi, %rax
ret
.cfi_endproc
.LFE1:
.size take_foo_int, .-take_foo_int
.ident "GCC: (Debian 7.3.0-24) 7.3.0"
.section .note.GNU-stack,"",@progbits

IOW, exactly the same code generated. Note that the compiler does *not*
see the callsites in this case, i.e. this is platform ABI conformant.

FWIW, this is required by the x86-64 SYSV ABI. See
https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
3.2.3 Parameter Passing. Aggregates with scalar types up to "two
eightbytes" are passed via registers.

It's also the case for MSVC / windows
https://docs.microsoft.com/en-us/cpp/cpp/argument-passing-and-naming-conventions
https://docs.microsoft.com/en-us/cpp/build/parameter-passing

Small aggregates (8, 16, 32, or 64 bits) are passed in registers.

Greetings,

Andres Freund

#16Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#14)
Re: Usage of epoch in txid_current

On Tue, Jul 10, 2018 at 12:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-07-09 19:56:25 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-07-10 11:35:59 +1200, Thomas Munro wrote:

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent). Otherwise there is a class of bug that is hidden for
the first 2^32 transactions.

You could have BigTransactionId (maybe renamed to FullTransactionId?) be
a struct type. That'd prevent such issues. Most compilers these days
should be more than good enough to optimize passing around an 8byte
struct by value...

Or, perhaps, use a struct in assert builds and int64 otherwise?
You could hide the ensuing notational differences in macros.

That should be doable. But I'd like to check if it's necessary
first. Optimizing passing an 8 byte struct shouldn't be hard for
compilers these days - especially when most things dealing with them are
inline functions. If we find that it's not a problem on contemporary
compilers, it might be worthwhile to use a bit more type safety in other
places too.

...

IOW, exactly the same code generated. Note that the compiler does *not*
see the callsites in this case, i.e. this is platform ABI conformant.

I like it. Here's a version that uses a struct named
FullTransactionId (yeah, that's a better name, thanks), defined in
transam.h because c.h didn't feel right.

Client code lost the ability to use operator < directly. I needed to
use a static inline function as a constructor. I lost the
interchangeability with the wide xids in txid.c, so I provided
U64FromFullTransactionId() (I think that'll be useful for
serialisation over the write too). I don't know what to think about
the encoding or meaning of non-normal xids in this thing.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Track-the-next-xid-using-64-bits-v2.patchapplication/octet-stream; name=0001-Track-the-next-xid-using-64-bits-v2.patchDownload+180-208
#17Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#16)
Re: Usage of epoch in txid_current

On 2018-07-10 13:20:52 +1200, Thomas Munro wrote:

defined in transam.h because c.h didn't feel right.

Yea, that looks better.

Client code lost the ability to use operator < directly. I needed to
use a static inline function as a constructor. I lost the
interchangeability with the wide xids in txid.c, so I provided
U64FromFullTransactionId() (I think that'll be useful for
serialisation over the write too).

Should probably, at a later point, introduce a SQL type for it too.

I don't know what to think about the encoding or meaning of non-normal
xids in this thing.

I'm not following what you mean by this?

diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 4faa21f5aef..fa0847afc81 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -261,7 +261,7 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
startPage = TransactionIdToPage(oldestActiveXID);
-	endPage = TransactionIdToPage(ShmemVariableCache->nextXid);
+	endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));

These probably need an intermediate variable.

diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 394843f7e91..13020f54d98 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -73,7 +73,7 @@ GetNewTransactionId(bool isSubXact)

LWLockAcquire(XidGenLock, LW_EXCLUSIVE);

-	xid = ShmemVariableCache->nextXid;
+	xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);

It's a bit over the top, I know, but I'd move the conversion to outside
the lock. Less because it makes a meaningful efficiency difference, and
more because I find it clearer, and easier to reason about if we ever go
to atomically incrementing nextFullXid.

@@ -6700,7 +6698,8 @@ StartupXLOG(void)
wasShutdown ? "true" : "false")));
ereport(DEBUG1,
(errmsg_internal("next transaction ID: %u:%u; next OID: %u",
-							 checkPoint.nextXidEpoch, checkPoint.nextXid,
+							 EpochFromFullTransactionId(checkPoint.nextFullXid),
+							 XidFromFullTransactionId(checkPoint.nextFullXid),
checkPoint.nextOid)));

I don't think we should continue to expose epochs, but rather go to full
xids where appropriate.

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 895a51f89d5..f6731bfd28d 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -20,6 +20,7 @@

#include <time.h>

+#include "access/transam.h"
#include "access/xlog.h"
#include "access/xlog_internal.h"
#include "catalog/pg_control.h"
@@ -256,8 +257,8 @@ main(int argc, char *argv[])
printf(_("Latest checkpoint's full_page_writes: %s\n"),
ControlFile->checkPointCopy.fullPageWrites ? _("on") : _("off"));
printf(_("Latest checkpoint's NextXID:          %u:%u\n"),
-		   ControlFile->checkPointCopy.nextXidEpoch,
-		   ControlFile->checkPointCopy.nextXid);
+		   EpochFromFullTransactionId(ControlFile->checkPointCopy.nextFullXid),
+		   XidFromFullTransactionId(ControlFile->checkPointCopy.nextFullXid));
printf(_("Latest checkpoint's NextOID:          %u\n"),
ControlFile->checkPointCopy.nextOid);
printf(_("Latest checkpoint's NextMultiXactId:  %u\n"),

See above re exposing epoch.

--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -44,6 +44,32 @@
#define TransactionIdStore(xid, dest)	(*(dest) = (xid))
#define StoreInvalidTransactionId(dest) (*(dest) = InvalidTransactionId)
+#define EpochFromFullTransactionId(x)	((uint32) ((x).value >> 32))
+#define XidFromFullTransactionId(x)		((uint32) (x).value)
+#define U64FromFullTransactionId(x)		((x).value)
+#define FullTransactionIdPrecedes(a, b)	((a).value < (b).value)
+#define FullTransactionIdPrecedesOrEquals(a, b)	((a).value <= (b).value)
+
+/*
+ * A 64 bit value that contains an epoch and a TransactionId.  This is
+ * wrapped in a struct to prevent implicit conversion to/from TransactionId.
+ * Allowing such conversions seems likely to be error-prone.
+ */
+typedef struct FullTransactionId
+{
+    uint64  value;
+} FullTransactionId;
+
+static inline FullTransactionId
+MakeFullTransactionId(uint32 epoch, TransactionId xid)
+{
+	FullTransactionId result;
+
+	result.value = ((uint64) epoch) << 32 | xid;
+
+	return result;
+}
+
/* advance a transaction ID variable, handling wraparound correctly */
#define TransactionIdAdvance(dest)	\
do { \
@@ -52,6 +78,15 @@
(dest) = FirstNormalTransactionId; \
} while(0)
+/* advance a FullTransactionId lvalue, handling wraparound correctly */
+#define FullTransactionIdAdvance(dest) \
+	do { \
+		(dest).value++; \
+		if (XidFromFullTransactionId(dest) < FirstNormalTransactionId) \
+			(dest) = MakeFullTransactionId(EpochFromFullTransactionId(dest), \
+										   FirstNormalTransactionId); \
+	} while (0)

Yikes. Why isn't this an inline function? Because it assigns to dest?

Greetings,

Andres Freund

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#17)
Re: Usage of epoch in txid_current

On Tue, Jul 10, 2018 at 1:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-07-10 13:20:52 +1200, Thomas Munro wrote:

I don't know what to think about the encoding or meaning of non-normal
xids in this thing.

I'm not following what you mean by this?

While defining FullTransactionIdPrecedes() in the image of
TransactionIdPrecedes(), I wondered whether the xid component of a
FullTransactionId would ever be one of the special values below
FirstNormalTransactionId that require special treatment. The question
doesn't actually arise in this code, however, so I ignored that
thought and simply used x < y.

diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 4faa21f5aef..fa0847afc81 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -261,7 +261,7 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
startPage = TransactionIdToPage(oldestActiveXID);
-     endPage = TransactionIdToPage(ShmemVariableCache->nextXid);
+     endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));

These probably need an intermediate variable.

Ok, did that in a couple of places.

diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 394843f7e91..13020f54d98 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -73,7 +73,7 @@ GetNewTransactionId(bool isSubXact)

LWLockAcquire(XidGenLock, LW_EXCLUSIVE);

-     xid = ShmemVariableCache->nextXid;
+     xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);

It's a bit over the top, I know, but I'd move the conversion to outside
the lock. Less because it makes a meaningful efficiency difference, and
more because I find it clearer, and easier to reason about if we ever go
to atomically incrementing nextFullXid.

Erm -- I can't read nextFullXid until I have the lock, and then I need
it in a 32 bit format for the code that follows immediately (I don't
currently have xidVacLimit in FullTransactionId format). I'm not sure
what you mean.

@@ -6700,7 +6698,8 @@ StartupXLOG(void)
wasShutdown ? "true" : "false")));
ereport(DEBUG1,
(errmsg_internal("next transaction ID: %u:%u; next OID: %u",
-                                                      checkPoint.nextXidEpoch, checkPoint.nextXid,
+                                                      EpochFromFullTransactionId(checkPoint.nextFullXid),
+                                                      XidFromFullTransactionId(checkPoint.nextFullXid),
checkPoint.nextOid)));

I don't think we should continue to expose epochs, but rather go to full
xids where appropriate.

OK, done.

Hmm, that's going to hurt some eyeballs, because other
fields are shown in 32 bit xid format. Should we also go to hex
everywhere so that we feeble humans can see the epoch component and
compare the xid component by eye?

Here's what I see on my test cluster:

Latest checkpoint's NextXID: 184683724519
Latest checkpoint's oldestXID: 4294901760

In hexadecimal money that'd be (with extra spaces, to line up with a
monospace font):

Latest checkpoint's NextXID: 0000002b0001fee7
Latest checkpoint's oldestXID: ffff0000

I left pg_resetwal's arguments -x and -e alone, but I suppose someone
might want a new switch that does both together...

+/* advance a FullTransactionId lvalue, handling wraparound correctly */
+#define FullTransactionIdAdvance(dest) \
+     do { \
+             (dest).value++; \
+             if (XidFromFullTransactionId(dest) < FirstNormalTransactionId) \
+                     (dest) = MakeFullTransactionId(EpochFromFullTransactionId(dest), \
+                                                                                FirstNormalTransactionId); \
+     } while (0)

Yikes. Why isn't this an inline function? Because it assigns to dest?

Following existing malpractice, and yeah because it requires an lvalue
so can't be changed without a different convention at the call site.
Fixed.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Track-the-next-xid-using-64-bits-v3.patchapplication/octet-stream; name=0001-Track-the-next-xid-using-64-bits-v3.patchDownload+186-213
#19Craig Ringer
craig@2ndquadrant.com
In reply to: Thomas Munro (#11)
Re: Usage of epoch in txid_current

On 10 July 2018 at 07:35, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

I played around with this idea yesterday. Experiment-grade patch
attached. Approach:

1. Introduce a new type BigTransactionId (better names welcome).

txid_current() should be changed to BigTransactionId too. It's currently
bigint.

Users are currently left in a real mess when it comes to pg_stat_activity
xids, etc, which are epoch-unqualified. txid_current() reports an
epoch-qualified xid, but there's no sensible and safe conversion from
txid_current()'s bigint to an epoch-qualified ID. age() takes 'xid',
everything takes 'xid', but is completely oblivious to epoch.

IMO all user-facing xids should be moved to BigTransactionId, with the
'xid' type ceasing to appear in any user facing role.

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent).

The only way I know of to prevent it is to use a wrapper by-value struct.
I've used this technique in the past where it's critical that implicit
conversions don't occurr, but it sure isn't pretty.

typedef struct BigTransactionId
{
uint64 value;
} BigTransactionId;

instead of

typedef uint64 BigTransactionId;

It's annoying having to get the value member, but it's definitely highly
protective against errors. Pass-by-value isn't a problem, neither is
return-by-value.

BigTransactionId
add_one(BigTransactionId xid)
{
BigTransactionId ret;
ret.value = xid.value + 1;
return ret;
}

Personally I think it's potentially worth the required gymnastics if older
compilers do a sane job of code generation with this.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#19)
Re: Usage of epoch in txid_current

On 10 July 2018 at 10:32, Craig Ringer <craig@2ndquadrant.com> wrote:

I think it's probably a good idea to make it very explicit when moving
between big and small transaction IDs, hence the including of the word
'big' in variable and function names and the use of a function-like
macro (rather than implicit conversion, which C doesn't give me a good
way to prevent).

The only way I know of to prevent it is to use a wrapper by-value struct.

... and that's what I get for not finishing the thread before replying.

Anyway, please consider the other point re txid_current() and the user
facing xid type.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#19)
#22Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#21)
#23Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#18)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
#25Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#25)
#27Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#26)
#28Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#27)
#29Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Thomas Munro (#28)
#30Thomas Munro
thomas.munro@gmail.com
In reply to: Dmitry Dolgov (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#31)
#34Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#28)
#35Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#34)
#36Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#35)
#37Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Thomas Munro (#36)
#38Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#37)
#39Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#38)
#40Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#38)
#41Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Thomas Munro (#40)
#42Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#41)
#43Jeff Janes
jeff.janes@gmail.com
In reply to: Thomas Munro (#42)
#44Jeff Janes
jeff.janes@gmail.com
In reply to: Jeff Janes (#43)
#45emmjadea
emmjadea@gmail.com
In reply to: Thomas Munro (#40)
#46Alexander Korotkov
aekorotkov@gmail.com
In reply to: Thomas Munro (#42)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#46)
#48Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#47)
#49Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Korotkov (#46)
#50Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#48)
#51Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alvaro Herrera (#49)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#49)
#53Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Korotkov (#46)
#54Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#52)