Option to ensure monotonic timestamps

Started by Brent Kerbyabout 8 years ago8 messageshackers
Jump to latest
#1Brent Kerby
blkerby@gmail.com

Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
new feature to make it possible to ensure that Postgres-generated
timestamps never decrease even if the system clock may step backwards. My
use case is that I'm implementing a form of temporal tables based on
transaction commit timestamps (as returned by pg_xact_commit_timestamp),
and to ensure the integrity of the system I need to know that the ordering
of the commit timestamps will always be consistent with the order in which
the transactions actually committed. I don't need the timestamps to be
unique; i.e., if transactions occur close together in time, then it's fine
for them to have the same timestamp -- just if the timestamps are different
then they must be in the right order. I would guess there may be other
scenarios where users may want to ensure the timestamps are monotonic, and
in general it would probably be desired for the monotonicity to apply
across all timestamps generated by a given Postgres server, not only the
commit timestamps.

I'm aware of the obvious alternative, which is simply to try to configure
the system clock so that it can't go backwards (e.g., using the option
"stepback 0" for ntpd). However, in virtual environments this could
potentially be difficult to achieve in a reliable way. And in any case,
since in my application the integrity of the data history hinges on the
timestamps being monotonic, I think it makes sense that this be enforceable
on the database level.

What I propose is that we could add a boolean configuration option, say
'ensure_monotonic_timestamps', that enables the following behavior: when
GetCurrentTimestamp is called (
https://doxygen.postgresql.org/backend_2utils_2adt_2timestamp_8c.html#a9822cdf3fd41b15851c0c18ddc80143c),
before it returns it checks if `result` is less than what was returned last
time (if any) that GetCurrentTimestamp was called, and if so it returns the
result from the previous call (after logging a warning), otherwise it
proceeds as normal. In its simplest form, this could be accomplished by
adding a global variable lastGetCurrentTimestamp that stores the result of
the previous call. Since GetCurrentTimestamp appears to be the source of
all of the significant system-generated timestamps, including commit
timestamps, this should produce the behavior I'm looking for.

One tricky thing is to figure out how to make this reliable even in the
situation where the database engine has to be restarted. When we're
starting up and have to initialize lastGetCurrentTimestamp, we need to make
sure to make sure we initialize it to be at least as large as the largest
previous result of GetCurrentTimestamp that made its way into the WAL
before shutdown, i.e., the largest previous result of GetCurrentTimestamp
that has the potential to be written out to tables upon recovery. What's
fuzzy to me is whether this would require writing new data to the WAL
specifically for this, or whether there are already timestamps (e.g., as
part of WAL metadata) that could serve this purpose.

Any thoughts?

- Brent Kerby

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brent Kerby (#1)
Re: Option to ensure monotonic timestamps

Brent Kerby <blkerby@gmail.com> writes:

What I propose is that we could add a boolean configuration option, say
'ensure_monotonic_timestamps', that enables the following behavior: when
GetCurrentTimestamp is called (
https://doxygen.postgresql.org/backend_2utils_2adt_2timestamp_8c.html#a9822cdf3fd41b15851c0c18ddc80143c),
before it returns it checks if `result` is less than what was returned last
time (if any) that GetCurrentTimestamp was called, and if so it returns the
result from the previous call (after logging a warning), otherwise it
proceeds as normal. In its simplest form, this could be accomplished by
adding a global variable lastGetCurrentTimestamp that stores the result of
the previous call.

The "global" variable would actually need to be cluster-wide, ie in shared
memory, which would imply contention and the need for locks. I think the
overhead of this would be mighty high, and the return pretty low.

It's also worth pointing out that if you don't trust the kernel clock,
simply clamping to the last returned value isn't likely to be terribly
satisfactory. What if $idiotsysadmin steps the clock back an hour?
We've had actual problems of that sort, for example with the stats
collector going AWOL for awhile because it thought it'd already written a
sufficiently new stats file. There's now an explicit check for clock-
went-backwards in pgstat_recv_inquiry, which will be broken in that sort
of scenario if you cause GetCurrentTimestamp to do clamping internally.

On the whole, the alternative of solving the problem at the kernel level
seems preferable to me.

regards, tom lane

#3Andres Freund
andres@anarazel.de
In reply to: Brent Kerby (#1)
Re: Option to ensure monotonic timestamps

Hi,

Leaving Tom's concerns aside:

On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:

Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
new feature to make it possible to ensure that Postgres-generated
timestamps never decrease even if the system clock may step backwards. My
use case is that I'm implementing a form of temporal tables based on
transaction commit timestamps (as returned by pg_xact_commit_timestamp),
and to ensure the integrity of the system I need to know that the ordering
of the commit timestamps will always be consistent with the order in which
the transactions actually committed.

The acquiration of the commit timestamp and the actual visibility of the
commit will not necessarily be sufficient for many things. A backend can
theoretically sleep for an hour between

static TransactionId
RecordTransactionCommit(void)
{
...
SetCurrentTransactionStopTimestamp();
/* here */
XactLogCommitRecord(xactStopTimestamp,
nchildren, children, nrels, rels,
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
InvalidTransactionId /* plain commit */ );
}

static void
CommitTransaction(void)
{
...
/*
* We need to mark our XIDs as committed in pg_xact. This is where we
* durably commit.
*/
latestXid = RecordTransactionCommit();

/* here */

/*
* Let others know about no transaction in progress by me. Note that this
* must be done _before_ releasing locks we hold and _after_
* RecordTransactionCommit.
*/
ProcArrayEndTransaction(MyProc, latestXid);

whether that affects your approach I do not know.

Any thoughts?

Why are you looking to do something timestamp based in the first place?
It's a bit hard to give good advice without further information...

Greetings,

Andres Freund

#4Patrick Krecker
pkrecker@gmail.com
In reply to: Andres Freund (#3)
Re: Option to ensure monotonic timestamps

On Tue, Feb 20, 2018 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

Leaving Tom's concerns aside:

On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:

Hi, I'm new to Postgres hacking, and I'm interested in the possibility of a
new feature to make it possible to ensure that Postgres-generated
timestamps never decrease even if the system clock may step backwards. My
use case is that I'm implementing a form of temporal tables based on
transaction commit timestamps (as returned by pg_xact_commit_timestamp),
and to ensure the integrity of the system I need to know that the ordering
of the commit timestamps will always be consistent with the order in which
the transactions actually committed.

The acquiration of the commit timestamp and the actual visibility of the
commit will not necessarily be sufficient for many things. A backend can
theoretically sleep for an hour between

static TransactionId
RecordTransactionCommit(void)
{
...
SetCurrentTransactionStopTimestamp();
/* here */
XactLogCommitRecord(xactStopTimestamp,
nchildren, children, nrels, rels,
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
InvalidTransactionId /* plain commit */ );
}

static void
CommitTransaction(void)
{
...
/*
* We need to mark our XIDs as committed in pg_xact. This is where we
* durably commit.
*/
latestXid = RecordTransactionCommit();

/* here */

/*
* Let others know about no transaction in progress by me. Note that this
* must be done _before_ releasing locks we hold and _after_
* RecordTransactionCommit.
*/
ProcArrayEndTransaction(MyProc, latestXid);

whether that affects your approach I do not know.

Any thoughts?

Why are you looking to do something timestamp based in the first place?
It's a bit hard to give good advice without further information...

Greetings,

Andres Freund

Hi Brent --

I haven't heard of temporal tables before, but I guess it's a feature
of SQL Server 2016. It sounds similar to some recent work in progress
to add "AS OF" to SELECT statements:
/messages/by-id/78aadf6b-86d4-21b9-9c2a-51f1efb8a499@postgrespro.ru

Patrick

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: Option to ensure monotonic timestamps

Hi,

On 2018-02-20 12:32:22 -0500, Tom Lane wrote:

The "global" variable would actually need to be cluster-wide, ie in shared
memory, which would imply contention and the need for locks. I think the
overhead of this would be mighty high, and the return pretty low.

I think if we wanted to go for something like this (which I doubt), we'd
have that global variable as an atomic 64bit variable in shmem, and
*only* use it for stuff where the ordering actually matters. I.e. not
for transaction start times etc...

It's also worth pointing out that if you don't trust the kernel clock,
simply clamping to the last returned value isn't likely to be terribly
satisfactory. What if $idiotsysadmin steps the clock back an hour?
We've had actual problems of that sort, for example with the stats
collector going AWOL for awhile because it thought it'd already written a
sufficiently new stats file. There's now an explicit check for clock-
went-backwards in pgstat_recv_inquiry, which will be broken in that sort
of scenario if you cause GetCurrentTimestamp to do clamping internally.

I guess you could hack something together with CLOCK_MONOTONIC or such,
but brrrr.

Greetings,

Andres Freund

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: Option to ensure monotonic timestamps

Andres Freund <andres@anarazel.de> writes:

On 2018-02-20 12:32:22 -0500, Tom Lane wrote:

The "global" variable would actually need to be cluster-wide, ie in shared
memory, which would imply contention and the need for locks. I think the
overhead of this would be mighty high, and the return pretty low.

I think if we wanted to go for something like this (which I doubt), we'd
have that global variable as an atomic 64bit variable in shmem, and
*only* use it for stuff where the ordering actually matters. I.e. not
for transaction start times etc...

Then you've got problems with figuring out which usages "matter" and which
don't, and being sure you don't ever compare timestamps from the two
different sources. Seems mighty fragile to me, and reminiscent of the
replication problems that forced us to drop support for float timestamps.

In any case I'd not much like a system that mostly reported in system
clock time except transaction commit timestamps are on some other
timescale.

But really, the killer point here is your upthread comment that even if
GetCurrentTimestamp were guaranteed to return monotonic time, that would
not guarantee that commit timestamps match physical commit order, which
was the OP's goal. At least not unless we read the clock while holding
WALWriteLock, which I'm pretty sure everyone will say Ain't Happening.

I think your not-very-explicit suggestion that he should work in
commit LSNs, not timestamps at all, is a far superior answer.

regards, tom lane

#7Brent Kerby
blkerby@gmail.com
In reply to: Patrick Krecker (#4)
Re: Option to ensure monotonic timestamps

Right, I'm talking about temporal tables in the sense of the SQL:2011
standard. I know there's a Postgres extension temporal_tables by Vlad
Arkhipov (https://github.com/arkhipov/temporal_tables/) that approximates
this. There's also a way of doing it using only triggers written in
pgplsql, by Paolo Chiodi (
https://www.nearform.com/blog/time-travel-with-postgresql-on-amazon-rds/).
In these solutions, however, as well as in the SQL Server 2016 and many
other implementations, the transaction start time (as opposed to commit
time) is used as the time at which the data is considered to have changed,
which does not ensure consistency of the historical data: for instance, you
can end up with a situation where, when viewed "AS OF" certain time points,
the database will appear to have had non-unique primary keys and broken
foreign key references (e.g., see
https://dba.stackexchange.com/questions/143241/why-do-temporal-tables-log-the-begin-time-of-the-transaction/198204#198204
).

I wasn't aware of that recent work. The "AS OF" syntax seems useful,
although if I understand it correctly it doesn't provide the full power of
the temporal tables. With a full implementation of temporal tables, for
each temporal table there's a corresponding history table that can be
directly accessed by queries, making it possible for instance to see a list
of all changes that have affected rows satisfying certain conditions, or to
see the data "AS OF" not just constant times but "AS OF" some variable time
given by a column in another table that is being joined with (The ability
to do this is important in my application).

I agree with Tom's points and don't think that what I originally proposed
is a very good solution, but it still makes me uncomfortable to trust
blindly in the kernel clock when the integrity of the data hangs in the
balance. How about the following alternative proposal?: Instead of trying
to enforce monotonicity of all Postgres-generated timestamps, we look only
at the commit timestamps, and if at the time that we are about to commit we
detect that a violation occurs, instead of clamping (which I agree is ugly)
we abort the transaction with an error. And this should happen only if a
configuration option, say 'monotonic_commit_timestamp', is enabled. With
this approach, we only need to keep track of the previous commit timestamp,
which is already being done if "track_commit_timestamp" is enabled (which
should probably be a prerequisite for enabling
'monotonic_commit_timestamp'), so that should impose minimal overhead -- no
need for any additional locking or including anything more in the WAL,
right?

On Tue, Feb 20, 2018 at 11:09 AM, Patrick Krecker <pkrecker@gmail.com>
wrote:

Show quoted text

On Tue, Feb 20, 2018 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

Leaving Tom's concerns aside:

On 2018-02-19 13:42:31 -0700, Brent Kerby wrote:

Hi, I'm new to Postgres hacking, and I'm interested in the possibility

of a

new feature to make it possible to ensure that Postgres-generated
timestamps never decrease even if the system clock may step backwards.

My

use case is that I'm implementing a form of temporal tables based on
transaction commit timestamps (as returned by pg_xact_commit_timestamp),
and to ensure the integrity of the system I need to know that the

ordering

of the commit timestamps will always be consistent with the order in

which

the transactions actually committed.

The acquiration of the commit timestamp and the actual visibility of the
commit will not necessarily be sufficient for many things. A backend can
theoretically sleep for an hour between

static TransactionId
RecordTransactionCommit(void)
{
...
SetCurrentTransactionStopTimestamp();
/* here */
XactLogCommitRecord(xactStopTimestamp,
nchildren,

children, nrels, rels,

nmsgs,

invalMessages,

RelcacheInitFileInval, forceSyncCommit,

MyXactFlags,

InvalidTransactionId /* plain commit */ );

}

static void
CommitTransaction(void)
{
...
/*
* We need to mark our XIDs as committed in pg_xact.

This is where we

* durably commit.
*/
latestXid = RecordTransactionCommit();

/* here */

/*
* Let others know about no transaction in progress by me. Note

that this

* must be done _before_ releasing locks we hold and _after_
* RecordTransactionCommit.
*/
ProcArrayEndTransaction(MyProc, latestXid);

whether that affects your approach I do not know.

Any thoughts?

Why are you looking to do something timestamp based in the first place?
It's a bit hard to give good advice without further information...

Greetings,

Andres Freund

Hi Brent --

I haven't heard of temporal tables before, but I guess it's a feature
of SQL Server 2016. It sounds similar to some recent work in progress
to add "AS OF" to SELECT statements:
/messages/by-id/78aadf6b-86d4-21b9-
9c2a-51f1efb8a499@postgrespro.ru

Patrick

#8Brent Kerby
blkerby@gmail.com
In reply to: Tom Lane (#6)
Re: Option to ensure monotonic timestamps

The issue is that presence of timestamps is fundamental to the
functionality of temporal tables. The users need to.be able to make queries
on temporal tables in terms of timestamps; LSNs won't mean anything to
them. It would be an option to implement the temporal tables using LSNs
under the hood, but then it is still required to construct a monotonic
mapping between LSNs and timestamps in order for it to be usable. The most
natural method for constructing such a mapping would be to use the stored
commit timestamps; if these are monotonic, then that works, but we gain
little by storing the LSNs, since they will just be converted to timestamps
anyway when they're used. But if the timestamps aren't monotonic, then
we're faced with the same problem as before; we could try to patch up the
non-monotonicity in the mapping after-the-fact, using clamping or possibly
some more sophisticated method, but this is inefficient and could get ugly
fast. It would seem preferable to just ensure that the timestamps are
monotonic to begin with. And based on your observations of why we shouldn't
try to enforce this on every application of GetCurrentTimestamp, I think
maybe it would be cleaner to just enforce this on commit timestamps (and
only if enabled by a configuration option, of course), since this is all
that is needed and should be simpler and less expensive. And if a violation
occurs, we can just abort the transaction with an error, rather than
clamping the timestamp. This way we don't end up with an ugly scenario like
you pointed out, where we have one set of timestamps that are clamped
(i.e., for commit timestamps) and others that are not; and also this way,
if the $idiotsysadmin sets the clock back an hour, then we get errors
immediately instead of a whole hour of temporal table history being messed
up.

On Tue, Feb 20, 2018 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Andres Freund <andres@anarazel.de> writes:

On 2018-02-20 12:32:22 -0500, Tom Lane wrote:

The "global" variable would actually need to be cluster-wide, ie in

shared

memory, which would imply contention and the need for locks. I think

the

overhead of this would be mighty high, and the return pretty low.

I think if we wanted to go for something like this (which I doubt), we'd
have that global variable as an atomic 64bit variable in shmem, and
*only* use it for stuff where the ordering actually matters. I.e. not
for transaction start times etc...

Then you've got problems with figuring out which usages "matter" and which
don't, and being sure you don't ever compare timestamps from the two
different sources. Seems mighty fragile to me, and reminiscent of the
replication problems that forced us to drop support for float timestamps.

In any case I'd not much like a system that mostly reported in system
clock time except transaction commit timestamps are on some other
timescale.

But really, the killer point here is your upthread comment that even if
GetCurrentTimestamp were guaranteed to return monotonic time, that would
not guarantee that commit timestamps match physical commit order, which
was the OP's goal. At least not unless we read the clock while holding
WALWriteLock, which I'm pretty sure everyone will say Ain't Happening.

I think your not-very-explicit suggestion that he should work in
commit LSNs, not timestamps at all, is a far superior answer.

regards, tom lane