TABLESAMPLE patch is really in pretty sad shape

Started by Tom Lanealmost 11 years ago45 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

The two contrib modules this patch added are nowhere near fit for public
consumption. They cannot clean up after themselves when dropped:

regression=# create extension tsm_system_rows;
CREATE EXTENSION
regression=# create table big as select i, random() as x from generate_series(1,1000000) i;
SELECT 1000000
regression=# create view v1 as select * from big tablesample system_rows(10);
CREATE VIEW
regression=# drop extension tsm_system_rows;
DROP EXTENSION

The view is still there, but is badly broken:

regression=# select * from v1;
ERROR: cache lookup failed for function 46379

Potentially this is a security issue, since a malicious user could
probably manage to create a Trojan horse function having the now-vacated
OID, whereupon use of the view would invoke that function.

Worse still, the broken pg_tablesample_method row is still there:

regression=# create extension tsm_system_rows;
ERROR: duplicate key value violates unique constraint "pg_tablesample_method_name_index"
DETAIL: Key (tsmname)=(system_rows) already exists.

And even if we fixed that, these modules will not survive a pg_upgrade
cycle, because pg_upgrade has no idea that it would need to create a
pg_tablesample_method row (remember that we do *not* replay the extension
script during binary upgrade). Raw inserts into system catalogs just
aren't a sane thing to do in extensions.

Some of the risks here come from what seems like a fundamentally
wrong decision to copy all of the info about a tablesample method out
of the pg_tablesample_method catalog *at parse time* and store it
permanently in the query parse tree. This makes any sort of "alter
tablesample method" DDL operation impossible in principle, because
any views/rules referencing the method have already copied the data.

On top of that, I find the basic implementation design rather dubious,
because it supposes that the tablesample filtering step must always
come first; the moment you say TABLESAMPLE you can kiss goodbye the
idea that the query will use any indexes. For example:

d2=# create table big as select i, random() as x from generate_series(1,10000000) i;
SELECT 10000000
d2=# create index on big(i);
CREATE INDEX
d2=# analyze big;
ANALYZE
d2=# explain analyze select * from big where i between 100 and 200;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using big_i_idx on big (cost=0.43..10.18 rows=87 width=12) (actual time=0.022..0.088 rows=101 loops=1)
Index Cond: ((i >= 100) AND (i <= 200))
Planning time: 0.495 ms
Execution time: 0.141 ms
(4 rows)

d2=# explain analyze select * from big tablesample bernoulli(10) where i between 100 and 200;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Sample Scan (bernoulli) on big (cost=0.00..54055.13 rows=9 width=12) (actual time=0.028..970.051 rows=13 loops=1)
Filter: ((i >= 100) AND (i <= 200))
Rows Removed by Filter: 999066
Planning time: 0.182 ms
Execution time: 970.093 ms
(5 rows)

Now, maybe I don't understand the use-case for this feature, but I should
think it's meant for dealing with tables that are so big that you can't
afford to scan all the data. So, OK, a samplescan is hopefully cheaper
than a pure seqscan, but that doesn't mean that fetching 999079 rows and
discarding 999066 of them is a good plan design. There needs to be an
operational mode whereby we can use an index and do random sampling of
the TIDs the index returns. I do not insist that that has to appear in
version one of the feature --- but I am troubled by the fact that, by
exposing an oversimplified API for use by external modules, this patch is
doubling down on the assumption that no such operational mode will ever
need to be implemented.

There are a whole lot of lesser sins, such as documentation that was
clearly never copy-edited by a native speaker of English, badly designed
planner APIs (Paths really ought not have rowcounts different from the
underlying RelOptInfo), potential core dumps due to dereferencing values
that could be null (the guards for null values are in the wrong places
entirely), etc etc.

While there's nothing here that couldn't be fixed by nuking the contrib
modules and putting a week or two of concentrated work into fixing the
core code, I for one certainly don't have time to put that kind of effort
into TABLESAMPLE right now. Nor do I really have the interest; I find
this feature of pretty dubious value.

What are we going to do about this?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: TABLESAMPLE patch is really in pretty sad shape

I wrote:

The two contrib modules this patch added are nowhere near fit for public
consumption. They cannot clean up after themselves when dropped:
...
Raw inserts into system catalogs just
aren't a sane thing to do in extensions.

I had some thoughts about how we might fix that, without going to the
rather tedious lengths of creating a complete set of DDL infrastructure
for CREATE/DROP TABLESAMPLE METHOD.

First off, the extension API designed for the tablesample patch is
evidently modeled on the index AM API, which was not a particularly good
precedent --- it's not a coincidence that index AMs can't be added or
dropped on-the-fly. Modeling a server internal API as a set of
SQL-visible functions is problematic when the call signatures of those
functions can't be accurately described by SQL datatypes, and it's rather
pointless and inefficient when none of the functions in question is meant
to be SQL-callable. It's even less attractive if you don't think you've
got a completely stable API spec, because adding, dropping, or changing
signature of any one of the API functions then involves a pile of
easy-to-mess-up catalog changes on top of the actually useful work.
Not to mention then having to think about backwards compatibility of
your CREATE command's arguments.

We have a far better model to follow already, namely the foreign data
wrapper API. In that, there's a single SQL-visible function that returns
a dummy datatype indicating that it's an FDW handler, and when called,
it hands back a struct containing pointers to all the other functions
that the particular wrapper needs to supply (and, if necessary, the struct
could have non-function-pointer fields containing other info the core
system might need to know about the handler). We could similarly invent a
pseudotype "tablesample_handler" and represent each tablesample method by
a single SQL function returning tablesample_handler. Any future changes
in the API for tablesample handlers would then appear as changes in the C
definition of the struct returned by the handler, which requires no
SQL-visible thrashing, hence creates no headaches for pg_upgrade, and it
is pretty easy to design it so that failure to update an external module
that implements the API results in C compiler errors and/or sane fallback
behavior.

Once we've done that, I think we don't even need a special catalog
representing tablesample methods. Given "FROM foo TABLESAMPLE
bernoulli(...)", the parser could just look for a function bernoulli()
returning tablesample_handler, and it's done. The querytree would have an
ordinary function dependency on that function, which would be sufficient
to handle DROP dependency behaviors properly. (On reflection, maybe
better if it's "bernoulli(internal) returns tablesample_handler",
so as to guarantee that such a function doesn't create a conflict with
any user-defined function of the same name.)

Thoughts?

regards, tom lane

PS: now that I've written this rant, I wonder why we don't redesign the
index AM API along the same lines. It probably doesn't matter much at
the moment, but if we ever get serious about supporting index AM
extensions, I think we ought to consider doing that.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#2)
Re: TABLESAMPLE patch is really in pretty sad shape

On Mon, Jul 13, 2015 at 7:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

The two contrib modules this patch added are nowhere near fit for public
consumption. They cannot clean up after themselves when dropped:
...
Raw inserts into system catalogs just
aren't a sane thing to do in extensions.

I had some thoughts about how we might fix that, without going to the
rather tedious lengths of creating a complete set of DDL infrastructure
for CREATE/DROP TABLESAMPLE METHOD.

First off, the extension API designed for the tablesample patch is
evidently modeled on the index AM API, which was not a particularly good
precedent --- it's not a coincidence that index AMs can't be added or
dropped on-the-fly. Modeling a server internal API as a set of
SQL-visible functions is problematic when the call signatures of those
functions can't be accurately described by SQL datatypes, and it's rather
pointless and inefficient when none of the functions in question is meant
to be SQL-callable. It's even less attractive if you don't think you've
got a completely stable API spec, because adding, dropping, or changing
signature of any one of the API functions then involves a pile of
easy-to-mess-up catalog changes on top of the actually useful work.
Not to mention then having to think about backwards compatibility of
your CREATE command's arguments.

We have a far better model to follow already, namely the foreign data
wrapper API. In that, there's a single SQL-visible function that returns
a dummy datatype indicating that it's an FDW handler, and when called,
it hands back a struct containing pointers to all the other functions
that the particular wrapper needs to supply (and, if necessary, the struct
could have non-function-pointer fields containing other info the core
system might need to know about the handler). We could similarly invent a
pseudotype "tablesample_handler" and represent each tablesample method by
a single SQL function returning tablesample_handler. Any future changes
in the API for tablesample handlers would then appear as changes in the C
definition of the struct returned by the handler, which requires no
SQL-visible thrashing, hence creates no headaches for pg_upgrade, and it
is pretty easy to design it so that failure to update an external module
that implements the API results in C compiler errors and/or sane fallback
behavior.

That's elegant.

Once we've done that, I think we don't even need a special catalog
representing tablesample methods. Given "FROM foo TABLESAMPLE
bernoulli(...)", the parser could just look for a function bernoulli()
returning tablesample_handler, and it's done. The querytree would have an
ordinary function dependency on that function, which would be sufficient
to handle DROP dependency behaviors properly. (On reflection, maybe
better if it's "bernoulli(internal) returns tablesample_handler",
so as to guarantee that such a function doesn't create a conflict with
any user-defined function of the same name.)

Thoughts?

Regarding the fact that those two contrib modules can be part of a
-contrib package and could be installed, nuking those two extensions
from the tree and preventing the creating of custom tablesample
methods looks like a correct course of things to do for 9.5.

When looking at this patch I was as well surprised that the BERNOUILLI
method can only be applied on a physical relation and was not able to
fire on a materialized set of tuples, say the result of a WITH clause
for example.

PS: now that I've written this rant, I wonder why we don't redesign the
index AM API along the same lines. It probably doesn't matter much at
the moment, but if we ever get serious about supporting index AM
extensions, I think we ought to consider doing that.

Any API redesign looks to be clearly 9.6 work IMO at this point.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#3)
Re: TABLESAMPLE patch is really in pretty sad shape

Michael Paquier <michael.paquier@gmail.com> writes:

Regarding the fact that those two contrib modules can be part of a
-contrib package and could be installed, nuking those two extensions
from the tree and preventing the creating of custom tablesample
methods looks like a correct course of things to do for 9.5.

TBH, I think the right thing to do at this point is to revert the entire
patch and send it back for ground-up rework. I think the high-level
design is wrong in many ways and I have about zero confidence in most
of the code details as well.

I'll send a separate message about high-level issues, but as far as code
details go, I started to do some detailed code review last night and only
got through contrib/tsm_system_rows/tsm_system_rows.c before deciding it
was hopeless. Let's have a look at my notes:

* Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group

Obsolete copyright date.

* IDENTIFICATION
* contrib/tsm_system_rows_rowlimit/tsm_system_rows.c

Wrong filename. (For the moment, I'll refrain from any value judgements
about the overall adequacy or quality of the comments in this patch, and
just point out obvious errors that should have been caught in review.)

typedef struct
{
SamplerRandomState randstate;
uint32 seed; /* random seed */
BlockNumber nblocks; /* number of block in relation */
int32 ntuples; /* number of tuples to return */
int32 donetuples; /* tuples already returned */
OffsetNumber lt; /* last tuple returned from current block */
BlockNumber step; /* step size */
BlockNumber lb; /* last block visited */
BlockNumber doneblocks; /* number of already returned blocks */
} SystemSamplerData;

This same typedef name is defined in three different places in the patch
(tablesample/system.c, tsm_system_rows.c, tsm_system_time.c). While that
might not amount to a bug, it's sure a recipe for confusion, especially
since the struct definitions are all different.

Datum
tsm_system_rows_init(PG_FUNCTION_ARGS)
{
TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0);
uint32 seed = PG_GETARG_UINT32(1);
int32 ntuples = PG_ARGISNULL(2) ? -1 : PG_GETARG_INT32(2);

This is rather curious coding. Why should this function check only
one of its arguments for nullness? If it needs to defend against
any of them being null, it really needs to check all. But in point of
fact, this function is declared STRICT, which means it's a violation of
the function call protocol if the core code ever passes a null to it,
and so this test ought to be dead code.

A similar pattern of ARGISNULL checks in declared-strict functions exists
in all the tablesample modules, not just this one, showing that this is an
overall design error not just a thinko here. My inclination would be to
make the core code enforce non-nullness of all tablesample arguments so as
to make it unnecessary to check strictness of the tsm*init functions, but
in any case it is Not Okay to just pass nulls willy-nilly to strict C
functions.

Also, I find this coding pretty sloppy even without the strictness angle,
because the net effect of this way of dealing with nulls is that an
argument-must-not-be-null complaint is reported as "out of range",
which is both confusing and the wrong ERRCODE.

if (ntuples < 1)
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("invalid sample size"),
errhint("Sample size must be positive integer value.")));

I don't find this to be good error message style. The secondary comment
is not a "hint", it's an ironclad statement of what you did wrong, so if
we wanted to phrase it like this it should be an errdetail not errhint.
But the whole thing is overly cute anyway because there is no reason at
all not to just say what we mean in the primary error message, eg
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("sample size must be greater than zero")));

While we're on the subject, what's the reason for disallowing a sample
size of zero? Seems like a reasonable edge case.

/* All blocks have been read, we're done */
if (sampler->doneblocks > sampler->nblocks ||
sampler->donetuples >= sampler->ntuples)
PG_RETURN_UINT32(InvalidBlockNumber);

Okay, I lied, I *am* going to complain about this comment. Comments that
do not accurately describe the code they're attached to are worse than
useless.

/*
* Cleanup method.
*/
Datum
tsm_system_rows_end(PG_FUNCTION_ARGS)
{
TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0);

pfree(tsdesc->tsmdata);

PG_RETURN_VOID();
}

This cleanup method is a waste of code space. There is no need to
pfree individual allocations at query execution end.

limitnode = linitial(args);
limitnode = estimate_expression_value(root, limitnode);

if (IsA(limitnode, RelabelType))
limitnode = (Node *) ((RelabelType *) limitnode)->arg;

if (IsA(limitnode, Const))
ntuples = DatumGetInt32(((Const *) limitnode)->constvalue);
else
{
/* Default ntuples if the estimation didn't return Const. */
ntuples = 1000;
}

That RelabelType check is a waste of code space (estimate_expression_value
would have collapsed RelabelType-atop-Const into just a Const). On the
other hand, the failure to check for constisnull is a bug, and the failure
to sanity-check the value (ie clamp zero or negative or
larger-than-table-size to a valid rowcount estimate) is another bug, one
that could easily cause a planner failure.

static uint32
random_relative_prime(uint32 n, SamplerRandomState randstate)
{
/* Pick random starting number, with some limits on what it can be. */
uint32 r = (uint32) sampler_random_fract(randstate) * n / 2 + n / 4,
t;

r is not terribly random; in fact, it's always exactly n/4, because
careless parenthesization here causes the float8 result of
sampler_random_fract() to be truncated to zero immediately on return.
In any case, what's the rationale for not letting r cover the whole
range from 1 to n-1?

/*
* This should only take 2 or 3 iterations as the probability of 2 numbers
* being relatively prime is ~61%.
*/
while ((t = gcd(r, n)) > 1)
{
CHECK_FOR_INTERRUPTS();
r /= t;
}

Actually, that's an infinite loop if r is initially zero, which will
always happen (given the previous bug) if n is initially 2 or 3.
Also, because this coding decreases r whenever it's not immediately
relatively-prime to n, I don't think that what we're getting is especially
"random"; it's certainly not going to be uniformly distributed, but will
have a bias towards smaller values. Perhaps a better technique would be
to select an all-new random r each time the gcd test fails. In short,
I'd do something more like

uint32 r;

/* Safety check to avoid infinite loop or zero result for small n. */
if (n <= 1)
return 1;

/*
* This should only take 2 or 3 iterations as the probability of 2 numbers
* being relatively prime is ~61%; but just in case, we'll include a
* CHECK_FOR_INTERRUPTS in the loop.
*/
do {
CHECK_FOR_INTERRUPTS();
r = (uint32) (sampler_random_fract(randstate) * n);
} while (r == 0 || gcd(r, n) > 1);

Note however that this coding would result in an unpredictable number
of iterations of the RNG, which might not be such a good thing if we're
trying to achieve repeatability. It doesn't matter in the context of
this module since the RNG is not used after initialization, but it would
matter if we then went on to do Bernoulli-style sampling. (Possibly that
could be dodged by reinitializing the RNG after the initialization steps.)

As I said, I haven't specifically tried to read the tablesample patch
other than this one contrib file. However, the bits of it that I've come
across while doing other work have almost invariably made me squint and
think "that needs to be rewritten". I think there is every reason to
assume that all the rest of it is about as buggy as this file is. If it's
to stay, it *must* get a line-by-line review from some committer-level
person; and I think there are other more important things for us to be
doing for 9.5.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: TABLESAMPLE patch is really in pretty sad shape

I wrote:

TBH, I think the right thing to do at this point is to revert the entire
patch and send it back for ground-up rework. I think the high-level
design is wrong in many ways and I have about zero confidence in most
of the code details as well.

I'll send a separate message about high-level issues,

And here's that. I do *not* claim that this is a complete list of design
issues with the patch, it's just things I happened to notice in the amount
of time I've spent so far (which is already way more than I wanted to
spend on TABLESAMPLE right now).

I'm not sure that we need an extensible sampling-method capability at all,
let alone that an API for that needs to be the primary focus of a patch.
Certainly the offered examples of extension modules aren't inspiring:
tsm_system_time is broken by design (more about that below) and nobody
would miss tsm_system_rows if it weren't there. What is far more
important is to get sampling capability into indexscans, and designing
an API ahead of time seems like mostly a distraction from that.

I'd think seriously about tossing the entire executor-stage part of the
patch, creating a stateless (history-independent) sampling filter that
just accepts or rejects TIDs, and sticking calls to that into all the
table scan node types. Once you had something like that working well
it might be worth considering whether to try to expose an API to
generalize it. But even then it's not clear that we really need any
more options than true-Bernoulli and block-level sampling.

The IBM paper I linked to in the other thread mentions that their
optimizer will sometimes choose to do Bernoulli sampling even if SYSTEM
was requested.  Probably that happens when it decides to do a simple
indexscan, because then there's no advantage to trying to cluster the
sampled rows.  But in the case of a bitmap scan, you could very easily do
either true Bernoulli or block-level sampling simply by adjusting the
rules about which bits you keep or clear in the bitmap (given that you
apply the filter between collection of the index bitmap and accessing the
heap, which seems natural).  The only case where a special scan type
really seems to be needed is if you want block-level sampling, the query
would otherwise use a seqscan, *and* the sampling percentage is pretty low
--- if you'd be reading every second or third block anyway, you're likely
better off with a plain seqscan so that the kernel sees sequential access
and does prefetching.  The current API doesn't seem to make it possible to
switch between seqscan and read-only-selected-blocks based on the sampling
percentage, but I think that could be an interesting optimization.
(Another bet that's been missed is having the samplescan logic request
prefetching when it is doing selective block reads.  The current API can't
support that AFAICS, since there's no expectation that nextblock calls
could be done asynchronously from nexttuple calls.)

Another issue with the API as designed is the examinetuple() callback.
Allowing sample methods to see invisible tuples is quite possibly the
worst idea in the whole patch. They can *not* do anything with such
tuples, or they'll totally break reproducibility: if the tuple is
invisible to your scan, it might well be or soon become invisible to
everybody, whereupon it would be subject to garbage collection at the
drop of a hat. So if an invisible tuple affects the sample method's
behavior at all, repeated scans in the same query would not produce
identical results, which as mentioned before is required both by spec
and for minimally sane query behavior. Moreover, examining the contents
of the tuple is unsafe (it might contain pointers to TOAST values that
no longer exist); and even if it were safe, what's the point? Sampling
that pays attention to the data is the very definition of biased. So
if we do re-introduce an API like the current one, I'd definitely lose
this bit and only allow sample methods to consider visible tuples.

On the point of reproducibility: the tsm_system_time method is utterly
incapable of producing identical results across repeated scans, because
there is no reason to believe it would get exactly as far in the same
amount of time each time. This might be all right across queries if
the method could refuse to work with REPEATABLE clauses (but there's
no provision for such a restriction in the current API). But it's not
acceptable within a query. Another problem with tsm_system_time is that
its cost/rowcount estimation is based on nothing more than wishful
thinking, and can never be based on anything more than wishful thinking,
because planner cost units are not time-based. Adding a comment that
says we'll nonetheless pretend they are milliseconds isn't a solution.
So that sampling method really has to go away and never come back,
whatever we might ultimately salvage from this patch otherwise.

(I'm not exactly convinced that the system or tsm_system_rows methods
are adequately reproducible either, given that their sampling pattern
will change when the relation block count changes. Perhaps that's
unavoidable, but it seems like it might be possible to define things
in such a way that adding blocks doesn't change which existing blocks
get sampled.)

A more localized issue that I noticed is that nowhere is it documented
what the REPEATABLE parameter value is. Digging in the code eventually
reveals that the value is assignment-coerced to an int4, which I find
rather problematic because a user might reasonably assume that the
parameter works like setseed's parameter (float8 in the range -1 to 1).
If he does then he'll get no errors and identical samples from say
REPEATABLE(0.1) and REPEATABLE(0.2), which is bad. On the other hand,
it looks like DB2 allows integer values, so implementing it just like
setseed might cause problems for people coming from DB2. I'm inclined to
suggest that we should define the parameter as being any float8 value,
and obtain a seed from it with hashfloat8(). That way, no matter whether
users think that usable values are fractional or integral, they'll get
sane behavior with different supplied seeds almost always producing
different samples.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#1)
Re: TABLESAMPLE patch is really in pretty sad shape

On 11 July 2015 at 21:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What are we going to do about this?

I will address the points you raise, one by one.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#5)
Re: TABLESAMPLE patch is really in pretty sad shape

On 13 July 2015 at 17:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

TBH, I think the right thing to do at this point is to revert the entire
patch and send it back for ground-up rework. I think the high-level
design is wrong in many ways and I have about zero confidence in most
of the code details as well.

I'll send a separate message about high-level issues,

And here's that. I do *not* claim that this is a complete list of design
issues with the patch, it's just things I happened to notice in the amount
of time I've spent so far (which is already way more than I wanted to
spend on TABLESAMPLE right now).

Interesting that a time-based sample does indeed yield useful results. Good
to know.

That is exactly what this patch provides: a time-based sample, with reduced
confidence in the accuracy of the result. And other things too.

I'm not sure that we need an extensible sampling-method capability at all,
let alone that an API for that needs to be the primary focus of a patch.
Certainly the offered examples of extension modules aren't inspiring:
tsm_system_time is broken by design (more about that below) and nobody
would miss tsm_system_rows if it weren't there.

Time based sampling isn't broken by design. It works exactly according to
the design.

Time-based sampling has been specifically requested by data visualization
developers who are trying to work out how to display anything useful from a
database beyond a certain size. The feature for PostgreSQL implements a
similar mechanism to that deployed already with BlinkDB, demonstrated at
VLDB 2012. I regard it as an Advanced feature worthy of deployment within
PostgreSQL, based upon user request.

Row based sampling offers the capability to retrieve a sample of a fixed
size however big the data set. I shouldn't need to point out that this is
already in use within the ANALYZE command. Since it implements a technique
already in use within PostgreSQL, I see no reason not to expose that to
users. As I'm sure you're aware, there is rigorous math backing up the use
of fixed size sampling, with recent developments from my research
colleagues at Manchester University that further emphasises the usefulness
of this feature for us.

What is far more
important is to get sampling capability into indexscans, and designing
an API ahead of time seems like mostly a distraction from that.

This has come up as a blocker on TABLESAMPLE before. I've got no evidence
to show anyone cares about that. I can't imagine a use for it myself; it
was not omitted from this patch because its difficult, it was omitted
because its just useless. If anyone ever cares, they can add it in a later
release.

I'd think seriously about tossing the entire executor-stage part of the
patch, creating a stateless (history-independent) sampling filter that
just accepts or rejects TIDs, and sticking calls to that into all the
table scan node types. Once you had something like that working well
it might be worth considering whether to try to expose an API to
generalize it. But even then it's not clear that we really need any
more options than true-Bernoulli and block-level sampling.

See above.

The IBM paper I linked to in the other thread mentions that their
optimizer will sometimes choose to do Bernoulli sampling even if SYSTEM
was requested.

That sounds broken to me. This patch gives the user what the user asks for.
BERNOULLI or SYSTEM.

If you particularly like the idea of mixing the two sampling methods, you
can do so *because* the sampling method API is extensible for the user. So
Mr.DB2 can get it his way if he likes and he can call it SYSNOULLI if he
likes; others can get it the way they like also. No argument required.

It was very, very obvious that whatever sampling code anybody dreamt up
would be objected to. Clearly, we need a way to allow people to implement
whatever technique they wish. Stratified sampling, modified sampling, new
techniques. Whatever.

Probably that happens when it decides to do a simple
indexscan, because then there's no advantage to trying to cluster the
sampled rows.  But in the case of a bitmap scan, you could very easily do
either true Bernoulli or block-level sampling simply by adjusting the
rules about which bits you keep or clear in the bitmap (given that you
apply the filter between collection of the index bitmap and accessing the
heap, which seems natural).  The only case where a special scan type
really seems to be needed is if you want block-level sampling, the query
would otherwise use a seqscan, *and* the sampling percentage is pretty low
--- if you'd be reading every second or third block anyway, you're likely
better off with a plain seqscan so that the kernel sees sequential access
and does prefetching.  The current API doesn't seem to make it possible to
switch between seqscan and read-only-selected-blocks based on the sampling
percentage, but I think that could be an interesting optimization.
(Another bet that's been missed is having the samplescan logic request
prefetching when it is doing selective block reads.  The current API can't
support that AFAICS, since there's no expectation that nextblock calls
could be done asynchronously from nexttuple calls.)

Again, you like it that way then write a plugin that way. That's why its
extensible.

Another issue with the API as designed is the examinetuple() callback.
Allowing sample methods to see invisible tuples is quite possibly the
worst idea in the whole patch. They can *not* do anything with such
tuples, or they'll totally break reproducibility: if the tuple is
invisible to your scan, it might well be or soon become invisible to
everybody, whereupon it would be subject to garbage collection at the
drop of a hat. So if an invisible tuple affects the sample method's
behavior at all, repeated scans in the same query would not produce
identical results, which as mentioned before is required both by spec
and for minimally sane query behavior. Moreover, examining the contents
of the tuple is unsafe (it might contain pointers to TOAST values that
no longer exist); and even if it were safe, what's the point? Sampling
that pays attention to the data is the very definition of biased. So
if we do re-introduce an API like the current one, I'd definitely lose
this bit and only allow sample methods to consider visible tuples.

That looks like it might be a valid technical point. I'll think more on
that.

On the point of reproducibility: the tsm_system_time method is utterly
incapable of producing identical results across repeated scans, because
there is no reason to believe it would get exactly as far in the same
amount of time each time. This might be all right across queries if
the method could refuse to work with REPEATABLE clauses (but there's
no provision for such a restriction in the current API). But it's not
acceptable within a query. Another problem with tsm_system_time is that
its cost/rowcount estimation is based on nothing more than wishful
thinking, and can never be based on anything more than wishful thinking,
because planner cost units are not time-based. Adding a comment that
says we'll nonetheless pretend they are milliseconds isn't a solution.
So that sampling method really has to go away and never come back,
whatever we might ultimately salvage from this patch otherwise.

REPEATABLE clearly provides its meaning within the physical limits of the
approach chosen.

A time-based sample never could produce an immutable result. No further
discussion needed.

Perhaps we should document that the use of REPEATABLE is at best a STABLE
function, since the location of physical rows could have changed within the
table between executions. So really we should be documenting the fairly
hazy meaning in all cases. If you want to have a truly repeatable sample,
save the result into another table.

(I'm not exactly convinced that the system or tsm_system_rows methods
are adequately reproducible either, given that their sampling pattern
will change when the relation block count changes. Perhaps that's
unavoidable, but it seems like it might be possible to define things
in such a way that adding blocks doesn't change which existing blocks
get sampled.)

A more localized issue that I noticed is that nowhere is it documented
what the REPEATABLE parameter value is.

So, we have a minor flaw in the docs. Thanks for the report.

Digging in the code eventually
reveals that the value is assignment-coerced to an int4, which I find
rather problematic because a user might reasonably assume that the
parameter works like setseed's parameter (float8 in the range -1 to 1).
If he does then he'll get no errors and identical samples from say
REPEATABLE(0.1) and REPEATABLE(0.2), which is bad. On the other hand,
it looks like DB2 allows integer values, so implementing it just like
setseed might cause problems for people coming from DB2. I'm inclined to
suggest that we should define the parameter as being any float8 value,
and obtain a seed from it with hashfloat8(). That way, no matter whether
users think that usable values are fractional or integral, they'll get
sane behavior with different supplied seeds almost always producing
different samples.

Sounds like some good ideas in that part. The main reason for allowing real
numbers is that its fairly common to want to specify a sample size smaller
than a 1% sample, which as you mention above probably wouldn't be very much
faster than a SeqScan. Not something worth following DB2 on.

Your main points about how you would do sampling differently show me that
we were right to pursue an extensible approach. It's hardly a radical
thought given FDWs, custom scans and other APIs.

PostgreSQL is lucky enough to have many users with databases larger than 1
TB. We need to understand that they quite like the idea of running a short
query that uses few resources to get an approximate answer. Just look at
all the Wiki pages and blogs on this subject. Exact query techniques are
important, but so is sampling.

In terms of the patch, we've waited approximately a decade for TABLESAMPLE.
I don't see see any reason to wait longer, based upon the arguments
presented here. I accept there may be technical points that need to be
addressed, I will look at those tomorrow.

We have a choice here. Please understand that an extension that allows
block sampling is already available and works well. This patch is not
required for some evil plan, it simply allows PostgreSQL users to gain the
benefit of some cool and useful features. The fact that both of us can cite
details of other implementations means we're overdue for this and I'm happy
its in 9.5

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#4)
Re: TABLESAMPLE patch is really in pretty sad shape

On 13 July 2015 at 14:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

Regarding the fact that those two contrib modules can be part of a
-contrib package and could be installed, nuking those two extensions
from the tree and preventing the creating of custom tablesample
methods looks like a correct course of things to do for 9.5.

TBH, I think the right thing to do at this point is to revert the entire
patch and send it back for ground-up rework. I think the high-level
design is wrong in many ways and I have about zero confidence in most
of the code details as well.

Based on the various comments here, I don't see that as the right course of
action at this point.

There are no issues relating to security or data loss, just various fixable
issues in a low-impact feature, which in my view is an important feature
also.

If it's

to stay, it *must* get a line-by-line review from some committer-level
person; and I think there are other more important things for us to be
doing for 9.5.

Honestly, I am very surprised by this. My feeling was the code was neat,
clear and complete, much more so than many patches I review. If I had
thought the patch or its implementation was in any way contentious I would
not have committed it.

I take responsibility for the state of the code and will put time into
addressing the concerns mentioned and others.

If we cannot resolve them in reasonable time, a revert is possible: there
is nothing riding on this from me.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#8)
Re: TABLESAMPLE patch is really in pretty sad shape

Simon Riggs <simon@2ndQuadrant.com> writes:

On 13 July 2015 at 14:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH, I think the right thing to do at this point is to revert the entire
patch and send it back for ground-up rework. I think the high-level
design is wrong in many ways and I have about zero confidence in most
of the code details as well.

There are no issues relating to security or data loss, just various fixable
issues in a low-impact feature, which in my view is an important feature
also.

There is a *very large* amount of work needed here, and I do not hear you
promising to do it. What I'm hearing is stonewalling, and I am not happy.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#9)
Re: TABLESAMPLE patch is really in pretty sad shape

On 14 July 2015 at 15:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On 13 July 2015 at 14:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH, I think the right thing to do at this point is to revert the entire
patch and send it back for ground-up rework. I think the high-level
design is wrong in many ways and I have about zero confidence in most
of the code details as well.

There are no issues relating to security or data loss, just various

fixable

issues in a low-impact feature, which in my view is an important feature
also.

There is a *very large* amount of work needed here, and I do not hear you
promising to do it.

I thought I had done so clearly enough, happy to do so again.

I promise that either work will be done, or the patch will be reverted.
Since I have more time now, I view that as a realistic prospect.

What I'm hearing is stonewalling, and I am not happy.

I'm not sure what you mean by that but it sounds negative and is almost
certainly not justified, in this or other cases.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#8)
Re: TABLESAMPLE patch is really in pretty sad shape

On Tue, Jul 14, 2015 at 11:14:55AM +0100, Simon Riggs wrote:

On 13 July 2015 at 14:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH, I think the right thing to do at this point is to revert the entire
patch and send it back for ground-up rework. I think the high-level
design is wrong in many ways and I have about zero confidence in most
of the code details as well.

Based on the various comments here, I don't see that as the right course of
action at this point.

There are no issues relating to security or data loss, just various fixable
issues in a low-impact feature, which in my view is an important feature
also.

If it's
to stay, it *must* get a line-by-line review from some committer-level
person; and I think there are other more important things for us to be
doing for 9.5.

Honestly, I am very surprised by this.

Tom's partial review found quite a crop of unvarnished bugs:

1. sample node can give different tuples across rescans within an executor run
2. missing dependency machinery to restrict dropping a sampling extension
3. missing "pg_dump --binary-upgrade" treatment
4. "potential core dumps due to dereferencing values that could be null"
5. factually incorrect comments
6. null argument checks in strict functions
7. failure to check for constisnull
8. "failure to sanity-check" ntuples
9. arithmetic errors in random_relative_prime()

(That's after sifting out design counterproposals, feature requests, and other
topics of regular disagreement. I erred on the side of leaving things off
that list.) Finding one or two like that during a complete post-commit review
would be business as usual. Finding nine in a partial review diagnoses a
critical shortfall in pre-commit review vigilance. Fixing the bugs found to
date will not cure that shortfall. A qualified re-review could cure it.

nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Noah Misch (#11)
Re: TABLESAMPLE patch is really in pretty sad shape

On 15 July 2015 at 05:58, Noah Misch <noah@leadboat.com> wrote:

If it's
to stay, it *must* get a line-by-line review from some committer-level
person; and I think there are other more important things for us to be
doing for 9.5.

Honestly, I am very surprised by this.

Tom's partial review found quite a crop of unvarnished bugs:

1. sample node can give different tuples across rescans within an executor
run
2. missing dependency machinery to restrict dropping a sampling extension
3. missing "pg_dump --binary-upgrade" treatment
4. "potential core dumps due to dereferencing values that could be null"
5. factually incorrect comments
6. null argument checks in strict functions
7. failure to check for constisnull
8. "failure to sanity-check" ntuples
9. arithmetic errors in random_relative_prime()

(That's after sifting out design counterproposals, feature requests, and
other
topics of regular disagreement. I erred on the side of leaving things off
that list.) Finding one or two like that during a complete post-commit
review
would be business as usual. Finding nine in a partial review diagnoses a
critical shortfall in pre-commit review vigilance. Fixing the bugs found
to
date will not cure that shortfall. A qualified re-review could cure it.

Thank you for the summary of points. I agree with that list.

I will work on the re-review as you suggest.

1 and 4 relate to the sample API exposed, which needs some rework. We'll
see how big that is; at this time I presume not that hard, but I will wait
for Petr's opinion also when he returns on Friday.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: TABLESAMPLE patch is really in pretty sad shape

On 2015-07-13 00:36, Tom Lane wrote:

We have a far better model to follow already, namely the foreign data
wrapper API. In that, there's a single SQL-visible function that returns
a dummy datatype indicating that it's an FDW handler, and when called,
it hands back a struct containing pointers to all the other functions
that the particular wrapper needs to supply (and, if necessary, the struct
could have non-function-pointer fields containing other info the core
system might need to know about the handler). We could similarly invent a
pseudotype "tablesample_handler" and represent each tablesample method by
a single SQL function returning tablesample_handler. Any future changes
in the API for tablesample handlers would then appear as changes in the C
definition of the struct returned by the handler, which requires no
SQL-visible thrashing, hence creates no headaches for pg_upgrade, and it
is pretty easy to design it so that failure to update an external module
that implements the API results in C compiler errors and/or sane fallback
behavior.

(back from vacation, going over this thread)

Yes this sounds very sane (and we use something similar also for logical
decoding plugins, not just FDWs). I wish this has occurred to me before,
I would not have to spend time on the DDL support which didn't even get in.

Once we've done that, I think we don't even need a special catalog
representing tablesample methods. Given "FROM foo TABLESAMPLE
bernoulli(...)", the parser could just look for a function bernoulli()
returning tablesample_handler, and it's done. The querytree would have an
ordinary function dependency on that function, which would be sufficient

It seems possible that we might not need catalog indeed.

This would also simplify the parser part which currently contains
specialized function search code as we could most likely just reuse the
generic code.

to handle DROP dependency behaviors properly. (On reflection, maybe
better if it's "bernoulli(internal) returns tablesample_handler",
so as to guarantee that such a function doesn't create a conflict with
any user-defined function of the same name.)

The probability of conflict seems high with the system() so yeah we'd
need some kind of differentiator.

PS: now that I've written this rant, I wonder why we don't redesign the
index AM API along the same lines. It probably doesn't matter much at
the moment, but if we ever get serious about supporting index AM
extensions, I think we ought to consider doing that.

+1

I think this is very relevant to the proposed sequence am patch as well.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: TABLESAMPLE patch is really in pretty sad shape

On 2015-07-13 15:39, Tom Lane wrote:

Datum
tsm_system_rows_init(PG_FUNCTION_ARGS)
{
TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0);
uint32 seed = PG_GETARG_UINT32(1);
int32 ntuples = PG_ARGISNULL(2) ? -1 : PG_GETARG_INT32(2);

This is rather curious coding. Why should this function check only
one of its arguments for nullness? If it needs to defend against
any of them being null, it really needs to check all. But in point of
fact, this function is declared STRICT, which means it's a violation of
the function call protocol if the core code ever passes a null to it,
and so this test ought to be dead code.

A similar pattern of ARGISNULL checks in declared-strict functions exists
in all the tablesample modules, not just this one, showing that this is an
overall design error not just a thinko here. My inclination would be to
make the core code enforce non-nullness of all tablesample arguments so as
to make it unnecessary to check strictness of the tsm*init functions, but
in any case it is Not Okay to just pass nulls willy-nilly to strict C
functions.

The reason for this ugliness came from having to have balance between
modularity and following the SQL Standard error codes for BERNOULLI and
SYSTEM, which came as issue during reviews (the original code did the
checks before calling the sampling method's functions but produced just
generic error code about wrong parameter). I considered it as okayish
mainly because those functions are not SQL callable and the code which
calls them knows how to handle it correctly, but I understand why you don't.

I guess if we did what you proposed in another email in this thread -
don't have the API exposed on SQL level, we could send the additional
parameters as List * and leave the validation completely to the
function. (And maybe don't allow NULLs at all)

Also, I find this coding pretty sloppy even without the strictness angle,
because the net effect of this way of dealing with nulls is that an
argument-must-not-be-null complaint is reported as "out of range",
which is both confusing and the wrong ERRCODE.

if (ntuples < 1)
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("invalid sample size"),
errhint("Sample size must be positive integer value.")));

I don't find this to be good error message style. The secondary comment
is not a "hint", it's an ironclad statement of what you did wrong, so if
we wanted to phrase it like this it should be an errdetail not errhint.
But the whole thing is overly cute anyway because there is no reason at
all not to just say what we mean in the primary error message, eg
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("sample size must be greater than zero")));

Same as above, although now that I re-read the standard I am sure I
misunderstand it the first time - it says:
"If S is the null value or if S < 0 (zero) or if S > 100, then an
exception condition is raised: data exception � invalid sample size."

I took it as literal error message originally but they just mean status
code by it.

While we're on the subject, what's the reason for disallowing a sample
size of zero? Seems like a reasonable edge case.

Yes that's a bug.

/* All blocks have been read, we're done */
if (sampler->doneblocks > sampler->nblocks ||
sampler->donetuples >= sampler->ntuples)
PG_RETURN_UINT32(InvalidBlockNumber);

Okay, I lied, I *am* going to complain about this comment. Comments that
do not accurately describe the code they're attached to are worse than
useless.

That's copy-pasto from the tsm_system_time.

In short, I'd do something more like

uint32 r;

/* Safety check to avoid infinite loop or zero result for small n. */
if (n <= 1)
return 1;

/*
* This should only take 2 or 3 iterations as the probability of 2 numbers
* being relatively prime is ~61%; but just in case, we'll include a
* CHECK_FOR_INTERRUPTS in the loop.
*/
do {
CHECK_FOR_INTERRUPTS();
r = (uint32) (sampler_random_fract(randstate) * n);
} while (r == 0 || gcd(r, n) > 1);

Note however that this coding would result in an unpredictable number
of iterations of the RNG, which might not be such a good thing if we're
trying to achieve repeatability. It doesn't matter in the context of
this module since the RNG is not used after initialization, but it would
matter if we then went on to do Bernoulli-style sampling. (Possibly that
could be dodged by reinitializing the RNG after the initialization steps.)

Bernoulli-style sampling does not need this kind of code so it's not
really an issue. That is unless you'd like to combine the linear probing
and bernoulli of course, but I don't see any benefit in doing that.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: TABLESAMPLE patch is really in pretty sad shape

On 2015-07-13 18:00, Tom Lane wrote:

And here's that. I do *not* claim that this is a complete list of design
issues with the patch, it's just things I happened to notice in the amount
of time I've spent so far (which is already way more than I wanted to
spend on TABLESAMPLE right now).

I'm not sure that we need an extensible sampling-method capability at all,
let alone that an API for that needs to be the primary focus of a patch.
Certainly the offered examples of extension modules aren't inspiring:
tsm_system_time is broken by design (more about that below) and nobody
would miss tsm_system_rows if it weren't there. What is far more
important is to get sampling capability into indexscans, and designing
an API ahead of time seems like mostly a distraction from that.

I'd think seriously about tossing the entire executor-stage part of the
patch, creating a stateless (history-independent) sampling filter that
just accepts or rejects TIDs, and sticking calls to that into all the
table scan node types. Once you had something like that working well
it might be worth considering whether to try to expose an API to
generalize it. But even then it's not clear that we really need any
more options than true-Bernoulli and block-level sampling.

I think this is not really API issue as much as opposite approach on
what to implement first. I prioritized in first iteration for the true
block sampling support as that's what I've been mostly asked for.

But my plan was to have at same later stage (9.6+) also ability to do
subquery scans etc. Basically new SamplingFilter executor node which
would pass the tuples to examinetuple() which would then decide what to
do with it. The selection between using nextblock/nexttuple and
examinetuple was supposed to be extension of the API where the sampling
method would say if it supports examinetuple or nextblock/nexttuple or
both. And eventually I wanted to rewrite bernoulli to just use the
examinetuple() on top of whatever scan once this additional support was
in. I think I explained this during the review stage.

The IBM paper I linked to in the other thread mentions that their
optimizer will sometimes choose to do Bernoulli sampling even if SYSTEM
was requested. Probably that happens when it decides to do a simple
indexscan, because then there's no advantage to trying to cluster the

Yeah it happens when there is an index which is used in WHERE clause and
has bigger selectivity than the percentage specified in the TABLESAMPLE
clause. This of course breaks one of the common use-case though:
SELECT count(*) * 100 FROM table TABLESAMPLE SYSTEM(1) WHERE foo = bar;

sampled rows.  But in the case of a bitmap scan, you could very easily do
either true Bernoulli or block-level sampling simply by adjusting the
rules about which bits you keep or clear in the bitmap (given that you
apply the filter between collection of the index bitmap and accessing the
heap, which seems natural).  The only case where a special scan type
really seems to be needed is if you want block-level sampling, the query
would otherwise use a seqscan, *and* the sampling percentage is pretty low
--- if you'd be reading every second or third block anyway, you're likely
better off with a plain seqscan so that the kernel sees sequential access
and does prefetching.  The current API doesn't seem to make it possible to
switch between seqscan and read-only-selected-blocks based on the sampling
percentage, but I think that could be an interesting optimization.

Well you can do that if you write your own sampling method. We don't do
that in optimizer and that's design choice, because you can't really do
that on high level like that if you want to keep extensibility. And
given the amount of people that asked if they can do their own sampling
when I talked to them about this during the design stage, I consider the
extensibility as more important. Especially if extensibility gives you
the option to do the switching anyway, albeit on lower-level and not out
of the box.

(Another bet that's been missed is having the samplescan logicrequest
prefetching when it is doing selective block reads. The current API can't
support that AFAICS, since there's no expectation that nextblock calls
could be done asynchronously from nexttuple calls.)

Not sure I follow, would not it be possible to achieve this using the
tsmseqscan set to true (it's a misnomer then, I know)?

Another issue with the API as designed is the examinetuple() callback.
Allowing sample methods to see invisible tuples is quite possibly the
worst idea in the whole patch. They can *not* do anything with such
tuples, or they'll totally break reproducibility: if the tuple is
invisible to your scan, it might well be or soon become invisible to
everybody, whereupon it would be subject to garbage collection at the
drop of a hat. So if an invisible tuple affects the sample method's
behavior at all, repeated scans in the same query would not produce
identical results, which as mentioned before is required both by spec
and for minimally sane query behavior. Moreover, examining the contents
of the tuple is unsafe (it might contain pointers to TOAST values that
no longer exist); and even if it were safe, what's the point? Sampling
that pays attention to the data is the very definition of biased. So
if we do re-introduce an API like the current one, I'd definitely lose
this bit and only allow sample methods to consider visible tuples.

I didn't actually have the examinetuple() call on invisible tuples
originally, it was something I added after discussing with several
people off-list who mentioned independently on each other that it would
be useful to have this. I didn't realize the TOAST issue though,
otherwise I would not do that.

(I'm not exactly convinced that the system or tsm_system_rows methods
are adequately reproducible either, given that their sampling pattern
will change when the relation block count changes. Perhaps that's
unavoidable, but it seems like it might be possible to define things
in such a way that adding blocks doesn't change which existing blocks
get sampled.)

Note that even standard says that REPEATABLE only returns same result
"provided certain implementation-defined conditions are satisfied" (I am
not saying we should not try, I am saying it's ok to have some
limitations there).

A more localized issue that I noticed is that nowhere is it documented
what the REPEATABLE parameter value is. Digging in the code eventually
reveals that the value is assignment-coerced to an int4, which I find
rather problematic because a user might reasonably assume that the
parameter works like setseed's parameter (float8 in the range -1 to 1).
If he does then he'll get no errors and identical samples from say
REPEATABLE(0.1) and REPEATABLE(0.2), which is bad. On the other hand,
it looks like DB2 allows integer values, so implementing it just like
setseed might cause problems for people coming from DB2. I'm inclined to
suggest that we should define the parameter as being any float8 value,
and obtain a seed from it with hashfloat8(). That way, no matter whether
users think that usable values are fractional or integral, they'll get
sane behavior with different supplied seeds almost always producing
different samples.

Sounds reasonable.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#14)
Re: TABLESAMPLE patch is really in pretty sad shape

Petr Jelinek wrote:

On 2015-07-13 15:39, Tom Lane wrote:

I don't find this to be good error message style. The secondary comment
is not a "hint", it's an ironclad statement of what you did wrong, so if
we wanted to phrase it like this it should be an errdetail not errhint.
But the whole thing is overly cute anyway because there is no reason at
all not to just say what we mean in the primary error message, eg
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("sample size must be greater than zero")));

Same as above, although now that I re-read the standard I am sure I
misunderstand it the first time - it says:
"If S is the null value or if S < 0 (zero) or if S > 100, then an exception
condition is raised: data exception — invalid sample size."
I took it as literal error message originally but they just mean status code
by it.

Yes, must use a new errcode ERRCODE_INVALID_SAMPLE_SIZE defined to 2202H
here.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#13)
Re: TABLESAMPLE patch is really in pretty sad shape

Petr Jelinek wrote:

On 2015-07-13 00:36, Tom Lane wrote:

PS: now that I've written this rant, I wonder why we don't redesign the
index AM API along the same lines. It probably doesn't matter much at
the moment, but if we ever get serious about supporting index AM
extensions, I think we ought to consider doing that.

+1

I think this is very relevant to the proposed sequence am patch as well.

Hmm, how would this work? Would we have index AM implementation run
some function that register their support methods somehow at startup?
Hopefully we're not going to have the index AMs become shared libraries.

In any case, if indexes AMs and sequence AMs go this route, that
probably means the column store AM we're working on will probably have
to go the same route too.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#17)
Re: TABLESAMPLE patch is really in pretty sad shape

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Petr Jelinek wrote:

On 2015-07-13 00:36, Tom Lane wrote:

PS: now that I've written this rant, I wonder why we don't redesign the
index AM API along the same lines. It probably doesn't matter much at
the moment, but if we ever get serious about supporting index AM
extensions, I think we ought to consider doing that.

I think this is very relevant to the proposed sequence am patch as well.

Hmm, how would this work? Would we have index AM implementation run
some function that register their support methods somehow at startup?

Well, registration of new index AMs is an unsolved question ATM anyhow.
But what I'm imagining is that pg_am would reduce to about two columns,
amname and a handler function OID, and everything else that constitutes
the API for AMs would get moved down to the C level. We have to keep that
catalog because we still need index AMs to have OIDs that will represent
them in pg_opclass etc; but we don't need to nail the exact set of AM
interface functions into the catalog. (I'm not sure whether we'd want
to remove all the bool columns from pg_am. At the C level it would be
about as convenient to have them in a struct returned by the handler
function. But it's occasionally useful to have those properties
visible to SQL queries.)

I'm not clear on whether sequence AMs would need explicit catalog
representation, or could be folded down to just a single SQL function
with special signature as I suggested for tablesample handlers.
Is there any need for a sequence AM to have additional catalog
infrastructure like index AMs need?

Hopefully we're not going to have the index AMs become shared libraries.

Why not? If they can't be that, any claim that we've solved index AM
extensibility seems pretty weak.

In any case, if indexes AMs and sequence AMs go this route, that
probably means the column store AM we're working on will probably have
to go the same route too.

It's worth considering anyway. The FDW API has clearly been far more
successful than the index AM API in terms of being practically usable
by extensions.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#17)
Re: TABLESAMPLE patch is really in pretty sad shape

On Thu, Jul 16, 2015 at 10:33 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Petr Jelinek wrote:

On 2015-07-13 00:36, Tom Lane wrote:

PS: now that I've written this rant, I wonder why we don't redesign the
index AM API along the same lines. It probably doesn't matter much at
the moment, but if we ever get serious about supporting index AM
extensions, I think we ought to consider doing that.

+1

I think this is very relevant to the proposed sequence am patch as well.

Hmm, how would this work? Would we have index AM implementation run
some function that register their support methods somehow at startup?
Hopefully we're not going to have the index AMs become shared libraries.

I recall a proposal by Alexander Korotkov about extensible access
methods although his proposal also included a CREATE AM command that
would add a pg_am row so that perhaps differs from what Tom seems to
allude to here.

/messages/by-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#19)
Re: TABLESAMPLE patch is really in pretty sad shape

Amit Langote <amitlangote09@gmail.com> writes:

On Thu, Jul 16, 2015 at 10:33 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Hmm, how would this work? Would we have index AM implementation run
some function that register their support methods somehow at startup?

I recall a proposal by Alexander Korotkov about extensible access
methods although his proposal also included a CREATE AM command that
would add a pg_am row so that perhaps differs from what Tom seems to
allude to here.

I think we'd still need to invent CREATE AM if we wanted to allow index
AMs to be created as extensions; we'd still have to have the pg_am
catalog, and extensions still couldn't write rows directly into that,
for the same reasons I pointed out with respect to tablesample methods.

However, if the contents of pg_am could be boiled down to just a name and
a handler function, then that would represent a simple and completely
stable definition for CREATE AM's arguments, which would be a large
improvement over trying to reflect the current contents of pg_am directly
in a SQL statement. We add new columns to pg_am all the time, and that
would create huge backward-compatibility headaches if we had to modify
the behavior of a CREATE AM statement every time.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#18)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#21)
#23Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#23)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#24)
#26Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#24)
#27Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#26)
#29Emre Hasegeli
emre@hasegeli.com
In reply to: Petr Jelinek (#13)
#30Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#31)
#33Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#34)
#36Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#35)
#37Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#36)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#37)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#38)
#41Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#41)
#43Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#35)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#43)
#45Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#44)