TABLESAMPLE usage

Started by Tom Smithabout 10 years ago7 messagesgeneral

tomsmith1989sk@gmail.com

about 10 years ago

Hello:

I have a big table with that is always appended with new data with a unique
sequence id (always incremented, or timestamp as unique index) each row.
I'd like to sample, say 100 rows out of say 1000 rows evently across all
the rows,
so that it would return rows of1, 101, 201, 301 you get idea.
can TABLESAMPLE get one row for every 100 rows, based on the order
of the rows added to table using the timestamp as already indexed/sorted
sequence

Thanks

Vik Fearing

vik@postgresfriends.org

about 10 years ago

In reply to: Tom Smith (#1)

Re: TABLESAMPLE usage

On 01/25/2016 05:09 AM, Tom Smith wrote:

Hello:

I have a big table with that is always appended with new data with a unique
sequence id (always incremented, or timestamp as unique index) each row.
I'd like to sample, say 100 rows out of say 1000 rows evently across all
the rows,
so that it would return rows of1, 101, 201, 301 you get idea.
can TABLESAMPLE get one row for every 100 rows, based on the order
of the rows added to table using the timestamp as already indexed/sorted
sequence

No, TABLESAMPLE is intended to take a random sampling of the data using
various methods.

You're looking for something more like this:

select t.*
from generate_series(1, (select max(id) from t), 100) g
join t on t.id = g;
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Smith

tomsmith1989sk@gmail.com

about 10 years ago

In reply to: Vik Fearing (#2)

Re: TABLESAMPLE usage

Thanks, the solution would work for fixed interval timestamp.
But the data I am dealing with has irregular timestamp so can not be
generated with exact steps.

I would consider this a special case/method of random sampling, evenly
distributed sampling according to the defined timestamp index.

On Mon, Jan 25, 2016 at 3:48 AM, Vik Fearing <vik@2ndquadrant.fr> wrote:

Show quoted text

On 01/25/2016 05:09 AM, Tom Smith wrote:

Hello:

I have a big table with that is always appended with new data with a

unique

sequence id (always incremented, or timestamp as unique index) each row.
I'd like to sample, say 100 rows out of say 1000 rows evently across all
the rows,
so that it would return rows of1, 101, 201, 301 you get idea.
can TABLESAMPLE get one row for every 100 rows, based on the order
of the rows added to table using the timestamp as already indexed/sorted
sequence

No, TABLESAMPLE is intended to take a random sampling of the data using
various methods.

You're looking for something more like this:

select t.*
from generate_series(1, (select max(id) from t), 100) g
join t on t.id = g;
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Matija Lesar

matija.lesar@gmail.com

about 10 years ago

In reply to: Tom Smith (#3)

Re: TABLESAMPLE usage

On 25 January 2016 at 09:55, Tom Smith <tomsmith1989sk@gmail.com> wrote:

Thanks, the solution would work for fixed interval timestamp.
But the data I am dealing with has irregular timestamp so can not be
generated with exact steps.

I would consider this a special case/method of random sampling, evenly
distributed sampling according to the defined timestamp index.

On Mon, Jan 25, 2016 at 3:48 AM, Vik Fearing <vik@2ndquadrant.fr> wrote:

On 01/25/2016 05:09 AM, Tom Smith wrote:

Hello:

I have a big table with that is always appended with new data with a

unique

sequence id (always incremented, or timestamp as unique index) each

row.

I'd like to sample, say 100 rows out of say 1000 rows evently across all
the rows,
so that it would return rows of1, 101, 201, 301 you get idea.
can TABLESAMPLE get one row for every 100 rows, based on the order
of the rows added to table using the timestamp as already indexed/sorted
sequence

No, TABLESAMPLE is intended to take a random sampling of the data using
various methods.

You're looking for something more like this:

select t.*
from generate_series(1, (select max(id) from t), 100) g
join t on t.id = g;
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Hi,

you can accomplish this with row_number()
<http://www.postgresql.org/docs/8.4/static/functions-window.html#FUNCTIONS-WINDOW-TABLE>
:

WITH data_cte as (
SELECT
id,
clock_timestamp() as ctimestamp
FROM generate_series(1,1000) as id
)
SELECT
*
FROM
(SELECT
id,
ctimestamp,
row_number() OVER (ORDER BY ctimestamp) as rownum
FROM data_cte
) as data_withrownumbers
WHERE
rownum%100=1;

Bye,
Matija Lesar

Vik Fearing

vik@postgresfriends.org

about 10 years ago

In reply to: Tom Smith (#3)

Re: TABLESAMPLE usage

On 01/25/2016 09:55 AM, Tom Smith wrote:

Thanks, the solution would work for fixed interval timestamp.
But the data I am dealing with has irregular timestamp so can not be
generated with exact steps.

I would consider this a special case/method of random sampling, evenly
distributed sampling according to the defined timestamp index.

You could probably create your own sampling method to do what you want.

See contrib modules tsm_system_rows and tsm_system_time for guidance.
http://www.postgresql.org/docs/current/static/tsm-system-rows.html
http://www.postgresql.org/docs/current/static/tsm-system-time.html
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Simon Riggs

simon@2ndQuadrant.com

about 10 years ago

In reply to: Matija Lesar (#4)

Re: TABLESAMPLE usage

On 25 January 2016 at 09:44, Matija Lesar <matija.lesar@gmail.com> wrote:

you can accomplish this with row_number()
<http://www.postgresql.org/docs/8.4/static/functions-window.html#FUNCTIONS-WINDOW-TABLE>
:

WITH data_cte as (
SELECT
id,
clock_timestamp() as ctimestamp
FROM generate_series(1,1000) as id
)
SELECT
*
FROM
(SELECT
id,
ctimestamp,
row_number() OVER (ORDER BY ctimestamp) as rownum
FROM data_cte
) as data_withrownumbers
WHERE
rownum%100=1;

You can, but its not very fast.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Smith

tomsmith1989sk@gmail.com

about 10 years ago

In reply to: Simon Riggs (#6)

Re: TABLESAMPLE usage

Yeah. I am looking for fastest possible method that Postgresql would
use its internal data structure knowledge to walk through the timestamp
index
and resturns every "nth" row

On Mon, Jan 25, 2016 at 5:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Show quoted text

On 25 January 2016 at 09:44, Matija Lesar <matija.lesar@gmail.com> wrote:

you can accomplish this with row_number()
<http://www.postgresql.org/docs/8.4/static/functions-window.html#FUNCTIONS-WINDOW-TABLE>
:

WITH data_cte as (
SELECT
id,
clock_timestamp() as ctimestamp
FROM generate_series(1,1000) as id
)
SELECT
*
FROM
(SELECT
id,
ctimestamp,
row_number() OVER (ORDER BY ctimestamp) as rownum
FROM data_cte
) as data_withrownumbers
WHERE
rownum%100=1;

You can, but its not very fast.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services