How to do faster DML

Started by Lok Pabout 2 years ago67 messagesgeneral
Jump to latest
#1Lok P
loknath.73@gmail.com

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data size
as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got
approx ~1billion+ duplicate rows inserted in it and we want to get the
duplicate data removed for this table and create a PK/unique constraint
back so as to not have the duplicate values in future. We are struggling to
do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table
with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT * from TAB1 A
where ID in
(select min(ID) from TAB1
group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index created
on the column using which the duplicate check will be performed i.e ID
column. So, creating the index itself took ~2hrs+ and the index size now
shows as ~116GB.

*Create index idx1 on TAB1(ID)*
And now running the SELECT subquery part of the CTAS statement to see if
its giving the correct unique records count. It ran for 2.5 hrs and then we
killed it. Below is the plan for the same.

explain
*select min(ID) from TAB1 A group by ID having count(ID)>=1*

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
Group Key: ID
Filter: (count(ID) >= 1)
-> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am
worried that creating PK constraint/index back after deleting the duplicate
is also going to run forever. Is there any way we can make these heavy
operations faster on postgre by facilitating more database resources
through some parameter setup, like parallel hint etc? We have pg_hint_plan
extension added, but not seeing the parallel hint enforced when adding it
to the query.

In Oracle we have Parallel hints, Direct path read/write for faster
read/write operations, parallel index scan etc. available, if anything
similar to that available in aurora postgre to facilitate more
horsepower and speed up the batch operations. And , how can we monitor
progress of any running query ?
Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
256GB RAM. PG version 15.4.

Regards
Lok

#2Lok P
loknath.73@gmail.com
In reply to: Lok P (#1)
Re: How to do faster DML

Apology. One correction, the query is like below. I. E filter will be on on
ctid which I believe is equivalent of rowid in oracle and we will not need
the index on Id column then.

But, it still runs long, so thinking any other way to make the duplicate
removal faster?

Also wondering , the index creation which took ~2.5hrs+ , would that have
been made faster any possible way by allowing more db resource through some
session level db parameter setting?

create table TAB1_New
as
SELECT * from TAB1 A
where CTID in
(select min(CTID) from TAB1
group by ID having count(ID)>=1 );

On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath.73@gmail.com> wrote:

Show quoted text

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data
size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
got approx ~1billion+ duplicate rows inserted in it and we want to get the
duplicate data removed for this table and create a PK/unique constraint
back so as to not have the duplicate values in future. We are struggling to
do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table
with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT * from TAB1 A
where ID in
(select min(ID) from TAB1
group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index created
on the column using which the duplicate check will be performed i.e ID
column. So, creating the index itself took ~2hrs+ and the index size now
shows as ~116GB.

*Create index idx1 on TAB1(ID)*
And now running the SELECT subquery part of the CTAS statement to see if
its giving the correct unique records count. It ran for 2.5 hrs and then we
killed it. Below is the plan for the same.

explain
*select min(ID) from TAB1 A group by ID having count(ID)>=1*

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
Group Key: ID
Filter: (count(ID) >= 1)
-> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am
worried that creating PK constraint/index back after deleting the duplicate
is also going to run forever. Is there any way we can make these heavy
operations faster on postgre by facilitating more database resources
through some parameter setup, like parallel hint etc? We have pg_hint_plan
extension added, but not seeing the parallel hint enforced when adding it
to the query.

In Oracle we have Parallel hints, Direct path read/write for faster
read/write operations, parallel index scan etc. available, if anything
similar to that available in aurora postgre to facilitate more
horsepower and speed up the batch operations. And , how can we monitor
progress of any running query ?
Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
256GB RAM. PG version 15.4.

Regards
Lok

#3Todd Lewis
tlewis@brickabode.com
In reply to: Lok P (#2)
Re: How to do faster DML

I copy/pasted your question into ChatGPT, and it gave me 10 specific
suggestions. Have you tried those?

On Sat, Feb 3, 2024 at 10:55 AM Lok P <loknath.73@gmail.com> wrote:

Apology. One correction, the query is like below. I. E filter will be on
on ctid which I believe is equivalent of rowid in oracle and we will not
need the index on Id column then.

But, it still runs long, so thinking any other way to make the duplicate
removal faster?

Also wondering , the index creation which took ~2.5hrs+ , would that have
been made faster any possible way by allowing more db resource through some
session level db parameter setting?

create table TAB1_New
as
SELECT * from TAB1 A
where CTID in
(select min(CTID) from TAB1
group by ID having count(ID)>=1 );

On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath.73@gmail.com> wrote:

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data
size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
got approx ~1billion+ duplicate rows inserted in it and we want to get the
duplicate data removed for this table and create a PK/unique constraint
back so as to not have the duplicate values in future. We are struggling to
do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table
with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT * from TAB1 A
where ID in
(select min(ID) from TAB1
group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index
created on the column using which the duplicate check will be performed i.e
ID column. So, creating the index itself took ~2hrs+ and the index size now
shows as ~116GB.

*Create index idx1 on TAB1(ID)*
And now running the SELECT subquery part of the CTAS statement to see if
its giving the correct unique records count. It ran for 2.5 hrs and then we
killed it. Below is the plan for the same.

explain
*select min(ID) from TAB1 A group by ID having count(ID)>=1*

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
Group Key: ID
Filter: (count(ID) >= 1)
-> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am
worried that creating PK constraint/index back after deleting the duplicate
is also going to run forever. Is there any way we can make these heavy
operations faster on postgre by facilitating more database resources
through some parameter setup, like parallel hint etc? We have pg_hint_plan
extension added, but not seeing the parallel hint enforced when adding it
to the query.

In Oracle we have Parallel hints, Direct path read/write for faster
read/write operations, parallel index scan etc. available, if anything
similar to that available in aurora postgre to facilitate more
horsepower and speed up the batch operations. And , how can we monitor
progress of any running query ?
Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
256GB RAM. PG version 15.4.

Regards
Lok

--
Todd Lewis
tlewis@brickabode.com

#4Ron
ronljohnsonjr@gmail.com
In reply to: Lok P (#2)
Re: How to do faster DML

On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath.73@gmail.com> wrote:

Apology. One correction, the query is like below. I. E filter will be on
on ctid which I believe is equivalent of rowid in oracle and we will not
need the index on Id column then.

But, it still runs long, so thinking any other way to make the duplicate
removal faster?

Also wondering , the index creation which took ~2.5hrs+ , would that have
been made faster any possible way by allowing more db resource through some
session level db parameter setting?

create table TAB1_New
as
SELECT * from TAB1 A
where CTID in
(select min(CTID) from TAB1
group by ID having count(ID)>=1 );

On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath.73@gmail.com> wrote:

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data
size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
got approx ~1billion+ duplicate rows inserted in it and we want to get the
duplicate data removed for this table and create a PK/unique constraint
back so as to not have the duplicate values in future. We are struggling to
do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table
with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT * from TAB1 A
where ID in
(select min(ID) from TAB1
group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index
created on the column using which the duplicate check will be performed i.e
ID column. So, creating the index itself took ~2hrs+ and the index size now
shows as ~116GB.

*Create index idx1 on TAB1(ID)*
And now running the SELECT subquery part of the CTAS statement to see if
its giving the correct unique records count. It ran for 2.5 hrs and then we
killed it. Below is the plan for the same.

explain
*select min(ID) from TAB1 A group by ID having count(ID)>=1*

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
Group Key: ID
Filter: (count(ID) >= 1)
-> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am
worried that creating PK constraint/index back after deleting the duplicate
is also going to run forever. Is there any way we can make these heavy
operations faster on postgre by facilitating more database resources
through some parameter setup, like parallel hint etc? We have pg_hint_plan
extension added, but not seeing the parallel hint enforced when adding it
to the query.

In Oracle we have Parallel hints, Direct path read/write for faster
read/write operations, parallel index scan etc. available, if anything
similar to that available in aurora postgre to facilitate more
horsepower and speed up the batch operations. And , how can we monitor
progress of any running query ?
Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
256GB RAM. PG version 15.4.

Aurora is not Postgresql, so configurations might not work. Having said
that...
https://www.postgresql.org/docs/15t/how-parallel-query-works.html

And have you analyzed the table lately? Also, what's your work_mem
and maintenance_work_mem?

#5Lok P
loknath.73@gmail.com
In reply to: Ron (#4)
Re: How to do faster DML

Ron Johnson <ronljohnsonjr@gmail.com>
7:37 PM (1 hour ago)
to *pgsql-general*
On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath.73@gmail.com> wrote:

Apology. One correction, the query is like below. I. E filter will be on
on ctid which I believe is equivalent of rowid in oracle and we will not
need the index on Id column then.

But, it still runs long, so thinking any other way to make the duplicate
removal faster?

Also wondering , the index creation which took ~2.5hrs+ , would that have
been made faster any possible way by allowing more db resource through some
session level db parameter setting?

create table TAB1_New
as
SELECT * from TAB1 A
where CTID in
(select min(CTID) from TAB1
group by ID having count(ID)>=1 );

On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath.73@gmail.com> wrote:

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data
size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
got approx ~1billion+ duplicate rows inserted in it and we want to get the
duplicate data removed for this table and create a PK/unique constraint
back so as to not have the duplicate values in future. We are struggling to
do the same.

Teammates suggested doing this using CTAS method, i.e. create a new
table with the unique record set and then drop the main table. Something as
below

create table TAB1_New
as
SELECT * from TAB1 A
where ID in
(select min(ID) from TAB1
group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index
created on the column using which the duplicate check will be performed i.e
ID column. So, creating the index itself took ~2hrs+ and the index size now
shows as ~116GB.

*Create index idx1 on TAB1(ID)*
And now running the SELECT subquery part of the CTAS statement to see if
its giving the correct unique records count. It ran for 2.5 hrs and then we
killed it. Below is the plan for the same.

explain
*select min(ID) from TAB1 A group by ID having count(ID)>=1*

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
Group Key: ID
Filter: (count(ID) >= 1)
-> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am
worried that creating PK constraint/index back after deleting the duplicate
is also going to run forever. Is there any way we can make these heavy
operations faster on postgre by facilitating more database resources
through some parameter setup, like parallel hint etc? We have pg_hint_plan
extension added, but not seeing the parallel hint enforced when adding it
to the query.

In Oracle we have Parallel hints, Direct path read/write for faster
read/write operations, parallel index scan etc. available, if anything
similar to that available in aurora postgre to facilitate more
horsepower and speed up the batch operations. And , how can we monitor
progress of any running query ?
Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
256GB RAM. PG version 15.4.

Aurora is not Postgresql, so configurations might not work. Having said
that...
https://www.postgresql.org/docs/15t/how-parallel-query-works.html

And have you analyzed the table lately? Also, what's your work_mem
and maintenance_work_mem?

Thank you .

Below are the values of the default parameters in this instance

SHOW max_worker_processes; - 128
show max_parallel_workers_per_gather;- 4
show max_parallel_workers;- 32
show max_parallel_maintenance_workers; - 2
show maintenance_work_mem; - 4155MB
show work_mem; - 8MB
show shared_buffers ; -22029684

When I ran the CTAS queries and index creation process , I had not a very
clear idea of how these are related to each other and help each of the
operations, but I set a few of those as below before triggering those in
the same session.

set max_parallel_workers_per_gather=16;
SET max_parallel_maintenance_workers TO 16;
SET maintenance_work_mem TO '16 GB';

The instance has a total ~256 GB memory, so how should I adjust/bump these
values when running heavy SELECT queries doing a large sequential scan OR
large index creation process OR any Select query with heavy sorting/"order
by" operations OR heavy JOINS?

I have not analyzed the table manually though , but seeing the auto_vaccum
and auto_analyze column getting populated in the pg_stat_user_tables , I
thought it must be doing that automatically.

By the way if we run "analyze tab1' on this 1.5TB table , will that run
longer and will any of the above parameters help to expedite that ANALYZE
operation too, if I run the ANALYZE manually?

Regards
Lok

Show quoted text
#6Greg Sabino Mullane
greg@turnstep.com
In reply to: Lok P (#1)
Re: How to do faster DML

As a general rule, avoid heavy subselects like that. You don't need to
build a full list of duplicates before starting. Another approach:

create table mytable2 (like mytable1);

alter table mytable2 add primary key (id);

insert into mytable2 select * from mytable1 on conflict do nothing;

Given the size of your table, you probably want to divide that up.
As long as nothing is changing the original table, you could do:

insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 0;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 10_000_000;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 20_000_000;
etc.

Cheers,
Greg

#7Francisco Olarte
folarte@peoplecall.com
In reply to: Greg Sabino Mullane (#6)
Re: How to do faster DML

On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane <htamfids@gmail.com> wrote:
...

Given the size of your table, you probably want to divide that up.
As long as nothing is changing the original table, you could do:

insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 0;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 10_000_000;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 20_000_000;

Is it me or does this have the faint smell of quadratic behaviour? I
mean, you need to read and discard the first 10M to do offset 10M (
also I believe ctid order is implied in sequential scan, but no so
sure, if not it would need a full table sort on each pass ).

When doing things like this, I normally have some kind of unique index
and do it by selecting with limit above the last read value( stored
when doing it via code, as I normally load first, index later so I
cannot select max() fast on the target ). Or, with some kind of
"sparse" index (meaning, groups much less than the batch size ) and a
little code you can select where index_col > last order by index_col
limit 10M, remember last received index_col and reselect discarding
missing ( or just reselect it when doing your on conflict do nothing,
which also has a fast select max(id) due to the PK, it will work if it
has an index on id column on the original even if not unique ) to
avoid that.

Also, I'm not sure if ctid is ordered and you can select where
ctid>last ordered, if that works it probably is faster for immutable
origins.

Francisco Olarte.

#8Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Lok P (#2)
Re: How to do faster DML

On 2024-02-03 19:25:12 +0530, Lok P wrote:

Apology. One correction, the query is like below. I. E filter will be on on
ctid which I believe is equivalent of rowid in oracle and we will not need the
index on Id column then. 

 But, it still runs long, so thinking any other way to make the duplicate
removal faster? 

Also wondering , the index creation which took ~2.5hrs+ , would that have been
made faster any possible way by allowing more db resource through some session
level db parameter setting?

create table TAB1_New
as
SELECT  * from TAB1 A
where CTID in
      (select min(CTID) from TAB1
      group by ID having count(ID)>=1 );

That »having count(ID)>=1« seems redundant to me. Surely every id which
occurs in the table occurs at least once?

Since you want ID to be unique I assume that it is already almost
unique - so only a small fraction of the ids will be duplicates. So I
would start with creating a list of duplicates:

create table tab1_dups as
select id, count(*) from tab1 group by id having count(*) > 1;

This will still take some time because it needs to build a temporary
structure large enough to hold a count for each individual id. But at
least then you'll have a much smaller table to use for further cleanup.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#9Lok P
loknath.73@gmail.com
In reply to: Francisco Olarte (#7)
Re: How to do faster DML

On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <folarte@peoplecall.com>
wrote:

On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane <htamfids@gmail.com>
wrote:
...

Given the size of your table, you probably want to divide that up.
As long as nothing is changing the original table, you could do:

insert into mytable2 select * from mytable1 order by ctid limit

10_000_000 offset 0;

insert into mytable2 select * from mytable1 order by ctid limit

10_000_000 offset 10_000_000;

insert into mytable2 select * from mytable1 order by ctid limit

10_000_000 offset 20_000_000;

Is it me or does this have the faint smell of quadratic behaviour? I
mean, you need to read and discard the first 10M to do offset 10M (
also I believe ctid order is implied in sequential scan, but no so
sure, if not it would need a full table sort on each pass ).

When doing things like this, I normally have some kind of unique index
and do it by selecting with limit above the last read value( stored
when doing it via code, as I normally load first, index later so I
cannot select max() fast on the target ). Or, with some kind of
"sparse" index (meaning, groups much less than the batch size ) and a
little code you can select where index_col > last order by index_col
limit 10M, remember last received index_col and reselect discarding
missing ( or just reselect it when doing your on conflict do nothing,
which also has a fast select max(id) due to the PK, it will work if it
has an index on id column on the original even if not unique ) to
avoid that.

Also, I'm not sure if ctid is ordered and you can select where
ctid>last ordered, if that works it probably is faster for immutable
origins.

Francisco Olarte.

Thank you.

Yes , I think the "on conflict do nothing;" option looks promising as it
will remove the duplicate in case of PK violation but keepte load continue
for subsequent rows.

However , as we have ~5billion rows in the base table and out of that , we
were expecting almost half i.e. ~2billion would be duplicates. And you
said, doing the inserts using the "order by CTID Offset" approach must
cause one full sequential scan of the whole table for loading each
chunk/10M of rows and that would take a long time I believe.

I am still trying to understand the other approach which you suggested. Not
able to understand "y*ou can select where index_col > last order by
index_col **limit 10M," .*
However, to get the max ID value of the last 10M loaded rows in target, do
you say that having an PK index created on that target table column(ID)
will help, and we can save the max (ID) value subsequently in another table
to fetch and keep loading from the source table (as ID>Max_ID stored in
temp table)?
OR
Would it be better to do it in one shot only , but by setting a higher
value of some parameters like "maintenance_work_mem" or
"max_parallel_workers"?

#10Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Lok P (#9)
Re: How to do faster DML

On 2024-02-04 02:14:20 +0530, Lok P wrote:

However , as we have ~5billion rows in the base table and out of that , we were
expecting almost half i.e. ~2billion would be duplicates.

That's way more than I expected from your original description. And it
of course raises the question whether it's a good idea to just throw
away all that data or if you need to keep that in a normalized way.

And you said, doing the inserts using the "order by CTID Offset"
approach must cause one full sequential scan of the whole table for
loading each chunk/10M of rows and that would take a long time I
believe.

I am still trying to understand the other approach which you suggested. Not
able to understand "you can select where index_col > last order by index_col 
limit 10M," .
However, to get the max ID value of the last 10M loaded rows in target, do you
say that having an PK index created on that target table column(ID) will
help,

Yes. Getting the maximum value from an index is a very fast operation.
You just have to traverse down the right edge of the tree (or you may
even be able to access the right-most leaf page directly).

 and we can save the max (ID) value subsequently in another table to fetch
and keep loading from the source table (as ID>Max_ID stored in temp table)?

Another table or a variable in a script (personally, if I need to do
something repeatedly, I usually write a script in the scripting language
I feel most comfortable in (which has been Python for the last 7 or 8
years, Perl before that) which gives you variables, loops, conditionals
and - above all - repeatability.

OR
Would it be better to do it in one shot only , but by setting a higher value of
some parameters like "maintenance_work_mem" or "max_parallel_workers"?

Hard to say. Normally, processing in fewer. bigger chunks is faster. But
RAM is much faster than disk (even with SSDs), so it might be faster to
make work_mem as large as you can and then use a chunk size which just
fits inside work_mem is faster. Of course finding that sweet spot takes
experimentation, hence time, and it may make little sense to experiment
for 20 hours just to save 40 minutes.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#11Alban Hertroys
haramrae@gmail.com
In reply to: Lok P (#1)
Re: How to do faster DML

On 3 Feb 2024, at 13:20, Lok P <loknath.73@gmail.com> wrote:

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got approx ~1billion+ duplicate rows inserted in it and we want to get the duplicate data removed for this table and create a PK/unique constraint back so as to not have the duplicate values in future. We are struggling to do the same.

~4.8 billion rows of which ~1 billion are duplicates… Wait a minute…

Did you verify that your ID column is larger than 32-bits?
Because if that’s a 32 bit integer, the range of values it can hold is about 4.3 billion, after which it wraps around.

With ~4.8 billion rows that would result in about ~0.5 billion repeated ID values, giving you the reported ~1 billion duplicate ID's.

If that’s the case, your duplicates obviously aren’t really duplicates and you require a different type of solution.

Teammates suggested doing this using CTAS method, i.e. create a new table with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT * from TAB1 A
where ID in
(select min(ID) from TAB1
group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index created on the column using which the duplicate check will be performed i.e ID column. So, creating the index itself took ~2hrs+ and the index size now shows as ~116GB.

Create index idx1 on TAB1(ID)

Are your duplicates exact duplicates? Or is there an order of preference among them?
And if so, what really makes those rows unique?

That matters for solutions on how to deduplicate these rows.

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

#12Francisco Olarte
folarte@peoplecall.com
In reply to: Lok P (#9)
Re: How to do faster DML

Lok:

On Sat, 3 Feb 2024 at 21:44, Lok P <loknath.73@gmail.com> wrote:

On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <folarte@peoplecall.com> wrote:

However , as we have ~5billion rows in the base table and out of that , we were expecting almost half i.e. ~2billion would be duplicates. And you said, doing the inserts using the "order by CTID Offset" approach must cause one full sequential scan of the whole table for loading each chunk/10M of rows and that would take a long time I believe.

I did not say "MUST CAUSE". In fact I said I believe it would not.

What I am gonna say ( now ) is test it. Make 1k, 10k, 100k, 1M tables
in a scratch database, explain and test your things there w/ & w/o
index etc.. Not all needed, but testing 100k & 1M in 1k batches could
show you missing quadratic behaviour. Explain would show you
unexpected sorts or scans.

I am still trying to understand the other approach which you suggested. Not able to understand "you can select where index_col > last order by index_col limit 10M," .
However, to get the max ID value of the last 10M loaded rows in target, do you say that having an PK index created on that target table column(ID) will help, and we can save the max (ID) value subsequently in another table to fetch and keep loading from the source table (as ID>Max_ID stored in temp table)?

I am a programmer by trade. When faced with problems like these,
unless I find a trivial solution, I tend to make auxiliary programs as
it is much easier for me to make a thousand lines of Perl ( or
python/C/C++ Java ) or a couple hundreds of plpgsql ( if I want to
avoid roundtrips ) than trying to debug complicated SQL only
workflows.

For your kind of problem I would make a program to read the rows and
insert them. As an example, lets say you have a non-unique index on ID
and are going to use the on-conflict-do-nothing route ( I do believe
this would be slow due to the need of having an index on the target
table to support it, but let's assume it is ok ). To do that I may
just do a loop, starting with last_id=-1(some id less than any other
id), selecting a chunk of rows with id>=last_id ordered by id and
inserting them. After doing that I may notice that I do not need the
index if the sort order is right, drop the index and the on-conflict
and just do, for every row, if(id>last_id) insert before storing
last_id=id.

Anyway, not knowing the real table characteristics and current usage
patterns I cannot recomend anything concrete.

Would it be better to do it in one shot only , but by setting a higher value of some parameters like "maintenance_work_mem" or "max_parallel_workers"?

It depends on a lot of unknown ( to us ) things.

Francisco Olarte.

#13Marcos Pegoraro
marcos@f10.com.br
In reply to: Greg Sabino Mullane (#6)
Re: How to do faster DML

insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 20_000_000;

You can use min/max values grouping them by 10_000_000 records, so you
don´t need that offset, then generate commands and run them.

select format('insert into mytable2 select * from mytable1 where i between
%s and %s;', max(i), min(i)) from t group by i/10_000_000;

#14Dennis White
dwhite@seawardmoon.com
In reply to: Marcos Pegoraro (#13)
Re: How to do faster DML

I'm surprised no one has mentioned perhaps it's a good idea to partition
this table while adding the pk. By your own statements the table is
difficult to work with as is. Without partitioning the table, row inserts
would need to walk the pk index and could be a factor. If this is static
table then perhaps that's ok but if not...

Anyway I don't recall what type the ID was or how it's set but i suggest
you seriously investigate using it to partition this table into manageable
smaller tables.
Best of luck.

On Sun, Feb 4, 2024, 8:00 AM Marcos Pegoraro <marcos@f10.com.br> wrote:

Show quoted text

insert into mytable2 select * from mytable1 order by ctid limit 10_000_000

offset 20_000_000;

You can use min/max values grouping them by 10_000_000 records, so you
don´t need that offset, then generate commands and run them.

select format('insert into mytable2 select * from mytable1 where i between
%s and %s;', max(i), min(i)) from t group by i/10_000_000;

#15Lok P
loknath.73@gmail.com
In reply to: Dennis White (#14)
Re: How to do faster DML

On Sun, Feb 4, 2024 at 8:14 PM Dennis White <dwhite@seawardmoon.com> wrote:

I'm surprised no one has mentioned perhaps it's a good idea to partition
this table while adding the pk. By your own statements the table is
difficult to work with as is. Without partitioning the table, row inserts
would need to walk the pk index and could be a factor. If this is static
table then perhaps that's ok but if not...

Anyway I don't recall what type the ID was or how it's set but i suggest
you seriously investigate using it to partition this table into manageable
smaller tables.
Best of luck.

On Sun, Feb 4, 2024, 8:00 AM Marcos Pegoraro <marcos@f10.com.br> wrote:

insert into mytable2 select * from mytable1 order by ctid limit

10_000_000 offset 20_000_000;

You can use min/max values grouping them by 10_000_000 records, so you
don´t need that offset, then generate commands and run them.

select format('insert into mytable2 select * from mytable1 where i
between %s and %s;', max(i), min(i)) from t group by i/10_000_000;

Thank you so much.

You are correct. It was seemingly difficult to operate on this table. Every
read query is struggling and so partitioning is something we must have to
think of. And hoping that, postgres will be able to handle this scale, with
proper partitioning and indexing strategy.

I have a few related questions.

1)Even after partitioning the target table , to speed up the data load on
this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea or will it
be cumbersome/resource intensive to re-enable the constraints , after
persisting all the data in the table?

2)I understand there is no limitation theoretically on the number or size
of partitions a table can have in postgres. But I want to know from experts
here, from their real life experience, if there exists any such thing which
we should take care of before deciding the partitioning strategy, so as to
have the soft limit (both with respect to size and number of partitions)
obeyed.
Note:- Currently this table will be around ~1TB in size and will hold
Approx ~3billion rows(post removal of duplicates). But as per business need
it may grow up to ~45 billion rows in future.

3)As the size of the table or each partition is going to be very large and
this will be a read intensive application, compressing the historical
partition will help us save the storage space and will also help the read
queries performance. So, Can you please throw some light on the compression
strategy which we should follow here (considering a daily range partition
table based on transaction_date as partition key)?

Regards
Lok

#16Greg Sabino Mullane
greg@turnstep.com
In reply to: Lok P (#15)
Re: How to do faster DML

Partitioning will depend a lot on how you access the data. Is it always
using that transaction_date column in the where clause? Can you share the
table definition via \d? We've talked about this table quite a bit, but not
actually seen what it looks like. We are flying blind a little bit. You
mentioned your queries are slow, but can you share an explain analyze on
one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You may
want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on

this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea

No.

3)As the size of the table or each partition is going to be very large and

this will be a read intensive application, compressing the historical
partition will help us save the storage space and will also help the read
queries performance

I am not sure what this means. If you are not going to need the data
anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg

Show quoted text
#17Ron
ronljohnsonjr@gmail.com
In reply to: Lok P (#15)
Re: How to do faster DML

On Sun, Feb 4, 2024 at 10:30 AM Lok P <loknath.73@gmail.com> wrote:

On Sun, Feb 4, 2024 at 8:14 PM Dennis White <dwhite@seawardmoon.com>
wrote:

I'm surprised no one has mentioned perhaps it's a good idea to partition
this table while adding the pk. By your own statements the table is
difficult to work with as is. Without partitioning the table, row inserts
would need to walk the pk index and could be a factor. If this is static
table then perhaps that's ok but if not...

Anyway I don't recall what type the ID was or how it's set but i suggest
you seriously investigate using it to partition this table into manageable
smaller tables.
Best of luck.

On Sun, Feb 4, 2024, 8:00 AM Marcos Pegoraro <marcos@f10.com.br> wrote:

insert into mytable2 select * from mytable1 order by ctid limit

10_000_000 offset 20_000_000;

You can use min/max values grouping them by 10_000_000 records, so you
don´t need that offset, then generate commands and run them.

select format('insert into mytable2 select * from mytable1 where i
between %s and %s;', max(i), min(i)) from t group by i/10_000_000;

Thank you so much.

You are correct. It was seemingly difficult to operate on this table.
Every read query is struggling and so partitioning is something we must
have to think of. And hoping that, postgres will be able to handle this
scale, with proper partitioning and indexing strategy.

I have a few related questions.

1)Even after partitioning the target table , to speed up the data load on
this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea or will it
be cumbersome/resource intensive to re-enable the constraints , after
persisting all the data in the table?

1. Load the children before attaching them to the parent.
2. Create the child indices, PK and FKs before attaching to the parent.
3. Do step 2 in multiple parallel jobs. (cron is your friend.)
4. Attach the children to the "naked" (no PK, no FK, no indices) parent.
5. Use the ONLY keyword on CREATE INDEX and ALTER TABLE to add the indices,
PK and FK to the parent *after* step 4.

2)I understand there is no limitation theoretically on the number or size

of partitions a table can have in postgres. But I want to know from experts
here, from their real life experience, if there exists any such thing which
we should take care of before deciding the partitioning strategy, so as to
have the soft limit (both with respect to size and number of partitions)
obeyed.
Note:- Currently this table will be around ~1TB in size and will hold
Approx ~3billion rows(post removal of duplicates). But as per business need
it may grow up to ~45 billion rows in future.

I'd keep it under 200 partitions, but my experience might be outdated.

3)As the size of the table or each partition is going to be very large and
this will be a read intensive application,

By PK? If so, partition by PK.

Show quoted text

compressing the historical partition will help us save the storage space
and will also help the read queries performance. So, Can you please throw
some light on the compression strategy which we should follow here
(considering a daily range partition table based on transaction_date as
partition key)?

Regards
Lok

#18Lok P
loknath.73@gmail.com
In reply to: Greg Sabino Mullane (#16)
Re: How to do faster DML

On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

Partitioning will depend a lot on how you access the data. Is it always
using that transaction_date column in the where clause? Can you share the
table definition via \d? We've talked about this table quite a bit, but not
actually seen what it looks like. We are flying blind a little bit. You
mentioned your queries are slow, but can you share an explain analyze on
one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You
may want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on

this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea

No.

3)As the size of the table or each partition is going to be very large and

this will be a read intensive application, compressing the historical
partition will help us save the storage space and will also help the read
queries performance

I am not sure what this means. If you are not going to need the data
anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg

Thank you.

The table has ~127 columns of different data types , combinations of
Numeric, varchar, date etc. And is having current size ~1TB holding
~3billion rows currently and the row size is ~300bytes.

Currently it has lesser volume , but in future the daily transaction per
day which will be inserted into this table will be Max ~500million
rows/day. And the plan is to persist at least ~3months of transaction data
which will be around 45billion rows in total. And if all works well , we
may need to persist ~6 months worth of data in this database in future and
that will be ~90 billion.

This table will always be queried on the transaction_date column as one of
the filters criteria. But the querying/search criteria can span from a day
to a month worth of transaction date data.

When you said "*You may want to do a larger bucket than 24 hours per
partition.*", do you mean to say partition by weekly or so? Currently as
per math i.e. 1TB of storage for ~3billion rows. So the daily range
partition size( to hold ~500million transactions/day) will be around
~100-150GB. Won't that be too much data for a single partition to operate
on, and increasing the granularity further(say weekly) will make the
partition more bulkier?

What I mean was, we will definitely need the data for querying purposes by
the users, but just to keep the storage space incontrol (and to help the
read queries), we were thinking of having the historical partitions
compressed. And for that , if any compression strategy should we follow on
postgres?

With regards to loading data to the table faster, wondering why you said '
*NO*' to load the data first and enabling/Creating the Primary key and
Foreign key constraint later approach. Because this table is a child table
and the parent is already having data in it, loading data to this table in
presence of PK and FK makes it too slow as it tries to validate those for
each set of rows. So we were thinking if doing it at a later stage at
oneshot will be a faster approach. Please suggest.

I will try to collect some SELECT query and post the explain analyze.
Currently we are trying to get rid of the duplicates.

Regards
Lok

Show quoted text
#19Lok P
loknath.73@gmail.com
In reply to: Ron (#17)
Re: How to do faster DML

On Sun, Feb 4, 2024 at 9:25 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

1. Load the children before attaching them to the parent.
2. Create the child indices, PK and FKs before attaching to the parent.
3. Do step 2 in multiple parallel jobs. (cron is your friend.)
4. Attach the children to the "naked" (no PK, no FK, no indices) parent.
5. Use the ONLY keyword on CREATE INDEX and ALTER TABLE to add the
indices, PK and FK to the parent *after* step 4.

Thank You.

Can you please help me to understand these steps a bit more accurately.

Say we have a parent table already having data in it and also a primary
key defined.We will load the child table first , by dropping the Primary
key and Foreign keys, so that data load will be faster.

Then we will create the primary key index on the child table. When you said
using multiple parallel jobs, do you mean creating the PK index on each
partition separately from different sessions rather than creating on the
table using a single statement or some other faster way to create the PK
index?

Now the last step is attaching the PK and FK of the above child table to
the parent. This will validate each and every row for the uniqueness and
also to ensure the foreign key is present in the parent table. Won't this
take a longer time in this step?

#20Ron
ronljohnsonjr@gmail.com
In reply to: Lok P (#5)
Re: How to do faster DML

On Sat, Feb 3, 2024 at 11:09 AM Lok P <loknath.73@gmail.com> wrote:
[snip]

show maintenance_work_mem; - 4155MB
show work_mem; - 8MB
show shared_buffers ; -22029684

Those are pretty small values. What are your server specs?

#21Lok P
loknath.73@gmail.com
In reply to: Lok P (#18)
#22veem v
veema0000@gmail.com
In reply to: Lok P (#21)
#23Greg Sabino Mullane
greg@turnstep.com
In reply to: Lok P (#18)
#24Lok P
loknath.73@gmail.com
In reply to: Greg Sabino Mullane (#23)
#25Greg Sabino Mullane
greg@turnstep.com
In reply to: Lok P (#24)
#26Lok P
loknath.73@gmail.com
In reply to: Greg Sabino Mullane (#25)
#27veem v
veema0000@gmail.com
In reply to: Lok P (#26)
#28Lok P
loknath.73@gmail.com
In reply to: Greg Sabino Mullane (#23)
#29Greg Sabino Mullane
greg@turnstep.com
In reply to: Lok P (#28)
#30Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: veem v (#27)
#31veem v
veema0000@gmail.com
In reply to: Peter J. Holzer (#30)
#32Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Peter J. Holzer (#30)
#33Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: veem v (#31)
#34veem v
veema0000@gmail.com
In reply to: Peter J. Holzer (#33)
#35David G. Johnston
david.g.johnston@gmail.com
In reply to: veem v (#34)
#36Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#34)
#37Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: veem v (#34)
#38Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Ron (#36)
#39veem v
veema0000@gmail.com
In reply to: Peter J. Holzer (#38)
#40Dominique Devienne
ddevienne@gmail.com
In reply to: veem v (#39)
#41Greg Sabino Mullane
greg@turnstep.com
In reply to: Lok P (#26)
#42Greg Sabino Mullane
greg@turnstep.com
In reply to: veem v (#39)
#43veem v
veema0000@gmail.com
In reply to: Greg Sabino Mullane (#42)
#44Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#43)
#45veem v
veema0000@gmail.com
In reply to: Ron (#44)
#46Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Greg Sabino Mullane (#42)
#47Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: veem v (#43)
#48veem v
veema0000@gmail.com
In reply to: Peter J. Holzer (#47)
#49Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#48)
#50veem v
veema0000@gmail.com
In reply to: Ron (#49)
#51Adrian Klaver
adrian.klaver@aklaver.com
In reply to: veem v (#50)
#52veem v
veema0000@gmail.com
In reply to: Adrian Klaver (#51)
#53Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#52)
#54David G. Johnston
david.g.johnston@gmail.com
In reply to: veem v (#48)
#55Greg Sabino Mullane
greg@turnstep.com
In reply to: veem v (#52)
#56Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Greg Sabino Mullane (#55)
#57Greg Sabino Mullane
greg@turnstep.com
In reply to: Adrian Klaver (#56)
#58Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Greg Sabino Mullane (#57)
#59veem v
veema0000@gmail.com
In reply to: Adrian Klaver (#58)
#60David G. Johnston
david.g.johnston@gmail.com
In reply to: veem v (#59)
#61Greg Sabino Mullane
greg@turnstep.com
In reply to: veem v (#59)
#62Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: David G. Johnston (#54)
#63David G. Johnston
david.g.johnston@gmail.com
In reply to: Peter J. Holzer (#62)
#64Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: David G. Johnston (#63)
#65veem v
veema0000@gmail.com
In reply to: Peter J. Holzer (#64)
#66Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: veem v (#65)
#67Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Peter J. Holzer (#64)