PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables)

Started by Kirk Wolak5 months ago7 messages
#1Kirk Wolak
wolakk@gmail.com

Hackers,
I am asking for feedback on this. We just migrated a client to PG and
all is well. Except the developers requested a copy of recent production
data for testing.

We have a handful of tables that have 90% of the data going back 30 years.
We NEVER restore this data to Dev or Staging. We used a special RMAN
backup where these tables had a "WHERE clause" applied to them during the
backup/dump process.

It would be awesome if pg_dump offered something similar. I am willing
to code this up, but there are questions. And I want to do it in a way
that makes the most sense to the PG Community.

My initial thought is a simple filtering file, of the format:
schema.table = WHERE ...
"schema"."Table" = WHERE ...
"schema"."t1" = LIMIT 50000

If the --filter-data <filename> is not specified, the code would not
change anything. If it was, it would see if the table was in the file with
a filter, it would read that filter, and apply it to the resulting COPY
command.

I don't believe this impacts pg_restore. But if we wanted to include any
kind of messaging in the restore process that "pg_dump --filter-data was
used, this is NOT a complete dump!", then I would appreciate that, and
include it in pg_restore.

Just to make the point. The full pg_dump takes 60 minutes (During which,
we must turn off certain features to avoid throwing errors/locking
issues). Excluding these tables takes 2-3 minutes. (Side dumping limited
versions of them with \COPY takes 3 minutes). And frankly we have enough
backups of the many years of data, we don't need daily snapshots of them,
or to carry them around.

Thanks in advance. I am hoping that I am not the only one that would
benefit from a filtered dump (as opposed to all or nothing).

Finally, I considered using an entire query, which could allow data-masking
and more complex queries if there are FKs involved. But that seemed like a
much bigger ask (and a potential foot-gun).

PS: A Quick Hack feature we could leverage would be to flag the
ignore-table-data to generate an EMPTY .dat file, and the internal toc.dat
reference to load that file... THEN simply overwrite that file with our
manual \COPY command. This would be almost a trivial change, and would
work for what we do/need. BUT it feels "off" a bit.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kirk Wolak (#1)
Re: PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables)

Kirk Wolak <wolakk@gmail.com> writes:

We have a handful of tables that have 90% of the data going back 30 years.
We NEVER restore this data to Dev or Staging. We used a special RMAN
backup where these tables had a "WHERE clause" applied to them during the
backup/dump process.

Have you considered partitioning these tables by date and then not
dumping the older partitions? That would fit into existing
functionality a lot better.

regards, tom lane

#3Greg Sabino Mullane
htamfids@gmail.com
In reply to: Kirk Wolak (#1)
Re: PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables)

I've seen this idea pop up over the years, and it's still a good one.
Rather than invent new flags, I think a better approach would be to
convince pg_dump to dump a view, such that table foo has a view fooslice to
limit / filter the output. Then we can simply do:

pg_dump -t foo=view:fooslice

and under the hood pg_dump would do

COPY (SELECT * FROM fooslice) TO ...
rather than
COPY foo TO ...

While we could make the view mapping into a separate filtering file as you
suggest, that's more complexity and also a little more dangerous in an
action-at-a-distance way, so I'd rather have people be very specific in the
mapping on the command line.

I could swear I made a POC years ago for something like this, but it's long
gone. If I get some time, I'll give it a fresh attempt.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

#4Kirk Wolak
wolakk@gmail.com
In reply to: Tom Lane (#2)
Re: PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables)

On Fri, Aug 15, 2025 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kirk Wolak <wolakk@gmail.com> writes:

We have a handful of tables that have 90% of the data going back 30

years.

We NEVER restore this data to Dev or Staging. We used a special RMAN
backup where these tables had a "WHERE clause" applied to them during the
backup/dump process.

Have you considered partitioning these tables by date and then not
dumping the older partitions? That would fit into existing
functionality a lot better.

regards, tom lane

I had not considered it until now. But because we are talking about only
keeping 7 - 28 days of decades of data.
I would assume that window would require a bit of work in production to
maintain, making it a tough sell to the client.

#5Kirk Wolak
wolakk@gmail.com
In reply to: Greg Sabino Mullane (#3)
Re: PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables)

On Fri, Aug 15, 2025 at 1:25 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

I've seen this idea pop up over the years, and it's still a good one.
Rather than invent new flags, I think a better approach would be to
convince pg_dump to dump a view, such that table foo has a view fooslice to
limit / filter the output. Then we can simply do:

pg_dump -t foo=view:fooslice
..
While we could make the view mapping into a separate filtering file as you
suggest, that's more complexity and also a little more dangerous in an
action-at-a-distance way, so I'd rather have people be very specific in the
mapping on the command line.

Hmmm, first, we are talking a full pg_dump, with some data filtering, I
don't think we can avoid adding some kind of switch. While being able to
do it on the command line is great for 1 table or 2... it gets unwieldy
pretty quickly. Also, changing this inside of a production environment to
maintain the slices when different "dumps" have different parameters (our
monthly dump goes back 35 days), so for us, having different files with
different settings makes the most sense.

But this is why I am asking for input/feedback. The feedback is producing
ideas...

Thanks!

#6jian he
jian.universality@gmail.com
In reply to: Kirk Wolak (#5)
Re: PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables)

hi.

I just found this
/messages/by-id/CALAY4q8o00Sg5nemi2AuqNuLvmGLaR6br+YrvM41ZuSPjM2Qvg@mail.gmail.com
maybe it's not that very helpful.

IMV,
pg_dump --option="view_name"
is better than
pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > testdump2

#7Nikolay Samokhvalov
nik@postgres.ai
In reply to: jian he (#6)
Re: PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables)

On Sat, Aug 23, 2025 at 8:05 AM jian he <jian.universality@gmail.com> wrote:

hi.

I just found this

/messages/by-id/CALAY4q8o00Sg5nemi2AuqNuLvmGLaR6br+YrvM41ZuSPjM2Qvg@mail.gmail.com
maybe it's not that very helpful.

IMV,
pg_dump --option="view_name"
is better than
pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" >
testdump2

In some cases, we cannot or don't want to create a view. For example, view
creation might be banned in a project :) to avoid dependency management
headache. Or we simply might lack permissions for DDL.

Back to the idea of having ability to dump with arbitrary filter – I think
it's a great idea. COPY supports arbitrary SELECT as input and STDOUT as
output and this is what we usually use, via psql, but doing so, we lose a
lot of important mechanics pg_dump has.

At the same time, if we think about approach with views, it brings much
more than just filtering out some rows -- we can do arbitrary
transformations including projection, aggregation, and joining.

Ideally, it would be awesome to have a concept of virtual view that would
be implemented at pg_dump level to support any kind transformation. While
avoiding the need to have DDL permissions and change schema. This could
give huge freedom and enable lots of workflows (e.g., for testing –
replacing some actual sensitive values with random data on the fly would be
extremely helpful to have!)

Nik