PostgreSQL DB in prod, test, debug

Started by Simon Connahabout 2 years ago8 messagesgeneral
Jump to latest
#1Simon Connah
simon.n.connah@protonmail.com

Hi,

This project uses Python 3.12, Flask, psycopg3 and PostgreSQL 15.

This is probably a stupid question so I apologies in advance.

I'm building a website using PostgreSQL and since I've just been doing some dev work on it I've just manually played around with the database if I needed new tables or functions for example but I want to start doing automated testing and need to import a clean snapshot of the database with no data and then use the automated tests to test if things work with the tests.

What I think is the best way to do this is to do a pg_dump of the database (using the --schema-only flag) and then load it into a test only database that gets created at the start of the unit tests and destroyed at the end. The automated tests will insert, update, delete and select data to test if it all still works.

My main question is does this sound OK? And if so is there a nice way to automate the dump / restore in Python?

Simon.

#2Daniel Gustafsson
daniel@yesql.se
In reply to: Simon Connah (#1)
Re: PostgreSQL DB in prod, test, debug

On 14 Feb 2024, at 10:59, Simon Connah <simon.n.connah@protonmail.com> wrote:

This is probably a stupid question so I apologies in advance.

There is no such thing.

What I think is the best way to do this is to do a pg_dump of the database (using the --schema-only flag) and then load it into a test only database that gets created at the start of the unit tests and destroyed at the end. The automated tests will insert, update, delete and select data to test if it all still works.

If the source of truth for your schema is the database, then sure. If the
source of truth is a .sql file in your source code repository then you should
use that. In essence, create the test database identically to how you create
the production database to ensure that you are testing what you will run in
production.

--
Daniel Gustafsson

#3Erik Wienhold
ewie@ewie.name
In reply to: Simon Connah (#1)
Re: PostgreSQL DB in prod, test, debug

On 2024-02-14 10:59 +0100, Simon Connah wrote:

This project uses Python 3.12, Flask, psycopg3 and PostgreSQL 15.

This is probably a stupid question so I apologies in advance.

I'm building a website using PostgreSQL and since I've just been doing
some dev work on it I've just manually played around with the database
if I needed new tables or functions for example but I want to start
doing automated testing and need to import a clean snapshot of the
database with no data and then use the automated tests to test if
things work with the tests.

What I think is the best way to do this is to do a pg_dump of the
database (using the --schema-only flag)

You create a dump from the prod database each time? Yikes.

and then load it into a test only database that gets created at the
start of the unit tests and destroyed at the end. The automated tests
will insert, update, delete and select data to test if it all still
works.

My main question is does this sound OK? And if so is there a nice way
to automate the dump / restore in Python?

The database schema should be defined by migration scripts that you also
check into version control with the rest of your application sources.
Some people also prefer a separate repository just for the database
schema, depending on how tightly coupled database and application are,
or if there are other applications relying on the database schema.

I use Sqitch[1]https://sqitch.org/ which works well if you want to track an existing
database schema. Alembic is popular in Python but I don't know how it
works with an existing schema because Alembic migrations are usually not
written in plain SQL.

This is also useful for automated deployment because it allows you to
migrate the database to a specific schema version that is necessary for
your application.

For testing, I have a Bash script that starts a Docker container with
Postgres and then runs sqitch-deploy before running pytest. That can of
course be adapted to spin up a local Postgres instance instead. I also
use pgTAP[2]https://pgtap.org/ with that to just test the database schema.

You can also use testcontainers[3]https://testcontainers-python.readthedocs.io/en/latest/postgres/README.html to start/stop a Docker container in
pytest and run sqitch with subprocess before running all tests.

[1]: https://sqitch.org/
[2]: https://pgtap.org/
[3]: https://testcontainers-python.readthedocs.io/en/latest/postgres/README.html

--
Erik

#4Simon Connah
simon.n.connah@protonmail.com
In reply to: Daniel Gustafsson (#2)
Re: PostgreSQL DB in prod, test, debug

On Wednesday, 14 February 2024 at 10:25, Daniel Gustafsson <daniel@yesql.se> wrote:

On 14 Feb 2024, at 10:59, Simon Connah simon.n.connah@protonmail.com wrote:

This is probably a stupid question so I apologies in advance.

There is no such thing.

What I think is the best way to do this is to do a pg_dump of the database (using the --schema-only flag) and then load it into a test only database that gets created at the start of the unit tests and destroyed at the end. The automated tests will insert, update, delete and select data to test if it all still works.

If the source of truth for your schema is the database, then sure. If the
source of truth is a .sql file in your source code repository then you should
use that. In essence, create the test database identically to how you create
the production database to ensure that you are testing what you will run in
production.

--
Daniel Gustafsson

Thank you. I'm a bit disorganised with things at the moment. I'm kinda developing this site in an ad hoc manner which is probably a bad idea so I'm trying to fix up a few mistakes I made early on.

In terms of the database I've just added new functions as needed which I use instead of direct SQL in my Python code. Maybe I should sit down with Visio and try and do a diagram before I go any further?

I'll just write a simple script for backup and restore and call it before each test run but have the schema only dump come from production. At this point it doesn't really matter as the website is so small.

Simon.

#5Simon Connah
simon.n.connah@protonmail.com
In reply to: Erik Wienhold (#3)
Re: PostgreSQL DB in prod, test, debug

On Wednesday, 14 February 2024 at 10:38, Erik Wienhold <ewie@ewie.name> wrote:

On 2024-02-14 10:59 +0100, Simon Connah wrote:

This project uses Python 3.12, Flask, psycopg3 and PostgreSQL 15.

This is probably a stupid question so I apologies in advance.

I'm building a website using PostgreSQL and since I've just been doing
some dev work on it I've just manually played around with the database
if I needed new tables or functions for example but I want to start
doing automated testing and need to import a clean snapshot of the
database with no data and then use the automated tests to test if
things work with the tests.

What I think is the best way to do this is to do a pg_dump of the
database (using the --schema-only flag)

You create a dump from the prod database each time? Yikes.

Sorry. That came out wrong. I don't take a dump each time I run tests but I will generally take a full dump when working on something specific to the database.

and then load it into a test only database that gets created at the
start of the unit tests and destroyed at the end. The automated tests
will insert, update, delete and select data to test if it all still
works.

My main question is does this sound OK? And if so is there a nice way
to automate the dump / restore in Python?

The database schema should be defined by migration scripts that you also
check into version control with the rest of your application sources.
Some people also prefer a separate repository just for the database
schema, depending on how tightly coupled database and application are,
or if there are other applications relying on the database schema.

I use Sqitch[1] which works well if you want to track an existing
database schema. Alembic is popular in Python but I don't know how it
works with an existing schema because Alembic migrations are usually not
written in plain SQL.

This is also useful for automated deployment because it allows you to
migrate the database to a specific schema version that is necessary for
your application.

For testing, I have a Bash script that starts a Docker container with
Postgres and then runs sqitch-deploy before running pytest. That can of
course be adapted to spin up a local Postgres instance instead. I also
use pgTAP[2] with that to just test the database schema.

You can also use testcontainers[3] to start/stop a Docker container in
pytest and run sqitch with subprocess before running all tests.

[1] https://sqitch.org/
[2] https://pgtap.org/
[3] https://testcontainers-python.readthedocs.io/en/latest/postgres/README.html

--
Erik

Thank you. Squitch looks like a useful tool. I'll certainly look into all of that.

I've never really used Docker before. Instead I just spin up Vultr instances but I can read about Docker as well.

Simon.

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Erik Wienhold (#3)
Re: PostgreSQL DB in prod, test, debug

On 2/14/24 02:38, Erik Wienhold wrote:

On 2024-02-14 10:59 +0100, Simon Connah wrote:

The database schema should be defined by migration scripts that you also
check into version control with the rest of your application sources.
Some people also prefer a separate repository just for the database
schema, depending on how tightly coupled database and application are,
or if there are other applications relying on the database schema.

I use Sqitch[1] which works well if you want to track an existing
database schema. Alembic is popular in Python but I don't know how it
works with an existing schema because Alembic migrations are usually not
written in plain SQL.

Just to be clear you can start from scratch with Sqitch. The only thing
that needs to be done outside of it is the original CREATE DATABASE
statement. After that you can build the contents of the database from
the beginning. At any point you can move forward(deploy) or
backward(revert). For your use case the important feature is target,
where each target is a different instance of the database. In your case
prod, test and debug. This allow you to make changes in test/debug and
when they are verified good deploy them to prod.

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Daniel Gustafsson (#2)
Re: PostgreSQL DB in prod, test, debug

On 2024-02-14 11:25:48 +0100, Daniel Gustafsson wrote:

On 14 Feb 2024, at 10:59, Simon Connah <simon.n.connah@protonmail.com> wrote:

This is probably a stupid question so I apologies in advance.

There is no such thing.

What I think is the best way to do this is to do a pg_dump of the
database (using the --schema-only flag) and then load it into a test
only database that gets created at the start of the unit tests and
destroyed at the end. The automated tests will insert, update,
delete and select data to test if it all still works.

If the source of truth for your schema is the database, then sure. If the
source of truth is a .sql file in your source code repository then you should
use that.

I sort of do both for one of my projects:

I originally created the SQL script by running pg_dump on a manually
constructed test database (I do have code to create the empty schema,
but I had to insert the test data manually). That script went into the
git repo.

The test files all contain a fixture which drops and recreates the test
database using that sql file.

When I add a new test case I try to make do with the existing test data.

When I need additional data for a new test case, I create a new pristine
test database using the sql file, add the new data, and then create a
new sql file using sql_dump which is then committed with the test cases.

Same for migrations: If I need to migrate the schema, I run the
migration on the test database, then dump and commit it.

This project is small enough (86 tests in 10 files) that all test cases
can use the same test data. However, I could easily use different test
data for different tests.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#8Jay Stanley
beansboy@cruzio.com
In reply to: Peter J. Holzer (#7)
Re: PostgreSQL DB in prod, test, debug

On 2024-02-15 23:51, Peter J. Holzer wrote:

On 2024-02-14 11:25:48 +0100, Daniel Gustafsson wrote: On 14 Feb 2024,
at 10:59, Simon Connah <simon.n.connah@protonmail.com> wrote:
This is probably a stupid question so I apologies in advance.
There is no such thing.

What I think is the best way to do this is to do a pg_dump of the
database (using the --schema-only flag) and then load it into a test
only database that gets created at the start of the unit tests and
destroyed at the end. The automated tests will insert, update,
delete and select data to test if it all still works.
If the source of truth for your schema is the database, then sure. If
the
source of truth is a .sql file in your source code repository then you
should
use that.

I sort of do both for one of my projects:

I originally created the SQL script by running pg_dump on a manually
constructed test database (I do have code to create the empty schema,
but I had to insert the test data manually). That script went into the
git repo.

The test files all contain a fixture which drops and recreates the test
database using that sql file.

When I add a new test case I try to make do with the existing test data.

When I need additional data for a new test case, I create a new pristine
test database using the sql file, add the new data, and then create a
new sql file using sql_dump which is then committed with the test cases.

Same for migrations: If I need to migrate the schema, I run the
migration on the test database, then dump and commit it.

This project is small enough (86 tests in 10 files) that all test cases
can use the same test data. However, I could easily use different test
data for different tests.

hp

I had a very similar issue a few years ago; our large-ish codebase needs
a lot of suites tests (about 100) to run daily against specific database
conditions -- each test involved specific data in around 50 tables. At
first we had 'source of truth' sql scripts checked into git, and with
jenkins would run each to create a database for the test, load it with
the exact data needed for that suite of tests, run the tests and record
them, then drop the database.

This worked fine for a few tests but became unmanageable as more tests
were added and the data volume increased. Instead, I created a 'master'
script that creates one _database template_ for each suite of tests by
creating a blank database and running those same build scripts in git,
then disconnecting from that target database and switching it to become
a template. This was re-done very infrequently -- only when we wanted
to use a different testing database, or needed to refresh test content,
drop tests or add more tests. That's right - we have about 100
templates; I've found copying a template is FAR faster than reloading it
from SQL.

When each test is run, it creates a new database from the template
appropriate for that suite of tests. When the test suite completes, the
new database is dropped. This sped up our tests by at least 2 orders of
magnitude, and it was in a way more reliable because each test gets a
binary copy of the test database including exactly how vacuumed each
table is, exact state of the indexes, every block in the same place,
etc. Results were definitely more consistent in how long they ran.
Note that while a database is in template mode, it cannot be changed
(unless a DBA switches it back to non-template-mode).

This has been in production for a few years, in multiple projects, and
hasn't really hit any issues; the one cavaet is that you can't switch a
database to 'template mode' if anything's logged into the target.
According to the doc, I don't see an upper limit for the number of
template databases but there must be one - I haven't tested it with
1000's of templates.

See doc for 'create database DBNAME template TEMPLATENAME' as well as
'alter database DBNAME set datistemplate = true|false'.

- jay stanley
- https://cyc.com; The Next Generation of Enterprise AI