Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Started by Andres Freundover 2 years ago39 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

As some of you might have seen when running CI, cirrus-ci is restricting how
much CI cycles everyone can use for free (announcement at [1]https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/). This takes
effect September 1st.

This obviously has consequences both for individual users of CI as well as
cfbot.

The first thing I think we should do is to lower the cost of CI. One thing I
had not entirely realized previously, is that macos CI is by far the most
expensive CI to provide. That's not just the case with cirrus-ci, but also
with other providers. See the series of patches described later in the email.

To me, the situation for cfbot is different than the one for individual
users.

IMO, for the individual user case it's important to use CI for "free", without
a whole lot of complexity. Which imo rules approaches like providing
$cloud_provider compute accounts, that's too much setup work. With the
improvements detailed below, cirrus' free CI would last about ~65 runs /
month.

For cfbot I hope we can find funding to pay for compute to use for CI. The, by
far, most expensive bit is macos. To a significant degree due to macos
licensing terms not allowing more than 2 VMs on a physical host :(.

The reason we chose cirrus-ci were

a) Ability to use full VMs, rather than a pre-selected set of VMs, which
allows us to test a larger number

b) Ability to link to log files, without requiring an account. E.g. github
actions doesn't allow to view logs unless logged in.

c) Amount of compute available.

The set of free CI providers has shrunk since we chose cirrus, as have the
"free" resources provided. I started, quite incomplete as of now, wiki page at
[4]: .

Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
displaying the CI status for different platforms

- give up

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.

Opinions as to what to do?

The attached series of patches:

1) Makes startup of macos instances faster, using more efficient caching of
the required packages. Also submitted as [2]/messages/by-id/20230805202539.r3umyamsnctysdc7@awork3.anarazel.de.

2) Introduces a template initdb that's reused during the tests. Also submitted
as [3]/messages/by-id/20220120021859.3zpsfqn4z7ob7afz@alap3.anarazel.de

3) Remove use of -DRANDOMIZE_ALLOCATED_MEMORY from macos tasks. It's
expensive. And CI also uses asan on linux, so I don't think it's really
needed.

4) Switch tasks to use debugoptimized builds. Previously many tasks used -Og,
to get decent backtraces etc. But the amount of CPU burned that way is too
large. One issue with that is that use of ccache becomes much more crucial,
uncached build times do significantly increase.

5) Move use of -Dsegsize_blocks=6 from macos to linux

Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
could stop covering both meson and autoconf segsize_blocks. It does affect
runtime on linux as well.

6) Disable write cache flushes on windows

It's a bit ugly to do this without using the UI... Shaves off about 30s
from the tests.

7) pg_regress only checked once a second whether postgres started up, but it's
usually much faster. Use pg_ctl's logic. It might be worth replacing the
use psql with directly using libpq in pg_regress instead, looks like the
overhead of repeatedly starting psql is noticeable.

FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
linux-sanity: 0.01
linux-compiler-warnings: 0.05
linux-meson: 0.07
freebsd : 0.08
linux-autoconf: 0.09
windows : 0.18
macos : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month

Greetings,

Andres Freund

[1]: https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/
[2]: /messages/by-id/20230805202539.r3umyamsnctysdc7@awork3.anarazel.de
[3]: /messages/by-id/20220120021859.3zpsfqn4z7ob7afz@alap3.anarazel.de

Attachments:

v1-0001-ci-macos-used-cached-macports-install.patchtext/x-diff; charset=us-asciiDownload+122-39
v1-0002-Use-template-initdb-in-tests.patchtext/x-diff; charset=us-asciiDownload+161-45
v1-0003-ci-macos-Remove-use-of-DRANDOMIZE_ALLOCATED_MEMOR.patchtext/x-diff; charset=us-asciiDownload+0-2
v1-0004-ci-switch-tasks-to-debugoptimized-build.patchtext/x-diff; charset=us-asciiDownload+8-9
v1-0005-ci-Move-use-of-Dsegsize_blocks-6-from-macos-to-li.patchtext/x-diff; charset=us-asciiDownload+1-2
v1-0006-ci-windows-Disabling-write-cache-flushing-during-.patchtext/x-diff; charset=us-asciiDownload+5-1
v1-0007-regress-Check-for-postgres-startup-completion-mor.patchtext/x-diff; charset=us-asciiDownload+5-3
v1-0008-ci-Don-t-specify-amount-of-memory.patchtext/x-diff; charset=us-asciiDownload+0-5
#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-07 19:15:41 -0700, Andres Freund wrote:

The set of free CI providers has shrunk since we chose cirrus, as have the
"free" resources provided. I started, quite incomplete as of now, wiki page at
[4].

Oops, as Thomas just noticed, I left off that link:

[4]: https://wiki.postgresql.org/wiki/CI_Providers

Greetings,

Andres Freund

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#1)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On 08/08/2023 05:15, Andres Freund wrote:

IMO, for the individual user case it's important to use CI for "free", without
a whole lot of complexity. Which imo rules approaches like providing
$cloud_provider compute accounts, that's too much setup work.

+1

With the improvements detailed below, cirrus' free CI would last
about ~65 runs / month.

I think that's plenty.

For cfbot I hope we can find funding to pay for compute to use for CI.

+1

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.

Opinions as to what to do?

The resources for running our own system isn't free either. I'm sure we
can get sponsors for the cirrus-ci credits, or use donations.

I have been quite happy with Cirrus CI overall.

The attached series of patches:

All of this makes sense to me, although I don't use macos myself.

5) Move use of -Dsegsize_blocks=6 from macos to linux

Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
could stop covering both meson and autoconf segsize_blocks. It does affect
runtime on linux as well.

Could we have a comment somewhere on why we use -Dsegsize_blocks on
these particular CI runs? It seems pretty random. I guess the idea is to
have one autoconf task and one meson task with that option, to check
that the autoconf/meson option works?

6) Disable write cache flushes on windows

It's a bit ugly to do this without using the UI... Shaves off about 30s
from the tests.

A brief comment would be nice: "We don't care about persistence over
hard crashes in the CI, so disable write cache flushes to speed it up."

--
Heikki Linnakangas
Neon (https://neon.tech)

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#1)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On 08.08.23 04:15, Andres Freund wrote:

Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
displaying the CI status for different platforms

- give up

With the proposed optimizations, it seems you can still do a fair amount
of work under the free plan.

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.

I think we should use the "compute credits" plan from Cirrus CI. It
should be possible to estimate the costs for that. Money is available,
I think.

#5Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#4)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-08 16:28:49 +0200, Peter Eisentraut wrote:

On 08.08.23 04:15, Andres Freund wrote:

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.

I think we should use the "compute credits" plan from Cirrus CI. It should
be possible to estimate the costs for that. Money is available, I think.

Unfortunately just doing that seems like it would up considerably on the too
expensive side. Here are the stats for last months' cfbot runtimes (provided
by Thomas):

task_name | sum
------------------------------------------------+------------
FreeBSD - 13 - Meson | 1017:56:09
Windows - Server 2019, MinGW64 - Meson | 00:00:00
SanityCheck | 76:48:41
macOS - Ventura - Meson | 873:12:43
Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
Linux - Debian Bullseye - Autoconf | 830:17:26
Linux - Debian Bullseye - Meson | 860:37:21
CompilerWarnings | 935:30:35
(8 rows)

If I did the math right, that's about 7000 credits (and 1 credit costs 1 USD).

task costs in credits
linux-sanity: 55.30
linux-autoconf: 598.04
linux-meson: 619.40
linux-compiler-warnings: 674.28
freebsd : 732.24
windows : 1201.09
macos : 3143.52

Now, those times are before optimizing test runtime. And besides optimizing
the tasks, we can also optimize not running tests for docs patches etc. And
optimize cfbot to schedule a bit better.

But still, the costs look not realistic to me.

If instead we were to use our own GCP account, it's a lot less. t2d-standard-4
instances, which are faster than what we use right now, cost $0.168984 / hour
as "normal" instances and $0.026764 as "spot" instances right now [1]https://cloud.google.com/compute/all-pricing. Windows
VMs are considerably more expensive due to licensing - 0.184$/h in addition.

Assuming spot instances, linux+freebsd tasks would cost ~100USD month (maybe
10-20% more in reality, due to a) spot instances getting terminated requiring
retries and b) disks).

Windows would be ~255 USD / month (same retries caveats).

Given the cost of macos, it seems like it'd be by far the most of affordable
to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
persistent runners. Cirrus has builtin macos virtualization support - but can
only host two VMs on each mac, due to macos licensing restrictions. A single
mac mini would suffice to keep up with our unoptimized monthly runtime
(although there likely would be some overhead).

Greetings,

Andres Freund

[1]: https://cloud.google.com/compute/all-pricing

#6Tristan Partin
tristan@neon.tech
In reply to: Andres Freund (#1)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:

FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
linux-sanity: 0.01
linux-compiler-warnings: 0.05
linux-meson: 0.07
freebsd : 0.08
linux-autoconf: 0.09
windows : 0.18
macos : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month

I am not in the loop on the autotools vs meson stuff. How much longer do
we anticipate keeping autotools around? Seems like it could be a good
opportunity to reduce some CI usage if autotools were finally dropped,
but I know there are still outstanding tasks to complete.

Back of the napkin math says autotools is about 12% of the credit cost,
though I haven't looked to see if linux-meson and linux-autotools are
1:1.

--
Tristan Partin
Neon (https://neon.tech)

#7Andres Freund
andres@anarazel.de
In reply to: Tristan Partin (#6)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-08 10:25:52 -0500, Tristan Partin wrote:

On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:

FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
linux-sanity: 0.01
linux-compiler-warnings: 0.05
linux-meson: 0.07
freebsd : 0.08
linux-autoconf: 0.09
windows : 0.18
macos : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month

I am not in the loop on the autotools vs meson stuff. How much longer do we
anticipate keeping autotools around?

I think it depends in what fashion. We've been talking about supporting
building out-of-tree modules with "pgxs" for at least a 5 year support
window. But the replacement isn't yet finished [1]https://github.com/anarazel/postgres/tree/meson-pkgconfig, so that clock hasn't yet
started ticking.

Seems like it could be a good opportunity to reduce some CI usage if
autotools were finally dropped, but I know there are still outstanding tasks
to complete.

Back of the napkin math says autotools is about 12% of the credit cost,
though I haven't looked to see if linux-meson and linux-autotools are 1:1.

The autoconf task is actually doing quite useful stuff right now, leaving the
use of configure aside, as it builds with address sanitizer. Without that it'd
be a lot faster. But we'd loose, imo quite important, coverage. The tests
would run a bit faster with meson, but it'd be overall a difference on the
margins.

Greetings,

Andres Freund

[1]: https://github.com/anarazel/postgres/tree/meson-pkgconfig

#8Tristan Partin
tristan@neon.tech
In reply to: Andres Freund (#7)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On Tue Aug 8, 2023 at 10:38 AM CDT, Andres Freund wrote:

Hi,

On 2023-08-08 10:25:52 -0500, Tristan Partin wrote:

On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:

FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
linux-sanity: 0.01
linux-compiler-warnings: 0.05
linux-meson: 0.07
freebsd : 0.08
linux-autoconf: 0.09
windows : 0.18
macos : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month

I am not in the loop on the autotools vs meson stuff. How much longer do we
anticipate keeping autotools around?

I think it depends in what fashion. We've been talking about supporting
building out-of-tree modules with "pgxs" for at least a 5 year support
window. But the replacement isn't yet finished [1], so that clock hasn't yet
started ticking.

Seems like it could be a good opportunity to reduce some CI usage if
autotools were finally dropped, but I know there are still outstanding tasks
to complete.

Back of the napkin math says autotools is about 12% of the credit cost,
though I haven't looked to see if linux-meson and linux-autotools are 1:1.

The autoconf task is actually doing quite useful stuff right now, leaving the
use of configure aside, as it builds with address sanitizer. Without that it'd
be a lot faster. But we'd loose, imo quite important, coverage. The tests
would run a bit faster with meson, but it'd be overall a difference on the
margins.

[1] https://github.com/anarazel/postgres/tree/meson-pkgconfig

Makes sense. Please let me know if I can help you out in anyway for the
v17 development cycle besides what we have already talked about.

--
Tristan Partin
Neon (https://neon.tech)

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#5)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On 2023-Aug-08, Andres Freund wrote:

Given the cost of macos, it seems like it'd be by far the most of affordable
to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
persistent runners. Cirrus has builtin macos virtualization support - but can
only host two VMs on each mac, due to macos licensing restrictions. A single
mac mini would suffice to keep up with our unoptimized monthly runtime
(although there likely would be some overhead).

If using persistent workers is an option, maybe we should explore that.
I think we could move all or some of the Linux - Debian builds to
hardware that we already have in shelves (depending on how much compute
power is really needed.)

I think using other OSes is more difficult, mostly because I doubt we
want to deal with licenses; but even FreeBSD might not be a realistic
option, at least not in the short term.

Still,

task_name | sum
------------------------------------------------+------------
FreeBSD - 13 - Meson | 1017:56:09
Windows - Server 2019, MinGW64 - Meson | 00:00:00
SanityCheck | 76:48:41
macOS - Ventura - Meson | 873:12:43
Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
Linux - Debian Bullseye - Autoconf | 830:17:26
Linux - Debian Bullseye - Meson | 860:37:21
CompilerWarnings | 935:30:35
(8 rows)

moving just Debian, that might alleviate 76+830+860+935 hours from the
Cirrus infra, which is ~46%. Not bad.

(How come Windows - Meson reports allballs?)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
(Carlos Caszeli)

#10Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#9)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-08 18:34:58 +0200, Alvaro Herrera wrote:

On 2023-Aug-08, Andres Freund wrote:

Given the cost of macos, it seems like it'd be by far the most of affordable
to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
persistent runners. Cirrus has builtin macos virtualization support - but can
only host two VMs on each mac, due to macos licensing restrictions. A single
mac mini would suffice to keep up with our unoptimized monthly runtime
(although there likely would be some overhead).

If using persistent workers is an option, maybe we should explore that.
I think we could move all or some of the Linux - Debian builds to
hardware that we already have in shelves (depending on how much compute
power is really needed.)

(76+830+860+935)/((365/12)*24) = 3.7

3.7 instances with 4 "vcores" are busy 100% of the time. So we'd need at least
~16 cpu threads - I think cirrus sometimes uses instances that disable HT, so
it'd perhaps be 16 cores actually.

I think using other OSes is more difficult, mostly because I doubt we
want to deal with licenses; but even FreeBSD might not be a realistic
option, at least not in the short term.

They can be VMs, so that shouldn't be a big issue.

task_name | sum
------------------------------------------------+------------
FreeBSD - 13 - Meson | 1017:56:09
Windows - Server 2019, MinGW64 - Meson | 00:00:00
SanityCheck | 76:48:41
macOS - Ventura - Meson | 873:12:43
Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
Linux - Debian Bullseye - Autoconf | 830:17:26
Linux - Debian Bullseye - Meson | 860:37:21
CompilerWarnings | 935:30:35
(8 rows)

moving just Debian, that might alleviate 76+830+860+935 hours from the
Cirrus infra, which is ~46%. Not bad.

(How come Windows - Meson reports allballs?)

It's mingw64, which we've marked as "manual", because we didn't have the cpu
cycles to run it.

Greetings,

Andres Freund

#11Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#3)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote:

On 08/08/2023 05:15, Andres Freund wrote:

With the improvements detailed below, cirrus' free CI would last
about ~65 runs / month.

I think that's plenty.

Not so sure, I would regularly exceed it, I think. But it definitely will
suffice for more casual contributors.

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.

Opinions as to what to do?

The resources for running our own system isn't free either. I'm sure we can
get sponsors for the cirrus-ci credits, or use donations.

As outlined in my reply to Alvaro, just using credits likely is financially
not viable...

5) Move use of -Dsegsize_blocks=6 from macos to linux

Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
could stop covering both meson and autoconf segsize_blocks. It does affect
runtime on linux as well.

Could we have a comment somewhere on why we use -Dsegsize_blocks on these
particular CI runs? It seems pretty random. I guess the idea is to have one
autoconf task and one meson task with that option, to check that the
autoconf/meson option works?

Hm, some of that was in the commit message, but I should have added it to
.cirrus.yml as well.

Normally, the "relation segment" code basically has no coverage in our tests,
because we (quite reasonably) don't generate tables large enough. We've had
plenty bugs that we didn't notice due the code not being exercised much. So it
seemed useful to add CI coverage, by making the segments very small.

I chose the tasks by looking at how long they took at the time, I
think. Adding them to to the slower ones.

6) Disable write cache flushes on windows

It's a bit ugly to do this without using the UI... Shaves off about 30s
from the tests.

A brief comment would be nice: "We don't care about persistence over hard
crashes in the CI, so disable write cache flushes to speed it up."

Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and testing
it there. It doesn't actually seem to suffice when run in isolation, because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...

Greetings,

Andres Freund

#12Robert Treat
xzilla@users.sourceforge.net
In reply to: Andres Freund (#11)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On Tue, Aug 8, 2023 at 9:26 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote:

On 08/08/2023 05:15, Andres Freund wrote:

With the improvements detailed below, cirrus' free CI would last
about ~65 runs / month.

I think that's plenty.

Not so sure, I would regularly exceed it, I think. But it definitely will
suffice for more casual contributors.

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.

Opinions as to what to do?

The resources for running our own system isn't free either. I'm sure we can
get sponsors for the cirrus-ci credits, or use donations.

As outlined in my reply to Alvaro, just using credits likely is financially
not viable...

In case it's helpful, from an SPI oriented perspective, $7K/month is
probably an order of magnitude more than what we can sustain, so I
don't see a way to make that work without some kind of additional
magic that includes other non-profits and/or commercial companies
changing donation habits between now and September.

Purchasing a couple of mac-mini's (and/or similar gear) would be near
trivial though, just a matter of figuring out where/how to host it
(but I think infra can chime in on that if that's what get's decided).

The other likely option would be to seek out cloud credits from one of
the big three (or others); Amazon has continually said they would be
happy to donate more credits to us if we had a use, and I think some
of the other hosting providers have said similarly at times; so we'd
need to ask and hope it's not too bureaucratic.

Robert Treat
https://xzilla.net

#13Andres Freund
andres@anarazel.de
In reply to: Robert Treat (#12)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-08 22:29:50 -0400, Robert Treat wrote:

In case it's helpful, from an SPI oriented perspective, $7K/month is
probably an order of magnitude more than what we can sustain, so I
don't see a way to make that work without some kind of additional
magic that includes other non-profits and/or commercial companies
changing donation habits between now and September.

Yea, I think that'd make no sense, even if we could afford it. I think the
patches I've written should drop it to 1/2 already. Thomas added some
throttling to push it down further.

Purchasing a couple of mac-mini's (and/or similar gear) would be near
trivial though, just a matter of figuring out where/how to host it
(but I think infra can chime in on that if that's what get's decided).

Cool. Because of the limitation of running two VMs at a time on macos and the
comparatively low cost of mac minis, it seems they beat alternative models by
a fair bit.

Pginfra/sysadmin: ^

Based on being off by an order of magnitude, as you mention earlier, it seems
that:

1) reducing test runtime etc, as already in progress
2) getting 2 mac minis as runners
3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

Would be viable for a month or three? I hope we can get some cloud providers
to chip in for 3), but I'd like to have something in place that doesn't depend
on that.

Given the cost of macos VMs at AWS, the only of the big cloud providers to
have macos instances, I think we'd burn pointlessly quick through credits if
we used VMs for that.

(*) I think we should be able to get below that, but ...

The other likely option would be to seek out cloud credits from one of
the big three (or others); Amazon has continually said they would be
happy to donate more credits to us if we had a use, and I think some
of the other hosting providers have said similarly at times; so we'd
need to ask and hope it's not too bureaucratic.

Yep.

I tried to start that progress within microsoft, fwiw. Perhaps Joe and
Jonathan know how to start within AWS? And perhaps Noah inside GCP?

It'd be the least work to get it up and running in GCP, as it's already
running there, but should be quite doable at the others as well.

Greetings,

Andres Freund

#14Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Andres Freund (#11)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On Wed, Aug 9, 2023 at 3:26 AM Andres Freund <andres@anarazel.de> wrote:

6) Disable write cache flushes on windows

It's a bit ugly to do this without using the UI... Shaves off

about 30s

from the tests.

A brief comment would be nice: "We don't care about persistence over hard
crashes in the CI, so disable write cache flushes to speed it up."

Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and
testing
it there. It doesn't actually seem to suffice when run in isolation,
because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...

You can find a good example on how to accomplish this in:

https://github.com/farag2/Utilities/blob/master/Enable_disk_write_caching.ps1

Regards,

Juan José Santamaría Flecha

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#13)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hello

So pginfra had a little chat about this. Firstly, there's consensus
that it makes sense for pginfra to help out with some persistent workers
in our existing VM system; however there are some aspects that need
some further discussion, to avoid destabilizing the rest of the
infrastructure. We're looking into it and we'll let you know.

Hosting a couple of Mac Minis is definitely a possibility, if some
entity like SPI buys them. Let's take this off-list to arrange the
details.

Regards

--
Álvaro Herrera

#16Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#13)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On Tue, Aug 08, 2023 at 07:59:55PM -0700, Andres Freund wrote:

On 2023-08-08 22:29:50 -0400, Robert Treat wrote:

3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

The other likely option would be to seek out cloud credits

I tried to start that progress within microsoft, fwiw. Perhaps Joe and
Jonathan know how to start within AWS? And perhaps Noah inside GCP?

It'd be the least work to get it up and running in GCP, as it's already
running there

I'm looking at this. Thanks for bringing it to my attention.

#17Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-07 19:15:41 -0700, Andres Freund wrote:

As some of you might have seen when running CI, cirrus-ci is restricting how
much CI cycles everyone can use for free (announcement at [1]). This takes
effect September 1st.

This obviously has consequences both for individual users of CI as well as
cfbot.

[...]

Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
displaying the CI status for different platforms

- give up

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

To make that possible, we need to make the compute resources for CI
configurable on a per-repository basis. After experimenting with a bunch of
ways to do that, I got stuck on that for a while. But since today we have
sufficient macos runners for cfbot available, so... I think the approach I
finally settled on is decent, although not great. It's described in the "main"
commit message:
ci: Prepare to make compute resources for CI configurable

cirrus-ci will soon restrict the amount of free resources every user gets (as
have many other CI providers). For most users of CI that should not be an
issue. But e.g. for cfbot it will be an issue.

To allow configuring different resources on a per-repository basis, introduce
infrastructure for overriding the task execution environment. Unfortunately
this is not entirely trivial, as yaml anchors have to be defined before their
use, and cirrus-ci only allows injecting additional contents at the end of
.cirrus.yml.

To deal with that, move the definition of the CI tasks to
.cirrus.tasks.yml. The main .cirrus.yml is loaded first, then, if defined, the
file referenced by the REPO_CI_CONFIG_GIT_URL variable, will be added,
followed by the contents of .cirrus.tasks.yml. That allows
REPO_CI_CONFIG_GIT_URL to override the yaml anchors defined in .cirrus.yml.

Unfortunately git's default merge / rebase strategy does not handle copied
files, just renamed ones. To avoid painful rebasing over this change, this
commit just renames .cirrus.yml to .cirrus.tasks.yml, without adding a new
.cirrus.yml. That's done in the followup commit, which moves the relevant
portion of .cirrus.tasks.yml to .cirrus.yml. Until that is done,
REPO_CI_CONFIG_GIT_URL does not fully work.

The subsequent commit adds documentation for how to configure custom compute
resources to src/tools/ci/README

Discussion: /messages/by-id/20230808021541.7lbzdefvma7qmn3w@awork3.anarazel.de
Backpatch: 15-, where CI support was added

I don't love moving most of the contents of .cirrus.yml into a new file, but I
don't see another way. I did implement it without that as well (see [1]https://github.com/anarazel/postgres/commit/b95fd302161b951f1dc14d586162ed3d85564bfc), but
that ends up considerably harder to understand, and hardcodes what cfbot
needs. Splitting the commit, as explained above, at least makes git rebase
fairly painless. FWIW, I did merge the changes into 15, with only reasonable
conflicts (due to new tasks, autoconf->meson).

A prerequisite commit converts "SanityCheck" and "CompilerWarnings" to use a
full VM instead of a container - that way providing custom compute resources
doesn't have to deal with containers in addition to VMs. It also looks like
the increased startup overhead is outweighed by the reduction in runtime
overhead.

I'm hoping to push this fairly soon, as I'll be on vacation the last week of
August. I'll be online intermittently though, if there are issues, I can react
(very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
appreciate a quick review or two.

Greetings,

Andres Freund

[1]: https://github.com/anarazel/postgres/commit/b95fd302161b951f1dc14d586162ed3d85564bfc

Attachments:

v3-0001-ci-Don-t-specify-amount-of-memory.patchtext/x-diff; charset=us-asciiDownload+0-5
v3-0002-ci-Move-execution-method-of-tasks-into-yaml-templ.patchtext/x-diff; charset=us-asciiDownload+57-29
v3-0003-ci-Use-VMs-for-SanityCheck-and-CompilerWarnings.patchtext/x-diff; charset=us-asciiDownload+6-9
v3-0004-ci-Prepare-to-make-compute-resources-for-CI-confi.patchtext/x-diff; charset=us-asciiDownload+63-1
v3-0005-ci-Make-compute-resources-for-CI-configurable.patchtext/x-diff; charset=us-asciiDownload+90-46
v3-0006-ci-dontmerge-Example-custom-CI-configuration.patchtext/x-diff; charset=us-asciiDownload+49-1
v3-0007-Use-template-initdb-in-tests.patchtext/x-diff; charset=us-asciiDownload+161-45
v3-0008-ci-switch-tasks-to-debugoptimized-build.patchtext/x-diff; charset=us-asciiDownload+8-9
v3-0009-ci-windows-Disabling-write-cache-flushing-during-.patchtext/x-diff; charset=us-asciiDownload+26-1
v3-0010-regress-Check-for-postgres-startup-completion-mor.patchtext/x-diff; charset=us-asciiDownload+5-3
#18Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#17)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On 23 Aug 2023, at 08:58, Andres Freund <andres@anarazel.de> wrote:

I'm hoping to push this fairly soon, as I'll be on vacation the last week of
August. I'll be online intermittently though, if there are issues, I can react
(very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
appreciate a quick review or two.

I've been reading over these and the thread, and while not within my area of
expertise, nothing really sticks out.

I'll do another pass, but below are a few small comments so far.

I don't know Windows to know the implications, but should the below file have
some sort of warning about not doing that for production/shared systems, only
for dedicated test instances?

+++ b/src/tools/ci/windows_write_cache.ps1
@@ -0,0 +1,20 @@
+# Define the write cache to be power protected. This reduces the rate of cache
+# flushes, which seems to help metadata heavy workloads on NTFS. We're just
+# testing here anyway, so ...
+#
+# Let's do so for all disks, this could be useful beyond cirrus-ci.

One thing in 0010 caught my eye, and while not introduced in this patchset it
might be of interest here. In the below hunks we loop X ticks around
system(psql), with the loop assuming the server can come up really quickly and
sleeping if it doesn't. On my systems I always reach the pg_usleep after
failing the check, but if I reverse the check such it first sleeps and then
checks I only need to check once instead of twice.

@@ -2499,7 +2502,7 @@ regression_main(int argc, char *argv[],
else
wait_seconds = 60;

-		for (i = 0; i < wait_seconds; i++)
+		for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
 		{
 			/* Done if psql succeeds */
 			fflush(NULL);
@@ -2519,7 +2522,7 @@ regression_main(int argc, char *argv[],
 					 outputdir);
 			}

- pg_usleep(1000000L);
+ pg_usleep(1000000L / WAITS_PER_SEC);
}
if (i >= wait_seconds)
{

It's a micro-optimization, but if we're changing things here to chase cycles it
might perhaps be worth doing?

--
Daniel Gustafsson

#19Andres Freund
andres@anarazel.de
In reply to: Daniel Gustafsson (#18)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-23 14:48:26 +0200, Daniel Gustafsson wrote:

On 23 Aug 2023, at 08:58, Andres Freund <andres@anarazel.de> wrote:

I'm hoping to push this fairly soon, as I'll be on vacation the last week of
August. I'll be online intermittently though, if there are issues, I can react
(very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
appreciate a quick review or two.

I've been reading over these and the thread, and while not within my area of
expertise, nothing really sticks out.

Thanks!

I'll do another pass, but below are a few small comments so far.

I don't know Windows to know the implications, but should the below file have
some sort of warning about not doing that for production/shared systems, only
for dedicated test instances?

Ah, I should have explained that: I'm not planning to apply
- regress: Check for postgres startup completion more often
- ci: windows: Disabling write cache flushing during test
right now. Compared to the other patches the wins are much smaller and/or more
work is needed to make them good.

I think it might be worth going for
- ci: switch tasks to debugoptimized build
because that provides a fair bit of gain. But it might be more hurtful than
helpful due to costing more when ccache doesn't work...

+++ b/src/tools/ci/windows_write_cache.ps1
@@ -0,0 +1,20 @@
+# Define the write cache to be power protected. This reduces the rate of cache
+# flushes, which seems to help metadata heavy workloads on NTFS. We're just
+# testing here anyway, so ...
+#
+# Let's do so for all disks, this could be useful beyond cirrus-ci.

One thing in 0010 caught my eye, and while not introduced in this patchset it
might be of interest here. In the below hunks we loop X ticks around
system(psql), with the loop assuming the server can come up really quickly and
sleeping if it doesn't. On my systems I always reach the pg_usleep after
failing the check, but if I reverse the check such it first sleeps and then
checks I only need to check once instead of twice.

I think there's more effective ways to make this cheaper. The basic thing
would be to use libpq instead of forking of psql to make a connection
check. Medium term, I think we should invent a way for pg_ctl and other
tooling (including pg_regress) to wait for the service to come up. E.g. having
a named pipe that postmaster opens once the server is up, which should allow
multiple clients to use select/epoll/... to wait for it without looping.

ISTM making pg_regress use libpq w/ PQping() should be a pretty simple patch?
The non-polling approach obviously is even better, but also requires more
thought (and documentation and ...).

@@ -2499,7 +2502,7 @@ regression_main(int argc, char *argv[],
else
wait_seconds = 60;

-		for (i = 0; i < wait_seconds; i++)
+		for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
{
/* Done if psql succeeds */
fflush(NULL);
@@ -2519,7 +2522,7 @@ regression_main(int argc, char *argv[],
outputdir);
}

- pg_usleep(1000000L);
+ pg_usleep(1000000L / WAITS_PER_SEC);
}
if (i >= wait_seconds)
{

It's a micro-optimization, but if we're changing things here to chase cycles it
might perhaps be worth doing?

I wouldn't quite call not waiting for 1s for the server to start, when it does
so within a few ms, chasing cycles ;). For short tests it's a substantial
fraction of the overall runtime...

Greetings,

Andres Freund

#20Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#19)
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote:
On 2023-08-23 14:48:26 +0200, Daniel Gustafsson wrote:

I'll do another pass, but below are a few small comments so far.

I don't know Windows to know the implications, but should the below file have
some sort of warning about not doing that for production/shared systems, only
for dedicated test instances?

Ah, I should have explained that: I'm not planning to apply
- regress: Check for postgres startup completion more often
- ci: windows: Disabling write cache flushing during test
right now. Compared to the other patches the wins are much smaller and/or more
work is needed to make them good.

I think it might be worth going for
- ci: switch tasks to debugoptimized build
because that provides a fair bit of gain. But it might be more hurtful than
helpful due to costing more when ccache doesn't work...

Gotcha.

+++ b/src/tools/ci/windows_write_cache.ps1
@@ -0,0 +1,20 @@
+# Define the write cache to be power protected. This reduces the rate of cache
+# flushes, which seems to help metadata heavy workloads on NTFS. We're just
+# testing here anyway, so ...
+#
+# Let's do so for all disks, this could be useful beyond cirrus-ci.

One thing in 0010 caught my eye, and while not introduced in this patchset it
might be of interest here. In the below hunks we loop X ticks around
system(psql), with the loop assuming the server can come up really quickly and
sleeping if it doesn't. On my systems I always reach the pg_usleep after
failing the check, but if I reverse the check such it first sleeps and then
checks I only need to check once instead of twice.

I think there's more effective ways to make this cheaper. The basic thing
would be to use libpq instead of forking of psql to make a connection
check.

I had it in my head that not using libpq in pg_regress was a deliberate choice,
but I fail to find a reference to it in the archives.

It's a micro-optimization, but if we're changing things here to chase cycles it
might perhaps be worth doing?

I wouldn't quite call not waiting for 1s for the server to start, when it does
so within a few ms, chasing cycles ;). For short tests it's a substantial
fraction of the overall runtime...

Absolutely, I was referring to shifting the sleep before the test to avoid the
extra test, not the reduction of the pg_usleep. Reducing the sleep is a clear
win.

--
Daniel Gustafsson

#21Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andres Freund (#17)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#20)
#23Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
#25Andres Freund
andres@anarazel.de
In reply to: Nazir Bilal Yavuz (#21)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#24)
#27Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#28)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#29)
#31Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andres Freund (#25)
#32Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
#33Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#23)
#34Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#33)
#35Andres Freund
andres@anarazel.de
In reply to: Daniel Gustafsson (#34)
#36Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#35)
#37Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#35)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#37)
#39Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#38)