ci: CCache churns through available space too quickly
Hi,
I noticed that a handfull of CI runs already lead to exceeding the available
cache space. One can pay for more cache space, but I think the problem is
more that what we currently do doesn't work well.
With cirrus-ci all branches shared one cache, but that's not the case with
github actions. Except for being able to read caches from the default branch
(master in our case), other branches have completely separate cache
namespaces. That's probably the right call, safety wise, but makes our ccache
approach .. not great.
We should only upload a new cache when the ccache cache hit ratio of the
existing cache entry has gotten low.
We also chose the cache key unfortunately, so that if a branch name started
with the name of the default branch, followed by a -, we'd always end up using
the main branches cache.
The attached patch fixes these, and a few other problems. See commit message
for details. With it I see a lot less cache churn and therefore also a higher
hit rate once one has more than 2-3 branches.
I'm not entirely happy with the amount of per job repetition this has. While
staying within the confines of a single .yml file, I couldn't find a better
way to deal with that. We could move a fair bit of that complexity into a
separate file, using so called "composite actions". But that's a bit of
additional github actions specific stuff that one would be exposed to, so I'm
not sure we should go that way?
It would result in having only two references to ccache in each job (one
before the build, one after). Each of those could encapsulate a bunch of steps
defined in another file.
Thoughts?
Greetings,
Andres Freund
Hi,
Thank you for working on this!
On Fri, 5 Jun 2026 at 23:09, Andres Freund <andres@anarazel.de> wrote:
I noticed that a handfull of CI runs already lead to exceeding the available
cache space. One can pay for more cache space, but I think the problem is
more that what we currently do doesn't work well.With cirrus-ci all branches shared one cache, but that's not the case with
github actions. Except for being able to read caches from the default branch
(master in our case), other branches have completely separate cache
namespaces. That's probably the right call, safety wise, but makes our ccache
approach .. not great.We should only upload a new cache when the ccache cache hit ratio of the
existing cache entry has gotten low.
This makes sense.
We also chose the cache key unfortunately, so that if a branch name started
with the name of the default branch, followed by a -, we'd always end up using
the main branches cache.The attached patch fixes these, and a few other problems. See commit message
for details. With it I see a lot less cache churn and therefore also a higher
hit rate once one has more than 2-3 branches.I'm not entirely happy with the amount of per job repetition this has. While
staying within the confines of a single .yml file, I couldn't find a better
way to deal with that. We could move a fair bit of that complexity into a
separate file, using so called "composite actions". But that's a bit of
additional github actions specific stuff that one would be exposed to, so I'm
not sure we should go that way?
I think it looks okay, no need to use composite actions for this.
--------------------
I tested the patch and I confirm that it works as mentioned. Here is my review:
All the points you explained in the commit message are nice improvements!
Typo in commit message:
+ In my testing this utilizes the available cache space (10GB for personal
+ accounts) much more effictively than before.
Typo at 'effictively' in the commit message.
diff --git a/.github/workflows/pg-ci.yml b/.github/workflows/pg-ci.yml
index 8560e9389f6..86dc47de8db 100644
--- a/.github/workflows/pg-ci.yml
+++ b/.github/workflows/pg-ci.yml
+ - &ccache_decide_save_step
+ name: "ccache: Decide if cache should be uploaded"
+ id: ccache-pre-save
+ # [Decide to] store the cache whenever the cache was set up, so that
+ # incrementally addressing compiler errors/warnings doesn't have to
+ # start from scratch.
+ if: |
+ always() &&
+ steps.ccache-restore-branch.conclusion == 'success'
+ run: python3 src/tools/ci/gha_ccache_decide.py
Isn't the conclusion always true unless GitHub has some self errors?
Also, we are directly running this script with the 'python3' command
but it might not be available on the PATH. I had some problems with
this on BSD images when we were using Cirrus. I am not sure we would
have such problems with GitHub Actions but I just wanted to mention
it.
diff --git a/src/tools/ci/gha_ccache_decide.py
b/src/tools/ci/gha_ccache_decide.py
new file mode 100644
index 00000000000..920f7bf9685
--- /dev/null
+++ b/src/tools/ci/gha_ccache_decide.py
+def main():
+ on_default_branch = os.environ["ON_DEFAULT_BRANCH"] == "true"
+ ccache_dir = os.environ["CCACHE_DIR"]
ccache_dir isn't used.
+ # compute cache hit ratio
+ hits, misses = parse_ccache_stats()
+ total = hits + misses
+ hit_pct = int(( hits / total) * 100) if total > 0 else 100
Extra space in '( hits'.
+ # If there were either barely any misses, or the cache hit ratio was high,
+ # there no point in generating a new cache entry. We have limited cache
+ # space.
+ should_save = misses > 10 and hit_pct < target_rate
We consider misses here but we don't mention it, we only mention hit
rate and target rate. I think this is not very important since we
can't possibly have a case that misses < 10 and hit_pct < target_rate.
If that is not the case, then I think we can remove misses from the
should_save calculation.
+ # Don't store ccache stats , otherwise we'd need to reset the cache access
Extra space before comma.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On 2026-06-08 13:30:03 +0300, Nazir Bilal Yavuz wrote:
I think it looks okay, no need to use composite actions for this.
Cool.
I tested the patch and I confirm that it works as mentioned. Here is my review:
All the points you explained in the commit message are nice improvements!
Typo in commit message:
+ In my testing this utilizes the available cache space (10GB for personal + accounts) much more effictively than before.Typo at 'effictively' in the commit message.
Oops.
diff --git a/.github/workflows/pg-ci.yml b/.github/workflows/pg-ci.yml index 8560e9389f6..86dc47de8db 100644 --- a/.github/workflows/pg-ci.yml +++ b/.github/workflows/pg-ci.yml+ - &ccache_decide_save_step + name: "ccache: Decide if cache should be uploaded" + id: ccache-pre-save + # [Decide to] store the cache whenever the cache was set up, so that + # incrementally addressing compiler errors/warnings doesn't have to + # start from scratch. + if: | + always() && + steps.ccache-restore-branch.conclusion == 'success' + run: python3 src/tools/ci/gha_ccache_decide.pyIsn't the conclusion always true unless GitHub has some self errors?
I mean, the cache restoration *could* fail? Or another earlier step could We
don't want to upload a new cache entry if we never got to building...
Also, we are directly running this script with the 'python3' command
but it might not be available on the PATH. I had some problems with
this on BSD images when we were using Cirrus. I am not sure we would
have such problems with GitHub Actions but I just wanted to mention
it.
I think we'll just have to address it if/when it becomes a problem. I don't
really see the alternative...
diff --git a/src/tools/ci/gha_ccache_decide.py b/src/tools/ci/gha_ccache_decide.py new file mode 100644 index 00000000000..920f7bf9685 --- /dev/null +++ b/src/tools/ci/gha_ccache_decide.py+def main(): + on_default_branch = os.environ["ON_DEFAULT_BRANCH"] == "true" + ccache_dir = os.environ["CCACHE_DIR"]ccache_dir isn't used.
Ah, yea. It was earlier, but I removed that part (computed the cache size,
when this was a shell script, by using du. But that seemed too awkward in
python, so I removed it).
+ # If there were either barely any misses, or the cache hit ratio was high, + # there no point in generating a new cache entry. We have limited cache + # space. + should_save = misses > 10 and hit_pct < target_rateWe consider misses here but we don't mention it
I was trying to mention it, via "If there were either barely any misses".
, we only mention hit rate and target rate. I think this is not very
important since we can't possibly have a case that misses < 10 and hit_pct <
target_rate.
Why could we not have such a case? If we start building with some changes
that trigger cache misses, but there's a compiler error a few seconds in, that
seems like it'd precisely hit that case?
Greetings,
Andres Freund
Hi,
On Mon, 8 Jun 2026 at 17:59, Andres Freund <andres@anarazel.de> wrote:
diff --git a/.github/workflows/pg-ci.yml b/.github/workflows/pg-ci.yml index 8560e9389f6..86dc47de8db 100644 --- a/.github/workflows/pg-ci.yml +++ b/.github/workflows/pg-ci.yml+ - &ccache_decide_save_step + name: "ccache: Decide if cache should be uploaded" + id: ccache-pre-save + # [Decide to] store the cache whenever the cache was set up, so that + # incrementally addressing compiler errors/warnings doesn't have to + # start from scratch. + if: | + always() && + steps.ccache-restore-branch.conclusion == 'success' + run: python3 src/tools/ci/gha_ccache_decide.pyIsn't the conclusion always true unless GitHub has some self errors?
I mean, the cache restoration *could* fail? Or another earlier step could We
don't want to upload a new cache entry if we never got to building...
I see, yes these points make sense.
Also, we are directly running this script with the 'python3' command
but it might not be available on the PATH. I had some problems with
this on BSD images when we were using Cirrus. I am not sure we would
have such problems with GitHub Actions but I just wanted to mention
it.I think we'll just have to address it if/when it becomes a problem. I don't
really see the alternative...
Sounds good.
diff --git a/src/tools/ci/gha_ccache_decide.py b/src/tools/ci/gha_ccache_decide.py new file mode 100644 index 00000000000..920f7bf9685 --- /dev/null +++ b/src/tools/ci/gha_ccache_decide.py+ # If there were either barely any misses, or the cache hit ratio was high, + # there no point in generating a new cache entry. We have limited cache + # space. + should_save = misses > 10 and hit_pct < target_rateWe consider misses here but we don't mention it
I was trying to mention it, via "If there were either barely any misses".
Sorry, what I meant was we don't mention in the logs, which is:
+ if not should_save:
+ print(f"hit rate {hit_pct} is above target of {target_rate},
skip creating new cache entry")
+ return 0
, we only mention hit rate and target rate. I think this is not very
important since we can't possibly have a case that misses < 10 and hit_pct <
target_rate.Why could we not have such a case? If we start building with some changes
that trigger cache misses, but there's a compiler error a few seconds in, that
seems like it'd precisely hit that case?
Yes, you are right. I hadn't thought of the failure case. Then, it
would be good to mention that case in the log I mentioned above.
Otherwise, we will be printing the incorrect reason.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On 2026-06-08 19:10:46 +0300, Nazir Bilal Yavuz wrote:
On Mon, 8 Jun 2026 at 17:59, Andres Freund <andres@anarazel.de> wrote:
diff --git a/src/tools/ci/gha_ccache_decide.py b/src/tools/ci/gha_ccache_decide.py new file mode 100644 index 00000000000..920f7bf9685 --- /dev/null +++ b/src/tools/ci/gha_ccache_decide.py+ # If there were either barely any misses, or the cache hit ratio was high, + # there no point in generating a new cache entry. We have limited cache + # space. + should_save = misses > 10 and hit_pct < target_rateWe consider misses here but we don't mention it
I was trying to mention it, via "If there were either barely any misses".
Sorry, what I meant was we don't mention in the logs, which is:
+ if not should_save: + print(f"hit rate {hit_pct} is above target of {target_rate}, skip creating new cache entry") + return 0
Ah, makes sense.
I updated that, and after doing some minor polishing, pushed it.
Thanks for the quick review!
Greetings,
Andres
On Fri Jun 5, 2026 at 8:09 PM UTC, Andres Freund wrote:
Hi,
I noticed that a handfull of CI runs already lead to exceeding the available
cache space. One can pay for more cache space, but I think the problem is
more that what we currently do doesn't work well.With cirrus-ci all branches shared one cache, but that's not the case with
github actions. Except for being able to read caches from the default branch
(master in our case), other branches have completely separate cache
namespaces. That's probably the right call, safety wise, but makes our ccache
approach .. not great.We should only upload a new cache when the ccache cache hit ratio of the
existing cache entry has gotten low.
I had started reviewing this patch the day it was originally sent, but
due to circumstances I couldn't finish the review before it was
committed. I had some thoughts with regard to improving the Python
script itself. Attached are some improvements that make the code
a little more pythonic as well as more easily usable locally for testing
purposes. Some of the patches may be more valuable than others.
--
Tristan Partin
PostgreSQL Contributors Team
AWS (https://aws.amazon.com)
Attachments:
v1-0001-Use-long-options-for-ccache-commands.patchtext/x-patch; charset=utf-8; name=v1-0001-Use-long-options-for-ccache-commands.patchDownload+3-5
v1-0002-Use-json-to-parse-ccache-statistics.patchtext/x-patch; charset=utf-8; name=v1-0002-Use-json-to-parse-ccache-statistics.patchDownload+4-21
v1-0003-Remove-the-shutil.which-call.patchtext/x-patch; charset=utf-8; name=v1-0003-Remove-the-shutil.which-call.patchDownload+3-3
v1-0004-Use-a-context-manager-to-write-grouped-CI-output.patchtext/x-patch; charset=utf-8; name=v1-0004-Use-a-context-manager-to-write-grouped-CI-output.patchDownload+18-12
v1-0005-Provide-ccache-target-rate-as-an-input-to-the-scr.patchtext/x-patch; charset=utf-8; name=v1-0005-Provide-ccache-target-rate-as-an-input-to-the-scr.patchDownload+22-16
v1-0006-Check-that-GITHUB_OUTPUT-exists-before-trying-to-.patchtext/x-patch; charset=utf-8; name=v1-0006-Check-that-GITHUB_OUTPUT-exists-before-trying-to-.patchDownload+5-4
v1-0007-Make-gha_ccache_decide.py-executable.patchtext/x-patch; charset=utf-8; name=v1-0007-Make-gha_ccache_decide.py-executable.patchDownload+0-1
On Sat, Jun 6, 2026 at 8:09 AM Andres Freund <andres@anarazel.de> wrote:
With cirrus-ci all branches shared one cache, but that's not the case with
github actions. Except for being able to read caches from the default branch
(master in our case), other branches have completely separate cache
namespaces. That's probably the right call, safety wise, but makes our ccache
approach .. not great.
For the record: based on what Andres explained about how GHA cache
sharing works, I taught cfbot to mirror the master branch in its
postgresql-cfbot/postgresql account, and use the latest successful CI
run from there to select the base commit for the cf/XXX branches it
maintains. IIUC that should work well for this cache sharing policy,
since master's cache should be uploaded and ready to reuse at that
point. Perhaps we'll also get some data on how successful these new
heuristics are?
https://github.com/postgresql-cfbot/postgresql/commits/master/
I also taught cfbot to delete old cf/XXX branches with no builds in
over 90 days (we went from over 9000 to 376...). That matches
Github's own retention period for logs, artefacts etc, so stale
branches are not very interesting and it seemed like a good idea not
to waste resources or clutter the UI with junk.
On Wed, Jun 17, 2026 at 3:02 PM Tristan Partin <tristan@partin.io> wrote:
I had started reviewing this patch the day it was originally sent, but
due to circumstances I couldn't finish the review before it was
committed. I had some thoughts with regard to improving the Python
script itself. Attached are some improvements that make the code
a little more pythonic as well as more easily usable locally for testing
purposes. Some of the patches may be more valuable than others.
The code in 0001-3 looks good to me (haven't reviewed the commit
messages, but I assume they'd be squashed up anyway).
I'm lukewarm on the remaining pieces.
--Jacob