Postgresql 9.4 and ZFS?
Does anybody here have any recommendations for using PostgreSQL 9.4 (latest)
with ZFS?
We've been running both on ZFS/CentOS 6 with excellent results, and are
considering putting the two together. In particular, the CoW nature (and
subsequent fragmentation/thrashing) of ZFS becomes largely irrelevant on SSDs;
the very act of wear leveling on an SSD is itself a form of intentional
thrashing that doesn't affect performance since SSDs have no meaningful seek
time. It would seem that PGCon 2013 even had a workshop on it!
https://www.pgcon.org/2013/schedule/events/612.en.html
The exact configuration we're contemplating is either (3x 400 RAIDZ1) or (4x
400 RAIDZ2) with Intel Enterprise SATA3 SSDs, with default (lz4) compression
enabled.
If this is a particularly good or bad idea, I'd like to hear it, and why?
Thanks,
BenP
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 9/29/2015 10:01 AM, Benjamin Smith wrote:
Does anybody here have any recommendations for using PostgreSQL 9.4 (latest)
with ZFS?
For databases, I've always used mirrored pools, not raidz*.
put pgdata in its own zfs file system in your zpool. on that dedicated
zfs, set the blocksize to 8k.
--
john r pierce, recycling bits in santa cruz
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tuesday, September 29, 2015 10:35:28 AM John R Pierce wrote:
On 9/29/2015 10:01 AM, Benjamin Smith wrote:
Does anybody here have any recommendations for using PostgreSQL 9.4
(latest) with ZFS?For databases, I've always used mirrored pools, not raidz*.
put pgdata in its own zfs file system in your zpool. on that dedicated
zfs, set the blocksize to 8k.
Based on my reading here, that would be -o ashift=13 ?
HowDoesZFSonLinuxHandleAdvacedFormatDrives
EG: 2^13 = 8192
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Sep 29, 2015 at 01:08:20PM -0700, Benjamin Smith wrote:
On Tuesday, September 29, 2015 10:35:28 AM John R Pierce wrote:
On 9/29/2015 10:01 AM, Benjamin Smith wrote:
Does anybody here have any recommendations for using PostgreSQL 9.4
(latest) with ZFS?For databases, I've always used mirrored pools, not raidz*.
put pgdata in its own zfs file system in your zpool. on that dedicated
zfs, set the blocksize to 8k.Based on my reading here, that would be -o ashift=13 ?
HowDoesZFSonLinuxHandleAdvacedFormatDrivesEG: 2^13 = 8192
No, that would be:
zfs create -o blocksize=8192 <pool>/path/to/pgdata
this is for the DATASET, not the POOL
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ler@lerctr.org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 9/29/2015 1:08 PM, Benjamin Smith wrote:
put pgdata in its own zfs file system in your zpool. on that dedicated
zfs, set the blocksize to 8k.
Based on my reading here, that would be -o ashift=13 ?
HowDoesZFSonLinuxHandleAdvacedFormatDrivesEG: 2^13 = 8192
sorry, I meant recordsize.
zfs set recordsize=8192 zpool/pgdata
--
john r pierce, recycling bits in santa cruz
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Benjamin,
if you're using compression, forget about that. You need to synchronize the ashift value to the
internal rowsize of you SSD, that's it. Make sure your SSD doesn't lie to you regarding writing
blocks and their respective order. In that case you might even choose to set sync=disabled. Also,
set atime=off and relatime=on. For faster snapshot transfers, you might like to set the checksum
algo to SHA256.
As always, put zfs.conf into /etc/modprobe.d with
options spl spl_kmem_cache_slab_limit=16384
options zfs zfs_arc_max=8589934592
you might want to adjust the zfs_arc_max value to your liking. Don't set it to more than 1/3 of
your RAM, just saying.
I running above configuration in >30 servers atm in production, about 10 in test/dev environments,
speed is awesome. No data loss so far, depite quite some brown/blackouts already.
I'm using Ubuntu LTS (precise/trusty) with newest HWE, btw.
hope that helps,
Patric
Benjamin Smith schrieb am 29.09.2015 um 22:08:
On Tuesday, September 29, 2015 10:35:28 AM John R Pierce wrote:
On 9/29/2015 10:01 AM, Benjamin Smith wrote:
Does anybody here have any recommendations for using PostgreSQL 9.4 (latest) with ZFS?
For databases, I've always used mirrored pools, not raidz*.
put pgdata in its own zfs file system in your zpool. on that dedicated zfs, set the
blocksize to 8k.Based on my reading here, that would be -o ashift=13 ?
HowDoesZFSonLinuxHandleAdvacedFormatDrivesEG: 2^13 = 8192
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2
iEYEARECAAYFAlYLDtgACgkQfGgGu8y7ypBTywCfXvyWjmhAW+2AVl2ZVFBk45zy
190An1/OgNGHw7o48ZQiGQQbr2MTvqQ5
=yYUr
-----END PGP SIGNATURE-----
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I use pg with zfs on freebsd,it work great.I test zol 1 years ago,it will
crash the os on load.
Some note for pg on freebsd from my experience:
1.if you use compression,8k recordsize make the compression ratio poor.I
reach 7.x with gzip using default record size(128k if I remember) while get
2.x with 8k record size.
2.use slog to reduce fragment,and improve the sync write performance.
Hi,
On 09/30/2015 12:21 AM, Patric Bechtel wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1Hi Benjamin,
if you're using compression, forget about that. You need to
synchronize the ashift value to the internal rowsize of you SSD,
that's it. Make sure your SSD doesn't lie to you regarding writing
blocks and their respective order. In that case you might even choose
to set sync=disabled. Also, set atime=off and relatime=on. For faster
snapshot transfers, you might like to set the checksum algo to
SHA256.
What is "SSD rowsize". Do you mean size of the internal pages?
FWIW I've been doing extensive benchmarking of ZFS (on Linux), including
tests of different ashift values, and I see pretty much no difference
between ashift=12 and ashift=13 (4k vs 8k).
To show some numbers, these are pgbench results with 16 clients:
type scale ashift=12 ashift=13 rsize=8k logbias
----------------------------------------------------------------
ro small 53097 53159 53696 53221
ro medium 42869 43112 47039 46952
ro large 3127 3108 27736 28027
rw small 6593 6301 6384 6753
rw medium 1902 1890 4639 5034
rw large 561 554 2168 2585
small=150MB, medium=2GB, large=16GB (on a machine with 8GB of RAM)
The tests are "adding" the features, i.e. the columns are actually:
* ashift=12
* ashift=13
* ashift=13 + recordsize=8kB
* ashift=13 + recordsize=8kB + logbias=throughput
I've also done a few runs with compression, but that reduces the
performance a bit (understandably).
As always, put zfs.conf into /etc/modprobe.d with
options spl spl_kmem_cache_slab_limit=16384
options zfs zfs_arc_max=8589934592you might want to adjust the zfs_arc_max value to your liking. Don't
set it to more than 1/3 ofyour RAM, just saying.
Why? My understanding is that ARC cache is ~ page cache, although
implemented differently and not as tightly integrated with the kernel,
but it should release the memory when needed and such. Perhaps not
letting it to use all the RAM is a good idea, but 1/3 seems a bit too
aggressive?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi,
On 09/29/2015 07:01 PM, Benjamin Smith wrote:
Does anybody here have any recommendations for using PostgreSQL 9.4
(latest)with ZFS?
I think this really depends on the workload - if you have a lot of
random writes, CoW filesystems will perform significantly worse than
e.g. EXT4 or XFS, even on SSD.
We've been running both on ZFS/CentOS 6 with excellent results, and
are considering putting the two together. In particular, the CoW
nature (and subsequent fragmentation/thrashing) of ZFS becomes
largely irrelevant on SSDs; the very act of wear leveling on an SSD
is itself a form of intentional thrashing that doesn't affect
performance since SSDs have no meaningful seek time.
I don't think that's entirely true. Sure, SSD drives handle random I/O
much better than rotational storage, but it's not entirely free and
sequential I/O is still measurably faster.
It's true that the drives do internal wear leveling, but it probably
uses tricks that are impossible to do at the filesystem level (which is
oblivious to internal details of the SSD). CoW also increases the amount
of blocks that need to be reclaimed.
In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x faster
than ZFS. But of course, if the ZFS features are interesting for you,
maybe it's a reasonable price.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Tomas,
Tomas Vondra schrieb am 30.09.2015 um 14:01:
Hi,
On 09/30/2015 12:21 AM, Patric Bechtel wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Benjamin,
if you're using compression, forget about that. You need to synchronize the ashift value to
the internal rowsize of you SSD, that's it. Make sure your SSD doesn't lie to you regarding
writing blocks and their respective order. In that case you might even choose to set
sync=disabled. Also, set atime=off and relatime=on. For faster snapshot transfers, you might
like to set the checksum algo to SHA256.What is "SSD rowsize". Do you mean size of the internal pages?
Yep. In my experience, it helps write performance a lot. At least over extended period of time
(less write amplification).
FWIW I've been doing extensive benchmarking of ZFS (on Linux), including tests of different
ashift values, and I see pretty much no difference between ashift=12 and ashift=13 (4k vs 8k).To show some numbers, these are pgbench results with 16 clients:
type scale ashift=12 ashift=13 rsize=8k logbias
---------------------------------------------------------------- ro small 53097
53159 53696 53221 ro medium 42869 43112 47039 46952 ro
large 3127 3108 27736 28027 rw small 6593 6301
6384 6753 rw medium 1902 1890 4639 5034 rw large
561 554 2168 2585small=150MB, medium=2GB, large=16GB (on a machine with 8GB of RAM)
The tests are "adding" the features, i.e. the columns are actually:
* ashift=12 * ashift=13 * ashift=13 + recordsize=8kB * ashift=13 + recordsize=8kB +
logbias=throughputI've also done a few runs with compression, but that reduces the performance a bit
(understandably).
I'm somewhat surprised by the influence of the rsize value. I will recheck that. In my case, the
compression actually improved throughput quite a bit, but that might change depending on CPU speed
vs IO speed. Our CPU's are quite powerful, but the SSD are just SATA Samsung/OCZ models at least
18 months old. Also, I measured the write performance over several hours, to push the internal gc
of the SSD to its limits. We had some problems in the past with (e.g. Intel) SSD's and their
behaviour (<1MB/s), so that's why I put some emphasis on that.
As always, put zfs.conf into /etc/modprobe.d with
options spl spl_kmem_cache_slab_limit=16384 options zfs zfs_arc_max=8589934592
you might want to adjust the zfs_arc_max value to your liking. Don't set it to more than 1/3
ofyour RAM, just saying.Why? My understanding is that ARC cache is ~ page cache, although implemented differently and
not as tightly integrated with the kernel, but it should release the memory when needed and
such. Perhaps not letting it to use all the RAM is a good idea, but 1/3 seems a bit too
aggressive?
First of all: The setting is somewhat 'disregarded' by zfs, as it's the net size of the buffer.
The gross side (with padding and aligning) isn't counted there, so in fact the cache fills up to
2/3 of the memory, which is plenty enough. Also, sometimes the arc shrinking process isn't as fast
as necessary, so leaving some headroom in case isn't a bad strategy, IMHO.
Patric
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2
iEYEARECAAYFAlYL54cACgkQfGgGu8y7ypBXKACg6fuuvzdUtDvHRbdyisJXZwxF
ORMAoK3mEQhsB+AybHTQzhZ6hR6xT+30
=9yFi
-----END PGP SIGNATURE-----
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
I think this really depends on the workload - if you have a lot of
random writes, CoW filesystems will perform significantly worse than
e.g. EXT4 or XFS, even on SSD.
I'd be curious about the information you have that leads you to this
conclusion. As with many (most?) "rules of thumb", the devil is quite often
the details.
We've been running both on ZFS/CentOS 6 with excellent results, and
are considering putting the two together. In particular, the CoW
nature (and subsequent fragmentation/thrashing) of ZFS becomes
largely irrelevant on SSDs; the very act of wear leveling on an SSD
is itself a form of intentional thrashing that doesn't affect
performance since SSDs have no meaningful seek time.I don't think that's entirely true. Sure, SSD drives handle random I/O
much better than rotational storage, but it's not entirely free and
sequential I/O is still measurably faster.It's true that the drives do internal wear leveling, but it probably
uses tricks that are impossible to do at the filesystem level (which is
oblivious to internal details of the SSD). CoW also increases the amount
of blocks that need to be reclaimed.In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x faster
than ZFS. But of course, if the ZFS features are interesting for you,
maybe it's a reasonable price.
Again, the details would be highly interesting to me. What memory optimization
was done? Status of snapshots? Was the pool RAIDZ or mirrored vdevs? How many
vdevs? Was compression enabled? What ZFS release was this? Was this on Linux,
Free/Open/Net BSD, Solaris, or something else?
A 2x performance difference is almost inconsequential in my experience, where
growth is exponential. 2x performance change generally means 1 to 2 years of
advancement or deferment against the progression of hardware; our current,
relatively beefy DB servers are already older than that, and have an
anticipated life cycle of at least another couple years.
// Our situation //
Lots of RAM for the workload: 128 GB of ECC RAM with an on-disk DB size of ~
150 GB. Pretty much, everything runs straight out of RAM cache, with only
writes hitting disk. Smart reports 4/96 read/write ratio.
Query load: Constant, heavy writes and heavy use of temp tables in order to
assemble very complex queries. Pretty much the "worst case" mix of reads and
writes, average daily peak of about 200-250 queries/second.
16 Core XEON servers, 32 HT "cores".
SAS 3 Gbps
CentOS 6 is our O/S of choice.
Currently, we're running Intel 710 SSDs in a software RAID1 without trim
enabled and generally happy with the reliability and performance we see. We're
planning to upgrade storage soon (since we're over 50% utilization) and in the
process, bring the magic goodness of snapshots/clones from ZFS.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Sep 29, 2015 at 1:01 PM, Benjamin Smith <ben@chico.com> wrote:
Does anybody here have any recommendations for using PostgreSQL 9.4
(latest)
with ZFS?We've been running both on ZFS/CentOS 6 with excellent results, and are
considering putting the two together. In particular, the CoW nature (and
subsequent fragmentation/thrashing) of ZFS becomes largely irrelevant on
SSDs;
the very act of wear leveling on an SSD is itself a form of intentional
thrashing that doesn't affect performance since SSDs have no meaningful
seek
time. It would seem that PGCon 2013 even had a workshop on it!
https://www.pgcon.org/2013/schedule/events/612.en.htmlThe exact configuration we're contemplating is either (3x 400 RAIDZ1) or
(4x
400 RAIDZ2) with Intel Enterprise SATA3 SSDs, with default (lz4)
compression
enabled.If this is a particularly good or bad idea, I'd like to hear it, and why?
Thanks,
BenP
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
We've run postgres on ZFS for years with great success (first on
OpenSolaris, now on OmniOS, and I personally run it on FreeBSD). The
snapshotting feature makes upgrades on large clusters much less scary
(snapshot and revert if it goes bad) and being able to bring a snapshot
backup up as a clone to restore an accidentally dropped table is great.
Others have given a lot of great advice as far as system tuning. Only other
thing I can add is you definitely do want your data directory on its own
pool. But I recommend putting the actual data in a folder under that pool
(possibly by major version name). For example if your pool is
/data/postgres
Create a folder under that directory to actually put the data:
mkdir /data/postgres/9.4
This allows pg_upgrade's --link option to work during major upgrades since
you can't have an upgrade destination on a different filesystem. Just make
a 9.5 directory in the same spot when the time comes around. With ZFS
snapshots available, there's really no reason not to use the --link option
to greatly speed up upgrades.
--
Keith Fiske
Database Administrator
OmniTI Computer Consulting, Inc.
http://www.keithf4.com
On 09/30/2015 07:33 PM, Benjamin Smith wrote:
On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
I think this really depends on the workload - if you have a lot of
random writes, CoW filesystems will perform significantly worse than
e.g. EXT4 or XFS, even on SSD.I'd be curious about the information you have that leads you to this
conclusion. As with many (most?) "rules of thumb", the devil is
quiteoften the details.
A lot of testing done recently, and also experience with other CoW
filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
random writes).
We've been running both on ZFS/CentOS 6 with excellent results, and
are considering putting the two together. In particular, the CoW
nature (and subsequent fragmentation/thrashing) of ZFS becomes
largely irrelevant on SSDs; the very act of wear leveling on an SSD
is itself a form of intentional thrashing that doesn't affect
performance since SSDs have no meaningful seek time.I don't think that's entirely true. Sure, SSD drives handle random I/O
much better than rotational storage, but it's not entirely free and
sequential I/O is still measurably faster.It's true that the drives do internal wear leveling, but it probably
uses tricks that are impossible to do at the filesystem level (which is
oblivious to internal details of the SSD). CoW also increases the amount
of blocks that need to be reclaimed.In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
faster than ZFS. But of course, if the ZFS features are interesting
for you, maybe it's a reasonable price.Again, the details would be highly interesting to me. What memory
optimization was done? Status of snapshots? Was the pool RAIDZ or
mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
something else?
I'm not sure what you mean by "memory optimization" so the answer is
probably "no".
FWIW I don't have much experience with ZFS in production, all I have is
data from benchmarks I've recently done exactly with the goal to educate
myself on the differences of current filesystems.
The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
recent versions, IMHO.
My goal was to test the file systems under the same conditions and used
a single device (Intel S3700 SSD). I'm aware that this is not a perfect
test and ZFS offers interesting options (e.g. moving ZIL to a separate
device). I plan to benchmark some additional configurations with more
devices and such.
A 2x performance difference is almost inconsequential in my
experience, where growth is exponential. 2x performance change
generally means 1 to 2 years of advancement or deferment against the
progression of hardware; our current, relatively beefy DB servers
are already older than that, and have an anticipated life cycle of at
leastanother couple years.
I'm not sure I understand what you suggest here. What I'm saying is that
when I do a stress test on the same hardware, I do get ~2x the
throughput with EXT4/XFS, compared to ZFS.
// Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
straight out of RAM cache, with only writes hitting disk. Smart
reports 4/96 read/write ratio.
So your active set fits into RAM? I'd guess all your writes are then WAL
+ checkpoints, which probably makes them rather sequential.
If that's the case, CoW filesystems may perform quite well - I was
mostly referring to workloads with a lot of random writes to he device.
Query load: Constant, heavy writes and heavy use of temp tables in
order to assemble very complex queries. Pretty much the "worst case"
mix of reads and writes, average daily peak of about 200-250
queries/second.
I'm not sure how much random I/O that actually translates to. According
to the numbers I've posted to this thread few hours ago, a tuned ZFS on
a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
those are OLTP queries - your queries may write much more data. OTOH it
really does not matter that much if your active set fits into RAM,
because then it's mostly about writing to ZIL.
16 Core XEON servers, 32 HT "cores".
SAS 3 Gbps
CentOS 6 is our O/S of choice.
Currently, we're running Intel 710 SSDs in a software RAID1 without
trim enabled and generally happy with the reliability and performance
we see. We're planning to upgrade storage soon (since we're over 50%
utilization) and in the process, bring the magic goodness of
snapshots/clones from ZFS.
I presume by "software RAID1" you mean "mirrored vdev zpool", correct?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 09/30/2015 03:45 PM, Patric Bechtel wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1Hi Tomas,
Tomas Vondra schrieb am 30.09.2015 um 14:01:
Hi,
...
I've also done a few runs with compression, but that reduces the performance a bit
(understandably).I'm somewhat surprised by the influence of the rsize value. I will recheck that. In my case, the
compression actually improved throughput quite a bit, but that might change depending on CPU speed
vs IO speed. Our CPU's are quite powerful, but the SSD are just SATA Samsung/OCZ models at least
18 months old. Also, I measured the write performance over several hours, to push the internal gc
of the SSD to its limits. We had some problems in the past with (e.g. Intel) SSD's and their
behaviour (<1MB/s), so that's why I put some emphasis on that.
I think it really depends on how random the workload. If the workload is
random (as for example the workload simulated by pgbench), the
recordsize seems to matter a lot - perhaps your workload is not as
random? Same for compression.
I'm no ZFS expert, but I assume ARC tracks those records chunks, so
using 128kB records (default) means you have 16x less slots than with
8kB records. You may still cache the same amount of data, but the cache
may not adapt that well to your active set.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wednesday, September 30, 2015 03:49:44 PM Keith Fiske wrote:
We've run postgres on ZFS for years with great success (first on
OpenSolaris, now on OmniOS, and I personally run it on FreeBSD). The
snapshotting feature makes upgrades on large clusters much less scary
(snapshot and revert if it goes bad) and being able to bring a snapshot
backup up as a clone to restore an accidentally dropped table is great.
Somebody mentioned some trouble running it with ZFS on Linux, which is exactly
how we're planning our roll out. (We're a RHEL/CentOS shop) Have you tried
that config, and has it worked for you?
Others have given a lot of great advice as far as system tuning. Only other
thing I can add is you definitely do want your data directory on its own
pool. But I recommend putting the actual data in a folder under that pool
(possibly by major version name). For example if your pool is/data/postgres
Create a folder under that directory to actually put the data:
mkdir /data/postgres/9.4
This allows pg_upgrade's --link option to work during major upgrades since
you can't have an upgrade destination on a different filesystem. Just make
a 9.5 directory in the same spot when the time comes around. With ZFS
snapshots available, there's really no reason not to use the --link option
to greatly speed up upgrades.
Recently, the PGDG RPMs provided by PostgreSQL have done something similar by
moving from /var/lib/pgsql/ to (EG) /var/lib/pgsql/9.4 and we've followed
suit, trying to keep things "stock" where possible.
Our intent is to make /var/lib/pgsql a filesystem in a pool containing no other
file systems, with SSD-based VDEVs that aren't shared for any other purpose.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wednesday, September 30, 2015 09:58:08 PM Tomas Vondra wrote:
On 09/30/2015 07:33 PM, Benjamin Smith wrote:
On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
I think this really depends on the workload - if you have a lot of
random writes, CoW filesystems will perform significantly worse than
e.g. EXT4 or XFS, even on SSD.I'd be curious about the information you have that leads you to this
conclusion. As with many (most?) "rules of thumb", the devil is
quiteoften the details.A lot of testing done recently, and also experience with other CoW
filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
random writes).We've been running both on ZFS/CentOS 6 with excellent results, and
are considering putting the two together. In particular, the CoW
nature (and subsequent fragmentation/thrashing) of ZFS becomes
largely irrelevant on SSDs; the very act of wear leveling on an SSD
is itself a form of intentional thrashing that doesn't affect
performance since SSDs have no meaningful seek time.I don't think that's entirely true. Sure, SSD drives handle random I/O
much better than rotational storage, but it's not entirely free and
sequential I/O is still measurably faster.It's true that the drives do internal wear leveling, but it probably
uses tricks that are impossible to do at the filesystem level (which is
oblivious to internal details of the SSD). CoW also increases the amount
of blocks that need to be reclaimed.In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
faster than ZFS. But of course, if the ZFS features are interesting
for you, maybe it's a reasonable price.Again, the details would be highly interesting to me. What memory
optimization was done? Status of snapshots? Was the pool RAIDZ or
mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
something else?I'm not sure what you mean by "memory optimization" so the answer is
probably "no".
I mean the full gamut:
Did you use an l2arc? Did you use a dedicated ZIL? What was arc_max set to?
How much RAM/GB was installed on the machine? How did you set up PG? (PG
defaults are historically horrible for higher-RAM machines)
FWIW I don't have much experience with ZFS in production, all I have is
data from benchmarks I've recently done exactly with the goal to educate
myself on the differences of current filesystems.The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
recent versions, IMHO.My goal was to test the file systems under the same conditions and used
a single device (Intel S3700 SSD). I'm aware that this is not a perfect
test and ZFS offers interesting options (e.g. moving ZIL to a separate
device). I plan to benchmark some additional configurations with more
devices and such.
Also, did you try with/without compression? My information so far is that
compression significantly improves overall performance.
A 2x performance difference is almost inconsequential in my
experience, where growth is exponential. 2x performance change
generally means 1 to 2 years of advancement or deferment against the
progression of hardware; our current, relatively beefy DB servers
are already older than that, and have an anticipated life cycle of at
leastanother couple years.I'm not sure I understand what you suggest here. What I'm saying is that
when I do a stress test on the same hardware, I do get ~2x the
throughput with EXT4/XFS, compared to ZFS.
What I'm saying is only what it says on its face: A 50% performance difference
is rarely enough to make or break a production system; performance/capacity
reserves of 95% or more are fairly typical, which means the difference between
5% utilization and 10%. Even if latency rose by 50%, that's typically the
difference between 20ms and 30ms, not enough that, over the 'net for a
SOAP/REST call, that anybody'd notice even if it's enough to make you want to
optimize things a bit.
// Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
straight out of RAM cache, with only writes hitting disk. Smart
reports 4/96 read/write ratio.So your active set fits into RAM? I'd guess all your writes are then WAL
+ checkpoints, which probably makes them rather sequential.If that's the case, CoW filesystems may perform quite well - I was
mostly referring to workloads with a lot of random writes to he device.
That's *MY* hope, anyway! :)
Query load: Constant, heavy writes and heavy use of temp tables in
order to assemble very complex queries. Pretty much the "worst case"
mix of reads and writes, average daily peak of about 200-250queries/second.
I'm not sure how much random I/O that actually translates to. According
to the numbers I've posted to this thread few hours ago, a tuned ZFS on
a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
those are OLTP queries - your queries may write much more data. OTOH it
really does not matter that much if your active set fits into RAM,
because then it's mostly about writing to ZIL.
I personally don't yet know how much sense an SSD-backed ZIL makes when the
storage media is also SSD-based.
16 Core XEON servers, 32 HT "cores".
SAS 3 Gbps
CentOS 6 is our O/S of choice.
Currently, we're running Intel 710 SSDs in a software RAID1 without
trim enabled and generally happy with the reliability and performance
we see. We're planning to upgrade storage soon (since we're over 50%
utilization) and in the process, bring the magic goodness of
snapshots/clones from ZFS.I presume by "software RAID1" you mean "mirrored vdev zpool", correct?
I mean "software RAID 1" with Linux/mdadm. We haven't put ZFS into production
use on any of our DB servers, yet.
Thanks for your input.
Ben
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, Sep 30, 2015 at 4:58 PM, Benjamin Smith <ben@chico.com> wrote:
On Wednesday, September 30, 2015 03:49:44 PM Keith Fiske wrote:
We've run postgres on ZFS for years with great success (first on
OpenSolaris, now on OmniOS, and I personally run it on FreeBSD). The
snapshotting feature makes upgrades on large clusters much less scary
(snapshot and revert if it goes bad) and being able to bring a snapshot
backup up as a clone to restore an accidentally dropped table is great.Somebody mentioned some trouble running it with ZFS on Linux, which is
exactly
how we're planning our roll out. (We're a RHEL/CentOS shop) Have you tried
that config, and has it worked for you?
We've not run it in production where I work and I haven't met anyone that
is doing it either. Personally, I tried it at home for a while when I used
to use Linux on my home server. But whenever there was a kernel or zfs
update, i'd occasionally have problems with it booting up or seeing the zfs
mount. Rebooting again usually fixed it, but it made me nervous every time
there was a kernel update. I switched to FreeBSD a few years ago to get
native ZFS support and haven't looked back since. As that was a few years
ago, things may have improved, but I couldn't speak to those improvements
anymore.
Show quoted text
Others have given a lot of great advice as far as system tuning. Only
other
thing I can add is you definitely do want your data directory on its own
pool. But I recommend putting the actual data in a folder under that pool
(possibly by major version name). For example if your pool is/data/postgres
Create a folder under that directory to actually put the data:
mkdir /data/postgres/9.4
This allows pg_upgrade's --link option to work during major upgrades
since
you can't have an upgrade destination on a different filesystem. Just
make
a 9.5 directory in the same spot when the time comes around. With ZFS
snapshots available, there's really no reason not to use the --linkoption
to greatly speed up upgrades.
Recently, the PGDG RPMs provided by PostgreSQL have done something similar
by
moving from /var/lib/pgsql/ to (EG) /var/lib/pgsql/9.4 and we've followed
suit, trying to keep things "stock" where possible.Our intent is to make /var/lib/pgsql a filesystem in a pool containing no
other
file systems, with SSD-based VDEVs that aren't shared for any other
purpose.
On Wed, Sep 30, 2015 at 5:12 PM, Benjamin Smith <lists@benjamindsmith.com>
wrote:
On Wednesday, September 30, 2015 09:58:08 PM Tomas Vondra wrote:
On 09/30/2015 07:33 PM, Benjamin Smith wrote:
On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
I think this really depends on the workload - if you have a lot of
random writes, CoW filesystems will perform significantly worse than
e.g. EXT4 or XFS, even on SSD.I'd be curious about the information you have that leads you to this
conclusion. As with many (most?) "rules of thumb", the devil is
quiteoften the details.A lot of testing done recently, and also experience with other CoW
filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
random writes).We've been running both on ZFS/CentOS 6 with excellent results, and
are considering putting the two together. In particular, the CoW
nature (and subsequent fragmentation/thrashing) of ZFS becomes
largely irrelevant on SSDs; the very act of wear leveling on an SSD
is itself a form of intentional thrashing that doesn't affect
performance since SSDs have no meaningful seek time.I don't think that's entirely true. Sure, SSD drives handle random I/O
much better than rotational storage, but it's not entirely free and
sequential I/O is still measurably faster.It's true that the drives do internal wear leveling, but it probably
uses tricks that are impossible to do at the filesystem level (whichis
oblivious to internal details of the SSD). CoW also increases the
amount
of blocks that need to be reclaimed.
In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
faster than ZFS. But of course, if the ZFS features are interesting
for you, maybe it's a reasonable price.Again, the details would be highly interesting to me. What memory
optimization was done? Status of snapshots? Was the pool RAIDZ or
mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
something else?I'm not sure what you mean by "memory optimization" so the answer is
probably "no".I mean the full gamut:
Did you use an l2arc? Did you use a dedicated ZIL? What was arc_max set to?
How much RAM/GB was installed on the machine? How did you set up PG? (PG
defaults are historically horrible for higher-RAM machines)
In my testing with pgbench I actually saw a decrease in performance with a
ZIL enabled. I ended up just keeping the L2ARC and dropping the. ZIL will
not provide you with any speed boost as a database. On a NAS with NFS
shared for example, a ZIL would work well. ZIL is more for data protection
than anything.
I run in Production FreeBSD 10.1 with an NVMe mirror for L2ARC, the rest of
the storage is spinning drives. With a combination of filesystem
compressions. For example, archival tablespaces and the log folder are on
gzip compression on an external array. Faster stuff like the xlog are lz4
and on an internal array.
If you are interested I might still have the data from when I executed the
tests.
-Joseph Kregloh
Show quoted text
FWIW I don't have much experience with ZFS in production, all I have is
data from benchmarks I've recently done exactly with the goal to educate
myself on the differences of current filesystems.The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
recent versions, IMHO.My goal was to test the file systems under the same conditions and used
a single device (Intel S3700 SSD). I'm aware that this is not a perfect
test and ZFS offers interesting options (e.g. moving ZIL to a separate
device). I plan to benchmark some additional configurations with more
devices and such.Also, did you try with/without compression? My information so far is that
compression significantly improves overall performance.A 2x performance difference is almost inconsequential in my
experience, where growth is exponential. 2x performance change
generally means 1 to 2 years of advancement or deferment against the
progression of hardware; our current, relatively beefy DB servers
are already older than that, and have an anticipated life cycle of at
leastanother couple years.I'm not sure I understand what you suggest here. What I'm saying is that
when I do a stress test on the same hardware, I do get ~2x the
throughput with EXT4/XFS, compared to ZFS.What I'm saying is only what it says on its face: A 50% performance
difference
is rarely enough to make or break a production system; performance/capacity
reserves of 95% or more are fairly typical, which means the difference
between
5% utilization and 10%. Even if latency rose by 50%, that's typically the
difference between 20ms and 30ms, not enough that, over the 'net for a
SOAP/REST call, that anybody'd notice even if it's enough to make you want
to
optimize things a bit.// Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
straight out of RAM cache, with only writes hitting disk. Smart
reports 4/96 read/write ratio.So your active set fits into RAM? I'd guess all your writes are then WAL
+ checkpoints, which probably makes them rather sequential.If that's the case, CoW filesystems may perform quite well - I was
mostly referring to workloads with a lot of random writes to he device.That's *MY* hope, anyway! :)
Query load: Constant, heavy writes and heavy use of temp tables in
order to assemble very complex queries. Pretty much the "worst case"
mix of reads and writes, average daily peak of about 200-250queries/second.
I'm not sure how much random I/O that actually translates to. According
to the numbers I've posted to this thread few hours ago, a tuned ZFS on
a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
those are OLTP queries - your queries may write much more data. OTOH it
really does not matter that much if your active set fits into RAM,
because then it's mostly about writing to ZIL.I personally don't yet know how much sense an SSD-backed ZIL makes when the
storage media is also SSD-based.16 Core XEON servers, 32 HT "cores".
SAS 3 Gbps
CentOS 6 is our O/S of choice.
Currently, we're running Intel 710 SSDs in a software RAID1 without
trim enabled and generally happy with the reliability and performance
we see. We're planning to upgrade storage soon (since we're over 50%
utilization) and in the process, bring the magic goodness of
snapshots/clones from ZFS.I presume by "software RAID1" you mean "mirrored vdev zpool", correct?
I mean "software RAID 1" with Linux/mdadm. We haven't put ZFS into
production
use on any of our DB servers, yet.Thanks for your input.
Ben
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 10/1/15 8:50 AM, Joseph Kregloh wrote:
In my testing with pgbench I actually saw a decrease in performance with
a ZIL enabled. I ended up just keeping the L2ARC and dropping the. ZIL
will not provide you with any speed boost as a database. On a NAS with
NFS shared for example, a ZIL would work well. ZIL is more for data
protection than anything.I run in Production FreeBSD 10.1 with an NVMe mirror for L2ARC, the rest
of the storage is spinning drives. With a combination of filesystem
compressions. For example, archival tablespaces and the log folder are
on gzip compression on an external array. Faster stuff like the xlog are
lz4 and on an internal array.
I'm not a ZFS expert, but my understanding is that a ZIL *that has lower
latency than main storage* can be a performance win. This is similar to
the idea of giving pg_xlog it's own dedicated volume so that it's not
competing with all the other IO traffic every time you do a COMMIT.
Recent versions of Postgres go to a lot of trouble to make fsync as
painless as possible, so a ZIL might not help much in many cases. Where
it could still help is if you're running synchronous_commit = true and
you consistently get lower latency on the ZIL than on the vdev's; that
will make every COMMIT run faster.
(BTW, this is all based on the assumption that ZFS treats fsync as a
synchronous request.)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, Oct 1, 2015 at 5:51 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 10/1/15 8:50 AM, Joseph Kregloh wrote:
In my testing with pgbench I actually saw a decrease in performance with
a ZIL enabled. I ended up just keeping the L2ARC and dropping the. ZIL
will not provide you with any speed boost as a database. On a NAS with
NFS shared for example, a ZIL would work well. ZIL is more for data
protection than anything.I run in Production FreeBSD 10.1 with an NVMe mirror for L2ARC, the rest
of the storage is spinning drives. With a combination of filesystem
compressions. For example, archival tablespaces and the log folder are
on gzip compression on an external array. Faster stuff like the xlog are
lz4 and on an internal array.I'm not a ZFS expert, but my understanding is that a ZIL *that has lower
latency than main storage* can be a performance win. This is similar to the
idea of giving pg_xlog it's own dedicated volume so that it's not competing
with all the other IO traffic every time you do a COMMIT.Recent versions of Postgres go to a lot of trouble to make fsync as
painless as possible, so a ZIL might not help much in many cases. Where it
could still help is if you're running synchronous_commit = true and you
consistently get lower latency on the ZIL than on the vdev's; that will
make every COMMIT run faster.(BTW, this is all based on the assumption that ZFS treats fsync as a
synchronous request.)
The ZIL or ZFS Intent Log as the name describe is just a log. It just
replays transactions that may have been lost in the event of machine
failure. If the machine crashes upon startup of ZFS it will replay the data
stored in the ZIL drive and try to fix any errors. During runtime the ZIL
is never read from only written to.
When there is no separate ZIL device. With a synchronous write ZFS will
store the data on RAM and the ZIL residing on the vdev. Once it
acknowledges that the data is all there it will flush from RAM into it's
final write location on the vdev.
When you have a fast ZIL device like an SSD or NVMe drive. It will do the
same store the data on RAM and on the fast ZIL device. Once acknowledge it
will also write from RAM into the vdev. In theory it does give you a faster
acknowledgement time.
In either case you are still "bottlenecked" by the speed of the write from
RAM to the zpool. Now for a small database with not many writes a ZIL would
be awesome. But on a write heavy database you will be acknowledging more
writes because of the ZIL that what you are physically able to write from
RAM to zpool, thereby degrading performance.
At least this is how it works in my head.
-Joseph Kregloh
Show quoted text
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com