diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index 3314088c1c2da6c2946d63cbb857569b5460b614..4733786f95c7ce9181dbd5a81f1d069f1d1c6121 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -1,4 +1,4 @@ - + Write-Ahead Logging (<acronym>WAL</acronym>) @@ -88,8 +88,11 @@ transaction identifiers. Once UNDO is implemented, pg_clog will no longer be required to be permanent; it will be possible to remove - pg_clog at shutdown, split it into segments - and remove old segments. + pg_clog at shutdown. (However, the urgency + of this concern has decreased greatly with the adoption of a segmented + storage method for pg_clog --- it is no longer + necessary to keep old pg_clog entries around + forever.) @@ -116,6 +119,18 @@ copying the data files (operating system copy commands are not suitable). + + + A difficulty standing in the way of realizing these benefits is that they + require saving WAL entries for considerable periods + of time (eg, as long as the longest possible transaction if transaction + UNDO is wanted). The present WAL format is + extremely bulky since it includes many disk page snapshots. + This is not a serious concern at present, since the entries only need + to be kept for one or two checkpoint intervals; but to achieve + these future benefits some sort of compressed WAL + format will be needed. + @@ -133,8 +148,8 @@ WAL logs are stored in the directory $PGDATA/pg_xlog, as - a set of segment files, each 16 MB in size. Each segment is - divided into 8 kB pages. The log record headers are described in + a set of segment files, each 16MB in size. Each segment is + divided into 8KB pages. The log record headers are described in access/xlog.h; record content is dependent on the type of event that is being logged. Segment files are given ever-increasing numbers as names, starting at @@ -147,8 +162,8 @@ The WAL buffers and control structure are in shared memory, and are handled by the backends; they are protected by lightweight locks. The demand on shared memory is dependent on the - number of buffers; the default size of the WAL - buffers is 64 kB. + number of buffers. The default size of the WAL + buffers is 8 8KB buffers, or 64KB. @@ -166,8 +181,8 @@ disk drives that falsely report a successful write to the kernel, when, in fact, they have only cached the data and not yet stored it on the disk. A power failure in such a situation may still lead to - irrecoverable data corruption; administrators should try to ensure - that disks holding PostgreSQL's data and + irrecoverable data corruption. Administrators should try to ensure + that disks holding PostgreSQL's log files do not make such false reports. @@ -179,11 +194,12 @@ checkpoint's position is saved in the file pg_control. Therefore, when recovery is to be done, the backend first reads pg_control and - then the checkpoint record; next it reads the redo record, whose - position is saved in the checkpoint, and begins the REDO operation. - Because the entire content of the pages is saved in the log on the - first page modification after a checkpoint, the pages will be first - restored to a consistent state. + then the checkpoint record; then it performs the REDO operation by + scanning forward from the log position indicated in the checkpoint + record. + Because the entire content of data pages is saved in the log on the + first page modification after a checkpoint, all pages changed since + the checkpoint will be restored to a consistent state. @@ -217,9 +233,9 @@ buffers. This is undesirable because LogInsert is used on every database low level modification (for example, tuple insertion) at a time when an exclusive lock is held on - affected data pages and the operation is supposed to be as fast as - possible; what is worse, writing WAL buffers may - also cause the creation of a new log segment, which takes even more + affected data pages, so the operation needs to be as fast as + possible. What is worse, writing WAL buffers may + also force the creation of a new log segment, which takes even more time. Normally, WAL buffers should be written and flushed by a LogFlush request, which is made, for the most part, at transaction commit time to ensure that @@ -230,7 +246,7 @@ one should increase the number of WAL buffers by modifying the WAL_BUFFERS parameter. The default number of WAL buffers is 8. Increasing this - value will have an impact on shared memory usage. + value will correspondingly increase shared memory usage. @@ -243,34 +259,28 @@ log (known as the redo record) it should start the REDO operation, since any changes made to data files before that record are already on disk. After a checkpoint has been made, any log segments written - before the undo records are removed, so checkpoints are used to free - disk space in the WAL directory. (When - WAL-based BAR is implemented, - the log segments can be archived instead of just being removed.) - The checkpoint maker is also able to create a few log segments for - future use, so as to avoid the need for - LogInsert or LogFlush to - spend time in creating them. + before the undo records are no longer needed and can be recycled or + removed. (When WAL-based BAR is + implemented, the log segments would be archived before being recycled + or removed.) - The WAL log is held on the disk as a set of 16 - MB files called segments. By default a new - segment is created only if more than 75% of the current segment is - used. One can instruct the server to pre-create up to 64 log segments + The checkpoint maker is also able to create a few log segments for + future use, so as to avoid the need for + LogInsert or LogFlush to + spend time in creating them. (If that happens, the entire database + system will be delayed by the creation operation, so it's better if + the files can be created in the checkpoint maker, which is not on + anyone's critical path.) + By default a new 16MB segment file is created only if more than 75% of + the current segment has been used. This is inadequate if the system + generates more than 4MB of log output between checkpoints. + One can instruct the server to pre-create up to 64 log segments at checkpoint time by modifying the WAL_FILES configuration parameter. - - For faster after-crash recovery, it would be better to create - checkpoints more often. However, one should balance this against - the cost of flushing dirty data pages; in addition, to ensure data - page consistency, the first modification of a data page after each - checkpoint results in logging the entire page content, thus - increasing output to log and the log's size. - - The postmaster spawns a special backend process every so often to create the next checkpoint. A checkpoint is created every @@ -281,6 +291,35 @@ CHECKPOINT. + + Reducing CHECKPOINT_SEGMENTS and/or + CHECKPOINT_TIMEOUT causes checkpoints to be + done more often. This allows faster after-crash recovery (since + less work will need to be redone). However, one must balance this against + the increased cost of flushing dirty data pages more often. In addition, + to ensure data page consistency, the first modification of a data page + after each checkpoint results in logging the entire page content. + Thus a smaller checkpoint interval increases the volume of output to + the log, partially negating the goal of using a smaller interval, and + in any case causing more disk I/O. + + + + The number of 16MB segment files will always be at least + WAL_FILES + 1, and will normally not exceed + WAL_FILES + 2 * CHECKPOINT_SEGMENTS + + 1. This may be used to estimate space requirements for WAL. Ordinarily, + when an old log segment file is no longer needed, it is recycled (renamed + to become the next sequential future segment). If, due to a short-term + peak of log output rate, there are more than WAL_FILES + + 2 * CHECKPOINT_SEGMENTS + 1 segment files, then unneeded + segment files will be deleted instead of recycled until the system gets + back under this limit. (If this happens on a regular basis, + WAL_FILES should be increased to avoid it. Deleting log + segments that will only have to be created again later is expensive and + pointless.) + + The COMMIT_DELAY parameter defines for how many microseconds the backend will sleep after writing a commit @@ -294,6 +333,8 @@ Note that on most platforms, the resolution of a sleep request is ten milliseconds, so that any nonzero COMMIT_DELAY setting between 1 and 10000 microseconds will have the same effect. + Good values for these parameters are not yet clear; experimentation + is encouraged.