提交 · e01157500f26342bf4f067a4eb1e45ab9a3cd410 · Greenplum / Gpdb

13 3月, 2016 1 次提交

Include portability/mem.h into fd.c for MAP_FAILED. · e0115750

由 Andres Freund 提交于 3月 12, 2016

Buildfarm members gaur and pademelon are old enough not to know about
MAP_FAILED; which is used in 428b1d6b. Include portability/mem.h to fix;
as already done in a bunch of other places.

e0115750

11 3月, 2016 1 次提交

Allow to trigger kernel writeback after a configurable number of writes. · 428b1d6b

由 Andres Freund 提交于 2月 19, 2016

Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.

On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.

Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.

While desirable and likely possible this patch does not contain an
implementation for windows.

With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.

A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.

Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund

428b1d6b

10 3月, 2016 1 次提交

Introduce durable_rename() and durable_link_or_rename(). · 606e0f98

由 Andres Freund 提交于 3月 09, 2016

Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes; especially on filesystems like xfs and ext4 when mounted
with data=writeback. To be certain that a rename() atomically replaces
the previous file contents in the face of crashes and different
filesystems, one has to fsync the old filename, rename the file, fsync
the new filename, fsync the containing directory.  This sequence is not
generally adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.

Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well. There seems little reason extend several copies of the
same logic, so centralize the link() callers.

This commit does not yet make use of the new functions; they're used in
a followup commit.

Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches

606e0f98

08 3月, 2016 1 次提交

Add some functions to fd.c for the convenience of extensions. · 070140ee

由 Robert Haas 提交于 3月 08, 2016

For example, if you want to perform an ioctl() on a file descriptor
opened through the fd.c routines, there's no way to do that without
being able to get at the underlying fd.

KaiGai Kohei

070140ee

03 1月, 2016 1 次提交
- B
  Update copyright for 2016 · ee943004
  由 Bruce Momjian 提交于 1月 02, 2016
```
Backpatch certain files through 9.1
```
  ee943004
30 5月, 2015 1 次提交

Remove special cases for ETXTBSY from new fsync'ing logic. · 57e1138b

由 Tom Lane 提交于 5月 29, 2015

The argument that this is a sufficiently-expected case to be silently
ignored seems pretty thin. Andres had brought it up back when we were
still considering that most fsync failures should be hard errors, and it
probably would be legit not to fail hard for ETXTBSY --- but the same is
true for EROFS and other cases, which is why we gave up on hard failures.
ETXTBSY is surely not a normal case, so logging the failure seems fine
from here.

57e1138b

29 5月, 2015 2 次提交

Fix fsync-at-startup code to not treat errors as fatal. · d8179b00

由 Tom Lane 提交于 5月 28, 2015

Commit 2ce439f3 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.

After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.

This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.

I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().

The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.

Back-patch to all active branches, like the prior commit.

Abhijit Menon-Sen and Tom Lane

d8179b00

Fix assorted inconsistencies in our calls of readlink(). · 32f628be

由 Tom Lane 提交于 5月 28, 2015

Ensure that we null-terminate the result string (one place in pg_rewind).
Be paranoid about out-of-range results from readlink() (should not happen,
but there is no good reason for some call sites to be careful about it and
others not).  Consistently use the whole buffer, not sometimes one byte
less.  Ensure we emit an appropriate errcode() in all cases.  Spell the
error messages the same way.

The only serious bug here is the missing null-termination in pg_rewind,
which is new code, so no need for a back-patch.

Abhijit Menon-Sen and Tom Lane

32f628be

24 5月, 2015 1 次提交
- B
  
  pgindent run for 9.5 · 807b9e0d
  由 Bruce Momjian 提交于 5月 23, 2015
  
  807b9e0d
19 5月, 2015 1 次提交

Fix error message in pre_sync_fname. · 922de19e

由 Robert Haas 提交于 5月 18, 2015

The old one didn't include %m anywhere, and required extra
translation.

Report by Peter Eisentraut. Fix by me. Review by Tom Lane.

922de19e

05 5月, 2015 2 次提交

Fix some problems with patch to fsync the data directory. · 456ff086

由 Robert Haas 提交于 5月 05, 2015

pg_win32_is_junction() was a typo for pgwin32_is_junction().  open()
was used not only in a two-argument form, which breaks on Windows,
but also where BasicOpenFile() should have been used.

Per reports from Andrew Dunstan and David Rowley.

456ff086

Recursively fsync() the data directory after a crash. · 2ce439f3

由 Robert Haas 提交于 5月 04, 2015

Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk.  This could lead to data corruption.

Back-patch to all supported versions.

Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.

2ce439f3

07 1月, 2015 1 次提交
- B
  Update copyright for 2015 · 4baaf863
  由 Bruce Momjian 提交于 1月 06, 2015
```
Backpatch certain files through 9.0
```
  4baaf863
06 11月, 2014 1 次提交

Move the backup-block logic from XLogInsert to a new file, xloginsert.c. · 2076db2a

由 Heikki Linnakangas 提交于 11月 06, 2014

xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.

2076db2a

07 5月, 2014 1 次提交

pgindent run for 9.4 · 0a783200

由 Bruce Momjian 提交于 5月 06, 2014

This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.

0a783200

01 5月, 2014 1 次提交

Rationalize common/relpath.[hc]. · 2d001904

由 Tom Lane 提交于 4月 30, 2014

Commit a7301839 created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that.  To clean it up:

* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h.  We won't consider this symbol part of the FE/BE API.

* Push enum ForkNumber from relfilenode.h into relpath.h.  We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.

* So, relfilenode.h now includes relpath.h rather than vice-versa.  This
direction of dependency is fine.  (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)

* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.

* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().

* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers.  (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)

* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().

* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames.  This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.

The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.

Per a gripe from Christoph Berg about how include/common wasn't
self-contained.

2d001904

22 3月, 2014 2 次提交
- B
  
  Remove MinGW readdir/errno bug workaround fixed on 2003-10-10 · 1494931d
  由 Bruce Momjian 提交于 3月 21, 2014
  
  1494931d
- B
  Properly check for readdir/closedir() failures · 6f03927f
  由 Bruce Momjian 提交于 3月 21, 2014
```
Clear errno before calling readdir() and handle old MinGW errno bug
while adding full test coverage for readdir/closedir failures.

Backpatch through 8.4.
```
  6f03927f
13 3月, 2014 1 次提交
- B
  
  C comments: remove odd blank lines after #ifdef WIN32 lines · 886c0be3
  由 Bruce Momjian 提交于 3月 13, 2014
  
  886c0be3
24 1月, 2014 1 次提交

Allow use of "z" flag in our printf calls, and use it where appropriate. · ac4ef637

由 Tom Lane 提交于 1月 23, 2014

Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has". Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64). So let's
start using "z" instead. To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".

Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead. This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them. I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.

It's possible that this change will result in format-string warnings from
pre-C99 compilers. We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.

Andres Freund, with a little additional work by me

ac4ef637

08 1月, 2014 1 次提交
- B
  Update copyright for 2014 · 7e04792a
  由 Bruce Momjian 提交于 1月 07, 2014
```
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
```
  7e04792a
04 9月, 2013 1 次提交
- R
  Expose fsync_fname as a public API. · cc52d5b3
  由 Robert Haas 提交于 9月 04, 2013
```
Andres Freund
```
  cc52d5b3
10 6月, 2013 1 次提交

Remove fixed limit on the number of concurrent AllocateFile() requests. · 007556bf

由 Tom Lane 提交于 6月 09, 2013

AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.

We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.

Back-patch to 9.1 where contrib/file_fdw was introduced.

007556bf

17 5月, 2013 1 次提交

Fix fd.c to preserve errno where needed. · 6563fb2b

由 Tom Lane 提交于 5月 16, 2013

PathNameOpenFile failed to ensure that the correct value of errno was
returned to its caller after a failure (because it incorrectly supposed
that free() can never change errno).  In some cases this would result
in a user-visible failure because an expected ENOENT errno was replaced
with something else.  Bogus EINVAL failures have been observed on OS X,
for example.

There were also a couple of places that could mangle an important value
of errno if FDDEBUG was defined.  While the usefulness of that debug
support is highly debatable, we might as well make it safe to use,
so add errno save/restore logic to the DO_DB macro.

Per bug #8167 from Nelson Minar, diagnosed by RhodiumToad.
Back-patch to all supported branches.

6563fb2b

28 2月, 2013 1 次提交

Add support for piping COPY to/from an external program. · 3d009e45

由 Heikki Linnakangas 提交于 2月 27, 2013

This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.

In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.

This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.

Etsuro Fujita, reviewed by Amit Kapila.

3d009e45

22 2月, 2013 1 次提交

Move relpath() to libpgcommon · a7301839

由 Alvaro Herrera 提交于 2月 21, 2013

This enables non-backend code, such as pg_xlogdump, to use it easily.
The previous location, in src/backend/catalog/catalog.c, made that
essentially impossible because that file depends on many backend-only
facilities; so this needs to live separately.

a7301839

20 2月, 2013 1 次提交
- H
  Fix yet another typo in comment. · 5d6899db
  由 Heikki Linnakangas 提交于 2月 20, 2013
```
Etsuro Fujita
```
  5d6899db
08 2月, 2013 2 次提交
- M
  Fix another typo in a comment · c572bfaf
  由 Magnus Hagander 提交于 2月 08, 2013
```
Noted by Thom Brown
```
  c572bfaf
- M
  Fix typo in comment · 733701d2
  由 Magnus Hagander 提交于 2月 08, 2013
```
Etsuro Fujita
```
  733701d2
02 1月, 2013 1 次提交

Update copyrights for 2013 · bd61a623

由 Bruce Momjian 提交于 1月 01, 2013

Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.

bd61a623

27 11月, 2012 1 次提交

Add OpenTransientFile, with automatic cleanup at end-of-xact. · 1f67078e

由 Heikki Linnakangas 提交于 11月 27, 2012

Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().

This plugs a few rare fd leaks in error cases:

1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.

In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.

The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.

1f67078e

18 10月, 2012 1 次提交

Revert "Use "transient" files for blind writes, take 2". · 9bacf0e3

由 Tom Lane 提交于 10月 17, 2012

This reverts commit fba105b1.
That approach had problems with the smgr-level state not tracking what
we really want to happen, and with the VFD-level state not tracking the
smgr-level state very well either. In consequence, it was still possible
to hold kernel file descriptors open for long-gone tables (as in recent
report from Tore Halset), and yet there were also cases of FDs being closed
undesirably soon. A replacement implementation will follow.

9bacf0e3

29 8月, 2012 1 次提交

Split resowner.h · 45326c5a

由 Alvaro Herrera 提交于 8月 28, 2012

This lets files that are mere users of ResourceOwner not automatically
include the headers for stuff that is managed by the resowner mechanism.

45326c5a

22 7月, 2012 1 次提交

Improve copydir() code for the case that fsync is off. · 2d46a57d

由 Tom Lane 提交于 7月 21, 2012

We should avoid calling sync_file_range or posix_fadvise in this case,
since (a) we don't really care if the data gets synced, and might as
well save the kernel calls; (b) at least on Linux we know that the
kernel might block us until it's scheduled the write.

Also, avoid making a useless second traversal of the directory tree
if we're not actually going to call fsync(2) after all.

2d46a57d

14 7月, 2012 1 次提交

Add fsync capability to initdb, and use sync_file_range() if available. · b966dd6c

由 Tom Lane 提交于 7月 13, 2012

Historically we have not worried about fsync'ing anything during initdb
(in fact, initdb intentionally passes -F to each backend launch to prevent
it from fsync'ing).  But with filesystems getting more aggressive about
caching data, that's not such a good plan anymore.  Make initdb do a pass
over the finished data directory tree to fsync everything.  For testing
purposes, the -N/--nosync flag can be used to restore the old behavior.

Also, testing shows that on Linux, sync_file_range() is much faster than
posix_fadvise() for hinting to the kernel that an fsync is coming,
apparently because the latter blocks on a rather small request queue while
the former doesn't.  So use this function if available in initdb, and also
in the backend's pg_flush_data() (where it currently will affect only the
speed of CREATE DATABASE's cloning step).

We will later make pg_regress invoke initdb with the --nosync flag
to avoid slowing down cases such as "make check" in contrib.  But
let's not do so until we've shaken out any portability issues in this
patch.

Jeff Davis, reviewed by Andres Freund

b966dd6c

11 6月, 2012 1 次提交
- B
  Run pgindent on 9.2 source tree in preparation for first 9.3 · 927d61ee
  由 Bruce Momjian 提交于 6月 10, 2012
```
commit-fest.
```
  927d61ee
29 3月, 2012 1 次提交

Inherit max_safe_fds to child processes in EXEC_BACKEND mode. · 5762a4d9

由 Heikki Linnakangas 提交于 3月 29, 2012

Postmaster sets max_safe_fds by testing how many open file descriptors it
can open, and that is normally inherited by all child processes at fork().
Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
effectively ignored max_files_per_process on Windows, and always assumed
a conservative default of 32 simultaneous open files. That could have an
impact on performance, if you need to access a lot of different files
in a query. After this patch, the value is passed to child processes by
save/restore_backend_variables() among many other global variables.

It has been like this forever, but given the lack of complaints about it,
I'm not backpatching this.

5762a4d9

22 3月, 2012 1 次提交

Clean up compiler warnings from unused variables with asserts disabled · 0e85abd6

由 Peter Eisentraut 提交于 3月 21, 2012

For those variables only used when asserts are enabled, use a new
macro PG_USED_FOR_ASSERTS_ONLY, which expands to
__attribute__((unused)) when asserts are not enabled.

0e85abd6

28 1月, 2012 1 次提交
- M
  Prevent logging "failed to stat file: success" for temp files · 672614cf
  由 Magnus Hagander 提交于 1月 28, 2012
```
This was broken in commit bc334748, the
addition of statistics counters for temp files.

Reported by Thom Brown
```
  672614cf
26 1月, 2012 1 次提交
- R
  
  Add missing #include, to suppress compiler warning. · 467ff207
  由 Robert Haas 提交于 1月 26, 2012
  
  467ff207