1. 17 10月, 2013 1 次提交
    • J
      fs: buffer: move allocation failure loop into the allocator · 84235de3
      Johannes Weiner 提交于
      Buffer allocation has a very crude indefinite loop around waking the
      flusher threads and performing global NOFS direct reclaim because it can
      not handle allocation failures.
      
      The most immediate problem with this is that the allocation may fail due
      to a memory cgroup limit, where flushers + direct reclaim might not make
      any progress towards resolving the situation at all.  Because unlike the
      global case, a memory cgroup may not have any cache at all, only
      anonymous pages but no swap.  This situation will lead to a reclaim
      livelock with insane IO from waking the flushers and thrashing unrelated
      filesystem cache in a tight loop.
      
      Use __GFP_NOFAIL allocations for buffers for now.  This makes sure that
      any looping happens in the page allocator, which knows how to
      orchestrate kswapd, direct reclaim, and the flushers sensibly.  It also
      allows memory cgroups to detect allocations that can't handle failure
      and will allow them to ultimately bypass the limit if reclaim can not
      make progress.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84235de3
  2. 04 7月, 2013 1 次提交
    • M
      mm: vmscan: take page buffers dirty and locked state into account · b4597226
      Mel Gorman 提交于
      Page reclaim keeps track of dirty and under writeback pages and uses it
      to determine if wait_iff_congested() should stall or if kswapd should
      begin writing back pages.  This fails to account for buffer pages that
      can be under writeback but not PageWriteback which is the case for
      filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
      can have all the buffers clean and writepage does no IO so it should not
      be accounted as congested.
      
      This patch adds an address_space operation that filesystems may
      optionally use to check if a page is really dirty or really under
      writeback.  An implementation is provided for for buffer_heads is added
      and used for block operations and ext3 in ordered mode.  By default the
      page flags are obeyed.
      
      Credit goes to Jan Kara for identifying that the page flags alone are
      not sufficient for ext3 and sanity checking a number of ideas on how the
      problem could be addressed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4597226
  3. 22 5月, 2013 1 次提交
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  4. 30 4月, 2013 2 次提交
  5. 21 4月, 2013 1 次提交
  6. 24 3月, 2013 1 次提交
    • K
      block: Remove bi_idx references · 4f2ac93c
      Kent Overstreet 提交于
      For immutable bvecs, all bi_idx usage needs to be audited - so here
      we're removing all the unnecessary uses.
      
      Most of these are places where it was being initialized on a bio that
      was just allocated, a few others are conversions to standard macros.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      4f2ac93c
  7. 24 2月, 2013 1 次提交
  8. 23 2月, 2013 1 次提交
  9. 22 2月, 2013 1 次提交
    • D
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong 提交于
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  10. 15 1月, 2013 1 次提交
    • L
      vfs: add missing virtual cache flush after editing partial pages · 6d283dba
      Linus Torvalds 提交于
      Andrew Morton pointed this out a month ago, and then I completely forgot
      about it.
      
      If we read a partial last page of a block device, we will zero out the
      end of the page, but since that page can then be mapped into user space,
      we should also make sure to flush the cache on architectures that have
      virtual caches.  We have the flush_dcache_page() function for this, so
      use it.
      
      Now, in practice this really never matters, because nobody sane uses
      virtual caches to begin with, and they largely exist on old broken RISC
      arhitectures.
      
      And even if you did run on one of those obsolete CPU's, the whole "mmap
      and access the last partial page of a block device" behavior probably
      doesn't actually exist.  The normal IO functions (read/write) will never
      see the zeroed-out part of the page that migth not be coherent in the
      cache, because they honor the size of the device.
      
      So I'm marking this for stable (3.7 only), but I'm not sure anybody will
      ever care.
      Pointed-out-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org  # 3.7
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d283dba
  11. 14 1月, 2013 2 次提交
    • T
      block: add block_{touch|dirty}_buffer tracepoint · 5305cb83
      Tejun Heo 提交于
      The former is triggered from touch_buffer() and the latter
      mark_buffer_dirty().
      
      This is part of tracepoint additions to improve visiblity into
      dirtying / writeback operations for io tracer and userland.
      
      v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made
          it share TP definition with block_touch_buffer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5305cb83
    • T
      buffer: make touch_buffer() an exported function · f0059afd
      Tejun Heo 提交于
      We want to add a trace point to touch_buffer() but macros and inline
      functions defined in header files can't have tracing points.  Move
      touch_buffer() to fs/buffer.c and make it a proper function.
      
      The new exported function is also declared inline.  As most uses of
      touch_buffer() are inside buffer.c with nilfs2 as the only other user,
      the effect of this change should be negligible.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f0059afd
  12. 13 12月, 2012 2 次提交
  13. 12 12月, 2012 1 次提交
    • R
      mm: redefine address_space.assoc_mapping · 252aa6f5
      Rafael Aquini 提交于
      Overhaul struct address_space.assoc_mapping renaming it to
      address_space.private_data and its type is redefined to void*.  By this
      approach we consistently name the .private_* elements from struct
      address_space as well as allow extended usage for address_space
      association with other data structures through ->private_data.
      
      Also, all users of old ->assoc_mapping element are converted to reflect
      its new name and type change (->private_data).
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      252aa6f5
  14. 06 12月, 2012 1 次提交
  15. 05 12月, 2012 1 次提交
    • L
      vfs: avoid "attempt to access beyond end of device" warnings · 57302e0d
      Linus Torvalds 提交于
      The block device access simplification that avoided accessing the (racy)
      block size information (commit bbec0270: "blkdev_max_block: make
      private to fs/buffer.c") no longer checks the maximum block size in the
      block mapping path.
      
      That was _almost_ as simple as just removing the code entirely, because
      the readers and writers all check the size of the device anyway, so
      under normal circumstances it "just worked".
      
      However, the block size may be such that the end of the device may
      straddle one single buffer_head.  At which point we may still want to
      access the end of the device, but the buffer we use to access it
      partially extends past the end.
      
      The 'bd_set_size()' function intentionally sets the block size to avoid
      this, but mounting the device - or setting the block size by hand to
      some other value - can modify that block size.
      
      So instead, teach 'submit_bh()' about the special case of the buffer
      head straddling the end of the device, and turning such an access into a
      smaller IO access, avoiding the problem.
      
      This, btw, also means that unlike before, we can now access the whole
      device regardless of device block size setting.  So now, even if the
      device size is only 512-byte aligned, we can read and write even the
      last sector even when having a much bigger block size for accessing the
      rest of the device.
      
      So with this, we could now get rid of the 'bd_set_size()' block size
      code entirely - resulting in faster IO for the common case - but that
      would be a separate patch.
      Reported-and-tested-by: NRomain Francoise <romain@orebokech.com>
      Reporeted-and-tested-by: NMeelis Roos <mroos@linux.ee>
      Reported-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57302e0d
  16. 30 11月, 2012 2 次提交
    • L
      blkdev_max_block: make private to fs/buffer.c · bbec0270
      Linus Torvalds 提交于
      We really don't want to look at the block size for the raw block device
      accesses in fs/block-dev.c, because it may be changing from under us.
      So get rid of the max_block logic entirely, since the caller should
      already have done it anyway.
      
      That leaves the only user of this function in fs/buffer.c, so move the
      whole function there and make it static.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbec0270
    • L
      fs/buffer.c: make block-size be per-page and protected by the page lock · 45bce8f3
      Linus Torvalds 提交于
      This makes the buffer size handling be a per-page thing, which allows us
      to not have to worry about locking too much when changing the buffer
      size.  If a page doesn't have buffers, we still need to read the block
      size from the inode, but we can do that with ACCESS_ONCE(), so that even
      if the size is changing, we get a consistent value.
      
      This doesn't convert all functions - many of the buffer functions are
      used purely by filesystems, which in turn results in the buffer size
      being fixed at mount-time.  So they don't have the same consistency
      issues that the raw device access can have.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45bce8f3
  17. 01 10月, 2012 1 次提交
    • T
      ext4: fix mtime update in nodelalloc mode · 041bbb6d
      Theodore Ts'o 提交于
      Commits 5e8830dc and 41c4d25f introduced a regression into
      v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
      take place for files modified via mmap if the page was already in the
      page cache.  This would also affect ext3 file systems mounted using
      the ext4 file system driver.
      
      The problem was that ext4_page_mkwrite() had a shortcut which would
      avoid calling __block_page_mkwrite() under some circumstances, and the
      above two commit transferred the responsibility of calling
      file_update_time() to __block_page_mkwrite --- which woudln't get
      called in some circumstances.
      
      Since __block_page_mkwrite() only has three callers,
      block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
      best way to solve this is to move the responsibility for calling
      file_update_time() to its caller.
      
      This problem was found via xfstests #215 with a file system mounted
      with -o nodelalloc.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: stable@vger.kernel.org
      041bbb6d
  18. 23 8月, 2012 1 次提交
    • H
      block: replace __getblk_slow misfix by grow_dev_page fix · 676ce6d5
      Hugh Dickins 提交于
      Commit 91f68c89 ("block: fix infinite loop in __getblk_slow")
      is not good: a successful call to grow_buffers() cannot guarantee
      that the page won't be reclaimed before the immediate next call to
      __find_get_block(), which is why there was always a loop there.
      
      Yesterday I got "EXT4-fs error (device loop0): __ext4_get_inode_loc:3595:
      inode #19278: block 664: comm cc1: unable to read itable block" on console,
      which pointed to this commit.
      
      I've been trying to bisect for weeks, why kbuild-on-ext4-on-loop-on-tmpfs
      sometimes fails from a missing header file, under memory pressure on
      ppc G5.  I've never seen this on x86, and I've never seen it on 3.5-rc7
      itself, despite that commit being in there: bisection pointed to an
      irrelevant pinctrl merge, but hard to tell when failure takes between
      18 minutes and 38 hours (but so far it's happened quicker on 3.6-rc2).
      
      (I've since found such __ext4_get_inode_loc errors in /var/log/messages
      from previous weeks: why the message never appeared on console until
      yesterday morning is a mystery for another day.)
      
      Revert 91f68c89, restoring __getblk_slow() to how it was (plus
      a checkpatch nitfix).  Simplify the interface between grow_buffers()
      and grow_dev_page(), and avoid the infinite loop beyond end of device
      by instead checking init_page_buffers()'s end_block there (I presume
      that's more efficient than a repeated call to blkdev_max_block()),
      returning -ENXIO to __getblk_slow() in that case.
      
      And remove akpm's ten-year-old "__getblk() cannot fail ... weird"
      comment, but that is worrying: are all users of __getblk() really
      now prepared for a NULL bh beyond end of device, or will some oops??
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # 3.0 3.2 3.4 3.5
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      676ce6d5
  19. 31 7月, 2012 2 次提交
  20. 13 7月, 2012 1 次提交
    • J
      block: fix infinite loop in __getblk_slow · 91f68c89
      Jeff Moyer 提交于
      Commit 080399aa ("block: don't mark buffers beyond end of disk as
      mapped") exposed a bug in __getblk_slow that causes mount to hang as it
      loops infinitely waiting for a buffer that lies beyond the end of the
      disk to become uptodate.
      
      The problem was initially reported by Torsten Hilbrich here:
      
          https://lkml.org/lkml/2012/6/18/54
      
      and also reported independently here:
      
          http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511
      
      and then Richard W.M.  Jones and Marcos Mello noted a few separate
      bugzillas also associated with the same issue.  This patch has been
      confirmed to fix:
      
          https://bugzilla.redhat.com/show_bug.cgi?id=835019
      
      The main problem is here, in __getblk_slow:
      
              for (;;) {
                      struct buffer_head * bh;
                      int ret;
      
                      bh = __find_get_block(bdev, block, size);
                      if (bh)
                              return bh;
      
                      ret = grow_buffers(bdev, block, size);
                      if (ret < 0)
                              return NULL;
                      if (ret == 0)
                              free_more_memory();
              }
      
      __find_get_block does not find the block, since it will not be marked as
      mapped, and so grow_buffers is called to fill in the buffers for the
      associated page.  I believe the for (;;) loop is there primarily to
      retry in the case of memory pressure keeping grow_buffers from
      succeeding.  However, we also continue to loop for other cases, like the
      block lying beond the end of the disk.  So, the fix I came up with is to
      only loop when grow_buffers fails due to memory allocation issues
      (return value of 0).
      
      The attached patch was tested by myself, Torsten, and Rich, and was
      found to resolve the problem in call cases.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-and-Tested-by: NTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Tested-by: NRichard W.M. Jones <rjones@redhat.com>
      Reviewed-by: NJosh Boyer <jwboyer@redhat.com>
      Cc: Stable <stable@vger.kernel.org>  # 3.0+
      [ Jens is on vacation, taking this directly  - Linus ]
      --
      Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91f68c89
  21. 31 5月, 2012 1 次提交
  22. 11 5月, 2012 1 次提交
    • J
      block: don't mark buffers beyond end of disk as mapped · 080399aa
      Jeff Moyer 提交于
      Hi,
      
      We have a bug report open where a squashfs image mounted on ppc64 would
      exhibit errors due to trying to read beyond the end of the disk.  It can
      easily be reproduced by doing the following:
      
      [root@ibm-p750e-02-lp3 ~]# ls -l install.img
      -rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
      [root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
      [root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
      dd: reading `/dev/loop0': Input/output error
      277376+0 records in
      277376+0 records out
      142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s
      
      In dmesg, you'll find the following:
      
      squashfs: version 4.0 (2009/01/31) Phillip Lougher
      [   43.106012] attempt to access beyond end of device
      [   43.106029] loop0: rw=0, want=277410, limit=277408
      [   43.106039] Buffer I/O error on device loop0, logical block 138704
      [   43.106053] attempt to access beyond end of device
      [   43.106057] loop0: rw=0, want=277412, limit=277408
      [   43.106061] Buffer I/O error on device loop0, logical block 138705
      [   43.106066] attempt to access beyond end of device
      [   43.106070] loop0: rw=0, want=277414, limit=277408
      [   43.106073] Buffer I/O error on device loop0, logical block 138706
      [   43.106078] attempt to access beyond end of device
      [   43.106081] loop0: rw=0, want=277416, limit=277408
      [   43.106085] Buffer I/O error on device loop0, logical block 138707
      [   43.106089] attempt to access beyond end of device
      [   43.106093] loop0: rw=0, want=277418, limit=277408
      [   43.106096] Buffer I/O error on device loop0, logical block 138708
      [   43.106101] attempt to access beyond end of device
      [   43.106104] loop0: rw=0, want=277420, limit=277408
      [   43.106108] Buffer I/O error on device loop0, logical block 138709
      [   43.106112] attempt to access beyond end of device
      [   43.106116] loop0: rw=0, want=277422, limit=277408
      [   43.106120] Buffer I/O error on device loop0, logical block 138710
      [   43.106124] attempt to access beyond end of device
      [   43.106128] loop0: rw=0, want=277424, limit=277408
      [   43.106131] Buffer I/O error on device loop0, logical block 138711
      [   43.106135] attempt to access beyond end of device
      [   43.106139] loop0: rw=0, want=277426, limit=277408
      [   43.106143] Buffer I/O error on device loop0, logical block 138712
      [   43.106147] attempt to access beyond end of device
      [   43.106151] loop0: rw=0, want=277428, limit=277408
      [   43.106154] Buffer I/O error on device loop0, logical block 138713
      [   43.106158] attempt to access beyond end of device
      [   43.106162] loop0: rw=0, want=277430, limit=277408
      [   43.106166] attempt to access beyond end of device
      [   43.106169] loop0: rw=0, want=277432, limit=277408
      ...
      [   43.106307] attempt to access beyond end of device
      [   43.106311] loop0: rw=0, want=277470, limit=2774
      
      Squashfs manages to read in the end block(s) of the disk during the
      mount operation.  Then, when dd reads the block device, it leads to
      block_read_full_page being called with buffers that are beyond end of
      disk, but are marked as mapped.  Thus, it would end up submitting read
      I/O against them, resulting in the errors mentioned above.  I fixed the
      problem by modifying init_page_buffers to only set the buffer mapped if
      it fell inside of i_size.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NNick Piggin <npiggin@kernel.dk>
      
      --
      
      Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      080399aa
  23. 26 4月, 2012 1 次提交
    • G
      fs/buffer.c: remove BUG() in possible but rare condition · 61065a30
      Glauber Costa 提交于
      While stressing the kernel with with failing allocations today, I hit the
      following chain of events:
      
      alloc_page_buffers():
      
      	bh = alloc_buffer_head(GFP_NOFS);
      	if (!bh)
      		goto no_grow; <= path taken
      
      grow_dev_page():
              bh = alloc_page_buffers(page, size, 0);
              if (!bh)
                      goto failed;  <= taken, consequence of the above
      
      and then the failed path BUG()s the kernel.
      
      The failure is inserted a litte bit artificially, but even then, I see no
      reason why it should be deemed impossible in a real box.
      
      Even though this is not a condition that we expect to see around every
      time, failed allocations are expected to be handled, and BUG() sounds just
      too much.  As a matter of fact, grow_dev_page() can return NULL just fine
      in other circumstances, so I propose we just remove it, then.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61065a30
  24. 29 3月, 2012 1 次提交
  25. 29 2月, 2012 1 次提交
  26. 04 1月, 2012 1 次提交
  27. 01 11月, 2011 1 次提交
  28. 31 10月, 2011 1 次提交
    • C
      writeback: Add a 'reason' to wb_writeback_work · 0e175a18
      Curt Wohlgemuth 提交于
      This creates a new 'reason' field in a wb_writeback_work
      structure, which unambiguously identifies who initiates
      writeback activity.  A 'wb_reason' enumeration has been
      added to writeback.h, to enumerate the possible reasons.
      
      The 'writeback_work_class' and tracepoint event class and
      'writeback_queue_io' tracepoints are updated to include the
      symbolic 'reason' in all trace events.
      
      And the 'writeback_inodes_sbXXX' family of routines has had
      a wb_stats parameter added to them, so callers can specify
      why writeback is being started.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      0e175a18
  29. 28 10月, 2011 1 次提交
  30. 16 6月, 2011 1 次提交
    • J
      vfs: Fix data corruption after failed write in __block_write_begin() · f9f07b6c
      Jan Kara 提交于
      I've got a report of a file corruption from fsxlinux on ext3. The important
      operations to the page were:
      mapwrite to a hole
      partial write to the page
      read - found the page zeroed from the end of the normal write
      
      The culprit seems to be that if get_block() fails in __block_write_begin()
      (e.g. transient ENOSPC in ext3), the function does ClearPageUptodate(page).
      Thus when we retry the write, the logic in __block_write_begin() thinks zeroing
      of the page is needed and overwrites old data.  In fact, I don't see why we
      should ever need to zero the uptodate bit here - either the page was uptodate
      when we entered __block_write_begin() and it should stay so when we leave it,
      or it was not uptodate and noone had right to set it uptodate during
      __block_write_begin() so it remains !uptodate when we leave as well. So just
      remove clearing of the bit.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f9f07b6c
  31. 28 5月, 2011 1 次提交
  32. 27 5月, 2011 1 次提交
    • D
      mm/fs: add hooks to support cleancache · c515e1fd
      Dan Magenheimer 提交于
      This fourth patch of eight in this cleancache series provides the
      core hooks in VFS for: initializing cleancache per filesystem;
      capturing clean pages reclaimed by page cache; attempting to get
      pages from cleancache before filesystem read; and ensuring coherency
      between pagecache, disk, and cleancache.  Note that the placement
      of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
      change was required due to a patchset in 2.6.39.
      
      All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
      a check of a boolean global if CONFIG_CLEANCACHE is set but no
      cleancache "backend" has claimed cleancache_ops.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      c515e1fd
  33. 26 5月, 2011 2 次提交
    • J
      vfs: Block mmapped writes while the fs is frozen · ea13a864
      Jan Kara 提交于
      We should not allow file modification via mmap while the filesystem is
      frozen. So block in block_page_mkwrite() while the filesystem is frozen.
      We cannot do the blocking wait in __block_page_mkwrite() since e.g. ext4
      will want to call that function with transaction started in some cases
      and that would deadlock. But we can at least do the non-blocking reliable
      check in __block_page_mkwrite() which is the hardest part anyway.
      
      We have to check for frozen filesystem with the page marked dirty and under
      page lock with which we then return from ->page_mkwrite(). Only that way we
      cannot race with writeback done by freezing code - either we mark the page
      dirty after the writeback has started, see freezing in progress and block, or
      writeback will wait for our page lock which is released only when the fault is
      done and then writeback will writeout and writeprotect the page again.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea13a864
    • J
      vfs: Create __block_page_mkwrite() helper passing error values back · 24da4fab
      Jan Kara 提交于
      Create __block_page_mkwrite() helper which does all what block_page_mkwrite()
      does except that it passes back errors from __block_write_begin /
      block_commit_write calls.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      24da4fab
  34. 25 3月, 2011 1 次提交
    • D
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner 提交于
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      250df6ed