1. 05 11月, 2016 1 次提交
  2. 03 11月, 2016 1 次提交
  3. 01 11月, 2016 1 次提交
  4. 28 10月, 2016 1 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
  5. 12 10月, 2016 1 次提交
  6. 28 9月, 2016 1 次提交
  7. 21 7月, 2016 1 次提交
  8. 27 6月, 2016 1 次提交
    • B
      fs: export __block_write_full_page · b4bba389
      Benjamin Marzinski 提交于
      gfs2 needs to be able to skip the check to see if a page is outside of
      the file size when writing it out. gfs2 can get into a situation where
      it needs to flush its in-memory log to disk while a truncate is in
      progress. If the file being trucated has data journaling enabled, it is
      possible that there are data blocks in the log that are past the end of
      the file. gfs can't finish the log flush without either writing these
      blocks out or revoking them. Otherwise, if the node crashed, it could
      overwrite subsequent changes made by other nodes in the cluster when
      it's journal was replayed.
      
      Unfortunately, there is no way to add log entries to the log during a
      flush. So gfs2 simply writes out the page instead. This situation can
      only occur when the truncate code still has the file locked exclusively,
      and hasn't marked this block as free in the metadata (which happens
      later in truc_dealloc).  After gfs2 writes this page out, the truncation
      code will shortly invalidate it and write out any revokes if necessary.
      
      In order to make this work, gfs2 needs to be able to skip the check for
      writes outside the file size. Since the check exists in
      block_write_full_page, this patch exports __block_write_full_page, which
      doesn't have the check.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b4bba389
  9. 21 6月, 2016 1 次提交
    • C
      fs: introduce iomap infrastructure · ae259a9c
      Christoph Hellwig 提交于
      Add infrastructure for multipage buffered writes.  This is implemented
      using an main iterator that applies an actor function to a range that
      can be written.
      
      This infrastucture is used to implement a buffered write helper, one
      to zero file ranges and one to implement the ->page_mkwrite VM
      operations.  All of them borrow a fair amount of code from fs/buffers.
      for now by using an internal version of __block_write_begin that
      gets passed an iomap and builds the corresponding buffer head.
      
      The file system is gets a set of paired ->iomap_begin and ->iomap_end
      calls which allow it to map/reserve a range and get a notification
      once the write code is finished with it.
      
      Based on earlier code from Dave Chinner.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      ae259a9c
  10. 08 6月, 2016 3 次提交
  11. 20 5月, 2016 1 次提交
    • M
      mm, page_alloc: avoid looking up the first zone in a zonelist twice · c33d6c06
      Mel Gorman 提交于
      The allocator fast path looks up the first usable zone in a zonelist and
      then get_page_from_freelist does the same job in the zonelist iterator.
      This patch preserves the necessary information.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                              fastmark-v1r20             initonce-v1r20
        Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
        Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
        Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
        Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
        Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
        Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
        Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
        Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
        Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
        Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
        Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
        Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
        Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
        Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)
      
      The benefit is negligible and the results are within the noise but each
      cycle counts.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c33d6c06
  12. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  13. 16 3月, 2016 2 次提交
  14. 07 1月, 2016 1 次提交
  15. 11 11月, 2015 1 次提交
    • R
      vfs: remove unused wrapper block_page_mkwrite() · 5c500029
      Ross Zwisler 提交于
      The function currently called "__block_page_mkwrite()" used to be called
      "block_page_mkwrite()" until a wrapper for this function was added by:
      
      commit 24da4fab ("vfs: Create __block_page_mkwrite() helper passing
      	error values back")
      
      This wrapper, the current "block_page_mkwrite()", is currently unused.
      __block_page_mkwrite() is used directly by ext4, nilfs2 and xfs.
      
      Remove the unused wrapper, rename __block_page_mkwrite() back to
      block_page_mkwrite() and update the comment above block_page_mkwrite().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5c500029
  16. 07 11月, 2015 1 次提交
  17. 14 8月, 2015 1 次提交
  18. 29 7月, 2015 2 次提交
    • J
      block: manipulate bio->bi_flags through helpers · b7c44ed9
      Jens Axboe 提交于
      Some places use helpers now, others don't. We only have the 'is set'
      helper, add helpers for setting and clearing flags too.
      
      It was a bit of a mess of atomic vs non-atomic access. With
      BIO_UPTODATE gone, we don't have any risk of concurrent access to the
      flags. So relax the restriction and don't make any of them atomic. The
      flags that do have serialization issues (reffed and chained), we
      already handle those separately.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7c44ed9
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  19. 02 6月, 2015 6 次提交
    • J
      buffer: remove unusued 'ret' variable · d2e73fcc
      Jens Axboe 提交于
      Merge hickup on my part, due to a clash between the writeback
      changes and the EOPNOTSUPP removal in _submit_bh().
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d2e73fcc
    • T
      writeback: implement foreign cgroup inode detection · 2a814908
      Tejun Heo 提交于
      As concurrent write sharing of an inode is expected to be very rare
      and memcg only tracks page ownership on first-use basis severely
      confining the usefulness of such sharing, cgroup writeback tracks
      ownership per-inode.  While the support for concurrent write sharing
      of an inode is deemed unnecessary, an inode being written to by
      different cgroups at different points in time is a lot more common,
      and, more importantly, charging only by first-use can too readily lead
      to grossly incorrect behaviors (single foreign page can lead to
      gigabytes of writeback to be incorrectly attributed).
      
      To resolve this issue, cgroup writeback detects the majority dirtier
      of an inode and will transfer the ownership to it.  To avoid
      unnnecessary oscillation, the detection mechanism keeps track of
      history and gives out the switch verdict only if the foreign usage
      pattern is stable over a certain amount of time and/or writeback
      attempts.
      
      The detection mechanism has fairly low space and computation overhead.
      It adds 8 bytes to struct inode (one int and two u16's) and minimal
      amount of calculation per IO.  The detection mechanism converges to
      the correct answer usually in several seconds of IO time when there's
      a clear majority dirtier.  Even when there isn't, it can reach an
      acceptable answer fairly quickly under most circumstances.
      
      Please see wb_detach_inode() for more details.
      
      This patch only implements detection.  Following patches will
      implement actual switching.
      
      v2: wbc_account_io() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a814908
    • T
      writeback: make writeback_control track the inode being written back · b16b1deb
      Tejun Heo 提交于
      Currently, for cgroup writeback, the IO submission paths directly
      associate the bio's with the blkcg from inode_to_wb_blkcg_css();
      however, it'd be necessary to keep more writeback context to implement
      foreign inode writeback detection.  wbc (writeback_control) is the
      natural fit for the extra context - it persists throughout the
      writeback of each inode and is passed all the way down to IO
      submission paths.
      
      This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
      wbc_attach_fdatawrite_inode() which are used to associate wbc with the
      inode being written back.  IO submission paths now use wbc_init_bio()
      instead of directly associating bio's with blkcg themselves.  This
      leaves inode_to_wb_blkcg_css() w/o any user.  The function is removed.
      
      wbc currently only tracks the associated wb (bdi_writeback).  Future
      patches will add more for foreign inode detection.  The association is
      established under i_lock which will be depended upon when migrating
      foreign inodes to other wb's.
      
      As currently, once established, inode to wb association never changes,
      going through wbc when initializing bio's doesn't cause any behavior
      changes.
      
      v2: submit_blk_blkcg() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b16b1deb
    • T
      buffer, writeback: make __block_write_full_page() honor cgroup writeback · bafc0dba
      Tejun Heo 提交于
      [__]block_write_full_page() is used to implement ->writepage in
      various filesystems.  All writeback logic is now updated to handle
      cgroup writeback and the block cgroup to issue IOs for is encoded in
      writeback_control and can be retrieved from the inode; however,
      [__]block_write_full_page() currently ignores the blkcg indicated by
      inode and issues all bio's without explicit blkcg association.
      
      This patch adds submit_bh_blkcg() which associates the bio with the
      specified blkio cgroup before issuing and uses it in
      __block_write_full_page() so that the issued bio's are associated with
      inode_to_wb_blkcg_css(inode).
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bafc0dba
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
    • T
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo 提交于
      cancel_dirty_page() had some issues and b9ea2515 ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      11f81bec
  20. 27 5月, 2015 1 次提交
  21. 19 5月, 2015 1 次提交
  22. 15 4月, 2015 1 次提交
    • K
      page_writeback: clean up mess around cancel_dirty_page() · b9ea2515
      Konstantin Khlebnikov 提交于
      This patch replaces cancel_dirty_page() with a helper function
      account_page_cleaned() which only updates counters.  It's called from
      truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
      Page is locked in both cases, page-lock protects against concurrent
      dirtiers: see commit 2d6d7f98 ("mm: protect set_page_dirty() from
      ongoing truncation").
      
      Delete_from_page_cache() shouldn't be called for dirty pages, they must
      be handled by caller (either written or truncated).  This patch treats
      final dirty accounting fixup at the end of __delete_from_page_cache() as
      a debug check and adds WARN_ON_ONCE() around it.  If something removes
      dirty pages without proper handling that might be a bug and unwritten
      data might be lost.
      
      Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
      here.
      
      cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
      helper for nfs_invalidate_page() and it's called only in case complete
      invalidation.
      
      The mess was started in v2.6.20 after commits 46d2277c ("Clean up
      and make try_to_free_buffers() not race with dirty pages") and
      3e67c098 ("truncate: clear page dirtiness before running
      try_to_free_buffers()") first was reverted right in v2.6.20 in commit
      ecdfc978 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
      v2.6.25 commit a2b34564 ("Fix dirty page accounting leak with ext3
      data=journal").
      
      Custom fixes were introduced between these points.  NFS in v2.6.23, commit
      1b3b4a1a ("NFS: Fix a write request leak in nfs_invalidate_page()").
      Kludge in __delete_from_page_cache() in v2.6.24, commit 3a692790 ("Do
      dirty page accounting when removing a page from the page cache").  Since
      v2.6.25 all of them are redundant.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ea2515
  23. 22 10月, 2014 2 次提交
    • R
      fs: clarify rate limit suppressed buffer I/O errors · 432f16e6
      Robert Elliott 提交于
      When quiet_error applies rate limiting to buffer_io_error calls, what the
      they apply to is unclear because the name is so generic, particularly
      if the messages are interleaved with others:
      
      [ 1936.063572] quiet_error: 664293 callbacks suppressed
      [ 1936.065297] Buffer I/O error on dev sdr, logical block 257429952, lost async page write
      [ 1936.067814] Buffer I/O error on dev sdr, logical block 257429953, lost async page write
      
      Also, the function uses printk_ratelimit(), although printk.h includes a
      comment advising "Please don't use... Instead use printk_ratelimited()."
      
      Change buffer_io_error to check the BH_Quiet bit itself, drop the
      printk_ratelimit call, and print using printk_ratelimited.
      
      This makes the messages look like:
      
      [  387.208839] buffer_io_error: 676394 callbacks suppressed
      [  387.210693] Buffer I/O error on dev sdr, logical block 211291776, lost async page write
      [  387.213432] Buffer I/O error on dev sdr, logical block 211291777, lost async page write
      Signed-off-by: NRobert Elliott <elliott@hp.com>
      Reviewed-by: NWebb Scales <webbnh@hp.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      432f16e6
    • R
      fs: merge I/O error prints into one line · b744c2ac
      Robert Elliott 提交于
      buffer.c uses two printk calls to print these messages:
      [67353.422338] Buffer I/O error on device sdr, logical block 212868488
      [67353.422338] lost page write due to I/O error on sdr
      
      In a busy system, they may be interleaved with other prints,
      losing the context for the second message.  Merge them into
      one line with one printk call so the prints are atomic.
      
      Also, differentiate between async page writes, sync page writes, and
      async page reads.
      
      Also, shorten "device" to "dev" to match the block layer prints:
      [67353.467906] blk_update_request: critical target error, dev sdr, sector
      1707107328
      
      Also, use %llu rather than %Lu.
      
      Resulting prints look like:
      [ 1356.437006] blk_update_request: critical target error, dev sdr, sector 1719693992
      [ 1361.383522] quiet_error: 659876 callbacks suppressed
      [ 1361.385816] Buffer I/O error on dev sdr, logical block 256902912, lost async page write
      [ 1361.385819] Buffer I/O error on dev sdr, logical block 256903644, lost async page write
      Signed-off-by: NRobert Elliott <elliott@hp.com>
      Reviewed-by: NWebb Scales <webbnh@hp.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b744c2ac
  24. 14 10月, 2014 1 次提交
    • Z
      fs: check bh blocknr earlier when searching lru · 9470dd5d
      Zach Brown 提交于
      It's very common for the buffer heads in the lru to have different block
      numbers.  By comparing the blocknr before the bdev and size we can
      reduce the cost of searching in the very common case where all the
      entries have the same bdev and size.
      
      In quick hot cache cycle counting tests on a single fs workstation this
      cut the cost of a miss by about 20%.
      
      A diff of the disassembly shows the reordering of the bdev and blocknr
      comparisons.  This is in such a tiny loop that skipping one comparison
      is a meaningful portion of the total work being done:
      
           1628:      83 c1 01                add    $0x1,%ecx
           162b:      83 f9 08                cmp    $0x8,%ecx
           162e:      74 60                   je     1690 <__find_get_block+0xa0>
           1630:      89 c8                   mov    %ecx,%eax
           1632:      65 4c 8b 04 c5 00 00    mov    %gs:0x0(,%rax,8),%r8
           1639:      00 00
           163b:      4d 85 c0                test   %r8,%r8
           163e:      4c 89 c3                mov    %r8,%rbx
           1641:      74 e5                   je     1628 <__find_get_block+0x38>
      -    1643:      4d 3b 68 30             cmp    0x30(%r8),%r13
      +    1643:      4d 3b 68 18             cmp    0x18(%r8),%r13
           1647:      75 df                   jne    1628 <__find_get_block+0x38>
      -    1649:      4d 3b 60 18             cmp    0x18(%r8),%r12
      +    1649:      4d 3b 60 30             cmp    0x30(%r8),%r12
           164d:      75 d9                   jne    1628 <__find_get_block+0x38>
           164f:      49 39 50 20             cmp    %rdx,0x20(%r8)
           1653:      75 d3                   jne    1628 <__find_get_block+0x38>
      Signed-off-by: NZach Brown <zab@zabbo.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9470dd5d
  25. 10 10月, 2014 3 次提交
  26. 09 10月, 2014 1 次提交
    • M
      fs: make cont_expand_zero interruptible · c2ca0fcd
      Mikulas Patocka 提交于
      This patch makes it possible to kill a process looping in
      cont_expand_zero. A process may spend a lot of time in this function, so
      it is desirable to be able to kill it.
      
      It happened to me that I wanted to copy a piece data from the disk to a
      file. By mistake, I used the "seek" parameter to dd instead of "skip". Due
      to the "seek" parameter, dd attempted to extend the file and became stuck
      doing so - the only possibility was to reset the machine or wait many
      hours until the filesystem runs out of space and cont_expand_zero fails.
      We need this patch to be able to terminate the process.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2ca0fcd
  27. 02 10月, 2014 1 次提交
    • J
      vfs: fix data corruption when blocksize < pagesize for mmaped data · 90a80202
      Jan Kara 提交于
      ->page_mkwrite() is used by filesystems to allocate blocks under a page
      which is becoming writeably mmapped in some process' address space. This
      allows a filesystem to return a page fault if there is not enough space
      available, user exceeds quota or similar problem happens, rather than
      silently discarding data later when writepage is called.
      
      However VFS fails to call ->page_mkwrite() in all the cases where
      filesystems need it when blocksize < pagesize. For example when
      blocksize = 1024, pagesize = 4096 the following is problematic:
        ftruncate(fd, 0);
        pwrite(fd, buf, 1024, 0);
        map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
        map[0] = 'a';       ----> page_mkwrite() for index 0 is called
        ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
        mremap(map, 1024, 10000, 0);
        map[4095] = 'a';    ----> no page_mkwrite() called
      
      At the moment ->page_mkwrite() is called, filesystem can allocate only
      one block for the page because i_size == 1024. Otherwise it would create
      blocks beyond i_size which is generally undesirable. But later at
      ->writepage() time, we also need to store data at offset 4095 but we
      don't have block allocated for it.
      
      This patch introduces a helper function filesystems can use to have
      ->page_mkwrite() called at all the necessary moments.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      90a80202
  28. 22 9月, 2014 1 次提交
    • A
      Fix nasty 32-bit overflow bug in buffer i/o code. · f2d5a944
      Anton Altaparmakov 提交于
      On 32-bit architectures, the legacy buffer_head functions are not always
      handling the sector number with the proper 64-bit types, and will thus
      fail on 4TB+ disks.
      
      Any code that uses __getblk() (and thus bread(), breadahead(),
      sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
      block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
      in __getblk_slow() with an infinite stream of errors logged to dmesg
      like this:
      
        __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
        b_state=0x00000020, b_size=512
        device sda1 blocksize: 512
      
      Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
      top 32-bits are missing (in this case the 0x1 at the top).
      
      This is because grow_dev_page() is broken and has a 32-bit overflow due
      to shifting the page index value (a pgoff_t - which is just 32 bits on
      32-bit architectures) left-shifted as the block number.  But the top
      bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
      before the shift.
      
      This patch fixes this issue by type casting "index" to sector_t before
      doing the left shift.
      
      Note this is not a theoretical bug but has been seen in the field on a
      4TiB hard drive with logical sector size 512 bytes.
      
      This patch has been verified to fix the infinite loop problem on 3.17-rc5
      kernel using a 4TB disk image mounted using "-o loop".  Without this patch
      doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
      100% reproducibly whilst with the patch it works fine as expected.
      Signed-off-by: NAnton Altaparmakov <aia21@cantab.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2d5a944