1. 22 3月, 2012 3 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
    • A
      mm: update stale lock ordering comment for memory-failure.c · 9a3c531d
      Andi Kleen 提交于
      When i_mmap_lock changed to a mutex the locking order in memory failure
      was changed to take the sleeping lock first.  But the big fat mm lock
      ordering comment (BFMLO) wasn't updated.  Do this here.
      
      Pointed out by Andrew.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a3c531d
    • F
      mm: don't set __GFP_WRITE on ramfs/sysfs writes · 1010bb1b
      Fengguang Wu 提交于
      There is not much point in skipping zones during allocation based on the
      dirty usage which they'll never contribute to.  And we'd like to avoid
      page reclaim waits when writing to ramfs/sysfs etc.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1010bb1b
  2. 20 3月, 2012 1 次提交
  3. 04 2月, 2012 1 次提交
    • S
      readahead: fix pipeline break caused by block plug · 3deaa719
      Shaohua Li 提交于
      Herbert Poetzl reported a performance regression since 2.6.39.  The test
      is a simple dd read, but with big block size.  The reason is:
      
      T1: ra (A, A+128k), (A+128k, A+256k)
      T2: lock_page for page A, submit the 256k
      T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
      because of plug and there isn't any lock_page till we hit page A+256k
      because all pages from A to A+256k is in memory
      T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
      submitted again.
      T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
      waitting for (A+256k, A+512k) finish.
      
      There is no request to disk in T3 and T4, so readahead pipeline breaks.
      
      We really don't need block plug for generic_file_aio_read() for buffered
      I/O.  The readahead already has plug and has fine grained control when I/O
      should be submitted.  Deleting plug for buffered I/O fixes the regression.
      
      One side effect is plug makes the request size 256k, the size is 128k
      without it.  This is because default ra size is 128k and not a reason we
      need plug here.
      
      Vivek said:
      
      : We submit some readahead IO to device request queue but because of nested
      : plug, queue never gets unplugged.  When read logic reaches a page which is
      : not in page cache, it waits for page to be read from the disk
      : (lock_page_killable()) and that time we flush the plug list.
      :
      : So effectively read ahead logic is kind of broken in parts because of
      : nested plugging.  Removing top level plug (generic_file_aio_read()) for
      : buffered reads, will allow unplugging queue earlier for readahead.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reported-by: NHerbert Poetzl <herbert@13thfloor.at>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3deaa719
  4. 13 1月, 2012 1 次提交
    • K
      memcg: add mem_cgroup_replace_page_cache() to fix LRU issue · ab936cbc
      KAMEZAWA Hiroyuki 提交于
      Commit ef6a3c63 ("mm: add replace_page_cache_page() function") added a
      function replace_page_cache_page().  This function replaces a page in the
      radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
      up the accounting information.  memcg need to check PCG_USED bit etc.
      
      In some(many?) cases, 'newpage' is on LRU before calling
      replace_page_cache().  So, memcg's LRU accounting information should be
      fixed, too.
      
      This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
       In that function, old pages will be unaccounted without touching
      res_counter and new page will be accounted to the memcg (of old page).
      WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
      races with LRU handling.
      
      Background:
        replace_page_cache_page() is called by FUSE code in its splice() handling.
        Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
        page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
        because rmdir() checks the whole LRU is empty and there is no account leak.
        If a page is on the other LRU than it should be, rmdir() will fail.
      
      This bug was added in March 2011, but no bug report yet.  I guess there
      are not many people who use memcg and FUSE at the same time with upstream
      kernels.
      
      The result of this bug is that admin cannot destroy a memcg because of
      account leak.  So, no panic, no deadlock.  And, even if an active cgroup
      exist, umount can succseed.  So no problem at shutdown.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab936cbc
  5. 11 1月, 2012 1 次提交
  6. 04 1月, 2012 1 次提交
  7. 22 12月, 2011 1 次提交
  8. 02 12月, 2011 1 次提交
  9. 31 10月, 2011 1 次提交
  10. 28 10月, 2011 1 次提交
    • J
      vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriately · 39be79c1
      Jeff Layton 提交于
      Currently, when you call iov_iter_advance, then the pointer to the iovec
      array can be incremented, but it does not decrement the nr_segs value in
      the iov_iter struct. The result is a iov_iter struct with a nr_segs
      value that goes beyond the end of the array.
      
      While I'm not aware of anything that's specifically broken by this, it
      seems odd and a bit dangerous not to decrement that value. If someone
      were to trust the nr_segs value to be correct, then they could end up
      walking off the end of the array.
      
      Changing this might also provide some micro-optimization when dealing
      with the last iovec in an array. Many of the other routines that deal
      with iov_iter have optimized codepaths when nr_segs == 1.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      39be79c1
  11. 15 9月, 2011 1 次提交
  12. 04 8月, 2011 4 次提交
    • H
      mm: clarify the radix_tree exceptional cases · 8079b1c8
      Hugh Dickins 提交于
      Make the radix_tree exceptional cases, mostly in filemap.c, clearer.
      
      It's hard to devise a suitable snappy name that illuminates the use by
      shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
      generality.  And akpm points out that /* radix_tree_deref_retry(page) */
      comments look like calls that have been commented out for unknown
      reason.
      
      Skirt the naming difficulty by rearranging these blocks to handle the
      transient radix_tree_deref_retry(page) case first; then just explain the
      remaining shmem/tmpfs swap case in a comment.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8079b1c8
    • H
      mm: a few small updates for radix-swap · 31475dd6
      Hugh Dickins 提交于
      Remove PageSwapBacked (!page_is_file_cache) cases from
      add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
      go through shmem_add_to_page_cache().
      
      Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
      and add a comment on swap entries to invalidate_mapping_pages().
      
      And mincore_page() uses find_get_page() on what might be shmem or a
      tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
      find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31475dd6
    • H
      mm: let swap use exceptional entries · a2c16d6c
      Hugh Dickins 提交于
      If swap entries are to be stored along with struct page pointers in a
      radix tree, they need to be distinguished as exceptional entries.
      
      Most of the handling of swap entries in radix tree will be contained in
      shmem.c, but a few functions in filemap.c's common code need to check
      for their appearance: find_get_page(), find_lock_page(),
      find_get_pages() and find_get_pages_contig().
      
      So as not to slow their fast paths, tuck those checks inside the
      existing checks for unlikely radix_tree_deref_slot(); except for
      find_lock_page(), where it is an added test.  And make it a BUG in
      find_get_pages_tag(), which is not applied to tmpfs files.
      
      A part of the reason for eliminating shmem_readpage() earlier, was to
      minimize the places where common code would need to allow for swap
      entries.
      
      The swp_entry_t known to swapfile.c must be massaged into a slightly
      different form when stored in the radix tree, just as it gets massaged
      into a pte_t when stored in page tables.
      
      In an i386 kernel this limits its information (type and page offset) to
      30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
      swapfile size of 128GB.  Which is less than the 512GB we previously
      allowed with X86_PAE (where the swap entry can occupy the entire upper
      32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
      there's not a new limitation on 64-bit (where swap filesize is already
      limited to 16TB by a 32-bit page offset).  Thirty areas of 128GB is
      probably still enough swap for a 64GB 32-bit machine.
      
      Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
      enforce filesize limit in read_swap_header(), just as for ptes.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c16d6c
    • H
      radix_tree: exceptional entries and indices · 6328650b
      Hugh Dickins 提交于
      A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
      peculiar swap vector, instead keeping a file's swap entries in the same
      radix tree as its struct page pointers: thus saving memory, and
      simplifying its code and locking.
      
      This patch:
      
      The radix_tree is used by several subsystems for different purposes.  A
      major use is to store the struct page pointers of a file's pagecache for
      memory management.  But what if mm wanted to store something other than
      page pointers there too?
      
      The low bit of a radix_tree entry is already used to denote an indirect
      pointer, for internal use, and the unlikely radix_tree_deref_retry()
      case.
      
      Define the next bit as denoting an exceptional entry, and supply inline
      functions radix_tree_exception() to return non-0 in either unlikely
      case, and radix_tree_exceptional_entry() to return non-0 in the second
      case.
      
      If a subsystem already uses radix_tree with that bit set, no problem: it
      does not affect internal workings at all, but is defined for the
      convenience of those storing well-aligned pointers in the radix_tree.
      
      The radix_tree_gang_lookups have an implicit assumption that the caller
      can deduce the offset of each entry returned e.g.  by the page->index of
      a struct page.  But that may not be feasible for some kinds of item to
      be stored there.
      
      radix_tree_gang_lookup_slot() allow for an optional indices argument,
      output array in which to return those offsets.  The same could be added
      to other radix_tree_gang_lookups, but for now keep it to the only one
      for which we need it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6328650b
  13. 26 7月, 2011 2 次提交
    • H
      mm: consistent truncate and invalidate loops · b85e0eff
      Hugh Dickins 提交于
      Make the pagevec_lookup loops in truncate_inode_pages_range(),
      invalidate_mapping_pages() and invalidate_inode_pages2_range() more
      consistent with each other.
      
      They were relying upon page->index of an unlocked page, but apologizing
      for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
      index handling.
      
      invalidate_inode_pages2_range() had special handling for a wrapped
      page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
      near there, and a corrupt page->index in the radix_tree could cause more
      trouble than that would catch.  Remove that wrapped handling.
      
      invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
      when near the end of the range: copy that into the other two, although
      it's less useful than you might think (it limits the use of the buffer,
      rather than the indices looked up).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b85e0eff
    • H
      mm: cleanup descriptions of filler arg · 5e5358e7
      Hugh Dickins 提交于
      The often-NULL data arg to read_cache_page() and read_mapping_page()
      functions is misdescribed as "destination for read data": no, it's the
      first arg to the filler function, often struct file * to ->readpage().
      
      Satisfy checkpatch.pl on those filler prototypes, and tidy up the
      declarations in linux/pagemap.h.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e5358e7
  14. 21 7月, 2011 1 次提交
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  15. 08 6月, 2011 1 次提交
    • C
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig 提交于
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f758eeab
  16. 04 6月, 2011 1 次提交
    • A
      more conservative S_NOSEC handling · 9e1f1de0
      Al Viro 提交于
      Caching "we have already removed suid/caps" was overenthusiastic as merged.
      On network filesystems we might have had suid/caps set on another client,
      silently picked by this client on revalidate, all of that *without* clearing
      the S_NOSEC flag.
      
      AFAICS, the only reasonably sane way to deal with that is
      	* new superblock flag; unless set, S_NOSEC is not going to be set.
      	* local block filesystems set it in their ->mount() (more accurately,
      mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
      local block ones clear it)
      	* if any network filesystem (or a cluster one) wants to use S_NOSEC,
      it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
      inode attribute changes are picked from other clients.
      
      It's not an earth-shattering hole (anybody that can set suid on another client
      will almost certainly be able to write to the file before doing that anyway),
      but it's a bug that needs fixing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9e1f1de0
  17. 29 5月, 2011 1 次提交
    • A
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen 提交于
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      
      Why xattr lookup on every write I hear you ask?
      
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      69b45732
  18. 28 5月, 2011 1 次提交
  19. 27 5月, 2011 2 次提交
    • Y
      memcg: add the pagefault count into memcg stats · 456f998e
      Ying Han 提交于
      Two new stats in per-memcg memory.stat which tracks the number of page
      faults and number of major page faults.
      
        "pgfault"
        "pgmajfault"
      
      They are different from "pgpgin"/"pgpgout" stat which count number of
      pages charged/discharged to the cgroup and have no meaning of reading/
      writing page to disk.
      
      It is valuable to track the two stats for both measuring application's
      performance as well as the efficiency of the kernel page reclaim path.
      Counting pagefaults per process is useful, but we also need the aggregated
      value since processes are monitored and controlled in cgroup basis in
      memcg.
      
      Functional test: check the total number of pgfault/pgmajfault of all
      memcgs and compare with global vmstat value:
      
        $ cat /proc/vmstat | grep fault
        pgfault 1070751
        pgmajfault 553
      
        $ cat /dev/cgroup/memory.stat | grep fault
        pgfault 1071138
        pgmajfault 553
        total_pgfault 1071142
        total_pgmajfault 553
      
        $ cat /dev/cgroup/A/memory.stat | grep fault
        pgfault 199
        pgmajfault 0
        total_pgfault 199
        total_pgmajfault 0
      
      Performance test: run page fault test(pft) wit 16 thread on faulting in
      15G anon pages in 16G container.  There is no regression noticed on the
      "flt/cpu/s"
      
      Sample output from pft:
      
        TAG pft:anon-sys-default:
          Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
          15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
      
        +-------------------------------------------------------------------------+
            N           Min           Max        Median           Avg        Stddev
        x  10     16682.962     17344.027     16913.524     16928.812      166.5362
        +  10     16695.568     16923.896     16820.604     16824.652     84.816568
        No difference proven at 95.0% confidence
      
      [akpm@linux-foundation.org: fix build]
      [hughd@google.com: shmem fix]
      Signed-off-by: NYing Han <yinghan@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      456f998e
    • D
      mm/fs: add hooks to support cleancache · c515e1fd
      Dan Magenheimer 提交于
      This fourth patch of eight in this cleancache series provides the
      core hooks in VFS for: initializing cleancache per filesystem;
      capturing clean pages reclaimed by page cache; attempting to get
      pages from cleancache before filesystem read; and ensuring coherency
      between pagecache, disk, and cleancache.  Note that the placement
      of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
      change was required due to a patchset in 2.6.39.
      
      All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
      a check of a boolean global if CONFIG_CLEANCACHE is set but no
      cleancache "backend" has claimed cleancache_ops.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      c515e1fd
  20. 25 5月, 2011 6 次提交
  21. 25 3月, 2011 2 次提交
    • D
      fs: move i_wb_list out from under inode_lock · a66979ab
      Dave Chinner 提交于
      Protect the inode writeback list with a new global lock
      inode_wb_list_lock and use it to protect the list manipulations and
      traversals. This lock replaces the inode_lock as the inodes on the
      list can be validity checked while holding the inode->i_lock and
      hence the inode_lock is no longer needed to protect the list.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a66979ab
    • D
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner 提交于
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      250df6ed
  22. 23 3月, 2011 6 次提交