1. 22 9月, 2010 1 次提交
  2. 21 9月, 2010 1 次提交
    • H
      mm: further fix swapin race condition · 31c4a3d3
      Hugh Dickins 提交于
      Commit 4969c119 ("mm: fix swapin race condition") is now agreed to
      be incomplete.  There's a race, not very much less likely than the
      original race envisaged, in which it is further necessary to check that
      the swapcache page's swap has not changed.
      
      Here's the reasoning: cast in terms of reuse_swap_page(), but probably
      could be reformulated to rely on try_to_free_swap() instead, or on
      swapoff+swapon.
      
      A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
      comes through the lock_page(page1).
      
      B, a racing thread of the same process, faults on the same address: does
      page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
      for whatever reason is unlucky not to get the lock any time soon.
      
      A carries on through do_swap_page(), a write fault, but cannot reuse the
      swap page1 (another reference to swap1).  Unlocks the page1 (but B
      doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.
      
      C, perhaps the parent of A+B, comes in and write faults the same swap
      page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.
      
      kswapd comes in after some time (B still unlucky) and swaps out some
      pages from A+B and C: it allocates the original swap1 to page2 in A+B,
      and some other swap2 to the original page1 now in C.  But does not
      immediately free page1 (actually it couldn't: B holds a reference),
      leaving it in swap cache for now.
      
      B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
      Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
      given the swap1 which page1 used to have.  So B proceeds to insert page1
      into A+B's page_table, though its content now belongs to C, quite
      different from what A wrote there.
      
      B ought to have checked that page1's swap was still swap1.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c4a3d3
  3. 10 9月, 2010 14 次提交
  4. 29 8月, 2010 1 次提交
    • H
      mm: fix hang on anon_vma->root->lock · f1819427
      Hugh Dickins 提交于
      After several hours, kbuild tests hang with anon_vma_prepare() spinning on
      a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
      (which makes this very much more likely, but it could happen without).
      
      The ever-subtle page_lock_anon_vma() now needs a further twist: since
      anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
      of a reused anon_vma structure at any moment, page_lock_anon_vma()
      needs to check page_mapped() again before succeeding, otherwise
      page_unlock_anon_vma() might address a different root->lock.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1819427
  5. 27 8月, 2010 3 次提交
    • N
      percpu: fix a mismatch between code and comment · 54157c44
      Namhyung Kim 提交于
      When pcpu_build_alloc_info() searches best_upa value, it ignores current value
      if the number of waste units exceeds 1/3 of the number of total cpus. But the
      comment on the code says that it will ignore if wastage is over 25%.
      Modify the comment.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      54157c44
    • H
      percpu: fix a memory leak in pcpu_extend_area_map() · a002d148
      Huang Shijie 提交于
      The original code did not free the old map.  This patch fixes it.
      
      tj: use @old as memcpy source instead of @chunk->map, and indentation
          and description update
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      a002d148
    • A
      writeback: do not lose wakeup events when forking bdi threads · 6628bc74
      Artem Bityutskiy 提交于
      This patch fixes the following issue:
      
      INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.nfs4    D 00000000fffc6a21     0  1120   1119 0x00000000
       ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
       ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
       00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
      Call Trace:
       [<ffffffff813bc747>] schedule_timeout+0x34/0xf1
       [<ffffffff813bc530>] ? wait_for_common+0x3f/0x130
       [<ffffffff8106b50b>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffff813bc5c3>] wait_for_common+0xd2/0x130
       [<ffffffff8104159c>] ? default_wake_function+0x0/0xf
       [<ffffffff813beaa0>] ? _raw_spin_unlock+0x26/0x2a
       [<ffffffff813bc6bb>] wait_for_completion+0x18/0x1a
       [<ffffffff81101a03>] sync_inodes_sb+0xca/0x1bc
       [<ffffffff811056a6>] __sync_filesystem+0x47/0x7e
       [<ffffffff81105798>] sync_filesystem+0x47/0x4b
       [<ffffffff810e7ffd>] generic_shutdown_super+0x22/0xd2
       [<ffffffff810e80f8>] kill_anon_super+0x11/0x4f
       [<ffffffffa00d06d7>] nfs4_kill_super+0x3f/0x72 [nfs]
       [<ffffffff810e7b68>] deactivate_locked_super+0x21/0x41
       [<ffffffff810e7fd6>] deactivate_super+0x40/0x45
       [<ffffffff810fc66c>] mntput_no_expire+0xb8/0xed
       [<ffffffff810fc73b>] release_mounts+0x9a/0xb0
       [<ffffffff810fc7bb>] put_mnt_ns+0x6a/0x7b
       [<ffffffffa00d0fb2>] nfs_follow_remote_path+0x19a/0x296 [nfs]
       [<ffffffffa00d11ca>] nfs4_try_mount+0x75/0xaf [nfs]
       [<ffffffffa00d1790>] nfs4_get_sb+0x276/0x2ff [nfs]
       [<ffffffff810e7dba>] vfs_kern_mount+0xb8/0x196
       [<ffffffff810e7ef6>] do_kern_mount+0x48/0xe8
       [<ffffffff810fdf68>] do_mount+0x771/0x7e8
       [<ffffffff810fe062>] sys_mount+0x83/0xbd
       [<ffffffff810089c2>] system_call_fastpath+0x16/0x1b
      
      The reason of this hang was a race condition: when the flusher thread is
      forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
      visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
      'bdi->wb.task' is still NULL. And this is a dangerous time window.
      
      If at this time someone queues a work for this bdi, he does not see the bdi
      thread and wakes up the forker thread instead! But the forker has already
      forked this bdi thread, but just did not make it visible yet!
      
      The result is that we lose the wake up event for this bdi thread and the NFS4
      code waits forever.
      
      To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
      then make them visible in 'bdi->wb.task', and only after this wake them up.
      This is exactly what this patch does.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6628bc74
  6. 25 8月, 2010 1 次提交
  7. 24 8月, 2010 1 次提交
    • D
      writeback: write_cache_pages doesn't terminate at nr_to_write <= 0 · 546a1924
      Dave Chinner 提交于
      I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
      been. Enabling writeback tracing showed:
      
          flush-253:16-8516  [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
      
      Writeback is not terminating in background writeback if ->writepage is
      returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
      writeback on XFS.
      
      Fix the write_cache_pages loop to terminate correctly when this situation
      occurs and so prevent this sub-optimal background writeback pattern. This
      improves sustained sequential buffered write performance from around
      250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.
      
      Cc:<stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      546a1924
  8. 23 8月, 2010 1 次提交
  9. 21 8月, 2010 7 次提交
  10. 18 8月, 2010 1 次提交
  11. 16 8月, 2010 1 次提交
    • L
      mm: fix up some user-visible effects of the stack guard page · d7824370
      Linus Torvalds 提交于
      This commit makes the stack guard page somewhat less visible to user
      space. It does this by:
      
       - not showing the guard page in /proc/<pid>/maps
      
         It looks like lvm-tools will actually read /proc/self/maps to figure
         out where all its mappings are, and effectively do a specialized
         "mlockall()" in user space.  By not showing the guard page as part of
         the mapping (by just adding PAGE_SIZE to the start for grows-up
         pages), lvm-tools ends up not being aware of it.
      
       - by also teaching the _real_ mlock() functionality not to try to lock
         the guard page.
      
         That would just expand the mapping down to create a new guard page,
         so there really is no point in trying to lock it in place.
      
      It would perhaps be nice to show the guard page specially in
      /proc/<pid>/maps (or at least mark grow-down segments some way), but
      let's not open ourselves up to more breakage by user space from programs
      that depends on the exact deails of the 'maps' file.
      
      Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
      source code to see what was going on with the whole new warning.
      
      Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
      Reported-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7824370
  12. 15 8月, 2010 2 次提交
  13. 14 8月, 2010 2 次提交
  14. 13 8月, 2010 1 次提交
    • L
      mm: keep a guard page below a grow-down stack segment · 320b2b8d
      Linus Torvalds 提交于
      This is a rather minimally invasive patch to solve the problem of the
      user stack growing into a memory mapped area below it.  Whenever we fill
      the first page of the stack segment, expand the segment down by one
      page.
      
      Now, admittedly some odd application might _want_ the stack to grow down
      into the preceding memory mapping, and so we may at some point need to
      make this a process tunable (some people might also want to have more
      than a single page of guarding), but let's try the minimal approach
      first.
      
      Tested with trivial application that maps a single page just below the
      stack, and then starts recursing.  Without this, we will get a SIGSEGV
      _after_ the stack has smashed the mapping.  With this patch, we'll get a
      nice SIGBUS just as the stack touches the page just above the mapping.
      Requested-by: NKeith Packard <keithp@keithp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      320b2b8d
  15. 12 8月, 2010 3 次提交
    • W
      writeback: add comment to the dirty limit functions · 1babe183
      Wu Fengguang 提交于
      Document global_dirty_limits() and bdi_dirty_limit().
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1babe183
    • W
      writeback: avoid unnecessary calculation of bdi dirty thresholds · 16c4042f
      Wu Fengguang 提交于
      Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
      that the latter can be avoided when under global dirty background
      threshold (which is the normal state for most systems).
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c4042f
    • W
      writeback: balance_dirty_pages(): reduce calls to global_page_state · e50e3720
      Wu Fengguang 提交于
      Reducing the number of times balance_dirty_pages calls global_page_state
      reduces the cache references and so improves write performance on a
      variety of workloads.
      
      'perf stats' of simple fio write tests shows the reduction in cache
      access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
      with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
      times, dropping the fasted & slowest values then taking the average &
      standard deviation
      
      		average (s.d.) in millions (10^6)
      2.6.31-rc8	648.6 (14.6)
      +patch		620.1 (16.5)
      
      Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
      the counters to apply the dirty_threshold and moving this check up into
      balance_dirty_pages where it has already read the counters.
      
      Also by rearrange the for loop to only contain one copy of the limit tests
      allows the pdflush test after the loop to use the local copies of the
      counters rather than rereading them.
      
      In the common case with no throttling it now calls global_page_state 5
      fewer times and bdi_stat 2 fewer.
      
      Fengguang:
      
      This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
      with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
      avoid exceeding the dirty limit.  Since the bdi dirty limit is mostly
      accurate we don't need to do routinely clip.  A simple dirty limit check
      would be enough.
      
      The check is necessary because, in principle we should throttle everything
      calling balance_dirty_pages() when we're over the total limit, as said by
      Peter.
      
      We now set and clear dirty_exceeded not only based on bdi dirty limits,
      but also on the global dirty limit.  The global limit check is added in
      place of clip_bdi_dirty_limit() for safety and not intended as a behavior
      change.  The bdi limits should be tight enough to keep all dirty pages
      under the global limit at most time; occasional small exceeding should be
      OK though.  The change makes the logic more obvious: the global limit is
      the ultimate goal and shall be always imposed.
      
      We may now start background writeback work based on outdated conditions.
      That's safe because the bdi flush thread will (and have to) double check
      the states.  It reduces overall overheads because the test based on old
      states still have good chance to be right.
      
      [akpm@linux-foundation.org] fix uninitialized dirty_exceeded
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e50e3720