1. 08 10月, 2010 1 次提交
  2. 05 10月, 2010 2 次提交
    • H
      ksm: fix bad user data when swapping · 4e31635c
      Hugh Dickins 提交于
      Building under memory pressure, with KSM on 2.6.36-rc5, collapsed with
      an internal compiler error: typically indicating an error in swapping.
      
      Perhaps there's a timing issue which makes it now more likely, perhaps
      it's just a long time since I tried for so long: this bug goes back to
      KSM swapping in 2.6.33.
      
      Notice how reuse_swap_page() allows an exclusive page to be reused, but
      only does SetPageDirty if it can delete it from swap cache right then -
      if it's currently under Writeback, it has to be left in cache and we
      don't SetPageDirty, but the page can be reused.  Fine, the dirty bit
      will get set in the pte; but notice how zap_pte_range() does not bother
      to transfer pte_dirty to page_dirty when unmapping a PageAnon.
      
      If KSM chooses to share such a page, it will look like a clean copy of
      swapcache, and not be written out to swap when its memory is needed;
      then stale data read back from swap when it's needed again.
      
      We could fix this in reuse_swap_page() (or even refuse to reuse a
      page under writeback), but it's more honest to fix my oversight in
      KSM's write_protect_page().  Several days of testing on three machines
      confirms that this fixes the issue they showed.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e31635c
    • H
      ksm: fix page_address_in_vma anon_vma oops · 4829b906
      Hugh Dickins 提交于
      2.6.36-rc1 commit 21d0d443 "rmap:
      resurrect page_address_in_vma anon_vma check" was right to resurrect
      that check; but now that it's comparing anon_vma->roots instead of
      just anon_vmas, there's a danger of oopsing on a NULL anon_vma.
      
      In most cases no NULL anon_vma ever gets here; but it turns out that
      occasionally KSM, when enabled on a forked or forking process, will
      itself call page_address_in_vma() on a "half-KSM" page left over from
      an earlier failed attempt to merge - whose page_anon_vma() is NULL.
      
      It's my bug that those should be getting here at all: I thought they
      were already dealt with, this oops proves me wrong, I'll fix it in
      the next release - such pages are effectively pinned until their
      process exits, since rmap cannot find their ptes (though swapoff can).
      
      For now just work around it by making page_address_in_vma() safe (and
      add a comment on why that check is wanted anyway).  A similar check
      in __page_check_anon_rmap() is safe because do_page_add_anon_rmap()
      already excluded KSM pages.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4829b906
  3. 26 9月, 2010 1 次提交
  4. 25 9月, 2010 1 次提交
  5. 24 9月, 2010 4 次提交
  6. 23 9月, 2010 4 次提交
  7. 22 9月, 2010 1 次提交
  8. 21 9月, 2010 2 次提交
    • T
      percpu: fix pcpu_last_unit_cpu · 46b30ea9
      Tejun Heo 提交于
      pcpu_first/last_unit_cpu are used to track which cpu has the first and
      last units assigned.  This in turn is used to determine the span of a
      chunk for man/unmap cache flushes and whether an address belongs to
      the first chunk or not in per_cpu_ptr_to_phys().
      
      When the number of possible CPUs isn't power of two, a chunk may
      contain unassigned units towards the end of a chunk.  The logic to
      determine pcpu_last_unit_cpu was incorrect when there was an unused
      unit at the end of a chunk.  It failed to ignore the unused unit and
      assigned the unused marker NR_CPUS to pcpu_last_unit_cpu.
      
      This was discovered through kdump failure which was caused by
      malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible
      CPUs by CAI Qian.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Cc: stable@kernel.org
      46b30ea9
    • H
      mm: further fix swapin race condition · 31c4a3d3
      Hugh Dickins 提交于
      Commit 4969c119 ("mm: fix swapin race condition") is now agreed to
      be incomplete.  There's a race, not very much less likely than the
      original race envisaged, in which it is further necessary to check that
      the swapcache page's swap has not changed.
      
      Here's the reasoning: cast in terms of reuse_swap_page(), but probably
      could be reformulated to rely on try_to_free_swap() instead, or on
      swapoff+swapon.
      
      A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
      comes through the lock_page(page1).
      
      B, a racing thread of the same process, faults on the same address: does
      page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
      for whatever reason is unlucky not to get the lock any time soon.
      
      A carries on through do_swap_page(), a write fault, but cannot reuse the
      swap page1 (another reference to swap1).  Unlocks the page1 (but B
      doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.
      
      C, perhaps the parent of A+B, comes in and write faults the same swap
      page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.
      
      kswapd comes in after some time (B still unlucky) and swaps out some
      pages from A+B and C: it allocates the original swap1 to page2 in A+B,
      and some other swap2 to the original page1 now in C.  But does not
      immediately free page1 (actually it couldn't: B holds a reference),
      leaving it in swap cache for now.
      
      B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
      Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
      given the swap1 which page1 used to have.  So B proceeds to insert page1
      into A+B's page_table, though its content now belongs to C, quite
      different from what A wrote there.
      
      B ought to have checked that page1's swap was still swap1.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c4a3d3
  9. 10 9月, 2010 14 次提交
  10. 29 8月, 2010 1 次提交
    • H
      mm: fix hang on anon_vma->root->lock · f1819427
      Hugh Dickins 提交于
      After several hours, kbuild tests hang with anon_vma_prepare() spinning on
      a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
      (which makes this very much more likely, but it could happen without).
      
      The ever-subtle page_lock_anon_vma() now needs a further twist: since
      anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
      of a reused anon_vma structure at any moment, page_lock_anon_vma()
      needs to check page_mapped() again before succeeding, otherwise
      page_unlock_anon_vma() might address a different root->lock.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1819427
  11. 27 8月, 2010 3 次提交
    • N
      percpu: fix a mismatch between code and comment · 54157c44
      Namhyung Kim 提交于
      When pcpu_build_alloc_info() searches best_upa value, it ignores current value
      if the number of waste units exceeds 1/3 of the number of total cpus. But the
      comment on the code says that it will ignore if wastage is over 25%.
      Modify the comment.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      54157c44
    • H
      percpu: fix a memory leak in pcpu_extend_area_map() · a002d148
      Huang Shijie 提交于
      The original code did not free the old map.  This patch fixes it.
      
      tj: use @old as memcpy source instead of @chunk->map, and indentation
          and description update
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      a002d148
    • A
      writeback: do not lose wakeup events when forking bdi threads · 6628bc74
      Artem Bityutskiy 提交于
      This patch fixes the following issue:
      
      INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.nfs4    D 00000000fffc6a21     0  1120   1119 0x00000000
       ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
       ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
       00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
      Call Trace:
       [<ffffffff813bc747>] schedule_timeout+0x34/0xf1
       [<ffffffff813bc530>] ? wait_for_common+0x3f/0x130
       [<ffffffff8106b50b>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffff813bc5c3>] wait_for_common+0xd2/0x130
       [<ffffffff8104159c>] ? default_wake_function+0x0/0xf
       [<ffffffff813beaa0>] ? _raw_spin_unlock+0x26/0x2a
       [<ffffffff813bc6bb>] wait_for_completion+0x18/0x1a
       [<ffffffff81101a03>] sync_inodes_sb+0xca/0x1bc
       [<ffffffff811056a6>] __sync_filesystem+0x47/0x7e
       [<ffffffff81105798>] sync_filesystem+0x47/0x4b
       [<ffffffff810e7ffd>] generic_shutdown_super+0x22/0xd2
       [<ffffffff810e80f8>] kill_anon_super+0x11/0x4f
       [<ffffffffa00d06d7>] nfs4_kill_super+0x3f/0x72 [nfs]
       [<ffffffff810e7b68>] deactivate_locked_super+0x21/0x41
       [<ffffffff810e7fd6>] deactivate_super+0x40/0x45
       [<ffffffff810fc66c>] mntput_no_expire+0xb8/0xed
       [<ffffffff810fc73b>] release_mounts+0x9a/0xb0
       [<ffffffff810fc7bb>] put_mnt_ns+0x6a/0x7b
       [<ffffffffa00d0fb2>] nfs_follow_remote_path+0x19a/0x296 [nfs]
       [<ffffffffa00d11ca>] nfs4_try_mount+0x75/0xaf [nfs]
       [<ffffffffa00d1790>] nfs4_get_sb+0x276/0x2ff [nfs]
       [<ffffffff810e7dba>] vfs_kern_mount+0xb8/0x196
       [<ffffffff810e7ef6>] do_kern_mount+0x48/0xe8
       [<ffffffff810fdf68>] do_mount+0x771/0x7e8
       [<ffffffff810fe062>] sys_mount+0x83/0xbd
       [<ffffffff810089c2>] system_call_fastpath+0x16/0x1b
      
      The reason of this hang was a race condition: when the flusher thread is
      forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
      visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
      'bdi->wb.task' is still NULL. And this is a dangerous time window.
      
      If at this time someone queues a work for this bdi, he does not see the bdi
      thread and wakes up the forker thread instead! But the forker has already
      forked this bdi thread, but just did not make it visible yet!
      
      The result is that we lose the wake up event for this bdi thread and the NFS4
      code waits forever.
      
      To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
      then make them visible in 'bdi->wb.task', and only after this wake them up.
      This is exactly what this patch does.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6628bc74
  12. 25 8月, 2010 1 次提交
  13. 24 8月, 2010 1 次提交
    • D
      writeback: write_cache_pages doesn't terminate at nr_to_write <= 0 · 546a1924
      Dave Chinner 提交于
      I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
      been. Enabling writeback tracing showed:
      
          flush-253:16-8516  [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
      
      Writeback is not terminating in background writeback if ->writepage is
      returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
      writeback on XFS.
      
      Fix the write_cache_pages loop to terminate correctly when this situation
      occurs and so prevent this sub-optimal background writeback pattern. This
      improves sustained sequential buffered write performance from around
      250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.
      
      Cc:<stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      546a1924
  14. 23 8月, 2010 1 次提交
  15. 21 8月, 2010 3 次提交