提交 · 1a5a9906d4e8d1976b701f889d8f35d54b928f25 · openanolis / cloud-kernel

22 3月, 2012 1 次提交

mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906

由 Andrea Arcangeli 提交于 3月 21, 2012

In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode.  In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds).  The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously.  This is
probably why it wasn't common to run into this.  For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value.  Even if the real pmd is changing under the
value we hold on the stack, we don't care.  If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd.  The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds).  I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

		if (pmd_trans_huge(*pmd)) {
			if (next-addr != HPAGE_PMD_SIZE) {
				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
				split_huge_page_pmd(vma->vm_mm, pmd);
			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
				continue;
			/* fall through */
		}
		if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
      mapcount 0 page_mapcount 1
      kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

      mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

        143 void pmd_clear_bad(pmd_t *pmd)
        144 {
    ->  145         pmd_ERROR(*pmd);
        146         pmd_clear(pmd);
        147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

       1381         if (mapcount != page_mapcount(page))
       1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
       1383                        mapcount, page_mapcount(page));
    -> 1384         BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

               virtual address space
              .---------------------.
              |                     |
              |                     |
            .-|---------------------|
            | |                     |
            | |                     |<-- B(fault)
            | |                     |
      2 MB  | |/////////////////////|-.
      huge <  |/////////////////////|  > A(range)
      page  | |/////////////////////|-'
            | |                     |
            | |                     |
            '-|---------------------|
              |                     |
              |                     |
              '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
      on the virtual address range "A(range)" shown in the picture.

    sys_madvise
      // Acquire the semaphore in shared mode.
      down_read(&current->mm->mmap_sem)
      ...
      madvise_vma
        switch (behavior)
        case MADV_DONTNEED:
             madvise_dontneed
               zap_page_range
                 unmap_vmas
                   unmap_page_range
                     zap_pud_range
                       zap_pmd_range
                         //
                         // Assume that this huge page has never been accessed.
                         // I.e. content of the PMD entry is zero (not mapped).
                         //
                         if (pmd_trans_huge(*pmd)) {
                             // We don't get here due to the above assumption.
                         }
                         //
                         // Assume that Thread B incurred a page fault and
             .---------> // sneaks in here as shown below.
             |           //
             |           if (pmd_none_or_clear_bad(pmd))
             |               {
             |                 if (unlikely(pmd_bad(*pmd)))
             |                     pmd_clear_bad
             |                     {
             |                       pmd_ERROR
             |                         // Log "bad pmd ..." message here.
             |                       pmd_clear
             |                         // Clear the page's PMD entry.
             |                         // Thread B incremented the map count
             |                         // in page_add_new_anon_rmap(), but
             |                         // now the page is no longer mapped
             |                         // by a PMD entry (-> inconsistency).
             |                     }
             |               }
             |
             v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
      in the picture.

    ...
    do_page_fault
      __do_page_fault
        // Acquire the semaphore in shared mode.
        down_read_trylock(&mm->mmap_sem)
        ...
        handle_mm_fault
          if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
              // We get here due to the above assumption (PMD entry is zero).
              do_huge_pmd_anonymous_page
                alloc_hugepage_vma
                  // Allocate a new transparent huge page here.
                ...
                __do_huge_pmd_anonymous_page
                  ...
                  spin_lock(&mm->page_table_lock)
                  ...
                  page_add_new_anon_rmap
                    // Here we increment the page's map count (starts at -1).
                    atomic_set(&page->_mapcount, 0)
                  set_pmd_at
                    // Here we set the page's PMD entry which will be cleared
                    // when Thread A calls pmd_clear_bad().
                  ...
                  spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read).  Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated.  However, Thread A
    does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: NLarry Woodman <lwoodman@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>		[2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1a5a9906

20 3月, 2012 1 次提交
- C
  mm: remove the second argument of k[un]map_atomic() · 9b04c5fe
  由 Cong Wang 提交于 11月 25, 2011
```
Signed-off-by: NCong Wang <amwang@redhat.com>
```
  9b04c5fe
13 1月, 2012 1 次提交

mm: unify remaining mem_cont, mem, etc. variable names to memcg · 72835c86

由 Johannes Weiner 提交于 1月 12, 2012

Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72835c86

11 1月, 2012 1 次提交

mm: avoid livelock on !__GFP_FS allocations · f90ac398

由 Mel Gorman 提交于 1月 10, 2012

Colin Cross reported;

  Under the following conditions, __alloc_pages_slowpath can loop forever:
  gfp_mask & __GFP_WAIT is true
  gfp_mask & __GFP_FS is false
  reclaim and compaction make no progress
  order <= PAGE_ALLOC_COSTLY_ORDER

  These conditions happen very often during suspend and resume,
  when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
  allocations into __GFP_WAIT.

  The oom killer is not run because gfp_mask & __GFP_FS is false,
  but should_alloc_retry will always return true when order is less
  than PAGE_ALLOC_COSTLY_ORDER.

In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set.  The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.

The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended.

This patch special cases the suspend case to fail the page allocation if
reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used.  Failing allocations like this may
cause suspend to abort but that is better than a livelock.

[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry]
Reported-by: NColin Cross <ccross@android.com>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f90ac398

01 11月, 2011 1 次提交

oom: fix race while temporarily setting current's oom_score_adj · 43362a49

由 David Rientjes 提交于 10月 31, 2011

test_set_oom_score_adj() was introduced in 72788c38 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.

Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated.  That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.

To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86.  It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

43362a49

31 10月, 2011 1 次提交

mm: delete various needless include <linux/module.h> · e25934a5

由 Paul Gortmaker 提交于 5月 26, 2011

There is nothing modular in these files, and no reason to drag
in all the 357 headers that module.h brings with it, since
it just slows down compiles.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

e25934a5

04 8月, 2011 1 次提交

mm: let swap use exceptional entries · a2c16d6c

由 Hugh Dickins 提交于 8月 03, 2011

If swap entries are to be stored along with struct page pointers in a
radix tree, they need to be distinguished as exceptional entries.

Most of the handling of swap entries in radix tree will be contained in
shmem.c, but a few functions in filemap.c's common code need to check
for their appearance: find_get_page(), find_lock_page(),
find_get_pages() and find_get_pages_contig().

So as not to slow their fast paths, tuck those checks inside the
existing checks for unlikely radix_tree_deref_slot(); except for
find_lock_page(), where it is an added test. And make it a BUG in
find_get_pages_tag(), which is not applied to tmpfs files.

A part of the reason for eliminating shmem_readpage() earlier, was to
minimize the places where common code would need to allow for swap
entries.

The swp_entry_t known to swapfile.c must be massaged into a slightly
different form when stored in the radix tree, just as it gets massaged
into a pte_t when stored in page tables.

In an i386 kernel this limits its information (type and page offset) to
30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
swapfile size of 128GB. Which is less than the 512GB we previously
allowed with X86_PAE (where the swap entry can occupy the entire upper
32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
there's not a new limitation on 64-bit (where swap filesize is already
limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
probably still enough swap for a 64GB 32-bit machine.

Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
enforce filesize limit in read_swap_header(), just as for ptes.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a2c16d6c

21 7月, 2011 1 次提交

fs: seq_file - add event counter to simplify poll() support · f1514638

由 Kay Sievers 提交于 7月 12, 2011

Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.

All current users are switched over to use the new counter.

Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NNeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f1514638

28 6月, 2011 1 次提交

mm: move shmem prototypes to shmem_fs.h · 072441e2

由 Hugh Dickins 提交于 6月 27, 2011

Before adding any more global entry points into shmem.c, gather such
prototypes into shmem_fs.h.  Remove mm's own declarations from swap.h,
but for now leave the ones in mm.h: because shmem_file_setup() and
shmem_zero_setup() are called from various places, and we should not
force other subsystems to update immediately.
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

072441e2

25 5月, 2011 1 次提交

oom: replace PF_OOM_ORIGIN with toggling oom_score_adj · 72788c38

由 David Rientjes 提交于 5月 24, 2011

There's a kernel-wide shortage of per-process flags, so it's always
helpful to trim one when possible without incurring a significant penalty.
 It's even more important when you're planning on adding a per- process
flag yourself, which I plan to do shortly for transparent hugepages.

PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
tendency to allocate large amounts of memory and should be preferred for
killing over other tasks.  We'd rather immediately kill the task making
the errant syscall rather than penalizing an innocent task.

This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

The process's old oom_score_adj is stored and then set to
OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN.  The old
value is then reinstated when the process should no longer be considered a
high priority for oom killing.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72788c38

24 3月, 2011 1 次提交

memcg: fix ugly initialization of return value is in caller · 56039efa

由 KAMEZAWA Hiroyuki 提交于 3月 23, 2011

Remove initialization of vaiable in caller of memory cgroup function.
Actually, it's return value of memcg function but it's initialized in
caller.

Some memory cgroup uses following style to bring the result of start
function to the end function for avoiding races.

   mem_cgroup_start_A(&(*ptr))
   /* Something very complicated can happen here. */
   mem_cgroup_end_A(*ptr)

In some calls, *ptr should be initialized to NULL be caller.  But it's
ugly.  This patch fixes that *ptr is initialized by _start function.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56039efa

23 3月, 2011 27 次提交

sys_swapon: fix inode locking · 2130781e

由 Cesar Eduardo Barros 提交于 3月 22, 2011

A conflict between 52c50567 ("mm: swap: unlock swapfile inode mutex
before closing file on bad swapfiles") and 83ef99be ("sys_swapon:
remove did_down variable") caused a double unlock of the inode mutex
(once in bad_swap: before the filp_close, once at the end just before
returning).

The patch which added the extra unlock cleared did_down to avoid
unlocking twice, but the other patch removed the did_down variable.

To fix, set inode to NULL after the first unlock, since it will be used
after that point only for the final unlock.

While checking this patch, I found a path which could unlock without
locking, in case the same inode was added as a swapfile twice. To fix,
move the setting of the inode variable further down, to just before
claim_swapfile, which will lock the inode before doing anything else.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2130781e

mm: remove inline from scan_swap_map() · 24b8ff7c

由 Cesar Eduardo Barros 提交于 3月 22, 2011

scan_swap_map() is a large function (224 lines), with several loops and a
complex control flow involving several gotos.

Given all that, it is a bit silly that it is marked as inline.  The
compiler agrees with me: on a x86-64 compile, it did not inline the
function.

Remove the "inline" and let the compiler decide instead.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

24b8ff7c

sys_swapon: separate final enabling of the swapfile · 40531542

由 Cesar Eduardo Barros 提交于 3月 22, 2011

The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). Move
this code to a separate function, and use it both in sys_swapon and
sys_swapoff.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

40531542

sys_swapoff: change order to match sys_swapon · c6a2b64b

由 Cesar Eduardo Barros 提交于 3月 22, 2011

The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(), except
for the order of the operations within the lock. Since the order should
not matter, arbitrarily change sys_swapoff to match sys_swapon, in
preparation to making both share the same code.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c6a2b64b

sys_swapon: move printk outside lock · c69dbfb8

由 Cesar Eduardo Barros 提交于 3月 22, 2011

The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be
able to make both share the same code, move the printk() call in the
middle of it to just after it.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c69dbfb8

sys_swapon: remove nr_good_pages variable · 9c8100ef

由 Cesar Eduardo Barros 提交于 3月 22, 2011

It still exists within setup_swap_map_and_extents(), but after it
nr_good_pages == p->pages.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9c8100ef

sys_swapon: simplify error flow in setup_swap_map_and_extents() · bdb8e3f6

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bdb8e3f6

sys_swapon: separate parsing of bad blocks and extents · 915d4d7b

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Move the code which parses the bad block list and the extents to a
separate function. Only code movement, no functional changes.

This change uses the fact that, after the success path, nr_good_pages ==
p->pages.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

915d4d7b

sys_swapon: call swap_cgroup_swapon() earlier · 1421ef3c

由 Cesar Eduardo Barros 提交于 3月 22, 2011

The call to swap_cgroup_swapon is in the middle of loading the swap map
and extents. As it only does memory allocation and does not depend on
the swapfile layout (map/extents), it can be called earlier (or later).

Move it to just after the allocation of swap_map, since it is
conceptually similar (allocates a map).
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1421ef3c

sys_swapon: simplify error flow in read_swap_header() · 38719025

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

38719025

sys_swapon: separate parsing of swapfile header · ca8bd38b

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Move the code which parses and checks the swapfile header (except for
the bad block list) to a separate function. Only code movement, no
functional changes.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ca8bd38b

sys_swapon: move setting of swapfilepages near use · 5de771e4

由 Cesar Eduardo Barros 提交于 3月 22, 2011

There is no reason I can see to read inode->i_size long before it is
needed. Move its read to just before it is needed, to reduce the
variable lifetime.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NJesper Juhl <jj@chaosbits.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5de771e4

sys_swapon: simplify error flow in claim_swapfile() · 87ade72a

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

87ade72a

sys_swapon: separate bdev claim and inode lock · 4d0e1e10

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Move the code which claims the bdev (S_ISBLK) or locks the inode
(S_ISREG) to a separate function. Only code movement, no functional
changes.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4d0e1e10

sys_swapon: use a single error label · bd69010b

由 Cesar Eduardo Barros 提交于 3月 22, 2011

sys_swapon currently has two error labels, bad_swap and bad_swap_2.
bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and
swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is
jumped to. With a single extra test for inode (matching the one in the
S_ISREG case below), all the error paths in the function can go to
bad_swap.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bd69010b

sys_swapon: do only cleanup in the cleanup blocks · 9b01c350

由 Cesar Eduardo Barros 提交于 3月 22, 2011

The only way error is 0 in the cleanup blocks is when the function is
returning successfully. In this case, the cleanup blocks were setting
S_SWAPFILE in the S_ISREG case. But this is not a cleanup.

Move the setting of S_SWAPFILE to just before the "goto out;" to make
this more clear. At this point, we do not need to test for inode because
it will never be NULL.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9b01c350

sys_swapon: remove bdev variable · f2090d2d

由 Cesar Eduardo Barros 提交于 3月 22, 2011

The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ?
p->bdev : NULL), as long as it being set is moved to a bit earlier. Use
this fact to remove the bdev variable.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f2090d2d

sys_swapon: move setting of error nearer use · 7de7fb6b

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Move the setting of the error variable nearer the goto in a few places.

Avoids calling PTR_ERR() if not IS_ERR() in two places, and makes the
error condition more explicit in two other places.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NJesper Juhl <jj@chaosbits.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7de7fb6b

sys_swapon: remove did_down variable · 83ef99be

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Since mutex_lock(&inode->i_mutex) is called just after setting inode,
did_down is always equivalent to (inode && S_ISREG(inode->i_mode)).

Use this fact to remove the did_down variable.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83ef99be

sys_swapon: remove initial value of name variable · 28b36bd7

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Now there is nothing which jumps to the cleanup blocks before the name
variable is set. There is no need to set it initially to NULL anymore.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

28b36bd7

sys_swapon: simplify error flow in alloc_swap_info() · 730c0581

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

730c0581

sys_swapon: simplify error return from swap_info allocation · 2542e513

由 Cesar Eduardo Barros 提交于 3月 22, 2011

At this point in sys_swapon, there is nothing to free. Return directly
instead of jumping to the cleanup block at the end of the function.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2542e513

sys_swapon: separate swap_info allocation · 53cbb243

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Move the swap_info allocation to its own function. Only code movement,
no functional changes.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

53cbb243

sys_swapon: do not depend on "type" after allocation · e8e6c2ec

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Within sys_swapon, after the swap_info entry has been allocated, we
always have type == p->type and swap_info[type] == p. Use this fact to
reduce the dependency on the "type" local variable within the function,
as a preparation to move the allocation of the swap_info entry to a
separate function.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e8e6c2ec

sys_swapon: remove changelog from function comment · 80b0df12

由 Cesar Eduardo Barros 提交于 3月 22, 2011

Changelogs belong in the git history instead of in the source code.

Also, "The swapon system call" is redundant with
"SYSCALL_DEFINE2(swapon, ...)".
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NJesper Juhl <jj@chaosbits.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
[ Gaah. That's a _historical_ comment. But the patch-series depends on removal ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

80b0df12

sys_swapon: use vzalloc() instead of vmalloc/memset · 803d0c83

由 Cesar Eduardo Barros 提交于 3月 22, 2011

This patch series refactors the sys_swapon function.

sys_swapon is currently a very large function, with 313 lines (more than
12 25-line screens), which can make it a bit hard to read. This patch
series reduces this size by half, by extracting large chunks of related
code to new helper functions.

One of these chunks of code was nearly identical to the part of
sys_swapoff which is used in case of a failure return from
try_to_unuse(), so this patch series also makes both share the same
code.

As a side effect of all this refactoring, the compiled code gets a bit
smaller (from v1 of this patch series):

   text       data        bss        dec        hex    filename
  14012        944        276      15232       3b80    mm/swapfile.o.before
  13941        944        276      15161       3b39    mm/swapfile.o.after

This patch:

Use vzalloc() instead of vmalloc/memset.
Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
Tested-by: NEric B Munson <emunson@mgebm.net>
Acked-by: NEric B Munson <emunson@mgebm.net>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Reviewed-by: NJesper Juhl <jj@chaosbits.net>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

803d0c83

mm: swap: unlock swapfile inode mutex before closing file on bad swapfiles · 52c50567

由 Mel Gorman 提交于 3月 22, 2011

If an administrator tries to swapon a file backed by NFS, the inode mutex is
taken (as it is for any swapfile) but later identified to be a bad swapfile
due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
made to close the file but with inode->i_mutex still held. Closing an NFS
file syncs it which tries to acquire the inode mutex leading to deadlock. If
lockdep is enabled the following appears on the console;

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.38-rc8-autobuild #1
    ---------------------------------------------
    swapon/2192 is trying to acquire lock:
     (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c

    but task is already holding lock:
     (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    other info that might help us debug this:
    1 lock held by swapon/2192:
     #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    stack backtrace:
    Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
    Call Trace:
        __lock_acquire+0x2eb/0x1623
        find_get_pages_tag+0x14a/0x174
        pagevec_lookup_tag+0x25/0x2e
        vfs_fsync_range+0x47/0x7c
        lock_acquire+0xd3/0x100
        vfs_fsync_range+0x47/0x7c
        nfs_flush_one+0x0/0xdf [nfs]
        mutex_lock_nested+0x40/0x2b1
        vfs_fsync_range+0x47/0x7c
        vfs_fsync_range+0x47/0x7c
        vfs_fsync+0x1c/0x1e
        nfs_file_flush+0x64/0x69 [nfs]
        filp_close+0x43/0x72
        sys_swapon+0xa39/0xae7
        sysret_check+0x2e/0x69
        system_call_fastpath+0x16/0x1b

This patch releases the mutex if its held before calling filep_close()
so swapon fails as expected without deadlock when the swapfile is backed
by NFS.  If accepted for 2.6.39, it should also be considered a -stable
candidate for 2.6.38 and 2.6.37.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: <stable@kernel.org>		[2.6.37+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52c50567

10 3月, 2011 1 次提交

block: remove per-queue plugging · 7eaceacc

由 Jens Axboe 提交于 3月 10, 2011

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7eaceacc

25 2月, 2011 1 次提交

mm: fix refcounting in swapon · 8074b26f

由 Miklos Szeredi 提交于 2月 24, 2011

Grab a reference to bdev before calling blkdev_get(), which expects
the refcount to be already incremented and either returns success or
decrements the refcount and returns an error.

The bug was introduced by e525fd89 (block: make blkdev_get/put()
handle exclusive access), which didn't take into account this behavior
of blkdev_get().
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8074b26f

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功