1. 05 3月, 2008 4 次提交
  2. 04 3月, 2008 12 次提交
  3. 24 2月, 2008 4 次提交
  4. 20 2月, 2008 1 次提交
    • L
      Revert "SLUB: Alternate fast paths using cmpxchg_local" · 00e962c5
      Linus Torvalds 提交于
      This reverts commit 1f84260c, which is
      suspected to be the reason for some very occasional and hard-to-trigger
      crashes that usually look related to memory allocation (mostly reported
      in networking, but since that's generally the most common source of
      shortlived allocations - and allocations in interrupt contexts - that in
      itself is not a big clue).
      
      See for example
      	http://bugzilla.kernel.org/show_bug.cgi?id=9973
      	http://lkml.org/lkml/2008/2/19/278
      etc.
      
      One promising suspicion for what the root cause of bug is (which also
      explains why it's so hard to trigger in practice) came from Eric
      Dumazet:
      
         "I wonder how SLUB_FASTPATH is supposed to work, since it is affected
          by a classical ABA problem of lockless algo.
      
          cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed,
          while an interrupt came (on this cpu), and several allocations were
          done, and one free was performed at the end of this interruption, so
          'object' was recycled.
      
          c->freelist can then contain the previous value (object), but
          object[c->offset] was changed by IRQ.
      
          We then put back in freelist an already allocated object."
      
      but another reason for the revert is simply that everybody agrees that
      this code was the main suspect just by virtue of the pattern of oopses.
      
      Cc: Torsten Kaiser <just.for.lkml@googlemail.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00e962c5
  5. 15 2月, 2008 9 次提交
  6. 14 2月, 2008 2 次提交
  7. 12 2月, 2008 3 次提交
    • K
      mempolicy: silently restrict nodemask to allowed nodes · 31f1de46
      KOSAKI Motohiro 提交于
      Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
      presence of memoryless nodes.  This patch attempts to fix that problem.
      
      Some background:
      
      numactl --interleave=all calls set_mempolicy(2) with a fully populated
      [out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
      calls contextualize_policy() which requires that the nodemask be a
      subset of the current task's mems_allowed; else EINVAL will be returned.
      
      A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
      i.e., nodes with memory.  So, a fully populated nodemask will be
      declared invalid if it includes memoryless nodes.
      
        NOTE:  the same thing will occur when running in a cpuset
               with restricted mem_allowed--for the same reason:
               node mask contains dis-allowed nodes.
      
      mbind(2), on the other hand, just masks off any nodes in the nodemask
      that are not included in the caller's mems_allowed.
      
      In each case [mbind() and set_mempolicy()], mpol_check_policy() will
      complain [again, resulting in EINVAL] if the nodemask contains any
      memoryless nodes.  This is somewhat redundant as mpol_new() will remove
      memoryless nodes for interleave policy, as will bind_zonelist()--called
      by mpol_new() for BIND policy.
      
      Proposed fix:
      
      1) modify contextualize_policy logic to:
         a) remember whether the incoming node mask is empty.
         b) if not, restrict the nodemask to allowed nodes, as is
            currently done in-line for mbind().  This guarantees
            that the resulting mask includes only nodes with memory.
      
            NOTE:  this is a [benign, IMO] change in behavior for
                   set_mempolicy().  Dis-allowed nodes will be
                   silently ignored, rather than returning an error.
      
         c) fold this code into mpol_check_policy(), replace 2 calls to
            contextualize_policy() to call mpol_check_policy() directly
            and remove contextualize_policy().
      
      2) In existing mpol_check_policy() logic, after "contextualization":
         a) MPOL_DEFAULT:  require that in coming mask "was_empty"
         b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
            contains at least one node.
         c) add a case for MPOL_PREFERRED:  if in coming was not empty
            and resulting mask IS empty, user specified invalid nodes.
            Return EINVAL.
         c) remove the now redundant check for memoryless nodes
      
      3) remove the now redundant masking of policy nodes for interleave
         policy from mpol_new().
      
      4) Now that mpol_check_policy() contextualizes the nodemask, remove
         the in-line nodes_and() from sys_mbind().  I believe that this
         restores mbind() to the behavior before the memoryless-nodes
         patch series.  E.g., we'll no longer treat an invalid nodemask
         with MPOL_PREFERRED as local allocation.
      
      [ Patch history:
      
        v1 -> v2:
         - Communicate whether or not incoming node mask was empty to
           mpol_check_policy() for better error checking.
         - As suggested by David Rientjes, remove the now unused
           cpuset_nodes_subset_current_mems_allowed() from cpuset.h
      
        v2 -> v3:
         - As suggested by Kosaki Motohito, fold the "contextualization"
           of policy nodemask into mpol_check_policy().  Looks a little
           cleaner. ]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31f1de46
    • J
      Be more robust about bad arguments in get_user_pages() · 900cf086
      Jonathan Corbet 提交于
      So I spent a while pounding my head against my monitor trying to figure
      out the vmsplice() vulnerability - how could a failure to check for
      *read* access turn into a root exploit? It turns out that it's a buffer
      overflow problem which is made easy by the way get_user_pages() is
      coded.
      
      In particular, "len" is a signed int, and it is only checked at the
      *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
      will execute once and decrement len to -1.  At that point, the loop will
      proceed until the next invalid address is found; in the process, it will
      likely overflow the pages array passed in to get_user_pages().
      
      I think that, if get_user_pages() has been asked to grab zero pages,
      that's what it should do.  Thus this patch; it is, among other things,
      enough to block the (already fixed) root exploit and any others which
      might be lurking in similar code.  I also think that the number of pages
      should be unsigned, but changing the prototype of this function probably
      requires some more careful review.
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      900cf086
    • M
      slab: avoid double initialization & do initialization in 1 place · 7cc718d5
      Marcin Slusarz 提交于
      - alloc_slabmgmt: initialize all slab fields in 1 place
      - slab->nodeid was initialized twice: in alloc_slabmgmt
        and immediately after it in cache_grow
      Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      CC: Christoph Lameter <clameter@sgi.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      7cc718d5
  8. 10 2月, 2008 1 次提交
  9. 09 2月, 2008 4 次提交
    • N
      mm: special mapping nopage · b1d0e4f5
      Nick Piggin 提交于
      Convert special mapping install from nopage to fault.
      
      Because the "vm_file" is NULL for the special mapping, the generic VM
      code has messed up "vm_pgoff" thinking that it's an anonymous mapping
      and the offset does't matter.  For that reason, we need to undo the
      vm_pgoff offset that got added into vmf->pgoff.
      
      [ We _really_ should clean that up - either by making this whole special
        mapping code just use a real file entry rather than that ugly array of
        "struct page" pointers, or by just making the VM code realize that
        even if vm_file is NULL it may not be a regular anonymous mmap.
      							 - Linus ]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: linux-mm@kvack.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1d0e4f5
    • M
      CONFIG_HIGHPTE vs. sub-page page tables. · 2f569afd
      Martin Schwidefsky 提交于
      Background: I've implemented 1K/2K page tables for s390.  These sub-page
      page tables are required to properly support the s390 virtualization
      instruction with KVM.  The SIE instruction requires that the page tables
      have 256 page table entries (pte) followed by 256 page status table entries
      (pgste).  The pgstes are only required if the process is using the SIE
      instruction.  The pgstes are updated by the hardware and by the hypervisor
      for a number of reasons, one of them is dirty and reference bit tracking.
      To avoid wasting memory the standard pte table allocation should return
      1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
      
      Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
      the s390 version for pte_alloc_one cannot return a pointer to a struct
      page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
      cannot return a pointer to a pte either, since that would require more than
      32 bit for the return value of pte_alloc_one (and the pte * would not be
      accessible since its not kmapped).
      
      Solution: The only solution I found to this dilemma is a new typedef: a
      pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
      later patch.  For everybody else it will be a (struct page *).  The
      additional problem with the initialization of the ptl lock and the
      NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
      a destructor pgtable_page_dtor.  The page table allocation and free
      functions need to call these two whenever a page table page is allocated or
      freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
       To get the pgtable_t back from a pmd entry that has been installed with
      pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
      call in free_pte_range and apply_to_pte_range.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f569afd
    • A
      mount-options-fix-tmpfs-fix · b76db735
      Andrew Morton 提交于
      Documentation/SubmitCheckist, please.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b76db735
    • A
      mount options: fix tmpfs · 680d794b
      akpm@linux-foundation.org 提交于
      Add .show_options super operation to tmpfs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      680d794b