1. 17 7月, 2007 10 次提交
    • J
      hugetlb: fix race in alloc_fresh_huge_page() · f96efd58
      Joe Jin 提交于
      That static `nid' index needs locking.  Without it we can end up calling
      alloc_pages_node() with an illegal node ID and the kernel crashes.
      Acked-by: Ngurudas pai <gurudas.pai@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f96efd58
    • A
      vmscan: fix comments related to shrink_list() · 2706a1b8
      Anderson Briglia 提交于
      Fix the shrink_list name on some files under mm/ directory.
      Signed-off-by: NAnderson Briglia <anderson.briglia@indt.org.br>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2706a1b8
    • N
      slob: improved alignment handling · 55394849
      Nick Piggin 提交于
      Remove the core slob allocator's minimum alignment restrictions, and instead
      introduce the alignment restrictions at the slab API layer.  This lets us heed
      the ARCH_KMALLOC/SLAB_MINALIGN directives, and also use __alignof__ (unsigned
      long) for the default alignment (which should allow relaxed alignment
      architectures to take better advantage of SLOB's small minimum alignment).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55394849
    • N
      slob: remove bigblock tracking · d87a133f
      Nick Piggin 提交于
      Remove the bigblock lists in favour of using compound pages and going directly
      to the page allocator.  Allocation size is stored in page->private, which also
      makes ksize more accurate than it previously was.
      
      Saves ~.5K of code, and 12-24 bytes overhead per >= PAGE_SIZE allocation.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d87a133f
    • N
      slob: rework freelist handling · 95b35127
      Nick Piggin 提交于
      Improve slob by turning the freelist into a list of pages using struct page
      fields, then each page has a singly linked freelist of slob blocks via a
      pointer in the struct page.
      
      - The first benefit is that the slob freelists can be indexed by a smaller
        type (2 bytes, if the PAGE_SIZE is reasonable).
      
      - Next is that freeing is much quicker because it does not have to traverse
        the entire freelist. Allocation can be slightly faster too, because we can
        skip almost-full freelist pages completely.
      
      - Slob pages are then freed immediately when they become empty, rather than
        having a periodic timer try to free them. This gives efficiency and memory
        consumption improvement.
      
      Then, we don't encode seperate size and next fields into each slob block,
      rather we use the sign bit to distinguish between "size" or "next". Then
      size 1 blocks contain a "next" offset, and others contain the "size" in
      the first unit and "next" in the second unit.
      
      - This allows minimum slob allocation alignment to go from 8 bytes to 2
        bytes on 32-bit and 12 bytes to 2 bytes on 64-bit. In practice, it is
        best to align them to word size, however some architectures (eg. cris)
        could gain space savings from turning off this extra alignment.
      
      Then, make kmalloc use its own slob_block at the front of the allocation
      in order to encode allocation size, rather than rely on not overwriting
      slob's existing header block.
      
      - This reduces kmalloc allocation overhead similarly to alignment reductions.
      
      - Decouples kmalloc layer from the slob allocator.
      
      Then, add a page flag specific to slob pages.
      
      - This means kfree of a page aligned slob block doesn't have to traverse
        the bigblock list.
      
      I would get benchmarks, but my test box's network doesn't come up with
      slob before this patch. I think something is timing out. Anyway, things
      are faster after the patch.
      
      Code size goes up about 1K, however dynamic memory usage _should_ be
      lower even on relatively small memory systems.
      
      Future todo item is to restore the cyclic free list search, rather than
      to always begin at the start.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95b35127
    • E
      MM: alloc_large_system_hash() can free some memory for non power-of-two bucketsize · 1037b83b
      Eric Dumazet 提交于
      alloc_large_system_hash() is called at boot time to allocate space for
      several large hash tables.
      
      Lately, TCP hash table was changed and its bucketsize is not a power-of-two
      anymore.
      
      On most setups, alloc_large_system_hash() allocates one big page (order >
      0) with __get_free_pages(GFP_ATOMIC, order).  This single high_order page
      has a power-of-two size, bigger than the needed size.
      
      We can free all pages that wont be used by the hash table.
      
      On a 1GB i386 machine, this patch saves 128 KB of LOWMEM memory.
      
      TCP established hash table entries: 32768 (order: 6, 393216 bytes)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1037b83b
    • P
      Make /proc/slabinfo use seq_list_xxx helpers · b92151ba
      Pavel Emelianov 提交于
      This entry prints a header in .start callback.  This is OK, but the more
      elegant solution would be to move this into the .show callback and use
      seq_list_start_head() in .start one.
      
      I have left it as is in order to make the patch just switch to new API and
      noting more.
      
      [adobriyan@sw.ru: Wrong pointer was used as kmem_cache pointer]
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b92151ba
    • R
      MM: use DIV_ROUND_UP() in mm/memory.c · 68e116a3
      Rolf Eike Beer 提交于
      Replace a hand coded version of DIV_ROUND_UP().
      Signed-off-by: NRolf Eike Beer <eike-kernel@sf-tec.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68e116a3
    • N
      hugetlb: remove unnecessary nid initialization · 31a5c6e4
      Nishanth Aravamudan 提交于
      nid is initialized to numa_node_id() but will either be overwritten in
      the loop or not used in the conditional. So remove the initialization.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a5c6e4
    • K
      change zonelist order: zonelist order selection logic · f0c0b2b8
      KAMEZAWA Hiroyuki 提交于
      Make zonelist creation policy selectable from sysctl/boot option v6.
      
      This patch makes NUMA's zonelist (of pgdat) order selectable.
      Available order are Default(automatic)/ Node-based / Zone-based.
      
      [Default Order]
      The kernel selects Node-based or Zone-based order automatically.
      
      [Node-based Order]
      This policy treats the locality of memory as the most important parameter.
      Zonelist order is created by each zone's locality. This means lower zones
      (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
      IOW. ZONE_DMA will be in the middle of zonelist.
      current 2.6.21 kernel uses this.
      
      Pros.
       * A user can expect local memory as much as possible.
      Cons.
       * lower zone will be exhansted before higher zone. This may cause OOM_KILL.
      
      Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
      because of ZONE_DMA exhaution and you need the best locality.
      
      (example)
      assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
      
      *node(0)'s memory allocation order:
      
       node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
      
      *node(1)'s memory allocation order:
      
       node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
      
      [Zone-based order]
      This policy treats the zone type as the most important parameter.
      Zonelist order is created by zone-type order. This means lower zone
      never be used bofere higher zone exhaustion.
      IOW. ZONE_DMA will be always at the tail of zonelist.
      
      Pros.
       * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
      Cons.
       * memory locality may not be best.
      
      (example)
      assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
      
      *node(0)'s memory allocation order:
      
       node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
      
      *node(1)'s memory allocation order:
      
       node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
      
      bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
      
      command:
      %echo N > /proc/sys/vm/numa_zonelist_order
      
      Will rebuild zonelist in Node-based order.
      
      command:
      %echo Z > /proc/sys/vm/numa_zonelist_order
      
      Will rebuild zonelist in Zone-based order.
      
      Thanks to Lee Schermerhorn, he gives me much help and codes.
      
      [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0c0b2b8
  2. 12 7月, 2007 1 次提交
    • E
      security: Protection for exploiting null dereference using mmap · ed032189
      Eric Paris 提交于
      Add a new security check on mmap operations to see if the user is attempting
      to mmap to low area of the address space.  The amount of space protected is
      indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
      0, preserving existing behavior.
      
      This patch uses a new SELinux security class "memprotect."  Policy already
      contains a number of allow rules like a_t self:process * (unconfined_t being
      one of them) which mean that putting this check in the process class (its
      best current fit) would make it useless as all user processes, which we also
      want to protect against, would be allowed. By taking the memprotect name of
      the new class it will also make it possible for us to move some of the other
      memory protect permissions out of 'process' and into the new class next time
      we bump the policy version number (which I also think is a good future idea)
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      ed032189
  3. 10 7月, 2007 3 次提交
    • C
      xip sendfile removal · d054fe3d
      Carsten Otte 提交于
      This patch removes xip_file_sendfile, the sendfile implementation for
      xip without replacement. Those customers that use xip on s390 are not
      using sendfile() as far as we know, and so far s390 is the only platform
      this could potentially be used on so far.
      Having sendfile is not a popular feature for execute in place file
      systems, however we have a working implementation of splice_read() based
      on fs/splice.c if anyone asks for it.
      At this point in time, it does not seem preferable to merge
      splice_read() for xip because it causes extra maintenence effort due to
      code duplication and it requires struct page behind the xip memory
      segment. We'd like to get rid of that in favor of supporting flash based
      embedded platforms (Monta Vista work) soon.
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d054fe3d
    • H
      shmem: convert to using splice instead of sendfile() · ae976416
      Hugh Dickins 提交于
      Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs
      to support loop and sendfile in 2.4 and 2.5.  Now tmpfs can support splice,
      loop and sendfile in the simplest way, using generic_file_splice_read and
      generic_file_splice_write (with the aid of shmem_prepare_write).
      
      We could make some efficiency tweaks later, if there's a real need;
      but this is stable and works well as is.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ae976416
    • J
      sendfile: kill generic_file_sendfile() · 0452a4e5
      Jens Axboe 提交于
      It's no longer used.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0452a4e5
  4. 09 7月, 2007 1 次提交
  5. 07 7月, 2007 2 次提交
  6. 06 7月, 2007 1 次提交
    • D
      Fix slab redzone alignment · 87a927c7
      David Woodhouse 提交于
      Commit b46b8f19 fixed a couple of bugs
      by switching the redzone to 64 bits. Unfortunately, it neglected to
      ensure that the _second_ redzone, after the slab object, is aligned
      correctly. This caused illegal instruction faults on sparc32, which for
      some reason not entirely clear to me are not trapped and fixed up.
      
      Two things need to be done to fix this:
        - increase the object size, rounding up to alignof(long long) so
          that the second redzone can be aligned correctly.
        - If SLAB_STORE_USER is set but alignof(long long)==8, allow a
          full 64 bits of space for the user word at the end of the buffer,
          even though we may not _use_ the whole 64 bits.
      
      This patch should be a no-op on any 64-bit architecture or any 32-bit
      architecture where alignof(long long) == 4. Of the others, it's tested
      on ppc32 by myself and a very similar patch was tested on sparc32 by
      Mark Fortescue, who reported the new problem.
      
      Also, fix the conditions for FORCED_DEBUG, which hadn't been adjusted to
      the new sizes. Again noticed by Mark.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87a927c7
  7. 04 7月, 2007 1 次提交
  8. 02 7月, 2007 1 次提交
  9. 29 6月, 2007 1 次提交
    • H
      mm: kill validate_anon_vma to avoid mapcount BUG · 30acbaba
      Hugh Dickins 提交于
      validate_anon_vma gave a useful check on the integrity of the anon_vma list
      when Andrea was developing obj rmap; but it was not enabled in SLES9
      itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to
      configurable CONFIG_DEBUG_VM in 2.6.17.  Now Petr Vandrovec reports that
      its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system.
      
      That limit was just an arbitrary number to protect against an infinite
      loop.  We could raise it to something enormous (depending on sizeof struct
      vma and size of memory?); but I rather think validate_anon_vma has outlived
      its usefulness, and is better just removed - which gives a magnificent
      performance boost to anything like Petr's test program ;)
      
      Of course, a very long anon_vma list is bad news for preemption latency,
      and I believe there has been one recent report of such: let's not forget
      that, but validate_anon_vma only makes it worse not better.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Petr Vandrovec <petr@vmware.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30acbaba
  10. 24 6月, 2007 1 次提交
  11. 22 6月, 2007 1 次提交
  12. 17 6月, 2007 3 次提交
  13. 16 6月, 2007 1 次提交
    • P
      mm: Fix memory/cpu hotplug section mismatch and oops. · d09c6b80
      Paul Mundt 提交于
      When building with memory hotplug enabled and cpu hotplug disabled, we
      end up with the following section mismatch:
      
      WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
      .init.text: (between 'free_area_init_node' and '__build_all_zonelists')
      
      This happens as a result of:
      
              -> free_area_init_node()
                -> free_area_init_core()
                  -> zone_pcp_init() <-- all __meminit up to this point
                    -> zone_batchsize() <-- marked as __cpuinit                     fo
      
      This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but
      CONFIG_MEMORY_HOTPLUG=y unsets __meminit.
      
      Changing zone_batchsize() to __devinit fixes this.
      
      __devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and
      CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to
      another section identifier completely. Without this, memory hot-add
      of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is
      not also enabled.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      --
      
       mm/page_alloc.c |    2 +-
       1 file changed, 1 insertion(+), 1 deletion(-)
      d09c6b80
  14. 09 6月, 2007 4 次提交
  15. 01 6月, 2007 3 次提交
  16. 31 5月, 2007 2 次提交
  17. 24 5月, 2007 3 次提交
  18. 22 5月, 2007 1 次提交
    • A
      Detach sched.h from mm.h · e8edc6e0
      Alexey Dobriyan 提交于
      First thing mm.h does is including sched.h solely for can_do_mlock() inline
      function which has "current" dereference inside. By dealing with can_do_mlock()
      mm.h can be detached from sched.h which is good. See below, why.
      
      This patch
      a) removes unconditional inclusion of sched.h from mm.h
      b) makes can_do_mlock() normal function in mm/mlock.c
      c) exports can_do_mlock() to not break compilation
      d) adds sched.h inclusions back to files that were getting it indirectly.
      e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
         getting them indirectly
      
      Net result is:
      a) mm.h users would get less code to open, read, preprocess, parse, ... if
         they don't need sched.h
      b) sched.h stops being dependency for significant number of files:
         on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
         after patch it's only 3744 (-8.3%).
      
      Cross-compile tested on
      
      	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
      	alpha alpha-up
      	arm
      	i386 i386-up i386-defconfig i386-allnoconfig
      	ia64 ia64-up
      	m68k
      	mips
      	parisc parisc-up
      	powerpc powerpc-up
      	s390 s390-up
      	sparc sparc-up
      	sparc64 sparc64-up
      	um-x86_64
      	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
      
      as well as my two usual configs.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8edc6e0