1. 07 3月, 2010 25 次提交
    • D
      elf coredump: add extended numbering support · 8d9032bb
      Daisuke HATAYAMA 提交于
      The current ELF dumper implementation can produce broken corefiles if
      program headers exceed 65535.  This number is determined by the number of
      vmas which the process have.  In particular, some extreme programs may use
      more than 65535 vmas.  (If you google max_map_count, you can find some
      users facing this problem.) This kind of program never be able to generate
      correct coredumps.
      
      This patch implements ``extended numbering'' that uses sh_info field of
      the first section header instead of e_phnum field in order to represent
      upto 4294967295 vmas.
      
      This is supported by
      AMD64-ABI(http://www.x86-64.org/documentation.html) and
      Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
      Of course, we are preparing patches for gdb and binutils.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d9032bb
    • D
      elf coredump: replace ELF_CORE_EXTRA_* macros by functions · 1fcccbac
      Daisuke HATAYAMA 提交于
      elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
      macro for hiding _multiline_ logics in functions.  This patch removes
      #ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions.  For
      architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
      order to reduce a range of modification.
      
      This cleanup is for my next patches, but I think this cleanup itself is
      worth doing regardless of my firnal purpose.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fcccbac
    • D
      coredump: move dump_write() and dump_seek() into a header file · 088e7af7
      Daisuke HATAYAMA 提交于
      My next patch will replace ELF_CORE_EXTRA_* macros by functions, putting
      them into other newly created *.c files.  Then, each files will contain
      dump_write(), where each pair of binfmt_*.c and elfcore.c should be the
      same.  So, this patch moves them into a header file with dump_seek().
      Also, the patch deletes confusing DUMP_WRITE macros in each files.
      Signed-off-by: NDaisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      088e7af7
    • D
      sdio: put active devices into 1-bit mode during suspend · 6b5eda36
      Daniel Drake 提交于
      And bring them back to 4-bit mode during resume.
      Signed-off-by: NDaniel Drake <dsd@laptop.org>
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b5eda36
    • N
      sdio: introduce API for special power management features · da68c4eb
      Nicolas Pitre 提交于
      This patch series provides the core changes needed to allow SDIO cards to
      remain powered and active while the host system is suspended, and let them
      wake up the host system when needed.  This is used to implement
      wake-on-lan with SDIO wireless cards at the moment.  Patches to add that
      support to the libertas driver will be posted separately.
      
      This patch:
      
      Some SDIO cards have the ability to keep on running autonomously when the
      host system is suspended, and wake it up when needed.  This however
      requires that the host controller preserve power to the card, and
      configure itself appropriately for wake-up.
      
      There is however 4 layers of abstractions involved: the host controller
      driver, the MMC core code, the SDIO card management code, and the actual
      SDIO function driver.  To make things simple and manageable, host drivers
      must advertise their PM capabilities with a feature bitmask, then function
      drivers can query and set those features from their suspend method.  Then
      each layer in the suspend call chain is expected to act upon those bits
      accordingly.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da68c4eb
    • B
      sdio: add quirk to clamp byte mode transfer · 3fb7fb4a
      Bing Zhao 提交于
      Some SDIO cards expect byte transfers not to exceed the configured block
      transfer size.  Add a quirk to that effect.
      
      Patches to make use of this quirk will be sent separately.
      Signed-off-by: NBing Zhao <bzhao@marvell.com>
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Cc: <linux-mmc@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fb7fb4a
    • B
      lib: fix first line of kernel-doc for a few functions · 9a86e2ba
      Ben Hutchings 提交于
      The function name must be followed by a space, hypen, space, and a short
      description.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a86e2ba
    • R
      smp: fix documentation in include/linux/smp.h · cfd8d6c0
      Rakib Mullick 提交于
      smp: Fix documentation.
      
      Fix documentation in include/linux/smp.h: smp_processor_id()
      Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfd8d6c0
    • H
      nodemask.h: remove macro any_online_node · 72c33688
      H Hartley Sweeten 提交于
      The macro any_online_node() is prone to producing sparse warnings due to
      the local symbol 'node'.  Since all the in-tree users are really
      requesting the first online node (the mask argument is either
      NODE_MASK_ALL or node_online_map) just use the first_online_node macro and
      remove the any_online_node macro since there are no users.
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Nathan Fontenot <nfont@austin.ibm.com>
      Cc: Geoff Levand <geoffrey.levand@am.sony.com>
      Cc: Grant Likely <grant.likely@secretlab.ca>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Benny Halevy <bhalevy@panasas.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72c33688
    • H
      cpumask: let num_*_cpus() function always return unsigned values · 221e3ebf
      Heiko Carstens 提交于
      Dependent on CONFIG_SMP the num_*_cpus() functions return unsigned or
      signed values.  Let them always return unsigned values to avoid strange
      casts.
      
      Fixes at least one warning:
      
       kernel/kprobes.c: In function 'register_kretprobe':
       kernel/kprobes.c:1038: warning: comparison of distinct pointer types lacks a cast
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Masami Hiramatsu <mhiramat@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      221e3ebf
    • D
      mm: add comment about deprecation of __GFP_NOFAIL · 478352e7
      David Rientjes 提交于
      __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new
      users should be added.
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      478352e7
    • J
      vmscan: detect mapped file pages used only once · 64574746
      Johannes Weiner 提交于
      The VM currently assumes that an inactive, mapped and referenced file page
      is in use and promotes it to the active list.
      
      However, every mapped file page starts out like this and thus a problem
      arises when workloads create a stream of such pages that are used only for
      a short time.  By flooding the active list with those pages, the VM
      quickly gets into trouble finding eligible reclaim canditates.  The result
      is long allocation latencies and eviction of the wrong pages.
      
      This patch reuses the PG_referenced page flag (used for unmapped file
      pages) to implement a usage detection that scales with the speed of LRU
      list cycling (i.e.  memory pressure).
      
      If the scanner encounters those pages, the flag is set and the page cycled
      again on the inactive list.  Only if it returns with another page table
      reference it is activated.  Otherwise it is reclaimed as 'not recently
      used cache'.
      
      This effectively changes the minimum lifetime of a used-once mapped file
      page from a full memory cycle to an inactive list cycle, which allows it
      to occur in linear streams without affecting the stable working set of the
      system.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64574746
    • R
      mm/pm: force GFP_NOIO during suspend/hibernation and resume · 452aa699
      Rafael J. Wysocki 提交于
      There are quite a few GFP_KERNEL memory allocations made during
      suspend/hibernation and resume that may cause the system to hang, because
      the I/O operations they depend on cannot be completed due to the
      underlying devices being suspended.
      
      Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
      gfp_allowed_mask before suspend/hibernation and restoring the original
      values of these bits in gfp_allowed_mask durig the subsequent resume.
      
      [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: NMaxim Levitsky <maximlevitsky@gmail.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      452aa699
    • R
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel 提交于
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • R
      rmap: move exclusively owned pages to own anon_vma in do_wp_page() · c44b6743
      Rik van Riel 提交于
      When the parent process breaks the COW on a page, both the original which
      is mapped at child and the new page which is mapped parent end up in that
      same anon_vma.  Generally this won't be a problem, but for some workloads
      it could preserve the O(N) rmap scanning complexity.
      
      A simple fix is to ensure that, when a page which is mapped child gets
      reused in do_wp_page, because we already are the exclusive owner, the page
      gets moved to our own exclusive child's anon_vma.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c44b6743
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • A
      include/linux/fs.h: convert FMODE_* constants to hex · 19adf9c5
      Andrew Morton 提交于
      It was tolerable until Eric went and added 8388608.
      
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19adf9c5
    • W
      readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM · 0141450f
      Wu Fengguang 提交于
      This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
      
      POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
      a 16K read will be carried out in 4 _sync_ 1-page reads.
      
      In other places, ra_pages==0 means
      - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
      - some IO error happened
      where multi-page read IO won't help or should be avoided.
      
      POSIX_FADV_RANDOM actually want a different semantics: to disable the
      *heuristic* readahead algorithm, and to use a dumb one which faithfully
      submit read IO for whatever application requests.
      
      So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
      
      Note that the random hint is not likely to help random reads performance
      noticeably.  And it may be too permissive on huge request size (its IO
      size is not limited by read_ahead_kb).
      
      In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
      (NFS read) performance of the application increased by 313%!
      Tested-by: NQuentin Barnes <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org>			[2.6.33.x]
      Cc: <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0141450f
    • A
      memory-hotplug: create /sys/firmware/memmap entry for new memory · d96ae530
      akpm@linux-foundation.org 提交于
      A memmap is a directory in sysfs which includes 3 text files: start, end
      and type.  For example:
      
      start: 	0x100000
      end:	0x7e7b1cff
      type:	System RAM
      
      Interface firmware_map_add was not called explicitly.  Remove it and add
      function firmware_map_add_hotplug as hotplug interface of memmap.
      
      Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs
      does not export memmap entry for it.  We add a call in function add_memory
      to function firmware_map_add_hotplug.
      
      Add a new function add_sysfs_fw_map_entry() to create memmap entry, it
      will be called when initialize memmap and hot-add memory.
      
      [akpm@linux-foundation.org: un-kernedoc a no longer kerneldoc comment]
      Signed-off-by: NShaohui Zheng <shaohui.zheng@intel.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d96ae530
    • K
      mm: restore zone->all_unreclaimable to independence word · 93e4a89a
      KOSAKI Motohiro 提交于
      commit e815af95 ("change all_unreclaimable zone member to flags") changed
      all_unreclaimable member to bit flag.  But it had an undesireble side
      effect.  free_one_page() is one of most hot path in linux kernel and
      increasing atomic ops in it can reduce kernel performance a bit.
      
      Thus, this patch revert such commit partially. at least
      all_unreclaimable shouldn't share memory word with other zone flags.
      
      [akpm@linux-foundation.org: fix patch interaction]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Huang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93e4a89a
    • L
      mm: remove free_hot_page() · fc91668e
      Li Hong 提交于
      free_hot_page() is just a wrapper around free_hot_cold_page() with
      parameter 'cold = 0'.  After adding a clear comment for
      free_hot_cold_page(), it is reasonable to remove a level of call.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc91668e
    • K
      mm: count swap usage · b084d435
      KAMEZAWA Hiroyuki 提交于
      A frequent questions from users about memory management is what numbers of
      swap ents are user for processes.  And this information will give some
      hints to oom-killer.
      
      Besides we can count the number of swapents per a process by scanning
      /proc/<pid>/smaps, this is very slow and not good for usual process
      information handler which works like 'ps' or 'top'.  (ps or top is now
      enough slow..)
      
      This patch adds a counter of swapents to mm_counter and update is at each
      swap events.  Information is exported via /proc/<pid>/status file as
      
      [kamezawa@bluextal memory]$ cat /proc/self/status
      Name:   cat
      State:  R (running)
      Tgid:   2910
      Pid:    2910
      PPid:   2823
      TracerPid:      0
      Uid:    500     500     500     500
      Gid:    500     500     500     500
      FDSize: 256
      Groups: 500
      VmPeak:    82696 kB
      VmSize:    82696 kB
      VmLck:         0 kB
      VmHWM:       432 kB
      VmRSS:       432 kB
      VmData:      172 kB
      VmStk:        84 kB
      VmExe:        48 kB
      VmLib:      1568 kB
      VmPTE:        40 kB
      VmSwap:        0 kB <=============== this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b084d435
    • K
      mm: avoid false sharing of mm_counter · 34e55232
      KAMEZAWA Hiroyuki 提交于
      Considering the nature of per mm stats, it's the shared object among
      threads and can be a cache-miss point in the page fault path.
      
      This patch adds per-thread cache for mm_counter.  RSS value will be
      counted into a struct in task_struct and synchronized with mm's one at
      events.
      
      Now, in this patch, the event is the number of calls to handle_mm_fault.
      Per-thread value is added to mm at each 64 calls.
      
       rough estimation with small benchmark on parallel thread (2threads) shows
       [before]
           4.5 cache-miss/faults
       [after]
           4.0 cache-miss/faults
       Anyway, the most contended object is mmap_sem if the number of threads grows.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34e55232
    • K
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki 提交于
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
    • A
      bitops: rename for_each_bit() to for_each_set_bit() · 984b3f57
      Akinobu Mita 提交于
      Rename for_each_bit to for_each_set_bit in the kernel source tree.  To
      permit for_each_clear_bit(), should that ever be added.
      
      The patch includes a macro to map the old for_each_bit() onto the new
      for_each_set_bit().  This is a (very) temporary thing to ease the migration.
      
      [akpm@linux-foundation.org: add temporary for_each_bit()]
      Suggested-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      984b3f57
  2. 06 3月, 2010 9 次提交
  3. 05 3月, 2010 6 次提交
    • C
      quota: stop using QUOTA_OK / NO_QUOTA · efd8f0e6
      Christoph Hellwig 提交于
      Just use 0 / -EDQUOT directly - that's what it translates to anyway.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      efd8f0e6
    • C
      dquot: cleanup dquot initialize routine · 871a2931
      Christoph Hellwig 提交于
      Get rid of the initialize dquot operation - it is now always called from
      the filesystem and if a filesystem really needs it's own (which none
      currently does) it can just call into it's own routine directly.
      
      Rename the now static low-level dquot_initialize helper to __dquot_initialize
      and vfs_dq_init to dquot_initialize to have a consistent namespace.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      871a2931
    • C
      dquot: move dquot initialization responsibility into the filesystem · 907f4554
      Christoph Hellwig 提交于
      Currently various places in the VFS call vfs_dq_init directly.  This means
      we tie the quota code into the VFS.  Get rid of that and make the
      filesystem responsible for the initialization.   For most metadata operations
      this is a straight forward move into the methods, but for truncate and
      open it's a bit more complicated.
      
      For truncate we currently only call vfs_dq_init for the sys_truncate case
      because open already takes care of it for ftruncate and open(O_TRUNC) - the
      new code causes an additional vfs_dq_init for those which is harmless.
      
      For open the initialization is moved from do_filp_open into the open method,
      which means it happens slightly earlier now, and only for regular files.
      The latter is fine because we don't need to initialize it for operations
      on special files, and we already do it as part of the namespace operations
      for directories.
      
      Add a dquot_file_open helper that filesystems that support generic quotas
      can use to fill in ->open.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      907f4554
    • C
      dquot: cleanup dquot drop routine · 9f754758
      Christoph Hellwig 提交于
      Get rid of the drop dquot operation - it is now always called from
      the filesystem and if a filesystem really needs it's own (which none
      currently does) it can just call into it's own routine directly.
      
      Rename the now static low-level dquot_drop helper to __dquot_drop
      and vfs_dq_drop to dquot_drop to have a consistent namespace.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      9f754758
    • C
      dquot: cleanup dquot transfer routine · b43fa828
      Christoph Hellwig 提交于
      Get rid of the transfer dquot operation - it is now always called from
      the filesystem and if a filesystem really needs it's own (which none
      currently does) it can just call into it's own routine directly.
      
      Rename the now static low-level dquot_transfer helper to __dquot_transfer
      and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
      and make the new dquot_transfer return a normal negative errno value
      which all callers expect.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      b43fa828
    • C
      dquot: cleanup inode allocation / freeing routines · 63936dda
      Christoph Hellwig 提交于
      Get rid of the alloc_inode and free_inode dquot operations - they are
      always called from the filesystem and if a filesystem really needs
      their own (which none currently does) it can just call into it's
      own routine directly.
      
      Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
      call the lowlevel dquot_alloc_inode / dqout_free_inode routines
      directly, which now lose the number argument which is always 1.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      63936dda