1. 07 1月, 2009 40 次提交
    • J
      i2c: Drop I2C_CLASS_ALL · e1995f65
      Jean Delvare 提交于
      I2C_CLASS_ALL is almost never what bus driver authors really want.
      These i2c classes are really only about which devices must be probed,
      not what devices can be present. As device drivers get converted to the
      new i2c device driver model, only a few device types will keep relying
      on probing.
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      Acked-by: Sonic Zhang <sonic.zhang@analog.com> 
      e1995f65
    • L
      Fix up 64-bit byte swaps for most 32-bit architectures · ede6f5ae
      Linus Torvalds 提交于
      The __SWAB_64_THRU_32__ case of a 64-bit byte swap was depending on the
      no-longer-existant ___swab32() method (three underscores).  We got rid
      of some of the worst indirection and complexity, and now it should just
      use the 32-bit swab function that was defined right above it.
      Reported-and-tested-by: NNicolas Pitre <nico@cam.org>
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Harvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ede6f5ae
    • H
      byteorder: remove the now unused byteorder.h · 637b180c
      Harvey Harrison 提交于
      This implementation caused problems in userspace which can, and does
      define _both_ __LITTLE_ENDIAN and __BIG_ENDIAN.
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      637b180c
    • H
      byteorder: only use linux/swab.h · 991c0e6d
      Harvey Harrison 提交于
      The first step to make swab.h a regular header that will
      include an asm/swab.h with arch overrides.
      
      Avoid the gratuitous differences introduced in the new
      linux/swab.h by naming the ___constant_swabXX bits and
      __fswabXX bits exactly as found in the old implementation
      in byteorder/swab[b].h
      
      Use this new swab.h in byteorder/[big|little]_endian.h and
      remove the two old swab headers.
      
      Although the inclusion of asm/byteorder.h looks strange in
      linux/swab.h, this will allow each arch to move the actual
      arch overrides for the swab bits in an asm file and then
      the includes can be cleaned up without requiring a flag day
      for all arches at once.
      
      Keep providing __fswabXX in case some userspace was using them
      directly, but the revised __swabXX should be used instead in
      any new code and will always do constant folding not dependent
      on the optimization level, which means the __constant versions
      can be phased out in-kernel.
      
      Arches that use the old-style arch macros will lose their
      optimized versions until they move to the new style, but at
      least they will still compile.  Many arches have already moved
      and the patches to move the remaining arches are trivial.
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      991c0e6d
    • W
      fs/exec.c: make do_coredump() void · 8cd3ac3a
      WANG Cong 提交于
      No one cares do_coredump()'s return value, and also it seems that it
      is also not necessary. So make it void.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NWANG Cong <wangcong@zeuux.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd3ac3a
    • R
      rapidio: remove excess kernel-doc notation · d3635abf
      Randy Dunlap 提交于
      Remove excess kernel-doc notation from rio header and driver:
      
      Warning(include/linux/rio_drv.h:399): Excess function parameter or struct member 'buffer' description in 'rio_get_inb_message'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3635abf
    • D
      twl4030-gpio: cleanup debounce · cabb3fc4
      David Brownell 提交于
      Provide a static debounce configuration mechanism for twl4030 GPIOs,
      replacing the previous dynamic one.  The single user of that mechanism was
      for MMC card detect debouncing.
      
      Boards can provide a bitmask saying which GPIOs to debounce (30 msec).
      It's always enabled for pins with the MMC card-detect/VMMCx link active,
      so most boards won't need to set the debounce mask.
      
      This is a net code shrink, including runtime footprint.
      Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cabb3fc4
    • I
      autofs4: make autofs type usage explicit · a92daf6b
      Ian Kent 提交于
      - the type assigned at mount when no type is given is changed
        from 0 to AUTOFS_TYPE_INDIRECT. This was done because 0 and
        AUTOFS_TYPE_INDIRECT were being treated implicitly as the same
        type.
      
      - previously, an offset mount had it's type set to
        AUTOFS_TYPE_DIRECT|AUTOFS_TYPE_OFFSET but the mount control
        re-implementation needs to be able distinguish all three types.
        So this was changed to make the type setting explicit.
      
      - a type AUTOFS_TYPE_ANY was added for use by the re-implementation
        when checking if a given path is a mountpoint. It's not really a
        type as we use this to ask if a given path is a mountpoint in the
        autofs_dev_ioctl_ismountpoint() function.
      
      - functions to set and test the autofs mount types have been added to
        improve readability and make the type usage explicit.
      
      - the mount type is used from user space for the mount control
        re-implementtion so, for consistency, all the definitions have
        been moved to the user space include file include/linux/auto_fs4.h.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a92daf6b
    • I
      autofs4: improve parameter usage · 730c9eec
      Ian Kent 提交于
      The parameter usage in the device node ioctl code uses arg1 and arg2 as
      parameter names.  This patch redefines the parameter names to reflect what
      they actually are in an effort to make the code more readable.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      730c9eec
    • M
      kprobes: support probing module __exit function · e8386a0c
      Masami Hiramatsu 提交于
      Allows kprobes to probe __exit routine.  This adds flags member to struct
      kprobe.  When module is freed(kprobes hooks module_notifier to get this
      event), kprobes which probe the functions in that module are set to "Gone"
      flag to the flags member.  These "Gone" probes are never be enabled.
      Users can check the GONE flag through debugfs.
      
      This also removes mod_refcounted, because we couldn't free a module if
      kprobe incremented the refcount of that module.
      
      [akpm@linux-foundation.org: document some locking]
      [mhiramat@redhat.com: bugfix: pass aggr_kprobe to arch_remove_kprobe]
      [mhiramat@redhat.com: bugfix: release old_p's insn_slot before error return]
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8386a0c
    • M
      kprobes: add kprobe_insn_mutex and cleanup arch_remove_kprobe() · 12941560
      Masami Hiramatsu 提交于
      Add kprobe_insn_mutex for protecting kprobe_insn_pages hlist, and remove
      kprobe_mutex from architecture dependent code.
      
      This allows us to call arch_remove_kprobe() (and free_insn_slot) while
      holding kprobe_mutex.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12941560
    • M
      module: add within_module_core() and within_module_init() · a06f6211
      Masami Hiramatsu 提交于
      This series of patches allows kprobes to probe module's __init and __exit
      functions.  This means, you can probe driver initialization and
      terminating.
      
      Currently, kprobes can't probe __init function because these functions are
      freed after module initialization.  And it also can't probe module __exit
      functions because kprobe increments reference count of target module and
      user can't unload it.  this means __exit functions never be called unless
      removing probes from the module.
      
      To solve both cases, this series of patches introduces GONE flag and sets
      it when the target code is freed(for this purpose, kprobes hooks
      MODULE_STATE_* events).  This also removes refcount incrementing for
      allowing user to unload target module.  Users can check which probes are
      GONE by debugfs interface.  For taking timing of freeing module's .init
      text, these also include a patch which adds module's notifier of
      MODULE_STATE_LIVE event.
      
      This patch:
      
      Add within_module_core() and within_module_init() for checking whether an
      address is in the module .init.text section or .text section, and replace
      within() local inline functions in kernel/module.c with them.
      
      kprobes uses these functions to check where the kprobe is inserted.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a06f6211
    • D
      spi_gpio driver · d29389de
      David Brownell 提交于
      Generalize the old at91rm9200 "bootstrap" bitbanging SPI master driver as
      "spi_gpio", so it works with arbitrary GPIOs and can be configured through
      platform_data.  Such SPI masters support:
      
       - any number of bus instances (bus_num is the platform_device.id)
       - any number of chipselects (one GPIO per spi_device)
       - all four SPI_MODE values, and SPI_CS_HIGH
       - i/o word sizes from 1 to 32 bits;
       - devices configured as with any other spi_master controller
      
      When configured using platform_data, this provides relatively low clock
      rates.  On platforms that support inlined GPIO calls, significantly
      improved transfer speeds are also possible with a semi-custom driver.
      (It's still painful when accessing flash memory, but less so.)
      
      Sanity checked by using this version to replace both native controllers on
      a board with six different SPI slaves, relying on three different
      SPI_MODE_* values and both SPI_CS_HIGH settings for correct operation.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
      Acked-by: NMagnus Damm <damm@igel.co.jp>
      Tested-by: NMagnus Damm <damm@igel.co.jp>
      Cc: Torgil Svensson <torgil.svensson@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d29389de
    • H
      binfmts.h: include list.h · 5cf0cc4e
      Hiroshi Shimamoto 提交于
      linux_binfmt uses list_head, so list.h is needed.
      
      [akpm@linux-foundation.org: fix `make headerscheck']
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cf0cc4e
    • J
    • E
      percpu_counter: FBC_BATCH should be a variable · 179f7ebf
      Eric Dumazet 提交于
      For NR_CPUS >= 16 values, FBC_BATCH is 2*NR_CPUS
      
      Considering more and more distros are using high NR_CPUS values, it makes
      sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS.
      
      A sensible value is 2*num_online_cpus(), with a minimum value of 32 (This
      minimum value helps branch prediction in __percpu_counter_add())
      
      We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically.
      
      We rename FBC_BATCH to percpu_counter_batch since its not a constant
      anymore.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      179f7ebf
    • T
      poll: allow f_op->poll to sleep · 5f820f64
      Tejun Heo 提交于
      f_op->poll is the only vfs operation which is not allowed to sleep.  It's
      because poll and select implementation used task state to synchronize
      against wake ups, which doesn't have to be the case anymore as wait/wake
      interface can now use custom wake up functions.  The non-sleep restriction
      can be a bit tricky because ->poll is not called from an atomic context
      and the result of accidentally sleeping in ->poll only shows up as
      temporary busy looping when the timing is right or rather wrong.
      
      This patch converts poll/select to use custom wake up function and use
      separate triggered variable to synchronize against wake up events.  The
      only added overhead is an extra function call during wake up and
      negligible.
      
      This patch removes the one non-sleep exception from vfs locking rules and
      is beneficial to userland filesystem implementations like FUSE, 9p or
      peculiar fs like spufs as it's very difficult for those to implement
      non-sleeping poll method.
      
      While at it, make the following cosmetic changes to make poll.h and
      select.c checkpatch friendly.
      
      * s/type * symbol/type *symbol/		   : three places in poll.h
      * remove blank line before EXPORT_SYMBOL() : two places in select.c
      
      Oleg: spotted missing barrier in poll_schedule_timeout()
      Davide: spotted missing write barrier in pollwake()
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Brad Boyer <flar@allandria.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f820f64
    • D
      Create a DIV_ROUND_CLOSEST macro to do division with rounding · 9fe06081
      Darrick J. Wong 提交于
      Create a helper macro to divide two numbers and round the result to the
      nearest whole number.  This is a helper macro for hwmon drivers that want
      to convert incoming sysfs values per standard hwmon practice, though the
      macro itself can be used by anyone.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fe06081
    • A
      Remove remaining unwinder code · f1883f86
      Alexey Dobriyan 提交于
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Gabor Gombas <gombasg@sztaki.hu>
      Cc: Jan Beulich <jbeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1883f86
    • M
      atomic_t: unify all arch definitions · ea435467
      Matthew Wilcox 提交于
      The atomic_t type cannot currently be used in some header files because it
      would create an include loop with asm/atomic.h.  Move the type definition
      to linux/types.h to break the loop.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea435467
    • O
      mm: introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting · 901608d9
      Oleg Nesterov 提交于
      xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses
      mm->hiwater_xxx directly, this leads to 2 problems:
      
      - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any
        moment before the task exits, so we should check the current values of
        rss/vm anyway.
      
      - do_exit()->update_hiwater_xxx() calls are racy.  An exiting thread can
        be preempted right before mm->hiwater_xxx = new_val, and another thread
        can use A_LOT of memory and exit in between.  When the first thread
        resumes it can be the last thread in the thread group, in that case we
        report the wrong hiwater_xxx values which do not take A_LOT into
        account.
      
      Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change
      xacct_add_tsk() to use them.  The first helper will also be used by
      rusage->ru_maxrss accounting.
      
      Kill do_exit()->update_hiwater_xxx() calls.  Unless we are going to
      decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody
      can look at this mm_struct when exit_mmap() actually unmaps the memory.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      901608d9
    • N
      fs: sys_sync fix · 856bf4d7
      Nick Piggin 提交于
      s_syncing livelock avoidance was breaking data integrity guarantee of
      sys_sync, by allowing sys_sync to skip writing or waiting for superblocks
      if there is a concurrent sys_sync happening.
      
      This livelock avoidance is much less important now that we don't have the
      get_super_to_sync() call after every sb that we sync.  This was replaced
      by __put_super_and_need_restart.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      856bf4d7
    • N
      fs: remove WB_SYNC_HOLD · 4f5a99d6
      Nick Piggin 提交于
      Remove WB_SYNC_HOLD.  The primary motiviation is the design of my
      anti-starvation code for fsync.  It requires taking an inode lock over the
      sync operation, so we could run into lock ordering problems with multiple
      inodes.  It is possible to take a single global lock to solve the ordering
      problem, but then that would prevent a future nice implementation of "sync
      multiple inodes" based on lock order via inode address.
      
      Seems like a backward step to remove this, but actually it is busted
      anyway: we can't use the inode lists for data integrity wait: an inode can
      be taken off the dirty lists but still be under writeback.  In order to
      satisfy data integrity semantics, we should wait for it to finish
      writeback, but if we only search the dirty lists, we'll miss it.
      
      It would be possible to have a "writeback" list, for sys_sync, I suppose.
      But why complicate things by prematurely optimise?  For unmounting, we
      could avoid the "livelock avoidance" code, which would be easier, but
      again premature IMO.
      
      Fixing the existing data integrity problem will come next.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f5a99d6
    • H
      badpage: remove vma from page_remove_rmap · edc315fd
      Hugh Dickins 提交于
      Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
      And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
      page_dup_rmap(): we're trying to be more resilient about that than BUGs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edc315fd
    • H
      badpage: zap print_bad_pte on swap and file · 2509ef26
      Hugh Dickins 提交于
      Complete zap_pte_range()'s coverage of bad pagetable entries by calling
      print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
      That needs free_swap_and_cache() to tell it, which will also have shown
      one of those "swap_free" errors (but with much less information).
      
      Similar checks in fork's copy_one_pte()?  No, that would be more noisy
      than helpful: we'll see them when parent and child exec or exit.
      
      Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
      case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
      VM_FAULT_OOM rather than VM_FAULT_SIGBUS?  Well, okay, that is consistent
      with what happens if do_swap_page() operates a bad swap entry; but don't
      we have patches to be more careful about killing when VM_FAULT_OOM?
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2509ef26
    • H
      badpage: simplify page_alloc flag check+clear · 79f4b7bf
      Hugh Dickins 提交于
      Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
      a page: check the same flags as before when freeing, clear ALL the flags
      (unless PageReserved) when freeing, check ALL flags off when allocating.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f4b7bf
    • H
      swapfile: swapon randomize if nonrot · 20137a49
      Hugh Dickins 提交于
      Swap allocation has always started from the beginning of the swap area;
      but if we're dealing with a solidstate swap device which can only remap
      blocks within limited zones, that would sooner wear out the first zone.
      
      Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
      randomize the cluster_next starting position for allocation.
      
      If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
      with an "SS" at the right end of the kernel's "Adding ...  swap" message
      (so that if it's both nonrot and discardable, "SSD" will be shown there).
      Perhaps something should be shown in /proc/swaps (swapon -s), but we have
      to be more cautious before making any addition to that format.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20137a49
    • H
      swapfile: swap allocation use discard · 7992fde7
      Hugh Dickins 提交于
      When scan_swap_map() finds a free cluster of swap pages to allocate,
      discard the old contents of the cluster if the device supports discard.
      But don't bother when swap is so fragmented that we allocate single pages.
      
      Be careful about racing allocations made while we're scanning for a
      cluster; and hold up allocations made while we're discarding.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7992fde7
    • H
      swapfile: swapon use discard (trim) · 6a6ba831
      Hugh Dickins 提交于
      When adding swap, all the old data on swap can be forgotten: sys_swapon()
      discard all but the header page of the swap partition (or every extent but
      the header of the swap file), to give a solidstate swap device the
      opportunity to optimize its wear-levelling.
      
      If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
      "D" at the right end of the kernel's "Adding ...  swap" message.  Perhaps
      something should be shown in /proc/swaps (swapon -s), but we have to be
      more cautious before making any addition to that format.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a6ba831
    • H
      swapfile: rearrange scan and swap_info · ebebbbe9
      Hugh Dickins 提交于
      Before making functional changes, rearrange scan_swap_map() to simplify
      subsequent diffs.  Actually, there is one functional change in there:
      leave cluster_nr negative while scanning for a new cluster - resetting it
      early increased the likelihood that when we have difficulty finding a free
      cluster, another task may come in and try doing exactly the same - just a
      waste of cpu.
      
      Before making functional changes, rearrange struct swap_info_struct
      slightly: flags will be needed as an unsigned long (for wait_on_bit), next
      is a good int to pair with prio, old_block_size is uninteresting so shift
      it to the end.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebebbbe9
    • H
      swapfile: remove SWP_ACTIVE mask · 22c6f8fd
      Hugh Dickins 提交于
      Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22c6f8fd
    • K
      mm: make vread() and vwrite() declaration · 69beeb1d
      KOSAKI Motohiro 提交于
      Sparse output following warnings.
      
      mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static?
      mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static?
      
      However, it is used by /dev/kmem. fixed here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69beeb1d
    • H
      mm: optimize get_scan_ratio for no swap · b962716b
      Hugh Dickins 提交于
      Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes, the gcc
      optimizer gives us that, when nr_swap_pages is #defined as 0L.  Move usual
      declaration to swapfile.c: it never belonged in page_alloc.c.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b962716b
    • H
      mm: add add_to_swap stub · 60371d97
      Hugh Dickins 提交于
      If we add a failing stub for add_to_swap(), then we can remove the #ifdef
      CONFIG_SWAP from mm/vmscan.c.
      
      This was intended as a source cleanup, but looking more closely, it turns
      out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
      page, whereas now it goes to the more suitable activate_locked, like the
      CONFIG_SWAP nr_swap_pages 0 case.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60371d97
    • H
      mm: remove gfp_mask from add_to_swap · ac47b003
      Hugh Dickins 提交于
      Remove gfp_mask argument from add_to_swap(): it's misleading because its
      only caller, shrink_page_list(), is not atomic at that point; and in due
      course (implementing discard) we'll sometimes want to allocate some memory
      with GFP_NOIO (as is used in swap_writepage) when allocating swap.
      
      No change to the gfp_mask passed down to add_to_swap_cache(): still use
      __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
      though it's not obvious if that's the best combination to ask for here.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac47b003
    • H
      mm: try_to_free_swap replaces remove_exclusive_swap_page · a2c43eed
      Hugh Dickins 提交于
      remove_exclusive_swap_page(): its problem is in living up to its name.
      
      It doesn't matter if someone else has a reference to the page (raised
      page_count); it doesn't matter if the page is mapped into userspace
      (raised page_mapcount - though that hints it may be worth keeping the
      swap): all that matters is that there be no more references to the swap
      (and no writeback in progress).
      
      swapoff (try_to_unuse) has been removing pages from swapcache for years,
      with no concern for page count or page mapcount, and we used to have a
      comment in lookup_swap_cache() recognizing that: if you go for a page of
      swapcache, you'll get the right page, but it could have been removed from
      swapcache by the time you get page lock.
      
      So, give up asking for exclusivity: get rid of
      remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
      remove_exclusive_swap_page_count() which were spawned for the recent LRU
      work: replace them by the simpler try_to_free_swap() which just checks
      page_swapcount().
      
      Similarly, remove the page_count limitation from free_swap_and_count(),
      but assume that it's worth holding on to the swap if page is mapped and
      swap nowhere near full.  Add a vm_swap_full() test in free_swap_cache()?
      It would be consistent, but I think we probably have enough for now.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c43eed
    • H
      mm: reuse_swap_page replaces can_share_swap_page · 7b1fe597
      Hugh Dickins 提交于
      A good place to free up old swap is where do_wp_page(), or do_swap_page(),
      is about to redirty the page: the data on disk is then stale and won't be
      read again; and if we do decide to write the page out later, using the
      previous swap location makes an unnecessary disk seek very likely.
      
      So give can_share_swap_page() the side-effect of delete_from_swap_cache()
      when it safely can.  And can_share_swap_page() was always a misleading
      name, the more so if it has a side-effect: rename it reuse_swap_page().
      
      Irrelevant cleanup nearby: remove swap_token_default_timeout definition
      from swap.h: it's used nowhere.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b1fe597
    • D
      mm: add dirty_background_bytes and dirty_bytes sysctls · 2da02997
      David Rientjes 提交于
      This change introduces two new sysctls to /proc/sys/vm:
      dirty_background_bytes and dirty_bytes.
      
      dirty_background_bytes is the counterpart to dirty_background_ratio and
      dirty_bytes is the counterpart to dirty_ratio.
      
      With growing memory capacities of individual machines, it's no longer
      sufficient to specify dirty thresholds as a percentage of the amount of
      dirtyable memory over the entire system.
      
      dirty_background_bytes and dirty_bytes specify quantities of memory, in
      bytes, that represent the dirty limits for the entire system.  If either
      of these values is set, its value represents the amount of dirty memory
      that is needed to commence either background or direct writeback.
      
      When a `bytes' or `ratio' file is written, its counterpart becomes a
      function of the written value.  For example, if dirty_bytes is written to
      be 8096, 8K of memory is required to commence direct writeback.
      dirty_ratio is then functionally equivalent to 8K / the amount of
      dirtyable memory:
      
      	dirtyable_memory = free pages + mapped pages + file cache
      
      	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
      		-or-
      	dirty_background_ratio = dirty_background_bytes / dirtyable_memory
      
      		AND
      
      	dirty_bytes = dirty_ratio * dirtyable_memory
      		-or-
      	dirty_ratio = dirty_bytes / dirtyable_memory
      
      Only one of dirty_background_bytes and dirty_background_ratio may be
      specified at a time, and only one of dirty_bytes and dirty_ratio may be
      specified.  When one sysctl is written, the other appears as 0 when read.
      
      The `bytes' files operate on a page size granularity since dirty limits
      are compared with ZVC values, which are in page units.
      
      Prior to this change, the minimum dirty_ratio was 5 as implemented by
      get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
      written value between 0 and 100.  This restriction is maintained, but
      dirty_bytes has a lower limit of only one page.
      
      Also prior to this change, the dirty_background_ratio could not equal or
      exceed dirty_ratio.  This restriction is maintained in addition to
      restricting dirty_background_bytes.  If either background threshold equals
      or exceeds that of the dirty threshold, it is implicitly set to half the
      dirty threshold.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2da02997
    • D
      mm: change dirty limit type specifiers to unsigned long · 364aeb28
      David Rientjes 提交于
      The background dirty and dirty limits are better defined with type
      specifiers of unsigned long since negative writeback thresholds are not
      possible.
      
      These values, as returned by get_dirty_limits(), are normally compared
      with ZVC values to determine whether writeback shall commence or be
      throttled.  Such page counts cannot be negative, so declaring the page
      limits as signed is unnecessary.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      364aeb28
    • H
      mm: make page_lock_anon_vma() static · 2afd1c92
      Hugh Dickins 提交于
      page_lock_anon_vma() and page_unlock_anon_vma() were made available to
      show_page_path() in vmscan.c; but now that has been removed, make them
      static in rmap.c again, they're better kept private if possible.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2afd1c92